Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] How to Extract Emails from any Website using Python?

#1
How to Extract Emails from any Website using Python?

<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload="{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;669053&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;top&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;1&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;5&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;Rate this post&quot;,&quot;legend&quot;:&quot;5\/5 - (1 vote)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;width&quot;:&quot;142.5&quot;,&quot;_legend&quot;:&quot;{score}\/{best} - ({count} {votes})&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</div>
<figure class="wp-block-image size-large"><img loading="lazy" width="1024" height="576" src="https://blog.finxter.com/wp-content/uploads/2022/09/image-7-1024x576.png" alt="" class="wp-image-669058" srcset="https://blog.finxter.com/wp-content/uploads/2022/09/image-7-1024x576.png 1024w, https://blog.finxter.com/wp-content/uplo...00x169.png 300w, https://blog.finxter.com/wp-content/uplo...68x432.png 768w, https://blog.finxter.com/wp-content/uplo...36x864.png 1536w, https://blog.finxter.com/wp-content/uplo...mage-7.png 1600w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
<p>The article begins by formulating a problem regarding how to extract emails from any website using <a href="https://blog.finxter.com/python-developer-income-and-opportunity/" data-type="post" data-id="189354" target="_blank" rel="noreferrer noopener">Python</a>, gives you an overview of solutions, and then goes into great detail about each solution for beginners. </p>
<p>At the end of this article, you will know the results of comparing methods of extracting emails from a website. Continue reading to find out the answers.</p>
<p>You may want to read out the disclaimer on web scraping here:</p>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/2696.png" alt="⚖" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended Tutorial</strong>: <a href="https://blog.finxter.com/is-web-scraping-legal/" data-type="post" data-id="383048">Is We</a><a rel="noreferrer noopener" href="https://blog.finxter.com/is-web-scraping-legal/" data-type="post" data-id="383048" target="_blank">b</a><a href="https://blog.finxter.com/is-web-scraping-legal/" data-type="post" data-id="383048"> Scraping Legal?</a></p>
<p>You can find the full code of both web scrapers on our GitHub <a rel="noreferrer noopener" href="https://github.com/finxter/extract-emails-from-websitepython" data-type="URL" data-id="https://github.com/finxter/extract-emails-from-websitepython" target="_blank">here</a>. <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f448.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<h2>Problem Formulation</h2>
<p>Marketers build <a href="https://blog.finxter.com/subscribe/" data-type="page" data-id="1414" target="_blank" rel="noreferrer noopener">email lists</a> to generate leads. </p>
<p>Statistics show that 33% of marketers send weekly emails, and 26% send emails multiple times per month. An email list is a fantastic tool for both company and job searching. </p>
<p>For instance, to find out about employment openings, you can hunt up an employee’s email address of your desired company. </p>
<p>However, manually locating, copying, and pasting emails into a <a href="https://blog.finxter.com/convert-html-table-to-csv-in-python/" data-type="post" data-id="590862" target="_blank" rel="noreferrer noopener">CSV file</a> takes time, costs money, and is prone to error. There are a lot of online tutorials for building email extraction bots. </p>
<p>When attempting to extract email from a website, these bots experience some difficulty. The issues include the lengthy data extraction times and the occurrence of unexpected errors. </p>
<p>Then, how can you obtain an email address from a company website in the most efficient manner? How can we use robust programming Python to extract data?</p>
<h2>Method Summary</h2>
<p>This post will provide two ways to extract emails from websites. They are referred to as <strong><em>Direct Email Extraction</em></strong> and <strong><em>Indirect Email Extraction</em></strong>, respectively.</p>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Our Python code will search for emails on the target page of a given company or specific website when using the <strong>direct email extraction</strong> method. </p>
<p>For instance, when a user enters “<a href="http://www.scrapingbee.com">www.scrapingbee.com</a>”  into their screen, our Python email extractor bot scrapes the website’s URLs. Then it uses a <a href="https://blog.finxter.com/python-regex/" data-type="post" data-id="6210" target="_blank" rel="noreferrer noopener">regex</a> library to look for emails before <a href="https://blog.finxter.com/pandas-dataframe-to_csv-method/" data-type="post" data-id="344277" target="_blank" rel="noreferrer noopener">saving them</a> in a CSV file.</p>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> The second method, the <strong>indirect email extraction</strong> method, leverages Google.com’s <strong><em>Search Engine Result Page (SERP)</em></strong> to extract email addresses instead of using a specific website. </p>
<p>For instance, a user may type “scrapingbee.com” as the website name. The email extractor bot will search on this term and return the results to the system. The bot then stores the email addresses extracted using regex into a CSV file from these search results. </p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f449.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> In the next section, you will learn more about these methods in more detail.</p>
<p>These two techniques are excellent email list-building tools. </p>
<p>The main issue with alternative email extraction techniques posted online, as was already said, is that they extract hundreds of irrelevant website URLs that don’t contain emails. The programming running through these approaches takes several hours.</p>
<p>Discover our two excellent methods by continuing reading.&nbsp;</p>
<h2><strong>Solution</strong></h2>
<h3><strong>Method 1&nbsp; Direct Email Extraction</strong></h3>
<p>This method will outline the step-by-step process for obtaining an email address from a particular website.</p>
<h4><strong>Step 1: Install Libraries.</strong></h4>
<p>Using the <a href="https://blog.finxter.com/a-guide-of-all-pip-commands/" data-type="post" data-id="90570" target="_blank" rel="noreferrer noopener"><code>pip</code> command</a>, install the following Python libraries:</p>
<ol>
<li>You can use <a href="https://blog.finxter.com/python-regex-tutorial/" data-type="post" data-id="5629" target="_blank" rel="noreferrer noopener">Regular Expression</a> (<code>re</code>) to match an email address’s format.</li>
<li>You can use the <code><a href="https://blog.finxter.com/python-requests-library/" data-type="post" data-id="37796" target="_blank" rel="noreferrer noopener">request</a></code> module to send HTTP requests.</li>
<li><code>bs4</code> is a <a href="https://blog.finxter.com/installing-beautiful-soup/" data-type="post" data-id="17693" target="_blank" rel="noreferrer noopener">beautiful soup</a> for web page extraction.</li>
<li>The <code>deque</code> module of the <code>collections</code> package allows data to be stored in containers.</li>
<li>The <code>urlsplit</code> module in the <code>urlib</code> package splits a URL into four parts.</li>
<li>The emails can be saved in a DataFrame for future processing using the <code><a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">pandas</a></code> module.</li>
<li>You can use <code>tld</code> library to acquire relevant emails.</li>
</ol>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install re
pip install request
pip install bs4
pip install python-collections
pip install urlib
pip install pandas
pip install tld
</pre>
<h4><strong>Step 2<strong>:</strong> Import Libraries.</strong></h4>
<p>Import the libraries as shown below:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re
import requests
from bs4 import BeautifulSoup
from collections import deque
from urllib.parse import urlsplit
import pandas as pd
from tld import get_fld
</pre>
<h4><strong>Step 3<strong>:</strong> Create User Input.</strong></h4>
<p>Ask the user to enter the desired website for extracting emails with the <code><a href="https://blog.finxter.com/python-input-function/" data-type="post" data-id="24632" target="_blank" rel="noreferrer noopener">input()</a></code> function and store them in the variable <code>user_url</code>:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">user_url = input("Enter the website url to extract emails: ")
if "https://" in user_url: user_url = user_url
else: user_url = "https://"+ user_url
</pre>
<h4><strong>Step 4<strong>:</strong> Set up variables.</strong></h4>
<p>Before we start writing the code, let’s define some variables.</p>
<p>Create two variables using the command below to store the URLs of scraped and un-scraped websites:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">unscraped_url = deque([user_url])
scraped_url = set()
</pre>
</p>
<p>You can save the URLs of websites that are not scraped using the <code>deque</code> container. Additionally, the URLs of the sites that were scraped are saved in a <a href="https://blog.finxter.com/sets-in-python/" data-type="post" data-id="1908" target="_blank" rel="noreferrer noopener">set data format</a>.</p>
<p>As seen below, the variable <code>list_emails</code> contains the retrieved emails:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">list_emails = set()</pre>
<p>Utilizing a set data type is primarily intended to <a href="https://blog.finxter.com/how-to-remove-duplicates-from-a-python-list-while-preserving-order/" data-type="post" data-id="13975" target="_blank" rel="noreferrer noopener">eliminate duplicate</a> emails and keep just unique emails.</p>
<p>Let us proceed to the next step of our main program to extract email from a website.</p>
<h4><strong>Step 5<strong>:</strong> Adding Urls for Content Extraction.</strong></h4>
<p>Web page URLs are transferred from the variable <code>unscraped_url</code> to <code>scrapped_url</code> to begin the process of extracting content from the user-entered URLs.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">while len(unscraped_url): url = unscraped_url.popleft() scraped_url.add(url)
</pre>
<p>The <code>popleft()</code> method removes the web page URLs from the left side of the <code>deque</code> container and saves them in the <code>url</code> variable. </p>
<p>Then the <code>url</code> is stored in <code>scraped_url</code> using the <code><a href="https://blog.finxter.com/python-set-add/" data-type="post" data-id="27986" target="_blank" rel="noreferrer noopener">add()</a></code> method.</p>
<h4><strong>Step 6<strong>:</strong> Splitting of URLs and merging them with base URL.</strong></h4>
<p>The website contains relative links that you cannot access directly. </p>
<p>Therefore, we must merge the relative links with the base URL. We need the <code>urlsplit()</code> function to do this.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">parts = urlsplit(url)</pre>
<p>Create a <code>parts</code> variable to segment the URL as shown below.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">SplitResult(scheme='https', netloc='www.scrapingbee.com', path='/', query='', fragment='')</pre>
<p>As an example shown above, the URL <a href="https://www.scrapingbee.com/" target="_blank" rel="noreferrer noopener">https://www.scrapingbee.com/</a>  is divided into <code>scheme</code>, <code>netloc</code>, <code>path</code>, and other elements.</p>
<p>The split result’s <code>netloc</code> variable contains the website’s name. Continue reading to learn how this procedure benefits our programming.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">base_url = "{0.scheme}://{0.netloc}".format(parts)</pre>
<p>Next, we create the basic URL by merging the <code>scheme</code> and <code>netloc</code>.</p>
<p>Base URL means the main website’s URL is what you type into the browser’s address bar when you input it.</p>
<p>If the user enters relative URLs when requested by the program, we must then convert them back to base URLs. We can accomplish this by using the command:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if '/' in parts.path: part = url.rfind("/") path = url[0:part + 1]
else: path = url
</pre>
<p>Let us understand how each line of the above command works. </p>
<p>Suppose the user enters the following URL: </p>
<ul>
<li><a href="https://www.scrapingbee.com/blog" target="_blank" rel="noreferrer noopener">https://www.scrapingbee.com/blog</a></li>
</ul>
<p>This URL is a relative link, and the above set of commands will convert it to a base URL (<a href="https://www.scrapingbee.com" target="_blank" rel="noreferrer noopener"><strong>https://www.scrapingbee.com</strong></a><strong>). </strong>Let’s see how it works.</p>
<p>If the condition finds that there is a “<code>/</code>” in the path of the URL, then the command finds where is the last slash ”<code>/</code>” is located using the <code><a href="https://blog.finxter.com/python-string-rfind/" data-type="post" data-id="26085">rfind()</a></code> method. The “<code>/</code>” is located at the 27th position.  </p>
<p>Next line of code stores the URL from 0 to 27 + 1, i.e., 28th item position, i.e., <a href="https://www.scrapingbee.com/">https://www.scrapingbee.</a><a href="https://www.scrapingbee.com/" target="_blank" rel="noreferrer noopener">c</a><a href="https://www.scrapingbee.com/">om/</a>.  Thus, it converts to the base URL.</p>
<p>In the last command, If there is no relative link from the URL, it is the same as the base URL. That links are in the <code>path</code> variable.</p>
<p>The following command prints the URLs for which the program is <a href="https://blog.finxter.com/newspaper3k-a-python-library-for-fast-web-scraping/" data-type="post" data-id="34047" target="_blank" rel="noreferrer noopener">scraping</a>. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print("Searching for Emails in %s" % url)</pre>
<h4><strong> Step 7<strong>:</strong></strong> <strong>Extracting Emails from the URLs.</strong></h4>
<p>The <a href="https://blog.finxter.com/python-requests-get-the-ultimate-guide/" data-type="post" data-id="37837" target="_blank" rel="noreferrer noopener">HTML Get Request</a> Command access the user-entered website.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">response = requests.get(url)</pre>
<p>Then, extract all email addresses from the response variable using a <a href="https://blog.finxter.com/how-to-find-all-matches-using-regex/" data-type="post" data-id="481806" target="_blank" rel="noreferrer noopener">regular expression</a>, and update them to the <code>list_emails</code> set.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">new_emails = ((re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", response.text, re.I)))
list_emails.update(new_emails)
</pre>
<p>The regression is built to match the email address syntax displayed in the new emails variable. The regression format pulls the email address from the website URL’s content with the <code>response.text</code> method.  And <code>re.I</code> <a rel="noreferrer noopener" href="https://blog.finxter.com/python-regex-flags/" data-type="post" data-id="5733" target="_blank">flag</a> method ignores the font case. The <code>list_emails</code> set is updated with new emails.</p>
<p>The next is to find all of the website’s URL links and extract them in order to retrieve the email addresses that are currently available. You can utilize a powerful, beautiful soup module to carry out this procedure.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">soup = BeautifulSoup(response.text, 'lxml')</pre>
<p>A beautiful soup function parses the HTML document of the webpage the user has entered, as shown in the above command.</p>
<p>You can find out how many emails have been extracted with the following command.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print("Email Extracted: " + str(len(list_emails)))</pre>
<p>The URLs related to the website can be found with “<code>a href</code>” anchor tags. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">for tag in soup.find_all("a"): if "href" in tag.attrs: weblink = tag.attrs["href"] else: weblink = ""
</pre>
<p>Beautiful soups find all the anchor tag “<code>a</code>” from the website. </p>
<p>Then if <code>href</code> is in the attribute of tags, then soup fetches the URL in the weblink variable else it is an empty string.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if weblink.startswith('/'): weblink = base_url + weblink
elif not weblink.startswith('https'): weblink = path + weblink
</pre>
</p>
<p>The <code>href</code> contains just a link to a particular page beginning with “<code>/</code>,” the page name, and no base URL. </p>
<p>For instance, you can see the following URL on the scraping bee website:</p>
<ul>
<li><code>&lt;a <strong>href="/#pricing"</strong> class="block hover:underline">Pricing&lt;/a></code></li>
<li><code>&lt;a <strong>href="/#faq</strong>" class="block hover:underline">FAQ&lt;/a></code></li>
<li><code>&lt;a <strong>href="/documentation"</strong> class="text-white hover:underline">Documentation&lt;/a></code></li>
</ul>
<p>Thus, the above command combines the extracted <code>href</code> link and the base URL.</p>
<p>For example, in the case of pricing, the weblink variable is as&nbsp; follows:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Weblink = "https://www.scrapingbee.com/#pricing"</pre>
<p>In some cases, <code>href</code> doesn’t start with either “<code>/</code>” or “<code>https</code>”; in that case, the command combines the path with that link. </p>
<p>For example, <code>href</code> is like below:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;a href="mailtoConfused[email protected]?subject=Enterprise plan&amp;amp;body=Hi there, I'd like to discuss the Enterprise plan." class="btn btn-sm btn-black-o w-full mt-13">1-click quote&lt;/a></pre>
<p>Now let’s complete the code with the following command:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if not weblink in unscraped_url and not weblink in scraped_url: unscraped_url.append(weblink) print(list_emails)
</pre>
<p>The above command appends URLs not scraped to the unscraped <code>url</code> variable. To view the results, <a href="https://blog.finxter.com/python-print/" data-type="post" data-id="20731" target="_blank" rel="noreferrer noopener">print</a> the list_emails.</p>
<p>Run the program.</p>
<p><strong>What if the program doesn’t work?</strong></p>
<p>Are you getting errors or exceptions of Missing Schema, Connection Error, or Invalid URL?</p>
<p>Some of the websites you aren’t able to access for some reason.&nbsp;</p>
<p>Don’t worry! Let’s see how to hit these errors.</p>
<p>Use the <a href="https://blog.finxter.com/python-try-except-an-illustrated-guide/" data-type="post" data-id="367118" target="_blank" rel="noreferrer noopener">Try Exception</a> command to bypass the errors as shown below:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">try: response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError, requests.exceptions.InvalidURL): continue
</pre>
<p>Insert the command before the email regex command. Precisely, place this command above the <code>new_emails</code> variable.</p>
<p>Run the program now.</p>
<p><strong>Did the program work?</strong></p>
<p>Does it keep on running for several hours and not complete it?</p>
<p>The program searches and extracts all the URLs from the given website. Also, It is extracting links from other domain name websites. For example, the Scraping Bee website has URLs such as  <a href="https://seekwell.io/" target="_blank" rel="noreferrer noopener">https://seekwell.io/</a>., <a href="https://codesubmit.io/" target="_blank" rel="noreferrer noopener">https://codesubmit.io/</a>, and more.</p>
<p>A well-built website has up to 100 links for a single page of a website. So the program will take several hours to extract the links.</p>
<p>Sorry about it. You have to face this issue to get your target emails.</p>
<p>Bye Bye, the article ends here……..</p>
<p>No, I am just joking!</p>
<p>Fret Not! I will give you the best solution in the next step.</p>
<h4><strong>Step 8<strong>:</strong> Fix the code problems.</strong></h4>
<p>Here is the solution code for you:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if base_url in weblink: # code1 if ("contact" in weblink or "Contact" in weblink or "About" in weblink or "about" in weblink or 'CONTACT' in weblink or 'ABOUT' in weblink or 'contact-us' in weblink): #code2 if not weblink in unscraped_url and not weblink in scraped_url: unscraped_url.append(weblink)
</pre>
<p>First off, apply code 1, which specifies that you only include base URL websites from links weblinks to prevent scraping other domain name websites from a specific website.</p>
<p>Since the majority of emails are provided on the contact us and about web pages, only those links from those sites will be extracted (Refer to code 2). Other pages are not considered.</p>
<p>Finally, unscraped URLs are added to the <code>unscrapped_url</code> variable.</p>
<h4><strong>Step 9<strong>:</strong> Exporting the Email Address to CSV file.</strong></h4>
<p>Finally, we can save the email address in a CSV file (<code>email2.csv</code>) through data frame <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">pandas</a>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">url_name = "{0.netloc}".format(parts)
col = "List of Emails " + url_name
df = pd.DataFrame(list_emails, columns=[col])
s = get_fld(base_url)
df = df[df[col].str.contains(s) == True]
df.to_csv('email2.csv', index=False)
</pre>
<p>We use <code>get_fld</code> to save emails belonging to the first level domain name of the base URL. The <code>s</code> variable contains the first level domain of the base URL. For example, the first level domain is scrapingbee.com.</p>
<p>We include only emails ending with the website’s first-level domain name in the data frame. Other domain names that do not belong to the base URL are ignored. Finally, the <a href="https://blog.finxter.com/how-to-export-pandas-dataframe-to-csv-example/" data-type="post" data-id="562980" target="_blank" rel="noreferrer noopener">data frame transfers emails to a CSV file</a>.</p>
<p>As previously stated, a web admin can maintain up to 100 links per page. </p>
<p>Because there are more than 30 hyperlinks on each page for a normal website, it will still take some time to finish the program. If you believe that the software has extracted enough email, you may manually halt it using <code>try except KeyboardInterrupt</code>  and <code>raise SystemExit</code> command as shown below:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">try:
while len(unscraped_url):
… if base_url in weblink: if ("contact" in weblink or "Contact" in weblink or "About" in weblink or "about" in weblink or 'CONTACT' in weblink or 'ABOUT' in weblink or 'contact-us' in weblink): if not weblink in unscraped_url and not weblink in scraped_url: unscraped_url.append(weblink) url_name = "{0.netloc}".format(parts) col = "List of Emails " + url_name df = pd.DataFrame(list_emails, columns=[col]) s = get_fld(base_url) df = df[df[col].str.contains(s) == True] df.to_csv('email2.csv', index=False) except KeyboardInterrupt: url_name = "{0.netloc}".format(parts) col = "List of Emails " + url_name df = pd.DataFrame(list_emails, columns=[col]) s = get_fld(base_url) df = df[df[col].str.contains(s) == True] df.to_csv('email2.csv', index=False) print("Program terminated manually!") raise SystemExit
</pre>
<p>Run the program and enjoy it…</p>
<p>Let’s see what our fantastic email scraper application produced. The website I have entered is www.abbott.com.</p>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="478" height="1024" src="https://blog.finxter.com/wp-content/uploads/2022/09/image-8-478x1024.png" alt="" class="wp-image-669228" srcset="https://blog.finxter.com/wp-content/uploads/2022/09/image-8-478x1024.png 478w, https://blog.finxter.com/wp-content/uplo...40x300.png 140w, https://blog.finxter.com/wp-content/uplo...6x1536.png 716w, https://blog.finxter.com/wp-content/uplo...mage-8.png 734w" sizes="(max-width: 478px) 100vw, 478px" /></figure>
</div>
<h3><strong>Method 2 Indirect Email Extraction</strong></h3>
<p>You will learn the steps to extract email addresses from Google.com using the second method.</p>
<h4><strong>Step 1: Install Libraries.</strong></h4>
<p>Using the <code>pip</code> command, install the following Python libraries:</p>
<ol>
<li><code>bs4</code> is a Beautiful soup for extracting google pages.</li>
<li>The <code>pandas</code> module can save emails in a DataFrame for future processing.</li>
<li>You can use Regular Expression (<code>re</code>) to match the Email Address format.</li>
<li>The <code>request</code> library sends HTTP requests.</li>
<li>You can use <code>tld</code> library to acquire relevant emails.</li>
<li><code>time</code> library to delay the scraping of pages.</li>
</ol>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install bs4
pip install pandas
pip install re
pip install request
pip install time
</pre>
<h4><strong>Step 2: Import Libraries.</strong></h4>
<p>Import the libraries.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup
import pandas as pd
import re
import requests
from tld import get_fld
import time
</pre>
</p>
<h4><strong>Step 3: Constructing Search Query.</strong></h4>
<p>The search query is written in the format “<code>@websitename.com</code>“.</p>
<p>Create an input for the user to enter the URL of the website.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">user_keyword = input("Enter the Website Name: ")
user_keyword = str('"@') + user_keyword +' " '
</pre>
</p>
<p>The format of the search query is “<code>@websitename.com</code>,” as indicated in the code for the <code>user_keyword</code> variable above. The search query has opening and ending double quotes.</p>
<h4><strong>Step 4: Define Variables.</strong></h4>
<p>Before moving on to the heart of the program, let’s first set up the variables.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">page = 0
list_email = set()
</pre>
</p>
<p>You can move through multiple Google search results pages using the <code>page</code> variable. And <code>list_email</code> for extracted emails set.</p>
<h4><strong>Step 5<strong>:</strong> Requesting Google Page.</strong></h4>
<p>In this step, you will learn how to create a Google URL link using a user keyword term and request the same.</p>
<p>The Main part of coding starts as below:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">while page &lt;= 100: print("Searching Emails in page No " + str(page)) time.sleep(20.00) google = "https://www.google.com/search?q=" + user_keyword + "&amp;ei=dUoTY-i9L_2Cxc8P5aSU8AI&amp;start=" + str(page) response = requests.get(google) print(response)
</pre>
<p>Let’s examine what each line of code does.</p>
<ul>
<li>The <code>while</code> loop enables the email extraction bot to retrieve emails up to a specific number of pages, in this case 10 Pages.</li>
<li>The code prints the page number of the Google page being extracted. The first page is represented by page number 0, the second by page 10, the third by page 20, and so on.</li>
<li>To prevent having Google’s IP blocked, we slowed down the programming by 20 seconds and requested the URLs more slowly.</li>
</ul>
<p>Before creating a google variable, let us learn more about the google search URL.</p>
<p>Suppose you search the keyword “Germany” on google.com. Then the Google search URL will be as follows</p>
<ul>
<li><code>https://www.google.com/search?q=germany</code></li>
</ul>
<p>If you click the second page of the Google search result, then the link will be as follows:</p>
<ul>
<li><code>https://www.google.com/search?q=germany&amp;ei=dUoTY-i9L_2Cxc8P5aSU8AI&amp;start=10</code></li>
</ul>
<p>How does that link work?</p>
<ul>
<li>The user keyword is inserted after the “<code>q=</code>” symbol, and the page number is added after the “<code>start=</code>” as shown above in the google variable.</li>
<li>Request a Google webpage after that, then print the results. To test whether it’s functioning or not. The website was successfully accessed if you received a 200 response code. If you receive a 429, it implies that you have hit your request limit and must wait two hours before making any more requests.</li>
</ul>
<h4><strong>Step 6<strong>:</strong> Extracting Email Address.</strong></h4>
<p>In this step, you will learn how to extract the email address from the google search result contents.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">soup = BeautifulSoup(response.text, 'html.parser')
new_emails = ((re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", soup.text, re.I)))
list_email.update(new_emails)
page = page + 10
</pre>
<p>The Beautiful soup parses the web page and extracts the content of html web page.</p>
<p>With the regex <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-re-findall/" data-type="post" data-id="5729" target="_blank">findall()</a></code> function, you can obtain email addresses, as shown above. Then the new email is updated to the <code>list_email</code> set. The page is added to 10 for navigating the next page. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">n = len(user_keyword)-1
base_url = "https://www." + user_keyword[2:n]
col = "List of Emails " + user_keyword[2:n]
df = pd.DataFrame(list_email, columns=[col])
s = get_fld(base_url)
df = df[df[col].str.contains(s) == True]
df.to_csv('email3.csv', index=False)
</pre>
<p>And finally, target emails are saved to the CSV file from the above lines of code. The list item in the <code>user_keyword</code> starts from the  2nd position until the domain name.</p>
<p>Run the program and see the output.</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="455" height="1024" src="https://blog.finxter.com/wp-content/uploads/2022/09/image-9-455x1024.png" alt="" class="wp-image-669251" srcset="https://blog.finxter.com/wp-content/uploads/2022/09/image-9-455x1024.png 455w, https://blog.finxter.com/wp-content/uplo...33x300.png 133w, https://blog.finxter.com/wp-content/uplo...mage-9.png 502w" sizes="(max-width: 455px) 100vw, 455px" /></figure>
</div>
<h3><strong>Method 1 Vs. Method 2&nbsp;</strong></h3>
<p>Can we determine which approach is more effective for building an email list: <strong><em>Method 1 Direct Email Extraction</em></strong> or <strong><em>Method 2 Indirect Email Extraction</em></strong>? The output’s email list was generated from the website abbot.com.</p>
<p>Let’s contrast two email lists that were extracted using Methods 1 and 2.</p>
<ul>
<li>From Method 1, the extractor has retrieved 60 emails.</li>
<li>From Method 2, the extractor has retrieved 19 emails.</li>
<li>The 17 email lists in Method 2 are not included in Method 1.&nbsp;</li>
<li>These emails are employee-specific rather than company-wide. Additionally, there are more employee emails in Method 1.</li>
</ul>
<p>Thus, we are unable to recommend one procedure over another. Both techniques provide fresh email lists. As a result, both of these methods will increase your email list.</p>
<h2><strong>Summary</strong></h2>
<p>Building an email list is crucial for businesses and freelancers alike to increase sales and leads. </p>
<p>This article offers instructions on using Python to retrieve email addresses from websites. </p>
<p>The best two methods to obtain email addresses are provided in the article. </p>
<p>In order to provide a recommendation, the two techniques are finally compared. </p>
<p>The first approach is a direct email extractor from any website, and the second method is to extract email addresses using Google.com.</p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
<h2>Regex Humor</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img loading="lazy" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-133.png" alt="" class="wp-image-428862" width="700" height="629" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-133.png 785w, https://blog.finxter.com/wp-content/uplo...00x270.png 300w, https://blog.finxter.com/wp-content/uplo...68x691.png 768w" sizes="(max-width: 700px) 100vw, 700px" /><figcaption><em>Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee.</em> (<a href="https://imgs.xkcd.com/comics/regular_expressions.png" data-type="URL" data-id="https://imgs.xkcd.com/comics/regular_expressions.png" target="_blank" rel="noreferrer noopener">source</a>)</figcaption></figure>
</div>
</div>


https://www.sickgaming.net/blog/2022/09/...ng-python/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016