Sick Gaming
[Tut] Scrape a Bookstore in 5 Steps Python [Learn Project] - Printable Version

+- Sick Gaming (https://www.sickgaming.net)
+-- Forum: Programming (https://www.sickgaming.net/forum-76.html)
+--- Forum: Python (https://www.sickgaming.net/forum-83.html)
+--- Thread: [Tut] Scrape a Bookstore in 5 Steps Python [Learn Project] (/thread-99586.html)



[Tut] Scrape a Bookstore in 5 Steps Python [Learn Project] - xSicKxBot - 06-17-2022

Scrape a Bookstore in 5 Steps Python [Learn Project]

<div><div class="kk-star-ratings kksr-valign-top kksr-align-left " data-payload="{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;422300&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;top&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;count&quot;:&quot;1&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;5&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;Rate this post&quot;,&quot;legend&quot;:&quot;5\/5 - (1 vote)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;width&quot;:&quot;142.5&quot;,&quot;_legend&quot;:&quot;{score}\/{best} - ({count} {votes})&quot;}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend"> 5/5 – (1 vote) </div>
</div>
<p><em><strong>Story</strong>: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.</em></p>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Note</strong>: Before continuing, we recommend you possess, at minimum, a basic knowledge of <a rel="noreferrer noopener" href="https://www.w3schools.com/html/" target="_blank">HTML</a> and <a rel="noreferrer noopener" href="https://www.w3schools.com/css/default.asp" target="_blank">CSS</a> and have reviewed our articles on <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-scrape-html-tables-part-1/" target="_blank">How to Scrape HTML tables</a>.</p>
<h2>What You’ll Build in This Project</h2>
<p>Let’s navigate to <a rel="noreferrer noopener" href="https://books.toscrape.com/index.html" data-type="URL" data-id="https://books.toscrape.com/index.html" target="_blank">Books to Scrape </a>and review the format. </p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="564" src="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-1024x564.png" alt="" class="wp-image-224055" srcset="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-1024x564.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-300x165.png 300w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-768x423.png 768w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a.png 1247w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>At first glance, you will notice:</p>
<ul>
<li>Book categories display on the left-hand side.</li>
<li>There are, in total, 1,000 books listed on the website.</li>
<li>Each web page shows 20 Books.</li>
<li>Each price is in £ (in this instance, the UK pound).</li>
<li>Each Book displays <strong>minimum </strong>details.</li>
<li>To view <strong>complete </strong>details for a book, click on the image or the <code>Book Title</code> hyperlink. This hyperlink forwards to a page containing additional book details for the selected item (see below).</li>
<li>The total number of website pages displays in the footer (<code>Page 1 of 50</code>).</li>
</ul>
<h2 class="wp-embed-aspect-16-9 wp-has-aspect-ratio" id="getting-started">Step 1: Install and Import Libraries for Project</h2>
<p class="wp-embed-aspect-16-9 wp-has-aspect-ratio">Before any data manipulation can occur, three (3) new libraries will require installation.</p>
<ul>
<li>The <em><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a></em> library enables access to/from a <em>DataFrame</em>.</li>
<li>The <em><a rel="noreferrer noopener" href="https://blog.finxter.com/best-python-requests-tutorials/" data-type="URL" data-id="https://blog.finxter.com/best-python-requests-tutorials/" target="_blank">Requests</a> </em>library provides access to the HTTP requests in Python.</li>
<li>The <a rel="noreferrer noopener" href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="URL" data-id="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" target="_blank">Beautiful Soup </a>library enables data extraction from HTML and XML files.</li>
</ul>
<p>To install these libraries, navigate to an <a rel="noreferrer noopener" href="https://blog.finxter.com/best-python-ide/" data-type="post" data-id="8106" target="_blank">IDE</a> terminal. At the command prompt (<code>$</code>), execute the code below. For the terminal used in this example, the command prompt is a dollar sign (<code>$</code>). Your terminal prompt may be different.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install pandas</pre>
<p>Hit the <code>&lt;Enter&gt;</code> key on the keyboard to start the installation process.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install requests</pre>
<p>Hit the <code>&lt;Enter&gt;</code> key on the keyboard to start the installation process.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install beautifulsoup4</pre>
<p>Hit the <code>&lt;Enter&gt;</code> key on the keyboard to start the installation process.</p>
<p>If the installations were successful, a message displays in the terminal indicating the same.</p>
<hr class="wp-block-separator has-css-opacity"/>
<p>Feel free to view the PyCharm installation guides for the required libraries.</p>
<ul>
<li><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-pandas-in-python/" target="_blank"></a><a href="https://blog.finxter.com/how-to-install-pandas-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-pandas-on-pycharm/" target="_blank" rel="noreferrer noopener">How to install Pandas on PyCharm</a></li>
<li><a href="https://blog.finxter.com/how-to-install-requests-in-python/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-requests-in-python/" target="_blank" rel="noreferrer noopener">How to install Requests on PyCharm</a></li>
<li><a href="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" target="_blank" rel="noreferrer noopener">How to install BeautifulSoup4 on PyCharm</a></li>
</ul>
<hr class="wp-block-separator has-css-opacity"/>
<p>Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.</p>
<pre class="EnlighterJSRAW wp-embed-aspect-16-9 wp-has-aspect-ratio" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import urllib.request
from csv import reader, writer</pre>
<ul>
<li>The <code>time</code> library is built-in with Python and does not require installation. This library contains <a rel="noreferrer noopener" href="https://blog.finxter.com/time-delay-in-python/" data-type="URL" data-id="https://blog.finxter.com/time-delay-in-python/" target="_blank"><code>time.sleep()</code></a> and is used to set a delay between page scrapes.</li>
<li>The <code>urllib</code> library is built-in with Python and does not require installation. This library contains <a rel="noreferrer noopener" href="https://blog.finxter.com/time-delay-in-python/" data-type="URL" data-id="https://blog.finxter.com/time-delay-in-python/" target="_blank"><code>urllib.request</code></a> and is used to save images.</li>
<li>The <code>csv </code>library is built-in <code><em><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a></em></code> and does not require additional installation. This library contains <code>reader and writer</code> methods to save data to a CSV file.</li>
</ul>
<h2>Step 2: Understand Basics and Scrape Your First Results</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="909" height="462" src="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a.png" alt="" class="wp-image-224220" srcset="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a.png 909w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a-300x152.png 300w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a-768x390.png 768w" sizes="(max-width: 909px) 100vw, 909px" /></figure>
</div>
<p>In this step, you’ll perform the following tasks:</p>
<ul id="block-990dfa6f-f2e6-423a-84d3-3fbfcb432a12">
<li>Reviewing the website to scrape.</li>
<li>Understanding HTTP Status Codes.</li>
<li>Connecting to the <a rel="noreferrer noopener" href="https://books.toscrape.com/index.html" target="_blank">Books to Scrape</a> website using the <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-requests-library/" target="_blank">requests</a> </code>library.</li>
<li>Retrieving&nbsp;Total Pages to Scrape</li>
<li>Closing the Open Connection.</li>
</ul>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-1/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-1/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>
<h2>Step 3: Configure URL to Scrape and Avoid Spamming the Server</h2>
<div class="wp-block-cover aligncenter is-light"><span aria-hidden="true" class="wp-block-cover__background has-background-dim"></span><img loading="lazy" width="886" height="672" class="wp-block-cover__image-background wp-image-422310" alt="" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-122.png" data-object-fit="cover" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-122.png 886w, https://blog.finxter.com/wp-content/uploads/2022/06/image-122-300x228.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-122-768x583.png 768w" sizes="(max-width: 886px) 100vw, 886px" /></p>
<div class="wp-block-cover__inner-container">
<p class="has-text-align-center has-base-3-color has-text-color has-large-font-size"><strong>Rule: Don’t Spam the Server!</strong></p>
</div>
</div>
<p>In this step, you’ll perform the following tasks:</p>
<ul id="block-30f20a4a-690b-43a9-bf02-27dbdcbfb3a7">
<li>Configuring a page URL for scraping</li>
<li>Setting a delay: <a href="https://blog.finxter.com/time-delay-in-python/"><code>time.sleep()</code> </a>to pause between page scrapes.</li>
<li><a href="https://blog.finxter.com/python-loops/" target="_blank" rel="noreferrer noopener">Looping</a> through two (2) pages for testing purposes.</li>
</ul>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-2/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-2/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>
<h2>Step 4: Save Book Details in a Python List</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="709" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-123-1024x709.png" alt="" class="wp-image-422311" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-123-1024x709.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123-300x208.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123-768x532.png 768w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123.png 1268w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>In this step, you’ll perform the following tasks:</p>
<ul>
<li>Locating Book details.</li>
<li>Writing code to retrieve this information for all Books.</li>
<li>Saving <code>Book</code> details to a <a href="https://blog.finxter.com/python-lists/" target="_blank" rel="noreferrer noopener">List</a>.</li>
</ul>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-3/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-3/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>
<h2>Step 5: Clean and Save the Scraped Output</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="340" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-124-1024x340.png" alt="" class="wp-image-422312" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-124-1024x340.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124-300x100.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124-768x255.png 768w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124.png 1030w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>In this step, you’ll perform the following tasks:</p>
<ul>
<li>Cleaning up the scraped code.</li>
<li>Saving the output to a <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-read-a-csv-file-into-a-python-list/" target="_blank">CSV </a>file.</li>
</ul>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-4/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-4/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p>
<h2>Conclusion</h2>
<p>This tutorial has guided you through the steps to create your first practical web scraping project: scraping the contents of a book store! </p>
<p>Now, go out and use your skills wisely and to the benefit of humanity, my friend! <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f642.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<hr class="wp-block-separator has-alpha-channel-opacity"/>
</div>


https://www.sickgaming.net/blog/2022/06/14/scrape-a-bookstore-in-5-steps-python-learn-project/