[Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages

[Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages - Printable Version

+- Sick Gaming (https://sickgaming.net)
+-- Forum: Programming (https://sickgaming.net/forum-76.html)
+--- Forum: PHP Development (https://sickgaming.net/forum-82.html)
+--- Thread: [Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages (/thread-101254.html)

[Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages - xSicKxBot - 08-17-2023

[Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages

<div style="margin: 5px 5% 10px 5%;"><img src="https://www.sickgaming.net/blog/wp-content/uploads/2023/08/web-scraping-with-php-tutorial-to-scrape-web-pages.jpg" width="550" height="367" title="" alt="" /></div><div><div class="modified-on" readability="7.0697674418605"> by <a href="https://phppot.com/about/">Vincy</a>. Last modified on July 21st, 2023.</div>
<p>Web scraping is a mechanism to crawl web pages using software tools or utilities. It reads the content of the website pages over a network stream.</p>
<p>This technology is also known as web crawling or data extraction. In a previous tutorial, we learned <a href="https://phppot.com/php/extract-content-using-php-and-preview-like-facebook/">how to extract pages by its URL</a>.<br /><a class="demo" href="https://phppot.com/demo/web-scraping-php">View Demo</a></p>
<p>There are more PHP libraries to support this feature. In this tutorial, we will see one of the popular web-scraping components named <strong>DomCrawler</strong>.</p>
<p>This component is underneath the PHP Symfony framework. This article has the code for integrating and using this component to crawl web pages.</p>
<p><img decoding="async" loading="lazy" class="alignnone size-large wp-image-20924" src="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-550x367.jpg" alt="web scraping php" width="550" height="367" srcset="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-550x367.jpg 550w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-300x200.jpg 300w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-768x512.jpg 768w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php.jpg 1200w" sizes="(max-width: 550px) 100vw, 550px"></p>
<p>We can also create custom utilities to scrape the content from the remote pages. <a href="https://phppot.com/php/php-curl/">PHP allows built-in cURL functions</a> to process the network request-response cycle.</p>
<h2>About DomCrawler</h2>
<p>The DOMCrawler component of the Symfony library is for parsing the HTML and XML content.</p>
<p>It constructs the crawl handle to reach any node of an HTML tree structure. It accepts queries to filter specific nodes from the input HTML or XML.</p>
<p>It provides many crawling utilities and features.</p>
<ol>
<li>Node filtering by XPath queries.</li>
<li>Node traversing by specifying the HTML selector by its position.</li>
<li>Node name and value reading.</li>
<li>HTML or XML insertion into the specified container tag.</li>
</ol>
<h2>Steps to create a web scraping tool in PHP</h2>
<ol>
<li>Install and instantiate an HTTP client library.</li>
<li>Install and instantiate the crawler library to parse the response.</li>
<li>Prepare parameters and bundle them with the request to scrape the remote content.</li>
<li>Crawl response data and read the content.</li>
</ol>
<p>In this example, we used the HTTPClient library for sending the request.</p>
<h2>Web scraping PHP example</h2>
<p>This example creates a client instance and sends requests to the target URL. Then, it receives the web content in a response object.</p>
<p>The PHP DOMCrawler parses the response data to filter out specific web content.</p>
<p>In this example, the crawler reads the site title by parsing the <em>h1</em> text. Also, it parses the content from the site HTML filtered by the <em>paragraph</em> tag.</p>
<p>The below image shows the example project structure with the PHP script to scrape the web content.</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-20923" src="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-project-structure.jpg" alt="web scraping php project structure" width="313" height="134" srcset="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-project-structure.jpg 313w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-project-structure-300x128.jpg 300w" sizes="(max-width: 313px) 100vw, 313px"></p>
<h3>How to install the Symfony framework library</h3>
<p>We are using the popular Symfony to scrape the web content. It can be installed via Composer.<br />Following are the commands to install the dependencies.</p>
<pre class="prettyprint"><code>composer require symfony/http-client symfony/dom-crawler
composer require symfony/css-selector
</code></pre>
<p>After running these composer commands, a vendor folder can map the required dependencies with an autoload.php file. The below script imports the dependencies by this file.</p>
<p class="code-heading">index.php</p>
<pre class="prettyprint"><code class="language-php"><?php require 'vendor/autoload.php'; use Symfony\Component\HttpClient\HttpClient;
use Symfony\Component\DomCrawler\Crawler; $httpClient = HttpClient::create(); // Website to be scraped
$website = 'https://example.com'; // HTTP GET request and store the response
$httpResponse = $httpClient->request('GET', $website);
$websiteContent = $httpResponse->getContent(); $domCrawler = new Crawler($websiteContent); // Filter the H1 tag text
$h1Text = $domCrawler->filter('h1')->text();
$paragraphText = $domCrawler->filter('p')->each(function (Crawler $node) { return $node->text();
}); // Scraped result
echo "H1: " . $h1Text . "\n";
echo "Paragraphs:\n";
foreach ($paragraphText as $paragraph) { echo $paragraph . "\n";
}
?>
</code></pre>
<h2>Ways to process the web scrapped data</h2>
<p>What will people do with the web-scraped data? The example code created for this article prints the content to the browser. In an actual application, this data can be used for many purposes.</p>
<ol>
<li>It gives data to find popular trends with the scraped news site contents.</li>
<li>It generates leads for showing charts or statistics.</li>
<li>It helps to extract images and store them in the application’s backend.</li>
</ol>
<p>If you want to see <a href="https://phppot.com/php/extract-images-from-url-in-excel-with-php-using-phpspreadsheet/">how to extract images from the pages</a>, the linked article has a simple code.</p>
<h2>Caution</h2>
<p>Web scraping is theft if you scrape against a website’s usage policy.  You should read a website’s policy before scraping it. If the terms are unclear, you may get explicit permission from the website’s owner. Also, commercializing web-scraped content is a crime in most cases. Get permission before doing any such activities.</p>
<p>Before crawling a site’s content, it is essential to read the website terms. It is to ensure that the public can be subject to scraping.</p>
<p>People provide API access or feed to read the content. It is fair to do data extraction with proper API access provision. We have seen how to <a href="https://phppot.com/php/extracting-title-description-thumbnail-using-youtube-data-api/">extract the title, description and video thumbnail using YouTube API</a>.</p>
<p>For learning purposes, you may host a dummy website with lorem ipsum content and scrape it.<br /><a class="demo" href="https://phppot.com/demo/web-scraping-php">View Demo</a></p>
<p>  </p>
<div class="related-articles">
<h2>Popular Articles</h2>
</p></div>
<p> <a href="https://phppot.com/php/web-scraping-php/#top" class="top">↑ Back to Top</a> </p>
</div>

https://www.sickgaming.net/blog/2023/06/05/web-scraping-with-php-tutorial-to-scrape-web-pages/