Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Parsing XML Using BeautifulSoup In Python

#1
Parsing XML Using BeautifulSoup In Python

<div><h2>Introduction</h2>
<p>XML is a tool that is used to store and transport data. It stands for <strong><span class="has-inline-color has-luminous-vivid-orange-color">eXtensible Markup Language.</span></strong> XML is quite similar to HTML and they have almost the same kind of structure but they were designed to accomplish different goals. </p>
<ul>
<li>XML is designed to <strong>transport </strong>data while HTML is designed to <strong>display </strong>data. Many systems contain incompatible data formats. This makes data exchange between incompatible systems is a time-consuming task for web developers as large amounts of data has to be converted. Further, there are chances that incompatible data is lost. But, <strong>XML stores data in plain text format</strong> thereby providing <strong>software and hardware-independent method of storing and sharing data</strong>.</li>
</ul>
<ul>
<li>Another major difference is that HTML tags are predefined whereas XML files are not. </li>
</ul>
<p>❖ <strong>Example of XML:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;note&gt; &lt;to&gt;Harry Potter&lt;/to&gt; &lt;from&gt;Albus Dumbledore&lt;/from&gt; &lt;heading&gt;Reminder&lt;/heading&gt; &lt;body&gt;It does not do to dwell on dreams and forget to live!&lt;/body&gt;
&lt;/note&gt;</pre>
<p>As mentioned earlier, XML tags are not pre-defined so we need to find the tag that holds the information that we want to extract. Thus there are two major aspects governing the parsing of XML files:</p>
<ol>
<li>Finding the required Tags.</li>
<li>Extracting data from after identifying the Tags.</li>
</ol>
<h2>BeautifulSoup and LXML Installation</h2>
<p>When it comes to web scraping with Python, <a href="https://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>&nbsp;the most commonly used library. The recommended way of parsing XML files using BeautifulSoup is to use Python’s&nbsp;<strong>lxml</strong>&nbsp;parser.<a href="https://eb2.3lift.com/pass?tl_clickthrough=true&amp;redir=https%3A%2F%2Fr1-usc1.zemanta.com%2Frp%2Fu1istglw5uyo%2Fb1_triplelift%2F4146769%2F31150289%2FPOB725AMX53UM5N7H7J5GJL2L4C2ZMVZ3XU75S5EHLHEJ3ZYNVYJSEFIJ3SIW4AQX433OKTXYGF3CZDQRBJWLLIGUHLMRZWLYZMYYSIGTSXL4LZF5QQM56KPU5JSE3OZVHDLSLDSH7YWNSB6ZNNF67KXZC56C73UJKV4BKKUBUM2IZDTPDI2RC2IYSEV4N7VSXEKTEKBOJZCGEBPP6NET2FEZT74J73HCGVMGBWBNKSCGPDO7EX7JL7PSRUN7F5NYFYP4SJRVLGUDTHLCVIBFP4ZGGV7WLSOJBB2MB2ZH4PJMJBR4JUJ7DB66F4KVBMQBUNZUHOQGZGLEDJAF722BGFJYHMWUD6OUSCJMOE6SRLSMKEIPT6YT5RRSVTQ77GSUBCBEE4DYFH6AQFF7Y4Y3TXO7JI6SOKYCXFZQW46MDQB5O4YTIH75PTFX2Q2BDGYJY3OQWFUZFJ2CI55SXMAJBA5ZKP5EAWYA3VDKJLVLAUVFYJ2YII2A46TYKRREVT7DB4SGGWGOWR2SKE3EXTXA3LZUARYGGFQOGNVBQDU4B6HK6HRJC4QSOYZ5ECPLSQYLA3B4A5VRL2CE%2F&amp;tluid=5445843655552828362&amp;bc=0.259&amp;uid=5445843655552828362&amp;pr=0.116&amp;brid=1923&amp;bmid=2460&amp;biid=2460&amp;aid=182125531231275789510&amp;bcud=259&amp;sid=94080&amp;ts=1607058013&amp;cb=57250" target="_blank" rel="noreferrer noopener"></a></p>
<p>You can install both libraries using the&nbsp;<strong>pip</strong>&nbsp;installation tool. Please have a look at our <strong><a href="https://blog.finxter.com/installing-beautiful-soup/">BLOG TUTORIAL</a></strong> to learn how to install them if you want to scrape data from an XML file using Beautiful soup.</p>
<div class="wp-block-buttons">
<div class="wp-block-button"><a class="wp-block-button__link" href="https://blog.finxter.com/installing-beautiful-soup/" target="_blank" rel="noreferrer noopener"><strong>TUTORIAL: Installing BeautifulSoup and LXML</strong></a></div>
</div>
<p><strong># Note: </strong>Before we proceed with our discussion, please have a look at the following XML file that we will be using throughout the course of this article. (Please create a file with the name <em>sample.txt</em> and copy-paste the code given below to practice further.)</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;?xml version="1.0" encoding="UTF-8" standalone="no"?&gt;
&lt;CATALOG&gt; &lt;PLANT&gt; &lt;COMMON&gt;Bloodroot&lt;/COMMON&gt; &lt;BOTANICAL&gt;Sanguinaria canadensis&lt;/BOTANICAL&gt; &lt;ZONE&gt;4&lt;/ZONE&gt; &lt;LIGHT&gt;Mostly Shady&lt;/LIGHT&gt; &lt;PRICE&gt;$2.44&lt;/PRICE&gt; &lt;AVAILABILITY&gt;031599&lt;/AVAILABILITY&gt; &lt;/PLANT&gt; &lt;PLANT&gt; &lt;COMMON&gt;Marsh Marigold&lt;/COMMON&gt; &lt;BOTANICAL&gt;Caltha palustris&lt;/BOTANICAL&gt; &lt;ZONE&gt;4&lt;/ZONE&gt; &lt;LIGHT&gt;Mostly Sunny&lt;/LIGHT&gt; &lt;PRICE&gt;$6.81&lt;/PRICE&gt; &lt;AVAILABILITY&gt;051799&lt;/AVAILABILITY&gt; &lt;/PLANT&gt; &lt;PLANT&gt; &lt;COMMON&gt;Cowslip&lt;/COMMON&gt; &lt;BOTANICAL&gt;Caltha palustris&lt;/BOTANICAL&gt; &lt;ZONE&gt;4&lt;/ZONE&gt; &lt;LIGHT&gt;Mostly Shady&lt;/LIGHT&gt; &lt;PRICE&gt;$9.90&lt;/PRICE&gt; &lt;AVAILABILITY&gt;030699&lt;/AVAILABILITY&gt; &lt;/PLANT&gt;
&lt;/CATALOG&gt;</pre>
<h2>Searching The Required Tags in The XML Document</h2>
<p>Since the tags are not pre-defined in XML, we must identify the tags and search them using the different methods provided by the BeautifulSoup library. Now, how do we find the right tags? We can do so with the help of <code>BeautifulSoup's </code>search methods.</p>
<p>Beautiful Soup has numerous methods for searching a parse tree. The two most popular and commonly used methods are:</p>
<ol>
<li>&nbsp;<code>find()</code></li>
<li>&nbsp;<code>find_all()</code></li>
</ol>
<p>We have an entire <a href="https://blog.finxter.com/searching-the-parse-tree-using-beautifulsoup/" target="_blank" rel="noreferrer noopener"><strong>blog tutorial</strong></a> on the two methods. Please have a look at the following tutorial to understand how these search methods work.</p>
<div class="wp-block-buttons">
<div class="wp-block-button"><a class="wp-block-button__link" href="https://blog.finxter.com/searching-the-parse-tree-using-beautifulsoup/" target="_blank" rel="noreferrer noopener"><strong>Tutorial: Searching A Parse Tree</strong></a></div>
</div>
<p>If you have read the above-mentioned article, then you can easily use the&nbsp;<code><code data-enlighter-language="generic" class="EnlighterJSRAW">find</code> </code>and&nbsp;<code>find_all&nbsp;</code>methods to search for tags anywhere in the XML document. </p>
<h2>Relationship Between Tags</h2>
<p>It is extremely important to understand the relationship between tags, especially while scraping data from XML documents.</p>
<p>The three key relationships in the XML parse tree are:</p>
<ul>
<li><strong>Parent</strong>: The tag which is used as the reference tag for navigating to child tags.</li>
<li><strong>Children</strong>: The tags contained within the parent tag.</li>
<li><strong>Siblings</strong>: As the name suggests these are the tags that exist on the same level of the parse tree.</li>
</ul>
<p>Let us have a look at how we can navigate the XML parse tree using the above relationships. </p>
<h3>Finding Parents</h3>
<p>❖ The&nbsp;<strong>parent</strong>&nbsp;attribute allows us to find the parent/reference tag as shown in the example below.</p>
<p><strong>Example:</strong> In the following code we will find out the parents of the <code data-enlighter-language="generic" class="EnlighterJSRAW">common </code>tag.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(soup.common.parent.name)</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">plant</pre>
<p><strong>Note:</strong> The <code data-enlighter-language="generic" class="EnlighterJSRAW">name </code>attribute allows us to extract the name of the tag instead of extracting the entire content.</p>
<h3>Finding Children</h3>
<p>❖ The&nbsp;<strong>children</strong> attribute allows us to find the child tag as shown in the example below.</p>
<p><strong>Example:</strong> In the following code we will find out the children of the <code data-enlighter-language="generic" class="EnlighterJSRAW">plant</code> tag.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">for child in soup.plant.children: if child.name == None: pass else: print(child.name)</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">common
botanical
zone
light
price
availability</pre>
<h3>Finding Siblings</h3>
<p>A tag can have siblings before and after it. </p>
<ul>
<li>❖ The <strong>previous_siblings</strong> attribute returns the siblings before the referenced tag, and the <strong>next_siblings</strong> attribute returns the siblings after it.</li>
</ul>
<p><strong>Example: </strong>The following code finds the previous and next sibling tags of the <code>light</code> tag of the XML document.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print("***Previous Siblings***")
for sibling in soup.light.previous_siblings: if sibling.name == None: pass else: print(sibling.name) print("\n***Next Siblings***")
for sibling in soup.light.next_siblings: if sibling.name == None: pass else: print(sibling.name)</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">***Previous Siblings***
zone
botanical
common ***Next Siblings***
price
availability</pre>
<h2>Extracting Data From Tags</h2>
<p>By now, we know how to navigate and find data within tags. Let us have a look at the attributes that help us to extract data from the tags.</p>
<h3>Text And String Attributes</h3>
<p>To access the text values within tags, you can use the&nbsp;<code data-enlighter-language="generic" class="EnlighterJSRAW">text</code>&nbsp;or&nbsp;<code data-enlighter-language="generic" class="EnlighterJSRAW">strings&nbsp;</code>attribute. </p>
<p><strong>Example:</strong> let us extract the the text from the first price tag using <code data-enlighter-language="generic" class="EnlighterJSRAW">text </code>and <code data-enlighter-language="generic" class="EnlighterJSRAW">string</code> attributes.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print('***PLANT NAME***')
for tag in plant_name: print(tag.text)
print('\n***BOTANICAL NAME***')
for tag in scientific_name: print(tag.string)</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">***PLANT NAME***
Bloodroot
Marsh Marigold
Cowslip ***BOTANICAL NAME***
Sanguinaria canadensis
Caltha palustris
Caltha palustris</pre>
<h3>The Contents Attribute</h3>
<p>The <strong>contents</strong> attribute allows us to extract the entire content from the tags, that is the tag along with the data. The <code data-enlighter-language="generic" class="EnlighterJSRAW">contents</code> attribute returns a list, therefore we can access its elements using their index.</p>
<p><strong>Example:</strong> </p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(soup.plant.contents)
# Accessing content using index
print()
print(soup.plant.contents[1])</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">['\n', &lt;common&gt;Bloodroot&lt;/common&gt;, '\n', &lt;botanical&gt;Sanguinaria canadensis&lt;/botanical&gt;, '\n', &lt;zone&gt;4&lt;/zone&gt;, '\n', &lt;light&gt;Mostly Shady&lt;/light&gt;, '\n', &lt;price&gt;$2.44&lt;/price&gt;, '\n', &lt;availability&gt;031599&lt;/availability&gt;, '\n'] &lt;common&gt;Bloodroot&lt;/common&gt;</pre>
<h3>Pretty Printing The Beautiful Soup Object</h3>
<p>If you observe closely when we print the tags on the screen, they have a sort of messy appearance. While this may not have direct productivity issues, but a better and structured print style helps us to parse the document more effectively.</p>
<p>The following code shows how the output looks when we print the BeautifulSoup object normally:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(soup)</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;?xml version="1.0" encoding="UTF-8" standalone="no"?&gt;&lt;html&gt;&lt;body&gt;&lt;catalog&gt;
&lt;plant&gt;
&lt;common&gt;Bloodroot&lt;/common&gt;
&lt;botanical&gt;Sanguinaria canadensis&lt;/botanical&gt;
&lt;zone&gt;4&lt;/zone&gt;
&lt;light&gt;Mostly Shady&lt;/light&gt;
&lt;price&gt;$2.44&lt;/price&gt;
&lt;availability&gt;031599&lt;/availability&gt;
&lt;/plant&gt;
&lt;plant&gt;
&lt;common&gt;Marsh Marigold&lt;/common&gt;
&lt;botanical&gt;Caltha palustris&lt;/botanical&gt;
&lt;zone&gt;4&lt;/zone&gt;
&lt;light&gt;Mostly Sunny&lt;/light&gt;
&lt;price&gt;$6.81&lt;/price&gt;
&lt;availability&gt;051799&lt;/availability&gt;
&lt;/plant&gt;
&lt;plant&gt;
&lt;common&gt;Cowslip&lt;/common&gt;
&lt;botanical&gt;Caltha palustris&lt;/botanical&gt;
&lt;zone&gt;4&lt;/zone&gt;
&lt;light&gt;Mostly Shady&lt;/light&gt;
&lt;price&gt;$9.90&lt;/price&gt;
&lt;availability&gt;030699&lt;/availability&gt;
&lt;/plant&gt;
&lt;/catalog&gt;
&lt;/body&gt;&lt;/html&gt;</pre>
<p><span class="has-inline-color has-luminous-vivid-orange-color">Now let us use the <span style="text-decoration: underline"><strong>prettify</strong></span> method to improve the appearance of our output.</span></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(soup.prettify())</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;?xml version="1.0" encoding="UTF-8" standalone="no"?&gt;
&lt;html&gt; &lt;body&gt; &lt;catalog&gt; &lt;plant&gt; &lt;common&gt; Bloodroot &lt;/common&gt; &lt;botanical&gt; Sanguinaria canadensis &lt;/botanical&gt; &lt;zone&gt; 4 &lt;/zone&gt; &lt;light&gt; Mostly Shady &lt;/light&gt; &lt;price&gt; $2.44 &lt;/price&gt; &lt;availability&gt; 031599 &lt;/availability&gt; &lt;/plant&gt; &lt;plant&gt; &lt;common&gt; Marsh Marigold &lt;/common&gt; &lt;botanical&gt; Caltha palustris &lt;/botanical&gt; &lt;zone&gt; 4 &lt;/zone&gt; &lt;light&gt; Mostly Sunny &lt;/light&gt; &lt;price&gt; $6.81 &lt;/price&gt; &lt;availability&gt; 051799 &lt;/availability&gt; &lt;/plant&gt; &lt;plant&gt; &lt;common&gt; Cowslip &lt;/common&gt; &lt;botanical&gt; Caltha palustris &lt;/botanical&gt; &lt;zone&gt; 4 &lt;/zone&gt; &lt;light&gt; Mostly Shady &lt;/light&gt; &lt;price&gt; $9.90 &lt;/price&gt; &lt;availability&gt; 030699 &lt;/availability&gt; &lt;/plant&gt; &lt;/catalog&gt; &lt;/body&gt;
&lt;/html&gt;</pre>
<h2>The Final Solution</h2>
<p>We are now well versed with all the concepts required to extract data from a given XML document. It is now time to have a look at the final code where we shall be extracting the <strong>Name, Botanical Name, and Price </strong>of each plant in our example XML document (sample.xml).</p>
<p>Please follow the comments along with the code given below to have a understanding of the logic used in the solution.</p>
</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup # Open and read the XML file
file = open("sample.xml", "r")
contents = file.read() # Create the BeautifulSoup Object and use the parser
soup = BeautifulSoup(contents, 'lxml') # extract the contents of the common, botanical and price tags
plant_name = soup.find_all('common') # store the name of the plant
scientific_name = soup.find_all('botanical') # store the scientific name of the plant
price = soup.find_all('price') # store the price of the plant # Use a for loop along with the enumerate function that keeps count of each iteration
for n, title in enumerate(plant_name): print("Plant Name:", title.text) # print the name of the plant using text print("Botanical Name: ", scientific_name[ n].text) # use the counter to access each index of the list that stores the scientific name of the plant print("Price: ", price[n].text) # use the counter to access each index of the list that stores the price of the plant print()</pre>
<p><strong>Output:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Plant Name: Bloodroot
Botanical Name: Sanguinaria canadensis
Price: $2.44 Plant Name: Marsh Marigold
Botanical Name: Caltha palustris
Price: $6.81 Plant Name: Cowslip
Botanical Name: Caltha palustris
Price: $9.90</pre>
<h2>Conclusion</h2>
<p>XML documents are an important source of transporting data and hopefully after reading this article you are well equipped to extract the data you want from these documents. You might be tempted to have a look at <strong><a href="https://www.youtube.com/playlist?list=PLbo6ydLr984ZbU9VrB1ouj9CCJ80x4Xmo" target="_blank" rel="noreferrer noopener">this video series</a></strong> where you can learn how to scrape webpages. </p>
<p>Please <a href="http://blog.finxter.com/subscribe" target="_blank" rel="noreferrer noopener">subscribe </a>and <a href="http://blog.finxter.com/">stay tuned</a> for more interesting articles in the future. </p>
<h2>Where to Go From Here?</h2>
<p>Enough theory, let’s get some practice!</p>
<p>To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?</p>
<p><strong>Practice projects is how you sharpen your saw in coding!</strong></p>
<p>Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?</p>
<p>Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.</p>
<p>Join my free webinar <a rel="noreferrer noopener" href="https://blog.finxter.com/webinar-freelancer/" target="_blank">“How to Build Your High-Income Skill Python”</a> and watch how I grew my coding business online and how you can, too—from the comfort of your own home.</p>
<p><a href="https://blog.finxter.com/webinar-freelancer/" target="_blank" rel="noreferrer noopener">Join the free webinar now!</a></p>
<p>The post <a href="https://blog.finxter.com/parsing-xml-using-beautifulsoup-in-python/" target="_blank" rel="noopener noreferrer">Parsing XML Using BeautifulSoup In Python</a> first appeared on <a href="https://blog.finxter.com/" target="_blank" rel="noopener noreferrer">Finxter</a>.</p>
</div>


https://www.sickgaming.net/blog/2020/12/...in-python/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016