How To

BeautifulSoup: How to get nested inner divs

nameerror: name nltk is not defined

Python web scraping class will teach you how to get inner and nested divs using beautifulsoup. You can use BeautifulSoup to scrap or get the text inside nested div tags and take further operation with the text or the result you will get after the scrapping is done. Without wasting your time, let me quickly show you the problem and how we can solve it. Be caution when scraping data from a website. Although it legal but some websites have TOS protection which means even though you use a legal way to scrap their contents, it might be a violation. Therefor please be careful and know your target source, and if possible, always seek permission first before you get yourself into trouble of scrapping any data.

Look at the structure of the below HTML code and see how the divs have been nested into another. The text that we want to scrap is found inside a sub nested div tag and the text is ‘Python web scraping tutorials’ I have to jump through all the parent div before I could have access to the main DIV that contains the text. Remember, the text is not inside a paragraph <p> tag, if so it would have been easy to do with few lines of code. I tried using BeautifulSoup to extract and print out the text but it fail because it was my first time of scraping web content. At the end, my code didn’t threw any error but I was unable to print the text.

<div class="_333v _45kb".....
    <div class="_2a_i" ...............
        <div class="_2a_j".......</div>
        <div class="_2b04"...........
            <div class="_14v5"........
                <div class="_2b06".....
                    <div class="_2b05".....</div>
                    <div id=............>**Python web scraping tutorials**</div>
                </div>
            </div>
        </div>
    </div>
</div>

And here is my code attempt to scrap the text in the above nested div tag.

url = "https://tutorialscamp.com"
thepage = urllib.request.urlopen(url)
bSoup = BeautifulSoup(thepage, "html.praser")
nestedDiv_list = bSoup.findAll('div', class_="_333v _45kb")
for lists in nestedDiv_list:
    print(nestedDiv_list.find('div').text)

Check also: How to add double quotes around java object and string variables

The above approach isn’t bad but the structure of the html code is the problem. Have a look at below related solution:

from bs4 import BeautifulSoup

html = '''
<div class="foo">
    <div class="bar">
        <div class="spam">Title goes here</div>
        <div id="eggs">** Python web scraping tutorials **</div>
    </div>
</div>
'''

bSoup = BeautifulSoup(html, 'html.parser')

// grab the parent div with class foo
div = bSoup.find('div', {'class':'foo'})
# This will print all the text
print(div.text)


print('\n----\n')
# if other divs don't have id
for div in bSoup.findAll('div'):
    if div.has_attr('id'):
        print(div.text)
Output:
Title goes here
**Python web scraping tutorials**


---------
**Python web scraping tutorials**

Check also, how to make money on the dark web or darknet?

Another BeautifulSoup example to find nested div tags. Suppose we have the following code…
<html>
<body>
<div class=" text1" id="wrapper">
      <div class=" text2" id="bar">
            <div class=" text3">
            </div>
            <div class=" text4">
                 <div class="text5"> python beautifoulsoup tutorials </div>
            </div>
      </div>
</div>
</body>
</html>

How can you extract the textpython beautifoulsoup tutorials” in the nested <div class="text5"> python BeautifulSoup tutorials </div> above using BeautifulSoup? Well, you might think xpath could help solve this easily but the issue is that its not supported in BeautifulSoup, so what should you do?

check also, how to disable button on condition in angular

The solution to finding the above nested div it to follow BeautifulSoup hierarchy. Since the above HTML <div> tags contains class and Id, we could use .contents or findChildre() property but I will use for-loop to grab each div and access it attributes.

from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="category1" id="wrapper">
      <div class=" text2" id="bar">
            <div class=" text3">
            </div>
            <div class=" text4">
                 <div class=" text5"> "> python beautifoulsoup tutorials
                 </div>
            </div>
      </div>
</div>
</body>
</html>'''

content = BeautifulSoup(html)

for div in content.findAll('div', attrs={'class': text5}):
    print div.text

python beautifoulsoup tutorials

Get inner div using beautifulsoup alternative is to use xpath even though it not supported at the moment but there is a workaround. You can install lxml or a similar library which support BeautifulSoup. In our case, am I have installed lxml and a I could use it to solve the above issue easily by doing…

from lxml import etree

tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[@class=" text5"]/text()')
[' python beautifoulsoup tutorials \n      

Thats how to get inner div using beautifulsoup in python.

Tagged

About Justice Ankomah

computer science certified technical instructor. Who is interested in sharing (Expert Advice Only)
View all posts by Justice Ankomah →

Leave a Reply

Your email address will not be published. Required fields are marked *