Web scraping is a widely used technology in today’s era, and at the end of the article, you can also build one web scrapper in Python using Beautiful Soup library. To start webscraping, you have to access the html of the website and extract useful information or data from it. This technique of extracting data from a website or webpage is called web scraping.
<!DOCTYPE html>
<html>
<head>
<title>Example Page</title>
</head>
<body>
<h1>Welcome to Example Page</h1>
<p>This is a simple example webpage.</p>
</body>
</html>
Before import any library to your code it’s important that it should be installed in your system. To install any library on system we use pip package manager to install the library.
- The Request library provides a simple interaction with HTTP in just one line of code. The operations request provides for HTTP are GET, POST etc. To use ‘requests library’ in your code you have to first install it using pip in your terminal.
- Beautiful Soup provides simple methods for navigating, searching, and modifying a parse tree in HTML, XML files. It transforms a complex HTML document into a tree of Python objects. It also automatically converts the document to Unicode, so you don’t have to think about encodings. To use Beautiful Soup in your code you have to first install it using pip in your terminal.
- The lxml library is used for parsing and manipulating XML and HTML documents in Python.
pip install requests
pip install beautifulsoup4
pip install lxml
Using Requests to Fetch HTML Content:
- To get the source code of the webpage you have to send a http request using the url to the webpage server and the server responds to the request by returning the HTML content or source code of the webpage. For this task, we will use a third-party HTTP library for python-requests.
import requests
url = "https://example.com"
response = requests.get(url)
if response.status_code == 200:
print("Your request was successful")
print("Response Content:")
print(response.content)
else:
print("Your request was denied")
After getting the html content from the server, now we are left with the task of parsing the data. Since the HTML data is nested so we cannot extract it directly or simply through string processing in python. We need a function which is inside the request module is content to parse the data.
Introduction to Beautiful Soup
Beautiful Soup is a Python library that simplifies the process of parsing HTML and XML files. It allows us to navigate, search, and modify the parse tree effortlessly. Before using Beautiful Soup, ensure it’s installed using pip.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')
Scraping Data from a Webpage
Now you have learned basics of request and Beautiful Soup let’s make a scrapper from it. Using the response object we can access certain features such as content, text, and headers.
In our example, we only want to obtain content from the object. So, we use response.content which only returns the content of the webpage. Running response.content through BeautifulSoup using lxml gives us a Beautiful soup object:
STEP 1 : First of all import all the required modules such as requests, and bs4
import requests
import bs4
STEP 2: Sent a request to the website server to get the html content from it.
response = requests.get("https://webscraper.io/test-sites/e-commerce/allinone/product/542")
If the response is 200 then congratulations we successfully got the html from the server.
STEP 3 : Create soup object and pass the website content in it.
soup = bs4.BeautifulSoup(response.content, 'lxml')
Now we have to scrap the required fields from the page: product name, price, product description
STEP 4 : Find the required class or id or tag name from the html to fetch the particular object details here is the html of product name:
<div class="caption">
<h4 class="pull-right price">$1769.00</h4>
<h4>Asus ROG Strix GL702ZC-GC154T</h4>
<p class="description">Asus ROG Strix GL702ZC-GC154T, 17.3" FHD, Ryzen 7 1700, 16GB, 256GB + 1TB HDD,Radeon RX 580 4GB, Windows 10 Home</p>
</div>
As you seen the product name is present in second h4 which is inside the div tag which have class caption. So first we have to target div tag.
div_tag = soup.find('div',{'class':'caption'})
Now to get the product name we use find_all function because the product name is inside the h4 tag and there are two h4 tags are present in the html.
h4_tags = div_tag.find_all("h4")
The product name is present in second h4 so using indexing we target second h4.
product_name = h4_tag[1].text
We use .text to get only text from the targeted html content. Now as you seen the price is present in first h4 so to get the price we write.
price = h4_tag[0].text
Now as you seen to get the description from it we have to target the p tag which have class description.
description = div_tag.find("p",{'class':'description'}).text
Here is the full code
import requests
import bs4
response = requests.get("https://webscraper.io/test-sites/e-commerce/allinone/product/542")
soup = bs4.BeautifulSoup(response.content, 'lxml')
div_tag = soup.find('div', {'class': 'caption'})
h4_tags = div_tag.find_all("h4")
product_name = h4_tags[1].text
price = h4_tags[0].text
description = div_tag.find("p", {'class': 'description'}).text
print("Product name:", product_name)
print("Price:", price)
print("Description:", description)
Result:
- product name : Asus ROG Strix GL702ZC-GC154T
- price : $1769.00
- description : Asus ROG Strix GL702ZC-GC154T, 17.3″ FHD, Ryzen 7 1700, 16GB, 256GB + 1TB HDD, Radeon RX 580 4GB, Windows 10 Home
find() and find_all() :
The difference between find and find_all method is the find method only finds the first time the tag occurs in the html code.
For example it you study the upper code you might know that there are two h4 tags so if we use find method instead of find_all then the code find only first h4 tag from the html which is price not the product name.
This is different than the find_all method which scans the entire set of HTML code to find all instances where the tag occurs that’s why we use find_all method to find price and product name because both the fields are present in same tags . To find the instances of html using inner html we use the find and find_all syntax like that.
variable = soup.find(“tag_name”,{“attribute_name”:”attribute_value”})
For find_all
Variable = soup.find_all(“tag_name”,{“attribute_name”:”attribute_value”})
<div>
<p class="name">King</p>
<p class="name">Prince</p>
<p id="vinayak2">Queen</p>
</div>
To get the King and Prince we have to use find_all like that
Container = soup.find_all(“p”,{“class”:”name”})
King = Container[0].text
Prince = Container[1].text
To get the Queen we have to use find like that
Queen = soup.find(“p”,{“id”:”vinayak2”}).text
LIMITATIONS OF REQUESTS :
- Requests can easily blocked by the server.
- Due to some network errors it easily fail.
- If the server is overloaded then the requests can be slow.
CONCLUSION :
Beautiful Soup is easy to learn and beginner-friendly. In this article we have completed the basics of web scraping using Beautiful Soup and also did a sample project to better understand the concepts. In short, the request library allows you to fetch static HTML content from the Internet and the Beautiful Soup package allows you to parse the HTML using a parser. However, there are many more advanced, interesting concepts to learn regarding this topic.
If you like the article and would like to support me, make sure to:
- 👏 Like for this article and subscribe to our newsletter
- 📰 View more content on my DataSpoof website
- 🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter