You are currently viewing Automated Web Scraping with Python AutoScraper

Automated Web Scraping with Python AutoScraper

Loading

Web scraping is a fundamental technique that is used for extracting useful information, such as contact information, emails, images, URLs, etc…. from websites.

In web scraping, we communicate with the server to get the data in the form of HTML, and then, using some Python or web scraping libraries, we mine the data as we want from the HTML. The other form of web scraping is crawling. It is used when we need a huge amount of data that has been structured and labelled for industrial fundamentals.

Basically, we are getting the path or source to the target HTML that we want to scrape. The process of getting the source is called web crawling. For example, if you want to scrape the MRP of products available on Amazon.com, you first have to find the source of the web page where the MRP is available. So getting the source of the web page is called web crawling. Getting the value of MRP is called web scraping.

You know that the new forms of web scraping involve the observation of data feeds on web servers, for example, JSON files, which are used as transporters between clients and web servers. There are many websites that Google, Facebook, Amazon, etc… Provide an API that allows you to access their data in a structured or labelled format.

Now, we have a brief discussion of the AutoScrapper library.

Autoscraper

When we talk about scraping, there is lots of stuff on the website that we want to scrape, but writable scripts take lots of time to scrape data, and it is a very lengthy process, To overcome this problem, a group of Python developers developed a library that scrapes all the data from a website in an easy way. So AutoScraper is the web scraping Python library that is used for scraping data from a website in a simple, easy, and fast way. It has a user-friendly environment, and this scrapper can easily interact with this library.

AutoScraper accepts the URL or HTML of any website and scrapes the data by learning some rules. In other words, it matches the data on the relevant web page and scrapes data that follows similar rules.

Let’s install the AutoScraper library. There are actually several ways to install and use this library, but for this tutorial, we’re going to use the Python package index (PyPI) repository using the following pip command:

pip install autoscraper

Implementation

First, we will import AutoScraper from AutoScraper library

from autoscraper import AutoScraper

Let us start by defining URL from which we will be used to fetch the data and the required data sample which is to be fetched. Here I will fetch titles of different news from the google news technology category.

url='https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2pWeHRtckNSRzFhUVdwLXhEbV95Z0FQAQ?hl=en-IN&gl=IN&ceid=IN%3Aen'

Now we are defining the list which contains the sample of data that we want, which means the sample to title from the google news technology category section.

wanted_list = ["10 AI Tools to Enhance Your Excel Skills in 2023"]

The next step is calling the AutoScraper function so that we can use it to build the scraper model and perform a web scraping operation.

scraper = AutoScraper()

This is the final step, where we create the object and display the result of the web scraping.

result = scraper.build(url, wanted_list)
print(result)

Here we saw that it returns the title of the topic from Google News, similarly, we can also retrieve URLs of the news by just passing the sample URL in the category we define above

wanted_list=["./articles/CBMiUmh0dHBzOi8vd3d3LmFuYWx5dGljc2luc2lnaHQubmV0LzEwLWFpLXRvb2xzLXRvLWVuaGFuY2UteW91ci1leGNlbC1za2lsbHMtaW4tMjAyMy_SAQA?hl=en-IN&gl=IN&ceid=IN%3Aen"]

result = scraper.build(url, wanted_list)
print(result)

Autoscraper allows you to use the model you build for fetching similar data from a different URL. We need to use the ‘get_result_similar’ function to fetch similar data. In this step, we will retrieve the URLs of different news stories.

scraper.get_result_similar("https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2pxamVheENSRmpvQXpLYUVCTXRpZ0FQAQ?hl=en-IN&gl=IN&ceid=IN%3Aen")

Autoscraper allows us to save the model created and load it whenever required.

If you want to save the model then you have to call save function of Autoscraper and pass the name of the file that you want to save.

scraper.save('data')   # saving the model

data is a name of the file that we want to save.

By running this line, the Autoscraper saves the model in a data file.

Now you have successfully saved the model . If you want to use this model again, then you have to load this model by passing the name of the file as an argument.

scraper.load('data') # loading the model

Other than all these functionalities, Autoscraper also allows you to define proxy IP Addresses so that you can use it to fetch data. We just need to define the proxies and pass it as an argument to the build function like the example below:

proxy = {
            "http": 'http://127.0.0.1:8003',
            "https":'https://127.0.0.1:8071',
}
final = scraper.build(url,wanted_list,request_args=dict(proxies=proxy))

The complete code is given below

# Import AutoScraper from autoscraper library
from autoscraper import AutoScraper

# Define the URL from which we will fetch the data
url = 'https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2pWeHRtckNSRzFhUVdwLXhEbV95Z0FQAQ?hl=en-IN&gl=IN&ceid=IN%3Aen'

# Define the sample data that we want to fetch
wanted_list = ["10 AI Tools to Enhance Your Excel Skills in 2023"]

# Create an AutoScraper object
scraper = AutoScraper()

# Build the scraper model and perform web scraping
result = scraper.build(url, wanted_list)

# Display the result
print(result)

# Now, let's retrieve URLs of different news
# Define the sample URL for news articles
wanted_list_urls = ["./articles/CBMiUmh0dHBzOi8vd3d3LmFuYWx5dGljc2luc2lnaHQubmV0LzEwLWFpLXRvb2xzLXRvLWVuaGFuY2UteW91ci1leGNlbC1za2lsbHMtaW4tMjAyMy_SAQA?hl=en-IN&gl=IN&ceid=IN%3Aen"]

# Build the scraper model to fetch URLs
result_urls = scraper.build(url, wanted_list_urls)

# Display the URLs
print(result_urls)

# Retrieve similar data from a different URL
similar_result = scraper.get_result_similar("https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2pxamVheENSRmpvQXpLYUVCTXRpZ0FQAQ?hl=en-IN&gl=IN&ceid=IN%3Aen")
print(similar_result)

# Save the model
scraper.save('data')   # saving the model

# Load the model
scraper.load('data')   # loading the model

# Define proxy IP Addresses
proxy = {
    "http": 'http://127.0.0.1:8003',
    "https":'https://127.0.0.1:8071',
}

# Build the scraper model with proxy
final_result = scraper.build(url, wanted_list, request_args=dict(proxies=proxy))

# Display the final result
print(final_result)

Conclusion

In this blog, we saw how we can use Autoscraper for web scraping by creating a simple and easy-to-use model. We saw different formats in which data can be retrieved using Autoscraper. We can also save and load the model for later which saves time and effort.

If you like the article and would like to support me, make sure to: