Web scraping is a technique for obtaining usable data for a specific purpose. In essence, the website contains a large amount of data, if you want to get that data then one way is by web scraping.
In this post, we will try scraping urls on Google and learn how to optimize searches using the Google search operator.
Clone this repository by executing the following command:
$ git clone https://github.com/jagadyudha/google-scraper
Install the libraries that are required by executing the following command:
# Python3 $ pip3 install -r requirements.txt # or # Python2 $ pip install -r requirements.txt
Before we jump into how to run the project, we need to know how this project works.
- Identify google search url
- Collect data from urls
divtag with class
atag inside class that we found in step 3
To run the code is quite simple. Just write the following command:
# Python3 $ python3 main.py # or # Python2 $ python main.py
Input pages and input data will be displayed after executing the above command. Input pages is the number of Google pages that you want to scrape it, while input data is the keyword you want to search for.
Google search operators are often used to find information that is specific, allowing for accurate search results even when the information is tough to track down.
We may also use the Google search operator in this project. For example, I will find a pdf file with the keyword
So, I will search with keyword
filetype:pdf intext:learning Python
Unfortunately, if you use Google Search Operators too often, it can bring up captchas.
Previously, I have given an example of using Google Search Operators. For more details, you can check out the following cheat sheet: Google Search Operators Cheat Sheet (notion.site)
With the help of this tool, we can do scraping automatically without the need to copy URLs one by one. However, there is an unsolved problem with captcha when using it too often.