- Using Beautifulsoup To Web Scrape Files
- Using Beautifulsoup To Web Scrape Photos
- Beautifulsoup Tutorial Python 3
- Beautifulsoup Tutorial
- Using Beautiful Soup To Web Scraper
The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at web scraping. The Python libraries requests and Beautiful Soup are powerful tools for the job. Web scraping is the technique to extract data from a website. The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree. Apr 17, 2021 I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an 'alternative' app, which is simply a localhost that can search for a user, and get its posts).
Browse other questions tagged pandas dataframe web-scraping beautifulsoup or ask your own question. The Overflow Blog Podcast 330: How to build and maintain online communities, from gaming to Level Up: Creative Coding with p5.js – part 6.
Web Scraping means navigating structured elements on a website, and deeply going to next layers. Incoming big data will be retrieved and formated in desired styles. We apply Python BeautifulSoup to a simple example for scraping with step-by-step tutorials.
All codes here are not complicated, so you can easily understand even though you are still students in school. To benefit your learning, we will provide you download link to a zip file thus you can get all source codes for future usage.
Estimated reading time: 10 minutes
EXPLORE THIS ARTICLE
TABLE OF CONTENTS
BONUS
Source Code Download
We have released it under the MIT license, so feel free to use it in your own project or your school homework.
Download Guideline
- Prepare Python environment for Windows by clicking Python Downloads, or search a Python setup pack for Linux.
- The pack of Windows version also contains pip install for you to obtain more Python libraries in the future.
THE BASICS
Python Web Scraping
Web pages are written using HTML to be structured texts. Therefore, people can obtain organized information of websites by scraping. Because Python is good at that, we will introduce its library BeautifulSoup in the article.
What is BeautifulSoup?
BeautifulSoup version 4 is a famous Python library for web scraping. In addition, there was BeautifulSoup version 3, and support for it will be dropped on or after December 31, 2020. People had better learn newer versions. Below is the definition from BeautifulSoup Documentation.
BeautifulSoup Installation
If you already download and setup Python and its tool pip in Windows or Linux, you can install BeautifulSoup 4 package bs4 by pip install command lines, and then check the result by pip show.
Scraping An Example Website
We should choose an example website to start. Focusing only on the content about laptop computer, thus the portion of an Indian shopping website, Snapdeal Laptops, will be suitable to our target.
There are two layers, top layer for a product list and bottom layer for details in specification. For safety, we suggest a process of saving pages and then retrieve data from them. Therefore, the page contents along with images are downloaded. Below is the running process.
Next, you will learn how to scaping web pages by BeautifulSoup. The result will be saved as JSON format in a file named result.json.
STEP 1
Finding How Web Pages Link
The previous section let us know what type of data we will meet, so we need to inspect these HTML structured texts to find entry points for BeautifulSoup scraping to start.
Finding The Relationship
Probably, the website you want to crawl has more layers than that in this example. As there is no general rule for various web pages, it is better to get a entry point that BeautifulSoup can start from by finding how web pages link.
Logitech g403 specs. Below lists the JSON-styled data related to a specific laptop computer. Except that the fields under 'highlight'
is found from layer 2, other fields come from layer 1. For example, 'title'
indicates the product name with brief specification, and 'img_url'
can be used to download product pictures.
What to Inspect in Layer 1
From the view of HTML document, let us continue inspecting Layer 1 in file laptop.html. In other words, by going through HTML structured text, BeautifulSoup can locate the key feature of class='product-tuple-image'
for scraping items like <a pogId=
, <a href=
, <source srcset=
, and <img title=
to be pogid, href, img_url, and title, respectively. Where href directs to url in Layer 2 for that product.
Similarly, with another feature of class='product-tuple-description'
, BeautifulSoup can continue scraping <span product-desc-price=
, <span product-price=
, and <div product-discount=
to retrieve JSON fields price, price_discount, and discount. So far, all items in Layer 1 have been discovered.
What to Inspect in Layer 2
Further, BeautifulSoup traverses the HTML file of Layer 2 such as 638317853217.html. As mentioned previously, that is for detailed specification of laptops. The task in Layer 2 can be done by scraping items inside the feature of class='highlightsTileContent '
.
STEP 2
Scraping By Beautifulsoup
Before scraping, we got to introduce a popular Python library PyPI requests to get contents from websites. Then, BeautifulSoup will perform iteratedly in layers to convert all information into JSON-style data.
PyPI requests
PyPI requests is an elegant and simple HTTP library for Python. In the following paragraph, we leverage it to read text from web pages and save as HTML files. Of course, you can install it by issuing pip install requests in command box.
PyPI requests requests.get()
not only get web pages, but also pull down binary data for pictures like that in download_image()
.
Python Scraping in Layer 1
Let us start from getLayer_1()
in which each web page has been saved before BeautifulSoup parsing proceed.
Saving contents as backup is helpful for debuging. Once exceptions happens while BeautifulSoup scraping, it is hard for you to find again the exact content you need from massive web pages. In addition to that, too frequent data scraping may trigger prohibition of some websites.
For BeautifulSoup, the very first expression like soup = BeautifulSoup(page, 'html.parser')
need to select one kind of parsers. The parser could be html.parser
or html5lib
, whose difference can be found in Differences between parsers.
Based on understanding about what the text structure is in STEP 1, we find prices from the class of product-tuple-description
by using attrs={}
with which BeautifulSoup anchors a lot of locations in this example. Apart from prices, pictures are done in the same way.
Layer 1 can discover data about titles, images, and prices. Importantly, you should notice that BeautifulSoup uses find_all()
or find()
to gather information for all pieces or one piece, like the usage in database.
For each page in Layer 2, 'highlight': getLayer_2()
is called iteratedly to retrieve more. Finally, json.dump()
save JSON-formated data as a file. Subsequently, let us go through steps for BeautifulSoup scraping in Layer 2 in next paragraph.
Python Scraping in Layer 2
Like the way in Layer 1, getLayer_2()
find more product details by locating the class of highlightsTileContent
. Then, in a loop, it store searched data in an array. You can check what is in this array in JSON style as discussed in STEP 1.
STEP 3
Handling Exception
BeautifulSoup scraping won’t be smooth continuously, because the input HTML elements may be partial missing. In that case, you have to take actions to avoid interrupt, and keep the entire procedure going on.
Why Will Exceptions Probably Occur?
In laptop.html, if every product has identical fields, there will be no exception. However, when any product has less fields than that of other normal products, it could happen like what the following Python scripts express.
Always, you have to deal with every kind of exception, so as to prevent the scrapying process from interrupt. Imaging that programs suddenly exit due to errors after crawling several thousands of pieces, the loss won’t be little.
What Exceptions to Handle
Here list only two conditions. However, you may encounter extra ones when scrapying more and more web pages.
KeyError
means the HTML element to search for does not exist. You may have been aware of an alternative HTML element to replace with, thus in this example, you can replace find('img')['src']
with find('source')['srcset']
. However, if no alternation, just ignore it.FINAL
Conclusion
We didn’t write detailed skills in BeautifulSoup Documentation, but show many opinions and directions about scraping in Python. Therefore, if you are not familiar with skills, please refer to online resources for practices.
Thank you for reading, and we have suggested more helpful articles here. If you want to share anything, please feel free to comment below. Good luck and happy coding!
Suggested Reading
- 4 Practices for Python File Upload to PHP Server
- Compare Python Dict Index and Other 5 Ops with List
Nowadays, there are APIs for nearly everything. If you wanted to build an app that told people the current weather in their area, you could find a weather API and use the data from the API to give users the latest forecast.
But what do you do when the website you want to use doesn't have an API? That's where Web Scraping comes in. Web pages are built using HTML to create structured documents, and these documents can be parsed using programming languages to gather the data you want.
Web Scraping with Python and Beautiful Soup
There are two basic steps to web scraping for getting the data you want:
- Load the web page (i.e. the HTML) into a string
- Parse the HTML string to find the bits you care about
Python provides two very powerful tools for doing both of these tasks. We can use the Requests library to retrieve the web page containing our data, and we can use the awesome Beautiful Soup package for parsing and extracting the data. If you'd like to know a bit more about the Requests library and how it works, check out this post for a bit more depth.
Using Beautiful Soup we can easily select any links, tables, lists or whatever else we require from a page with the libraries powerful built-in methods. So let's get started!
HTML basics
Before we get into the web scraping, it's important to understand how HTML is structured so we can appreciate how to extract data from it. The following is a simple example of a HTML page:
HTML will always start with a type declaration of <!DOCTYPE html>
and will be contained between <html>
/ </html>
tags.
The <body>
tags wrap around the visible part of a website, which is made up by various combinations of header tags (<h1>
to <h6>
), paragraphs (<p>
), links (<a>
) and several others not shown in this example, such as <input>
and <table>
tags.
HTML tags can also be given attributes, like the id
and class
attributes in the example above. These attributes can help with styling by uniquely identifying elements.
If these tags are new to you, it might be worth taking some time quickly getting up to speed with HTML. Codecademy and W3Schools both offer excellent introductions into HTML (and CSS) that will be more than enough for this tutorial.
Analyzing the HTML
Have you ever followed one of those links on your social media to a 'Top 10 films of 2017', only to find it's one of those sites where each listing is on a different page? Part of you wants to find out what they thought was number one, the other part wants to give up waiting for all the ads to load? Well, web scrapping can help you with that.
We are going to use this article from CinemaBlend to find out the 10 Greatest Movies of All-Time.
Take a look at the link. It should bring you to a page where you can see that Taxi Driver was ranked 10th in the list. We want to grab this, so the first thing we need to do is look at the page structure. Right click on the page in the link above, and select the Page Source
option.
This will bring up the HTML document for the entire page, side-menus and all. Don't be alarmed, I don't expect you to read all that. Instead press Ctrl + F
and search for 10. Taxi Driver.
You should find something like this:
This part of the HTML represents the rank and title found underneath the movie image as shown below:
The easiest way to be sure is that this search should return only 1 result, which means we must be looking at the same part of the page.
So the 10th entry in our list is Taxi Driver, but how do we get the other 9 without having to click through every page?
Open the page source again, but this time search for Continued On Next Page. You should find something like this:
This section is rendered as the link we need to click on to see the next entry:
Again, we can tell this is the same element because it is the only result in the whole page source that should match.
Believe it or not, with just those two HTML segments we can create a Python script that will get us all the results from the article.
Scraping the HTML
Before we can write our scraping script, we need to install the necessary packages. Total war: warhammer ii - the shadow & the blade for mac. Type the following into the console:
pip install requestspip install beautifulsoup4
Now we can write our web scraper. Create a script called scraper.py
and open it in your development environment. We'll start by importing Requests
and BeautifulSoup
:
Let's use the Requests
library to grab the page holding the 10. Taxi Driver entry and store it in a variable called page
. We'll also create a variable called results
, which will store the film rankings in a list for us:
Do you remember when we looked at the HTML for the web article using page source? Essentially, we now have that page's HTML stored in our variable, and we're going to use BeautifulSoup
to parse through the response to find the data we care about.
The next step is to feed page
into BeautifulSoup
:
Ssh kerberos. Now we can use the BeautifulSoup
built-in methods to extract the film and it's ranking from the snippet we examined earlier:
To do this, we can use CSS selector syntax. In CSS, selectors are used to select elements for styling. Notice how the div
element has a class of liststyle
? We can use this to select the div
tag, since a div
tag with this exact class only appears once on the page.
Note: Usually, class
attributes aren't unique and are used to style multiple elements in a similar way. If you want to guarantee uniqueness, try to use an id
attribute.
Here, we have used the BeautifulSoup select
method to grab the div
element we want. The select
method returns a list containing any matching elements. In our case, element
returns: [<div>10. Taxi Driver</div>]
.
Since our list only contains one item, we get the element with index 0. We then use the BeautifulSoup get_text
method to return just the text inside the div
element, which will give us '10. Taxi Driver'.
Finally, let's append the result to our results list:
Crawling the HTML
Another key part of web scraping is crawling. In fact, the terms web scraper and web crawler are used almost interchangeably; however, they are subtly different. A web crawler gets web pages, whereas a web scraper extracts data from web pages. The two are often used together, since usually when you crawl some web pages you also want to get some data from them, hence the confusion.
In order for us to determine the other 9 rankings in the article, we will need to crawl through the web pages to find them. To do that, we are going to use the snippet we discovered before:
An <a>
tag represents a link, and the destination for that link when clicking on it is held by the href
attribute. We want to pass the value held by the href
attribute to the Requests
library, just like we did for the first page. We can do that with the following:
Here we have selected for any a
tag that contains the class next-story
and is within a parent div
element that itself has a class of nextpage
. This will return just a single result, since a link matching this criteria occurs just once on the page for our Continued On Next Page link.
We can then get the value of the href
attribute by calling the get
method on the a
tag and storing it in a variable called url
.
The next step would be to pass the href
variable into the Requests
library get
method like we did at the beginning, but in order to do that we are going to need to refactor our code slightly to avoid repeating ourselves.
Refactoring the Scraper
Right now, our scraper successfully grabs our chosen page and extracts the movie title and ranking, but to do the same for the remaining pages we need to repeat the process without just duplicating our code. To do this we are going to use recursion.
Recursion involves coding a function that calls itself one or more times, something that Python is able to take advantage of very easily. Here is our scraper refactored as a recursive function:
Let's go through each section of the code and see what is happening.
The scraper
function takes two arguments. The first, url
, is the URL of the page you want to extract information from, which gets passed into requests
.
The second argument results
is optional but is key to the operation of our recursive function. When the function is first called, it should be called as follows:
scraper('https://www.cinemablend.com/new/10-Greatest-Movies-All-Time-According-Actors-73867.html')
The results
parameter is not provided, and thus is set to an empty list. The function then grabs the page and extracts the information from it, appending it to the results
list.
Using Beautifulsoup To Web Scrape Files
The next vital part of our recursive function lies here:
If we find a link on the page matching the CSS selector div.nextpage a.next-story
, then we will call the scraper
function again, this time with the href
of the link to the next page AND the results
list we have generated so far. This means when scraper
runs for any subsequent calls, the results
parameter will not be empty and instead we will continue to append new results to it.
When the scraper reaches the last page of the article (i.e. the movie ranked number one), then there will be no link matching the CSS selector and our recursive function wil return the final results
list.
Note: Take care when using recursion. If you don't create a condition that will eventually end the function calls, a recursive function will run continously until it causes a runtime error. This is to prevent an issue known as stack overflow.
A complete working script could look something like this:
Scraper limitations
So now you've seen how easily you can extract information from a web page, why wouldn't you use it all the time? Well, sadly, there are downsides.
For starters, web scraping can also be slower obtaining the information than through an equivalent API, and some sites don't like you scraping information from their pages, so you need to check their policies to see it's okay.
But perhaps the most significant drawback is changes to the the HTML page structure. One of the advantages of APIs is that they are designed with developers in mind, and are therefore less likely to changes how they work. Web pages on the other hand can change quite dramatically. If the web page author decides to change the class names of their elements, such as the nextpage
and next-story
CSS selectors we used, our scraper will break. This can be frustrating if a website updates regularly.
Using Beautifulsoup To Web Scrape Photos
That being said, web sites have improved their structures a lot over the years with the popularity of many easy-to-use frameworks, which means pages are unlikely to change too much over time.
Summary
Beautifulsoup Tutorial Python 3
Hopefully you've seen enough that you can now use web scraping confidently in your own projects. The advantage of web scraping is that what you see is what you get and If you know the information you are after, you don't need to dig around trying to figure out an API to get it. Just code a simple scraper and it's yours!
Like