Building a custom web scraper involves several steps, and it’s important to be aware of the legal and ethical considerations when scraping data from websites. Always review a website’s terms of service and robots.txt file to ensure compliance. Additionally, be respectful of the website’s resources and bandwidth.

For this example, I’ll use the requests library for making HTTP requests and BeautifulSoup for parsing HTML. You can install them using:

pip install requests beautifulsoup4

Let’s create a simple web scraper to extract quotes from http://quotes.toscrape.com:

import requests
from bs4 import BeautifulSoup

def scrape_quotes():
    # URL of the website to scrape
    url = "http://quotes.toscrape.com"

    # Send an HTTP GET request to the URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract quotes and authors
        quotes = soup.find_all('span', class_='text')
        authors = soup.find_all('small', class_='author')

        # Print the quotes and authors
        for quote, author in zip(quotes, authors):
            print(f"{quote.text.strip()} - {author.text.strip()}")
    else:
        print(f"Error: Unable to fetch the page (Status code: {response.status_code})")

if __name__ == "__main__":
    scrape_quotes()

This script sends an HTTP GET request to http://quotes.toscrape.com, parses the HTML content using BeautifulSoup, and extracts quotes along with their authors.

Here’s the result you should get:

python scraping result in command prompt

Note that websites may change their structure, so this script might need adjustments if the structure of the target website is modified.

Remember to respect the website’s terms of service, robots.txt file, and not to overload the server with too many requests.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *