Workshop: Exploring Microservices Through Industry Examples

Workshop: Exploring Microservices Through Industry Examples

Introduction

Web scraping typically refers to an automated process of collecting data from websites. On a high level, you’re essentially making a bot that visits a website, detects the data you’re interested in, and then stores it into some appropriate data structure, so you can easily analyze and access it later.

However, if you’re concerned about your anonymity on the Internet, you should probably take a little more care when scraping the web. Since your IP address is public, a website owner could track it down and, potentially, block it.

So, if you want to stay as anonymous as possible, and prevent being blocked from visiting a certain website, you should consider using proxies when scraping a web.

Proxies, also referred to as proxy servers, are specialized servers that enable you not to directly access the websites you’re scraping. Rather, you’ll be routing your scraping requests via a proxy server.

That way, your IP address gets “hidden” behind the IP address of the proxy server you’re using. This can help you both stay as anonymous as possible, as well as not being blocked, so you can keep scraping as long as you want.

In this comprehensive guide, you’ll get a grasp of the basics of web scraping and proxies, you’ll see the actual, working example of scraping a website using proxies in Node.js. Afterward, we’ll discuss why you might consider using existing scraping solutions (like ScraperAPI) over writing your own web scraper. At the end, we’ll give you some tips on how to overcome some of the most common issues you might face when scraping the web.

Web Scraping

Web scraping is the process of extracting data from websites. It automates what would otherwise be a manual process of gathering information, making the process less time-consuming and prone to errors.

That way you can collect a large amount of data quickly and efficiently. Later, you can analyze, store, and use it.

The primary reason you might scrape a website is to obtain data that is either unavailable through an existing API or too vast to collect manually.

It’s particularly useful when you need to extract information from multiple pages or when the data is spread across different websites.

There are many real-world applications that utilize the power of web scraping in their business model. The majority of apps helping you track product prices and discounts, find cheapest flights and hotels, or even find a job, use the technique of web scraping to gather the data that provides you the value.

Web Proxies

Imagine you’re sending a request to a website. Usually, your request is sent from your machine (with your IP address) to the server that hosts a website you’re trying to access. That means that the server “knows” your IP address and it can block you based on your geo-location, the amount of traffic you’re sending to the website, and many more factors.

But when you send a request through a proxy, it routes the request through another server, hiding your original IP address behind the IP address of the proxy server. This not only helps in maintaining anonymity but also plays a crucial role in avoiding IP blocking, which is a common issue in web scraping.

By rotating through different IP addresses, proxies allow you to distribute your requests, making them appear as if they’re coming from various users. This reduces the likelihood of getting blocked and increases the chances of successfully scraping the desired data.

Types of Proxies

Typically, there are four main types of proxy servers – datacenter, residential, rotating, and mobile.

Each of them has its pros and cons, and based on that, you’ll use them for different purposes and at different costs.

Datacenter proxies are the most common and cost-effective proxies, provided by third-party data centers. They offer high speed and reliability but are more easily detectable and can be blocked by websites more frequently.

Residential proxies route your requests through real residential IP addresses. Since they appear as ordinary user connections, they are less likely to be blocked but are typically more expensive.

Rotating proxies automatically change the IP address after each request or after a set period. This is particularly useful for large-scale scraping projects, as it significantly reduces the chances of being detected and blocked.

Mobile proxies use IP addresses associated with mobile devices. They are highly effective for scraping mobile-optimized websites or apps and are less likely to be blocked, but they typically come at a premium cost.

Example Web Scraping Project

Let’s walk through a practical example of a web scraping project, and demonstrate how to set up a basic scraper, integrate proxies, and use a scraping service like ScraperAPI.

Setting up

Before you dive into the actual scraping process, it’s essential to set up your development environment.

For this example, we’ll be using Node.js since it’s well-suited for web scraping due to its asynchronous capabilities. We’ll use Axios for making HTTP requests, and Cheerio to parse and manipulate HTML (that’s contained in the response of the HTTP request).

First, ensure you have Node.js installed on your system. If you don’t have it, download and install it from nodejs.org.

Then, create a new directory for your project and initialize it:

$ mkdir my-web-scraping-project
$ cd my-web-scraping-project
$ npm init -y

Finally, install Axios and Cheerio since they are necessary for you to implement your web scraping logic:

$ npm install axios cheerio

Simple Web Scraping Script

Now that your environment is set up, let’s create a simple web scraping script. We’ll scrape a sample website to gather famous quotes and their authors.

So, create a JavaScript file named sample-scraper.js and write all the code inside of it. Import the packages you’ll need to send HTTP requests and manipulate the HTML:

const axios = require('axios');
const cheerio = require('cheerio');

Next, create a wrapper function that will contain all the logic you need to scrape data from a web page. It accepts the URL of a website you want to scrape as an argument and returns all the quotes found on the page:

// Function to scrape data from a webpage
async function scrapeWebsite(url) {
    try {
        // Send a GET request to the webpage
        const response = await axios.get(url);
        
        // Load the HTML into cheerio
        const $ = cheerio.load(response.data);
        
        // Extract all elements with the class 'quote'
        const quotes = [];
        $('div.quote').each((index, element) => {
            // Extracting text from span with class 'text'
            const quoteText = $(element).find('span.text').text().trim(); 
            // Assuming there's a small tag for the author
            const author = $(element).find('small.author').text().trim(); 
            quotes.push({ quote: quoteText, author: author });
        });

        // Output the quotes
        console.log("Quotes found on the webpage:");
        quotes.forEach((quote, index) => {
            console.log(`${index + 1}: "${quote.quote}" - ${quote.author}`);
        });

    } catch (error) {
        console.error(`An error occurred: ${error.message}`);
    }
}

Note: All the quotes are stored in a separate div element with a class of quote. Each quote has its text and author – text is stored under the span element with the class of text, and the author is within the small element with the class of author.

Finally, specify the URL of the website you want to scrape – in this case, https://quotes.toscrape.com, and call the scrapeWebsite() function:

// URL of the website you want to scrape
const url = 'https://quotes.toscrape.com';

// Call the function to scrape the website
scrapeWebsite(url);

All that’s left for you to do is to run the script from the terminal:

$ node sample-scraper.js

Integrating Proxies

To use a proxy with axios, you specify the proxy settings in the request configuration. The axios.get() method can include the proxy configuration, allowing the request to route through the specified proxy server. The proxy object contains the host, port, and optional authentication details for the proxy:

// Send a GET request to the webpage with proxy configuration
const response = await axios.get(url, {
    proxy: {
        host: proxy.host,
        port: proxy.port,
        auth: {
            username: proxy.username, // Optional: Include if your proxy requires authentication
            password: proxy.password, // Optional: Include if your proxy requires authentication
        },
    },
});

Note: You need to replace these placeholders with your actual proxy details.

Other than this change, the entire script remains the same:

// Function to scrape data from a webpage
async function scrapeWebsite(url) {
    try {
       // Send a GET request to the webpage with proxy configuration
        const response = await axios.get(url, {
            proxy: {
                host: proxy.host,
                port: proxy.port,
                auth: {
                    username: proxy.username, // Optional: Include if your proxy requires authentication
                    password: proxy.password, // Optional: Include if your proxy requires authentication
                },
            },
        });
        
        // Load the HTML into cheerio
        const $ = cheerio.load(response.data);
        
        // Extract all elements with the class 'quote'
        const quotes = [];
        $('div.quote').each((index, element) => {
            // Extracting text from span with class 'text'
            const quoteText = $(element).find('span.text').text().trim(); 
            // Assuming there's a small tag for the author
            const author = $(element).find('small.author').text().trim(); 
            quotes.push({ quote: quoteText, author: author });
        });

        // Output the quotes
        console.log("Quotes found on the webpage:");
        quotes.forEach((quote, index) => {
            console.log(`${index + 1}: "${quote.quote}" - ${quote.author}`);
        });

    } catch (error) {
        console.error(`An error occurred: ${error.message}`);
    }
}

Integrating a Scraping Service

Using a scraping service like ScraperAPI offers several advantages over manual web scraping since it’s designed to tackle all of the major problems you might face when scraping websites:

  • Automatically handles common web scraping obstacles such as CAPTCHAs, JavaScript rendering, and IP blocks.
  • Automatically handles proxies – proxy configuration, rotation, and much more.
  • Instead of building your own scraping infrastructure, you can leverage ScraperAPI’s pre-built solutions. This saves significant development time and resources that can be better spent on analyzing the scraped data.
  • ScraperAPI offers various customization options such as geo-location targeting, custom headers, and asynchronous scraping. You can personalize the service to suit your specific scraping needs.
  • Using a scraping API like ScraperAPI is often more cost-effective than building and maintaining your own scraping infrastructure. The pricing is based on usage, allowing you to scale up or down as needed.
  • ScraperAPI allows you to scale your scraping efforts by handling millions of requests concurrently.

To implement the ScraperAPI proxy into the scraping script you’ve created so far, there are just a few tweaks you need to make in the axios configuration.

First of all, ensure you have created a free ScraperAPI account. That way, you’ll have access to your API key, which will be necessary in the following steps.

Once you get the API key, use it as a password in the axios proxy configuration from the previous section:

// Send a GET request to the webpage with ScraperAPI proxy configuration
axios.get(url, {
    method: 'GET',
    proxy: {
        host: 'proxy-server.scraperapi.com',
        port: 8001,
        auth: {
            username: 'scraperapi',
            password: 'YOUR_API_KEY' // Paste your API key here
        },
        protocol: 'http'
    }
});

And, that’s it, all of your requests will be routed through the ScraperAPI proxy servers.

But to use the full potential of a scraping service you’ll have to configure it using the service’s dashboard – ScraperAPI is no different here.

It has a user-friendly dashboard where you can set up the web scraping process to best fit your needs. You can enable proxy or async mode, JavaScript rendering, set a region from where the requests will be sent, set your own HTTP headers, timeouts, and much more.

And the best thing is that ScraperAPI automatically generates a script containing all of the scraper settings, so you can easily integrate the scraper into your codebase.

Best Practices for Using Proxies in Web Scraping

Not every proxy provider and its configuration are the same. So, it’s important to know what proxy service to choose and how to configure it properly.

Let’s take a look at some tips and tricks to help you with that!

Rotate Proxies Regularly

Implement a proxy rotation strategy that changes the IP address after a certain number of requests or at regular intervals. This approach can mimic human browsing behavior, making it less likely for websites to flag your activities as suspicious.

Handle Rate Limits

Many websites enforce rate limits to prevent excessive scraping. To avoid hitting these limits, you can:

  • Introduce Delays: Add random delays between requests to simulate human behavior.
  • Monitor Response Codes: Track HTTP response codes to detect when you are being rate-limited. If you receive a 429 (Too Many Requests) response, pause your scraping for a while before trying again.

Use Quality Proxies

Choosing high-quality proxies is crucial for successful web scraping. Quality proxies, especially residential ones, are less likely to be detected and banned by target websites. Using a mix of high-quality proxies can significantly enhance your chances of successful scraping without interruptions.

Quality proxy services often provide a wide range of IP addresses from different regions, enabling you to bypass geo-restrictions and access localized content.

Reliable proxy services can offer faster response times and higher uptime, which is essential when scraping large amounts of data.

As your scraping needs grow, having access to a robust proxy service allows you to scale your operations without the hassle of managing your own infrastructure.

Using a reputable proxy service often comes with customer support and maintenance, which can save you time and effort in troubleshooting issues related to proxies.

Handling CAPTCHAs and Other Challenges

CAPTCHAs and anti-bot mechanisms are some of the most common obstacles you’ll encounter while scraping a web.

Websites use CAPTCHAs to prevent automated access by trying to differentiate real humans and automated bots. They’re achieving that by prompting the users to solve various kinds of puzzles, identify distorted objects, and so on. That can make it really difficult for you to automatically scrape data.

Even though there are many both manual and automated CAPTCHA solvers available online, the best strategy for handling CAPTCHAs is to avoid triggering them in the first place. Typically, they are triggered when non-human behavior is detected. For example, a large amount of traffic, sent from a single IP address, using the same HTTP configuration is definitely a red flag!

So, when scraping a website, try mimicking human behavior as much as possible:

  • Add delays between requests and spread them out as much as you can.
  • Regularly rotate between multiple IP addresses using a proxy service.
  • Randomize HTTP headers and user agents.

Beyond CAPTCHAs, websites often use sophisticated anti-bot measures to detect and block scraping.

Some websites use JavaScript to detect bots. Tools like Puppeteer can simulate a real browser environment, allowing your scraper to execute JavaScript and bypass these challenges.

Websites sometimes add hidden form fields or links that only bots will interact with. So, try avoiding clicking on hidden elements or filling out forms with invisible fields.

Advanced anti-bot systems go as far as tracking user behavior, such as mouse movements or time spent on a page. Mimicking these behaviors using browser automation tools can help bypass these checks.

But the simplest and most efficient way to handle CAPTCHAs and anti-bot measures will definitely be to use a service like ScraperAPI.

Sending your scraping requests through ScraperAPI’s API will ensure you have the best chance of not being blocked. When the API receives the request, it uses advanced machine learning techniques to determine the best request configuration to prevent triggering CAPTCHAs and other anti-bot measures.

Conclusion

As websites became more sophisticated in their anti-scraping measures, the use of proxies has become increasingly important in maintaining your scraping project successful.

Proxies help you maintain anonymity, prevent IP blocking, and enable you to scale your scraping efforts without getting obstructed by rate limits or geo-restrictions.

In this guide, we’ve explored the fundamentals of web scraping and the crucial role that proxies play in this process. We’ve discussed how proxies can help maintain anonymity, avoid IP blocks, and distribute requests to mimic natural user behavior. We’ve also covered the different types of proxies available, each with its own strengths and ideal use cases.

We demonstrated how to set up a basic web scraper and integrate proxies into your scraping script. We also explored the benefits of using a dedicated scraping service like ScraperAPI, which can simplify many of the challenges associated with web scraping at scale.

In the end, we covered the importance of carefully choosing the right type of proxy, rotating them regularly, handling rate limits, and leveraging scraping services when necessary. That way, you can ensure that your web scraping projects will be efficient, reliable, and sustainable.

About The Author

Enable Notifications OK No thanks
Verified by MonsterInsights