spying

Keeping a close eye on the competition.

In this tutorial, we are going to see how to monitor a competitor web page for changes using Python/AWS Lambda and the serverless framework.

We’re going to make a CRON job that will scrape the ScrapingBee (my company website) pricing table and checks whether the prices changed.

It could be done for lots of other use cases, like receiving an alert each time a new job is posted on a job board, or an apartment on a rental website, for example.

You may also enjoy: Introduction to Monitoring Serverless Applications

Serverless refers to the execution of code inside ephemeral containers (Function as a Service, or FaaS). Some cloud providers also call thes “cloud functions.”

Generally, you can trigger the function’s execution with different mechanisms such as:

  • An HTTP call to a REST API
  • A job in a message queue
  • A log (in Cloudwatch for example)
  • An IoT event

Cloud functions can be a really good fit for different use cases, like when you don’t care about latency/cold start for your CRON jobs, or when you need to “glue” different services together with some API calls.

In our example, it would be a “perfect” use case: we’re going to scrape the pricing table of a competitor to get an alert in case it changes. Web scraping is I/O bound, so most of the time is spent waiting for an HTTP response from the server. You don’t need a high-end CPU for this, nor a lot of RAM.

Prerequisites

In order to scaffold and deploy our project to AWS lambda, we will use the serverless framework. It’s an amazing project that makes building/configuring your cloud functions really easy with a simple configuration file. It handles many different clouds (AWS, Google Cloud, Azure…) and different languages.

In order to install the CLI, you will need Node.js on your system and an AWS account. You can follow the instruction here.

Creating the Project

Now that you installed the serverless CLI, we can create a new python project for AWS with:

serverless create --template aws-python3 --name cron-scraping --path cron-scraping

In order to scrape ScrapingBee’s pricing table, we will use Requests and BeautifulSoup packages:

 pip install requests
 pip install beautifulsoup4
 pip freeze > requirements.txt

Without using serverless this can be a problem because you need to package your dependencies into a zip file and upload everything to AWS.

With serverless, you can use a plugin that will directly read your requirements.txt file and handle the dependencies.

In order to do so, you will only need to use this command:

 npm init

Accept all the defaults, and then add this to your serverless.yml:

# serverless.yml

plugins:
  - serverless-python-requirements

custom:
  pythonRequirements:
    dockerizePip: non-linux

You can also follow this guide if you want to know more about this.

Web Scraping

We are going to scrape the different prices on the pricing table that you can see below. If the price that we extract is not $9 / $29 or $99 then we will send an alert in a Slack channel (or an email).

pricing

We are going to use the Requests package to get the HTML code, and BeautifulSoup to parse it and select the different prices in the table.

In our case, we can select the prices with this CSS selector:

.price.color-1 span.a

Now let’s code! In the serverless.yml, a handler is specified, it’s the name of the python function that will be executed. It has two parameters, the event, which is generally a Python dictionary that contains the data your function will need (in our case it will be empty, we don’t need any parameters) and the context parameter.

The context parameter is an object with different properties about the execution of your lambda function, such as the function name, version, and memory limit.

import json
import requests
from bs4 import BeautifulSoup


def hello(event, context):

    base_url = "https://www.scrapingbee.com"
    known_prices = ["$9", "$29", "$89"]
    status = "Nothing changed"

    r = requests.get(base_url)
    soup = BeautifulSoup(r.text, 'html.parser')

    prices = soup.select('.price.color-1 span.a')

    for price in prices[:3]:
        if price.text.strip() not in known_prices:
            status = f'Something changed: {price.text}'

    response = {
        "statusCode": 200,
        "body": status
    }

    return response

We then get the ScrapingBee’s home page, parse the HTML with BeautifulSoup, and select the prices with the appropriate CSS selector. If the price is different from the known prices (9/29/99) then we change the status.

Instead of only setting a status, we could send a Slack notification to a channel. It’s really easy, you just have to create an app to get a webhook URL as explained here.

And then, with Requests:

json = {"text": f"A price was updated on ScrapingBee's pricing table"}
slack_request = requests.post(
  WEBHOOK_URL, json=json, headers={"Content-Type": "application/json"}
)

Deployment and Invocation and CRON

Deploying your function to AWS is really easy with serverless:

serverless deploy

In order to invoke your function:

serverless invoke -f hello --log

We don’t want to do this manually, so we are going to add a simple line to our configuration file (serverless.yml) to invoke the function automatically every hour:

functions:
  hello:
    handler: handler.hello
    events:
      - schedule: rate(1 day)

You can learn more about schedule expression here.

Going Further

This was a little introduction to the serverless framework and how easy it is to build and deploy simple scripts to AWS Lambda.

There are many more things to explore, such as the integration with other AWS services like Image title to trigger a function with an HTTP call.

Another interesting topic is AWS Lambda Layers that were introduced recently. It allows you to handle dependencies (including binaries) in your lambda execution environment.

Further Reading

The Basics Of Web Scraping With Proxies

A Simple Intro to Web Scraping with Python

from DZone Cloud Zone