How to Make a Web Scraper with AWS Lambda and the Serverless Framework?

Go to the AWS Lambda web portal to build a new function, modify your Lambda code, or execute it.

September 23, 2021

Our achievements in the field of business digital transformation.

The AWS idea is that Amazon provisions and maintains all aspects of your application, from storage to processing power, in a cloud environment (i.e., on Amazon’s computers), allowing you to design cloud hosting apps that grow automatically. You won’t have to set up or manage servers because Amazon will handle it. The use of a serverless framework is recommended to develop the Lambda function. A Lambda function is a cloud-based function that may execute when needed and is triggered by signals or API requests.

Why Use Scraper?

For instance, if you want to fetch the recipes which are posted on a particular website. Scraping this information from the website is possible.

Step 1: Serverless Setup

Read the quick start guide for the serverless framework. Serverless will eliminate all the difficulties associated with setting up AWS infrastructure, allowing development and testing locally before deploying everything to the cloud.

Developing a new serverless project work:

				
					$ serverless create --template aws-nodejs --path donkeyjob$ cd donkeyjob

A serverless.yml file is used to start the project. YAML is a commonly used language for system settings, and it is this file that holds all of the AWS configuration information. For the time being, we can ignore all of the remarks and stick to the following:

				
					service: donkeyjobprovider:  name: aws  runtime: nodejs6.10functions:  getdonkeyjobs:    handler: handler.getdonkeyjobs

As per the requirement, we have a function known as getdonkeyjobs, and we will export a function having the name from handler.js.

This is the function that will help in deploying AWS and will trigger to scrape of the job listings data.

Creating a handler.js essential function.

Lambda functions comprise three parts: an event, a context, and a callback. Let’s start with some basics for now. You can remove the rest of the file.

				
					module.exports.getdonkeyjobs = (event, context, callback) => {
callback(null, 'Hello world');
};

Check the script locally.

				
					$ serverless invoke local --function getdonkeyjobs

“Hello World” will be the result of the above script.

Step 2: Scraping The Data

We are building a scraping functionality for the Donkey Sanctuary jobs page and parsing the HTML page to fetch the list of the jobs in the required format.

				
					[
{job: 'Marketing Campaigns Officer', closing: 'Fri Jul 21 2017 00:00:00 GMT+0100', location: 'Leeds, UK'},
{job: 'Registered Veterinary Nurse', closing: 'Sat Jul 22 2017 00:00:00 GMT+0100', location: 'Manchester, UK'},
{job: 'Building Services Manager', closing: 'Fri Jul 21 2017 00:00:00 GMT+0100', location: 'London, UK'}
];

Axios is used for requesting the page contents and then passing them over to the HTML string and a parsing function that can be tested. The library Cheerio parses the HTML file and gets the desired data inside the parsing function.

Cheerio is similar to jQuery in that you can feed an HTML string (for example, the answer you get from a GET request for a page), and it will construct a document object-oriented approach for you to navigate and manage.

The Moment is a valuable package for dealing with dates, making constructing an ISO String format simple.

				
					const request = require('axios');
	const {extractListingsFromHTML} = require('./helpers');
	

	module.exports.getdonkeyjobs = (event, context, callback) => {
	  request('https://www.thedonkeysanctuary.org.uk/vacancies')
	    .then(({data}) => {
	      const jobs = extractListingsFromHTML(data);
	      callback(null, {jobs});
	    })
	    .catch(callback);
	};

const cheerio = require('cheerio');
	const moment = require('moment');
	

	function extractListingsFromHTML (html) {
	  const $ = cheerio.load(html);
	  const vacancyRows = $('.view-Vacancies tbody tr');
	

	  const vacancies = [];
	  vacancyRows.each((i, el) => {
	

	    // Extract information from each row of the jobs table
	    let closing = $(el).children('.views-field-field-vacancy-deadline').first().text().trim();
	    let job = $(el).children('.views-field-title').first().text().trim();
	    let location = $(el).children('.views-field-name').text().trim();
	    closing = moment(closing.slice(0, closing.indexOf('-') - 1), 'DD/MM/YYYY').toISOString();
	

	    vacancies.push({closing, job, location});
	  });
	

	  return vacancies;
	}
	

	module.exports = {
	  extractListingsFromHTML
	};

To use Cheerio, you must understand how to navigate the DOM precisely and choose your desired items. To accomplish this, use the dev tools in your browser to study the HTML structure of the website you’re scraping, and keep in mind that if the layout of that HTML future changes, your scraper may become worthless.

If we run our function now, we need to see the following array of jobs:

				
					$ serverless invoke local --function getdonkeyjobs

Step 3: Setup DynamoDB

The lambda function cannot be used to persist data, but it only saves temporary information. We will configure DynamoDB as an AWS resource and give Lambda function permission to interact. Here the serverless.yml will look like this:

				
					service: donkeyjob
	

	provider:
	  name: aws
	  runtime: nodejs6.10
	functions:
	  getdonkeyjobs:
	    handler: handler.getdonkeyjobs
	

	resources:
	  Resources:
	    donkeyjobs:
	      Type: AWS::DynamoDB::Table
	      Properties:
	        TableName: donkeyjobs
	        AttributeDefinitions:
	          - AttributeName: listingId
	            AttributeType: S
	        KeySchema:
	          - AttributeName: listingId
	            KeyType: HASH
	        ProvisionedThroughput:
	          ReadCapacityUnits: 1
	          WriteCapacityUnits: 1
	

	    # A policy is a resource that states one or more permssions. It lists actions, resources and effects.
	

	    DynamoDBIamPolicy: 
	      Type: AWS::IAM::Policy
	      DependsOn: donkeyjobs
	      Properties:
	        PolicyName: lambda-dynamodb
	        PolicyDocument:
	          Version: '2012-10-17'
	          Statement:
	            - Effect: Allow
	              Action:
	                - dynamodb:DescribeTable
	                - dynamodb:Query
	                - dynamodb:Scan
	                - dynamodb:GetItem
	                - dynamodb:PutItem
	                - dynamodb:UpdateItem
	                - dynamodb:DeleteItem
	              Resource: arn:aws:dynamodb:*:*:table/donkeyjobs
	        Roles:
	          - Ref: IamRoleLambdaExecution

To construct the DynamoDB resource, we’ll have to deploy this to AWS. Because a Lambda function is merely a function, we can test it locally before communicating with AWS, but we can’t test how a database works without having one.

As a result, we move:

				
					$ serverless deploy

This sends our program to AWS and generates the resources we specified in the configuration file.

Step 4: Interact with DynamoDB

It is now possible to use the database. For using a database, we need to install and use a package known as aws-sdk (AWS Software Development Kit) which makes the interaction of DynamoDB simple.

Here are the steps mentioned for scraping a new list of jobs.

Fetch yesterday’s job from the database using Dynamo. Scan method.

				
					{
jobs: [ {job: 'Donkey Feeder',
closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
location: 'Leeds, UK'},
{job: 'Chef',
closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
location: 'Sheffield, UK'}
],
listingId: 'Fri Jul 21 2017 14:25:35 GMT+0100 (BST)'
}

You can compare the difference between yesterday’s and today’s jobs by employing handy lodash techniques.

Dynamo. Delete will help to delete yesterday’s job from the database.

Save the new jobs instead with the dynamo.put technique.

Callback with the new jobs.

				
					const request = require('axios');
	const AWS = require('aws-sdk');
	const dynamo = new AWS.DynamoDB.DocumentClient();
	const { differenceWith, isEqual } = require('lodash');
	const { extractListingsFromHTML } = require('./helpers');
	

	module.exports.getdonkeyjobs = (event, context, callback) => {
	  let newJobs, allJobs;
	

	  request('https://www.thedonkeysanctuary.org.uk/vacancies')
	    .then(({ data }) => {
	      allJobs = extractListingsFromHTML(data);
	

	      // Retrieve yesterday's jobs
	      return dynamo.scan({
	        TableName: 'donkeyjobs'
	      }).promise();
	    })
	    .then(response => {
	      // Figure out which jobs are new
	      let yesterdaysJobs = response.Items[0] ? response.Items[0].jobs : [];
	

	      newJobs = differenceWith(allJobs, yesterdaysJobs, isEqual);
	

	      // Get the ID of yesterday's jobs which can now be deleted
	      const jobsToDelete = response.Items[0] ? response.Items[0].listingId : null;
	

	      // Delete old jobs
	      if (jobsToDelete) {
	        return dynamo.delete({
	          TableName: 'donkeyjobs',
	          Key: {
	            listingId: jobsToDelete
	          }
	        }).promise();
	      } else return;
	    })
	    .then(() => {
	      // Save the list of today's jobs
	      return dynamo.put({
	        TableName: 'donkeyjobs',
	        Item: {
	          listingId: new Date().toString(),
	          jobs: allJobs
	        }
	      }).promise();
	    })
	    .then(() => {
	      callback(null, { jobs: newJobs });
	    })
	    .catch(callback);
	};

We can test the function locally by executing

				
					$ serverless invoke local --function getdonkeyjobs

And therefore, we should expect our callback to include a list of all the positions published on The Donkey Sanctuary today because they are all ‘new’ to us. There are still no jobs in our database from the day before.

If you go to the AWS console now, you should see today’s data saved there. Go to DynamoDB, select your donkey jobs table, and look at the entries.

We will see that the jobs array is clear if you run the function locally again. It’s because we’re comparing the jobs to whatever is already in the database, and nothing has changed unless a new job was added in the last few minutes.

Step 5: Sending a Text Using Nexmo

Let’s send an SMS to our users notifying them of all the fascinating donkey employment they may be applying for now that we have a list of new opportunities!

To begin, create an account with Nexmo. It provides you a free $2 credit to play with, which is plenty. After joining, you should be taken to a dashboard where you will be given a password and classified information. To send a text message from Nexmo, you’ll need these.

We can easily handle the request to send a text using the Nexmo npm package. It should be installed and placed in your handler.js file. We may send any text message we wish before calling the last callback on our getdonkeyjobs handler:

				
					.then(() => {
	  if (newJobs.length) {
	    var nexmo = new Nexmo({
	      apiKey: NEXMO_API_KEY,
	      apiSecret: NEXMO_API_SECRET
	    });
	    nexmo.message.sendSms('Donkey Jobs Finder', MY_PHONE_NUMBER, 'Hello, we found a new donkey job!');
	  }
	  callback(null, { jobs: newJobs });
	})

To test this, we’ll need to clear the DynamoDB database (as seen below) so that Lambda assumes there are new jobs, and then we’ll be able to run our function locally once again.

And, with just about any luck, an SMS message should have arrived!

The final step is to improve the formatting of our text or email. For this, we may make a new helper function that takes a list of jobs and outputs a formatted message with the deadlines, locations, and job names for everything available.

Remember that anytime we would like to evaluate our function, we’ll have to clean the table continually (there are certainly better ways to do this, but for now, it’s easy enough to remove the Item on the AWS console).

				
					function formatJobs (list) {
	  return list.reduce((acc, job) => {
	    return `${acc}${job.job} in ${job.location} closing on ${moment(job.closing).format('LL')}\n\n`;
	  }, 'We found:\n\n');
	}
	

	module.exports = {
	  extractListingsFromHTML,
	  formatJobs
	};

And now that we’ve finished, we can finally deploy our entire application to AWS:

				
					$ serverless deploy

Step 6: Configuring Lambda to Execute Every Day

After we’ve deployed the function, we can check it to make sure it’s working correctly:

We may also set the function to run once a day automatically. Selecting ‘Add Trigger,’ select ‘CloudWatch Events’ from the drop-down menu, and then fill in the relevant details. We can run it daily using the scheduled expression rate (1 day).

If you have any queries regarding this blog or want Amazon Web Scraping and Web Scraping, Contact 3i Data Scraping or ask for a free quote!