Search engines have always interested me a bit and I've wondered how to set them up. They consist of a few simple parts: * crawler, * scraper, * query engine, and * ranking system. I decided to set myself a simple project to build one that could scrape a domain and store content for all the pages. This post will be focused on the crawler/scraper aspect of it. ## Scrape A Page Initially, all we want to do is download a single page and scrape the text from it. This can be done with a few lines of Python quite easily, all you'll need to install is the [`requests`](https://requests.readthedocs.io/en/master/) and [`lxml`](https://lxml.de/) packages: {{< highlight python >}} from lxml import html import requests import sys url = sys.argv[1] page = requests.get(url) tree = html.fromstring(page.content) text = tree.xpath('//body//text()') print('Text:', str(text)) {{< / highlight >}} All this does is download the page and then create and then find all text (not including HTML tags) and print them out. ## Scrape A Domain A bit more complex is to recursively scrape a domain by following every link in the site, but also making sure you're not following the same links twice (duplication). For this, you'll need to install [RabbitMQ](https://www.rabbitmq.com/), [Redis](http://redis.io/), and [Celery](http://www.celeryproject.org/). Now let's modify the previous code to look like below: {{< highlight python >}} from celery import Celery from lxml import html import redis import requests import sys r = redis.Redis(host='localhost', port=6379, db=0) app = Celery('tasks', broker='pyamqp://guest@localhost//') @app.task def scrape(x): if r.exists(x): return str(x) url = x page = requests.get(url) tree = html.fromstring(page.content) r.set(x, 1) links = tree.xpath('//a/@href') text = tree.xpath('//body//text()') for link in links: if 'https://adamogrady.id.au' in link and r.exists(link) == 0: r.set(x, 1) scrape.delay(link) return str(x) scrape.delay('https://adamogrady.id.au/') {{< / highlight >}} When started with Celery using the command `celery -A [file name without .py] worker --loglevel=info`, you should see that the project sets up a Celery automation task, then runs it with the domain `http://adamogrady.id.au/`. It checks if the page has already been scraped (exiting if so), then requests and scrapes the page, then goes through all the links on the page and queues them up if they're under the write domain (although there should be a string start check) and that the link hasn't already been scraped. ## Store The Data Lastly we need to store the data somewhere. In this case I'm using [Elasticsearch](https://www.elastic.co/), which will also be our query engine and provide our ranking system later on. Once you have Elasticsearch installed and the appropriate Python module ready, let's modify our code a bit more: {{< highlight python >}} from celery import Celery from elasticsearch import Elasticsearch import json from lxml import html import redis import requests import sys r = redis.Redis(host='localhost', port=6379, db=0) app = Celery('tasks', broker='pyamqp://guest@localhost//') es = Elasticsearch() @app.task def scrape(link): if r.exists(link): return str(link) page = requests.get(link) tree = html.fromstring(page.content) links = tree.xpath('//a/@href') text = tree.xpath('//body//text()') doc = { 'link': link, 'text': ','.join(text) } res = es.index(index="test-search", doc_type='page', body=doc) r.set(link, 1) for single_link in links: if 'https://adamogrady.id.au' in single_link and r.exists(single_link) == 0: scrape.delay(single_link) return str(link) scrape.delay('https://adamogrady.id.au/') {{< / highlight >}} You'll notice now we're storing data in an Elasticsearch index (`test-search`) so it can later be queuried. This code should work, and you can replace the domain with another URL to scrape whatever site you feel. ## Future Improvements There's a bunch of small improvements that can be made to improve this search. * Moving to [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) instead of `lxml` for better text scraping * Indexing headers as a separate key in the Elasticsearch doc * Indexing the entire page, including HTML tags * Separate the XPath/Beautiful Soup work into a separate task * Expanding any shortened forms of URLs ('`/about`') to the full form ('`https://adamogrady.id.au/about`'), then checking the domain and removing any anchors for the same page (thanks [@\_\_eater\_\_](https://twitter.com/__eater__)!).