"A Simple Recursive Web Page Crawler to Scrape Data" written in Python
I am very new to python language. I bet there are better tools or modules written professionally. But I somehow find this very code written by myself useful. So I wanted to share it online.
Lets say that we aim to scrape a specific kind of data (E.g. e-mail information is a good data type to try this code on) from within an html based web site. This code visits all possible links she finds and visits them recursively to find data you're seeking. If the web site is designed in a proper format you can inspect the code and identify a few html tags appearing just before the data string you need.
Of course this code will not work properly if you don't make the necessary adjustments specific for your case. I just wanted to give a general idea to grasp the logic behind recursively crawling a web page and scraping any data.
In crawl() function I give a random time break between consecutive requests in order not to disturb the server site much. It is a good idea to give that kind of break.
PS. The code requires bs4 module installed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | #!/usr/bin/python """ 18042016 This code is able to scrape data in a web site. It recursively visits links and sub links. But the code should be specifically designed for the website """ #importing necessary libraries import sys, getopt, time, requests, random from bs4 import BeautifulSoup as BS search_tags = ['write here','the list of','corresponding html tags','to be used to recover data'] emaildb = list() linkdb = list() femails = open("emails.txt","w+") rootDir = sys.argv[1] depth = 0 def scrapeEmail(url): global emaildb global femails global search_tags crop = url for tag in search_tags: sstring = tag sindex = crop.find(sstring) crop = crop[sindex+len(sstring):] sstring = ' |
Comments