How do I store crawled data into a database - python

I'm fairly new to python and everything else I'm about to talk about in this question but I want to get started with a project I've been thinking about for sometime now. Basically I want to crawl the web and display the urls as and when they are crawled in-real time on the web page. I coded a simple crawler which stores the urls in a list. I was wondering how to get this list into a database and have the database updated every x seconds, so that I can access the database and output the list of links on the web page periodically.
I don't know so much about real-time web development but that's a topic for another day. Right now though, I'm more concerned about how to get the list into the database. I'm currently using the web2py framework which is quite easy to get along with but if you guys have any recommendations as to where I should look, what frameworks I should check out... please do comment that too in your answers, thanks.
In a nutshell, the things I'm a noob at are: Python, databases, real-time web dev.
here's the code to my crawler if it helps in anyway :) thanks
from urllib2 import urlopen
def crawler(url,x):
crawled=[]
tocrawl=[]
def crawl(url,x):
x=x+1
try:
page = urlopen(url).read()
findlink = page.find('<a href=')
if findlink == -1:
return None, 0
while findlink!=-1:
start = page.find(('"'), findlink)
end = page.find(('"'), start+1)
link = page[start+1:end]
if link:
if link!=url:
if link[0]=='/':
link=url+link
link=replace(link)
if (link not in tocrawl) and (link!="") and (link not in crawled):
tocrawl.append(link)
findlink = page.find('<a href=', end)
crawled.append(url)
while tocrawl:
crawl(tocrawl[x],x)
except:
#keep crawling
crawl(tocrawl[x],x)
crawl(url,x)
def replace(link):
tsp=link.find('//')
if tsp==-1:
return link
link=link[0:tsp]+'/'+link[tsp+2:]
return link

Instead of placing the URL's into a list, why not write them to the db directly? using for example mysql:
import MySQLdb
conn = MySQLdb.connect('server','user','pass','db')
curs = conn.cursor()
sql = 'INSERT into your_table VALUES(%s,%s)' %(id,str(link))
rc = curs.execute(sql)
conn.close()
This way you don't have to manage the list like pipe. But if that is necessary this can also be adapted for that method.

This sounds like a good job for Redis which has a built in list structure. To append new url to your list, it's as simple as:
from redis import Redis
red = Red()
# Later in your code...
red.lpush('crawler:tocrawl', link)
It also has a set type that let you efficiently check which websites you've crawled and let you sync multiple crawlers.
# Check if we're the first one to mark this link
if red.sadd('crawler:crawled', link):
red.lpush('crawler:tocrawl', link)
To get the next link to crawl:
url = red.lpop('crawler:tocrawl')
To see which urls are queued to be crawled:
print red.lrange('crawler:tocrawl', 0, -1)
Its just one option but it is very fast and flexible. You can find more documentation on the redis python driver page.

To achieve this you need a Cron. A cron is a job scheduler for Unix-like computers. You can schedule a cron job to go every minute, every hour, every day, etc.
Check out this tutorial http://newcoder.io/scrape/intro/ and it will help you achieve what you want here.
Thanks. Info if it works.

Related

How would I prevent adding multiple copies of the same data to my firebase database

I am designing a small sports news app as a school project that scrapes data from the web, posts it to a firebase realtime database and is then used in an application being built on android studio by my project partner. So far during development I have just been deleting the database and rebuilding it every time i run the code to prevent build-up of the same data. I am wondering how i would go about checking to see if a piece of data exists before it push the data to the database.
Thanks if anyone is able to point me in the right direction. Here is my code for pushing the data to firebase:
ref = db.reference('/news')
ref.delete()
url = 'https://news.sky.com/topic/premier-league-3810'
content = requests.get(url)
soup = BeautifulSoup(content.text, "html.parser")
body = soup.find_all("h3", "sdc-site-tile__headline")
titles_list = []
links_list = []
for item in body:
headline = item.find_all('span', class_='sdc-site-tile__headline-text')[0].text
titles_list.append(headline)
link = item.find('a', class_='sdc-site-tile__headline-link').get('href')
links_list.append('https://news.sky.com' + link)
i=0
while i < len(titles_list):
ref.push({
'Title' : titles_list[i],
'Link' : links_list[i]
})
i+=1
There are a few main options here:
You can use a query to check if the data already exists, before writing it. Then when it doesn't exist yet, you can write it.
If multiple users/devices can be adding data at the same time, the above won't work as somebody may write their data just after you have checked if the values already exist. In that case you'll want to:
Use the values that are to be unique as the key of the data (so using child("your_unique_values").set instead of push), and use a transaction to ensure you don't overwrite each other's data.

How to 'save' progress whilst web scraping in Python?

I am scraping some data and making a lot of requests from Reddit's pushshift API, along the way I keep encountering http errors, which halt all the progress, is there any way in which I can continue where I left off if an error occurs?
X = []
for i in ticklist:
f = urlopen("https://api.pushshift.io/reddit/search/submission/?q={tick}&subreddit=wallstreetbets&metadata=true&size=0&after=1610928000&before=1613088000".format(tick=i))
j = json.load(f)
subs = j['metadata']['total_results']
X.append(subs)
print('{tick} has been scraped!'.format(tick=i))
time.sleep(1)
I've so far mitigated the 429 error by waiting for a second in between requests - although I am experiencing connection time outs, I'm not sure how to efficiently proceed with this without wasting a lot of my time rerunning the code and hoping for the best.
Python sqlitedb approach: Refrence: https://www.tutorialspoint.com/sqlite/sqlite_python.htm
Create sqlitedb.
Create a table with urls to be scraped with schema like CREATE TABLE COMPANY (url NOT NULL UNIQUE, Status NOT NULL default "Not started")
Now read the rows only for which the status is "Not started".
you can change the status column of the URL to success once scraping is done.
So wherever the script starts it will only run run for the not started once.

Python scrapy, re compiling efficiency paradox in loop

I am totally new to programming but I come across with this bizarre phenomenon that I could not answer. Please point me in the right direction. I started crawling a entirely javascript built webpage with ajax tables. First I started with selenium and it worked well. However, I noticed that someone of you pros here mentioned scrapy is much faster. Then I tried and succeeded in building the crawler under scrapy, with a hell of headache.
I need to use re to extract the javascript strings and what happened next confused me. Here is what the python doc says(https://docs.python.org/3/library/re.html):
"but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program."
I first started with using re.search(pattern,str) in a loop. Scrapy crawled 202 pages in 3 seconds (finish time - start time).
Then I followed the Python doc's suggestion, compiling re.compile(pattern) before the loop to improve efficiency. Scrapy crawled the same 202 pages with 37 seconds. What is going on here?
Here is the some of the code, other suggestions to improve the code is greatly appreciated. Thanks.
EDIT2: I was too presumptuous to base my view on a single run.
Later 3 tests with 2000 webpages show that regex compiling within the loop is finished in 25s on average. With the same 2000 webpages regex compiling before the loop is finished in 24s on average.
EDIT:
Here is the webpage I am trying to crawl
http://trac.syr.edu/phptools/immigration/detainhistory/
I am trying to crawl basically everything on a year-month basis from this database. There are three javascript generated columns on this webpage. When you select options form these drop-down menus, the webpage send back sql queries to its server and generate corresponding contents. I figured out directly generating these pre-defined queries to crawl all the table contents, however, it is great pain.
def parseYM(self, response):
c_list =response.xpath()
c_list_num = len(c_list)
item2 = response.meta['item']
# compiling before loop
# search_c=re.comple(pattern1)
# search_cid=re.comple(pattern2)
for j in range(c_list):
item = myItem()
item[1] = item2[1]
item['id'] = item2['id']
ym_id = item['ymid']
item[3] = re.search(pattern1, c_list[j]).group(1)
tmp1 = re.search(pattern2, c_list[j]).group(1)
# item['3'] = search_c.search(c_list[j]).group(1)
# tmp1 = search_cid.search(c_list[j]).group(1)
item[4] = tmp1
link1 = 'link1'
request = Request(link1, self.parse2, meta={'item': item}, dont_filter=True)
yield request
Unnecessary temp variable is used to avoid long lines. Maybe there are better ways? I have a feeling that the regular expression issue has something to do with the twisted reactor. The twisted doc is quite intimidating to newbies like me...

Flask template streaming with Jinja

I have a flask application. On a particular view, I show a table with about 100k rows in total. It's understandably taking a long time for the page to load, and I'm looking for ways to improve it. So far I've determined I query the database and get a result fairly quickly. I think the problem lies in rendering the actual page. I've found this page on streaming and am trying to work with that, but keep running into problems. I've tried the stream_template solution provided there with this code:
#app.route('/thing/matches', methods = ['GET', 'POST'])
#roles_accepted('admin', 'team')
def r_matches():
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).all()
return Response(stream_template('/retailer/matches.html',
dashboard_title = g.name,
match_show_option = True,
match_form = form,
matches = matches))
def stream_template(template_name, **context):
app.update_template_context(context)
t = app.jinja_env.get_template(template_name)
rv = t.stream(context)
rv.enable_buffering(5)
return rv
The Match query is the one that returns 100k+ items. However, whenever I run this the page just shows up blank with nothing there. I've also tried the solution with streaming the data to a json and loading it via ajax, but nothing seems to be in the json file either! Here's what that solution looks like:
#app.route('/large.json')
def generate_large_json():
def generate():
app.logger.info("Generating JSON")
matches = Product.query.join(Category).filter(
Product.retailer == g.retailer,
Product.match != None).order_by(Product.name)
for match in matches:
yield json.dumps(match)
app.logger.info("Sending file response")
return Response(stream_with_context(generate()))
Another solution I was looking at was for pagination. This solution works well, except I need to be able to sort through the entire dataset by headers, and couldn't find a way to do that without rendering the whole dataset in the table then using JQuery for sorting/pagination.
The file I get by going to /large.json is always empty. Please help or recommend another way to display such a large data set!
Edit: I got the generate() part to work and updated the code.
The problem in both cases is almost certainly that you are hanging on building 100K+ Match items and storing them in memory. You will want to stream the results from the DB as well using yield_per. However, only PostGres+psycopg2 support the necessary stream_result argument (here's a way to do it with MySQL):
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).yield_per(10)
# Stream ten results at a time
An alternative
If you are using Flask-SQLAlchemy you can make use of its Pagination class to paginate your query server-side and not load all 100K+ entries into the browser. This has the added advantage of not requiring the browser to manage all of the DOM entries (assuming you are doing the HTML streaming option).
See also
SQLAlchemy: Scan huge tables using ORM?
How to Use SQLAlchemy Magic to Cut Peak Memory and Server Costs in Half

Parse what you google search

I'd like to write a script (preferably in python, but other languages is not a problem), that can parse what you type into a google search. Suppose I search 'cats', then I'd like to be able to parse the string cats and, for example, append it to a .txt file on my computer.
So if my searches were 'cats', 'dogs', 'cows' then I could have a .txt file like so,
cats
dogs
cows
Anyone know any APIs that can parse the search bar and return the string inputted? Or some object that I can cast into a string?
EDIT: I don't want to make a chrome extension or anything, but preferably a python (or bash or ruby) script I can run in terminal that can do this.
Thanks
If you have access to the URL, you can look for "&q=" to find the search term. (http://google.com/...&q=cats..., for example).
I can offer 2 popular solution
1) Google have a search-engine API https://developers.google.com/products/#google-search
(It have restriction on 100 requests per day)
cutted code:
def gapi_parser(args):
query = args.text; count = args.max_sites
import config
api_key = config.api_key
cx = config.cx
#Note: This API returns up to the first 100 results only.
#https://developers.google.com/custom-search/v1/using_rest?hl=ru-RU#WorkingResults
results = []; domains = set(); errors = []; start = 1
while True:
req = 'https://www.googleapis.com/customsearch/v1?key={key}&cx={cx}&q={q}&alt=json&start={start}'.format(key=api_key, cx=cx, q=query, start=start)
if start>=100: #google API does not can do more
break
con = urllib2.urlopen(req)
if con.getcode()==200:
data = con.read()
j = json.loads(data)
start = int(j['queries']['nextPage'][0]['startIndex'])
for item in j['items']:
match = re.search('^(https?://)?\w(\w|\.|-)+', item['link'])
if match:
domain = match.group(0)
if domain not in results:
results.append(domain)
domains.update([domain])
else:
errors.append('Can`t recognize domain: %s' % item['link'])
if len(domains) >= args.max_sites:
break
print
for error in errors:
print error
return (results, domains)
2) I wrote a selenuim based script what parse a page in real browser instance, but this solution have a some restrictions, for example captcha if you run searches like a robots.
A few options you might consider, with their advantages and disadvantages:
URL:
advantage: as Chris mentioned, accessing the URL and manually changing it is an option. It should be easy to write a script for this, and I can send you my perl script if you want
disadvantage: I am not sure if you can do it. I made a perl script for that before, but it didn't work because Google states that you can't use its services outside the Google interface. You might face the same problem
Google's search API:
advantage: popular choice. Good documentation. It should be a safe choice
disadvantage: Google's restrictions.
Research other search engines:
advantage: they might not have the same restrictions as Google. You might find some search engines that let you play around more and have more freedom in general.
disadvantage: you're not going to get results that are as good as Google's

Categories

Resources