Python scrapy, re compiling efficiency paradox in loop - python

I am totally new to programming but I come across with this bizarre phenomenon that I could not answer. Please point me in the right direction. I started crawling a entirely javascript built webpage with ajax tables. First I started with selenium and it worked well. However, I noticed that someone of you pros here mentioned scrapy is much faster. Then I tried and succeeded in building the crawler under scrapy, with a hell of headache.
I need to use re to extract the javascript strings and what happened next confused me. Here is what the python doc says(https://docs.python.org/3/library/re.html):
"but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program."
I first started with using re.search(pattern,str) in a loop. Scrapy crawled 202 pages in 3 seconds (finish time - start time).
Then I followed the Python doc's suggestion, compiling re.compile(pattern) before the loop to improve efficiency. Scrapy crawled the same 202 pages with 37 seconds. What is going on here?
Here is the some of the code, other suggestions to improve the code is greatly appreciated. Thanks.
EDIT2: I was too presumptuous to base my view on a single run.
Later 3 tests with 2000 webpages show that regex compiling within the loop is finished in 25s on average. With the same 2000 webpages regex compiling before the loop is finished in 24s on average.
EDIT:
Here is the webpage I am trying to crawl
http://trac.syr.edu/phptools/immigration/detainhistory/
I am trying to crawl basically everything on a year-month basis from this database. There are three javascript generated columns on this webpage. When you select options form these drop-down menus, the webpage send back sql queries to its server and generate corresponding contents. I figured out directly generating these pre-defined queries to crawl all the table contents, however, it is great pain.
def parseYM(self, response):
c_list =response.xpath()
c_list_num = len(c_list)
item2 = response.meta['item']
# compiling before loop
# search_c=re.comple(pattern1)
# search_cid=re.comple(pattern2)
for j in range(c_list):
item = myItem()
item[1] = item2[1]
item['id'] = item2['id']
ym_id = item['ymid']
item[3] = re.search(pattern1, c_list[j]).group(1)
tmp1 = re.search(pattern2, c_list[j]).group(1)
# item['3'] = search_c.search(c_list[j]).group(1)
# tmp1 = search_cid.search(c_list[j]).group(1)
item[4] = tmp1
link1 = 'link1'
request = Request(link1, self.parse2, meta={'item': item}, dont_filter=True)
yield request
Unnecessary temp variable is used to avoid long lines. Maybe there are better ways? I have a feeling that the regular expression issue has something to do with the twisted reactor. The twisted doc is quite intimidating to newbies like me...

Related

How to 'save' progress whilst web scraping in Python?

I am scraping some data and making a lot of requests from Reddit's pushshift API, along the way I keep encountering http errors, which halt all the progress, is there any way in which I can continue where I left off if an error occurs?
X = []
for i in ticklist:
f = urlopen("https://api.pushshift.io/reddit/search/submission/?q={tick}&subreddit=wallstreetbets&metadata=true&size=0&after=1610928000&before=1613088000".format(tick=i))
j = json.load(f)
subs = j['metadata']['total_results']
X.append(subs)
print('{tick} has been scraped!'.format(tick=i))
time.sleep(1)
I've so far mitigated the 429 error by waiting for a second in between requests - although I am experiencing connection time outs, I'm not sure how to efficiently proceed with this without wasting a lot of my time rerunning the code and hoping for the best.
Python sqlitedb approach: Refrence: https://www.tutorialspoint.com/sqlite/sqlite_python.htm
Create sqlitedb.
Create a table with urls to be scraped with schema like CREATE TABLE COMPANY (url NOT NULL UNIQUE, Status NOT NULL default "Not started")
Now read the rows only for which the status is "Not started".
you can change the status column of the URL to success once scraping is done.
So wherever the script starts it will only run run for the not started once.

Using scrapy to extract and structure table data

I'm new to python and scrapy and thought I'd try out a simple review site to scrape. While most of the site structure is straight forward, I'm having trouble extracting the content of the reviews. This portion is visually laid out in sets of 3 (the text to the right of 良 (good), 悪 (bad), 感 (impressions) fields), but I'm having trouble pulling this content and associating it with a reviewer or section of review due to the use of generic divs, , /n and other formatting.
Any help would be appreciated.
Here's the site and code I've tried for the grabbing them, with some results.
http://www.psmk2.net/ps2/soft_06/rpg/p3_log1.html
(1):
response.xpath('//tr//td[#valign="top"]//text()').getall()
This returns the entire set of reviews, but it contains newline markup and, more of a problem, it renders each line as a separate entry. Due to this, I can't figure out where the good, bad, and impression portions end, nor can I easily parse each separate review as entry length varies.
['\n弱点をついた時のメリット、つかれたときのデメリットがはっきりしてて良い', '\nコミュをあげるのが楽しい',
'\n仲間が多くて誰を連れてくか迷う', '\n難易度はやさしめなので遊びやすい', '\nタルタロスしかダンジョンが無くて飽きる。'........and so forth
(2) As an alternative, I tried:
response.xpath('//tr//td[#valign="top"]')[0].get()
Which actually comes close to what I'd like, save for the markup. Here it seems that it returns the entire field of a review section. Every third element should be the "good" points of each separate review (I've replaced the <> with () to show the raw return).
(td valign="top")\n精一杯考えました(br)\n(br)\n戦闘が面白いですね\n主人公だけですが・・・・(br)\n従来のプレスターンバトルの進化なので(br)\n(br)\n以上です(/td)
(3) Figuring I might be able to get just the text, I then tried:
response.xpath('//tr//td[#valign="top"]//text()')[0].get()
But that only provides each line at a time, with the \n at the front. As with (1), a line by line rendering makes it difficult to attribute reviews to reviewers and the appropriate section in their review.
From these (2) seems the closest to what I want, and I was hoping I could get some direction in how to grab each section for each review without the markup. I was thinking that since these sections come in sets of 3, if these could be put in a list that would make pulling them easier in the future (i.e. all "good" reviews follow 0, 0+3; all "bad" ones 1, 1+3 ... etc.)...but first I need to actually get the elements.
I've thought about, and tried, iterating over each line with an "if" conditional (something like:)
i = 0
if i <= len(response.xpath('//tr//td[#valign="top"]//text()').getall()):
yield {response.xpath('//tr//td[#valign="top"]')[i].get()}
i + 1
to pull these out, but I'm a bit lost on how to implement something like this. Not sure where it should go. I've briefly looked at Item Loader, but as I'm new to this, I'm still trying to figure it out.
Here's the block where the review code is.
def parse(self, response):
for table in response.xpath('body'):
yield {
#code for other elements in review
'date': response.xpath('//td//div[#align="left"]//text()').getall(),
'name': response.xpath('//td//div[#align="right"]//text()').getall(),
#this includes the above elements, and is regualr enough I can systematically extract what I want
'categories': response.xpath('//tr//td[#class="koumoku"]//text()').getall(),
'scores': response.xpath('//tr//td[#class="tokuten_k"]//text()').getall(),
'play_time': response.xpath('//td[#align="right"]//span[#id="setumei"]//text()').getall(),
#reviews code here
}
Pretty simple task using a part of text as anchor (I used string to get text content for a whole td):
for review_node in response.xpath('//table[#width="645"]'):
good = review_node.xpath('string(.//td[b[starts-with(., "良")]]/following-sibling::td[1])').get()
bad= review_node.xpath('string(.//td[b[starts-with(., "悪")]]/following-sibling::td[1])').get()
...............

Getting more than 100 days of data web scraping Yahoo

Like many others I have been looking for an alternative source of stock prices now that the Yahoo and Google APIs are defunct. I decided to take a try at web scraping the Yahoo site from which historical prices are still available. I managed to put together the following code which almost does what I need:
import urllib.request as web
import bs4 as bs
def yahooPrice(tkr):
tkr=tkr.upper()
url='https://finance.yahoo.com/quote/'+tkr+'/history?p='+tkr
sauce=web.urlopen(url)
soup=bs.BeautifulSoup(sauce,'lxml')
table=soup.find('table')
table_rows=table.find_all('tr')
allrows=[]
for tr in table_rows:
td=tr.find_all('td')
row=[i.text for i in td]
if len(row)==7:
allrows.append(row)
vixdf= pd.DataFrame(allrows).iloc[0:-1]
vixdf.columns=['Date','Open','High','Low','Close','Aclose','Volume']
vixdf.set_index('Date',inplace=True)
return vixdf
which produces a dataframe with the information I want. Unfortunately, even though the actual web page shows a full year's worth of prices, my routine only returns 100 records (including dividend records). Any idea how I can get more?
The Yahoo Finance API was depreciated in May '17, I believe. Now, there are to many options for downloading time series data for free, at least that I know of. Nevertheless, there is always some kind of alternative. Check out the URL below to find a tool to download historical price.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/
See this too.
https://blog.quandl.com/api-for-stock-data
I don't have the exact solution to your question but I have a workaround (I had the same problem and hence used this approach)....basically, you can use Bday() method - 'import pandas.tseries.offset' and look for x number of businessDays for collecting the data. In my case, i ran the loop thrice to get 300 businessDays data - knowing that 100 was maximum I was getting by default.
Basically, you run the loop thrice and set the Bday() method such that the iteration on first time grabs 100 days data from now, then the next 100 days (200 days from now) and finally the last 100 days (300 days from now). The whole point of using this is because at any given point, one can only scrape 100 days data. So basically, even if you loop through 300 days in one go, you may not get 300 days data - your original problem (possibly yahoo limits amount of data extracted in one go). I have my code here : https://github.com/ee07kkr/stock_forex_analysis/tree/dataGathering
Note, the csv files for some reason are not working with /t delimiter in my case...but basically u can use the data frame. One more issue I currently have is 'Volume' is a string instead of float....the way to get around is :
apple = pd.DataFrame.from_csv('AAPL.csv',sep ='\t')
apple['Volume'] = apple['Volume'].str.replace(',','').astype(float)
First - Run the code below to get your 100 days.
Then - Use SQL to insert the data into a small db (Sqlite3 is pretty easy to use with python).
Finally - Amend code below to then get daily prices which you can add to grow your database.
from pandas import DataFrame
import bs4
import requests
def function():
url = 'https://uk.finance.yahoo.com/quote/VOD.L/history?p=VOD.L'
response = requests.get(url)
soup=bs4.BeautifulSoup(response.text, 'html.parser')
headers=soup.find_all('th')
rows=soup.find_all('tr')
ts=[[td.getText() for td in rows[i].find_all('td')] for i in range (len(rows))]
date=[]
days=(100)
while days > 0:
for i in ts:
data.append (i[:-6])
now=data[num]
now=DataFrame(now)
now=now[0]
now=str(now[0])
print now, item
num=num-1

How to pass url through two functions - Callback

Set-up
I'm scraping housing ads with scrapy: per housing ad I scrape several housing characteristics.
Scraping the housing characteristics works fine.
Problem
Besides the housing characteristics, I want to scrape one image per ad.
I have the following code:
class ApartmentSpider(scrapy.Spider):
name = 'apartments'
start_urls = [
'http://www.jaap.nl/huurhuizen/noord+holland/groot-amsterdam/amsterdam'
]
def parse(self, response):
for href in response.xpath(
'//*[#id]/a',
).css("a.property-inner::attr(href)").extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_ad) # parse_ad() scrapes housing characteristics
yield scrapy.Request(response.urljoin(href),
callback=self.parse_AdImage) # parse_AdImage() obtains one image per ad
So, I've got two yield commands, which does not work. That is, I get the characteristics, but not the images.
I can comment the first one, such that I get the images.
How do I fix this such that I get both? Thanks in advance.
Just yield them both together.
yield (scrapy.Request(response.urljoin(href), callback=self.parse_ad), scrapy.Request(response.urljoin(href), callback=self.parse_AdImage))
On the receiving end, grab both as separate values
characteristics, image = ApartmentSpider.parse(response)
I have two major suggestions:
Number 1
I would strongly suggest re-working your code to actually farm out all the info at the same time. Instead of having two separate parse_X functions...just have one that gets the info and returns a single item.
Number 2
Implement a Spider Middleware that does merging/splitting similar to what I have below for pipelines. A simple example middleware is https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/spidermiddlewares/urllength.py. You would simply merge items and track them here before they enter the itempipelines.
WARNING DO NOT DO WHAT's BELOW. I WAS GOING TO SUGGEST THIS, AND THE CODE MIGHT WORK...BUT WITH SOME POTENTIALLY HIDDEN ISSUES.
IT IS HERE FOR COMPLETENESS OF WHAT I WAS RESEARCHING -- IT IS RECOMMENDED AGAINST HERE:https://github.com/scrapy/scrapy/issues/1915
Use the item processing pipelines in scrapy. They are incredibly useful for accumulating data. Have a item joiner pipeline who's purpose is to wait for the two separate partial data items and concatenate them into one item and key them on the ad id (or some other unique piece of data).
In rough not-runnable psuedocode:
class HousingItemPipeline(object):
def __init__():
self.assembledItems = dict()
def process_item(self, item, spider):
if type(item, PartialAdHousingItem):
self.assembledItems[unique_id] = AssembledHousingItem()
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if type(item, PartialAdImageHousingItem):
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if Fully Assembled:
return self.assembledItems.pop(unique_id)

How do I store crawled data into a database

I'm fairly new to python and everything else I'm about to talk about in this question but I want to get started with a project I've been thinking about for sometime now. Basically I want to crawl the web and display the urls as and when they are crawled in-real time on the web page. I coded a simple crawler which stores the urls in a list. I was wondering how to get this list into a database and have the database updated every x seconds, so that I can access the database and output the list of links on the web page periodically.
I don't know so much about real-time web development but that's a topic for another day. Right now though, I'm more concerned about how to get the list into the database. I'm currently using the web2py framework which is quite easy to get along with but if you guys have any recommendations as to where I should look, what frameworks I should check out... please do comment that too in your answers, thanks.
In a nutshell, the things I'm a noob at are: Python, databases, real-time web dev.
here's the code to my crawler if it helps in anyway :) thanks
from urllib2 import urlopen
def crawler(url,x):
crawled=[]
tocrawl=[]
def crawl(url,x):
x=x+1
try:
page = urlopen(url).read()
findlink = page.find('<a href=')
if findlink == -1:
return None, 0
while findlink!=-1:
start = page.find(('"'), findlink)
end = page.find(('"'), start+1)
link = page[start+1:end]
if link:
if link!=url:
if link[0]=='/':
link=url+link
link=replace(link)
if (link not in tocrawl) and (link!="") and (link not in crawled):
tocrawl.append(link)
findlink = page.find('<a href=', end)
crawled.append(url)
while tocrawl:
crawl(tocrawl[x],x)
except:
#keep crawling
crawl(tocrawl[x],x)
crawl(url,x)
def replace(link):
tsp=link.find('//')
if tsp==-1:
return link
link=link[0:tsp]+'/'+link[tsp+2:]
return link
Instead of placing the URL's into a list, why not write them to the db directly? using for example mysql:
import MySQLdb
conn = MySQLdb.connect('server','user','pass','db')
curs = conn.cursor()
sql = 'INSERT into your_table VALUES(%s,%s)' %(id,str(link))
rc = curs.execute(sql)
conn.close()
This way you don't have to manage the list like pipe. But if that is necessary this can also be adapted for that method.
This sounds like a good job for Redis which has a built in list structure. To append new url to your list, it's as simple as:
from redis import Redis
red = Red()
# Later in your code...
red.lpush('crawler:tocrawl', link)
It also has a set type that let you efficiently check which websites you've crawled and let you sync multiple crawlers.
# Check if we're the first one to mark this link
if red.sadd('crawler:crawled', link):
red.lpush('crawler:tocrawl', link)
To get the next link to crawl:
url = red.lpop('crawler:tocrawl')
To see which urls are queued to be crawled:
print red.lrange('crawler:tocrawl', 0, -1)
Its just one option but it is very fast and flexible. You can find more documentation on the redis python driver page.
To achieve this you need a Cron. A cron is a job scheduler for Unix-like computers. You can schedule a cron job to go every minute, every hour, every day, etc.
Check out this tutorial http://newcoder.io/scrape/intro/ and it will help you achieve what you want here.
Thanks. Info if it works.

Categories

Resources