How to 'save' progress whilst web scraping in Python?

How to 'save' progress whilst web scraping in Python? - python

I am scraping some data and making a lot of requests from Reddit's pushshift API, along the way I keep encountering http errors, which halt all the progress, is there any way in which I can continue where I left off if an error occurs?
X = []
for i in ticklist:
f = urlopen("https://api.pushshift.io/reddit/search/submission/?q={tick}&subreddit=wallstreetbets&metadata=true&size=0&after=1610928000&before=1613088000".format(tick=i))
j = json.load(f)
subs = j['metadata']['total_results']
X.append(subs)
print('{tick} has been scraped!'.format(tick=i))
time.sleep(1)
I've so far mitigated the 429 error by waiting for a second in between requests - although I am experiencing connection time outs, I'm not sure how to efficiently proceed with this without wasting a lot of my time rerunning the code and hoping for the best.

Python sqlitedb approach: Refrence: https://www.tutorialspoint.com/sqlite/sqlite_python.htm
Create sqlitedb.
Create a table with urls to be scraped with schema like CREATE TABLE COMPANY (url NOT NULL UNIQUE, Status NOT NULL default "Not started")
Now read the rows only for which the status is "Not started".
you can change the status column of the URL to success once scraping is done.
So wherever the script starts it will only run run for the not started once.

Related

Loading imaging interposition with clicking element. Fluent waiting time not working

This is my first question so apologice in advance.
I am trying to automate loading an online form, which you have to do one case at a time.
Some inputs request information from the server and freezes the webpage with a loading spining gif, afterwards some fields are autocompleted.
I am having issues at the end of the first entry where I need to ADD the current info in order to reset everything and start the process again.
The exception ElementClickInterceptedException raises inspite of using fluent waiting. I have tried several ways using Xpath or script executor but it throws the same error
any thoughts?
send ID 1 and 2
driver.find_element(By.ID,'bodyContent_PRD_EFE_txtNroIDPrestador').send_keys(ID)
driver.find_element(By.ID,'bodyContent_PRD_PRE_txtNroIDPrestador').send_keys(ID2)
for i in prestaciones.index: #this is a pd.DataFrame wherei store the data to fill the form
afi=WebDriverWait(driver,5).until(
EC.element_to_be_clickable((By.ID,'bodyContent_PID_txtNroIDAfiliado'))) #store the input element
if afi.get_attribute('value')=='': #check if its empty and fill it
afi.send_keys(str(prestaciones['n af'][i]))
else:
driver.find_element(By.ID,'bodyContent_PID_btnNroIDAfiliado_Clean').click()
afi.send_keys(str(prestaciones['n af'][i]))
#select something from a scroll down list
prog_int=Select(driver.find_element(By.ID,'bodyContent_PV1Internacion_selTipoAdmision'))
prog_int.select_by_value('P')
#fill more input
diag=driver.find_element(By.ID,'bodyContent_DG1_txtNroIDDiagnostico').get_attribute('value')
if diag=='':
driver.find_element(By.ID,'bodyContent_DG1_txtNroIDDiagnostico').send_keys('I10')
#select more inputs
tip_prac=Select(driver.find_element(By.ID,'bodyContent_PRE_selSistemaCodificacion'))
tip_prac.select_by_value('1')
#Codigo de prestacion
prest= driver.find_element(By.ID,'bodyContent_PRE_txtCodigoPrestacion')
if prest=='': #deal with the data in the input for next round of loading
prest.send_keys(str(prestaciones['codigo'][i]))
else:
prest.clear()
prest.send_keys(str(prestaciones['codigo'][i]))
#select amount of items
cant=driver.find_element(By.ID,'bodyContent_PRE_txtCantidadTratamientoSolicitados').get_attribute('value')
if cant == '':
driver.find_element(By.ID,'bodyContent_PRE_txtCantidadTratamientoSolicitados').send_keys('1')
#HERE IS THE DEAL. some fields make a loading gift appears and caches the click. I have tried several ways and it throws that exception or the time out one with the execute script
aceptar=WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "bodyContent_PRE_btnTablaPrestaciones_Add")))
aceptar.click()

Forex-python "Currency Rates Source Not Ready"

I want to use the Forex-python module to convert amounts in various currencies to a specific currency ("DKK") according to a specific date (The last day of a previous month according to a date in the dataframe)
This is the structure of my code:
pd.DataFrame(data={'Date':['2017-4-15','2017-6-12','2017-2-25'],'Amount':[5,10,15],'Currency':['USD','SEK','EUR']})
def convert_rates(amount,currency,PstngDate):
PstngDate = datetime.strptime(PstngDate, '%Y-%m-%d')
if currency != 'DKK':
return c.convert(base_cur=currency,dest_cur='DKK',amount=amount \
,date_obj=PstngDate - timedelta(PstngDate.day))
else:
return amount
and the the new column with the converted amount:
df['Amount, DKK'] = np.vectorize(convert_rates)(
amount=df['Amount'],
currency=df['Currency'],
PstngDate=df['Date']
)
I get the RatesNotAvailableError "Currency Rates Source Not Ready"
Any idea what can cause this? It has previously worked with small amounts of data, but I have many rows in my real df...

I inserted a small print statement into convert.py (part of forex-python) to debug this.
print(response.status_code)
Currently I receive:
502
Read these threads about the HTTP 502 error:
In HTTP 502, what is meant by an invalid response?
https://www.lifewire.com/502-bad-gateway-error-explained-2622939
These errors are completely independent of your particular setup,
meaning that you could see one in any browser, on any operating
system, and on any device.
502 indicates that currently there is a problem with the infrastructure this API uses to provide us with the required data. As I am in need of the data myself I will continue to monitor this issue and keep my post on this site updated.
There is already an issue on Github regarding this issue:
https://github.com/MicroPyramid/forex-python/issues/100

From the source: https://github.com/MicroPyramid/forex-python/blob/80290a2b9150515e15139e1a069f74d220c6b67e/forex_python/converter.py#L73
Your error means the library received a non 200 response code to your request. This could mean the site is down, or it's blocked you momentarily because you're hammering it with requests.
Try replacing the call to c.convert with something like:
from time import sleep
def try_convert(amount, currency, PstngDate):
success = False
while success == False:
try:
res = c.convert(base_cur=currency,dest_cur='DKK',amount=amount \
,date_obj=PstngDate - timedelta(PstngDate.day))
except:
#wait a while
sleep(10)
return res
Or even better, use a library like backoff, to do the retrying for you:
https://pypi.python.org/pypi/backoff/1.3.1

Python 2.7 Mailchimp3 401 status "API Key is missing"

I am currently trying to make a call using mailchimp3:
client.reports.email_activity.all(campaign_id = '#######', getall=True, fields = '######')
When I call large amounts of email data I get the 401 status error. I am able to call smaller amounts with no error.
I tried increasing:
request.get(timeout=10000)

Pulling large amounts of data in one call can cause this timeout error. Instead it is best to use the offset method.
client.reports.email_activity.all(campaign_id = '#####', offset = #, count = #, fields = '######')
With this you can loop through and make a bunch of smaller calls instead of one large call that causes the timeout issue.
Thanks to the support staff at chimp mail that helped me trouble shoot this issue.

use many processes to filter a million records

I have python script that works well for a few numbers:
def ncpr (searchnm):
import urllib2
from bs4 import BeautifulSoup
mynumber = searchnm
url = "http://www.domain.com/saveSearchSub.misc?phoneno=" + mynumber
soup = BeautifulSoup(urllib2.urlopen(url))
header = soup.find('td', class_='GridHeader')
result = []
for row in header.parent.find_next_siblings('tr'):
cells = row.find_all('td')
try:
result.append(cells[2].get_text(strip=True))
except IndexError:
continue
if result:
pass
else:
return str(i)
with open("Output.txt3", "w") as text_file:
for i in range(9819838100,9819838200):
myl=str(ncpr(str(i)))
if myl != 'None':
text_file.write((myl)+'\n')
It checks the range of 100 numbers and return the integer that is not present in the database. It takes a few seconds to process 100 records.
I need to process a million numbers starting from different ranges.
For e.g.
9819800000 9819900000
9819200000 9819300000
9829100000 9829200000
9819100000 9819200000
7819800000 7819900000
8819800000 8819900000
9119100000 9119200000
9119500000 9119600000
9119700000 9119800000
9113100000 9113200000
This dictionary will be generated from the list supplied:
mylist=[98198, 98192, 98291, 98191, 78198, 88198, 91191, 91195, 91197, 91131]
mydict={}
for mynumber in mylist:
start_range= int(str(mynumber) + '00000')
end_range=int(str(mynumber+1) +'00000')
mydict[start_range] = end_range
I need to use threads in such a way that I can check 1 million records as quickly as possible.

The problem with your code is not so much about how to parallelize it, but about the fact that you query a single number per request. This means processing a million numbers will generate a million requests, using a million separate HTTP sessions on a million new TCP connections, to www.nccptrai.gov.in. I don't think the webmaster will enjoy this.
Instead, you should find a way to get a database dump of some kind. If that's impossible, restructure your code to reuse a single connection to issue multiple requests. That's discussed here: How to Speed Up Python's urllib2 when doing multiple requests
By issuing all your requests on a single connection you avoid a ton of overhead, and will experience greater throughput as well, hopefully culminating in being able to send a single packet per request and receive a single packet per response. If you live outside India, far from the server, you may benefit quite a bit from HTTP Pipelining as well, where you issue multiple requests without waiting for earlier responses. There's a sort of hack that demonstrates that here: http://code.activestate.com/recipes/576673-python-http-pipelining/ - but beware this may again get you in more trouble with the site's operator.

How do I store crawled data into a database

I'm fairly new to python and everything else I'm about to talk about in this question but I want to get started with a project I've been thinking about for sometime now. Basically I want to crawl the web and display the urls as and when they are crawled in-real time on the web page. I coded a simple crawler which stores the urls in a list. I was wondering how to get this list into a database and have the database updated every x seconds, so that I can access the database and output the list of links on the web page periodically.
I don't know so much about real-time web development but that's a topic for another day. Right now though, I'm more concerned about how to get the list into the database. I'm currently using the web2py framework which is quite easy to get along with but if you guys have any recommendations as to where I should look, what frameworks I should check out... please do comment that too in your answers, thanks.
In a nutshell, the things I'm a noob at are: Python, databases, real-time web dev.
here's the code to my crawler if it helps in anyway :) thanks
from urllib2 import urlopen
def crawler(url,x):
crawled=[]
tocrawl=[]
def crawl(url,x):
x=x+1
try:
page = urlopen(url).read()
findlink = page.find('<a href=')
if findlink == -1:
return None, 0
while findlink!=-1:
start = page.find(('"'), findlink)
end = page.find(('"'), start+1)
link = page[start+1:end]
if link:
if link!=url:
if link[0]=='/':
link=url+link
link=replace(link)
if (link not in tocrawl) and (link!="") and (link not in crawled):
tocrawl.append(link)
findlink = page.find('<a href=', end)
crawled.append(url)
while tocrawl:
crawl(tocrawl[x],x)
except:
#keep crawling
crawl(tocrawl[x],x)
crawl(url,x)
def replace(link):
tsp=link.find('//')
if tsp==-1:
return link
link=link[0:tsp]+'/'+link[tsp+2:]
return link

Instead of placing the URL's into a list, why not write them to the db directly? using for example mysql:
import MySQLdb
conn = MySQLdb.connect('server','user','pass','db')
curs = conn.cursor()
sql = 'INSERT into your_table VALUES(%s,%s)' %(id,str(link))
rc = curs.execute(sql)
conn.close()
This way you don't have to manage the list like pipe. But if that is necessary this can also be adapted for that method.

This sounds like a good job for Redis which has a built in list structure. To append new url to your list, it's as simple as:
from redis import Redis
red = Red()
# Later in your code...
red.lpush('crawler:tocrawl', link)
It also has a set type that let you efficiently check which websites you've crawled and let you sync multiple crawlers.
# Check if we're the first one to mark this link
if red.sadd('crawler:crawled', link):
red.lpush('crawler:tocrawl', link)
To get the next link to crawl:
url = red.lpop('crawler:tocrawl')
To see which urls are queued to be crawled:
print red.lrange('crawler:tocrawl', 0, -1)
Its just one option but it is very fast and flexible. You can find more documentation on the redis python driver page.

To achieve this you need a Cron. A cron is a job scheduler for Unix-like computers. You can schedule a cron job to go every minute, every hour, every day, etc.
Check out this tutorial http://newcoder.io/scrape/intro/ and it will help you achieve what you want here.
Thanks. Info if it works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to 'save' progress whilst web scraping in Python? - python

Related

Loading imaging interposition with clicking element. Fluent waiting time not working

Forex-python "Currency Rates Source Not Ready"

Python 2.7 Mailchimp3 401 status "API Key is missing"

use many processes to filter a million records

How do I store crawled data into a database

Categories

Resources