Optimizing web-scraper python loop - python

I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.
Using Jupyter notebook, it runs fine # first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.
So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?
The code is below. For each valid URL in a Pandas dataframe column, the loop:
downloads the webpage
extracts the relevant text
cleans out some encoding garbage from the text
writes that text to another dataframe column
every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.
Some ideas I've considered:
Write each article to a local SQL server instead of in-mem (speed concerns?)
save each article text in a csv with its url, then build a dataframe later
delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)
i=0
#lots of NaNs in the column, hence the subsetting
for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\
.unique_suffixes.values[:]:
i = i+1
if pd.isnull(u):
continue
#save our progress every 2k articles just in case
if i%2000 == 0:
unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8')
try:
#pull the data
html_r = requests.get(u).text
#the phrase "TX:" indicates start of article
#text, so if it's not present, URL must have been bad
if html_r.find("TX:") == -1:
continue
#capture just the text of the article
txt = html_r[html_r.find("TX:")+5:]
#fix encoding/formatting quirks
txt = txt.replace('\n',' ')
txt = txt.replace('[^\x00-\x7F]','')
#wait 200 ms to spare site's servers
time.sleep(.2)
#write our article to our dataframe
unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt
logging.info("done with url # %s -- %s remaining", i, (total_links-i))
print "done with url # " + str(i)
print total_links-i
except:
logging.exception("Exception on article # %s, URL: %s", i, u)
print "ERROR with url # " + str(i)
continue
This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.
logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
eta: some details in response to answers/comments:
script is only thing running on a 16 GB/ram EC2 instance
articles are ~100-800 words apiece

I'm going to take an educated guess and say that your script turns your machine into a swap storm as you get around 30k articles, according to your description. I don't see anything in your code where you could easily free up memory using:
some_large_container = None
Setting something that you know has a large allocation to None tells Python's memory manager that it's available for garbage collection. You also might want to explicitly call gc.collect(), but I'm not sure that would do you much good.
Alternatives you could consider:
sqlite3: Instead of a remote SQL database, use sqlite3 as intermediate storage. Exists there does a Python module.
Keep appending to the CSV checkpoint file.
Compress your strings with zlib.compress().
Any way that you decide to go, you're probably best off doing the collection as phase 1, constructing the Pandas dataframe as phase 2. Never pays off to be clever by a half. The other half tends to hang you.

Related

What's a Fail-safe Way to Update Python CSV

This is my function to build a record of user's performed action in python csv. It will get the username from the global and perform increment given in the amount parameter to the specific location of the csv, matching the user's row and current date.
In brief, the function will read the csv in a list, and do any modification on the data before rewriting the whole list back into the csv file.
Every first item on rows is the username, and the header has the dates.
Accs\Dates,12/25/2016,12/26/2016,12/27/2016
user1,217,338,653
user2,261,0,34
user3,0,140,455
However, I'm not sure why sometimes, the header get's pushed down to the second row, and data gets wiped entirely when it crashes.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
I'm thinking maybe I can write the stats separately and uniquely to each users and combine later, hence eliminating the possible clashing in writing. Although would be great if I could just improve from what I have here and read/write everything on a file.
Any fail-safe way to do what I'm trying to do here?
# Search current user in first rows and updating the count on the column (today's date)
# 'amount' will be added to the respective position
def dailyStats(self, amount, code = None):
def initStats():
# prepping table
with open(self.stats, 'r') as f:
reader = csv.reader(f)
for row in reader:
if row:
self.statsTable.append(row)
self.statsNames.append(row[0])
def getIndex(list, match):
# get the index of the matched date or user
for i, j in enumerate(list):
if j == match:
return i
self.statsTable = []
self.statsNames = []
self.statsDates = None
initStats()
today = datetime.datetime.now().strftime('%m/%d/%Y')
user_index = None
today_index = None
# append header if the csv is empty
if len(self.statsTable) == 0:
self.statsTable.append([r'Accs\Dates'])
# rebuild updated table
initStats()
# add new user/date if not found in first row/column
self.statsDates = self.statsTable[0]
if getIndex(self.statsNames, self.username) is None:
self.statsTable.append([self.username])
if getIndex(self.statsDates, today) is None:
self.statsDates.append(today)
# rebuild statsNames after table appended
self.statsNames = []
for row in self.statsTable:
self.statsNames.append(row[0])
# getting the index of user (row) and date (column)
user_index = getIndex(self.statsNames, self.username)
today_index = getIndex(self.statsDates, today)
# the row where user is matched, if there are previous dates than today which
# has no data, append 0 (e.g. user1,0,0,0,) until the column where today's date is match
if len(self.statsTable[user_index]) < today_index + 1:
for i in range(0,today_index + 1 - len(self.statsTable[user_index])):
self.statsTable[user_index].append(0)
# insert pv or tb code if found
if code is None:
self.statsTable[user_index][today_index] = amount + int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0))
else:
self.statsTable[user_index][today_index] = str(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0)) + ' - ' + code
# Writing final table
with open(self.stats, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(self.statsTable)
# return the summation of the user's total count
total_follow = 0
for i in range(1, len(self.statsTable[user_index])):
total_follow += int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][i])).group(0))
return total_follow
As David Z says, concurrency is more likely the cause of your problem.
I will add that CSV format is not suitable for Database storing, indexing, sorting, because it is plain/text and sequential.
You could handle it using a RDBMS for storing and updating your data and periodically processing your stats. Then your CSV format is just an import/export format.
Python offers a SQLite binding in its Standard Library, if you build a connector that import/update CSV content in a SQLite schema and then dump results as CSV you will be able to handle concurency and keep your native format without worring about installing a database server and installing new packages in Python.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
More likely than not that is exactly your issue. When two things are trying to write to the same file at the same time, the outputs from the two sources can easily get mixed up together, resulting in a file full of gibberish.
An easy way to fix this is just what you mentioned in the question, have each different process (or thread) write to its own file and then have separate code to combine all those files in the end. That's what I would probably do.
If you don't want to do that, what you can do is have different processes/threads send their information to an "aggregator process", which puts everything together and writes it to the file - the key is that only the aggregator ever writes to the file. Of course, doing that requires you to build in some method of interprocess communication (IPC), and that in turn can be tricky, depending on how you do it. Actually, one of the best ways to implement IPC for simple programs is by using temporary files, which is just the same thing as in the previous paragraph.

Django streaming CSV stops prematurely

I'm using the following code to stream a table and all its fields to a csv and then back to the user. It's more or less just the example from django's website.
# CSV header
header = [field.name for field in model._meta.fields]
results = model.objects.all()
# Generate a streaming output in case the file is large
h = []
h.append(header)
rows = ([getattr(item, field) for field in header] for item in results)
# Need to add the header to the front
chained = itertools.chain(h, rows)
pseduo_buffer = Echo() # File like object that just returns value when write is called
writer = csv.writer(pseduo_buffer)
response = StreamingHttpResponse((writer.writerow(row) for row in chained), content_type="text/csv")
filename = "{}_{}_{}.csv".format(app_name, model_name, date_str)
response['Content-Disposition'] = 'attachment; filename="{}"'.format(filename)
return response
The issue seems to be that for larger datasets, it will randomly stop the stream prior to actually finishing getting all the data from the table. The total file size is about 12MB or so, but it will stop streaming randomly from 500k to about 8MB.
I have only seen this in our production environment. I get the whole file when I do it in my development setup. It's all running in docker containers, so in theory it's the same setups in both instances. Not sure if there are other server related settings though that could be causing this?
Our devops guy did say he increased the load balancer timeouts, but I'm under the impression the streaming response shouldn't cause timeouts anyway because it's constantly sending data.
Is it possible it has to do with how the query is being executed, and it thinks it's done before it actually is (maybe poor query performance - a lot of joins, etc)?
Thanks everyone -

For loop with Python for ArcGIS using Add Field and Field Calculator

I'll try to give a brief background here. I recently received a large amount of data that was all digitized from paper maps. Each map was saved as an individual file that contains a number of records (polygons mostly). My goal is to merge all of these files into one shapefile or geodatabase, which is an easy enough task. However, other than spatial information, the records in the file do not have any distinguishing information so I would like to add a field and populate it with the original file name to track its provenance. For example, in the file "505_dmg.shp" I would like each record to have a "505_dmg" id in a column in the attribute table labeled "map_name". I am trying to automate this using Python and feel like I am very close. Here is the code I'm using:
# Import system module
import arcpy
from arcpy import env
from arcpy.sa import *
# Set overwrite on/off
arcpy.env.overwriteOutput = "TRUE"
# Define workspace
mywspace = "K:/Research/DATA/ADS_data/Historic/R2_ADS_Historical_Maps/Digitized Data/Arapahoe/test"
print mywspace
# Set the workspace for the ListFeatureClass function
arcpy.env.workspace = mywspace
try:
for shp in arcpy.ListFeatureClasses("","POLYGON",""):
print shp
map_name = shp[0:-4]
print map_name
arcpy.AddField_management(shp, "map_name", "TEXT","","","20")
arcpy.CalculateField_management(shp, "map_name","map_name", "PYTHON")
except:
print "Fubar, It's not working"
print arcpy.GetMessages()
else:
print "You're a genius Aaron"
The output I receive from running this script:
>>>
K:/Research/DATA/ADS_data/Historic/R2_ADS_Historical_Maps/Digitized Data/Arapahoe/test
505_dmg.shp
505_dmg
506_dmg.shp
506_dmg
You're a genius Aaron
Appears successful, right? Well, it has been...almost: a field was added and populated for both files, and it is perfect for 505_dmg.shp file. Problem is, 506_dmg.shp has also been labeled "505_dmg" in the "map_name" column. Though the loop appears to be working partially, the map_name variable does not seem to be updating. Any thoughts or suggestions much appreciated.
Thanks,
Aaron
I received a solution from the ESRI discussion board:
https://geonet.esri.com/thread/114520
Basically, a small edit in the Calculate field function did the trick. Here is the new code that worked:
arcpy.CalculateField_management(shp, "map_name","\"" + map_name + "\"", "PYTHON")

Loading in data via Django orm very memory intensive

So I have a script that loads in data that was created from a python pickle file.
dump_file = open('movies.pkl')
movie_data = pickle.load(dump_file)
#transaction.commit_manually
def load_data(data):
start = False
counter = 0
for item in data:
counter += 1
film_name = item.decode(encoding='latin1')
print "at", film_name, str(counter), str(len(data))
film_rating = float(data[item][0])
get_votes = int(data[item][2]['votes'])
full_date = data[item][2]['year']
temp_film = Film(name=film_name,date=full_date,rating=film_rating, votes=get_votes)
temp_film.save()
for actor in data[item][1]:
actor = actor.decode(encoding='latin1')
print "adding", actor
person = Person.objects.get(full=actor)
temp_film.actors.add(person)
if counter % 10000 == 0 or counter % len(data) == 0:
transaction.commit()
print "COMMITED"
load_data(movie_data)
So this is a very large data set. And it takes up a lot of memory where it slows down to a crawl, and in the past my solution was to just restart the script from where I left off, so it would take quite a few runs to actually save everything into the database.
I'm wondering if there's a better way to do this (even an optimization in my code would be nice) other than writing raw sql to input the data? I've tried JSON fixtures previously and it was even worse than this method.
If size of movie_data is large, you might wanna divide it into smaller files first and then iterate over them one by one.
Remember to free memory of previously loaded pkl files or keep overwriting the same variable.
If movie data is a list, you can free memory of of say 1000 records after you have iterated over them by slicing such as movie_data=movie_data[1000:] to reduce memory consumption over time
You can use bulk_create() method on the QuerySet object to create mutliple object in a single query, it's available in Django 1.4. Please go through following documentation link -
Bulk Create - https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
You can also optimize you code, by open the file with "with" keyword in python. "With" statement, it automatically closes the files for you, do all the operations inside the with block, so it'll keep the files open for you and will close the files once you're out of the with block.

Writing a large number of queries to a text file

I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files.
The naive way of doing it is going through the list of the 200,000 entities and query one by one, add the returned JSON to a list, and when it's done, right all to a text file. Something like:
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
a=api()
fullEntityList=[]
for entity in listEntities:
fullEntityList.append(a.getFullEntity(entity))
with open("fullEntities.txt","w") as f:
simplejson.dump(fullEntityList,f)
Obviously this is not reliable, as 200,000 queries to the API will take about 10 hours or so, so I guess something will cause an error before it gets to write it to the file.
I guess the right way is to write it in chunks, but not sure how to implement it. Any ideas?
Also, I cannot do this with a database.
I would recommend writing them to a SQLite database. This is they way I do it for my own tiny web spider applications. Because you can query the keys quite easily, and check which ones you already retrieved. This way, your application can easily continue where it left off. In particular if you get some 1000 new entries added next week.
Do design "recovery" into your application from the beginning. If there is some unexpected exception (Say, a timeout due to network congestion), you don't want to have to restart from the beginning, but only those queries you have not yet successfully retrieved. At 200.000 queries, an uptime of 99.9% means you have to expect 200 failures!
For space efficiency and performance it will likely pay off to use a compressed format, such as compressing the json with zlib before dumping it into the database blob.
SQLite is a good choice, unless your spider runs on multiple hosts at the same time. For a single application, sqlite is perfect.
The easy way is to open the file in 'a' (append) mode and write them one by one as they come in.
The better way is to use a job queue. This will allow you to spawn off a.getFullEntity calls into worker thread(s) and handle the results however you want when/if they come back, or schedule retries for failures, etc.
See Queue.
I'd also use a separate Thread that does file-writing, and use Queue to keep record of all entities. When I started off, I thought this would be done in 5 minutes, but then it turned out to be a little harder. simplejson and all other such libraries I'm aware off do not support partial writing, so you cannot first write one element of a list, later add another etc. So, I tried to solve this manually, by writing [, , and ] separately to the file and then dumping each entity separately.
Without being able to check it (as I don't have your api), you could try:
import threading
import Queue
import simplejson
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
CHUNK_SIZE = 1000
class EntityWriter(threading.Thread):
lines_written = False
_filename = "fullEntities.txt"
def __init__(self, queue):
super(EntityWriter, self).__init()
self._q = queue
self.running = False
def run(self):
self.running = True
with open(self._filename,"a") as f:
while True:
try:
entity = self._q.get(block=False)
if not EntityWriter.lines_written:
EntityWriter.lines_written = True
f.write("[")
simplejson.dump(entity,f)
else:
f.write(",\n")
simplejson.dump(entity,f)
except Queue.Empty:
break
self.running = False
def finish_file(self):
with open(self._filename,"a") as f:
f.write("]")
a=api()
fullEntityQueue=Queue.Queue(2*CHUNK_SIZE)
n_entities = len(listEntities)
writer = None
for i, entity in listEntities:
fullEntityQueue.append(a.getFullEntity(entity))
if (i+1) % CHUNK_SIZE == 0 or i == n_entities-1:
if writer is None or not writer.running:
writer = EntityWriter(fullEntityQueue)
writer.start()
writer.join()
writer.finish_file()
What this script does
The main loop still iterates over your list of entities, getting the full information for each. Afterwards each entity is now put into a Queue. Every 1000 entities (and at the end of the list) an EntityWriter-Thread is being launched that runs in parallel to the main Thread. This EntityWriter gets from the Queue and dumps it to the desired output file.
Some additional logic is required to make the JSON a list, as mentioned above I write [, , and ] manually. The resulting file should, in principle, be understood by simplejson when you reload it.

Categories

Resources