I'm seeing some really odd behavior in one of my GAE apps where periodically (every few days or so), the entire datastore just seems to get wiped and it restarts fresh. I've searched high and low for possible reasons for this and have so far come up empty handed.
There's only 1 entity in the datastore, with 2 properties. I'm loading up the datastore by reading a CSV file and inserting the data. Here's what that code looks like:
filename = "data.csv"
rows = []
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile, delimiter=',', quotechar='"')
for line in lines:
prop1Value = line[0]
prop2Value = line[1]
aRow = SomeEntity(prop1=prop1Value, prop2=prop2Value)
rows.append(aRow)
chunkSize = 50
numProgressChunks = int(len(rows) / chunkSize) + 1
for puttableRows in chunks(rows, chunkSize):
db.put(puttableRows)
This is the only time that data gets written to the datastore. It works, after I import the CSV I'm able to do queries and fetch data. Then a few days later I find that the data is gone. Not only are all rows gone, but the entity (kind) is no longer present in the GAE Datastore Viewer, whereas when I delete all the rows myself, the kind still shows up.
This might be a coincidence, but the last time the data got wiped, I noticed that a GAE instance was started up around the same time (within a few minutes).
This is an HRD app, with python2.7, using django. There are no deploys around the time that it gets reset. I've looked at the logs and can't find anything odd happening around the time of the reset.
What am I missing?
Related
I'm using the following code to stream a table and all its fields to a csv and then back to the user. It's more or less just the example from django's website.
# CSV header
header = [field.name for field in model._meta.fields]
results = model.objects.all()
# Generate a streaming output in case the file is large
h = []
h.append(header)
rows = ([getattr(item, field) for field in header] for item in results)
# Need to add the header to the front
chained = itertools.chain(h, rows)
pseduo_buffer = Echo() # File like object that just returns value when write is called
writer = csv.writer(pseduo_buffer)
response = StreamingHttpResponse((writer.writerow(row) for row in chained), content_type="text/csv")
filename = "{}_{}_{}.csv".format(app_name, model_name, date_str)
response['Content-Disposition'] = 'attachment; filename="{}"'.format(filename)
return response
The issue seems to be that for larger datasets, it will randomly stop the stream prior to actually finishing getting all the data from the table. The total file size is about 12MB or so, but it will stop streaming randomly from 500k to about 8MB.
I have only seen this in our production environment. I get the whole file when I do it in my development setup. It's all running in docker containers, so in theory it's the same setups in both instances. Not sure if there are other server related settings though that could be causing this?
Our devops guy did say he increased the load balancer timeouts, but I'm under the impression the streaming response shouldn't cause timeouts anyway because it's constantly sending data.
Is it possible it has to do with how the query is being executed, and it thinks it's done before it actually is (maybe poor query performance - a lot of joins, etc)?
Thanks everyone -
I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.
Using Jupyter notebook, it runs fine # first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.
So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?
The code is below. For each valid URL in a Pandas dataframe column, the loop:
downloads the webpage
extracts the relevant text
cleans out some encoding garbage from the text
writes that text to another dataframe column
every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.
Some ideas I've considered:
Write each article to a local SQL server instead of in-mem (speed concerns?)
save each article text in a csv with its url, then build a dataframe later
delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)
i=0
#lots of NaNs in the column, hence the subsetting
for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\
.unique_suffixes.values[:]:
i = i+1
if pd.isnull(u):
continue
#save our progress every 2k articles just in case
if i%2000 == 0:
unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8')
try:
#pull the data
html_r = requests.get(u).text
#the phrase "TX:" indicates start of article
#text, so if it's not present, URL must have been bad
if html_r.find("TX:") == -1:
continue
#capture just the text of the article
txt = html_r[html_r.find("TX:")+5:]
#fix encoding/formatting quirks
txt = txt.replace('\n',' ')
txt = txt.replace('[^\x00-\x7F]','')
#wait 200 ms to spare site's servers
time.sleep(.2)
#write our article to our dataframe
unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt
logging.info("done with url # %s -- %s remaining", i, (total_links-i))
print "done with url # " + str(i)
print total_links-i
except:
logging.exception("Exception on article # %s, URL: %s", i, u)
print "ERROR with url # " + str(i)
continue
This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.
logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
eta: some details in response to answers/comments:
script is only thing running on a 16 GB/ram EC2 instance
articles are ~100-800 words apiece
I'm going to take an educated guess and say that your script turns your machine into a swap storm as you get around 30k articles, according to your description. I don't see anything in your code where you could easily free up memory using:
some_large_container = None
Setting something that you know has a large allocation to None tells Python's memory manager that it's available for garbage collection. You also might want to explicitly call gc.collect(), but I'm not sure that would do you much good.
Alternatives you could consider:
sqlite3: Instead of a remote SQL database, use sqlite3 as intermediate storage. Exists there does a Python module.
Keep appending to the CSV checkpoint file.
Compress your strings with zlib.compress().
Any way that you decide to go, you're probably best off doing the collection as phase 1, constructing the Pandas dataframe as phase 2. Never pays off to be clever by a half. The other half tends to hang you.
I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.
This is the code with which I'm trying to populate the database:
import os
def populate():
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns=col[1]
name=col[8]
job=col[12]
dun_add = add_c_duns(duns)
add_contact(c_duns = dun_add, fn=name, job=job)
def add_contact(c_duns, fn, job):
c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
return c
def add_c_duns(duns):
cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
return cd
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate()
print "Done!!"
The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.
I don't have enough reputation to comment, but here's a speculative answer.
Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.
So if you can live without the get part of get_or_create, then you can follow this strategy:
Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.
This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.
EDIT Here's the code:
import os
def populate_duns():
# Will only work if there are no DUNS duplicates
# (both in the DB and within the file)
duns_instances = []
with open("filename") as f:
for line in f:
duns = line.strip().split("|")[1]
duns_instances.append(Contact_DUNS(duns=duns))
# Run a single INSERT query for all DUNS instances
# (actually it will be run in batches run but it's still quite fast)
Contact_DUNS.objects.bulk_create(duns_instances)
def get_duns_dict():
# This is basically a SELECT query for these two fields
duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
return dict(duns_id_pairs)
def populate_contacts():
# Repeat the same process for Contacts
contact_instances = []
duns_dict = get_duns_dict()
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns = col[1]
name = col[8]
job = col[12]
ci = Contact(duns_id=duns_dict[duns],
fullName=name,
title=job)
contact_instances.append(ci)
# Again, run only a single INSERT query
Contact.objects.bulk_create(contact_instances)
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate_duns()
populate_contacts()
print "Done!!"
CSV Import
First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.
There is no standard as to what a CSV file should look like, and the
SQLite shell does not even attempt to handle all the intricacies of
interpreting a CSV file. If you need to import a complex CSV file and
the SQLite shell doesn't handle it, you may want to try a different
front end, such as SQLite Database Browser.
On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.
Suitability of Sqlite.
You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.
SQLite supports an unlimited number of simultaneous readers, but it
will only allow one writer at any instant in time. For many
situations, this is not a problem. Writer queue up. Each application
does its database work quickly and moves on, and no lock lasts for
more than a few dozen milliseconds. But there are some applications
that require more concurrency, and those applications may need to seek
a different solution.
Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.
The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)
So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.
6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.
I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.
So I have a script that loads in data that was created from a python pickle file.
dump_file = open('movies.pkl')
movie_data = pickle.load(dump_file)
#transaction.commit_manually
def load_data(data):
start = False
counter = 0
for item in data:
counter += 1
film_name = item.decode(encoding='latin1')
print "at", film_name, str(counter), str(len(data))
film_rating = float(data[item][0])
get_votes = int(data[item][2]['votes'])
full_date = data[item][2]['year']
temp_film = Film(name=film_name,date=full_date,rating=film_rating, votes=get_votes)
temp_film.save()
for actor in data[item][1]:
actor = actor.decode(encoding='latin1')
print "adding", actor
person = Person.objects.get(full=actor)
temp_film.actors.add(person)
if counter % 10000 == 0 or counter % len(data) == 0:
transaction.commit()
print "COMMITED"
load_data(movie_data)
So this is a very large data set. And it takes up a lot of memory where it slows down to a crawl, and in the past my solution was to just restart the script from where I left off, so it would take quite a few runs to actually save everything into the database.
I'm wondering if there's a better way to do this (even an optimization in my code would be nice) other than writing raw sql to input the data? I've tried JSON fixtures previously and it was even worse than this method.
If size of movie_data is large, you might wanna divide it into smaller files first and then iterate over them one by one.
Remember to free memory of previously loaded pkl files or keep overwriting the same variable.
If movie data is a list, you can free memory of of say 1000 records after you have iterated over them by slicing such as movie_data=movie_data[1000:] to reduce memory consumption over time
You can use bulk_create() method on the QuerySet object to create mutliple object in a single query, it's available in Django 1.4. Please go through following documentation link -
Bulk Create - https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
You can also optimize you code, by open the file with "with" keyword in python. "With" statement, it automatically closes the files for you, do all the operations inside the with block, so it'll keep the files open for you and will close the files once you're out of the with block.
I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files.
The naive way of doing it is going through the list of the 200,000 entities and query one by one, add the returned JSON to a list, and when it's done, right all to a text file. Something like:
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
a=api()
fullEntityList=[]
for entity in listEntities:
fullEntityList.append(a.getFullEntity(entity))
with open("fullEntities.txt","w") as f:
simplejson.dump(fullEntityList,f)
Obviously this is not reliable, as 200,000 queries to the API will take about 10 hours or so, so I guess something will cause an error before it gets to write it to the file.
I guess the right way is to write it in chunks, but not sure how to implement it. Any ideas?
Also, I cannot do this with a database.
I would recommend writing them to a SQLite database. This is they way I do it for my own tiny web spider applications. Because you can query the keys quite easily, and check which ones you already retrieved. This way, your application can easily continue where it left off. In particular if you get some 1000 new entries added next week.
Do design "recovery" into your application from the beginning. If there is some unexpected exception (Say, a timeout due to network congestion), you don't want to have to restart from the beginning, but only those queries you have not yet successfully retrieved. At 200.000 queries, an uptime of 99.9% means you have to expect 200 failures!
For space efficiency and performance it will likely pay off to use a compressed format, such as compressing the json with zlib before dumping it into the database blob.
SQLite is a good choice, unless your spider runs on multiple hosts at the same time. For a single application, sqlite is perfect.
The easy way is to open the file in 'a' (append) mode and write them one by one as they come in.
The better way is to use a job queue. This will allow you to spawn off a.getFullEntity calls into worker thread(s) and handle the results however you want when/if they come back, or schedule retries for failures, etc.
See Queue.
I'd also use a separate Thread that does file-writing, and use Queue to keep record of all entities. When I started off, I thought this would be done in 5 minutes, but then it turned out to be a little harder. simplejson and all other such libraries I'm aware off do not support partial writing, so you cannot first write one element of a list, later add another etc. So, I tried to solve this manually, by writing [, , and ] separately to the file and then dumping each entity separately.
Without being able to check it (as I don't have your api), you could try:
import threading
import Queue
import simplejson
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
CHUNK_SIZE = 1000
class EntityWriter(threading.Thread):
lines_written = False
_filename = "fullEntities.txt"
def __init__(self, queue):
super(EntityWriter, self).__init()
self._q = queue
self.running = False
def run(self):
self.running = True
with open(self._filename,"a") as f:
while True:
try:
entity = self._q.get(block=False)
if not EntityWriter.lines_written:
EntityWriter.lines_written = True
f.write("[")
simplejson.dump(entity,f)
else:
f.write(",\n")
simplejson.dump(entity,f)
except Queue.Empty:
break
self.running = False
def finish_file(self):
with open(self._filename,"a") as f:
f.write("]")
a=api()
fullEntityQueue=Queue.Queue(2*CHUNK_SIZE)
n_entities = len(listEntities)
writer = None
for i, entity in listEntities:
fullEntityQueue.append(a.getFullEntity(entity))
if (i+1) % CHUNK_SIZE == 0 or i == n_entities-1:
if writer is None or not writer.running:
writer = EntityWriter(fullEntityQueue)
writer.start()
writer.join()
writer.finish_file()
What this script does
The main loop still iterates over your list of entities, getting the full information for each. Afterwards each entity is now put into a Queue. Every 1000 entities (and at the end of the list) an EntityWriter-Thread is being launched that runs in parallel to the main Thread. This EntityWriter gets from the Queue and dumps it to the desired output file.
Some additional logic is required to make the JSON a list, as mentioned above I write [, , and ] manually. The resulting file should, in principle, be understood by simplejson when you reload it.