I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files.
The naive way of doing it is going through the list of the 200,000 entities and query one by one, add the returned JSON to a list, and when it's done, right all to a text file. Something like:
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
a=api()
fullEntityList=[]
for entity in listEntities:
fullEntityList.append(a.getFullEntity(entity))
with open("fullEntities.txt","w") as f:
simplejson.dump(fullEntityList,f)
Obviously this is not reliable, as 200,000 queries to the API will take about 10 hours or so, so I guess something will cause an error before it gets to write it to the file.
I guess the right way is to write it in chunks, but not sure how to implement it. Any ideas?
Also, I cannot do this with a database.
I would recommend writing them to a SQLite database. This is they way I do it for my own tiny web spider applications. Because you can query the keys quite easily, and check which ones you already retrieved. This way, your application can easily continue where it left off. In particular if you get some 1000 new entries added next week.
Do design "recovery" into your application from the beginning. If there is some unexpected exception (Say, a timeout due to network congestion), you don't want to have to restart from the beginning, but only those queries you have not yet successfully retrieved. At 200.000 queries, an uptime of 99.9% means you have to expect 200 failures!
For space efficiency and performance it will likely pay off to use a compressed format, such as compressing the json with zlib before dumping it into the database blob.
SQLite is a good choice, unless your spider runs on multiple hosts at the same time. For a single application, sqlite is perfect.
The easy way is to open the file in 'a' (append) mode and write them one by one as they come in.
The better way is to use a job queue. This will allow you to spawn off a.getFullEntity calls into worker thread(s) and handle the results however you want when/if they come back, or schedule retries for failures, etc.
See Queue.
I'd also use a separate Thread that does file-writing, and use Queue to keep record of all entities. When I started off, I thought this would be done in 5 minutes, but then it turned out to be a little harder. simplejson and all other such libraries I'm aware off do not support partial writing, so you cannot first write one element of a list, later add another etc. So, I tried to solve this manually, by writing [, , and ] separately to the file and then dumping each entity separately.
Without being able to check it (as I don't have your api), you could try:
import threading
import Queue
import simplejson
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
CHUNK_SIZE = 1000
class EntityWriter(threading.Thread):
lines_written = False
_filename = "fullEntities.txt"
def __init__(self, queue):
super(EntityWriter, self).__init()
self._q = queue
self.running = False
def run(self):
self.running = True
with open(self._filename,"a") as f:
while True:
try:
entity = self._q.get(block=False)
if not EntityWriter.lines_written:
EntityWriter.lines_written = True
f.write("[")
simplejson.dump(entity,f)
else:
f.write(",\n")
simplejson.dump(entity,f)
except Queue.Empty:
break
self.running = False
def finish_file(self):
with open(self._filename,"a") as f:
f.write("]")
a=api()
fullEntityQueue=Queue.Queue(2*CHUNK_SIZE)
n_entities = len(listEntities)
writer = None
for i, entity in listEntities:
fullEntityQueue.append(a.getFullEntity(entity))
if (i+1) % CHUNK_SIZE == 0 or i == n_entities-1:
if writer is None or not writer.running:
writer = EntityWriter(fullEntityQueue)
writer.start()
writer.join()
writer.finish_file()
What this script does
The main loop still iterates over your list of entities, getting the full information for each. Afterwards each entity is now put into a Queue. Every 1000 entities (and at the end of the list) an EntityWriter-Thread is being launched that runs in parallel to the main Thread. This EntityWriter gets from the Queue and dumps it to the desired output file.
Some additional logic is required to make the JSON a list, as mentioned above I write [, , and ] manually. The resulting file should, in principle, be understood by simplejson when you reload it.
Related
The script has a database.json file that it reads and writes to with every operation it does. It's a simple database, 'two columns' one with text (includes hyphens, numbers, and underscores) and an associated number. There's about 40,000 rows though.
The script as of right now can only work in the database at one instance. I know this because I had two instances running and when I checked the database half of it was missing, so it would seem the script would erase the data from the other script. Long story short I had to rebuild it from a backup, but I have a question:
Can a Python script access a json database with read and write in multiple instances? or must the script be written with some more complex database functionality?
Here are some snippets of the code that interacts with the database:
def load_database():
# If file exists load the json as a dict
if os.path.isfile('database.json'):
return json.load(open('database.json', 'r'))
# If file doesn't exist return empty dict because the script hasn't run yet
else:
return {}
------
def save_database():
json.dump(database, open('database.json', 'w'))
------
self.user_data = self.get_user_data()
if self.user_data:
self.user_id = str(self.user_data['id'])
if self.user_id in database:
if database[self.user_id] != self.username:
main_log.info('User {} changed their username to {}.'.format(database[self.user_id], self.username))
os.rename('Users/{}_{}'.format(database[self.user_id], self.user_id), 'Users/{}_{}'.format(self.username, self.user_id))
add_username_to_history('Users/{}_{}/'.format(self.username, self.user_id), database[self.user_id])
database[self.user_id] = self.username
else:
safe_mkdir('Users/{}_{}'.format(self.username, self.user_id))
else:
safe_mkdir('Users/{}_{}'.format(self.username, self.user_id))
main_log.info('Added {} with ID:{} to the database.'.format(self.username, self.user_id))
database.update({self.user_id: self.username})
# Create the user log obj
self.user_log = Logger('Users/{}_{}/log.log'.format(self.username, self.user_id))
# Create ID File
create_id_file('Users/{}_{}/'.format(self.username, self.user_id), self.user_id)
save_database()
There's a few more instances where the database is called, but those give an idea.
The reason I wanted to run multiple instances was to double up the speed on scraping. I already have ThreadPool and even with 200+ threads it's not as fast as it could be by having multiple instances.
What you need is the concept of atomicity.
Since your json file is not too big, one not-ideal-but-workable option could be :
When needed to read: open file, read values into variable(s), close connection i.e. do not keep the json open
Repeat the step 1 everytime you need to read the database
When needed to write: open file, write values into json, close connection i.e. do not keep the json open
Repeat step 2 everytime you need to run any DML
I used a Redis database. This allowed the script to individually write to the database thus allowing multiple instances of the script to be running.
To briefly explain context, I am downloading SEC prospectus data for example. After downloading I want to parse the file to extract certain data, then output the parsed dictionary to a JSON file which consists of a list of dictionaries. I would use a SQL database for output, but the research cluster admins at my university are being slow getting me access. If anyone has any suggestions for how to store the data for easy reading/writing later I would appreciate it, I was thinking about HDF5 as a possible alternative.
A minimal example of what I am doing with the spots that I think I need to improved labeled.
def classify_file(doc):
try:
data = {
'link': doc.url
}
except AttributeError:
return {'flag': 'ATTRIBUTE ERROR'}
# Do a bunch of parsing using regular expressions
if __name__=="__main__":
items = list()
for d in tqdm([y + ' ' + q for y in ['2019'] for q in ['1']]):
stream = os.popen('bash ./getformurls.sh ' + d)
stacked = stream.read().strip().split('\n')
# split each line into the fixed-width fields
widths=(12,62,12,12,44)
items += [[item[sum(widths[:j]):sum(widths[:j+1])].strip() for j in range(len(widths))] for item in stacked]
urls = [BASE_URL + item[4] for item in items]
resp = list()
# PROBLEM 1
filelimit = 100
for i in range(ceil(len(urls)/filelimit)):
print(f'Downloading: {i*filelimit/len(urls)*100:2.0f}%... ',end='\r',flush=True)
resp += [r for r in grequests.map((grequests.get(u) for u in urls[i*filelimit:(i+1)*filelimit]))]
# PROBLEM 2
with Pool() as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
with open('prospectus_data.json') as f:
json.dump(prospectus,f)
The getfileurls.sh referenced is a bash script I wrote that was faster than doing it in python since I could use grep, the code for that is
#!/bin/bash
BASE_URL="https://www.sec.gov/Archives/"
INDEX="edgar/full-index/"
url="${BASE_URL}${INDEX}$1/QTR$2/form.idx"
out=$(curl -s ${url} | grep "^485[A|B]POS")
echo "$out"
PROBLEM 1: So I am currently pulling about 18k files in the grequests map call. I was running into an error about too many files being open so I decided to split up the urls list into manageable chunks. I don't like this solution, but it works.
PROBLEM 2: This is where my actual error is. This code runs fine on a smaller set of urls (~2k) on my laptop (uses 100% of my cpu and ~20GB of RAM ~10GB for the file downloads and another ~10GB when the parsing starts), but when I take it to the larger 18k dataset using 40 cores on a research cluster it spins up to ~100GB RAM and ~3TB swap usage then crashes after parsing about 2k documents in 20 minutes via a KeyboardInterrupt from the server.
I don't really understand why the swap usage is getting so crazy, but I think I really just need help with memory management here. Is there a way to create an generator of unsent requests that will be sent when I call classify_file() on them later? Any help would be appreciated.
Generally when you have runaway memory usage with a Pool it's because the workers are being re-used and accumulating memory with each iteration. You can occasionally close and re-open the pool to prevent this but it's so common of an issue that Python now has a built-in parameter to do it for you...
Pool(...maxtasksperchild) is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
There's no way for me to tell you what the right value is but you generally want to set it low enough that resources can be freed fairly often but not so low that it slows things down. (Maybe a minutes worth of processing... just as a guess)
with Pool(maxtasksperchild=5) as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
For your first problem, you might consider just using requests and moving the call inside of the worker process you already have. Pulling 18K worth of URLs and caching all that data initially is going to take time and memory. If it's all encapsulated in the worker, you'll minimize data usage and you wont need to spin up so many open file handles.
Im newbie in Python, this is my first work with REST API in python. First let me explain what i wanted to do. I have a csv file which have name of a product and some other details, these are missing data after migration. So now my job is to check in the downstream application1 if they contain these product or it is missing there too. if it is missing there should dig up back and back.
So Now I have API of Application 1(this would give the productname and details if that exists) and have an API for OAuth 2. This will create me a token and im using that token to access API of Application 1(it would look like this https://Applciationname/rest/< productname >) i get this < productname > from a list which is retrieved from first column of csv file. Everthing is working fine but my list is having 3000 entries it is taking almost 2 hours for me to complete.
Is there any fastest way to check this, BTW im calling token API only once. This is how my code looks like
list=[]
reading csv and appedning to list #using with open and csv reader here
get_token=requests.get(tokenurl,OAuthdetails) #similar type of code
token_dict=json.loads(get_token.content.decode())
token=token_dict['access_token']
headers={
'Authorization': 'Bearer'+' '+str(token)
}
url= https://Applciationname/rest/
for element in list:
full_url=url+element
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
writing the element value to text file "successcall" text file #using with open here
else:
writing the element value to text file "failurecall" text file #using with open here
Now could you please help me optimizing this, so that ill be finding the product names which are not in APP 1 faster
You could see Threading for your for loop. Like so:
import threading
lock = threading.RLock()
thread_list = []
def check_api(full_url):
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
# dont forget to add a lock to writing to the file
with lock:
with open("successcall.txt", "a") as f:
f.write(recieved_data)
else:
# again, dont forget to add with lock like the one above
# writing the element value to text file "failurecall" text file #using with open here
for element in list:
full_url = url+element
t = threading.Thread(target=check_api, args=(full_url, ))
thread_list.append(t)
# start all threads
for thread in thread_list:
thread.start()
# wait for them all to finish
for thread in thread_list:
thread.finish()
You should also not write to the same file while using Threads since it might cause some problems unless you use locks
I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.
This is the code with which I'm trying to populate the database:
import os
def populate():
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns=col[1]
name=col[8]
job=col[12]
dun_add = add_c_duns(duns)
add_contact(c_duns = dun_add, fn=name, job=job)
def add_contact(c_duns, fn, job):
c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
return c
def add_c_duns(duns):
cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
return cd
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate()
print "Done!!"
The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.
I don't have enough reputation to comment, but here's a speculative answer.
Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.
So if you can live without the get part of get_or_create, then you can follow this strategy:
Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.
This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.
EDIT Here's the code:
import os
def populate_duns():
# Will only work if there are no DUNS duplicates
# (both in the DB and within the file)
duns_instances = []
with open("filename") as f:
for line in f:
duns = line.strip().split("|")[1]
duns_instances.append(Contact_DUNS(duns=duns))
# Run a single INSERT query for all DUNS instances
# (actually it will be run in batches run but it's still quite fast)
Contact_DUNS.objects.bulk_create(duns_instances)
def get_duns_dict():
# This is basically a SELECT query for these two fields
duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
return dict(duns_id_pairs)
def populate_contacts():
# Repeat the same process for Contacts
contact_instances = []
duns_dict = get_duns_dict()
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns = col[1]
name = col[8]
job = col[12]
ci = Contact(duns_id=duns_dict[duns],
fullName=name,
title=job)
contact_instances.append(ci)
# Again, run only a single INSERT query
Contact.objects.bulk_create(contact_instances)
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate_duns()
populate_contacts()
print "Done!!"
CSV Import
First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.
There is no standard as to what a CSV file should look like, and the
SQLite shell does not even attempt to handle all the intricacies of
interpreting a CSV file. If you need to import a complex CSV file and
the SQLite shell doesn't handle it, you may want to try a different
front end, such as SQLite Database Browser.
On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.
Suitability of Sqlite.
You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.
SQLite supports an unlimited number of simultaneous readers, but it
will only allow one writer at any instant in time. For many
situations, this is not a problem. Writer queue up. Each application
does its database work quickly and moves on, and no lock lasts for
more than a few dozen milliseconds. But there are some applications
that require more concurrency, and those applications may need to seek
a different solution.
Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.
The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)
So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.
6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.
I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.
So I have a script that loads in data that was created from a python pickle file.
dump_file = open('movies.pkl')
movie_data = pickle.load(dump_file)
#transaction.commit_manually
def load_data(data):
start = False
counter = 0
for item in data:
counter += 1
film_name = item.decode(encoding='latin1')
print "at", film_name, str(counter), str(len(data))
film_rating = float(data[item][0])
get_votes = int(data[item][2]['votes'])
full_date = data[item][2]['year']
temp_film = Film(name=film_name,date=full_date,rating=film_rating, votes=get_votes)
temp_film.save()
for actor in data[item][1]:
actor = actor.decode(encoding='latin1')
print "adding", actor
person = Person.objects.get(full=actor)
temp_film.actors.add(person)
if counter % 10000 == 0 or counter % len(data) == 0:
transaction.commit()
print "COMMITED"
load_data(movie_data)
So this is a very large data set. And it takes up a lot of memory where it slows down to a crawl, and in the past my solution was to just restart the script from where I left off, so it would take quite a few runs to actually save everything into the database.
I'm wondering if there's a better way to do this (even an optimization in my code would be nice) other than writing raw sql to input the data? I've tried JSON fixtures previously and it was even worse than this method.
If size of movie_data is large, you might wanna divide it into smaller files first and then iterate over them one by one.
Remember to free memory of previously loaded pkl files or keep overwriting the same variable.
If movie data is a list, you can free memory of of say 1000 records after you have iterated over them by slicing such as movie_data=movie_data[1000:] to reduce memory consumption over time
You can use bulk_create() method on the QuerySet object to create mutliple object in a single query, it's available in Django 1.4. Please go through following documentation link -
Bulk Create - https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
You can also optimize you code, by open the file with "with" keyword in python. "With" statement, it automatically closes the files for you, do all the operations inside the with block, so it'll keep the files open for you and will close the files once you're out of the with block.