My website has a feature of exporting daily report in excel which may vary according to users. Due to some reason i can't consider redis or memcache. For each user the no of rows in db are around 2-5 lacks. when user call the export-to-excel feature it takes 5-10 minutes to export and till that website all resources(ram,cpu) are used in making that excel and that results site-down for 5 minutes and after 5 minutes everything work fine. I also chunked the the query result in small part for solving RAM issue it solves my 50 percent problem. is there is any other solution for CPU and RAM optimization?
sample code
def import_to_excel(request):
order_list = Name.objects.all()
book = xlwt.Workbook(encoding='utf8')
default_style = xlwt.Style.default_style
style = default_style
fname = 'order_data'
sheet = book.add_sheet(fname)
row = -1
for order in order_list:
row+=1
sheet.write(row, 1,order.first_name, style=style)
sheet.write(row, 2,order.last_name, style=style)
response = HttpResponse(mimetype='application/vnd.ms-excel')
response['Content-Disposition'] = 'attachment; filename=order_data_pull.xls'
book.save(response)
return response
Instead of a HttpResponse use StreamingHttpResponse
Streaming a file that takes a long time to generate you can avoid a load balancer dropping a connection that might have otherwise timed out while the server was generating the response.
You can also process your request asynchronously using celery.
Processing requests asynchronously will allow your server to accept any other request while the previous one is being processed by the worker in the background.
Thus your system will become more user friendly in that manner.
Related
I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf
I'm using the following code to stream a table and all its fields to a csv and then back to the user. It's more or less just the example from django's website.
# CSV header
header = [field.name for field in model._meta.fields]
results = model.objects.all()
# Generate a streaming output in case the file is large
h = []
h.append(header)
rows = ([getattr(item, field) for field in header] for item in results)
# Need to add the header to the front
chained = itertools.chain(h, rows)
pseduo_buffer = Echo() # File like object that just returns value when write is called
writer = csv.writer(pseduo_buffer)
response = StreamingHttpResponse((writer.writerow(row) for row in chained), content_type="text/csv")
filename = "{}_{}_{}.csv".format(app_name, model_name, date_str)
response['Content-Disposition'] = 'attachment; filename="{}"'.format(filename)
return response
The issue seems to be that for larger datasets, it will randomly stop the stream prior to actually finishing getting all the data from the table. The total file size is about 12MB or so, but it will stop streaming randomly from 500k to about 8MB.
I have only seen this in our production environment. I get the whole file when I do it in my development setup. It's all running in docker containers, so in theory it's the same setups in both instances. Not sure if there are other server related settings though that could be causing this?
Our devops guy did say he increased the load balancer timeouts, but I'm under the impression the streaming response shouldn't cause timeouts anyway because it's constantly sending data.
Is it possible it has to do with how the query is being executed, and it thinks it's done before it actually is (maybe poor query performance - a lot of joins, etc)?
Thanks everyone -
I'm scraping articles from a news site on behalf of the owner. I have to keep it to <= 5 requests per second, or ~100k articles in 6 hrs (overnight), but I'm getting ~30k at best.
Using Jupyter notebook, it runs fine # first, but becomes less and less responsive. After 6 hrs, the kernel is normally un-interruptable, and I have to restart it. Since I'm storing each article in-memory, this is a problem.
So my question is: is there a more efficient way to do this to reach ~100k articles in 6 hours?
The code is below. For each valid URL in a Pandas dataframe column, the loop:
downloads the webpage
extracts the relevant text
cleans out some encoding garbage from the text
writes that text to another dataframe column
every 2000 articles, it saves the dataframe to a CSV (overwriting the last backup), to handle the eventual crash of the script.
Some ideas I've considered:
Write each article to a local SQL server instead of in-mem (speed concerns?)
save each article text in a csv with its url, then build a dataframe later
delete all "print()" functions and rely solely on logging (my logger config doesn't seem to perform awesome, though--i'm not sure it's logging everything I tell it to)
i=0
#lots of NaNs in the column, hence the subsetting
for u in unique_urls[unique_urls['unique_suffixes'].isnull() == False]\
.unique_suffixes.values[:]:
i = i+1
if pd.isnull(u):
continue
#save our progress every 2k articles just in case
if i%2000 == 0:
unique_urls.to_csv('/backup-article-txt.csv', encoding='utf-8')
try:
#pull the data
html_r = requests.get(u).text
#the phrase "TX:" indicates start of article
#text, so if it's not present, URL must have been bad
if html_r.find("TX:") == -1:
continue
#capture just the text of the article
txt = html_r[html_r.find("TX:")+5:]
#fix encoding/formatting quirks
txt = txt.replace('\n',' ')
txt = txt.replace('[^\x00-\x7F]','')
#wait 200 ms to spare site's servers
time.sleep(.2)
#write our article to our dataframe
unique_urls.loc[unique_urls.unique_suffixes == u, 'article_text'] = txt
logging.info("done with url # %s -- %s remaining", i, (total_links-i))
print "done with url # " + str(i)
print total_links-i
except:
logging.exception("Exception on article # %s, URL: %s", i, u)
print "ERROR with url # " + str(i)
continue
This is the logging config I'm using. I found it on SO, but w/ this particular script it doesn't seem to capture everything.
logTime = "{:%d %b-%X}".format(datetime.datetime.now())
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='logTime+'.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
eta: some details in response to answers/comments:
script is only thing running on a 16 GB/ram EC2 instance
articles are ~100-800 words apiece
I'm going to take an educated guess and say that your script turns your machine into a swap storm as you get around 30k articles, according to your description. I don't see anything in your code where you could easily free up memory using:
some_large_container = None
Setting something that you know has a large allocation to None tells Python's memory manager that it's available for garbage collection. You also might want to explicitly call gc.collect(), but I'm not sure that would do you much good.
Alternatives you could consider:
sqlite3: Instead of a remote SQL database, use sqlite3 as intermediate storage. Exists there does a Python module.
Keep appending to the CSV checkpoint file.
Compress your strings with zlib.compress().
Any way that you decide to go, you're probably best off doing the collection as phase 1, constructing the Pandas dataframe as phase 2. Never pays off to be clever by a half. The other half tends to hang you.
I'm seeing some really odd behavior in one of my GAE apps where periodically (every few days or so), the entire datastore just seems to get wiped and it restarts fresh. I've searched high and low for possible reasons for this and have so far come up empty handed.
There's only 1 entity in the datastore, with 2 properties. I'm loading up the datastore by reading a CSV file and inserting the data. Here's what that code looks like:
filename = "data.csv"
rows = []
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile, delimiter=',', quotechar='"')
for line in lines:
prop1Value = line[0]
prop2Value = line[1]
aRow = SomeEntity(prop1=prop1Value, prop2=prop2Value)
rows.append(aRow)
chunkSize = 50
numProgressChunks = int(len(rows) / chunkSize) + 1
for puttableRows in chunks(rows, chunkSize):
db.put(puttableRows)
This is the only time that data gets written to the datastore. It works, after I import the CSV I'm able to do queries and fetch data. Then a few days later I find that the data is gone. Not only are all rows gone, but the entity (kind) is no longer present in the GAE Datastore Viewer, whereas when I delete all the rows myself, the kind still shows up.
This might be a coincidence, but the last time the data got wiped, I noticed that a GAE instance was started up around the same time (within a few minutes).
This is an HRD app, with python2.7, using django. There are no deploys around the time that it gets reset. I've looked at the logs and can't find anything odd happening around the time of the reset.
What am I missing?
I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files.
The naive way of doing it is going through the list of the 200,000 entities and query one by one, add the returned JSON to a list, and when it's done, right all to a text file. Something like:
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
a=api()
fullEntityList=[]
for entity in listEntities:
fullEntityList.append(a.getFullEntity(entity))
with open("fullEntities.txt","w") as f:
simplejson.dump(fullEntityList,f)
Obviously this is not reliable, as 200,000 queries to the API will take about 10 hours or so, so I guess something will cause an error before it gets to write it to the file.
I guess the right way is to write it in chunks, but not sure how to implement it. Any ideas?
Also, I cannot do this with a database.
I would recommend writing them to a SQLite database. This is they way I do it for my own tiny web spider applications. Because you can query the keys quite easily, and check which ones you already retrieved. This way, your application can easily continue where it left off. In particular if you get some 1000 new entries added next week.
Do design "recovery" into your application from the beginning. If there is some unexpected exception (Say, a timeout due to network congestion), you don't want to have to restart from the beginning, but only those queries you have not yet successfully retrieved. At 200.000 queries, an uptime of 99.9% means you have to expect 200 failures!
For space efficiency and performance it will likely pay off to use a compressed format, such as compressing the json with zlib before dumping it into the database blob.
SQLite is a good choice, unless your spider runs on multiple hosts at the same time. For a single application, sqlite is perfect.
The easy way is to open the file in 'a' (append) mode and write them one by one as they come in.
The better way is to use a job queue. This will allow you to spawn off a.getFullEntity calls into worker thread(s) and handle the results however you want when/if they come back, or schedule retries for failures, etc.
See Queue.
I'd also use a separate Thread that does file-writing, and use Queue to keep record of all entities. When I started off, I thought this would be done in 5 minutes, but then it turned out to be a little harder. simplejson and all other such libraries I'm aware off do not support partial writing, so you cannot first write one element of a list, later add another etc. So, I tried to solve this manually, by writing [, , and ] separately to the file and then dumping each entity separately.
Without being able to check it (as I don't have your api), you could try:
import threading
import Queue
import simplejson
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
CHUNK_SIZE = 1000
class EntityWriter(threading.Thread):
lines_written = False
_filename = "fullEntities.txt"
def __init__(self, queue):
super(EntityWriter, self).__init()
self._q = queue
self.running = False
def run(self):
self.running = True
with open(self._filename,"a") as f:
while True:
try:
entity = self._q.get(block=False)
if not EntityWriter.lines_written:
EntityWriter.lines_written = True
f.write("[")
simplejson.dump(entity,f)
else:
f.write(",\n")
simplejson.dump(entity,f)
except Queue.Empty:
break
self.running = False
def finish_file(self):
with open(self._filename,"a") as f:
f.write("]")
a=api()
fullEntityQueue=Queue.Queue(2*CHUNK_SIZE)
n_entities = len(listEntities)
writer = None
for i, entity in listEntities:
fullEntityQueue.append(a.getFullEntity(entity))
if (i+1) % CHUNK_SIZE == 0 or i == n_entities-1:
if writer is None or not writer.running:
writer = EntityWriter(fullEntityQueue)
writer.start()
writer.join()
writer.finish_file()
What this script does
The main loop still iterates over your list of entities, getting the full information for each. Afterwards each entity is now put into a Queue. Every 1000 entities (and at the end of the list) an EntityWriter-Thread is being launched that runs in parallel to the main Thread. This EntityWriter gets from the Queue and dumps it to the desired output file.
Some additional logic is required to make the JSON a list, as mentioned above I write [, , and ] manually. The resulting file should, in principle, be understood by simplejson when you reload it.