Django csv taking very long time Outputting CSV - python

I have a model which has a text entry of about 90,000 and I am outputting Django CSV but it is not converting outputting CVS i left browser for half an hour aand no output.But the method i used it worked fine when data is low.
My method:-
def usertype_csv(request):
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename="university_list.csv"'
writer = csv.writer(response)
news_obj = Users.objects.using('cms').all()
writer.writerow(['NAME', 'USERNAME', 'E-MAIL ID','USER TYPE','USER TYPE'])
for item in news_obj:
writer.writerow([item.name.encode('UTF-8'),item.username.encode('UTF-8'),item.email.encode('UTF-8'),
item.userTypeId.userType.encode('UTF-8'),item.universityId.name.encode('UTF-8')])
return response
I have testing this for smaller data it worked but for very larger files it is not working.
Thanks in advance

Any large CSV files you generate should ideally be streamed back to the user

Here you can find an example on how django handles it.
This is how I would do it:
import csv
from django.http import StreamingHttpResponse
class Echo(object):
"""An object that implements just the write method of the file-like
interface.
"""
def write(self, value):
"""Write the value by returning it, instead of storing in a buffer."""
return value
def some_streaming_csv_view(request):
"""A view that streams a large CSV file."""
# Generate a sequence of rows. The range is based on the maximum number of
# rows that can be handled by a single sheet in most spreadsheet
# applications.
rows = (["Row {}".format(idx), str(idx)] for idx in range(165536))
pseudo_buffer = Echo()
news_obj = Users.objects.using('cms').all()
writer = csv.writer(pseudo_buffer)
response = StreamingHttpResponse((writer.writerow([item.name.encode('UTF-8'), item.username.encode('UTF-8'), item.email.encode('UTF-8'), item.userTypeId.userType.encode('UTF-8'), item.universityId.name.encode('UTF-8')]) for item in news_obj), content_type="text/csv")
response['Content-Disposition'] = 'attachment; filename="university_list.csv"'
return response

Related

Efficient ways of using Python to Iterate over a CSV file and make an API Call on each row

I created a script which can read a csv file and trigger an api call on each row. It works, but my concern is if I will run into memory issues if the file is over 1M rows.
import json
import requests
import csv
import time
"""
PURPOSE:
This is a script designed to:
1. read through a CSV file
2. loop through each row of the CSV file
3. build and trigger an API Request to the registerDeviceToken endpoint
using the contents of each row of the CSV File
INSTRUCTIONS:
1. Create a CSV file with columns in the following order (left to right):
1. email
2. applicationName (i.e. your bundle ID or package name)
3. platform (i.e. APNS, APNS_SANDBOX, GCM)
4. token (i.e. device token)
2. Save CSV file and make note of the full 'filepath'
3. Define the required constant variables below
4. Run python script
Note: If your CSV files does not contain column headings, then set
contains_headers to 'False'
"""
# Define constant variables
api_endpoint = '<Insert API Endpoint>'
# Update per user specifications
file_location = '/Users/bob/Development/Python/token.csv' # Add location of CSV File
api_key = '<insert API Key: Type server-side' # Add your API Key
contains_headers = True # Set to True is file contains column headers
def main():
# Open and read CSV File
with open(r'%s' % (file_location)) as x:
reader = csv.reader(x)
if contains_headers == True:
next(reader) # Skip the first row if file contains column headers
counter = 0 # This counter is used to monitor Rate Limit
# Loop through each
for row in reader:
jsonBody = {}
device = {}
# Create JSON body for API Request
device['applicationName'] = row[1]
device['platform'] = row[2]
device['token'] = row[3]
device['dataFields'] = {'endpointEnabled': True}
jsonBody['email'] = row[0]
jsonBody['device'] = device
# Create API Request
destinationHeaders = {
'api_key': api_key,
'Content-Type': 'application/json'
}
r = requests.post(api_endpoint, headers=destinationHeaders, json=jsonBody)
print(r)
data = json.loads(r.text)
# Print Successes/Errors to console
msg = 'user %s token %s' % (row[0],row[3])
if r.status_code == 200:
try:
msg = 'Success - %s. %s' % (msg, data['msg'])
except Exception:
continue
else:
msg = 'Failure - %s. Code: %s, Details: %s' % (msg, r.status_code, data['msg'])
print(msg)
# Add delay to avoid rate limit
counter = counter + 1
if counter == 400:
time.sleep(2)
counter = 0
if __name__ == '__main__':
main()
I've read about using Pandas with chunking as an option, but using the Dataframe is not intuitive to me, and I can't figure out how to parse through each row of the chunk, like I do in the example above. A few questions:
Will what I currently have run into any memory issues if the file is over a million rows? Each CSV should only have 4 columns, if that helps.
Is Pandas chunking going to be more efficient? If so, how can I iterate over each row of the 'csv chunk' to build my API request like in the example above?
In my pathetic attempt to chunk the file, the result of printing the 'row' in this code:
for chunk in pd.read_csv(file_location, chunksize=chunk_size):
for row in chunk:
print(row)
returns
email
device
applicationName
platform
token
So I am super confused. Thank you in advance for all your help.
Take a look at python generators. Generators are kind of iterators that do not store all the values in memory
def read_file_generator(file_name):
with open(file_name) as csv_file:
for row in csv_file:
yield row
def main():
for row in read_file_generator("my_file.csv"):
print(row)
if __name__ == '__main__':
main()
``

Looping through values in an API and saving to a txt file

I am using a Pokemon API : https://pokeapi.co/api/v2/pokemon/
and I am trying to make a list which can store 6 pokemon ID's then, using a for loop, call to the API and retrieve data for each pokemon. Finally, I want to save this info in a txt file. This is what I have so far:
import random
import requests
from pprint import pprint
pokemon_number = []
for i in range (0,6):
pokemon_number.append(random.randint(1,10))
url = 'https://pokeapi.co/api/v2/pokemon/{}/'.format(pokemon_number)
response = requests.get(url)
pokemon = response.json()
pprint(pokemon)
with open('pokemon.txt', 'w') as pok:
pok.write(pokemon_number)
I don't understand how to get the API to read the IDs from the list.
I hope this is clear, I am in a right pickle.
Thanks
You are passing pokemon_number to the url variable, which is a list. You need to iterate over the list instead.
Also, to actually save the pokemon, you can use either the name or it's ID as the filename. The JSON library allows for easy saving of objects to JSON files.
import random
import requests
import json
# renamed this one to indicate it's not a single number
pokemon_numbers = []
for i in range (0,6):
pokemon_numbers.append(random.randint(1,10))
# looping over the generated IDs
for id in pokemon_numbers:
url = f"https://pokeapi.co/api/v2/pokemon/{id}/"
# if you use response, you overshadow response from the requests library
resp = requests.get(url)
pokemon = resp.json()
print(pokemon['name'])
with open(f"{pokemon['name']}.json", "w") as outfile:
json.dump(pokemon, outfile, indent=4)
I now have this:
import requests
pokemon_number = []
for i in range (0,6):
pokemon_number.append(random.randint(1,50))
x = 0
while x <len(pokemon_number):
print(pokemon_number[x])
x = x +1
url = 'https://pokeapi.co/api/v2/pokemon/{}/'.format(pokemon_number[])
response = requests.get(url)
pokemon = response.json()
print(pokemon)
print(pokemon['name'])
print(pokemon['height'])
print(pokemon['weight'])
with open('pokemon.txt', 'w') as p:
p.write(pokemon['name'])
p.write(pokemon['ability'])

Trying to load pickled data to a list isn't appending properly

I'm writing a to-do list application, and to store the class objects task I'm pickling a list of the objects created. However, when I load the data, the list appears empty. The way I structured it is to create an empty list each session, then append the contents of the pickle file. When new tasks are created, they are appended and the whole list is then appended and then reloaded.
This is my first real software project, so my code looks pretty rough. I reviewed it and can't find any glaring errors, but obviously I am doing something wrong.
Here is the relevant code:
import _pickle as pickle
import os.path
from os import path
from datetime import datetime
#checks if data exists, and creates file if it does not
if path.exists('./tasks.txt') != True:
open("./tasks.txt", 'wb')
else:
pass
#define class for tasks
class task:
def __init__(self, name, due, category):
self.name = name
self.due = datetime.strptime(due, '%B %d %Y %I:%M%p')
self.category = category
def expand(self): # returns the contents of the task
return str(self.name) + " is due in " + str((self.due - datetime.now()))
data = []
# load data to list
def load_data():
with open('tasks.txt', 'rb') as file:
while True:
data = []
try:
data.append(pickle.load(file))
except EOFError:
break
...
# returns current task list
def list_tasks():
clear()
if not data:
print("Nothing to see here.")
else:
i = 1
for task in data:
print("%s. %s" % (i, task.expand()))
i = i+1
#define function to add tasks
def addTask(name, due, category):
newTask = task(name, due, category)
data.append(newTask)
with open('./tasks.txt', 'wb') as file:
pickle.dump(data, file)
load_data()
list_tasks()
...
load_data()
list_tasks()
startup()
ask()
data = []
# load data to list
def load_data():
with open('tasks.txt', 'rb') as file:
while True:
data = []
try:
data.append(pickle.load(file))
except EOFError:
break
That second data = [] doesn't look right. Having data = [] both inside and outside of the function creates two data objects, and the one you're appending to won't be accessible anywhere else. And even if it was accessible, it would still be empty since it's being reset to [] in every iteration of the while loop. Try erasing the inner data = []. Then the data.append call will affect the globally visible data, and its contents won't be reset in each loop.
Additionally, going by the rest of your code it looks like that data is supposed to be a list of tasks. But if you pickle a list of tasks and then run data.append(pickle.load(file)), then data will be a list of lists of tasks instead. One way to keep things flat is to use extend instead of append.
data = []
# load data to list
def load_data():
with open('tasks.txt', 'rb') as file:
while True:
try:
data.extend(pickle.load(file))
except EOFError:
break
I think it may also be possible to load the data with a single load call, rather than many calls in a loop. It depends on whether your tasks.txt file is the result of a single pickle.dump call, or if you appended text to it multiple times with multiple pickle.dump calls while the file was opened in "append" mode.
def load_data():
with open('tasks.txt', 'rb') as file:
return pickle.load(file)
data = load_data()

How to pass Bags (named-tuples) between activities

I'm new to Bonobo library and built a simple flow :
read a simple CSV called input.csv with header : Header1, Header2, Header3, Header4
append a new column which is the concatenation of the others
write the result to a CSV file called output.csv
I'm using the built-in CsvReader and CsvWriter from bonobo to make it simple.
First I was stuck with the CsvReader not sending the headers with cells, and a suggested workaround was adding
#use_raw_input
annotation for the transformation coming right after the CsvReader. But when passing content to the next activity, the bag is once again losing its header and seen as a simple tuple. It does work only if and only if I explicitely name the fields
def process_rows(Header1, Header2, Header3, Header4)
My code is as per below (put a breakpoint in process_rows to see that you get a tuple without the header) :
import bonobo
from bonobo.config import use_raw_input
# region constants
INPUT_PATH = 'input.csv'
OUTPUT_PATH = 'output.csv'
EXPECTED_HEADER = ('Header1', 'Header2', 'Header3', 'Header4')
# endregion constants
#This is stupid because all rows are checked instead of only the first
#use_raw_input #mandatory to get the header
def validate_header(input):
if input._fields != EXPECTED_HEADER:
raise("This file has an unexpected header, won't be processed")
yield input
def process_rows(*input):
concat = ""
for elem in input:
concat += elem
result = input.__add__((concat,))
yield result
# region bonobo + main
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(bonobo.CsvReader(INPUT_PATH, delimiter=','),
validate_header,
process_rows,
bonobo.CsvWriter(OUTPUT_PATH))
return graph
def get_services(**options):
return {}
if __name__ == '__main__':
parser = bonobo.get_argument_parser()
with bonobo.parse_args(parser) as options:
bonobo.run(
get_graph(**options),
services=get_services(**options)
)
# endregion bonobo + main
Thanks for your time and help !
I did some investigations and found this "FUTURE" document that I think is what you are after:
http://docs.bonobo-project.org/en/master/guide/future/transformations.html
But it is not implemented.
I found this similar question Why does Bonobo's CsvReader() method yield tuples and not dicts?

Streaming a generated CSV with Flask

I have this function for streaming text files:
def txt_response(filename, iterator):
if not filename.endswith('.txt'):
filename += '.txt'
filename = filename.format(date=str(datetime.date.today()).replace(' ', '_'))
response = Response((_.encode('utf-8')+'\r\n' for _ in iterator), mimetype='text/txt')
response.headers['Content-Disposition'] = 'attachment; filename={filename}'.format(filename=filename)
return response
I am working out how to stream a CSV in a similar manner. This page gives an example, but I wish to use the CSV module.
I can use StringIO and create a fresh "file" and CSV writer for each line, but it seems very inefficient. Is there a better way?
According to this answer how do I clear a stringio object? it is quicker to just create a new StringIO object for each line in the file than the method I use below. However if you still don't want to create new StringIO instances you can achieve what you want like this:
import csv
import StringIO
from flask import Response
def iter_csv(data):
line = StringIO.StringIO()
writer = csv.writer(line)
for csv_line in data:
writer.writerow(csv_line)
line.seek(0)
yield line.read()
line.truncate(0)
line.seek(0) # required for Python 3
def csv_response(data):
response = Response(iter_csv(data), mimetype='text/csv')
response.headers['Content-Disposition'] = 'attachment; filename=data.csv'
return response
If you just want to stream back the results as they are created by csv.writer you can create a custom object implementing an interface the writer expects.
import csv
from flask import Response
class Line(object):
def __init__(self):
self._line = None
def write(self, line):
self._line = line
def read(self):
return self._line
def iter_csv(data):
line = Line()
writer = csv.writer(line)
for csv_line in data:
writer.writerow(csv_line)
yield line.read()
def csv_response(data):
response = Response(iter_csv(data), mimetype='text/csv')
response.headers['Content-Disposition'] = 'attachment; filename=data.csv'
return response
A slight improvement to Justin's existing great answer. You can take advantage of the fact that csv.writerow() returns the value returned by the underlying file's write call.
import csv
from flask import Response
class DummyWriter:
def write(self, line):
return line
def iter_csv(data):
writer = csv.writer(DummyWriter())
for row in data:
yield writer.writerow(row)
def csv_response(data):
response = Response(iter_csv(data), mimetype='text/csv')
response.headers['Content-Disposition'] = 'attachment; filename=data.csv'
return response
If you are dealing with large amounts of data that you don't want to store in memory then you could use SpooledTemporaryFile. This would use StringIO until it reaches a max_size after that it will roll over to disk.
However, I would stick with the recommended answer if you just want to stream back the results as they are created.

Categories

Resources