Parsing CSV data from memory in Python - python

Is there a way to parse CSV data in Python when the data is not in a file? I'm storing CSV data in my database and I'd like to parse it. I'm looking for something analogous to Ruby's CSV.parse. I know Python has a CSV class but everything I've seen in the docs seems to deal with files as opposed to in-memory CSV data.
(And it's not an option to parse the data before it goes into the database.)
(And please don't tell me not to store the CSV data in the database. I know what I'm doing as far as the database goes.)

There is no special distinction for files about the python csv module. You can use StringIO to wrap your strings as file-like objects.

Here is why you should use cStringIO.StringIO (io.StringIO in Python 3.x) instead of some DIY kludge:
>>> import csv
>>> from cStringIO import StringIO
>>> fromDB = '"Column\nheading1",hdng2\r\n"data1\rfoo","data2\r\nfoo"\r\n'
>>> sources = [StringIO(fromDB), fromDB.splitlines(True),
... fromDB.splitlines(), fromDB.split("\n")]
>>> for i, source in enumerate(sources):
... print i, list(csv.reader(source))
...
0 [['Column\nheading1', 'hdng2'], ['data1\rfoo', 'data2\r\nfoo']] # OK
1 [['Column\nheading1', 'hdng2'], ['data1\rfoo', 'data2\r\nfoo']] # OK
2 [['Columnheading1', 'hdng2'], ['data1foo', 'data2foo']] # 3 errors
3 [['Columnheading1', 'hdng2'], ['data1\rfoo', 'data2\rfoo'], []] # 3 errors
>>>
Using guff.splitlines(True) is not recommended as it has a far greater chance than StringIO(guff) that whoever is reading your code will not have a clue what it does.

Use the stringio module, which allows you to dress strings as file-like objects. That way you can pass a stringio "file" to the CSV module for parsing (or any other parser you may be using).

http://docs.python.org/library/csv.html
csv.reader(csvfile)
csvfile can be any object which
supports the iterator protocol and
returns a string each time its next()
method is called — file objects and
list objects are both suitable.
If you have e.g. the content from DB in a string you can parse it like
import csv
fromDB = "1,2,3\n4,5,6"
reader = csv.reader(fromDB.split("\n"))
for row in reader:
print("New row")
for col in row:
print(" ", col)

Related

How to save multiple data at once in Python

I am running a script which takes, say, an hour to generate the data I want. I want to be able to save all of the relevant variables to some external file so I can fiddle with them later without having to run the hour-long calculation over again. Is there an easy way I can save all of the variables I need into one convenient file?
In Matlab I would just contain all of the results of the calculation in a single structure so that later I could just load results.mat and I would have everything I need stored as results.output1, results.output2 or whatever. What is the Python equivalent of this?
In particular, the data that I would like to save includes arrays of complex numbers, which seems to present difficulties for using things like json.
I suggest taking look at built-in shelve module which provides persistent, dictionary-like object and generally does work with all native Python types so you can do:
Write complex to some file (in my example it is named mydata) under key n (keep in mind that keys should be strings).
import shelve
my_number = 2+7j
with shelve.open('mydata') as db:
db['n'] = my_number
Later retrieve that number from given file
import shelve
with shelve.open('mydata') as db:
my_number = db['n']
print(my_number) # (2+7j)
You can use pickle function in Python and then use the dump function to dump all your data into a file. You can reuse the data later.I suggest you find more about pickle.
I would recommend a json file. With json you can assign variables to keywords, just like dictionaries in stock python. The json package is automatically installed when installing python.
import json
dict = {var1: "abcde", var2: "fghij"}
with open(path, "w") as file:
json.dump(dict, file, indent=2, ensure_ascii = False)
You can also load this from a file using the same api:
with open(path, r) as file:
text = file.read()
dict = json.loads(text)
Edit: Json can also handle every datatype python can, so if you want to save an array you can just define that in the dict:
dict = {list1: ["ab", "cd", "ef"]}

Python Storing Data

I have a list in my program. I have a function to append to the list, unfortunately when you close the program the thing you added goes away and the list goes back to the beginning. Is there any way that I can store the data so the user can re-open the program and the list is at its full.
You may try pickle module to store the memory data into disk,Here is an example:
store data:
import pickle
dataset = ['hello','test']
outputFile = 'test.data'
fw = open(outputFile, 'wb')
pickle.dump(dataset, fw)
fw.close()
load data:
import pickle
inputFile = 'test.data'
fd = open(inputFile, 'rb')
dataset = pickle.load(fd)
print dataset
You can make a database and save them, the only way is this. A database with SQLITE or a .txt file. For example:
with open("mylist.txt","w") as f: #in write mode
f.write("{}".format(mylist))
Your list goes into the format() function. It'll make a .txt file named mylist and will save your list data into it.
After that, when you want to access your data again, you can do:
with open("mylist.txt") as f: #in read mode, not in write mode, careful
rd=f.readlines()
print (rd)
The built-in pickle module provides some basic functionality for serialization, which is a term for turning arbitrary objects into something suitable to be written to disk. Check out the docs for Python 2 or Python 3.
Pickle isn't very robust though, and for more complex data you'll likely want to look into a database module like the built-in sqlite3 or a full-fledged object-relational mapping (ORM) like SQLAlchemy.
For storing big data, HDF5 library is suitable. It is implemented by h5py in Python.

reconstituting a class full of data from MySQL BLOB object in python

I'm using CherryPy and it seems to not behave nicely when it comes to retrieving data from stored files on the server. (I asked for help on that and nobody replied, so I'm on to plan B, or C...) Now I have stored a class containing a bunch of data structures (3 dictionaries and two lists of lists all related) in a MySQL table, and amazingly, it was easier than I thought to insert the binary object (longblob). I turned it into a pickle file and INSERTED it.
However, I can't figure out how to reconstitute the pickle and rebuild the class full of data from it now. The database returns a giant string that looks like the pickle, but how to you make a string into a file-like object so that pickle.load(data) will work?
Alternative solutions: How to save the class as a BLOB in database, or some ideas on why I can save a pickle of this class but when I go to load it later, the class seems to be lost. But in SSH / locally, it works - only when calling pickle.load(xxx) from cherrypy do I get errors.
I'm up for plan D - if there's a better way to store a collection of structured data for fast retrieval without pickles or MYSQL blobs...
You can create a file-like in-memory object with (c)StringIO:
>>> from cStringIO import StringIO
>>> fobj = StringIO('file\ncontent')
>>> for line in fobj:
... print line
...
file
content
But for pickle usage you can directly load and dump to a string (have a look at the s in the function names):
>>> import pickle
>>> obj = 1
>>> serialized = pickle.dumps(obj)
>>> serialized
'I1\n.'
>>> pickle.loads(serialized)
1
But for structured data stored in a database, I would suggest in general that you either use
a table, preferable with an ORM like sqlalchemy so it is directly mapped to a class or
a dictionary, which could be easily (de)serialized with JSON
and not using pickle at all.
I struggled with this myself.
Convert to bytes using the UTF-8 charset and try to load the data in your object.
CurrentShoppingCart.SetCartItems(pickle.loads(bytes(DBCart[0]['Cart'], 'UTF-8')))
Andrew

Formatting a single row as CSV

I'm creating a script to convert a whole lot of data into CSV format. It runs on Google AppEngine using the mapreduce API, which is only relevant in that it means each row of data is formatted and output separately, in a callback function.
I want to take advantage of the logic that already exists in the csv module to convert my data into the correct format, but because the CSV writer expects a file-like object, I'm having to instantiate a StringIO for each row, write the row to the object, then return the content of the object, each time.
This seems silly, and I'm wondering if there is any way to access the internal CSV formatting logic of the csv module without the writing part.
The csv module wraps the _csv module, which is written in C. You could grab the source for it and modify it to not require the file-like object, but poking around in the module, I don't see any clear way to do it without recompiling.
One option could be having your own "file-like" object. Actually, cvs.writer requires for the object only to have a write method, so:
class PseudoFile(object):
def write(self, string):
# Do whatever with your string
csv.writer(PseudoFile()).writerow(row)
You're skipping a couple steps in there, but maybe it's just what you want.

Is there a memory efficient and fast way to load big JSON files?

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Categories

Resources