Pandas vs JSON library to read a JSON file in Python - python

It seems that I can use both pandas and/or json to read a json file, i.e.
import pandas as pd
pd_example = pd.read_json('some_json_file.json')
or, equivalently,
import json
json_example = json.load(open('some_json_file.json'))
So my question is, what's the difference and which one should I use? Is one way recommended over another, are there certain situations where one is better than the other, etc. ? Thanks.

It Depends.
When you have a single JSON structure inside a json file, use read_json because it loads the JSON directly into a DataFrame. With json.loads, you've to load it into a python dictionary/list, and then into a DataFrame - an unnecessary two step process.
Of course, this is under the assumption that the structure is directly parsable into a DataFrame. For non-trivial structures (usually of the form of complex nested lists-of-dicts), you may want to use json_normalize instead.
On the other hand, with a JSON lines file, the story becomes different. From my experience, I've found loading a JSON lines file with pd.read_json(..., lines=True) is actually slightly slower on large data (tested on ~50k+ records once), and to make matters worse, cannot handle rows with errors - the entire read operation fails. In contrast, you can use json.loads on each line of your file inside a try-except brace for some robust code which actually ends up being a few clicks faster. Go figure.
Use whatever fits the situation.

Related

How to read json file without using python loop?

I have a JSON file that I want to convert into DataFrame. Since the dataset is pretty large (~30 GB), I found that I need to set the chunksize as the limitation. The code is like this:
import pandas as pd
pd.options.display.max_rows
datas = pd.read_json('/Users/xxxxx/Downloads/Books.json', chunksize = 1, lines = True)
datas
Then when I run it the result is
<pandas.io.json._json.JsonReader at 0x15ce38550>
Is this an error?
Also I found that if you use loop in the datas, it worked. Is there any way to use the standard way?
I don't think pandas is the way to go when reading giant json files.
First you should check out if your file is actually in a valid JSON format (it is completely wrapped in one dictionary) or if it is a JSONL file (each row is one dictionary in JSON format but the rows are not connected).
Because if you are e.g. using the Amazon Review Dataset which has these huge files they are all in JSONL but named JSON.
Two packages I can recommend for parsing huge JSON/JSONL files:
Pretty fast: ujson
import ujson
with open(file_path) as f:
file_contents = ujson.load(f)
Even faster: ijson
import ijson
file_contents = [t for t in ijson.items(open(file_path), "item")]
This allows you to use a progress bar like tqdm as well:
import ijson
from tqdm import tqdm
file_contents = [t for t in tqdm(ijson.items(open(file_path), "item"))]
This is good to know how fast it is going as it shows you how many lines it already read. Since your file is 30GB it might take quite a while to read it all and it's good to know if its still going or the memory crashed or w/e else could have gone wrong.
You can then try to create a DataFrame from the dictionary by using pandas.DataFrame.from_dict(file_contents) but I think 30GB of contents is way more than pandas allows as maximum number of rows. Not quite sure though. Generally I would really recommend working with dictionaries for this amount of content as it's much faster. Then only when you need to display some parts of it for visualization or analysis convert it into a DataFrame.

Floats converting to symbols reading from .dat file. Unsure of encoding

I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = ­?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.

Fastest way to read complex data structures from disk in Python

I have a dataset in CSV containing lists of values as strings in a single field that looks more or less like this:
Id,sequence
1,'1;0;2;6'
2,'0;1'
3,'1;0;9'
In the real dataset I'm dealing with, the sequence length vary greatly and can contain from one up to few thousands observations. There are many columns containing sequences all stored as strings.
I'm reading those CSV's and parsing strings to become lists nested inside Pandas DataFrame. This takes some time, but I'm ok with it.
However, later when I save the parsed results to pickle the read time of this pickle file is very high.
I'm facing the following:
Reading a raw ~600mb CSV file of such structure to Pandas takes around ~3
seconds.
Reading the same (raw, unprocessed) data from pickle takes ~0.1 second.
Reading the processed data from pickle takes 8 seconds!
I'm trying to find a way to read processed data from disk in the quickest possible way.
Already tried:
Experimenting with different storage formats but most of them can't store nested structures. The only one that worked was msgpack but that didn't improve the performance much.
Using structures other than Pandas DataFrame (like tuple of tuples) - faced similar performance.
I'm not very tied to the exact data structure. The thing is I would like to quickly read parsed data from disk directly to Python.
This might be a duplicate to this question
HDF5 is quite a bit quicker at handling nested pandas dataframes. I would give that a shot.
An example usage borrowed from here shows how you can chunk it efficiently when dumping:
import glob, os
import pandas as pd
df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
store = pd.HDFStore('test.h5')
nrows = store.get_storer('df').nrows
chunksize = 100
for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
store.close()
When reading it back, you can do it in chunks like this, too:
for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
print df.info()
print(df.head(5))

Merging datasets using dask proves unsuccessful

I am trying to merge a number of large data sets using Dask in Python to avoid loading issues. I want to save as .csv the merged file. The task proves harder than imagined:
I put together a toy example with just two data sets
The code I then use is the following:
import dask.dataframe as dd
import glob
import os
os.chdir('C:/Users/Me/Working directory')
file_list = glob.glob("*.txt")
dfs = []
for file in file_list:
ddf = dd.read_table(file, sep=';')
dfs.append(ddf)
dd_all = dd.concat(dfs)
If I use dd_all.to_csv('*.csv') I simply print out the two original data sets.
If I use dd_all.to_csv('name.csv') I get an error saying the file does not exist.
(FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Me\\Working directory\\name.csv\\1.part')
I can check that using dd_all.compute() the merged data set had successfully been created.
You are misunderstanding how Dask works - the behaviour you see is as expected. In order to be able to write from multiple workers in parallel, it is necessary for each worker to be able to write to a separate file; there is no way to know the length of the first chunk before writing it has finished, for example. To write to a single file is therefore necessarily a sequential operation.
The default operation, therefore, is to write one output file for each input partition, and this is what you see. Since Dask can read from these in parallel, it does raise the question of why you would want to creation one output file at all.
For the second method without the "*" character, Dask is assuming that you are supplying a directory, not a file, and is trying to write two files within this directory, which doesn't exist.
If you really wanted to write a single file, you could do one of the following:
use the repartition method to make a single output piece and then to_csv
write the separate file and concatenate them after the fact (taking care of the header line)
iterate over the partitions of your dataframe in sequence to write to the same file.

how to retrieve all lines with errors in pandas

For example, I can use
pd.read_csv('file.csv')
to load a csv file.
By default, it fails when there are any parsing errors. I understand that one can use error_bad_lines=False to skip the rows with errors.
But my question is:
How to get all the lines where errors occur? This way, I can potentially solve the problem for not only this particular file.csv but also other related files in a batch file1.csv, file2.csv, file3.csv ...
One easy way would be to prepend a row index number to each row. This can easily be done with Awk or Python before loading the data. You could even do it in-memory using StringIO or your own custom file-like object in Python which would "magically" prepend the row numbers.

Categories

Resources