I have a dataset in CSV containing lists of values as strings in a single field that looks more or less like this:
Id,sequence
1,'1;0;2;6'
2,'0;1'
3,'1;0;9'
In the real dataset I'm dealing with, the sequence length vary greatly and can contain from one up to few thousands observations. There are many columns containing sequences all stored as strings.
I'm reading those CSV's and parsing strings to become lists nested inside Pandas DataFrame. This takes some time, but I'm ok with it.
However, later when I save the parsed results to pickle the read time of this pickle file is very high.
I'm facing the following:
Reading a raw ~600mb CSV file of such structure to Pandas takes around ~3
seconds.
Reading the same (raw, unprocessed) data from pickle takes ~0.1 second.
Reading the processed data from pickle takes 8 seconds!
I'm trying to find a way to read processed data from disk in the quickest possible way.
Already tried:
Experimenting with different storage formats but most of them can't store nested structures. The only one that worked was msgpack but that didn't improve the performance much.
Using structures other than Pandas DataFrame (like tuple of tuples) - faced similar performance.
I'm not very tied to the exact data structure. The thing is I would like to quickly read parsed data from disk directly to Python.
This might be a duplicate to this question
HDF5 is quite a bit quicker at handling nested pandas dataframes. I would give that a shot.
An example usage borrowed from here shows how you can chunk it efficiently when dumping:
import glob, os
import pandas as pd
df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
store = pd.HDFStore('test.h5')
nrows = store.get_storer('df').nrows
chunksize = 100
for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
store.close()
When reading it back, you can do it in chunks like this, too:
for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
print df.info()
print(df.head(5))
Related
I have a JSON file that I want to convert into DataFrame. Since the dataset is pretty large (~30 GB), I found that I need to set the chunksize as the limitation. The code is like this:
import pandas as pd
pd.options.display.max_rows
datas = pd.read_json('/Users/xxxxx/Downloads/Books.json', chunksize = 1, lines = True)
datas
Then when I run it the result is
<pandas.io.json._json.JsonReader at 0x15ce38550>
Is this an error?
Also I found that if you use loop in the datas, it worked. Is there any way to use the standard way?
I don't think pandas is the way to go when reading giant json files.
First you should check out if your file is actually in a valid JSON format (it is completely wrapped in one dictionary) or if it is a JSONL file (each row is one dictionary in JSON format but the rows are not connected).
Because if you are e.g. using the Amazon Review Dataset which has these huge files they are all in JSONL but named JSON.
Two packages I can recommend for parsing huge JSON/JSONL files:
Pretty fast: ujson
import ujson
with open(file_path) as f:
file_contents = ujson.load(f)
Even faster: ijson
import ijson
file_contents = [t for t in ijson.items(open(file_path), "item")]
This allows you to use a progress bar like tqdm as well:
import ijson
from tqdm import tqdm
file_contents = [t for t in tqdm(ijson.items(open(file_path), "item"))]
This is good to know how fast it is going as it shows you how many lines it already read. Since your file is 30GB it might take quite a while to read it all and it's good to know if its still going or the memory crashed or w/e else could have gone wrong.
You can then try to create a DataFrame from the dictionary by using pandas.DataFrame.from_dict(file_contents) but I think 30GB of contents is way more than pandas allows as maximum number of rows. Not quite sure though. Generally I would really recommend working with dictionaries for this amount of content as it's much faster. Then only when you need to display some parts of it for visualization or analysis convert it into a DataFrame.
We are migrating from SAS to Python and I am having some trouble dealing with large dataframes.
I am dealing with a df with 15kk rows and 44 columns, a pretty large boy. I need to replace commas with dots in some columns, delete some others columns and change some to date.
To delete I found out that this works pretty well:
del df['column']
but when trying to replace using this:
df["column"] = (dfl["column"].replace('\.','', regex=True).replace(',','.', regex=True).astype(float))
I get:
MemoryError: Unable to allocate 14.2 MiB for an array with shape (14901054,) and data type uint8
Same happens when trying to convert to date using this:
df['column'] = pd.to_datetime(df['column'],errors='coerce')
I get:
MemoryError: Unable to allocate 114. MiB for an array with shape (14901054,) and data type datetime64[ns]
Is there any other way to do those things, only more memory efficient? Or is the only solution to split the df beforehand? Thanks!
ps. not all columns are giving me this problem, but I guess that is not important
I am not an expert with this library but when you have to deal with a big amount of data, you could store the information in disk and read it in chunks.
A good (but not the best I guess) solution could be store it in a temporal CSV file and them, read the file in chunks, dealing with less rows in memory.
Original Dataframe
Remove the unnecesary columns
Store the dataframe in a temporal CSV file.
Read the CSV by chunks:
For each chunk, perform the column modifications and store it in another final CSV file.
More information over here (a few minutes googling):
Why and how to use pandas with large data
Dealing with large datasets in pandas
Is Dask proper to read large csv files in parallel and split them into multiple smaller files?
Yes, dask can read large CSV files. It will split them into chunks
df = dd.read_csv("/path/to/myfile.csv")
Then, when saving, Dask always saves CSV data to multiple files
df.to_csv("/output/path/*.csv")
See the read_csv and to_csv docstrings for much more information about this.
dd.read_csv
dd.DataFrame.to_csv
Hi Nutsa Nazgaide and welcome on SO. First of all I'd suggest you to read about how-to-ask and mcve. your question is good enough but it will be great to produce a sample of your original dataframe. I'm going to produce a basic dataframe but the logic shouldn't be too different in your case as you just need to consider location.
Generate dataframe
import dask.dataframe as dd
import numpy as np
import pandas as pd
import string
letters = list(string.ascii_lowercase)
N = int(1e6)
df = pd.DataFrame({"member":np.random.choice(letters, N),
"values":np.random.rand(N)})
df.to_csv("file.csv", index=False)
One parquet file (folder) per member
If you're happy to have the output in as parquet you can just use the option partition_on as
df = dd.read_csv("file.csv")
df.to_parquet("output", partition_on="member")
If you then really need csv you can convert to this format. I strongly suggest you to move your data to parquet.
Im kinda new to Python and Datascience.
I have a 33gb csv file Dataset, and i want to parse it in a DataFrame to do some stuff on it.
I tried to do it the 'Casual' with pandas.read_csv and it's taking ages to parse..
I searched on the internet and found this article.
It says that the most efficent way to read a large csv file is to use csv.DictReader.
So i tried to do that :
import pandas as pd
import csv
df = pd.DataFrame(csv.DictReader(open("MyFilePath")))
Even with this solution it's taking ages to do the job..
Can you please guys tell me what's the most efficient way to parse a large dataset into pandas?
There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:
Sampling
Chunking
Optimising Pandas dtypes
Parallelising Pandas with Dask.
The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use.
sample code :
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename)) - 1 # number of lines in file
s = n//m # part of the data
skip = sorted(random.sample(range(1, n + 1), n - s))
df = pandas.read_csv(filename, skiprows=skip)
This is the link for Chunking large data.
I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.
What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.
I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.
import pandas as pd
import os
print(os.getcwd())
print(os.listdir(os.getcwd()))
chunksize = 10000
data = pd.read_csv('ukb35190.csv', chunksize=chunksize)
df = data.get_chunk(chunksize)
print(df)
export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.
import pandas as pd
import os
chunksize = 10000
batch=1
for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
batch_no+=1