Pickling pandas dataframe multiplies by 5 the file size - python

I am reading a 800 Mb CSV file with pandas.read_csv, and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplied by 5.
I expected pickle to compress data rather than extend it. Also because I can do a gzip on the CSV file which compress it to 200 Mb, dividing it by 4.
I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am understanding that I would rather have to compress the files and then use the compression option from pandas.read_csv to speed up the loading time.
Is that correct?
Is it normal that pickling pandas dataframe extend the data size?
How do you speed up loading time usually?
What are the data-size limit would you load with pandas?

Not sure why you think pickling compresses the data size, pickling creates a string version of your python object so that it can be loaded back as a python object:
In [388]:
import sys
import os
df = pd.DataFrame({'a':np.arange(5)})
df.to_pickle(r'c:\data\df.pkl')
print(sys.getsizeof(df))
statinfo = os.stat(r'c:\data\df.pkl')
print(statinfo.st_size)
with open(r'c:\data\df.pkl', 'rb') as f:
print(f.read())
56
700
b'\x80\x04\x95\xb1\x02\x00\x00\x00\x00\x00\x00\x8c\x11pandas.core.frame\x94\x8c\tDataFrame\x94\x93\x94)}\x94\x92\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)}\x94\x92\x94(]\x94(\x8c\x11pandas.core.index\x94\x8c\n_new_Index\x94\x93\x94h\x0b\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94\x8c\x01a\x94at\x94b\x8c\x04name\x94Nu\x86\x94R\x94h\rh\x0b\x8c\nInt64Index\x94\x93\x94}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x05\x85\x94h\x1f\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C(\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x94t\x94bh(Nu\x86\x94R\x94e]\x94h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01K\x05\x86\x94h\x1f\x8c\x02i4\x94K\x00K\x01\x87\x94R\x94(K\x03h5NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x14\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x94t\x94ba]\x94h\rh\x0f}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01\x85\x94h"\x89]\x94h&at\x94bh(Nu\x86\x94R\x94a}\x94\x8c\x060.14.1\x94}\x94(\x8c\x06blocks\x94]\x94}\x94(\x8c\x06values\x94h>\x8c\x08mgr_locs\x94\x8c\x08builtins\x94\x8c\x05slice\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94ua\x8c\x04axes\x94h\nust\x94bb.'
The method to_csv does support compression as a kwarg, 'gzip' and 'bz2':
In [390]:
df.to_csv(r'c:\data\df.zip', compression='bz2')
statinfo = os.stat(r'c:\data\df.zip')
print(statinfo.st_size)
29

It is likely in your best interest to stash your CSV file in a database of some sort and perform operations on that rather than loading the CSV file to RAM, as Kathirmani suggested. You will see the speedup in loading time that you expect due simply to the fact that you are not filling up 800 Mb worth of RAM every time you load your script.
File compression and loading time are two conflicting elements of what you seem to be trying to accomplish. Compressing the CSV file and loading that will take more time; you've now added the extra step of having to decompress the file, which doesn't solve your problem.
Consider a precursory step to ship the data to an sqlite3 database, as described here: Importing a CSV file into a sqlite3 database table using Python.
You now have the pleasure of being able to query a subset of your data and quickly load it into a pandas.DataFrame for further use, as follows:
from pandas.io import sql
import sqlite3
conn = sqlite3.connect('your/database/path')
query = "SELECT * FROM foo WHERE bar = 'FOOBAR';"
results_df = sql.read_frame(query, con=conn)
...
Conversely, you can use pandas.DataFrame.to_sql() to save these for later use.

Dont load 800MB file to memory. It will increase your loading time. Pickle objects too takes more time to load. Instead store the csv file as a sqlite3 (which comes along with python) table. And then query the table every time depending upon your need.

You can also use panda's pickle methods which should compress your data.
Save a dataframe:
df.to_pickle(filename)
Load it:
df = pd.read_pickle(filename)

Related

Parquet File re-write has slightly larger size in both Pandas / PyArrow

So I am trying to read a parquet file into memory, choose chunks of the file and upload it to AWS S3 Bucket. I want to write sanity tests to check if a file was uploaded correctly through either size check or MD5 hash check between the local and cloud file on the bucket.
One thing I noticed is that reading a file into memory, either as bytes or pd.DataFrame / Table, and then re-writing the same object into a new file would change the file size, in my case increasing it compared to the original. Here's some sample code:
import pandas as pd
df = pd.read_parquet("data/example.parquet")
Then I simply write:
from io import ByteIO
buffer = ByteIO()
df.to_parquet(buffer) # this can be done straight without BytesIO. I use it for clarity.
with open('copy.parquet', 'rb') as f:
f.write(buffer.getvalue())
Now using ls -l on both files give me different sizes:
37089 Oct 28 16:57 data/example.parquet
37108 Dec 7 14:17 copy.parquet
Interestingly enough, I tried using a tool such as xxd paired with diff, and to my surprise the binary difference was scattered all across the file, so I think it's safe to assume that this is not just limited to a metadata difference. Reloading both files into memory using pandas gives me the same table. It might also be worth mentioning that the parquet file contains both Nan and Nat values. Unfortunately I cannot share the file but see if I can replicate the behavior with a small sample.
I also tried using Pyarrow's file reading functionality which resulted in the same file size:
import pyarrow as pa
import pyarrow.parquet as pq
with open('data/example.parquet', 'rb') as f:
buffer = pa.BufferReader(obj)
table = pq.read_table(buffer)
pq.write_table(table, 'copy.parquet')
I have also tried turning on the compression='snappy' in both versions, but it did not change the output.
Is there some configuration I'm missing when writing back to disk?
Pandas uses pyarrow to read/write parquet so it is unsurprising that the results are the same. I am not sure what clarity using buffers gives compared to saving the files directly so I have left it out in the code below.
What was used to write the example file? If it was not pandas but e.g. pyarrow directly that would show up as mostly meta data difference as pandas adds its own schema in addition to the normal arrow meta data.
Though you say this is not the case here so the likely reason is that this file was written by another system with a different version of pyarrow, as Michael Delgado mentioned in the comments snappy compression is turned on by default. Snappy is not deterministic between systems:
not across library versions (and possibly not
even across architectures)
This explains why you see the difference all over the file. You can try the code below to see that on the same machine the md5 is the same between files but the pandas version is larger due to the added meta data.
Currently the arrow s3 writer does not check for integrity but the S3 API has such a functionality. I have opened an issue to make this accessible via arrow.
import pandas as pd
import pyarrow as pa
import numpy as np
import pyarrow.parquet as pq
arr = pa.array(np.arange(100))
table = pa.Table.from_arrays([arr], names=["col1"])
pq.write_table(table, "original.parquet")
pd_copy = pd.read_parquet("original.parquet")
copy = pq.read_table("original.parquet")
pq.write_table(copy, "copy.parquet")
pd_copy.to_parquet("pd_copy.parquet")
$ md5sum original.parquet copy.parquet pd_copy.parquet
fb70a5b1ca65923fec01a54f85f17260 original.parquet
fb70a5b1ca65923fec01a54f85f17260 copy.parquet
dcb93cb89426a948e885befdbee204ff pd_copy.parquet
1092 copy.parquet
1092 original.parquet
2174 pd_copy.parquet

Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?
I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

How to parse data from json to DataFrame faster?

I've got a json files of total size of 3gb. I need to parse some data from it to Pandas Dataframe. I already made it a bit faster with custom library to parse json, but it is still too slow. It works only in one thread, that is a problem, too. How can I make it faster? Main problem is that is starts with 60it/s, but on 50000th iteration speed lowers down to 5it/s, but RAM is still not fully used, so it is not the problem. Here is an example of what am I doing:
import tqdm
with open('data/train.jsonlines') as fin:
for line in tqdm.tqdm_notebook(fin):
record = ujson.loads(line)
for target in record['damage_targets']:
df_train.loc[record['id'], 'target_{}'.format(target)] = record["damage_targets"][target]

python read zipfile into numpy-array efficiently

I want to read a zipfile into memory and extract its content into a numpy array (as numpy-datatypes). This needs to happen in an extremely efficient/fast manner, since the files are rather big and there are many of them. Unfortunately looking at similiar questions didn't help me, because I couldn't find a way to convert the data into numpy-datatypes at the time of reading. Also the speed turned out to be a big problem.
For example: The zipfile "log_ks818.zip" contains "log_file.csv", which contains the needed data in the following format (yyyymmdd hhnnsszzz,float,float,zero):
20161001 190000100,1.000500,1.000800,0
20161001 190001000,1.001000,1.002000,0
20161001 190002500,1.001500,1.001200,0
...
The fastest that I managed to do so far (using pandas):
zfile = zipfile.ZipFile("log_ks818.zip", 'r')
df = pd.read_csv(io.BytesIO(zfile.read("log_file.csv")), header=None, usecols=[0, 1, 2], delimiter=',', encoding='utf-8')
print(df.head())
However this takes around 6 seconds for ~2,000,000 lines in the file (~80MB if unpacked), which is too slow (plus it's not a numpy object). When I compared the read-in speed of different numpy/pandas-methods, using the extracted file on the hard drive to test, np.fromfile performed the best with 0.08 seconds to simply get it into memory. It would be nice if it was possible to stay in this magnitude when reading the data from the zip-file.
I think this is not a problem about read speed from disk. Even though you are using HDD, reading 80MB into memory can be done in one second.
Take my experience as an example, the cost of time is determined by the process of uncompressing. If you just work with extracted data, it won't cost you a lot I believe.

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of dask instead of regular pandas.
Since reading each csv file involves some extra work (adding columns with data from the file path), I tried creating the dask.dataframe from a list of delayed objects, similar to this example.
This is my code:
import dask.dataframe as dd
from dask.delayed import delayed
import os
import pandas as pd
def read_file_to_dataframe(file_path):
df = pd.read_csv(file_path)
df['some_extra_column'] = 'some_extra_value'
return df
if __name__ == '__main__':
path = '/path/to/my/files'
delayed_collection = list()
for rootdir, subdirs, files in os.walk(path):
for filename in files:
if filename.endswith('.csv'):
file_path = os.path.join(rootdir, filename)
delayed_reader = delayed(read_file_to_dataframe)(file_path)
delayed_collection.append(delayed_reader)
df = dd.from_delayed(delayed_collection)
print(df.compute())
When starting this script (Python 3.4, dask 0.12.0), it runs for a couple of minutes while my system memory constantly fills up. When it is fully used, everything starts lagging and it runs for some more minutes, then it crashes with killed or MemoryError.
I thought the whole point of dask.dataframe was to be able to operate on larger-than-memory dataframes that span over multiple files on disk, so what am I doing wrong here?
edit: Reading the files instead with df = dd.read_csv(path + '/*.csv') seems to work fine as far as I can see. However, this does not allow me to alter each single dataframe with additional data from the file path.
edit #2:
Following MRocklin's answer, I tried to read my data with dask's read_bytes() method as well as using the single-threaded scheduler as well as doing both in combination.
Still, even when reading chunks of 100MB in single-threaded mode on a laptop with 8GB of memory, my process gets killed sooner or later. Running the code stated below on a bunch of small files (around 1MB each) of similar shape works fine though.
Any ideas what I am doing wrong here?
import dask
from dask.bytes import read_bytes
import dask.dataframe as dd
from dask.delayed import delayed
from io import BytesIO
import pandas as pd
def create_df_from_bytesio(bytesio):
df = pd.read_csv(bytesio)
return df
def create_bytesio_from_bytes(block):
bytesio = BytesIO(block)
return bytesio
path = '/path/to/my/files/*.csv'
sample, blocks = read_bytes(path, delimiter=b'\n', blocksize=1024*1024*100)
delayed_collection = list()
for datafile in blocks:
for block in datafile:
bytesio = delayed(create_bytesio_from_bytes)(block)
df = delayed(create_df_from_bytesio)(bytesio)
delayed_collection.append(df)
dask_df = dd.from_delayed(delayed_collection)
print(dask_df.compute(get=dask.async.get_sync))
If each of your files is large then a few concurrent calls to read_file_to_dataframe might be flooding memory before Dask ever gets a chance to be clever.
Dask tries to operate in low memory by running functions in an order such that it can delete intermediate results quickly. However if the results of just a few functions can fill up memory then Dask may never have a chance to delete things. For example if each of your functions produced a 2GB dataframe and if you had eight threads running at once, then your functions might produce 16GB of data before Dask's scheduling policies can kick in.
Some options
Use dask.bytes.read_bytes
The reason why read_csv works is that it chunks up large CSV files into many ~100MB blocks of bytes (see the blocksize= keyword argument). You could do this too, although it's tricky because you need to always break on an endline.
The dask.bytes.read_bytes function can help you here. It can convert a single path into a list of delayed objects, each corresponding to a byte range of that file that starts and stops cleanly on a delimiter. You would then put these bytes into an io.BytesIO (standard library) and call pandas.read_csv on that. Beware that you'll also have to handle headers and such. The docstring to that function is extensive and should provide more help.
Use a single thread
In the example above everything would be fine if we didn't have the 8x multiplier from parallelism. I suspect that if you only ran a single function at once that things would probably pipeline without ever reaching your memory limit. You can set dask to use only a single thread with the following line
dask.set_options(get=dask.async.get_sync)
Note: For Dask versions >= 0.15, you need to use dask.local.get_sync instead.
Make sure that results fit in memory (response to edit 2)
If you make a dask.dataframe and then compute it immediately
ddf = dd.read_csv(...)
df = ddf.compute()
You're loading in all of the data into a Pandas dataframe, which will eventually blow up memory. Instead it's better to operate on the Dask dataframe and only compute on small results.
# result = df.compute() # large result fills memory
result = df.groupby(...).column.mean().compute() # small result
Convert to a different format
CSV is a pervasive and pragmatic format, but also has some flaws. You might consider a data format like HDF5 or Parquet.

Categories

Resources