My application need to process data periodically. The application need to process new data and then merge it with old ones. The data may have billions rows with only two columns, which the first column is the row name and the second one is values. The following one is the example:
a00001,12
a00002,2321
a00003,234
The new data may has new row names or old ones. I want to merge them. So each in processing procedure I need to read the old large data file and merge it with new ones. Then I write the new data to a new file.
I find that the most time-consuming process is read and write data. I have tried several data I/O way.
Orignal read and write text. This is the most time-consuming way
Python pickle package, however, it is not efficient for large data file
Are there any other data I/O formats or packages can load and write large data efficiently in python?
If you have such large amounts of data, it might be faster to try lowering the amount of data you have to read and write.
You could spread the data over multiple files instead of saving it all in one.
When processing your new data, check what old data has to be merged and just read and write those specific files.
Your data has two rows:
name1, data1
name2, data2
Files containing old data:
db_1.dat, db_2.dat, db_3.dat
name_1: data_1 name_1001: data_1001 name_2001: data_2001
. . .
. . .
. . .
name_1000: data_1000 name_2000: data_2000 name_3000: data_3000
Now you can check what data you need to merge and just read and write the specific files holding that data.
Not sure if what you are trying to achieve allows a system like this but it would speed up the process as there is less data to handle.
Maybe this article could help you. It seems like father and parquet may be interesting.
Related
I have a process that generates millions of small dataframes and save them to parquet in parallel.
all dataframes have the same columns and index information. and have the same number of rows (about 300).
as the dataframe is small, when they are saved into parquet files, the meta information is quite big in comparison with the data. as the meta information for each parquet file is basically the same, the disk space is wasted because the same meta are repeated millions of times.
is it possible to save one copy of meta information and other parquet files contains only the data ? when I need to read a dataframe, read the meta and the data from two different files?
some updates:
concating them into one big dataframe can save the disk space, but it's not friendly to do parallel processing of each small dataframe.
I also tried other format like feather.but it seems that feather does not store data as effciently as parquet. the file size is smaller but it's larger than parquet meta + parquet data
This is not possible at least using python pandas (fastparquet and pyarrow dont have any such feature).
I do see a parameter that disables statistics in footer.
pyarrow - write_statistics=Fasle
fastparquet - stats=False
However this will not save you a lot of disk space. Only few stats related info will not be written to the parquet metadata footer.
is it possible to save one copy of meta information and other parquet
files contains only the data ?
You are wanting to write multiple data files with only row groups without footer and a single metadata file that consists only footer. In this case none of those files will be a valid parquet file. This should be possible theoretically but no such known implementation exists. Checkout comments on this thread. Maybe reach out to parquet community on slack to find out if any such implementation exists.
My suggestions would be combine the dataframes somehow before writing to parquet format on disk. OR run a job at a later stage that will merge these files. Both these options are not efficient since you have a huge number of small df's/files.
You could write 300 rows per dataframe into a some kind of intermediate database as well. Later convert to parquet.
I have many files: 1.csv, 2.csv ... N.csv. I want to read them all and aggregate a DataFrame. But reading files sequentially in one process will definitely be slow. So how can I improve it? Besides, Jupyter notebook is used.
Also, I am a little confused about the "cost of parsing parameters or return values between python processes"
I know the question may be duplicated. But I found that most of the answers use multi-process to solve it. Multiprocess does solve the GIL problem. But in my experience(maybe it is wrong): parsing large data(like a DataFrame) as a parameter to subprocess is slower than a for loop in a single process because the procedure needs serializing and de-serializing. And I am not sure about the return of large values from the subprocess.
Is it most efficient to use a Qeueu or joblib or Ray?
Reading csv is fast. I would read all csv in a list and then concat the list to one dataframe. Here is a bit of code form my use case. I find all .csv files in my path and save the csv file names in variable "results". I then loop the file names and read the csv and store it in list which I later concat to one dataframe.
data = []
for item in result:
data.append(pd.read_csv(path))
main_df = pd.concat(data, axis = 0)
I am not saying this is the best approach, but this works great for me :)
I have a collection of mainly numerical data-files that are the result of running a physics simulation (or several). I can convert the files into pandas dataframes. It is natural to organize the dataframe objects in lists, lists of lists etc. For example:
allData = [df1, [df11, df12], df2, [df21, df22]]
I want to save this data to files (to be sent). I know the whole thing can be dumped into one file with e.g. a pickle format, but I don't want this because some files can be large and I want to be able to load the files selectively. So each dataframe should be stored as a separate file.
But I also want to store how the objects are organized into lists, for example in another file. So that when reading the files from somewhere else, python will know how the data files are connected.
Possibly I could solve this by inventing some system of writing the filenames and how they are structured into a txt file. But is there a proper/cleaner way to do it?
There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?
I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.
I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...
I want to use Pandas to work with series in real-time. Every second, I need to add the latest observation to an existing series. My series are grouped into a DataFrame and stored in an HDF5 file.
Here's how I do it at the moment:
>> existing_series = Series([7,13,97], [0,1,2])
>> updated_series = existing_series.append( Series([111], [3]) )
Is this the most efficient way? I've read countless posts but cannot find any that focuses on efficiency with high-frequency data.
Edit: I just read about modules shelve and pickle. It seems like they would achieve what I'm trying to do, basically save lists on disks. Because my lists are large, is there any way not to load the full list into memory but, rather, efficiently append values one at a time?
Take a look at the new PyTables docs in 0.10 (coming soon) or you can get from master. http://pandas.pydata.org/pandas-docs/dev/whatsnew.html
PyTables is actually pretty good at appending, and writing to a HDFStore every second will work. You want to store a DataFrame table. You can then select data in a query like fashion, e.g.
store.append('df', the_latest_df)
store.append('df', the_latest_df)
....
store.select('df', [ 'index>12:00:01' ])
If this is all from the same process, then this will work great. If you have a writer process and then another process is reading, this is a little tricky (but will work correctly depending on what you are doing).
Another option is to use messaging to transmit from one process to another (and then append in memory), this avoids the serialization issue.