Load many feather files in a folder into dask

Load many feather files in a folder into dask - python

With a folder with many .feather files, I would like to load all of them into dask in python.
So far, I have tried the following sourced from a similar question on GitHub https://github.com/dask/dask/issues/1277
files = [...]
dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
df = dd.concat(dfs)
Unfortunately, this gives me the error TypeError: Truth of Delayed objects is not supported which is mentioned there, but a workaround is not clear.
Is it possible to do the above in dask?

Instead of concat, which operates on dataframes, you want to use from_delayed, which turns a list of delayed objects, each of which represents a dataframe, into a single logical dataframe
dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
df = dd.from_delayed(dfs)
If possible, you should also supply the meta= (a zero-length dataframe, describing the columns, index and dtypes) and divisions= (the boundary values of the index along the partitions) kwargs.

Related

Loading multiple CSVs into a single pandas dataframe

I am trying to load multiple CSVs into a single pandas dataframe. They are all in one file, and all have the same column structure. I have tried a few different methods from a few different threads, and all return the error 'ValueError: No objects to concatenate.' I'm sure the problem is something dumb like my file path? This is what I've tried:
temps = pd.concat(map(pd.read_csv, glob.glob(os.path.join('./Resources/temps', "*.csv"))))
Also this:
path = r'./Resources/temps'
temps_csvs = glob.glob(os.path.join(path, "*.csv"))
df_for_each_csv = (pd.read_csv(f) for f in temps_csvs)
temps_df = pd.concat(df_for_each_csv, ignore_index = True)```
Thanks for any help!

It might not be as helpful as other answers, but when I tried running your code, it work perfectly fine. The only difference that conflicted was that I changed the path to be like this:
temps_csvs = glob.glob(os.path.join(os.getcwd(), "*.csv"))
df_for_each_csv = (pd.read_csv(f) for f in temps_csvs)
temps_df = pd.concat(df_for_each_csv, ignore_index = True)
and put the script in the same folder as to where the csv files are.
EDIT: I saw your comment about how you are having an error ParserError: Error tokenizing data. C error: Expected 5 fields in line 1394, saw 6
This means that the csv files don't have the same number of columns, here is a question that deals with a similar issue, maybe it will help :
Reading a CSV file with irregular number of columns using Pandas

Change tuple to list on the third line.
[pd.read_csv(f) for f in temps_csvs]
or add tuple to it: tuple(pd.read_csv(f) for f in temps_csvs)
Tuple Comprehension doesn't work this way.
See Why is there no tuple comprehension in Python?

Concatenating huge dataframes with pandas

I have sensor data recorded over a timespan of one year. The data is stored in twelve chunks, with 1000 columns, ~1000000 rows each. I have worked out a script to concatenate these chunks to one large file, but about half way through the execution I get a MemoryError. (I am running this on a machine with ~70 GB of usable RAM.)
import gc
from os import listdir
import pandas as pd
path = "/slices02/hdf/"
slices = listdir(path)
res = pd.DataFrame()
for sl in slices:
temp = pd.read_hdf(path + f"{sl}")
res = pd.concat([res, temp], sort=False, axis=1)
del temp
gc.collect()
res.fillna(method="ffill", inplace=True)
res.to_hdf(path + "sensor_data_cpl.hdf", "online", mode="w")
I have also tried to fiddle with HDFStore so I do not have to load all the data into memory (see Merging two tables with millions of rows in Python), but I could not figure out how that works in my case.

When you read in a csv as a pandas DataFrame, the process will take up to twice the needed memory at the end (because of type guessing and all the automatic stuff pandas tries to provide).
Several methods to fight that :
Use chunks. I see that your data is already in chunks, but maybe those are too big, so you can read each files by chunks using the chunk_size parameter of pandas.read_hdf or pandas.read_csv
Provide dtypes to avoid type guessing and mixed types (ex: a column of strings with null value with have mixed type), this will work along low_memory parameters.
If this is not sufficient you'll have to turn to distributed technologies like pyspark, dask, modin or even pandarallel

When you have so much data avoid creating temporary dataframes as they take up memory too. Try doing it in one pass:
folder = "/slices02/hdf/"
files = [os.path.join(folder, file) for file in os.listdir(folder)]
res = pd.concat((pd.read_csv(file) for file in files), sort=False)
See how this works for you.

how to solve error due to chunksize in pandas?

I am trying to read a large csv file and run a code. I use chunk size to do the same.
file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
print len(df.index)
I get the following error in the code:
AttributeError: 'TextFileReader' object has no attribute 'index'
How to resolve this?

Those errors are stemming from the fact that your pd.read_csv call, in this case, does not return a DataFrame object. Instead, it returns a TextFileReader object, which is an iterator. This is, essentially, because when you set the iterator parameter to True, what is returned is NOT a DataFrame; it is an iterator of DataFrame objects, each the size of the integer passed to the chunksize parameter (in this case 1000000).
Specific to your case, you can't just call df.index because, simply, an iterator object does not have an index attribute. This does not mean that you cannot access the DataFrames inside the iterator. What it means is that you would either have to loop through the iterator to access one DataFrame at a time or you would have to use some kind of way of concatenating all those DataFrames into one giant one.
If you are considering just working with one DataFrame at a time, then the following is what you would need to do to print the indexes of each DataFrame:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
for df in dfs:
print(df.index)
# do something
df.to_csv('output_file.csv', mode='a', index=False)
This will save the DataFrames into an output file with the name output_file.csv. With the mode parameter set to a, the operations should append to the file. As a result, nothing should be overwritten.
However, if the goal for you is to concatenate all the DataFrames into one giant DataFrame, then the following would perhaps be a better path:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
giant_df = pd.concat(dfs)
print(giant_df.index)
Since you are already using the iterator parameter here, I would assume that you are concerned about memory. As such, the first strategy would be a better one. That basically means that you are taking advantage of the benefits that iterators offer when it comes to memory management for large datasets.
I hope this proves useful.

concat multiple big files from on directory causes memory error

I want to concate 50 txt files from one directory, each of them does have >300 MB. The function below does work for two of these files but not for all them. I am quite new to python and therefore I am not sure if my function could be faster. I already cheacked similar topics but couldn't find a better way. Do you have an idea how to make it more efficient?
my script to concat files:
def txtComponentstoOne(rdire):
path=rdire
allFiles=glob.glob(os.path.join(path,"*.txt"))
df = pd.concat((pd.read_table(f, header=None, dtype={0:str,1:int,2:int,3:str, 4:str, 5: int, 6:int}) for f in allFiles),ignore_index=True)
return df
My overall aim is to calculate the median of column 6 for each row with the same values in the other columns. Thus, if you know how to do that without concate the files before, it would also solve my problem.

Dask reading CSV, setting partition as CSV length

I'm trying to write code that will read from a set of CSVs named my_file_*.csv into a Dask dataframe.
Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each partition must be the whole CSV.
I've tried to reset the index, and then set partitions based on the length of each CSV but it looks like the index of the Dask dataframe is not unique.
Is there a better way to partition based on the length of each CSV?

So one partition should contain exactly one file?
You cold do:
import dask.dataframe as dd
ddf = dd.read_csv(my_file_*.csv, blocksize = None)
Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf will be a dask dataframe containing one file per partition.
You might want to check out the documentation:
general instructions how to generate dask dataframes from data
details about read_csv

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Load many feather files in a folder into dask - python

Related

Loading multiple CSVs into a single pandas dataframe

Concatenating huge dataframes with pandas

how to solve error due to chunksize in pandas?

concat multiple big files from on directory causes memory error

Dask reading CSV, setting partition as CSV length

Categories

Resources