pandas chunksize - concat inside and outside the loops - python

I have to read massive csv files (500 million lines), and I tried to read them with pandas using the chunksize method, in order to reduce memory consumption. But I didn't understand the behaviour of the concat method and the option to read all the file and reduce memory. I'm adding some pseudocode in order to explain what I did so far.
Let's say I'm reading and then concatenate a file with n lines with:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
df = pd.concat([chunk for chunk in iter_csv])
Then I have to apply a function to the dataframe to create a new column based on some values:
df['newcl'] = df.apply(function)
Everything goes fine.
But now I wonder what's the difference between the above procedure and the following:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
for chunk in iter_csv:
chunk['newcl'] = chunk.apply(function)
df = pd.concat([chunk])
In terms of RAM consumption, I thought that the second method should be better because it applies the function only to the chunk and not to the whole dataframe. But the following issues occur:
putting the df = pd.concat([chunk]) inside the loop returns me a dataframe with a size of n/2 (the size of the chunk), and not the full one;
putting the df = pd.concat([chunk]) outside, after the loop returns the same n/2 dataframe length.
So my doubt is whether the first method (concatenate the dataframe just after the read_csv function) is the best one, balancing speed and RAM consumption. And I'm also wondering how may I concat the chunks using the for loop.
Thanks for your support.

Related

Does toPandas() speed up as a pyspark dataframe gets smaller?

I figured I would ask the question. I've found a clever way to reduce the size of a PySpark Dataframe and convert it to Pandas and I was just wondering, does the toPandas function get faster as the size of the pyspark dataframe gets smaller? Here is some code:
window = Window.partitionBy(F.lit('A')).orderBy(F.lit('A'))
eps_tfs = {}
while True:
pdf = toPandas(conn.select(F.col('*')).where(F.col('row_number') <= 2500))
n = len(pdf)
trigger = 0
for u in pdf['features']:
indices = [i for i, x in enumerate(u) if x == 1.0]
for idx in range(len(eps_columns)):
if idx in indices:
try:
eps_tfs[eps_columns[idx]].append(True)
except:
eps_tfs[eps_columns[idx]] = [True]
else:
try:
eps_tfs[eps_columns[idx]].append(False)
except:
eps_tfs[eps_columns[idx]] = [False]
full_view = full_view.append(pd.concat([pdf, pd.DataFrame(eps_tfs)], axis=1))
conn = conn.select(F.col('*')).where(F.col('row_number') > 2500)
conn = conn.drop("row_number")
conn = conn.select(F.col('*'), F.row_number().over(window).alias('row_number'))
eps_tfs = {}
del pdf
if n < 2500:
break
Also, is the following code really a faster way to map the dataframe to pandas?
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
:param df: pyspark.sql.DataFrame
:param n_partitions: int or None
:return: pandas.DataFrame
"""
if n_partitions is not None: df = df.repartition(n_partitions)
df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
df_pand = pd.concat(df_pand)
df_pand.columns = df.columns
return df_pand
Is there any better way to go about doing this?
The answer by #EZY is true (that you need to collect all rows to the driver or client). However, there is one more optimisation possible with apache arrow integration. It provides faster libraries for numpy and pandas data types. It's not enabled by default, so you need to enable it by setting spark conf like below.
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
here is the source code to ToPandas,
And first of all, yes, toPandas will be faster if your pyspark dataframe gets smaller, it has similar taste as sdf.collect()
The difference is ToPandas return a pdf and collect return a list.
As you can see from the source code pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) pdf is generated from pd.DataFrame.from_records from the List!
So if your sdf is smaller, there a smaller data to be transferred through the network, and from_record process less data using your Driver's CPU.
The Design of the second code is different, sdf is distributed, code calls a Mappartition so all worker generates a Pandas dataframe from the subset of the data, then it calls collect, now the all the Pandas dataframe transferred through the network, brought to the driver. Then code calls pd.concat to concat all the dataframe together.
The benefits are:
When converting to Pandas DataFrame, all the workers work on a small subset of the data in parallel much better than bring all data to the driver and burn your driver's CPU to convert a giant data to Pandas.
There is a repartition going on, means if your dataset is huge, and you have a low number of partition, the data on each partition will be huge, and toPandas will be failed on OOM of serializer, and also very slow to collect the data
The Drawbacks are:
now when you collect, you are not collecting the native sdf data, instead of a pandas dataframe which have more metadata attached and generally larger, means the total size of object are bigger
pd.concat is slow lol, but might still better than from_record
So there is no universal conclusion saying which method is better, but choose wisely which tool to use. Like in this question, toPandas might be faster than small sdf, but for large size sdf, the code snippet definitively works better.
In our case, we found that just not doing toPandas() and using pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) was fastest. We couldn't use the arrow option because we got the error "arrow is not supported when using file-based collect".
Looking at the source code for toPandas(), one reason it may be slow is because it first creates the pandas DataFrame, and then copies each of the Series in that DataFrame over to the returned DataFrame. If you know that all of your columns have unique names, and that the data types will convert nicely via having pandas infer the dtype values, there is no need to do any of that copying or dtype inference.
Side note: We were converting a Spark DataFrame on Databricks with about 2 million rows and 6 columns, so your mileage may vary dependent on the size of your conversion.

Get subset of rows where any column contains a particular value

I have a very large data file (foo.sas7bdat) that I would like to filter rows from without loading the whole data file into memory. For example, I can print the first 20 rows of the dataset without loading the entire file into memory by doing the following:
import pandas
import itertools
with pandas.read_sas('foo.sas7bdat') as f:
for row in itertools.islice(f,20):
print(row)
However, I am unclear on how to only print (or preferably place in a new file) only rows that have any column that contain the number 123.1. How can I do this?
Pandas has the ability to pull dataframes one chunk at a time. Following the trail of read_sas() documentation to "chunksize" I came across this:
http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk
for chunk in pd.read_sas('foo.sas7bdat', interator=True, chunksize=100000):
print(chunk)
This would get chunks of 100,000 lines.
As for the other problem you would need a query. However I don't know the constraints of the problem. If you make a Dataframe with all the columns then you still might overflow your memory space so an efficient way would be to collect the indexes and put those in a set, then sort those and use .iloc to get those entries if you wanted to put those into a Dataframe.
You may need to use tools that take this into account. Dask is a good alternative for use on clusters.

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?
Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.
Simple Solution
If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:
import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')
Keyword arguments
The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.
df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])
Set the index
Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.
The naive way to do this is to use the set_index method
df2 = df.set_index('timestamp')
However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument
df2 = df.set_index('timestamp', sorted=True)
Divisions
In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.
import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)
This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).
Convert to another format
CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.

Pandas Dataframe: Lack of Memory- What's the better way here?

f = pd.read_hdf('Sensor_Data.h5','f')
pieces = [f[x: x + 360] for x in xrange(504649)]
df = pd.concat(pieces)
Morning all. I have a file with 500,000+ rows of data. I want to take 360 row slices from this, and move it down by 1 row each time. (So I will end up with a LOT of data. )
As expected, I tried the above code and got a memory error. I'm assuming there's a better way of doing this?
EDIT: To add some context, this is a .h5 file, and I'm using pandas dataframe to try and slice it this way. I'm trying to create an array of data to feed into a deep neural network using caffenet, though the format it will be in at this point will be unclear...
The code works for small amounts of data. Just not for larger ones. To be clearer of what I'm trying to do:import pandas as pd
df = pd.DataFrame(np.random.randn(10,6)); df
[displays a 6 x 10 table of random numbers]
Now:
pieces = [df[x: x + 4] for x in xrange(7)]
f = pd.concat(pieces)
f
Diplays a new table similar to the previous one, but expanded. It now has rows 0,1,2,3,1,2,3,4,2,3,4,5,3,4,5,6...
Now "pieces" is not a dataframe object itself, but a 'list' for some reason. Is there also a simple way to turn all of these separate datasets (0,1,2,3),(1,2,3,4) and so on, into dataframe object themselves? (Instead of concatenating them together into one dataframe?
I hope this makes sense.
Consider using h5py. From the website: "For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays".
So you have a couple of questions there. First the easiest: pieces is a list because you've created it using a list comprehension, it should be a list of dataframe objects. If you want to use them as separate data frame objects you could just index into the list (i.e. pieces[0] etc).
But you still have this problem that you are trying to create a huge data frame. Without seeing the rest of how you would use the code I'd suggest not create half a million slices of your df but instead looping over you original data frame and calling whatever function you need on a single slice of the data frame:
for x in xrange(504649):
result = my_func(df[x:x+360])
that way each slice is released after its used. And hopefully the result is much smaller than the frame.
You could also similarly as above write all our slices to separate cvs files and read them in as you need.

Is .loc the best way to build a pandas DataFrame?

I have a fairly large csv file (700mb) which is assembled as follows:
qCode Date Value
A_EVENTS 11/17/2014 202901
A_EVENTS 11/4/2014 801
A_EVENTS 11/3/2014 2.02E+14
A_EVENTS 10/17/2014 203901
etc.
I am parsing this file to get specific values, and then using DF.loc to populate a pre-existing DataFrame, i.e. the code:
for line in fileParse:
code=line[0]
for point in fields:
if(point==code[code.find('_')+1:len(code)]):
date=line[1]
year,quarter=quarter_map(date)
value=float(line[2])
pos=line[0].find('_')
ticker=line[0][0:pos]
i=ticker+str(int(float(year)))+str(int(float(quarter)))
df.loc[i,point]=value
else:
pass
the question I have is .loc the most efficient way to add values to a existing DataFrame? As this operation seems to take over 10 hours...
fyi fields are the col that are in the DF (values i'm interested in) and the index (i) is a string...
thanks
No, you should never build a dataframe row-by-row. Each time you do this the entire dataframe has to be copied (it's not extended inplace) so you are using n + (n - 1) + (n - 2) + ... + 1, O(n^2), memory (which has to be garbage collected)... which is terrible, hence it's taking hours!
You want to use read_csv, and you have a few options:
read in the entire file in one go (this should be fine with 700mb even with just a few gig of ram).
pd.read_csv('your_file.csv')
read in the csv in chunks and then glue them together (in memory)... tbh I don't think this will actually use less memory than the above, but is often useful if you are doing some munging at this step.
pd.concat(pd.read_csv('foo.csv', chunksize=100000))  # not sure what optimum value is for chunksize
read the csv in chunks and save them into pytables (rather than in memory), if you have more data than memory (and you've already bought more memory) then use pytables/hdf5!
store = pd.HDFStore('store.h5')
for df in pd.read_csv('foo.csv', chunksize=100000):
store.append('df', df)
If I understand correctly, I think it would be much faster to:
Import the whole csv into a dataframe using pandas.read_csv.
Select the rows of interest from the dataframe.
Append the rows to your other dataframe using df.append(other_df).
If you provide more information about what criteria you are using in step 2 I can provide code there as well.
A couple of options that come to mind
1) Parse the file as you are currently doing, but build a dict intend of appending to your dataframe. After you're done with that convert that dict to a Dataframe and then use concat() to combine it with the existing Dataframe
2) Bring that csv into pandas using read_csv() and then filter/parse on what you want then do a concat() with the existing dataframe

Categories

Resources