I figured I would ask the question. I've found a clever way to reduce the size of a PySpark Dataframe and convert it to Pandas and I was just wondering, does the toPandas function get faster as the size of the pyspark dataframe gets smaller? Here is some code:
window = Window.partitionBy(F.lit('A')).orderBy(F.lit('A'))
eps_tfs = {}
while True:
pdf = toPandas(conn.select(F.col('*')).where(F.col('row_number') <= 2500))
n = len(pdf)
trigger = 0
for u in pdf['features']:
indices = [i for i, x in enumerate(u) if x == 1.0]
for idx in range(len(eps_columns)):
if idx in indices:
try:
eps_tfs[eps_columns[idx]].append(True)
except:
eps_tfs[eps_columns[idx]] = [True]
else:
try:
eps_tfs[eps_columns[idx]].append(False)
except:
eps_tfs[eps_columns[idx]] = [False]
full_view = full_view.append(pd.concat([pdf, pd.DataFrame(eps_tfs)], axis=1))
conn = conn.select(F.col('*')).where(F.col('row_number') > 2500)
conn = conn.drop("row_number")
conn = conn.select(F.col('*'), F.row_number().over(window).alias('row_number'))
eps_tfs = {}
del pdf
if n < 2500:
break
Also, is the following code really a faster way to map the dataframe to pandas?
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
:param df: pyspark.sql.DataFrame
:param n_partitions: int or None
:return: pandas.DataFrame
"""
if n_partitions is not None: df = df.repartition(n_partitions)
df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
df_pand = pd.concat(df_pand)
df_pand.columns = df.columns
return df_pand
Is there any better way to go about doing this?
The answer by #EZY is true (that you need to collect all rows to the driver or client). However, there is one more optimisation possible with apache arrow integration. It provides faster libraries for numpy and pandas data types. It's not enabled by default, so you need to enable it by setting spark conf like below.
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
here is the source code to ToPandas,
And first of all, yes, toPandas will be faster if your pyspark dataframe gets smaller, it has similar taste as sdf.collect()
The difference is ToPandas return a pdf and collect return a list.
As you can see from the source code pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) pdf is generated from pd.DataFrame.from_records from the List!
So if your sdf is smaller, there a smaller data to be transferred through the network, and from_record process less data using your Driver's CPU.
The Design of the second code is different, sdf is distributed, code calls a Mappartition so all worker generates a Pandas dataframe from the subset of the data, then it calls collect, now the all the Pandas dataframe transferred through the network, brought to the driver. Then code calls pd.concat to concat all the dataframe together.
The benefits are:
When converting to Pandas DataFrame, all the workers work on a small subset of the data in parallel much better than bring all data to the driver and burn your driver's CPU to convert a giant data to Pandas.
There is a repartition going on, means if your dataset is huge, and you have a low number of partition, the data on each partition will be huge, and toPandas will be failed on OOM of serializer, and also very slow to collect the data
The Drawbacks are:
now when you collect, you are not collecting the native sdf data, instead of a pandas dataframe which have more metadata attached and generally larger, means the total size of object are bigger
pd.concat is slow lol, but might still better than from_record
So there is no universal conclusion saying which method is better, but choose wisely which tool to use. Like in this question, toPandas might be faster than small sdf, but for large size sdf, the code snippet definitively works better.
In our case, we found that just not doing toPandas() and using pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) was fastest. We couldn't use the arrow option because we got the error "arrow is not supported when using file-based collect".
Looking at the source code for toPandas(), one reason it may be slow is because it first creates the pandas DataFrame, and then copies each of the Series in that DataFrame over to the returned DataFrame. If you know that all of your columns have unique names, and that the data types will convert nicely via having pandas infer the dtype values, there is no need to do any of that copying or dtype inference.
Side note: We were converting a Spark DataFrame on Databricks with about 2 million rows and 6 columns, so your mileage may vary dependent on the size of your conversion.
Related
I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.
I have a pandas dataframe data_pandas which has about half a million rows and 30000 columns. I want this to be in a Spark dataframe data_spark and I achieve this by:
data_spark = sqlContext.createDataFrame(data_pandas)
I am working on an r3.8xlarge driver with 10 workers of the same configuration. But the aforementioned operation takes forever and returns an OOM error. Is there an alternate method I can try?
The source data in in HDF format, so I can't read it directly as a Spark dataframe.
You can try using arrow which can make it more efficient.
spark.conf.set("spark.sql.execution.arrow.enabled","true)
For more details refer: https://bryancutler.github.io/toPandas/
One way can be to read the data from the pandas dataframe in batches rather than at one go, one way would be to use the code below which divides it into 20 chunks (some part of the solution from the question here and here)
def unionAll(*dfs):
' by #zero323 from here: http://stackoverflow.com/a/33744540/42346 '
first, *rest = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)
df_list = []
for chunk in np.array_split(df1,20):
df_list.append(sqlContext.createDataFrame(chunk))
df_all = unionAll(df_list)
I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?
Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.
Simple Solution
If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:
import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')
Keyword arguments
The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.
df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])
Set the index
Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.
The naive way to do this is to use the set_index method
df2 = df.set_index('timestamp')
However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument
df2 = df.set_index('timestamp', sorted=True)
Divisions
In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.
import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)
This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).
Convert to another format
CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.
f = pd.read_hdf('Sensor_Data.h5','f')
pieces = [f[x: x + 360] for x in xrange(504649)]
df = pd.concat(pieces)
Morning all. I have a file with 500,000+ rows of data. I want to take 360 row slices from this, and move it down by 1 row each time. (So I will end up with a LOT of data. )
As expected, I tried the above code and got a memory error. I'm assuming there's a better way of doing this?
EDIT: To add some context, this is a .h5 file, and I'm using pandas dataframe to try and slice it this way. I'm trying to create an array of data to feed into a deep neural network using caffenet, though the format it will be in at this point will be unclear...
The code works for small amounts of data. Just not for larger ones. To be clearer of what I'm trying to do:import pandas as pd
df = pd.DataFrame(np.random.randn(10,6)); df
[displays a 6 x 10 table of random numbers]
Now:
pieces = [df[x: x + 4] for x in xrange(7)]
f = pd.concat(pieces)
f
Diplays a new table similar to the previous one, but expanded. It now has rows 0,1,2,3,1,2,3,4,2,3,4,5,3,4,5,6...
Now "pieces" is not a dataframe object itself, but a 'list' for some reason. Is there also a simple way to turn all of these separate datasets (0,1,2,3),(1,2,3,4) and so on, into dataframe object themselves? (Instead of concatenating them together into one dataframe?
I hope this makes sense.
Consider using h5py. From the website: "For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays".
So you have a couple of questions there. First the easiest: pieces is a list because you've created it using a list comprehension, it should be a list of dataframe objects. If you want to use them as separate data frame objects you could just index into the list (i.e. pieces[0] etc).
But you still have this problem that you are trying to create a huge data frame. Without seeing the rest of how you would use the code I'd suggest not create half a million slices of your df but instead looping over you original data frame and calling whatever function you need on a single slice of the data frame:
for x in xrange(504649):
result = my_func(df[x:x+360])
that way each slice is released after its used. And hopefully the result is much smaller than the frame.
You could also similarly as above write all our slices to separate cvs files and read them in as you need.
I have to read massive csv files (500 million lines), and I tried to read them with pandas using the chunksize method, in order to reduce memory consumption. But I didn't understand the behaviour of the concat method and the option to read all the file and reduce memory. I'm adding some pseudocode in order to explain what I did so far.
Let's say I'm reading and then concatenate a file with n lines with:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
df = pd.concat([chunk for chunk in iter_csv])
Then I have to apply a function to the dataframe to create a new column based on some values:
df['newcl'] = df.apply(function)
Everything goes fine.
But now I wonder what's the difference between the above procedure and the following:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
for chunk in iter_csv:
chunk['newcl'] = chunk.apply(function)
df = pd.concat([chunk])
In terms of RAM consumption, I thought that the second method should be better because it applies the function only to the chunk and not to the whole dataframe. But the following issues occur:
putting the df = pd.concat([chunk]) inside the loop returns me a dataframe with a size of n/2 (the size of the chunk), and not the full one;
putting the df = pd.concat([chunk]) outside, after the loop returns the same n/2 dataframe length.
So my doubt is whether the first method (concatenate the dataframe just after the read_csv function) is the best one, balancing speed and RAM consumption. And I'm also wondering how may I concat the chunks using the for loop.
Thanks for your support.