Pyspark: OutOfMemoryError on count() - python

This is a follow-on query to my previous one: following suggestion there I got the row-over-row percentage changes and since the first row in the df_diff dataframe (df) was all null values, I did:
df_diff = df_diff.dropna()
df_diff.count()
The second statement throws the following error:
Py4JJavaError: An error occurred while calling o1844.count.
: java.lang.OutOfMemoryError: Java heap space
When I try above code on the toy df posted in the previous post it works fine but with my actual dataframe (834 rows, 51 columns) the above error happens. Any guidance as to why this is happening and how to handle it would be much appreciated. Thanks
EDIT:
In my actual dataframe (df) of 834 X 51, the first column is date and the remaining columns are closing stock prices for 50 stocks for which I'm trying to get the daily percentage changes. Partitioning the window by the date col made no difference to the previous error in this df in pyspark and there doesn't seem to any other natural candidate to partition by.
The only thing that sort of worked was to do this in spark-shell. Here without partitions I was getting warning messages ...
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
... until I called cache() on the dataframe but this is not ideal for large df

Your original code is simply not scalable. Following
w = Window.orderBy("index")
window definition requires shuffling data to a single partition, and such is useful only for small, local datasets.
Depending on the data, you can try more complex approach, like the one show in Avoid performance impact of a single partition mode in Spark window functions

Related

using pandas.read_csv() for malformed csv data

This is a conceptual question, so no code or reproduceable example.
I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.
Sometimes, the processes result in errors, which result in malformed rows. Here is an example:
id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/
The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.
Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.
You can load your csv file on a Dataframe and apply a filter :
df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1)
df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows
You need to make sure that your keyword does not appear on your data.
PS : There might be more optimized ways to do it
This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.
Read the entire CSV into a single column
import io
dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)
Build a mask identifying the failed rows
fail_msk = dfs[0].str.contains('failed')
Use that mask to split out and build separate dataframes
df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

Why is 'withColumn' taking so long in pyspark?

I have a pyspark dataframe containing 1000 columns and 10,000 records(rows).
I need to create 2000 more columns, by performing some computation on the existing columns.
df #pyspark dataframe contaning 1000 columns and 10,000 records
df = df.withColumn('C1001', ((df['C269'] * df['C285'])/df['C41'])) #existing column names range from C1 to C1000
df = df.withColumn('C1002', ((df['C4'] * df['C267'])/df['C146']))
df = df.withColumn('C1003', ((df['C87'] * df['C134'])/df['C238']))
.
.
.
df = df.withColumn('C3000', ((df['C365'] * df['C235'])/df['C321']))
The issue is, this takes way too long, around 45 minutes or so.
Since I am a newbie, I was wondering what I am doing wrong?
P.S.: I am running spark on databricks, with 1 driver and 1 worker node, both having 16GB Memory and 8 Cores.
Thanks!
A lot of what you're doing is just creating an execution plan. Spark is lazy executing until there's an action that triggers it. So the 45 minutes that you're seeing is probably from executing all the transformations that you've been setting up.
If you want to see how long a single withColumn takes, then trigger an action like df.count() or something before, and then do a single withColumn and then another df.count() (to trigger action again).
Take a look more into pyspark execution plan, transformations and actions.
Without being too specific
and looking at the observations of the 1st answer
and knowing that Execution Plans for many DF-columns aka "very wide data" are costly to compute
a move to RDD processing may well be the path to take.
Do this in single line rather than doing it one by one
df = df.withColumn('C1001', COl1).df.withColumn('C1002', COl2).df.withColumn('C1003', COl3) ......

How to perform a rolling_median on a large data frame in Pandas without encountering the skiplist_insert failed error?

I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000
After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately
for index,group in df.groupby(['Category','Subcategory']):
print pd.rolling_median(group['value'],7,min_period=7)
Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations.
I ran the code several times, and every time it showed the memory error for different items.
I created some dummy values for anyone to test, here:
index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)
for index,group in sample.groupby(level=0):
try:
pd.rolling_median(group[0],7,7)
except MemoryError:
print a
print pd.rolling_median(group[0],7,7)
If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-
I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error.
Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?
The groupby doesn't look right and should change
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
to
df['rolling_median']=df.groupby(['Category','Subcategory'])['value'].apply(pd.rolling_median,7,min_periods=7)
Otherwise the the groupby won't work as it is a series with column named ["value"] so can't groupby ['Category','Subcategory'] as not present.
Also the groupby is going to be smaller than the length of the dataframe and creating the df['rolling_median'] will cause a length mismatch.
Hope that helps.
The bug has been fixed in Pandas 0.18.0, and now the methods rolling_mean() and rolling_median() have depreciated.
This was the bug: https://github.com/pydata/pandas/issues/11697
Can be viewed here: http://pandas.pydata.org/pandas-docs/stable/computation.html

Memory leak in pandas when dropping dataframe column?

I have some code like the following
df = ..... # load a very large dataframe
good_columns = set(['a','b',........]) # set of "good" columns we want to keep
columns = list(df.columns.values)
for col in columns:
if col not in good_columns:
df = df.drop(col, 1)
The odd thing is that it successfully drops the first column that is not good - so it isn't an issue where I am holding the old and new dataframe in memory at the same time and running out of space. It breaks on the second column being dropped (MemoryError). This makes me suspect there is some kind of memory leak. How would I prevent this error from happening?
It may be that your constantly returning a new and very large dataframe.
Try setting the drop inplace parameter to True.
Make use of usecols argument while reading the large data frame to keep the columns you want instead of dropping them later on. Check here : http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html
I tried the inplace=True argument but still had the same issues. Here's another solution dealing with the memory leak due to your architecture. That helped me when I had this same issue

Running update of Pandas dataframe

I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?
You should build a list of the pieces and concatenate them all in one shot at the end.

Categories

Resources