I have two dataframes each around 400k rows call them a and b. What I want to do is for every row in df b find the account number in that row in data frame a. If it exists, i want to drop that row from dataframe a. Problem is, when I try to run this code, i keep getting memory errors. Initially I was using iterrows, but that seems to be bad when working with large datasets, so i switched to apply, but I'm running into the same error. Below is simplified pseudocode of what I'm trying:
def reduceAccount(accountID):
idx = frameA.loc[frameA["AccountID"] == accountID].index
frameB.drop(idx, inplace=True)
frameB["AccountID"].apply(reduceAccount)
I've even tried some shennanigans like iterating thru the first few hundred/thousand rows, but after a cycle, i still hit the memory error, which makes me think im still loading things into memory rather than clearing thru. Is there a better way to reduce dataframeA than what im trying? Note that I do not want to merge the frames (yet) just remove any row in dataframe a that has a duplicate key in dataframe b.
The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply(), which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.
dfB[~dfB["AccountID"].isin(dfA["AccountID"])]
However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance
So basically you want every row in A which 'AccountID' is not in B.
This can be done with a left join: frameA = frameA.join(frameB, on='AccountID', how='left')
I think this is best in terms of memory efficiency for you'll be leveraging the power of pandas built-in optimized code.
Related
I want convert a sql query like
SELECT * FROM df WHERE id IN (SELECT id FROM an_df)
into dask equivalent.
So, I am trying this:
d=df[df['id'].isin(an_df['id'])]
But it is thorwing NotImplementedError: passing a 'dask.datframe.core.DataFrame' to'isin'
Then I converted this an_df['id'] to list like
d=df[df['id'].isin(list(an_df['id']))] or d=df[df['id'].isin(an_df['id'].compute())]
but this is very time consuming.
I want a solution that works as fast as dask.
df has approximately 100 million rows.
Please help me with it.
Thanks
I recommend adding a minimal reproducible example, which will make solving this particular issue easier:
https://stackoverflow.com/help/minimal-reproducible-example
It seems like you are converting the pandas.core.series.Series object returned by an_df['id'].compute() to a list, which is not needed. isin() will take a pandas series or dataframe object as argument. Please see:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.isin.html
In your example this should work:
series_to_match = an_df['id'].compute()
d=df[df['id'].isin(series_to_match)]
So you can omit the .to_list() cast. I expect this to be a little bit faster since that type casting can be dropped. But there are still things you need to consider here. Depending on the size of an_df['id'].compute() you may run into trouble since that statement is pulling the resulting series object into the memory of the machine your scheduler is running on.
if this series is small enough you could try to use a client.scatter to make sure all of your workers have that series persisted in memory, see:
http://distributed.dask.org/en/stable/locality.html
If that series is a huge object you'll have to tackle this differently.
I have a dask dataframe and need to compare all the values in one column to all the values in the same column. Eventually I will be using a function on these value pairs, but for now I'm just trying to get the .merge to work. Example code:
my_dd['dummy'] = 1
my_dd = my_dd.merge(my_dd, how='inner', on='dummy', npartitions=100000)
print(str(len(my_dd)))
At this point my starting data set is pretty small - my_dd before the join has about 19K rows. So after the join, there going to be about 360 million rows.
The above code gives me a memory error. Running on a single machine / LocalCluster with 32GB ram. With specifying nparitions, the code seems like it will work but eventually fails partway through (when most of the graph tasks are mostly complete).
I realize the problem here is the massive resulting df, but I would imagine there is some solution. Have done a lot of digging and can't find anything.
So I'm a newbie when it comes to working with big data.
I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right direction...
So I understand why the following query using the "compute" method would be slow(lazy computation):
df.loc[1:5 ,'enaging_user_following_count'].compute()
btw it took 10 minutes to compute.
But what I don't understand is why the following code using the "head" method prints the output in less than two seconds (i.e., In this case, I'm asking for 250 rows while the previous code snippet was just for 5 rows):
df.loc[50:300 , 'enaging_user_following_count'].head(250)
Why doesn't the "head" method take a long time? I feel like I'm missing something here because I'm able to pull a huge number of rows in a way shorter time than when using the "compute" method.
Or is the compute method used in other situations?
Note: I tried to read the documentation but there was no explanation to why head() is fast.
I played around with this a bit half a year ago. .head() is not checking all your partitions but simply checks the first partition. There is no synchronization overhead etc. so it is quite fast, but it does not take the whole Dataset into account.
You could try
df.loc[-251: , 'enaging_user_following_count'].head(250)
IIRC you should get the last 250 entries of the first partition instead of the actual last indices.
Also if you try something like
df.loc[conditionThatIsOnlyFulfilledOnPartition3 , 'enaging_user_following_count'].head(250)
you get an error that head could not find 250 samples.
If you actually just want the first few entries however it is quite fast :)
This processes the whole dataset
df.loc[1:5, 'enaging_user_following_count'].compute()
The reason is, that loc is a label-based selector, and there is no telling what labels exist in which partition (there's no reason that they should be monotonically increasing). In the case that the index is well-formed, you may have useful values of df.divisions, and in this case Dask should be able to pick only the partitions of your data that you need.
I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes. I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.
I have tried numerous things e.g.
df.show(5)
and
df.limit(5).show()
but everything I try requires a large portion of jobs resulting in slow performance.
I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?
Try the rdd equivalent of the dataframe
rdd_df = df.rdd
rdd_df.take(5)
Or, Try to print the dataframe schema
df.printSchema()
First, to show a certain number of rows you can use the limit() method after calling a select() method, like this:
df.select('*').limit(5).show()
also, the df.show() action will only print the first 20 rows, it will not print the whole dataframe.
second, and more importantly,
Spark Actions:
spark dataframe does not contain data, it contains instructions and operation graph, since spark works with big data it does not allow to perform any operation as its called, to prevent slow performance, instead, methods are separated into two kinds Actions and Transformations, transformations are collected and contained as an operation graph.
Action is a method that causes the dataframe to execute all accumulated operation in the graph, causing the slow performances, since it execute everything (note, UDFs are extremely slow).
show() is an action, when you call show() it has to calculate every other transformation to show you the true data.
keep that in mind.
to faster iterate you have to understand the difference between actions and transformation.
The Transformation is defined by any operation that result into another RDD/Spark Dataframe for example df.filter.join.groupBy. Action is defined by any operation that result into non-RDD for example df.write. or df.count() or df.show().
The transformation is lazy, saying the not like python df1=df.filter, df2=df1.groupby df and df1 and df3 was in memory. Instead the data will flow into memory until you call an action. like in your case .show()
And calling df.limit(5).show() will not fastern your job iteration, because this limit is limiting the final dataframe gets print out instead the original data that is flowing through your memory.
Like others suggestion, you should be able to limit your input datasize in order to faster test whether your transformation works. And further improve your iteration, you could cache dataframe from matured transformation, instead of running them over and over again.
Is sorting a DataFrame in pandas memory efficient? I.e., can I sort the dataframe without reading the whole thing into memory?
Internally, pandas relies on numpy.argsort to do all the sorting.
That being said: pandas DataFrames are backed by numpy arrays, which have to be present in memory as a whole. So, to answer your question: No, pandas needs the whole dataset in memory for sorting.
Additional thoughts:
You can of course implement such a disk-based external sorting using multiple steps: Load a chunk of your dataset, sort it, save the sorted version. Repeat. Load a part of each sorted subset, join them into one DataFrame and sort it You'll have to be careful here on how much t oload from each source. For example, if your 1000 element dataset is already sorted, getting the top 10 results from each of the 10 subsets won't get you the correct top 100. It will, however, give you the correct top 10.
Without further information about your data, I suggest you let some (relational) database handle all that stuff. They're made for this kind of thing, after all.