Let' say I have a dataframe with 1 million rows and 30 columns.
I want to add a column to the dataframe and the value is "the most frequent value of the previous 30 columns". I also want to add the "second most frequent value of the previous 30 columns"
I know that you can do df.mode(axis=1) for "the most frequent value of the previous 30 columns", but it is so slow.
Is there anyway to vectorize this so it could be fast?
df.mode(axis=1) is already vectorized. However, you may want to consider how it works. It needs to operate on each row independently, which means you would benefit from "row-major order" which is called C order in NumPy. A Pandas DataFrame is always column-major order, which means that getting 30 values to compute the mode for one row requires touching 30 pages of memory, which is not efficient.
So, try loading your data into a plain NumPy 2D array and see if that helps speed things up. It should.
I tried this on my 1.5 GHz laptop:
x = np.random.randint(0,5,(10000,30))
df = pd.DataFrame(x)
%timeit df.mode(axis=1)
%timeit scipy.stats.mode(x, axis=1)
The DataFrame way takes 6 seconds (!), whereas the SciPy (row-major) way takes 16 milliseconds for 10k rows. Even SciPy in column-major order is not much slower, which makes me think the Pandas version is less efficient than it could be.
Related
I have a Dask dataframe of 100 million rows of data.
I am trying to iterate over this dataframe without loading the entire dataframe
to RAM.
For an experiment, trying to access row of index equal to 1.
%time dask_df.loc[1].compute()
The time it took is whopping 8.88 s (Wall time)
Why is it taking it so long?
What can I do to make it faster?
Thanks in advance.
Per request, here is the code.
It is just reading 100 million rows of data and trying to access a row.
`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`
Dask DataFrame Structure:
avg_user_prod_aff_score internalItemID internalUserID
npartitions=1
float32 int16 int32
len(dask_df)
100,000,000
%time dask_df.loc[1].compute()
There are just 3 columns with datatypes of float32, int16 and int32.
The dataframe is indexed starting at 0.
Writing time is actually very good which is around 2 minutes.
I must be doing something wrong here.
Similarly to pandas, dask_df[1] would actually reference a column, not a row. So if you have a column named 1 then you're just loading a column from the whole frame. You can't access rows positionally - df.iloc only supports indexing along the second (column) axis. If your index has the value 1 in it, you could select this with df.loc, e.g.:
df.loc[1].compute()
See the dask.dataframe docs on indexing for more information and examples.
When performing .loc on an unindexed dataframe, Dask will need to decompress the full file. Since each partition will have its own index, .loc[N] will check every partition for that N, see this answer.
One way of resolving this is to pay the cost of generating a unique index once and saving the indexed parquet file. This way .loc[N] will only load information from the specific partition (or row group) that contains row N.
It looks like there is a performance issues with Dask when trying
access 10 million rows. It took 2.28 secs to access first 10 rows.
With 100 million rows, it takes whopping 30 secs.
I have a small toy dataset of 23 hours of irregular time series data (financial tick data) with millisecond granularity, roughly 1M rows. By irregular I mean that the timestamps are not evenly spaced. I also have a column 'mid' with some values too.
I am trying to group by e.g. 2 minute buckets to calculate the absolute difference of 'mid', and then taking the median, in the following manner:
df.groupby(["RIC", pd.Grouper(freq='2min')]).mid.apply(
lambda x: np.abs(x[-1] - x[0]) if len(x) != 0 else 0).median()
Note: 'RIC' is just another layer of grouping I am applying before the time bucket grouping.
Basically, I am telling pandas to group by every [ith minute : ith + 2 minute] intervals, and in each interval, take the last (x[-1]) and the first (x[0]) 'mid' element, and take its absolute difference. I am doing this over a range of 'freqs' as well, e.g. 2min, 4min, ..., up to 30min intervals.
This approach works completely fine, but it is awfully slow because of the usage of pandas' .apply function. I am aware that .apply doesn't take advantage of the built in vectorization of pandas and numpy, as it is computationally no different to a for loop, and am trying to figure out how to achieve the same without having to use apply so I can speed it up by several orders of magnitude.
Does anyone know how to rewrite the above code to ditch .apply? Any tips will be appreciated!
On the pandas groupby.apply webpage:
"While apply is a very flexible method, its downside is that using it
can be quite a bit slower than using more specific methods like agg or
transform. Pandas offers a wide range of method that will be much
faster than using apply for their specific purposes, so try to use
them before reaching for apply."
Therefore, using transform should be a lot faster.
grouped = df.groupby(["RIC", pd.Grouper(freq='2min')])
abs(grouped.mid.transform("last") - grouped.mid.transform("first")).median()
I am currently working on a project that uses Pandas, with a large dataset (~42K rows x 1K columns).
My dataset has many omissing values which I want to interpolate to obtain a better result when training an ML model using this data. My method of interpolating the data is by taking the average of the previous and the next value and then considering that the value for any NaN. Example:
TRANSACTION PAYED MONDAY TUESDAY WEDNESDAY
D8Q3ML42DS0 1 123.2 NaN 43.12
So in the above example the NaN would be replaced with the average of the 123.2 and 43.12 which is 83.16. If the value can't be interpolated then a 0 is put. I was able to implement this in a number of ways but I always end up getting into the issue of it taking a very long time to process all of the rows in the dataset despite running it on an Intel Core i9. The following are approaches I've tried and have found out that they take too long:
Interpolating the data and then only replacing the elements that need to be replaced instead of replacing the entire row.
Replacing the entire row with a new pd.Series that has the old and the interpolated values. It seems like my code is able to execute reasonably well on a Numpy Array but the slowness comes from the assignment.
I'm not quite sure why the performance of my code comes nowhere close to df.interpolate() despite it being the same idea. Here is some of my code responsible for the interpolation:
for transaction_id in df.index:
df.loc[transaction_id, 2:] = interpolate(df.loc[transaction_id, 2:])
def interpolate(array:np.array):
arr_len = len(array)
for i in range(array):
if math.isnan(array[i]):
if i == 0 or i == arr_len-1 or math.isnan(array[i-1]) or math.isnan(array[i+1]):
array[i] = 0
else:
statistics.mean([array[i-1], array[i+1]])
return array
My understanding is that Pandas has some sort of parallel techniques and functions that it is able to use to perform that. How can I speed this process up even a little?
df.interpolate(method='linear', limit_direction='forward', axis=0)
Try doing this it might help.
I have two pandas dataframes bookmarks and ratings where columns are respectively :
id_profile, id_item, time_watched
id_profile, id_item, score
I would like to find score for each couple (profile,item) in the ratings dataframe (set to 0 if does not exist). The problem is the bookmarks dataframe has 73 million rows and it takes so much time (after 15 min the code continues to run). I suppose there is a better way to do it.
Here is my code :
def find_rating(val):
res = ratings.loc[(ratings['id_profile'] == val[0]) & (ratings['id_asset'] == val[1])]
if res.empty :
return 0
return res['score'].values[0]
arr = bookmarks[['id_profile','id_asset']].values
rates = [find_rating(i) for i in arr]
I work on collab.
Do you think I can improve speed of the execution?
Just some thoughts ! I have not tried such large data in Pandas.
In pandas, the data is index on row as well as columns. So if you have 1 million rows with 5 columns ==> we have indexed 5 million records.
For performance boost,
Check if you can use Sparse* data structures - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
Filter out the unnecessary data as much feasible.
If you can, try stick to using only numpy. As this reduces few features, we also we loose some drag. Worth exploring.
Use some distributed multithreaded/multiprocessing tools like Dask\Ray. If you have 4 cores -> 4 parallel jobs => 25% faster
I have a huge dataframe (4 million rows and 25 columns). I am trying to investigate 2 categorical columns. One of them has around 5000 levels (app_id) and the other has 50 levels (app_category).
I have seen that for for each level in app_id there is a unique value of app_category. How do I code to prove that?
I have tried something like this:
app_id_unique = list(train['app_id'].unique())
for unique in app_id_unique:
train.loc[train['app_id'] == unique].app_category.nunique()
This code takes forever.
I think you need groupby with nunique:
train.groupby('app_id').app_category.nunique()