Get mean of numpy array using pandas groupby - python

I have a DataFrame where one column is a numpy array of numbers. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
'data': [np.array([0.43, 0.32, 0.19]),
np.array([0.41, 0.11, 0.21]),
np.array([0.94, 0.35, 0.14]),
np.array([0.78, 0.92, 0.45]),
np.array([0.32, 0.63, 0.48]),
np.array([0.17, 0.12, 0.15]),
np.array([0.54, 0.12, 0.16]),
np.array([0.48, 0.16, 0.19]),
np.array([0.14, 0.47, 0.01])]
})
I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:
pd.DataFrame.from_dict({
'id': [1, 2,3,4],
'data': [np.array([0.42, 0.215, 0.2]),
np.array([0.86, 0.635, 0.29500000000000004]),
np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
np.array([0.31, 0.315, 0.1])]
})
Could someone suggest how I might do this? Thanks!

Mean it twice, one at array level and once at group level:
df['data'].map(np.mean).groupby(df['id']).mean().reset_index()
id data
0 1 0.278333
1 2 0.596667
2 3 0.298889
3 4 0.241667
Based on comment, you can do:
pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)
id
1 [0.42, 0.215, 0.2]
2 [0.86, 0.635, 0.29500000000000004]
3 [0.3433333333333333, 0.29, 0.26333333333333336]
4 [0.31, 0.315, 0.1]
dtype: object
Or:
df.groupby("id")['data'].apply(np.mean)

First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array
# Your current memory usage
df.memory_usage(deep=True).sum()
1352
# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
# id 0 1 2
#0 1 0.43 0.32 0.19
#1 1 0.41 0.11 0.21
#2 2 0.94 0.35 0.14
#3 2 0.78 0.92 0.45
#4 3 0.32 0.63 0.48
#5 3 0.17 0.12 0.15
#6 3 0.54 0.12 0.16
#7 4 0.48 0.16 0.19
#8 4 0.14 0.47 0.01
Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.
df1.memory_usage(deep=True).sum()
#416
And now your aggregation is a normal groupby + mean, columns give the location in the array
df1.groupby('id').mean()
# 0 1 2
#id
#1 0.420000 0.215 0.200000
#2 0.860000 0.635 0.295000
#3 0.343333 0.290 0.263333
#4 0.310000 0.315 0.100000

Group by mean for array where output is array of mean value
df['data'].map(np.array).groupby(df['id']).mean().reset_index()
Output:
id data
0 1 [0.42, 0.215, 0.2]
1 2 [0.86, 0.635, 0.29500000000000004]
2 3 [0.3433333333333333, 0.29, 0.26333333333333336]
3 4 [0.31, 0.315, 0.1]

You can always .apply the numpy mean.
df.groupby('id')['data'].apply(np.mean).apply(np.mean)
# returns:
id
1 0.278333
2 0.596667
3 0.298889
4 0.241667
Name: data, dtype: float64

Related

Efficient embedding computations for large DataFrame

Given the DataFrame:
id articleno target
0 1 [607303] 607295
1 1 [607295] 607303
2 2 [243404, 617953] 590448
3 2 [590448, 617953] 243404
for each row, compute the average article-embedding by looking up each item in the lists in the dictionary:
embeddings = {"607303": np.array([0.19, 0.25, 0.45])
,"607295": np.array([0.77, 0.76, 0.55])
,"243404": np.array([0.35, 0.44, 0.32])
,"617953": np.array([0.23, 0.78, 0.24])
,"590448": np.array([0.67, 0.12, 0.10])}
So for example, and to clarify, for the third row (index 2), the article embeddings for 243404 and 617953 is [0.35, 0.44, 0.32] and [0.23, 0.78, 0.24], respectively. The average article embedding is computes as the element wise addition of all elements, divided by the number of articles, so: ([0.35, 0.44, 0.32]+[0.23, 0.78, 0.24])/2=[0.29, 0.61, 0.28].
Expected output:
id dim1 dim2 dim3 target
0 1 0.19 0.25 0.45 607295
1 1 0.77 0.76 0.55 607303
2 2 0.29 0.61 0.28 590448
3 2 0.45 0.45 0.17 243404
In reality, my DataFrame has millions of rows, and the lists in articleno can contain many more items. Because of this, iterating over the rows might be too slow, and a more efficient solution (perhaps vectorized) could be needed.
Moreover, the number of dimensions (embedding size) is known beforehand, but is a couple of hundred, so the number of columns; dim1, dim2, dim3, ... dimN should be dynamic, based on the dimensions of the embedding (N).
In the previous question, you went extra miles to separate elements in articleno list, then remove the target from articleno list. Now, if you want to access the elements inside articleno list, you need to go extra miles again to separate them.
To illustrate what I mean, here's an approach that generate both of the output from the two questions, while adding minimal extra code:
# construct the embeddings dataframe:
embedding_df = pd.DataFrame(embeddings).T.add_prefix('dim')
# aggregation dictionary
agg_dict = {'countrycode':'first','articleno':list}
# taking mean over embedddings
for i in embedding_df.columns: agg_dict[i] = 'mean'
new_df = df.explode('articleno')
(new_df.join(new_df['articleno'].rename('target'))
.query('articleno != target')
.merge(embedding_df, left_on='articleno', right_index=True) # this line is extra from the previous question
.groupby(['id','target'], as_index=False)
.agg(agg_dict)
)
Output:
id target countrycode articleno dim0 dim1 dim2
0 2 590448 US [617953, 617953] 0.23 0.78 0.24
1 2 617953 US [590448, 590448] 0.67 0.12 0.10
Now, if you don't care for articleno column in the final output, you can even simplify your code while lower memory/runtime like this:
total_embeddings = g[embedding_df.columns].sum()
article_counts = g['id'].transform('size')
new_df[embedding_df.columns] = (total_embeddings.sub(new_df[embedding_df.columns])
.div(article_counts-1, axis=0)
)
and you would get the same output.

Return rows in a Pandas dataframe by matching the nth digit of an integer?

I have a Pandas DataFrame. One integer column 'LOADCASE_ID' contains values of a fixed length, say 7 digits. I want to return rows where the nth digit matches a specific value.
E.g.
d = {'X_VAL': [1.2, 0.2, 1.4, 2.3, 0.25 ],
'LOADCASE_ID': [1100123, 1200456, 1300345, 2134324, 2345300]}
df = pd.DataFrame(data=d)
df
Gives...
X_VAL LOADCASE_ID
0 1.20 1100123
1 0.20 1200456
2 1.40 1300345
3 2.30 2134324
4 0.25 2345300
I want something like...
df.loc[df['SUBCASE'] == ?3?????]
to return...
X_VAL LOADCASE_ID
2 1.40 1300345
4 0.25 2345300
Thanks in advance for any help.
You can use panda's str accessor to slice the column on a specific index, and perform boolean indexing with the result:
df[df.LOADCASE_ID.astype(str).str[1].eq('3')]
X_VAL LOADCASE_ID
2 1.40 1300345
4 0.25 2345300

How to normalize the columns of a DataFrame using sklearn.preprocessing.normalize?

is there a way to normalize the columns of a DataFrame using sklearn's normalize? I think that by default it normalizes rows
For example, if I had df:
A B
1000 10
234 3
500 1.5
I would want to get the following:
A B
1 1
0.234 0.3
0.5 0.15
Why do you need sklearn?
Just use pandas:
>>> df / df.max()
A B
0 1.000 1.00
1 0.234 0.30
2 0.500 0.15
>>>
You can using div after get the max
df.div(df.max(),1)
Out[456]:
A B
0 1.000 1.00
1 0.234 0.30
2 0.500 0.15
sklearn defaults to normalize rows with the L2 normalization. Both of these arguments need to be changed for your desired normalization by the maximum value along columns:
from sklearn import preprocessing
preprocessing.normalize(df, axis=0, norm='max')
#array([[1. , 1. ],
# [0.234, 0.3 ],
# [0.5 , 0.15 ]])
From the documentation
axis : 0 or 1, optional (1 by default) axis used to normalize the data
along. If 1, independently normalize each sample, otherwise (if 0)
normalize each feature.
So just change the axis. Having said that, sklearn is an overkill for this task. It can be achieved easily using pandas.

Pandas - Using `.rolling()` on multiple columns

Consider a pandas DataFrame which looks like the one below
A B C
0 0.63 1.12 1.73
1 2.20 -2.16 -0.13
2 0.97 -0.68 1.09
3 -0.78 -1.22 0.96
4 -0.06 -0.02 2.18
I would like to use the function .rolling() to perform the following calculation for t = 0,1,2:
Select the rows from t to t+2
Take the 9 values contained in those 3 rows, from all the columns. Call this set S
Compute the 75th percentile of S (or other summary statistics about S)
For instance, for t = 1 we have
S = { 2.2 , -2.16, -0.13, 0.97, -0.68, 1.09, -0.78, -1.22, 0.96 } and the 75th percentile is 0.97.
I couldn't find a way to make it work with .rolling(), since it apparently takes each column separately. I'm now relying on a for loop, but it is really slow.
Do you have any suggestion for a more efficient approach?
One solution is to stack the data and then multiply your window size by the number of columns and slice the result by the number of columns. Also, since you want a forward looking window, reverse the order of the stacked DataFrame
wsize = 3
cols = len(df.columns)
df.stack(dropna=False)[::-1].rolling(window=wsize*cols).quantile(0.75)[cols-1::cols].reset_index(-1, drop=True).sort_index()
Output:
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
dtype: float64
In the case of many columns and a small window:
import pandas as pd
import numpy as np
wsize = 3
df2 = pd.concat([df.shift(-x) for x in range(wsize)], 1)
s_quant = df2.quantile(0.75, 1)
# Only necessary if you need to enforce sufficient data.
s_quant[df2.isnull().any(1)] = np.NaN
Output: s_quant
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
Name: 0.75, dtype: float64
You can use numpy ravel. Still you may have to use for loops.
for i in range(0,3):
print(df.iloc[i:i+3].values.ravel())
If your t steps in 3s, you can use numpy reshape function to create a n*9 dataframe.

Pandas Rank: unexpected behavior for method = 'dense' and pct = True

Suppose I have a series with duplicates:
import pandas as pd
ts = pd.Series([1,2,3,4] * 5)
and I want to calculate percentile ranks of it.
It is always a bit tricky to calculate ranks with multiple matches, but I think I am getting unexpected results:
ts.rank(method = 'dense', pct = True)
Out[112]:
0 0.05
1 0.10
2 0.15
3 0.20
4 0.05
5 0.10
6 0.15
7 0.20
8 0.05
9 0.10
10 0.15
11 0.20
12 0.05
13 0.10
14 0.15
15 0.20
16 0.05
17 0.10
18 0.15
19 0.20
dtype: float64
So I am getting as percentiles [0.05, 0.1, 0.15, 0.2], where I guess the expected output might be [0.25, 0.5, 0.75, 1], i.e. multiplying the output by the number of repeated values.
My guess here is that, in order to calculate percentile ranks, pd.rank is simply dividing by the number of observations, which is wrong for method = 'dense'.
So my questions are:
Do you agree the output is unexpected/wrong
How can I obtain my expected output, i.e. assign to each duplicate the percentile rank I would get if I didn't have any duplicate in
the series?
I have reported the issue on GithUB: https://github.com/pandas-dev/pandas/pull/15639
All that pct=True does is divide by the nobs, which gives unexpected behavior for method = 'dense', so this considered as a bug to be fixed in next major release.

Categories

Resources