Suppose I have a series with duplicates:
import pandas as pd
ts = pd.Series([1,2,3,4] * 5)
and I want to calculate percentile ranks of it.
It is always a bit tricky to calculate ranks with multiple matches, but I think I am getting unexpected results:
ts.rank(method = 'dense', pct = True)
Out[112]:
0 0.05
1 0.10
2 0.15
3 0.20
4 0.05
5 0.10
6 0.15
7 0.20
8 0.05
9 0.10
10 0.15
11 0.20
12 0.05
13 0.10
14 0.15
15 0.20
16 0.05
17 0.10
18 0.15
19 0.20
dtype: float64
So I am getting as percentiles [0.05, 0.1, 0.15, 0.2], where I guess the expected output might be [0.25, 0.5, 0.75, 1], i.e. multiplying the output by the number of repeated values.
My guess here is that, in order to calculate percentile ranks, pd.rank is simply dividing by the number of observations, which is wrong for method = 'dense'.
So my questions are:
Do you agree the output is unexpected/wrong
How can I obtain my expected output, i.e. assign to each duplicate the percentile rank I would get if I didn't have any duplicate in
the series?
I have reported the issue on GithUB: https://github.com/pandas-dev/pandas/pull/15639
All that pct=True does is divide by the nobs, which gives unexpected behavior for method = 'dense', so this considered as a bug to be fixed in next major release.
Related
I have a simple dataframe as the following:
n_obs = 3
dd = pd.DataFrame({
'WTL_exploded': [0, 1, 2]*n_obs,
'hazard': [0.3, 0.4, 0.5, 0.2, 0.8, 0.9, 0.6,0.6,0.65],
}, index=[1,1,1,2,2,2,3,3,3])
dd
I want to group by the index and get the cumulative product of the hazard column. However, I want to multiply all but the last element of each group.
Desired output:
index
hazard
1
0.3
1
0.12
2
0.2
2
0.16
3
0.6
3
0.36
How can I do that?
You can use:
out = dd.groupby(level=0, group_keys=False).apply(lambda x: x.cumprod().iloc[:-1])
Or:
out = dd.groupby(level=0).apply(lambda x: x.cumprod().iloc[:-1]).droplevel(1)
output:
WTL_exploded hazard
1 0 0.30
1 0 0.12
2 0 0.20
2 0 0.16
3 0 0.60
3 0 0.36
NB. you can also use lambda x: x.cumprod().head(-1).
The solution I found is a bit intricate but works for the test case.
First, get rid of the last row of each group:
ff = dd.groupby(lambda x:x, as_index=False).apply(lambda x: x.iloc[:-1])
ff
Then, restore the original index, group-by again and use pandas cumprod:
ff.reset_index().set_index('level_1').groupby(lambda x:x).cumprod()
Is there a more direct way?
I'm trying to sample a Dataframe based on a given Minimum Sample Interval on the "timestamp" column. Every extracted value would be the closest extracted value to the last one that is at least Minimum Sample Interval larger than the last one. So what I mean is, for the table given below and Minimum Sample Interval = 0.2
A timestamp
1 0.000000 0.1
2 3.162278 0.15
3 7.211103 0.45
4 7.071068 0.55
Here, we would extract indexes:
1, no last value yet so why not
Not 2, because it is only 0.05 larger than last value
3, because it is 0.35 larger than last value
Not 4, because it is only 0.1 larger than last value.
I've found a way to do this with iterrows, but I would like to avoid iterating over it if possible.
Closest I can think of is integer dividing the timestamp column with floordiv as interval and finding the rows where interval value changes. but for a case like [0.01 , 0.21, 0.55, 0.61, 0.75, 0.41], I would be selecting 0.61 instead of 0.75, which is only 0.06 larger than 0.55, instead of 0.2.
You can use pandas.Series.diff to compute the difference between each value and the next one:
sample = df[df['timestamp'].diff().fillna(1) > 0.2]
Output:
>>> sample
A timestamp
1 0.000000 0.10
3 7.211103 0.45
I have a DataFrame where one column is a numpy array of numbers. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
'data': [np.array([0.43, 0.32, 0.19]),
np.array([0.41, 0.11, 0.21]),
np.array([0.94, 0.35, 0.14]),
np.array([0.78, 0.92, 0.45]),
np.array([0.32, 0.63, 0.48]),
np.array([0.17, 0.12, 0.15]),
np.array([0.54, 0.12, 0.16]),
np.array([0.48, 0.16, 0.19]),
np.array([0.14, 0.47, 0.01])]
})
I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:
pd.DataFrame.from_dict({
'id': [1, 2,3,4],
'data': [np.array([0.42, 0.215, 0.2]),
np.array([0.86, 0.635, 0.29500000000000004]),
np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
np.array([0.31, 0.315, 0.1])]
})
Could someone suggest how I might do this? Thanks!
Mean it twice, one at array level and once at group level:
df['data'].map(np.mean).groupby(df['id']).mean().reset_index()
id data
0 1 0.278333
1 2 0.596667
2 3 0.298889
3 4 0.241667
Based on comment, you can do:
pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)
id
1 [0.42, 0.215, 0.2]
2 [0.86, 0.635, 0.29500000000000004]
3 [0.3433333333333333, 0.29, 0.26333333333333336]
4 [0.31, 0.315, 0.1]
dtype: object
Or:
df.groupby("id")['data'].apply(np.mean)
First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array
# Your current memory usage
df.memory_usage(deep=True).sum()
1352
# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
# id 0 1 2
#0 1 0.43 0.32 0.19
#1 1 0.41 0.11 0.21
#2 2 0.94 0.35 0.14
#3 2 0.78 0.92 0.45
#4 3 0.32 0.63 0.48
#5 3 0.17 0.12 0.15
#6 3 0.54 0.12 0.16
#7 4 0.48 0.16 0.19
#8 4 0.14 0.47 0.01
Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.
df1.memory_usage(deep=True).sum()
#416
And now your aggregation is a normal groupby + mean, columns give the location in the array
df1.groupby('id').mean()
# 0 1 2
#id
#1 0.420000 0.215 0.200000
#2 0.860000 0.635 0.295000
#3 0.343333 0.290 0.263333
#4 0.310000 0.315 0.100000
Group by mean for array where output is array of mean value
df['data'].map(np.array).groupby(df['id']).mean().reset_index()
Output:
id data
0 1 [0.42, 0.215, 0.2]
1 2 [0.86, 0.635, 0.29500000000000004]
2 3 [0.3433333333333333, 0.29, 0.26333333333333336]
3 4 [0.31, 0.315, 0.1]
You can always .apply the numpy mean.
df.groupby('id')['data'].apply(np.mean).apply(np.mean)
# returns:
id
1 0.278333
2 0.596667
3 0.298889
4 0.241667
Name: data, dtype: float64
Given the DataFrame:
id articleno target
0 1 [607303] 607295
1 1 [607295] 607303
2 2 [243404, 617953] 590448
3 2 [590448, 617953] 243404
for each row, compute the average article-embedding by looking up each item in the lists in the dictionary:
embeddings = {"607303": np.array([0.19, 0.25, 0.45])
,"607295": np.array([0.77, 0.76, 0.55])
,"243404": np.array([0.35, 0.44, 0.32])
,"617953": np.array([0.23, 0.78, 0.24])
,"590448": np.array([0.67, 0.12, 0.10])}
So for example, and to clarify, for the third row (index 2), the article embeddings for 243404 and 617953 is [0.35, 0.44, 0.32] and [0.23, 0.78, 0.24], respectively. The average article embedding is computes as the element wise addition of all elements, divided by the number of articles, so: ([0.35, 0.44, 0.32]+[0.23, 0.78, 0.24])/2=[0.29, 0.61, 0.28].
Expected output:
id dim1 dim2 dim3 target
0 1 0.19 0.25 0.45 607295
1 1 0.77 0.76 0.55 607303
2 2 0.29 0.61 0.28 590448
3 2 0.45 0.45 0.17 243404
In reality, my DataFrame has millions of rows, and the lists in articleno can contain many more items. Because of this, iterating over the rows might be too slow, and a more efficient solution (perhaps vectorized) could be needed.
Moreover, the number of dimensions (embedding size) is known beforehand, but is a couple of hundred, so the number of columns; dim1, dim2, dim3, ... dimN should be dynamic, based on the dimensions of the embedding (N).
In the previous question, you went extra miles to separate elements in articleno list, then remove the target from articleno list. Now, if you want to access the elements inside articleno list, you need to go extra miles again to separate them.
To illustrate what I mean, here's an approach that generate both of the output from the two questions, while adding minimal extra code:
# construct the embeddings dataframe:
embedding_df = pd.DataFrame(embeddings).T.add_prefix('dim')
# aggregation dictionary
agg_dict = {'countrycode':'first','articleno':list}
# taking mean over embedddings
for i in embedding_df.columns: agg_dict[i] = 'mean'
new_df = df.explode('articleno')
(new_df.join(new_df['articleno'].rename('target'))
.query('articleno != target')
.merge(embedding_df, left_on='articleno', right_index=True) # this line is extra from the previous question
.groupby(['id','target'], as_index=False)
.agg(agg_dict)
)
Output:
id target countrycode articleno dim0 dim1 dim2
0 2 590448 US [617953, 617953] 0.23 0.78 0.24
1 2 617953 US [590448, 590448] 0.67 0.12 0.10
Now, if you don't care for articleno column in the final output, you can even simplify your code while lower memory/runtime like this:
total_embeddings = g[embedding_df.columns].sum()
article_counts = g['id'].transform('size')
new_df[embedding_df.columns] = (total_embeddings.sub(new_df[embedding_df.columns])
.div(article_counts-1, axis=0)
)
and you would get the same output.
is there a way to normalize the columns of a DataFrame using sklearn's normalize? I think that by default it normalizes rows
For example, if I had df:
A B
1000 10
234 3
500 1.5
I would want to get the following:
A B
1 1
0.234 0.3
0.5 0.15
Why do you need sklearn?
Just use pandas:
>>> df / df.max()
A B
0 1.000 1.00
1 0.234 0.30
2 0.500 0.15
>>>
You can using div after get the max
df.div(df.max(),1)
Out[456]:
A B
0 1.000 1.00
1 0.234 0.30
2 0.500 0.15
sklearn defaults to normalize rows with the L2 normalization. Both of these arguments need to be changed for your desired normalization by the maximum value along columns:
from sklearn import preprocessing
preprocessing.normalize(df, axis=0, norm='max')
#array([[1. , 1. ],
# [0.234, 0.3 ],
# [0.5 , 0.15 ]])
From the documentation
axis : 0 or 1, optional (1 by default) axis used to normalize the data
along. If 1, independently normalize each sample, otherwise (if 0)
normalize each feature.
So just change the axis. Having said that, sklearn is an overkill for this task. It can be achieved easily using pandas.