I have a dataframe with counts in one column and I would like to assign several cumulative sums of this column at once. I tried the below code but unfortunately it gives me only the last cumulative sum for all columns.
d = pd.DataFrame({'counts':[242,99,2,13,0]})
kwargs = {f"cumulative_{i}" : lambda x: x['counts'].shift(1).rolling(i).sum() for i in range(1,4)}
d.assign(**kwargs)
this is what it gives me
counts cumulative_1 cumulative_2 cumulative_3
0 242 NaN NaN NaN
1 99 NaN NaN NaN
2 2 NaN NaN NaN
3 13 343.0 343.0 343.0
4 0 114.0 114.0 114.0
but I would like to get this
counts cumulative_1 cumulative_2 cumulative_3
0 242 NaN NaN NaN
1 99 242.0 NaN NaN
2 2 99.0 341.0 NaN
3 13 2.0 101.0 343.0
4 0 13.0 15.0 114.0
what can I change to get the above?
Variable i defined in the lambda has a global scope, and it's not captured in the lambda definition, i.e. it's always evaluated to 3, the last value when the loop ends. In order to capture i at definition time, you can define a wrapper function that captures i for each iteration of the loop and returns the lambda that can infer the correct i from it's enclosing environment:
def roll_i(i):
return lambda x: x['counts'].shift(1).rolling(i).sum()
kwargs = {f"cumulative_{i}" : roll_i(i) for i in range(1,4)}
d.assign(**kwargs)
counts cumulative_1 cumulative_2 cumulative_3
0 242 NaN NaN NaN
1 99 242.0 NaN NaN
2 2 99.0 341.0 NaN
3 13 2.0 101.0 343.0
4 0 13.0 15.0 114.0
Related
Now, I have two dataframe. I have use groupby. and count() function to export this dataframe(df1). When I used groupby. to count the total number of each category. It filtered out the category which the count is 0. How can I use Python to get the outcome?
However,I will like to have a dataframe which also required categories.
Original dataframe:
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 Beam Sensor 27.0 12.0 13.0 14.0
3 CLPS 1.0 NaN NaN 1.0
However,I will like to have a dataframe which also required categories.
(required categories: ATIDS, BasicCrane, LLP, Beam Sensor, CLPS, SPR)
Expected dataframe (The count number of 'LLP' and 'SPR' is 0)
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
>>> categories
['ATIDS', 'BasicCrane', 'LLP', 'Beam Sensor', 'CLPS', 'SPR']
>>> pd.merge(pd.DataFrame({'Cat': categories}), df, how='outer')
Cat UR3 VR1 VR VR3
0 ATIDS 137.0 99.0 40.0 84.0
1 BasicCrane 2.0 8.0 3.0 1.0
2 LLP NaN NaN NaN NaN
3 Beam Sensor 27.0 12.0 13.0 14.0
4 CLPS 1.0 NaN NaN 1.0
5 SPR NaN NaN NaN NaN
One way you could easily do is to fill NaN value with 0 'before' doing a groupby function. All zero data (previously NaN value) will be totally be counted as zero.
df.fillna(0)
I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN
Suppose I have the following dataframe.
A B
0 NaN 12
1 NaN NaN
2 24 NaN
3 NaN NaN
4 NaN 13
5 NaN 11
6 NaN 13
7 18 NaN
8 19 NaN
9 17 NaN
In column 'A', the missing values need to replaced with the mean of say 3 nearest non empty values in a sequence if they exist. For example the NaN at index 5 has 18 as its nearest non empty value and after 18, the next two values are also non empty. Therefore the NaN at index 5 is replaced with (18+19+17)/3.
The NaN at index 4 has 24 as its nearest non empty value but the two values prior to 24 are non empty. Therefore the NaN at index 4 is not replaced with any value.
Similarly it needs to be done with the rest of the columns. Does anyone know a vectorized way of doing this?
Thanks!
I believe you need combine rolling with mean with another rolling from back, then use DataFrame.interpolate for replace nearest NaNs by means with forward filling for last groups of NaNs and backfilling for first groups of NaNs for helper DataFrame c, which is used for replace missing values of original DataFrame:
a = df.rolling(3).mean()
b = df.iloc[::-1].rolling(3).mean()
c = a.fillna(b).fillna(df).interpolate(method='nearest').ffill().bfill()
print (c)
A B
0 24.0 12.000000
1 24.0 12.000000
2 24.0 12.000000
3 24.0 12.333333
4 24.0 12.333333
5 18.0 11.000000
6 18.0 12.333333
7 18.0 12.333333
8 19.0 12.333333
9 18.0 12.333333
df = df.fillna(c)
print (df)
A B
0 24.0 12.000000
1 24.0 12.000000
2 24.0 12.000000
3 24.0 12.333333
4 24.0 13.000000
5 18.0 11.000000
6 18.0 13.000000
7 18.0 12.333333
8 19.0 12.333333
9 17.0 12.333333
Editing my original post to hopefully simplify my question... I'm merging multiple DataFrames into one, SomeData.DataFrame, which gives me the following:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03-03
0 A 80 NaN NaN 80
1 B NaN NaN 45 36
2 C 44 NaN 39 NaN
3 D 80 NaN NaN 12
4 E 49 2 NaN NaN
What I'm trying to do now is efficiently merge the columns ending in "_x" and "_y" while keeping everything else in place so that I get:
Key 2019-02-17 2019-02-24 2019-03-03
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 NaN
3 D 80 NaN 12
4 E 49 2 NaN
The other issue I'm trying to account for is that the data contained in SomeData.DataFrame changes weekly so that my column headers are unpredictable. Meaning, some weeks I may not have the above issue at all and other weeks, there may be multiple instances for example:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03_10_x 2019-03-10_y
0 A 80 NaN NaN 80 NaN
1 B NaN NaN 45 36 NaN
2 C 44 NaN 39 NaN 12
3 D 80 NaN NaN 12 NaN
4 E 49 2 NaN NaN 17
So that again the desired result would be:
Key 2019-02-17 2019-02-24 2019-03_10
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 12
3 D 80 NaN 12
4 E 49 2 17
Is what I'm asking reasonable or am I venturing outside the bounds of Pandas' limits? I can't find anyone trying to do anything similar so I'm not sure anymore. Thank you in advance!
Edited answer to updated question:
df = df.set_index('Key')
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-03
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 0.0
D 80.0 0.0 12.0
E 49.0 2.0 0.0
Second dataframe Output:
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-10
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 12.0
D 80.0 0.0 12.0
E 49.0 2.0 17.0
You could try something like this:
df_t = df.T
df_t.set_index(df_t.groupby(level=0).cumcount(), append=True)\
.unstack().T\
.sort_values(df.columns[0])[df.columns.unique()]\
.reset_index(drop=True)
Output:
val03-20 03-20 val03-24 03-24
0 a 1 d 5
1 b 6 e 7
2 c 4 f 10
3 NaN NaN g 5
4 NaN NaN h 6
5 NaN NaN i 1
x = pd.DataFrame(index = pd.date_range(start="2017-1-1", end="2017-1-13"),
columns="a b c".split())
x.ix[0:2, "a"] = 1
x.ix[5:10, "a"] = 1
x.ix[9:12, "b"] = 1
x.ix[1:3, "c"] = 1
x.ix[5, "c"] = 1
a b c
2017-01-01 1 NaN NaN
2017-01-02 1 NaN 1
2017-01-03 NaN NaN 1
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 1 NaN 1
2017-01-07 1 NaN NaN
2017-01-08 1 NaN NaN
2017-01-09 1 NaN NaN
2017-01-10 1 1 NaN
2017-01-11 NaN 1 NaN
2017-01-12 NaN 1 NaN
2017-01-13 NaN NaN NaN
Given the above dataframe, x, I want to return the average of the number of occurences of 1s within each group of a, b, and c. The average for each column is taken over the number of blocks that contains consecutive 1s.
For example, column a will output the average of 2 and 5, which is 3.5. We divide it by 2 because there are 2 consecutive 1s between Jan-1 and Jan-2, then 5 consecutive 1s between Jan-06 and Jan-10, 2 blocks of 1s in total. Similarly, for column b, we will have 3 because only one consecutive sequence of 1s occur once between Jan-10 and Jan-13. Finally, for column c, we will have the average of 2 and 1, which is 1.5.
Expected output of the toy example:
a b c
3.5 3 1.5
Use mask + apply with value_counts, and finally, find the mean of your counts -
x.eq(1)\
.ne(x.eq(1).shift())\
.cumsum(0)\
.mask(x.ne(1))\
.apply(pd.Series.value_counts)\
.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
Details
First, find a list of all consecutive values in your dataframe -
i = x.eq(1).ne(x.eq(1).shift()).cumsum(0)
i
a b c
2017-01-01 1 1 1
2017-01-02 1 1 2
2017-01-03 2 1 2
2017-01-04 2 1 3
2017-01-05 2 1 3
2017-01-06 3 1 4
2017-01-07 3 1 5
2017-01-08 3 1 5
2017-01-09 3 1 5
2017-01-10 3 2 5
2017-01-11 4 2 5
2017-01-12 4 2 5
2017-01-13 4 3 5
Now, keep only those group values whose cells were originally 1 in x -
j = i.mask(x.ne(1))
j
a b c
2017-01-01 1.0 NaN NaN
2017-01-02 1.0 NaN 2.0
2017-01-03 NaN NaN 2.0
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 3.0 NaN 4.0
2017-01-07 3.0 NaN NaN
2017-01-08 3.0 NaN NaN
2017-01-09 3.0 NaN NaN
2017-01-10 3.0 2.0 NaN
2017-01-11 NaN 2.0 NaN
2017-01-12 NaN 2.0 NaN
2017-01-13 NaN NaN NaN
Now, apply value_counts across each column -
k = j.apply(pd.Series.value_counts)
k
a b c
1.0 2.0 NaN NaN
2.0 NaN 3.0 2.0
3.0 5.0 NaN NaN
4.0 NaN NaN 1.0
And just find the column-wise mean -
k.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
As a handy note, if you want to, for example, find the mean counts only for more than n consecutive 1s (say, n = 1 here), then you can filter on k's index quite easily -
k[k.index > 1].mean(0)
a 5.0
b 3.0
c 1.5
dtype: float64
Let's try:
x.apply(lambda s: s.groupby(s.ne(1).cumsum()).sum().mean())
Output:
a 3.5
b 3.0
c 1.5
dtype: float64
Apply the lambda function to each column of the dataframe. The lambda function groups none 1 values together and counts them using sum() then takes the average using mean().
This utilizes cumsum, shift, and an xor mask.
b = x.cumsum()
c = b.shift(-1)
b_masked = b[b.isnull() ^ c.isnull()]
b_masked.max() / b_masked.count()
a 3.5
b 3.0
c 1.5
dtype: float64
First do b = x.cumsum()
a b c
0 1.0 NaN NaN
1 2.0 NaN 1.0
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 3.0 NaN 3.0
6 4.0 NaN NaN
7 5.0 NaN NaN
8 6.0 NaN NaN
9 7.0 1.0 NaN
10 NaN 2.0 NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Then, shift b upward: c = b.shift(-1). Then, we create a xor mask with b.isnull() ^ c.isnull(). This mask will only keep one value per consecutive ones. Note that it seems that it will create an extra True at the end. But since we put it back to b, where in the place it is NaN, it will not generate new elements. We use an example to illustrate
b c b.isnull() ^ c.isnull() b[b.isnull() ^ c.isnull()]
NaN 1 True NaN
1 2 False NaN
2 NaN True 2
NaN NaN False NaN
Real big b[b.isnull() ^ c.isnull()] looks like
a b c
0 NaN NaN NaN
1 2.0 NaN NaN
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN 3.0
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 7.0 NaN NaN
10 NaN NaN NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Because we did cumsum in the first place, we only need the maximum and the number of non-NaN in each column to calculate the mean.
Thus, we do b[b.isnull() ^ c.isnull()].max() / b[b.isnull() ^ c.isnull()].count()
you could use regex:
import re
p = r'1+'
counts = {
c: np.mean(
[len(x) for x in re.findall(p, ''.join(map(str, x[c].values)))]
)
for c in ['a', 'b', 'c']
}
This method works because the columns here could be thought as expressions in a language with alphabet {1, nan}. 1+ matches all groups of adjacent 1s and re.findall returns a list of strings. Then, it is necessary to calculate the mean of the lengths of each string.