I want to group by the id column in this dataframe:
id a b c
0 1 1 6 2
1 1 2 5 2
2 2 3 4 2
3 2 4 3 2
4 3 5 2 2
5 3 6 1 2
and add the differences between rows for the same column and group as additional columns to end up with this dataframe:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
data here
df = pd.DataFrame({'id': [1,1,2,2,3,3], 'a': [1,2,3,4,5,6],'b': [6,5,4,3,2,1], 'c': [2,2,2,2,2,2]})
Your desired output doesn't make much sense, but I can force it there with:
df[['a_diff', 'b_diff', 'c_diff']] = df.groupby('id').transform(lambda x: x.diff(1).fillna(x.diff(-1)))
Output:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
Related
i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change
IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0
If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0
The goal is to put the digits from the last row of the previous letter group in the new column "last_digit_prev_group". The expected, correct value, as a result formula, was entered by me manually in the column "col_ok". I stopped trying shift (), but the effect was far from what I expected. Maybe there is some other way?
Forgive me the inconsistency of my post, I'm not an IT specialist and I don't know English. Thanks in advance for your support.
df = pd.read_csv('C:/Users/.../a.csv',names=['group_letter', 'digit', 'col_ok'] ,
index_col=0,)
df['last_digit_prev_group'] = df.groupby('group_letter')['digit'].shift(1)
print(df)
group_letter digit col_ok last_digit_prev_group
A 1 n NaN
A 3 n 1.0
A 2 n 3.0
A 5 n 2.0
A 1 n 5.0
B 1 1 NaN
B 2 1 1.0
B 1 1 2.0
B 1 1 1.0
B 3 1 1.0
C 5 3 NaN
C 6 3 5.0
C 1 3 6.0
C 2 3 1.0
C 3 3 2.0
D 4 3 NaN
D 3 3 4.0
D 2 3 3.0
D 5 3 2.0
D 7 3 5.0
Use Series.mask with DataFrame.duplicated for last valeus of digit, then Series.shift and last ffill:
df['last_digit_prev_group'] = (df['digit'].mask(df.duplicated('group_letter', keep='last'))
.shift()
.ffill())
print (df)
group_letter digit col_ok last_digit_prev_group
0 A 1 n NaN
1 A 3 n NaN
2 A 2 n NaN
3 A 5 n NaN
4 A 1 n NaN
5 B 1 1 1.0
6 B 2 1 1.0
7 B 1 1 1.0
8 B 1 1 1.0
9 B 3 1 1.0
10 C 5 3 3.0
11 C 6 3 3.0
12 C 1 3 3.0
13 C 2 3 3.0
14 C 3 3 3.0
15 D 4 3 3.0
16 D 3 3 3.0
17 D 2 3 3.0
18 D 5 3 3.0
19 D 7 3 3.0
If possible some last value is NaN:
df['last_digit_prev_group'] = (df['digit'].mask(df.duplicated('group_letter', keep='last'))
.shift()
.groupby(df['group_letter']).ffill()
print (df)
group_letter digit col_ok last_digit_prev_group
0 A 1.0 n NaN
1 A 3.0 n NaN
2 A 2.0 n NaN
3 A 5.0 n NaN
4 A 1.0 n NaN
5 B 1.0 1 1.0
6 B 2.0 1 1.0
7 B 1.0 1 1.0
8 B 1.0 1 1.0
9 B 3.0 1 1.0
10 C 5.0 3 3.0
11 C 6.0 3 3.0
12 C 1.0 3 3.0
13 C 2.0 3 3.0
14 C NaN 3 3.0
15 D 4.0 3 NaN
16 D 3.0 3 NaN
17 D 2.0 3 NaN
18 D 5.0 3 NaN
19 D 7.0 3 NaN
I cannot figure out a nice panda-ish way to fill in missing NaN values for left join by sampling from right table.
e.g
joined_left = left.merge(right, how="left", left_on=[attr1], right_on=[attr2])
from left and right
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
produces smth like
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
How do I sample a row from a right table instead of filling NaNs?
This is what I tried so far playground:
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
while(joined_left.isnull().values.any()):
right_sample = right.sample().drop(0, axis=1)
joined_left.fillna(value=right_sample, limit=1)
print joined_left
Basically sample randomly and use fillna() for first occurance of NaN value to fill in...but for some reason I get no output.
Thank you!
One of outputs could be
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 2.0 2.0
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 3.0 2.9
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
with sampled 3 2 2and3 2 9
Using sample with fillna
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0],indicator=True) # adding indicator
joined_left
Out[705]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 NaN NaN left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 NaN NaN left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both
nnull=joined_left['_merge'].eq('left_only').sum() # find all many row miss match , at the mergedf
s=right.sample(nnull)# rasmple from the dataframe after dropna
s.index=joined_left.index[joined_left['_merge'].eq('left_only')] # reset the index of the subset fill df to the index of null value show up
joined_left.fillna(s.rename(columns={1:'1_y',2:'2_y'}))
Out[706]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 2.0 2.0 left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 2.0 3.0 left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both
My dataset looks like this(first row is header)
0 1 2 3 4 5
1 3 4 6 2 3
3 8 9 3 2 4
2 2 3 2 1 2
I want to select a range of columns of the dataset based on the column [5], e.g:
1 3 4
3 8 9 3
2 2
I have tried the following, but it did not work:
df.iloc[:,0:df['5'].values]
Let's try:
df.apply(lambda x: x[:x.iloc[5]], 1)
Output:
0 1 2 3
0 1.0 3.0 4.0 NaN
1 3.0 8.0 9.0 3.0
2 2.0 2.0 NaN NaN
Recreate your dataframe
df=pd.DataFrame([x[:x[5]] for x in df.values]).fillna(0)
df
Out[184]:
0 1 2 3
0 1 3 4.0 0.0
1 3 8 9.0 3.0
2 2 2 0.0 0.0
I'm working with Mint transaction data and trying to sum the values from each category into it's parent category.
I have a dataframe mint_data that is created from all my Mint transactions:
mint_data = tranactions_data.pivot(index='Category', columns='Date', values='Amount')
mint_data image
And a dict with Category:Parent pairs (this uses xlwings to pull from excel sheet)
cat_parent = cats_sheet.range('A1').expand().options(dict).value
Cat:Parent image
I'm not sure how to go about looping through the mint_data df and summing amounts into the parent category. I would like to keep the data frame format exactly the same, just replacing the parent values.
Here is an example df:
A B C D E
par_a 0 0 5 0 0
cat1a 5 2 3 2 1
cat2a 0 1 2 1 0
par_b 1 0 1 1 2
cat1b 0 1 2 1 0
cat2b 1 1 1 1 1
cat3b 0 1 2 1 0
I also have a dict with
{'par_a': 'par_a',
'cat1a': 'par_a',
'cat2a': 'par_a',
'par_b': 'par_b',
'cat1b': 'par_b',
'cat2b': 'par_b',
'cat3b': 'par_b'}
I am trying to get the dataframe to end up with
A B C D E
par_a 5 3 10 3 1
cat1a 5 2 3 2 1
cat2a 0 1 2 1 0
par_b 2 3 6 4 3
cat1b 0 1 2 1 0
cat2b 1 1 1 1 1
cat3b 0 1 2 1 0
Let's call your dictionary "dct" and then make a new column that maps to the parent:
>>> df['parent'] = df.reset_index()['index'].map(dct).values
A B C D E parent
par_a 0 0 5 0 0 par_a
cat1a 5 2 3 2 1 par_a
cat2a 0 1 2 1 0 par_a
par_b 1 0 1 1 2 par_b
cat1b 0 1 2 1 0 par_b
cat2b 1 1 1 1 1 par_b
cat3b 0 1 2 1 0 par_b
Then sum by parent:
>>> df_sum = df.groupby('parent').sum()
A B C D E
parent
par_a 5 3 10 3 1
par_b 2 3 6 4 3
In many cases you would stop there, but since you want to combine the parent/child data, you need some sort of merge. combine_first will work well here since it will selectively update in the direction you want:
>>> df_new = df_sum.combine_first(df)
A B C D E parent
cat1a 5.0 2.0 3.0 2.0 1.0 par_a
cat1b 0.0 1.0 2.0 1.0 0.0 par_b
cat2a 0.0 1.0 2.0 1.0 0.0 par_a
cat2b 1.0 1.0 1.0 1.0 1.0 par_b
cat3b 0.0 1.0 2.0 1.0 0.0 par_b
par_a 5.0 3.0 10.0 3.0 1.0 par_a
par_b 2.0 3.0 6.0 4.0 3.0 par_b
You mentioned a multi-index in a comment, so you may prefer to organize it more like this:
>>> df_new.reset_index().set_index(['parent','index']).sort_index()
A B C D E
parent index
par_a cat1a 5.0 2.0 3.0 2.0 1.0
cat2a 0.0 1.0 2.0 1.0 0.0
par_a 5.0 3.0 10.0 3.0 1.0
par_b cat1b 0.0 1.0 2.0 1.0 0.0
cat2b 1.0 1.0 1.0 1.0 1.0
cat3b 0.0 1.0 2.0 1.0 0.0
par_b 2.0 3.0 6.0 4.0 3.0