I have a pandas data frame as follows
A B
1 2
1 2
1 0
1 2
2 3
2 3
2 1
3 0
3 0
3 1
3 2
I would like to get the following output
A B
1 2
1 2
1 2
2 3
2 3
3 0
3 0
This means that I need only rows where the count of A is maximum. Is there any solution to this?
Many thanks!
You can combine groupby() with Series.mode():
df_out = df[df.groupby("A")["B"].transform(lambda x: x == x.mode()[0])]
print(df_out)
Prints:
A B
0 1 2
1 1 2
3 1 2
4 2 3
5 2 3
7 3 0
8 3 0
Is this what you're looking for?
df.set_index(['A','B']).loc[df.groupby(['A','B']).size().groupby(level=0).idxmax()].reset_index()
Related
I have a dataframe df that looks like this:
ID
Months
Borrow_Rank
1
0
1
1
1
1
1
2
1
2
0
1
2
1
1
2
2
1
3
0
2
3
1
2
4
0
1
4
1
1
I want to create a new variable Months_Adjusted that starts counting from 0 for as long as Borrow_Rank remains the same.
ID
Months
Borrow_Rank
Months_Adjusted
1
0
1
0
1
1
1
1
1
2
1
2
2
0
1
3
2
1
1
4
2
2
1
5
3
0
2
0
3
1
2
1
4
0
1
0
4
1
1
1
Thank you all and I apologise if I could have written the question better. This is my first post.
import pandas as pd
df = pd.DataFrame({'Borrow_Rank':[1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,1,1,1]})
selector = (df['Borrow_Rank'] != df['Borrow_Rank'].shift()).cumsum()
df['Months_Adjusted'] = df.groupby(selector).cumcount()
Here is a solution using itertools.groupby -
from itertools import groupby
df['Months_Adjusted'] = pd.concat(
[pd.Series(range(len(list(g))))
for k, g in groupby(df['Borrow_Rank'])],
ignore_index=True)
Output
ID Months Borrow_Rank Months_Adjusted
0 1 0 1 0
1 1 1 1 1
2 1 2 1 2
3 2 0 1 3
4 2 1 1 4
5 2 2 1 5
6 3 0 2 0
7 3 1 2 1
8 4 0 1 0
9 4 1 1 1
How to filter tolist or values below to get only non-zeros
import pandas as pd
df = pd.DataFrame({'A':[0,2,3],'B':[2,0,4], 'C': [3,4,0]})
df['D']=df[['A','B','C']].values.tolist()
df.explode('D')
Data
A B C
0 0 2 3
1 2 0 4
2 3 4 0
On Explode on Column D rows now becomes 9. But i want 4 rows in the output
Expected result
A B C D
0 0 2 3 2
0 0 2 3 3
1 2 0 4 2
1 2 0 4 4
2 3 4 0 3
2 3 4 0 4
I got list(filter(None, [1,0,2,3,0])) to return only non-zeros. But not sure how to apply it in the above code
index.repeat
m = df.ne(0)
df.loc[df.index.repeat(m.sum(1))].assign(D=df.values[m])
A B C D
0 0 2 3 2
0 0 2 3 3
1 2 0 4 2
1 2 0 4 4
2 3 4 0 3
2 3 4 0 4
Simpliest is query:
df['D']=df[['A','B','C']].values.tolist()
df.explode('D').query('D != 0')
Output:
A B C D
0 0 2 3 2
0 0 2 3 3
1 2 0 4 2
1 2 0 4 4
2 3 4 0 3
2 3 4 0 4
Suppose I have pandas DataFrame like this:
>>> df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,2,3,4],'value':[1,1,1,1,3,1,2,2,3,3,4,1,1]})
>>> df
id value
1 1
1 1
1 1
1 1
1 3
2 1
2 2
2 2
2 3
2 3
2 4
3 1
4 1
I want to get a new DataFrame with top 2 (well really n values) values for each id including duplicates, like this:
id value
0 1 1
1 1 1
3 1 1
4 1 1
5 1 3
6 2 1
7 2 2
8 2 2
9 3 1
10 4 1
I've tried using head() and nsmallest() but I think those will not include duplicates. Is there a better way to do this?
Edited to make it clear I want more than 2 records per group if there are more than 2 duplictes
Use DataFrame.drop_duplicates in first step, then get top values and last use DataFrame.merge:
df1 = df.drop_duplicates(['id','value']).sort_values(['id','value']).groupby('id').head(2)
df = df.merge(df1)
print (df)
id value
0 1 1
1 1 1
2 1 2
3 1 2
4 2 1
5 2 2
6 2 2
7 3 1
8 4 1
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,2,3,4],'value':[1,1,1,1,3,1,2,2,3,3,4,1,1]})
df1 = df.drop_duplicates(['id','value']).sort_values(['id','value']).groupby('id').head(2)
df = df.merge(df1)
print (df)
id value
0 1 1
1 1 1
2 1 1
3 1 1
4 1 3
5 2 1
6 2 2
7 2 2
8 3 1
9 4 1
Or use custom lambda function with GroupBy.transform and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform(lambda x: x.isin(sorted(set(x))[:2]))]
print (df)
id value
0 1 1
1 1 1
2 1 2
3 1 2
5 2 1
6 2 2
7 2 2
11 3 1
12 4 1
df = df[df.groupby('id')['value'].transform(lambda x: x.isin(sorted(set(x))[:2]))]
print (df)
id value
0 1 1
1 1 1
2 1 1
3 1 1
4 1 3
5 2 1
6 2 2
7 2 2
11 3 1
12 4 1
I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0
If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?
You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"
df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3