I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x:
col row x y
1 1 1 1
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
The results output would be:
col row x y
6 3 3 8
9 2 3 4
5 3 3 9
5 5 5 1
3 7 5 2
Not sure how to do this.
IIUC, use boolean indexing using a mask of the consecutive values:
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
Related
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2
I have a pandas dataframe with several columns. I want to add a new column containing the number of values for which two values are the same.
For example, suppose I have the following dataframe:
x y
0 1 5
1 2 7
2 3 2
3 7 3
4 2 7
5 6 5
6 5 3
7 2 7
8 2 2
I want to add a third column that contains the number of values for which both x and y are the same. The desired output here would be
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1
For instance, all rows with (x, y) = (2, 7) equal three because (2, 7) appears three times in the dataframe.
One way to get the output is to create a "hash" (i.e., df['hash'] = df['x'].astype(str) + ',' + df['y'].astype(str) followed by df['frequency'] = df['hash'].map(collections.Counter(df['hash'))), but can we do this directly with group-by? The frequency column is exactly equal to the entry's group in df.groupby(['x', 'y']).
Thanks
IIUC this will work for you:
df['frequency'] = df.groupby(['x','y'])['y'].transform('size')
Output:
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.
Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
Here is sample dataset:
id a
0 5 1
1 5 0
2 5 4
3 5 6
4 5 2
5 5 3
6 9 0
7 9 1
8 9 6
9 9 2
10 9 4
From the dataset, I want to generate a column sum. For first 3 rows: sum=sum+a(group by id). From 4th row, each row contains the cumulative sum of the previous 3 rows of a value(group by id). Loop through each row.
Desired Output:
id a sum
0 5 1 1
1 5 0 1
2 5 4 5
3 5 6 5
4 5 2 10
5 5 3 12
6 9 0 0
7 9 1 1
8 9 6 7
9 9 2 7
10 9 4 9
Code I tried:
df['sum']=df['a'].rolling(min_periods=1, window=3).groupby(df['id']).cumsum()
You can define a functiona like the below:
def cumsum_last3(DF):
nrow=DF.shape[0]
DF["sum"]=0
DF["sum"].iloc[0]=DF["a"].iloc[0]
DF["sum"].iloc[1]=DF["a"].iloc[0]+DF["a"].iloc[1]
for a in range(nrow-2):
cums=np.sum(DF["a"].iloc[a:a+3])
DF["sum"].iloc[a+2]=cums
return DF
DF_cums=cumsum_last3(DF)
DF_cums
I have a dataframe like this:
1 2 3 4 5 6
Ax Ax Ax Ax Ax Ax
delta delta delta delta delta delta
0 6 4 1 5 3 2
1 6 1 5 3 2 4
2 6 1 5 3 2 4
3 6 1 5 3 2 4
4 6 1 5 3 2 4
5 6 1 5 3 2 4
6 6 1 5 3 2 4
7 6 1 5 3 2 4
8 6 1 5 3 2 4
9 6 1 5 3 2 4
I would like to pivot this such that the values are the column, and the columns are the value.
So, the first two rows would become the following:
1 2 3 4 5 6
0 3 6 5 2 4 1
1 3 6 2 5 4 1
I hope that this makes sense. I have tried using pivot() and pivot_table() but it doesn't seem possible with that.
Try:
df1 = df.copy()
df1.columns = df1.columns.droplevel([1,2])
df1.stack().reset_index().pivot(index='level_0', columns=0)
Slice the columns by the sorted indices:
import numpy as np
import pandas as pd
cols = df.columns.get_level_values(0).to_numpy()
pd.DataFrame(cols[np.argsort(df.to_numpy(), 1)],
columns=list(range(1, df.shape[1]+1)))
1 2 3 4 5 6
0 3 6 5 2 4 1
1 2 5 4 6 3 1
2 2 5 4 6 3 1
3 2 5 4 6 3 1
4 2 5 4 6 3 1
5 2 5 4 6 3 1
6 2 5 4 6 3 1
7 2 5 4 6 3 1
8 2 5 4 6 3 1
9 2 5 4 6 3 1