I have a pandas dataframe with several columns. I want to add a new column containing the number of values for which two values are the same.
For example, suppose I have the following dataframe:
x y
0 1 5
1 2 7
2 3 2
3 7 3
4 2 7
5 6 5
6 5 3
7 2 7
8 2 2
I want to add a third column that contains the number of values for which both x and y are the same. The desired output here would be
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1
For instance, all rows with (x, y) = (2, 7) equal three because (2, 7) appears three times in the dataframe.
One way to get the output is to create a "hash" (i.e., df['hash'] = df['x'].astype(str) + ',' + df['y'].astype(str) followed by df['frequency'] = df['hash'].map(collections.Counter(df['hash'))), but can we do this directly with group-by? The frequency column is exactly equal to the entry's group in df.groupby(['x', 'y']).
Thanks
IIUC this will work for you:
df['frequency'] = df.groupby(['x','y'])['y'].transform('size')
Output:
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1
Related
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x:
col row x y
1 1 1 1
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
The results output would be:
col row x y
6 3 3 8
9 2 3 4
5 3 3 9
5 5 5 1
3 7 5 2
Not sure how to do this.
IIUC, use boolean indexing using a mask of the consecutive values:
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
Having the following DF:
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
After grouping with column A I get 3 groups
(1, A B
0 1 2
1 1 2)
(4, A B
2 4 3
3 4 3)
(5, A B
4 5 6
5 5 6
6 5 6)
I would like to count the groups different from a specific row count, for example 2 as an input will result 1 as an output because there is only 1 group with 3 rows, were 3 as input will output 2 for the other groups.
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
What is the Pandas solution for such a task?
I think you need Series.value_counts with test not equal by Series.ne and then count number of Trues by sum:
N = 2
a = df['A'].value_counts().ne(N).sum()
print (a)
1
You can count values, then filter on B:
counts = df.groupby('A').count()
count_input = 2
print(len(counts[counts['B'] != count_input]))
I have a dataframe that looks like the following:
index value
1 21.046091
2 52.400000
3 14.082153
4 1.859942
5 1.859942
6 2.331143
7 9.060000
8 0.789265
9 12967.7
The last value is much higher than the rest. I'm trying to bin all the values into 5 bins using pd.cut:
pd.cut(df['value'], 5, labels = [1,2,3,4,5])
But it only ends up returning the groups 1 and 5.
index value group
0 0.410000 1
1 21.046091 1
2 52.400000 1
3 14.082153 1
4 1.859942 1
5 1.859942 1
6 2.331143 1
7 9.060000 1
8 0.789265 1
9 12967.7 5
The higher value is clearly throwing it, but is there a way to ensure that all five bins are represented in the dataframe without getting rid of outlying values?
You could use qcut:
pd.qcut(df['value'],5,labels=[1,2,3,4,5])
Output:
index
1 4
2 5
3 4
4 1
5 1
6 2
7 3
8 1
9 5
Name: value, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
print(df.assign(group = pd.qcut(df['value'],5,labels=[1,2,3,4,5])))
value group
index
1 21.046091 4
2 52.400000 5
3 14.082153 4
4 1.859942 1
5 1.859942 1
6 2.331143 2
7 9.060000 3
8 0.789265 1
9 12967.700000 5
I have two pandas dataframe with different size. two dataframe looks like
df1 =
x y data
1 2 5
2 2 7
5 3 9
3 5 2
and another dataframe looks like:
df2 =
x y value
5 3 7
1 2 4
3 5 2
7 1 4
4 6 5
2 2 1
7 5 8
I am trying to merge these two dataframe so that the final dataframe expected to have same combination of x and y with respective value. I am expecting final dataframe in this format:
x y data value
1 2 5 4
2 2 7 1
5 3 9 7
3 5 2 2
I tride this code but not getting expected results.
dfB.set_index('x').loc[dfA.x].reset_index()
Use merge, by default how='inner' so it can be omit and if join only on same columns parameter on can be omit too:
print (pd.merge(df1,df2))
x y data value
0 1 2 5 4
1 2 2 7 1
2 5 3 9 7
3 3 5 2 2
If in real data are multiple same column names use:
print (pd.merge(df1,df2, on=['x','y']))
x y data value
0 1 2 5 4
1 2 2 7 1
2 5 3 9 7
3 3 5 2 2
df1.merge(df2,by='x')
This will do