I have a dataframe that looks like the following:
index value
1 21.046091
2 52.400000
3 14.082153
4 1.859942
5 1.859942
6 2.331143
7 9.060000
8 0.789265
9 12967.7
The last value is much higher than the rest. I'm trying to bin all the values into 5 bins using pd.cut:
pd.cut(df['value'], 5, labels = [1,2,3,4,5])
But it only ends up returning the groups 1 and 5.
index value group
0 0.410000 1
1 21.046091 1
2 52.400000 1
3 14.082153 1
4 1.859942 1
5 1.859942 1
6 2.331143 1
7 9.060000 1
8 0.789265 1
9 12967.7 5
The higher value is clearly throwing it, but is there a way to ensure that all five bins are represented in the dataframe without getting rid of outlying values?
You could use qcut:
pd.qcut(df['value'],5,labels=[1,2,3,4,5])
Output:
index
1 4
2 5
3 4
4 1
5 1
6 2
7 3
8 1
9 5
Name: value, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
print(df.assign(group = pd.qcut(df['value'],5,labels=[1,2,3,4,5])))
value group
index
1 21.046091 4
2 52.400000 5
3 14.082153 4
4 1.859942 1
5 1.859942 1
6 2.331143 2
7 9.060000 3
8 0.789265 1
9 12967.700000 5
Related
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2
I have a pandas dataframe with several columns. I want to add a new column containing the number of values for which two values are the same.
For example, suppose I have the following dataframe:
x y
0 1 5
1 2 7
2 3 2
3 7 3
4 2 7
5 6 5
6 5 3
7 2 7
8 2 2
I want to add a third column that contains the number of values for which both x and y are the same. The desired output here would be
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1
For instance, all rows with (x, y) = (2, 7) equal three because (2, 7) appears three times in the dataframe.
One way to get the output is to create a "hash" (i.e., df['hash'] = df['x'].astype(str) + ',' + df['y'].astype(str) followed by df['frequency'] = df['hash'].map(collections.Counter(df['hash'))), but can we do this directly with group-by? The frequency column is exactly equal to the entry's group in df.groupby(['x', 'y']).
Thanks
IIUC this will work for you:
df['frequency'] = df.groupby(['x','y'])['y'].transform('size')
Output:
x y frequency
0 1 5 1
1 2 7 3
2 3 2 1
3 7 3 1
4 2 7 3
5 6 5 1
6 5 3 1
7 2 7 3
8 2 2 1
I have a pandas data frame and a list -
d={'abc':[0,2,4,5,2,2],'bec':[0,5,6,4,0,2],'def':[7,6,0,1,1,2],'rtr':[5,6,7,2,0,3],'rwr':[5,6,7,1,0,5],'xx':[4,5,6,7,8,7]}
X=pd.DataFrame(d)
abc bec def rtr rwr xx
0 0 0 7 5 5 4
1 2 5 6 6 6 5
2 4 6 0 7 7 6
3 5 4 1 2 1 7
4 2 0 1 0 0 8
5 2 2 2 3 5 7
l=[ 'bec','def','cef','ghd','rtr','fgh','ewr']
Now I want to append the list to data frame in the following way-
For each row in dataframe- We count the number of non zero elements in it(lets say it is 3 in the first row
We take 50% of 3= 1.5 (rounded to 1) and we append those many elements from the list l to the row(starting from the beginning). For the first row it is 'bec', since 'bec' is already present in the
row we increase its count by 1.
If the element from list is not present in the dataframe we append it at the end.
Dry run-
for row 1(index 1)- no of non zero elements is 6. So 50% of it is 3. So we take the first 3 elements from list['bec','def','cef']. 'bec' is already present so its count increases by 1 and it becomes(2,2)=6.
Similarly 'def' is present so it becomes(2,3)=7. 'cef' isn't present in the dataframe so we add it and make the count as 1.
The final output looks like this-
abc bec def rtr rwr xx cef
0 0 1 8 5 5 4 0
1 2 6 7 6 6 5 1
2 4 7 1 7 7 6 0
3 5 5 2 2 1 7 1
4 2 1 1 0 0 8 0
5 2 1 1 3 5 7 1
We can use ne + sum along axis=1 to count the non zero values in each row, followed by floordiv with 2 to consider only 50% of these counts, next create a list of record with the help of dict.fromkeys method inside a list comprehension, now create a dataframe lets say y from these records and add it with X to get the desired result
y = pd.DataFrame(dict.fromkeys(l[:i], 1)
for i in X.ne(0).sum(1).floordiv(2).astype(int))
X.add(y.fillna(0), fill_value=0).astype(int)
abc bec cef def rtr rwr xx
0 0 1 0 8 5 5 4
1 2 6 1 7 6 6 5
2 4 7 0 1 7 7 6
3 5 5 1 2 2 1 7
4 2 1 0 1 0 0 8
5 2 3 1 3 3 5 7
Having the following DF:
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
After grouping with column A I get 3 groups
(1, A B
0 1 2
1 1 2)
(4, A B
2 4 3
3 4 3)
(5, A B
4 5 6
5 5 6
6 5 6)
I would like to count the groups different from a specific row count, for example 2 as an input will result 1 as an output because there is only 1 group with 3 rows, were 3 as input will output 2 for the other groups.
A B
0 1 2
1 1 2
2 4 3
3 4 3
4 5 6
5 5 6
6 5 6
What is the Pandas solution for such a task?
I think you need Series.value_counts with test not equal by Series.ne and then count number of Trues by sum:
N = 2
a = df['A'].value_counts().ne(N).sum()
print (a)
1
You can count values, then filter on B:
counts = df.groupby('A').count()
count_input = 2
print(len(counts[counts['B'] != count_input]))
I'm having trouble working out how to add the index value of a pandas dataframe to each value at that index. For example, if I have a dataframe of zeroes, the row with index 1 should have a value of 1 for all columns. The row at index 2 should have values of 2 for each column, and so on.
Can someone enlighten me please?
You can use pd.DataFrame.add with axis=0. Just remember, as below, to convert your index to a series first.
df = pd.DataFrame(np.random.randint(0, 10, (5, 5)))
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 9 6 1 8 0
2 2 9 0 5 3
3 3 1 1 7 0
4 2 6 3 6 6
df = df.add(df.index.to_series(), axis=0)
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 10 7 2 9 1
2 4 11 2 7 5
3 6 4 4 10 3
4 6 10 7 10 10