Count number of rows per group in pandas datafreame - python

I have a data frame like so:
canopy speed
0 1 3.3
1 2 3.3
2 2 3.1
3 2 3.1
4 2 3.1
5 2 3.0
6 2 3.0
7 2 3.5
8 2 3.5
I want to count the number of rows (observations) for each combination of canopy and speed and plot it. I expect to see in the plot something like that:
canopy = 2:
3.3 1
3.1 3
3.0 2
3.5 3

You could do:
df.groupby('canopy')['speed'].value_counts().unstack('canopy').plot.bar()
This gives you some options, for example normalizing within each group (to get frequency instead of count):
(df
.groupby('canopy')['speed']
.value_counts(normalize=True)
.unstack('canopy').plot.bar()
)
And, of course, you could control the rounding of the speed values (as #QuangHoang rightly mentioned: not a good idea to group on floats --to which I would add: without some rounding):
(df
.assign(speed=df['speed'].round(0))
.groupby('canopy')['speed']
.value_counts()
.unstack('canopy').plot.bar()
)

Try:
df.canopy.eq(2).groupby(df['speed']).sum()
Output:
speed
3.0 2
3.1 3
3.3 1
3.5 2
Name: canopy, dtype: int64
Note it's really not a good idea to group on floats.

Related

Smoothing Categorical Output

I have a list of outputs obtained from a cow behavior detection model. Even in a video when a cow is laying, often time it identifies as standing and vice versa. In each video frame, a classification result is given by the model and we are appending it into a list. Let's assume after 20 frames, we have a series of output as follows -
behavious_cow_1 = ["stand","stand","stand","stand","lying", "stand","stand", "eating", "stand","stand","stand","stand","lying""stand","stand","stand","stand","stand","stand","lying"]
Out of 20 classification results, we have 4 misclassification; 3 lyings, and 1 eating. However, the whole time the cow was sitting at a place. If the list only contained numerical values like - 1,2,3..., I would have opted for moving average to change the misclassification. Is there any Scipy, Pandas, Numpy function that can smooth the categorical output? I am thinking about taking previous 3 and next 3 values to determine the current category.
I used the following solution -
import scipy.stats
window_length = 7
behave = ["stand","stand","stand","stand","lying","lying", "eating"]
most_freq_val = lambda x: scipy.stats.mode(x)[0][0]
smoothed = [most_freq_val(behave[i:i+window_length]) for i in range(0,len(behave)-window_length+1)]
I tried the solution posted by Hugolmn but it broke at a point. In the rolling mode, the window width is provided by the user (7 here). In a certain width, if more than one values are present in the same number of times, the code does not work. It's more like - you tried to find the statistical mode (most common item) of a list but it got more than one item with the same highest frequency.
I am myself very surprised that a function such as mode() does not work with a rolling window in pandas. However, I still found a decent solution to your problem
First, create a pandas Series with categorical datatype:
df = pd.Series(sample, dtype='category')
Now you can see that df.cat.categories returns the list of categories in your data, and df.cat.codes the codes associated to them. We can use the latter to apply a rolling mode with a width of 7 (3 previous, the value, and the next 3):
df.cat.codes
0 3
1 3
2 3
3 3
4 1
5 3
6 3
7 0
8 3
9 3
10 3
11 3
12 2
13 3
14 3
15 3
16 3
17 3
18 1
dtype: int8
df.cat.codes.rolling(7, center=True, min_periods=0).apply(lambda x: x.mode())
0 3.0
1 3.0
2 3.0
3 3.0
4 3.0
5 3.0
6 3.0
7 3.0
8 3.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 3.0
16 3.0
17 3.0
18 3.0
dtype: float64
Finally, you can map the codes to get the strings back:
(df.cat.codes
.rolling(7, center=True, min_periods=0)
.apply(lambda x: x.mode())
.map(dict(enumerate(df.cat.categories)))
)
0 stand
1 stand
2 stand
3 stand
4 stand
5 stand
6 stand
7 stand
8 stand
9 stand
10 stand
11 stand
12 stand
13 stand
14 stand
15 stand
16 stand
17 stand
18 stand
dtype: object
And there you go ! You recovered your strings after applying a rolling mode on their codes !

Clean up this int64 variable in python

This is the raw distribution of the var FREQUENCY
NaN 22131161
1.0 4182626
7.0 218343
3.0 145863
1 59432
0.0 29906
2.0 28129
4.0 15237
5.0 4553
8.0 3617
3 2754
7 2635
9.0 633
2 584
4 276
0 112
8 51
5 42
6.0 19
A 9
I 7
9 6
Q 3
Y 2
X 2
Z 1
C 1
N 1
G 1
B 1
Name: FREQUENCY, dtype: int64
group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom
How could frequency be render as int64 where letters are present?
Desired outcome 1: group all letter groups +NaN into one group. Remaining numeric value groups consolidate (1.0 and 1 =1,for example). In SAS, I just run this : y=1*X. I just give a value of 10 to represent character groups + NaN. How to do it in Python, especially elegantly?
Outcome 2: extract a binary variable z=1 if x=NaN. Otherwise z=0
The first issue "
group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom"
was fixed once I add dtype={'FREQUANCY':'object'} while reading the csv file. Group 1.0 collapsed with group 1... After than replace works just fine.
All other issues pretty much are resolved, except issue 2 in that it still sets the variable type to be int64 where character variables are present. My guess is perhaps Python adopts a majority rule to vote on data type. It is indeed true numeric values dominate the count.

How to take difference in a specific dataframe

I'm trying to take a difference of consecutive numbers in one of dataframe columns, while preserving an order in another columns, for example:
import pandas as pd
df = pd.DataFrame({"A": [1,1,1,2,2,2,3,3,3,4],
"B": [2,1,3,3,2,1,1,2,3,4],
"C": [2.1,2.0,2.2,1.2,1.1,1.0,3.0,3.1,3.2,3.3]})
In [1]: df
Out[1]:
A B C
0 1 2 2.1
1 1 1 2.0
2 1 3 2.2
3 2 3 1.4
4 2 2 1.2
5 2 1 1.0
6 3 1 3.0
7 3 2 3.3
8 3 3 3.6
9 4 4 4.0
I would like to:
- for each distinctive element of column A (1, 2, 3, and 4)
- sort column B and take consecutive differences of column C
without a loop, to get something like that
In [2]: df2
Out[2]:
A B C Diff
0 1 2 2.1 0.1
2 1 3 2.2 0.1
3 2 3 1.2 0.2
4 2 2 1.1 0.2
7 3 2 3.1 0.3
8 3 3 3.2 0.3
I have run a number of operations:
df2 = df.groupby(by='A').apply(lambda x: x.sort_values(by = ['B'])['C'].diff())
df3 = pd.DataFrame(df2)
df3.reset_index(inplace=True)
df4 = df3.set_index('level_1')
df5 = df.copy()
df5['diff'] = df4['C']
and got what I wanted:
df5
Out[1]:
A B C diff
0 1 2 2.1 0.1
1 1 1 2.0 NaN
2 1 3 2.2 0.1
3 2 3 1.2 0.1
4 2 2 1.1 0.1
5 2 1 1.0 NaN
6 3 1 3.0 NaN
7 3 2 3.1 0.1
8 3 3 3.2 0.1
9 4 4 3.3 NaN
but is there a more efficient way of doing so?
(NaN values can be easily removed so I'm not fussy about that part)
A little unclear on what is expected as result (why are there less rows?).
For taking the consecutive differences you probably want to use Series.diff() (see docs here)
df['Diff'] = df.C.diff()
You can use the period keyword if you wanted some (positive or negative) lags to take the differences.
Don't see where the sort part comes into effect, but for that you probably want to use Series.sort_values() (see docs here)
EDIT
Based on your updated information, I believe this may be what you are looking for:
df.sort_values(by=['B', 'C'], inplace=True)
df['diff'] = df.C.diff()
EDIT 2
Based on your new updated information about the calculation, you want to:
- groupby by A (see docs on DataFrame.groupby() here)
- sort (each group) by B (or presort by A then B, prior to groupby)
- calculate differences of C (and dismiss the first record since it will be missing).
The following code achieves that:
df.sort_values(by=['A','B'], inplace=True)
df['Diff'] = df.groupby('A').apply(lambda x: x['C'].diff()).values
df2 = df.dropna()
Explanation of the code:
First line sorts the dataframe first.
The second line there has a bunch of things going...:
First groupby (which now generates a grouped DataFrame, see the helpful pandas page on split-apply-combine if you're new to the groupby)
then obtain the differences of C for each group
and "flatten" the grouped dataframe by obtaining a series with .values
which we assign to df['Diff'] (that is why we needed to presort the dataframe, so this assignment would get it right... if not we would have to merge the series on A and B).
The third line just removes the NAs and assigns that to df2.
EDIT3
I think my EDIT2 version is maybe what you are looking for in, a bit more concise and less aux data generated. However, you can also improve your version of the solution a little by:
df3.reset_index(level=0, inplace=True) # no need to reset and then set again
df5 = df.copy() # only if you don't want to change df
df5['diff'] = df3.C # else, just do df.insert(2, 'diff', df3.C)

Conditional column arithmetic in pandas dataframe

I have a pandas dataframe with the following structure:
import numpy as np
import pandas as pd
myData = pd.DataFrame({'x': [1.2,2.4,5.3,2.3,4.1], 'y': [6.7,7.5,8.1,5.3,8.3], 'condition':[1,1,np.nan,np.nan,1],'calculation': [np.nan]*5})
print myData
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 NaN NaN 5.3 8.1
3 NaN NaN 2.3 5.3
4 NaN 1 4.1 8.3
I want to enter a value in the 'calculation' column based on the values in 'x' and 'y' (e.g. x/y) but only in those cells where the 'condition' column contains NaN (np.isnan(myData['condition']). The final dataframe should look like this:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654 NaN 5.3 8.1
3 0.434 NaN 2.3 5.3
4 NaN 1 4.1 8.3
I'm happy with the idea of stepping through each row in turn using a 'for' loop and then using 'if' statements to make the calculations but the actual dataframe I have is very large and I wanted do the calculations in an array-based way. Is this possible? I guess I could calculate the value for all rows and then delete the ones I don't want but this seems like a lot of wasted effort (the NaNs are quite rare in the dataframe) and, in some cases where 'condition' equals 1, the calculation cannot be made due to division by zero.
Thanks in advance.
Use where and pass your condition to it, this will then only perform your calculation where the rows meet the condition:
In [117]:
myData['calculation'] = (myData['x']/myData['y']).where(myData['condition'].isnull())
myData
Out[117]:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654321 NaN 5.3 8.1
3 0.433962 NaN 2.3 5.3
4 NaN 1 4.1 8.3
EdChum's answer worked for me well! Still, I wanted to extend this thread as I think it will be useful for other people.
Let's assume your dataframe is
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0 5.3 8.1
3 0 2.3 5.3
4 1 4.1 8.3
and you would like to update 0s in column c with associated x/y.
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0.65 5.3 8.1
3 0.43 2.3 5.3
4 1 4.1 8.3
You can do
myData['c'] = (myData['x']/myData['y']).where(cond=myData['c']==0, other=myData['c'])
or
myData['c'].where(cond=myData['c'] != 0, other=myData['x']/myData['y'], inplace=True)
In both cases where 'cond' is not satisfied, 'other' is performed. In the second code snippet, inplace flag also works nicely (as it would also in the first code snippet.)
I found these solutions from pandas official site "where" and pandas official site "indexing"
This kind of operations are exactly what I need most of the time. I am new to Pandas and it took me a while to find this useful thread. Could anyone recommend some comprehensive tutorials to practice these types of arithmetic operations? I need to "filter/ groupby/ slice a dataframe then apply different functions/operations to each group/slice separately or all at once and keep it all inplace." Cheers!

CSV data - max values for segments of columns using numpy

So Let's say I have a csv file with data like so:
'time' 'speed'
0 2.3
0 3.4
0 4.1
0 2.1
1 1.3
1 3.5
1 5.1
1 1.1
2 2.3
2 2.4
2 4.4
2 3.9
I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.
So basically I want this to return:
array([[0,41], [1,5.1],[2,4.4]])
Using numpy specifically.
This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:
a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]
This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.
pandas fits nicely for this kind of stuff:
>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time speed
... 0 2.3
... 0 3.4
... 0 4.1
... 0 2.1
... 1 1.3
... 1 3.5
... 1 5.1
... 1 1.1
... 2 2.3
... 2 2.4
... 2 4.4
... 2 3.9
... """), delim_whitespace=True)
>>> df
time speed
0 0 2.3
1 0 3.4
2 0 4.1
3 0 2.1
4 1 1.3
5 1 3.5
6 1 5.1
7 1 1.1
8 2 2.3
9 2 2.4
10 2 4.4
11 2 3.9
[12 rows x 2 columns]
once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:
>>> df.groupby('time')['speed'].aggregate(max)
time
0 4.1
1 5.1
2 4.4
Name: speed, dtype: float64

Categories

Resources