Counting grouped data with missing values in pandas dataframe

Counting grouped data with missing values in pandas dataframe - python

I am trying to do something like this, but on a much larger dataframe (called Clean):
d={'rx': [1,1,1,1,2.1,2.1,2.1,2.1],
'vals': [NaN,10,10,20,NaN,10,20,20]}
df=DataFrame(d)
arrays = [df.rx,df.vals]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])
df.index = index
Hist=df.groupby(level=('rx','vals'))
Hist.count('vals')
This seems to work just fine, but when I run the same concept on even a subset of the Clean dataframe (substituting a column 'LagBin' for 'vals') I get an error:
df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)
arrays = [df1.rx,df1.LagBin]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','LagBin'])
df1.index = index
Hist=df1.groupby(level=('rx','LagBin'))
Hist.count('LagBin')
Specifically, the Hist.count('LagBin') produces a value error:
ValueError: Cannot convert NA to integer
I have looked at the data structure and that all seems exactly the same.
Here is the data that produces the error:
rx LagBin rx LagBin
139.1 nan 139.1
139.1 0 139.1 0
139.1 0 139.1 0
139.1 0 139.1 0
141.1 nan 141.1
141.1 10 141.1 10
141.1 20 141.1 20
193 nan 193
193 50 193 50
193 20 193 20
193 3600 193 3600
193 50 193 50
193 0 193 0
193 20 193 20
193 10 193 10
193 110 193 110
193 80 193 80
193 460 193 460
193 30 193 30
193 0 193 0
while the original routine that works produces this:
rx vals rx vals
1 nan 1
1 10 1 10
1 10 1 10
1 20 1 20
2.1 nan 2.1
2.1 10 2.1 10
2.1 20 2.1 20
2.1 20 2.1 20
What is different about these datasets that produces this error?

If I'm understanding your question correctly I believe what you want is:
Hist.agg(len).dropna()
The full code implementation looks like this:
d={'rx': [139.1,139.1,139.1,139.1,141.1,141.1,141.1,193,193,193,193,193,193,193,193,193,193,193,193,193],
'vals': [nan,0,0,0,nan,10,20,nan,50,20,3600,50,0,20,10,110,80,460,30,0]}
df=pd.DataFrame(d)
arrays = [df.rx,df.vals]
index = pd.MultiIndex.from_arrays(arrays, names = ['rx','vals'])
df.index = index
Hist=df.groupby(level=('rx','vals'))
print(Hist.agg(len).dropna())
Where df looks like:
rx vals
rx vals
139.1 NaN 139.1 NaN
0 139.1 0
0 139.1 0
0 139.1 0
141.1 NaN 141.1 NaN
10 141.1 10
20 141.1 20
193.0 NaN 193.0 NaN
50 193.0 50
20 193.0 20
3600 193.0 3600
50 193.0 50
0 193.0 0
20 193.0 20
10 193.0 10
110 193.0 110
80 193.0 80
460 193.0 460
30 193.0 30
0 193.0 0
And the line Hist.agg(len).dropna() looks like:
rx vals
rx vals
139.1 0 3 3
141.1 10 1 1
20 1 1
193.0 0 2 2
10 1 1
20 2 2
30 1 1
50 2 2
80 1 1
110 1 1
460 1 1
3600 1 1

That looks right---I have been tinkering with groupby and came up with this solution, which seems more elegant, and does not require explicitly dealing with the na's:
df1=DataFrame(data=Clean,columns=('rx','LagBin'))
df1=df1.head(n=20)
df1["rx"].groupby((df1["rx"],df1["LagBin"])).count().reset_index(name="Count")
print(LagCount)
which gives me:
rx LagBin Count
0 139.1 0 3
1 141.1 10 1
2 141.1 20 1
3 193.0 0 2
4 193.0 10 1
5 193.0 20 2
6 193.0 30 1
7 193.0 50 2
8 193.0 80 1
9 193.0 110 1
10 193.0 460 1
11 193.0 3600 1
I like this better, because I retain values as values and not indices, which I assume will make life easier later for plotting.

Related

Want to assign value in the new column by comparing the other column in pandas data frame?

I have a dataframe and has a column name called savings. In that savings, it has a positive and negative value. I want to check if the savings is negative then assign 1 in the new column (need to create a new column name called flag_negative). If the savings is positive assign 0 in the new column. I have a missing value in the savings column which I don't want to do anything. Leave as it is.
I would like to apply loop or any other easy method.
my dataframe name is df
I want to get as follow
Number of rows: 9000
savings flag_negative
100 0
-76 1
1200 0
-
-
-200 1
500 0
I tried with loop and created new column as flag_ negatvie. But I am getting NONE for all the rows
Below is my code
for i in sum['savings']:
if i>0:
sum['flag_negative'] = print(0)
elif i == " ":
sum['flag_negative'] = print(" ")
else:
sum['flag_negative'] = print(1)

if your dataframe is like this:
savings
0 -4
1 -41
2 174
3 -103
4 -194
5 -160
6 126
7 100
8 -125
9 -71
10 -159
11 -100
12 -30
13 -50
14 83
15 124
16 -123
17 -70
18 -71
19 -29
then you can easily filter on positive/negative and assign to a new column like this:
df.loc[df.savings < 0, 'flag_negative'] = 1
df.loc[df.savings >= 0, 'flag_negative'] = 0
resulting in:
savings flag_negative
0 -4 1.0
1 -41 1.0
2 174 0.0
3 -103 1.0
4 -194 1.0
5 -160 1.0
6 126 0.0
7 100 0.0
8 -125 1.0
9 -71 1.0
10 -159 1.0
11 -100 1.0
12 -30 1.0
13 -50 1.0
14 83 0.0
15 124 0.0
16 -123 1.0
17 -70 1.0
18 -71 1.0
19 -29 1.0

find the maximum value for each streak of numbers in another column in pandas

I have a dataframe like this :
df = pd.DataFrame({'dir': [1,1,1,1,0,0,1,1,1,0], 'price':np.random.randint(100,200,10)})
dir price
0 1 100
1 1 150
2 1 190
3 1 194
4 0 152
5 0 151
6 1 131
7 1 168
8 1 112
9 0 193
and I want a new column that shows the maximum price as long as the dir is 1 and reset if dir is 0.
My desired outcome looks like this:
dir price max
0 1 100 194
1 1 150 194
2 1 190 194
3 1 194 194
4 0 152 NaN
5 0 151 NaN
6 1 131 168
7 1 168 168
8 1 112 168
9 0 193 NaN

Use transform with max for filtered rows:
#get unique groups for consecutive values
g = df['dir'].ne(df['dir'].shift()).cumsum()
#filter only 1
m = df['dir'] == 1
df['max'] = df[m].groupby(g)['price'].transform('max')
print (df)
dir price max
0 1 100 194.0
1 1 150 194.0
2 1 190 194.0
3 1 194 194.0
4 0 152 NaN
5 0 151 NaN
6 1 131 168.0
7 1 168 168.0
8 1 112 168.0
9 0 193 NaN

Create column that sums the last x occurrences of another column

I'm trying to create a new column, lets call it "HomeForm", that is the sum of the last 5 values of "FTHG" for each of the entries in the "HomeTeam" column.
Say for Team 0, the idea would be to populate the cell on the new column with the sum of the last 5 values of "FTHG" that correspond to Team 0. The table is ordered by date.
How can it be done in Python?
HomeTeam FTHG HomeForm
Date
136 0 4
135 2 0
135 4 2
135 5 0
135 6 1
135 13 0
135 17 3
135 18 1
134 11 4
134 12 0
128 1 0
128 3 0
128 8 2
128 9 1
128 13 3
128 14 1
128 15 0
127 7 1
127 16 1
126 10 1
Thanks.

You'll groupby on HomeTeam and perform a rolling sum here, summing for a minimum of 1 period, and maximum of 5.
First, define a function -
def f(x):
return x.shift().rolling(window=5, min_periods=1).sum()
This function performs the rolling sum of the previous 5 games (hence the shift). Pass this function to dfGroupBy.transform -
df['HomeForm'] = df.groupby('HomeTeam', sort=False).FTHG.transform(f)
df
HomeTeam FTHG HomeForm
Date
136 0 4 NaN
135 2 0 NaN
135 4 2 NaN
135 5 0 NaN
135 6 1 NaN
135 13 0 NaN
135 17 3 NaN
135 18 1 NaN
134 11 4 NaN
134 12 0 NaN
128 1 0 NaN
128 3 0 NaN
128 8 2 NaN
128 9 1 NaN
128 13 3 0.0
128 14 1 NaN
128 15 0 NaN
127 7 1 NaN
127 16 1 NaN
126 10 1 NaN
If needed, fill the NaNs with zeros and convert to integer -
df['HomeForm'] = df['HomeForm'].fillna(0).astype(int)

Pandas - Sum up previous values of a column

New to pandas, I'm trying to sum up all previous values of a column. In SQL I did this by joining the table to itself, so I've been taking the same approach in pandas, but having some issues.
Original Data Frame
TeamName PlayerCount Goals CalMonth
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 30 300 2
5 B 28 189 3
Code
prev_month = np.where(df3['CalMonth'] == 12, df3['CalMonth'] - 11, df3['CalMonth'] + 1)
df4 = pd.merge(df3, df3, how='left', left_on=['TeamName','CalMonth'], right_on=['TeamName', prev_month])
print(df4.head(20))
Output
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y
NaN NaN NaN
25 126 1
25 100 2
22 NaN NaN
22 205 1
22 100 2
The output is what I had in mind, but what I want now is to create a column that is YTD and sum up all Goals from previous months. Here are my desired results (can either include the current month or not, that can be done in an additional step):
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y Goals_YTD
NaN NaN NaN NaN
25 126 1 126
25 100 2 226
22 NaN NaN NaN
22 205 1 205
22 100 2 305

Pandas difference between groupby-size and unique

The goal here is to see how many unique values i have in my database. This is the code i have written:
apps = pd.read_csv('ConcatOwned1_900.csv', sep='\t', usecols=['appid'])
apps[('appid')] = apps[('appid')].astype(int)
apps_list=apps['appid'].unique()
b = apps.groupby('appid').size()
blist = b.unique()
print len(apps_list), len(blist), len(set(b))
>>>7672 2164 2164
Why is there difference in those two methods?
Due to request i am posting some of my data:
Unnamed: 0 StudID No appid work work2
0 0 76561193665298433 0 10 nan 0
1 1 76561193665298433 1 20 nan 0
2 2 76561193665298433 2 30 nan 0
3 3 76561193665298433 3 40 nan 0
4 4 76561193665298433 4 50 nan 0
5 5 76561193665298433 5 60 nan 0
6 6 76561193665298433 6 70 nan 0
7 7 76561193665298433 7 80 nan 0
8 8 76561193665298433 8 100 nan 0
9 9 76561193665298433 9 130 nan 0
10 10 76561193665298433 10 220 nan 0
11 11 76561193665298433 11 240 nan 0
12 12 76561193665298433 12 280 nan 0
13 13 76561193665298433 13 300 nan 0
14 14 76561193665298433 14 320 nan 0
15 15 76561193665298433 15 340 nan 0
16 16 76561193665298433 16 360 nan 0
17 17 76561193665298433 17 380 nan 0
18 18 76561193665298433 18 400 nan 0
19 19 76561193665298433 19 420 nan 0
20 20 76561193665298433 20 500 nan 0
21 21 76561193665298433 21 550 nan 0
22 22 76561193665298433 22 620 6.0 3064
33 33 76561193665298434 0 10 nan 837
34 34 76561193665298434 1 20 nan 27
35 35 76561193665298434 2 30 nan 9
36 36 76561193665298434 3 40 nan 5
37 37 76561193665298434 4 50 nan 2
38 38 76561193665298434 5 60 nan 0
39 39 76561193665298434 6 70 nan 403
40 40 76561193665298434 7 130 nan 0
41 41 76561193665298434 8 80 nan 6
42 42 76561193665298434 9 100 nan 10
43 43 76561193665298434 10 220 nan 14

IIUC based on attached piece of the dataframe it seems that you should analyze b.index, not values of b. Just look:
b = apps.groupby('appid').size()
In [24]: b
Out[24]:
appid
10 2
20 2
30 2
40 2
50 2
60 2
70 2
80 2
100 2
130 2
220 2
240 1
280 1
300 1
320 1
340 1
360 1
380 1
400 1
420 1
500 1
550 1
620 1
dtype: int64
In [25]: set(b)
Out[25]: {1, 2}
But if you do it for b.index you'll get the same values for all 3 methods:
blist = b.index.unique()
In [30]: len(apps_list), len(blist), len(set(b.index))
Out[30]: (23, 23, 23)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting grouped data with missing values in pandas dataframe - python

Related

Want to assign value in the new column by comparing the other column in pandas data frame?

find the maximum value for each streak of numbers in another column in pandas

Create column that sums the last x occurrences of another column

Pandas - Sum up previous values of a column

Pandas difference between groupby-size and unique

Categories

Resources