Cumsum with groupby - python

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.

arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Related

How to transform combinations of values in columns into individual columns?

I have a dataset (df), that looks like this:
Date
ID
County Name
State
State Name
Product Name
Type of Transaction
QTY
202105
10001
Los Angeles
CA
California
Shoes
Entry
630
202012
10002
Houston
TX
Texas
Keyboard
Exit
5493
202001
11684
Chicago
IL
Illionis
Phone
Disposal
220
202107
12005
New York
NY
New York
Phone
Entry
302
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
Shoes
Exit
201
For every county, there are multiple entries for different Products, types of transactions, and at different dates, but not all counties have the same number of entries and they don't follow the same dates.
I want to recreate this dataset, such that:
1 - All counties have the same start and end dates, and for those dates where the county does not record entries, I want this entry to be recorded as NaN.
2 - The product names and their types are their own columns.
Essentially, this is how the dataset needs to look:
Date
ID
County Name
State
State Name
Shoes, Entry
Shoes, Exit
Shoes, Disposal
Phones, Entry
Phones, Exit
Phones, Disposal
Keyboard, Entry
Keyboard, Exit
Keyboard, Disposal
202105
10001
Los Angeles
CA
California
594
694
5660
33299
1110
5659
4559
3223
56889
202012
10002
Houston
TX
Texas
3420
4439
549
2110
5669
2245
39294
3345
556
202001
11684
Chicago
IL
Illionis
55432
4439
329
21190
4320
455
34059
44556
5677
202107
12005
New York
NY
New York
34556
2204
4329
11193
22345
43221
1544
3467
22450
...
...
...
...
...
...
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
54543
23059
3290
21394
34335
59660
NaN
NaN
NaN
Under the example, you can see how Florida does not record certain transactions. I would like to add the NaN such that the dataframe looks like this. I appreciate all the help!
This is essentially a pivot, with flattening of the MultiIndex:
(df
.pivot(index=['Date', 'ID', 'County Name', 'State', 'State Name'],
columns=['Product Name', 'Type of Transaction'],
values='QTY')
.pipe(lambda d: d.set_axis(map(','.join, d. columns), axis=1))
.reset_index()
)
Output:
Date ID County Name State State Name Shoes,Entry Keyboard,Exit \
0 202001 11684 Chicago IL Illionis NaN NaN
1 202012 10002 Houston TX Texas NaN 5493.0
2 202105 10001 Los Angeles CA California 630.0 NaN
3 202107 12005 New York NY New York NaN NaN
Phone,Disposal Phone,Entry
0 220.0 NaN
1 NaN NaN
2 NaN NaN
3 NaN 302.0

How to replace the values of a column to other columns only in NaN values?

How to fill the values of column ["state"] with another column ["country"] only in NaN values?
Like in this Pandas DataFrame:
state country sum
0 NaN China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 NaN India 5
5 NaN Srilanka 6
6 NaN Malaysia 7
7 NaN Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 NaN US 12
12 NaN Canada 13
What code should I do to fill state columns with country column only in NaN values, like this:
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
I can use this code:
df.loc[df['state'].isnull(), 'state'] = df[df['state'].isnull()]['country'].replace(df['country'])
But in a very large dataset with 300K of rows, it compute for 5-6 minutes and crashed every time. Because it is replacing one value at a time.
Like this
Can anyone help me with efficient code for this?
Please!
Perhaps using fillna without checking for isnull() and replace():
df['state'].fillna(df['country'], inplace=True)
print(df)
Output
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13

How to concatenate the following dataframe

I have two dataframes:
file_date = str((date.today() - timedelta(days = 2)).strftime('%m-%d-%Y'))
file_date
github_dir_path = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/'
file_path = github_dir_path + file_date + '.csv'
first dataframe:
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
0 45001.0 Abbeville South Carolina US 2020-04-28 02:30:51 34.223334 -82.461707 29 0 0 29 Abbeville, South Carolina, US
1 22001.0 Acadia Louisiana US 2020-04-28 02:30:51 30.295065 -92.414197 130 9 0 121 Acadia, Louisiana, US
2 51001.0 Accomack Virginia US 2020-04-28 02:30:51 37.767072 -75.632346 195 3 0 192 Accomack, Virginia, US
3 16001.0 Ada Idaho US 2020-04-28 02:30:51 43.452658 -116.241552 650 15 0 635 Ada, Idaho, US
4 19001.0 Adair Iowa US 2020-04-28 02:30:51 41.330756 -94.471059 1 0 0 1 Adair, Iowa, US
#
0 0 ... 0 Kerala 0 Kerala 1
2 2020-02-01 Kerala 2 0 0 ... 0 Kerala 0 Kerala 2
3 2020-02-02 Kerala 3 0 0 ... 0 Kerala 0 Kerala 3
4 2020-02-03 Kerala 3 0 0 ... 0 Kerala 0 Kerala 3
Please guide me on how to concatenate both the data frames. I tried a couple of things but did not get the expected result.

Subtract columns from two data frames based on a common column

df1:
Asia 34
America 74
Australia 92
Africa 44
df2 :
Asia 24
Australia 90
Africa 30
I want the output of df1 - df2 to be
Asia 10
America 74
Australia 2
Africa 14
I am getting troubled by this, I am newbie into pandas. Please help out.
Use Series.sub with mapped second Series by Series.map:
df1['B'] = df1['B'].sub(df1['A'].map(df2.set_index('A')['B']), fill_value=0)
print (df1)
A B
0 Asia 10.0
1 America 74.0
2 Australia 2.0
3 Africa 14.0
If possible changed ordering of first column convert both first columns to index by DataFrame.set_index and subtract :
df2 = df1.set_index('A')['B'].sub(df2.set_index('A')['B'], fill_value=0).reset_index()
print (df2)
A B
0 Africa 14.0
1 America 74.0
2 Asia 10.0
3 Australia 2.0

How to filter a transposed pandas dataframe?

Say I have a transposed df like so
id 0 1 2 3
0 1361 Spain Russia South Africa China
1 1741 Portugal Cuba UK Ukraine
2 1783 Germany USA France Egypt
3 1353 Brazil Russia Japan Kenya
4 1458 India Romania Holland Nigeria
How could I get all rows where there is 'er' so it'll return me this
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria
because 'er' is contained in Germany and Nigeria.
Thanks!
Using contains
df[df.apply(lambda x :x.str.contains(pat='er')).any(1)]
Out[96]:
id 0 1 2 3
2 1783 Germany USA France Egypt None
4 1458 India Romania Holland Nigeria None
Use apply + str.contains across rows:
df = df[df.apply(lambda x: x.str.contains('er').any(), axis=1)]
print(df)
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria

Categories

Resources