I am new to python and pandas. I am having difficulties comming up with a column with the elapsed days since the occurence of the first case by country. Similiar to the date column, but instead of a date I want the days since de first case (since the first occurence of a case/death/recovered within a country)
I have grouped the data by the country and date and summed confirmed, deaths and recovered cases. (Because the original data had some countries split withing regions) I also erased the days where there were no deaths, recovered or deaths (I want to count since the first case appeared).
I would appreciate any help! Thanks in advance!
covid_data = covid_data.groupby(['Country/Region', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].apply(sum)
covid_data.sort_values(by=['Country/Region', 'Date'])
covid_data.reset_index()
covid_data = covid_data[(covid_data.T != 0).any()] #eliminates rows with no suspected, no deaths and no cured
Output:
Country/Region Date Confirmed Deaths Recovered
Afghanistan 2020-02-24 1 0 0
2020-02-25 1 0 0
2020-02-26 1 0 0
2020-02-27 1 0 0
2020-02-28 1 0 0
2020-02-29 1 0 0
2020-03-01 1 0 0
2020-03-02 1 0 0
2020-03-03 1 0 0
2020-03-04 1 0 0
(and many other countries)
Let's start from some corrections to your "initial" code:
After groupby you have already your data sorted, so
covid_data.sort_values(by=['Country/Region', 'Date']) is not needed.
Actually this instruction doesn't change anything, as you didn't pass
inplace=True parameter.
Now, when Date is in the index, it is time to eliminate rows with all zeroes
in other columns, so run covid_data = covid_data[(covid_data.T != 0).any()]
before you reset the index.
covid_data.reset_index() only generates a DataFrame with reset index,
but also doesn't save it anywhere. You should correct it to:
covid_data.reset_index(inplace=True)
And now let's get down to the main task.
Assume that the source data, after the above initial operations, contains:
Country/Region Date Confirmed Deaths Recovered
0 Aaaa 2020-02-24 2 1 0
1 Aaaa 2020-02-25 2 0 0
2 Aaaa 2020-02-26 1 0 0
3 Aaaa 2020-02-27 3 0 0
4 Aaaa 2020-02-28 4 0 0
5 Bbbb 2020-02-20 5 1 0
6 Bbbb 2020-02-21 7 0 0
7 Bbbb 2020-02-23 9 1 0
8 Bbbb 2020-02-24 4 0 0
9 Bbbb 2020-02-25 8 1 0
i.e. 2 countries/regions.
To compute Elapsed column for each contry / region, define the following function:
def getElapsed(grp):
startDate = grp.iloc[0]
return ((grp - startDate) / np.timedelta64(1, 'D')).astype(int)
Then run:
covid_data['Elapsed'] = covid_data.groupby('Country/Region').Date.transform(getElapsed)
The result is:
Country/Region Date Confirmed Deaths Recovered Elapsed
0 Aaaa 2020-02-24 2 1 0 0
1 Aaaa 2020-02-25 2 0 0 1
2 Aaaa 2020-02-26 1 0 0 2
3 Aaaa 2020-02-27 3 0 0 3
4 Aaaa 2020-02-28 4 0 0 4
5 Bbbb 2020-02-20 5 1 0 0
6 Bbbb 2020-02-21 7 0 0 1
7 Bbbb 2020-02-23 9 1 0 3
8 Bbbb 2020-02-24 4 0 0 4
9 Bbbb 2020-02-25 8 1 0 5
For anyone with the same problem:
#aggregates de countries by date
covid_data = covid_data.groupby(['Country/Region', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].apply(sum)
#sorts the countries by name and then by date
covid_data.sort_values(by=['Country/Region', 'Date'])
#eliminates rows with no suspected, no deaths and no cured
covid_data = covid_data[(covid_data.T != 0).any()]
#get group by columns back
covid_data = covid_data.reset_index()
#substructs the mim date from the current date (and returns the result in days - dt.days)
covid_data['Ellapsed Days'] = (covid_data['Date'] - covid_data.groupby('Country/Region')['Date'].transform('min')).dt.days
EDIT: With the contribution of Valdi_Bo
#aggregates de countries by date
covid_data = covid_data.groupby(['Country/Region', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].apply(sum)
#eliminates rows with no suspected, no deaths and no cured
covid_data = covid_data[(covid_data.T != 0).any()]
#get group by columns back
covid_data.reset_index(inplace=True)
#substructs the mim date from the current date (and returns the result in days - dt.days)
covid_data['Ellapsed Days'] = (covid_data['Date'] - covid_data.groupby('Country/Region')['Date'].transform('min')).dt.days
Related
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.
I have a DataFrame df, that, once sorted by date, looks like this:
User Date price
0 2 2020-01-30 50
1 1 2020-02-02 30
2 2 2020-02-28 50
3 2 2020-04-30 10
4 1 2020-12-28 10
5 1 2020-12-30 20
I want to compute, for each row:
the number of row in the last month, and
the sum price in the last month.
On the example above, the output that I'm looking for:
User Date price NumlastMonth Totallastmonth
0 2 2020-01-30 50 0 0
1 1 2020-02-02 30 0 0 # not 1, 50 ???
2 2 2020-02-28 50 1 50
3 2 2020-04-30 10 0 0
4 1 2020-12-28 10 0 0
5 1 2020-12-30 20 1 10 # not 0, 0 ???
I tried this, but the result is for all last row not only last month.
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumcount()
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumsum()
Taking literally the question (acknowledging that the example doesn't quite match the description of the question), we could do:
tally = df.groupby(pd.Grouper(key='Date', freq='M')).agg({'User': 'count', 'price': sum})
tally.index += pd.offsets.Day(1)
tally = tally.reindex(index=df.Date, method='ffill', fill_value=0)
On your input, that gives:
>>> tally
User price
Date
2020-01-30 0 0
2020-02-02 1 50
2020-02-28 1 50
2020-04-30 0 0
2020-12-28 0 0
2020-12-30 0 0
After that, it's easy to change the column names and concat:
df2 = pd.concat([
df.set_index('Date'),
tally.rename(columns={'User': 'NumlastMonth', 'price': 'Totallastmonth'})
], axis=1)
# out:
User price NumlastMonth Totallastmonth
Date
2020-01-30 2 50 0 0
2020-02-02 1 30 1 50
2020-02-28 2 50 1 50
2020-04-30 2 10 0 0
2020-12-28 1 10 0 0
2020-12-30 1 20 0 0
```
I have a Pandas dataframe with the following columns:
SecId Date Sector Country
184149 2019-12-31 Utility USA
184150 2019-12-31 Banking USA
187194 2019-12-31 Aerospace FRA
...............
128502 2020-02-12 CommSvcs UK
...............
SecId & Date columns are the indices. What I want is the following..
SecId Date Aerospace Banking CommSvcs ........ Utility AFG CAN .. FRA .... UK USA ...
184149 2019-12-31 0 0 0 1 0 0 0 0 1
184150 2019-12-31 0 1 0 0 0 0 0 0 1
187194 2019-12-31 1 0 0 0 0 0 1 0 0
................
128502 2020-02-12 0 0 1 0 0 0 0 1 0
................
What is the efficient way to pivot this? The original data is denormalized for each day and can have millions of rows.
You can use get_dummies. You can cast as a categorical dtype beforehand to define what columns will be created.
code:
SECTORS = df.Sector.unique()
df["Sector"] = df.Sector.astype(pd.Categorical(SECTORS))
COUNTRIES = df.Country.unique()
df["Country"] = df.Country.astype(pd.Categorical(COUNTRIES))
df2 = pd.get_dummies(data=df, columns=["Sector", "Country"], prefix="", pefix_sep="")
output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0
Try as #BEN_YO suggests:
pd.get_dummies(df,columns=['Sector', 'Country'], prefix='', prefix_sep='')
Output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0
having this dataframe:
provincia contagios defunciones fecha
0 distrito nacional 11 0 18/3/2020
1 azua 0 0 18/3/2020
2 baoruco 0 0 18/3/2020
3 dajabon 0 0 18/3/2020
4 barahona 0 0 18/3/2020
How can I have a new dataframe like this:
provincia contagios_from_march1_8 defunciones_from_march1_8
0 distrito nacional 11 0
1 azua 0 0
2 baoruco 0 0
3 dajabon 0 0
4 barahona 0 0
Where the 'contagios_from_march1_8' and 'defunciones_from_march1_8' are the result of the sum of the 'contagios' and 'defunciones' in the date range 3/1/2020 to 3/8/2020.
Thanks.
Can use df.sum on a condition. Eg.:
df[df["date"]<month]["contagios"].sum()
refer to this for extracting month out of date: Extracting just Month and Year separately from Pandas Datetime column