I have a Pandas dataframe with the following columns:
SecId Date Sector Country
184149 2019-12-31 Utility USA
184150 2019-12-31 Banking USA
187194 2019-12-31 Aerospace FRA
...............
128502 2020-02-12 CommSvcs UK
...............
SecId & Date columns are the indices. What I want is the following..
SecId Date Aerospace Banking CommSvcs ........ Utility AFG CAN .. FRA .... UK USA ...
184149 2019-12-31 0 0 0 1 0 0 0 0 1
184150 2019-12-31 0 1 0 0 0 0 0 0 1
187194 2019-12-31 1 0 0 0 0 0 1 0 0
................
128502 2020-02-12 0 0 1 0 0 0 0 1 0
................
What is the efficient way to pivot this? The original data is denormalized for each day and can have millions of rows.
You can use get_dummies. You can cast as a categorical dtype beforehand to define what columns will be created.
code:
SECTORS = df.Sector.unique()
df["Sector"] = df.Sector.astype(pd.Categorical(SECTORS))
COUNTRIES = df.Country.unique()
df["Country"] = df.Country.astype(pd.Categorical(COUNTRIES))
df2 = pd.get_dummies(data=df, columns=["Sector", "Country"], prefix="", pefix_sep="")
output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0
Try as #BEN_YO suggests:
pd.get_dummies(df,columns=['Sector', 'Country'], prefix='', prefix_sep='')
Output:
SecId Date Aerospace Banking Utility FRA USA
0 184149 2019-12-31 0 0 1 0 1
1 184150 2019-12-31 0 1 0 0 1
2 187194 2019-12-31 1 0 0 1 0
Related
I have a DataFrame df, that, once sorted by date, looks like this:
User Date price
0 2 2020-01-30 50
1 1 2020-02-02 30
2 2 2020-02-28 50
3 2 2020-04-30 10
4 1 2020-12-28 10
5 1 2020-12-30 20
I want to compute, for each row:
the number of row in the last month, and
the sum price in the last month.
On the example above, the output that I'm looking for:
User Date price NumlastMonth Totallastmonth
0 2 2020-01-30 50 0 0
1 1 2020-02-02 30 0 0 # not 1, 50 ???
2 2 2020-02-28 50 1 50
3 2 2020-04-30 10 0 0
4 1 2020-12-28 10 0 0
5 1 2020-12-30 20 1 10 # not 0, 0 ???
I tried this, but the result is for all last row not only last month.
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumcount()
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumsum()
Taking literally the question (acknowledging that the example doesn't quite match the description of the question), we could do:
tally = df.groupby(pd.Grouper(key='Date', freq='M')).agg({'User': 'count', 'price': sum})
tally.index += pd.offsets.Day(1)
tally = tally.reindex(index=df.Date, method='ffill', fill_value=0)
On your input, that gives:
>>> tally
User price
Date
2020-01-30 0 0
2020-02-02 1 50
2020-02-28 1 50
2020-04-30 0 0
2020-12-28 0 0
2020-12-30 0 0
After that, it's easy to change the column names and concat:
df2 = pd.concat([
df.set_index('Date'),
tally.rename(columns={'User': 'NumlastMonth', 'price': 'Totallastmonth'})
], axis=1)
# out:
User price NumlastMonth Totallastmonth
Date
2020-01-30 2 50 0 0
2020-02-02 1 30 1 50
2020-02-28 2 50 1 50
2020-04-30 2 10 0 0
2020-12-28 1 10 0 0
2020-12-30 1 20 0 0
```
I want to add data to redis now.
My purpose is to split the data and insert it into redis little by little.
After inserting pandas.DataFrame into redis, I want to add data
We've inserted the Dataframe to redis now, but we don't know how to keep and add existing data.
for example:
log_df_v1 ## DataFrame_v1
session_id connect_date location categories join page_out
0 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:14:24 경기도 4 0 1
1 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:13 경기도 4 0 0
2 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
3 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
4 62de8537-e79f-4d67-8db5-57a26b89a42d 2020-01-01 00:10:52 경기도 3 0 1
step 1. dataframe to redis set
r = redis.StrictRedis(host="localhost", port=6379, db=0)
log_dic = log_df_v1.to_dict()
log_set = json.dumps(log_dic,ensure_ascii = False).encode('utf-8')
r.set('log_t1',log_set)
True
step 2. Get data from redis, Make it DataFrame
log_get = r.get('log_t1').decode('utf-8')
log_dic = dict(json.loads(log_get))
data_log = pd.DataFrame(log_dic)
data_log
session_id connect_date location categories join page_out
0 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:14:24 경기도 4 0 1
1 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:13 경기도 4 0 0
2 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
3 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
4 62de8537-e79f-4d67-8db5-57a26b89a42d 2020-01-01 00:10:52 경기도 3 0 1
step 3(Question). I want to add Dataframe(log_df_v2) to redis.
However, I need to keep existing DataFrame(log_df_v1)
log_df_v2 ## DataFrame_v2
session_id connect_date location categories join page_out
20000 f28e7b23-5ad0-460f-b50e-e6fe0b5edff6 2019-12-29 16:03:39 서울특별시 12 0 0
20001 e284ca69-333f-4cb8-84c9-485353a4ed74 2019-12-29 16:03:38 경기도 4 0 1
20002 ea348aa8-aa52-4ee2-84da-f000020c1ecf 2019-12-29 16:03:15 경상북도 1 0 0
20003 36b9795c-d38f-4ec1-8f49-0eae9cecd0b6 2019-12-29 16:03:12 경상북도 1 0 0
20004 f83e403e-16f5-4e31-8265-3ad40d9be969 2019-12-29 16:03:12 경상북도 1 0 0
I want to Result:
log_get = r.get('log_t1').decode('utf-8')
log_dic = dict(json.loads(log_get))
data_log = pd.DataFrame(log_dic)
data_log
session_id connect_date location categories join page_out
0 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:14:24 경기도 4 0 1
1 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:13 경기도 4 0 0
2 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
3 5fd1e923-d145-40cc-bf38-3b1156af5eb6 2020-01-01 00:13:10 경기도 4 0 0
4 62de8537-e79f-4d67-8db5-57a26b89a42d 2020-01-01 00:10:52 경기도 3 0 1
20000 f28e7b23-5ad0-460f-b50e-e6fe0b5edff6 2019-12-29 16:03:39 서울특별시 12 0 0
20001 e284ca69-333f-4cb8-84c9-485353a4ed74 2019-12-29 16:03:38 경기도 4 0 1
20002 ea348aa8-aa52-4ee2-84da-f000020c1ecf 2019-12-29 16:03:15 경상북도 1 0 0
20003 36b9795c-d38f-4ec1-8f49-0eae9cecd0b6 2019-12-29 16:03:12 경상북도 1 0 0
20004 f83e403e-16f5-4e31-8265-3ad40d9be969 2019-12-29 16:03:12 경상북도 1 0 0
How to insert log_df_v1 and then log_df_v2 into redis?
All I want is to save data on Redis.
Please help me...
I am new to python and pandas. I am having difficulties comming up with a column with the elapsed days since the occurence of the first case by country. Similiar to the date column, but instead of a date I want the days since de first case (since the first occurence of a case/death/recovered within a country)
I have grouped the data by the country and date and summed confirmed, deaths and recovered cases. (Because the original data had some countries split withing regions) I also erased the days where there were no deaths, recovered or deaths (I want to count since the first case appeared).
I would appreciate any help! Thanks in advance!
covid_data = covid_data.groupby(['Country/Region', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].apply(sum)
covid_data.sort_values(by=['Country/Region', 'Date'])
covid_data.reset_index()
covid_data = covid_data[(covid_data.T != 0).any()] #eliminates rows with no suspected, no deaths and no cured
Output:
Country/Region Date Confirmed Deaths Recovered
Afghanistan 2020-02-24 1 0 0
2020-02-25 1 0 0
2020-02-26 1 0 0
2020-02-27 1 0 0
2020-02-28 1 0 0
2020-02-29 1 0 0
2020-03-01 1 0 0
2020-03-02 1 0 0
2020-03-03 1 0 0
2020-03-04 1 0 0
(and many other countries)
Let's start from some corrections to your "initial" code:
After groupby you have already your data sorted, so
covid_data.sort_values(by=['Country/Region', 'Date']) is not needed.
Actually this instruction doesn't change anything, as you didn't pass
inplace=True parameter.
Now, when Date is in the index, it is time to eliminate rows with all zeroes
in other columns, so run covid_data = covid_data[(covid_data.T != 0).any()]
before you reset the index.
covid_data.reset_index() only generates a DataFrame with reset index,
but also doesn't save it anywhere. You should correct it to:
covid_data.reset_index(inplace=True)
And now let's get down to the main task.
Assume that the source data, after the above initial operations, contains:
Country/Region Date Confirmed Deaths Recovered
0 Aaaa 2020-02-24 2 1 0
1 Aaaa 2020-02-25 2 0 0
2 Aaaa 2020-02-26 1 0 0
3 Aaaa 2020-02-27 3 0 0
4 Aaaa 2020-02-28 4 0 0
5 Bbbb 2020-02-20 5 1 0
6 Bbbb 2020-02-21 7 0 0
7 Bbbb 2020-02-23 9 1 0
8 Bbbb 2020-02-24 4 0 0
9 Bbbb 2020-02-25 8 1 0
i.e. 2 countries/regions.
To compute Elapsed column for each contry / region, define the following function:
def getElapsed(grp):
startDate = grp.iloc[0]
return ((grp - startDate) / np.timedelta64(1, 'D')).astype(int)
Then run:
covid_data['Elapsed'] = covid_data.groupby('Country/Region').Date.transform(getElapsed)
The result is:
Country/Region Date Confirmed Deaths Recovered Elapsed
0 Aaaa 2020-02-24 2 1 0 0
1 Aaaa 2020-02-25 2 0 0 1
2 Aaaa 2020-02-26 1 0 0 2
3 Aaaa 2020-02-27 3 0 0 3
4 Aaaa 2020-02-28 4 0 0 4
5 Bbbb 2020-02-20 5 1 0 0
6 Bbbb 2020-02-21 7 0 0 1
7 Bbbb 2020-02-23 9 1 0 3
8 Bbbb 2020-02-24 4 0 0 4
9 Bbbb 2020-02-25 8 1 0 5
For anyone with the same problem:
#aggregates de countries by date
covid_data = covid_data.groupby(['Country/Region', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].apply(sum)
#sorts the countries by name and then by date
covid_data.sort_values(by=['Country/Region', 'Date'])
#eliminates rows with no suspected, no deaths and no cured
covid_data = covid_data[(covid_data.T != 0).any()]
#get group by columns back
covid_data = covid_data.reset_index()
#substructs the mim date from the current date (and returns the result in days - dt.days)
covid_data['Ellapsed Days'] = (covid_data['Date'] - covid_data.groupby('Country/Region')['Date'].transform('min')).dt.days
EDIT: With the contribution of Valdi_Bo
#aggregates de countries by date
covid_data = covid_data.groupby(['Country/Region', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].apply(sum)
#eliminates rows with no suspected, no deaths and no cured
covid_data = covid_data[(covid_data.T != 0).any()]
#get group by columns back
covid_data.reset_index(inplace=True)
#substructs the mim date from the current date (and returns the result in days - dt.days)
covid_data['Ellapsed Days'] = (covid_data['Date'] - covid_data.groupby('Country/Region')['Date'].transform('min')).dt.days
having this dataframe:
provincia contagios defunciones fecha
0 distrito nacional 11 0 18/3/2020
1 azua 0 0 18/3/2020
2 baoruco 0 0 18/3/2020
3 dajabon 0 0 18/3/2020
4 barahona 0 0 18/3/2020
How can I have a new dataframe like this:
provincia contagios_from_march1_8 defunciones_from_march1_8
0 distrito nacional 11 0
1 azua 0 0
2 baoruco 0 0
3 dajabon 0 0
4 barahona 0 0
Where the 'contagios_from_march1_8' and 'defunciones_from_march1_8' are the result of the sum of the 'contagios' and 'defunciones' in the date range 3/1/2020 to 3/8/2020.
Thanks.
Can use df.sum on a condition. Eg.:
df[df["date"]<month]["contagios"].sum()
refer to this for extracting month out of date: Extracting just Month and Year separately from Pandas Datetime column
Simplified situation:
I've got a file with list of some countries and I load it to dataframe df.
Then I've got data concerning those countries (and many more) in many .xls files. I try to read each of those files to df_f, subset the data I'm interested in and then find countries from the original file and if any of them is present, copy the data to dataframe df.
The problem is that only some of the values are assigned correctly. Most of them are inserted as NaNs. (see below)
for filename in os.listdir(os.getcwd()):
df_f = pd.read_excel(filename, sheetname = 'Data', parse_cols = "D,F,H,J:BS", skiprows = 2, skip_footer = 2)
df_f = df_f.fillna(0)
df_ss = [SUBSETTING df_f here]
countries = df_ss['Country']
for c in countries:
if (c in df['Country'].values):
row_idx = df[df['Country'] == c].index
df_h = df_ss[quarters][df_ss.Country == c]
df.loc[row_idx, quarters] = df_h
The result I get is:
Country Q1 2000 Q2 2000 Q3 2000 Q4 2000 Q1 2001 Q2 2001 Q3 2001 \
0 Albania NaN NaN NaN NaN NaN NaN NaN
1 Algeria NaN NaN NaN NaN NaN NaN NaN
2 Argentina NaN NaN NaN NaN NaN NaN NaN
3 Armenia NaN NaN NaN NaN NaN NaN NaN
4 Australia NaN NaN NaN NaN NaN NaN NaN
5 Austria 4547431 5155839 5558963 6079089 6326217 6483130 6547780
6 Azerbaijan NaN NaN NaN NaN NaN NaN NaN
etc...
The loading and subsetting is done correctly, data is not corrupted - I print df_h for each iteration and it shows regular numbers. The point is that after assigning them to df dataframe they become NaNs...
Any idea?
EDIT: sample data
df:
Country Country group Population Development coefficient Q1 2000 \
0 Albania group II 2981000 -1 0
1 Algeria group I 39106000 -1 0
2 Argentina group III 42669000 -1 0
3 Armenia group II 3013000 -1 0
4 Australia group IV 23520000 -1 0
5 Austria group IV 8531000 -1 0
6 Azerbaijan group II 9538000 -1 0
7 Bangladesh group I 158513000 -1 0
8 Belarus group III 9470000 -1 0
9 Belgium group III 11200000 -1 0
(...)
Q2 2013 Q3 2013 Q4 2013 Q1 2014 Q2 2014 Q3 2014 Q4 2014 Q1 2015
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
and df_ss of one of files:
Country Q1 2000 Q2 2000 Q3 2000 Q4 2000 Q1 2001 \
5 Guam 11257 17155 23063 29150 37098
10 Kiribati 323 342 361 380 398
15 Marshall Islands 425 428 433 440 449
17 Micronesia 0 0 0 0 0
19 Nauru 0 0 0 0 0
22 Northern Mariana Islands 2560 3386 4499 6000 8037
27 Palau 1513 1672 1828 1980 2130
(...)
Q3 2013 Q4 2013 Q1 2014 Q2 2014 Q3 2014 Q4 2014 Q1 2015
5 150028 151152 152244 153283 154310 155333 156341
10 19933 20315 20678 21010 21329 21637 21932
15 17536 19160 20827 22508 24253 26057 27904
17 18646 17939 17513 17232 17150 17233 17438
19 7894 8061 8227 8388 8550 8712 8874
22 27915 28198 28481 28753 29028 29304 29578
27 17602 17858 18105 18337 18564 18785 19001
Try setting the values like the following (see this post):
df.ix[quaters,...] = 10
By #joris:
Can you try
df.loc[row_idx, quarters] = df_h.values
for the last line (note the extra .values at the end)?
This one worked, thanks :-)