how to create various dummies based on consecutive values from another column

how to create various dummies based on consecutive values from another column - python

I have the following panel dataset. "winner" =1 if in period (date), someone is a winner, zero if loser.
ID date winner
A 2017Q4 NaN
A 2018Q4 1
A 2019Q4 0
A 2020Q4 0
A 2021Q4 1
B 2017Q4 NaN
B 2018Q4 1
B 2019Q4 1
B 2020Q4 0
B 2021Q4 0
C 2017Q4 NaN
C 2018Q4 0
C 2019Q4 0
C 2020Q4 0
C 2021Q4 0
D 2017Q4 NaN
D 2018Q4 0
D 2019Q4 1
D 2020Q4 1
D 2021Q4 1
I want to create four dummy variables, WW =1 if someone is winner in two consecutive periods. LL=1 if loser in two consecutive periods. WL if winner in period 1 and loser the next period, and LW vice versa.
UPDATE
when i apply the answers below i get the following
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 0 0 0 0
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 0 0 0 0
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 0 0 0 0
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 0 0 0 0
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How do i make sure I get the NaN when the previous value is NaN?
desired output
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 NaN NaN NaN NaN
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 NaN NaN NaN NaN
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 NaN NaN NaN NaN
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 NaN NaN NaN NaN
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How to do this in the most simple way?

Here's one way: Use groupby.shift to get the previous record; then use numpy.select to assign values, which you use get_dummies to convert to dummy variables:
import numpy as np
df['previous'] = df.groupby('ID')['winner'].shift()
tmp = df[['previous','winner']]
dummy_vars = ['WW','LL','WL', 'LW']
out = (df.join(pd.get_dummies(np.select([tmp.eq(1).all(1),
tmp.eq(0).all(1),
tmp.eq([1,0]).all(1),
tmp.eq([0,1]).all(1)],
dummy_vars, ''))[dummy_vars+['']]
.mask(df['previous'].isna(), ''))
.drop(columns=['previous','']))
Output:
ID date winner WW LL WL LW
0 A 2018Q4 1
1 A 2019Q4 0 0 0 1 0
2 A 2020Q4 0 0 1 0 0
3 A 2021Q4 1 0 0 0 1
4 B 2018Q4 1
5 B 2019Q4 1 1 0 0 0
6 B 2020Q4 0 0 0 1 0
7 B 2021Q4 0 0 1 0 0
8 C 2018Q4 0
9 C 2019Q4 0 0 1 0 0
10 C 2020Q4 0 0 1 0 0
11 C 2021Q4 0 0 1 0 0
12 D 2018Q4 0
13 D 2019Q4 1 0 0 0 1
14 D 2020Q4 1 1 0 0 0
15 D 2021Q4 1 1 0 0 0

map 1 and 0 to "W" and "L"
get the 2-period streak
get_dummies for the "streak"
join to original DataFrame ignoring the first row of each ID
wins = df["winner"].fillna(0).map({1:"W",0:"L"})
streaks = wins.shift() + wins
other = pd.get_dummies(streaks.where(df["ID"].eq(df["ID"].shift())))
output = df.join(other.where(df["ID"].duplicated()&df["winner"].shift().notna()))
>>> output
ID date winner LL LW WL WW
0 A 2017Q4 NaN NaN NaN NaN NaN
1 A 2018Q4 1.0 NaN NaN NaN NaN
2 A 2019Q4 0.0 0.0 0.0 1.0 0.0
3 A 2020Q4 0.0 1.0 0.0 0.0 0.0
4 A 2021Q4 1.0 0.0 1.0 0.0 0.0
5 B 2017Q4 NaN NaN NaN NaN NaN
6 B 2018Q4 1.0 NaN NaN NaN NaN
7 B 2019Q4 1.0 0.0 0.0 0.0 1.0
8 B 2020Q4 0.0 0.0 0.0 1.0 0.0
9 B 2021Q4 0.0 1.0 0.0 0.0 0.0
10 C 2017Q4 NaN NaN NaN NaN NaN
11 C 2018Q4 0.0 NaN NaN NaN NaN
12 C 2019Q4 0.0 1.0 0.0 0.0 0.0
13 C 2020Q4 0.0 1.0 0.0 0.0 0.0
14 C 2021Q4 0.0 1.0 0.0 0.0 0.0
15 D 2017Q4 NaN NaN NaN NaN NaN
16 D 2018Q4 0.0 NaN NaN NaN NaN
17 D 2019Q4 1.0 0.0 1.0 0.0 0.0
18 D 2020Q4 1.0 0.0 0.0 0.0 1.0
19 D 2021Q4 1.0 0.0 0.0 0.0 1.0

Related

Compare dataframe but keep the NaN cell

for example I have a dataframe:
0
1
2
3
4
5
6
0
0.493212
0.586246
nan
0.589289
nan
0.629087
0.593872
1
0.568513
0.367722
nan
nan
nan
nan
0.423369
2
0.70054
0.735529
nan
nan
0.494135
nan
nan
3
nan
nan
nan
0.338822
0.466331
0.765367
0.83082
4
0.512891
nan
0.623782
0.642438
nan
0.541117
0.92981
If I compare it like:
df >= 0.5
The result is:
0
1
2
3
4
5
6
0
0
1
0
1
0
1
1
1
1
0
0
0
0
0
0
2
1
1
0
0
0
0
0
3
0
0
0
0
0
1
1
4
1
0
1
1
0
1
1
How can I keep nan cell ? I mean I need 0.5 > np.nan == np.nan not 0.5 > np.nan == False

IIUC, you can use a mask:
df.lt(0.5).astype(int).mask(df.isna())
output:
0 1 2 3 4 5 6
0 1.0 0.0 NaN 0.0 NaN 0.0 0.0
1 0.0 1.0 NaN NaN NaN NaN 1.0
2 0.0 0.0 NaN NaN 1.0 NaN NaN
3 NaN NaN NaN 1.0 1.0 0.0 0.0
4 0.0 NaN 0.0 0.0 NaN 0.0 0.0
If you want to keep the integer type:
out = df.lt(0.5).astype(pd.Int64Dtype()).mask(df.isna()))
output:
0 1 2 3 4 5 6
0 1 0 <NA> 0 <NA> 0 0
1 0 1 <NA> <NA> <NA> <NA> 1
2 0 0 <NA> <NA> 1 <NA> <NA>
3 <NA> <NA> <NA> 1 1 0 0
4 0 <NA> 0 0 <NA> 0 0

Use DataFrame.mask with convert values to integers:
df = (df >= 0.5).astype(int).mask(df.isna())
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Details:
print ((df >= 0.5).astype(int))
0 1 2 3 4 5 6
0 0 1 0 1 0 1 1
1 1 0 0 0 0 0 0
2 1 1 0 0 0 0 0
3 0 0 0 0 0 1 1
4 1 0 1 1 0 1 1
Another idea with numpy.select:
df[:] = np.select([df.isna(), df >= 0.5], [None, 1], default=0)
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Btw, if need True/False with NaN is possible use Nullable Boolean data type:
df = (df >= 0.5).astype(int).mask(df.isna()).astype('boolean')
print (df)
0 1 2 3 4 5 6
0 False True <NA> True <NA> True True
1 True False <NA> <NA> <NA> <NA> False
2 True True <NA> <NA> False <NA> <NA>
3 <NA> <NA> <NA> False False True True
4 True <NA> True True <NA> True True

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.

One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0

better grouping of label frequency by month from dataframe

I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?

You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1

How to merge multiindex column dataframe

I want to merge static data with time varying data.
First dataframe
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(columns=a_columns,index=a_index)#A
Second dataframe
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(columns=b_columns,index=b_index)
How do i join these two? My desired dataframe has the form as A but with additional columns.
Thanks!

I think you need reshape by stack and then create df by to_frame - for concat need Datetimeindex, so new index was from first value of index of a.
Last concat + sort_index:
#added some data - 2
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(2,columns=a_columns,index=a_index)#A
#added some data - 1
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(1,columns=b_columns,index=b_index)
c = b.stack().to_frame(a.index[0]).T
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-03-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-04-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-05-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-06-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-07-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-08-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-09-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-10-29 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-11-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-12-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
Last if need replace NaNs only in added columns by first row:
d[c.columns] = d[c.columns].ffill()
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-03-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-04-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-05-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-06-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-07-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-08-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-09-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-10-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-11-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-12-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
Similar solution with reindex:
c = b.stack().to_frame(a.index[0]).T.reindex(a.index, method='ffill')
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
2010-02-26 1 1 1 1 1 1 1 1 1
2010-03-31 1 1 1 1 1 1 1 1 1
2010-04-30 1 1 1 1 1 1 1 1 1
2010-05-31 1 1 1 1 1 1 1 1 1
2010-06-30 1 1 1 1 1 1 1 1 1
2010-07-30 1 1 1 1 1 1 1 1 1
2010-08-31 1 1 1 1 1 1 1 1 1
2010-09-30 1 1 1 1 1 1 1 1 1
2010-10-29 1 1 1 1 1 1 1 1 1
2010-11-30 1 1 1 1 1 1 1 1 1
2010-12-31 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-02-26 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-03-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-04-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-05-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-06-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-07-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-08-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-09-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-10-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-11-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-12-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1

How to fill the indexs of dataframe & make the subplots?

Following the question I asked: Combine similar rows to one row in python dataframe1
I have the original data below, and have 2 questions want to ask :
yyyymmdd hr ariel cat kiki mmax vicky gaolie shiu nick ck
0 2015-12-27 9 0 0 0 0 0 0 0 23 0
1 2015-12-27 10 0 0 0 0 0 0 0 2 0
2 2015-12-27 11 0 0 0 0 0 0 0 20 0
3 2015-12-27 12 0 0 0 0 0 0 0 4 0
4 2015-12-27 17 0 0 0 0 0 0 0 2 0
5 2015-12-27 19 1 0 0 0 0 0 0 0 0
6 2015-12-28 8 0 8 0 0 0 0 0 0 0
7 2015-12-28 9 11 11 0 0 0 0 19 0 0
8 2015-12-28 10 85 13 0 0 2 0 15 0 0
9 2015-12-28 11 2 11 0 0 2 0 14 0 0
10 2015-12-28 12 2 20 0 4 0 0 10 0 0
11 2015-12-28 13 8 9 0 9 3 0 9 0 0
12 2015-12-28 14 4 10 0 8 0 0 22 0 0
13 2015-12-28 15 3 3 0 2 0 0 16 0 0
14 2015-12-28 16 14 5 1 1 0 0 19 0 0
15 2015-12-28 17 15 1 2 0 0 0 19 0 0
16 2015-12-28 18 0 0 0 6 0 0 0 0 0
17 2015-12-28 19 0 0 0 5 0 0 0 0 0
18 2015-12-28 20 0 0 0 1 0 0 0 0 0
how can I "fill" the "hr" index of the DataFrame? The result should be something like this:
yyyymmdd hr ariel cat kiki mmax vicky gaolie shiu nick ck
12/27/15 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 9 0 0 0 0 0 0 0 23 0
12/27/15 10 0 0 0 0 0 0 0 2 0
12/27/15 11 0 0 0 0 0 0 0 20 0
12/27/15 12 0 0 0 0 0 0 0 4 0
12/27/15 13 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 14 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 15 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 16 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 17 0 0 0 0 0 0 0 2 0
12/27/15 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 19 1 0 0 0 0 0 0 0 0
12/27/15 20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/28/15 8 0 8 0 0 0 0 0 0 0
12/28/15 9 11 11 0 0 0 0 19 0 0
12/28/15 10 85 13 0 0 2 0 15 0 0
12/28/15 11 2 11 0 0 2 0 14 0 0
12/28/15 12 2 20 0 4 0 0 10 0 0
12/28/15 13 8 9 0 9 3 0 9 0 0
12/28/15 14 4 10 0 8 0 0 22 0 0
12/28/15 15 3 3 0 2 0 0 16 0 0
12/28/15 16 14 5 1 1 0 0 19 0 0
12/28/15 17 15 1 2 0 0 0 19 0 0
12/28/15 18 0 0 0 6 0 0 0 0 0
12/28/15 19 0 0 0 5 0 0 0 0 0
12/28/15 20 0 0 0 1 0 0 0 0 0
how can I plot the line charts based on columns and hr ?
x-axis = columns , i.e. : ariel ,cat, kiki...
y-axis = hr, i.e. : 8,9,10...20
every subplot represents one date (i.e. 2015-12-27, 2015-12-28..)
and here is the framework of the plot I want to get :
please click here for the picture

You can convert yyyymmdd to datetime, combine with the hr information and then resample to hourly frequency like so:
df.yyyymmdd = pd.to_datetime(df.yyyymmdd)
df.yyyymmdd = df.apply(lambda x: x.yyyymmdd + pd.DateOffset(hours = x.hr), axis=1)
df.set_index('yyyymmdd', inplace=True)
df = df.resample('H')
to get:
hr ariel cat kiki mmax vicky gaolie shiu nick ck
yyyymmdd
2015-12-27 09:00:00 9 0 0 0 0 0 0 0 23 0
2015-12-27 10:00:00 10 0 0 0 0 0 0 0 2 0
2015-12-27 11:00:00 11 0 0 0 0 0 0 0 20 0
2015-12-27 12:00:00 12 0 0 0 0 0 0 0 4 0
2015-12-27 13:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 14:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 15:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 16:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 17:00:00 17 0 0 0 0 0 0 0 2 0
2015-12-27 18:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 19:00:00 19 1 0 0 0 0 0 0 0 0
2015-12-27 20:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 21:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 22:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 23:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 01:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 02:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 03:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 04:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 05:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 06:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 07:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 08:00:00 8 0 8 0 0 0 0 0 0 0
2015-12-28 09:00:00 9 11 11 0 0 0 0 19 0 0
2015-12-28 10:00:00 10 85 13 0 0 2 0 15 0 0
2015-12-28 11:00:00 11 2 11 0 0 2 0 14 0 0
2015-12-28 12:00:00 12 2 20 0 4 0 0 10 0 0
2015-12-28 13:00:00 13 8 9 0 9 3 0 9 0 0
2015-12-28 14:00:00 14 4 10 0 8 0 0 22 0 0
2015-12-28 15:00:00 15 3 3 0 2 0 0 16 0 0
2015-12-28 16:00:00 16 14 5 1 1 0 0 19 0 0
2015-12-28 17:00:00 17 15 1 2 0 0 0 19 0 0
2015-12-28 18:00:00 18 0 0 0 6 0 0 0 0 0
2015-12-28 19:00:00 19 0 0 0 5 0 0 0 0 0
2015-12-28 20:00:00 20 0 0 0 1 0 0 0 0 0
You could plot the result as follows - assuming that you are looking for one subplot for each date and column:
for d, data in df.groupby(pd.TimeGrouper('D')):
data.plot.line(figsize=(10, 20), sharey=True)
plt.gcf().savefig('cats {}.png'.format(d), bbox_inches='tight')
plt.close()
to get:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to create various dummies based on consecutive values from another column - python

Related

Compare dataframe but keep the NaN cell

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

better grouping of label frequency by month from dataframe

How to merge multiindex column dataframe

How to fill the indexs of dataframe & make the subplots?

Categories

Resources