I have a calendar data of type of dayworks - the day is the holiday or not.
I want to create a new feautures:
The value in the cell is the number of holidays in the week.
The value in the cell is the number of holidays in the N-window (right and left windows). In example - N=5 (and including current value)
Example:
is_holiday feature_1 feature_2
idx
0 0 2 0
1 0 2 1
2 0 2 2
3 0 2 2
4 0 2 2
5 1 2 2
6 1 2 2
7 0 3 3
8 0 3 4
9 0 3 5
10 0 3 4
11 1 3 3
12 1 3 3
13 1 3 3
...
I think you need grouping for each 7 values and aggregate sum and for second is used Series.rolling:
df['f1'] = df.groupby(df.index // 7)['is_holiday'].transform('sum')
df['f2'] = df['is_holiday'].rolling(9, center=True, min_periods=1).sum().astype(int)
print (df)
is_holiday feature_1 feature_2 f1 f2
idx
0 0 2 0 2 0
1 0 2 1 2 1
2 0 2 2 2 2
3 0 2 2 2 2
4 0 2 2 2 2
5 1 2 2 2 2
6 1 2 2 2 2
7 0 3 3 3 3
8 0 3 4 3 4
9 0 3 5 3 5
10 0 3 4 3 4
11 1 3 3 3 3
12 1 3 3 3 3
13 1 3 3 3 3
I have data with this structure:
id month val
1 0 4
2 0 4
3 0 5
1 1 3
2 1 7
3 1 9
1 2 12
2 2 1
3 2 5
1 3 10
2 3 4
3 3 7
...
I want to get mean val for each id, grouped by two months. Expected result:
id two_months val
1 0 3.5
2 0 5.5
3 0 7
1 1 11
2 1 2.5
3 1 6
What's the simplest way to do it using Pandas?
If months are consecutive integers starting by 0 use integer division by 2:
df = df.groupby(['id',df['month'] // 2])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 0 3.5
1 2 0 5.5
2 3 0 7.0
3 1 1 11.0
4 2 1 2.5
5 3 1 6.0
Possible solution with convert to datetimes:
df.index = pd.to_datetime(df['month'].add(1), format='%m')
df = df.groupby(['id', pd.Grouper(freq='2MS')])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 1900-01-01 3.5
1 2 1900-01-01 5.5
2 3 1900-01-01 7.0
3 1 1900-03-01 11.0
4 2 1900-03-01 2.5
5 3 1900-03-01 6.0
I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0
How to get the data frame below
dd = pd.DataFrame({'val':[0,0,1,1,1,0,0,0,0,1,1,0,1,1,1,1,0,0],
'groups':[1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,'ignore','ignore']})
val groups
0 0 1
1 0 1
2 1 1
3 1 1
4 1 1
5 0 2
6 0 2
7 0 2
8 0 2
9 1 2
10 1 2
11 0 3
12 1 3
13 1 3
14 1 3
15 1 3
16 0 ignore
17 0 ignore
I have a series df.val with has values [0,0,1,1,1,0,0,0,0,1,1,0,1,1,1,1,0,0].
How to create df.groups from df.val.
first 0,0,1,1,1 will form group 1,(i.e. from the beginning upto next occurrence of 0 after 1's)
0,0,0,0,1,1 will form group 2, (incremental group number, starting where previous group ended uptill next occurrence of 0 after 1's),...etc
Can anyone please help.
First test if next value after 0 is 1 and create groups by sumulative sums by Series.cumsum:
s = (dd['val'].eq(0) & dd['val'].shift().eq(1)).cumsum().add(1)
Then convert last group to ignore if last value of data are 0 with numpy.where:
mask = s.eq(s.max()) & (dd['val'].iat[-1] == 0)
dd['new'] = np.where(mask, 'ignore', s)
print (dd)
val groups new
0 0 1 1
1 0 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 0 2 2
6 0 2 2
7 0 2 2
8 0 2 2
9 1 2 2
10 1 2 2
11 0 3 3
12 1 3 3
13 1 3 3
14 1 3 3
15 1 3 3
16 0 ignore ignore
17 0 ignore ignore
IIUC first we do diff and cumsum , then we need to find the condition to ignore the previous value we get (np.where)
s=df.val.diff().eq(-1).cumsum()+1
df['New']=np.where(df['val'].eq(1).groupby(s).transform('any'),s,'ignore')
df
val groups New
0 0 1 1
1 0 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 0 2 2
6 0 2 2
7 0 2 2
8 0 2 2
9 1 2 2
10 1 2 2
11 0 3 3
12 1 3 3
13 1 3 3
14 1 3 3
15 1 3 3
16 0 ignore ignore
17 0 ignore ignore
Probably this question has already an answer, but I could not succeed to find any.
I want to get items from a second data-frame to be appended to a new column in the first dataframe if there a match between both dataframe
Here I am showing some sample data quite similar to the case I am confronting.
import pandas as pd
import numpy as np
a = np.arange(3).repeat(3)
b = np.tile(np.arange(3),3)
df1 = pd.DataFrame({'a':a, 'b':b})
a b
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
6 2 0
7 2 1
8 2 2
a2 = np.arange(1, 4).repeat(3)
b2 = np.tile(np.arange(3),3)
c = np.random.randint(0, 10, size=a2.size)
df2 = pd.DataFrame({'a2':a2, 'b2':b2, 'c':c})
a2 b2 c
0 1 0 3
1 1 1 1
2 1 2 9
3 2 0 5
4 2 1 8
5 2 2 4
6 3 0 1
7 3 1 6
8 3 2 1
The desired output should be like
a b c
0 0 0 nan
1 0 1 nan
2 0 2 nan
3 1 0 3
4 1 1 1
5 1 2 9
6 2 0 5
7 2 1 8
8 2 2 4
Unfortunately, I could not come up with anyway to solve it.
Use merge with left join and rename columns names:
df = df1.merge(df2.rename(columns={'a2':'a', 'b2':'b'}), on=['a','b'], how='left')
print (df)
a b c
0 0 0 NaN
1 0 1 NaN
2 0 2 NaN
3 1 0 3.0
4 1 1 5.0
5 1 2 0.0
6 2 0 2.0
7 2 1 6.0
8 2 2 2.0