Modify time column to exclusive and inclusive time in Pandas DataFrame - python

I have the following DataFrame of individuals and the time of an event.
id time
1 0
2 0
3 0
4 0
2 1
3 1
1 2
4 2
1 3
2 3
1 4
2 4
3 4
4 4
I want a column of left exclusive time points (start: time of the previous event). The column of right inclusive time points (stop) is the column time.
id start stop
1 0 0
2 0 0
3 0 0
4 0 0
2 0 1
3 0 1
1 0 2
4 0 2
1 2 3
2 1 3
1 3 4
2 3 4
3 1 4
4 2 4
Any straightforward functions that accomplish this?

Use DataFrameGroupBy.shift in DataFrame.insert, for get new column like second column, last rename column:
df.insert(1, 'start', df.groupby('id')['time'].shift(fill_value=0))
df = df.rename(columns={'time':'stop'})
print (df)
id start stop
0 1 0 0
1 2 0 0
2 3 0 0
3 4 0 0
4 2 0 1
5 3 0 1
6 1 0 2
7 4 0 2
8 1 2 3
9 2 1 3
10 1 3 4
11 2 3 4
12 3 1 4
13 4 2 4

To get the previous value of every id, you want to group by 'id' and retrieve the previous value by using shift as your new column 'start':
df['start'] = df.groupby('id').time.shift(1, fill_value=0)
id time start
0 1 0 0.0
1 2 0 0.0
2 3 0 0.0
3 4 0 0.0
4 2 1 0.0
5 3 1 0.0
6 1 2 0.0
7 4 2 0.0
8 1 3 2.0
9 2 3 1.0
10 1 4 3.0
11 2 4 3.0
12 3 4 1.0
13 4 4 2.0
Then you might want to rename your 'time' column to 'end':
df.rename({'time':'end'}, axis=1, inplace=True)
If you want the switch start and end, reshuffle your columns like this:
df[['id', 'start', 'end']]

Related

How to fill by counting last and forward N values with static window in pandas

I have a calendar data of type of dayworks - the day is the holiday or not.
I want to create a new feautures:
The value in the cell is the number of holidays in the week.
The value in the cell is the number of holidays in the N-window (right and left windows). In example - N=5 (and including current value)
Example:
is_holiday feature_1 feature_2
idx
0 0 2 0
1 0 2 1
2 0 2 2
3 0 2 2
4 0 2 2
5 1 2 2
6 1 2 2
7 0 3 3
8 0 3 4
9 0 3 5
10 0 3 4
11 1 3 3
12 1 3 3
13 1 3 3
...
I think you need grouping for each 7 values and aggregate sum and for second is used Series.rolling:
df['f1'] = df.groupby(df.index // 7)['is_holiday'].transform('sum')
df['f2'] = df['is_holiday'].rolling(9, center=True, min_periods=1).sum().astype(int)
print (df)
is_holiday feature_1 feature_2 f1 f2
idx
0 0 2 0 2 0
1 0 2 1 2 1
2 0 2 2 2 2
3 0 2 2 2 2
4 0 2 2 2 2
5 1 2 2 2 2
6 1 2 2 2 2
7 0 3 3 3 3
8 0 3 4 3 4
9 0 3 5 3 5
10 0 3 4 3 4
11 1 3 3 3 3
12 1 3 3 3 3
13 1 3 3 3 3

Grouping data in pandas by rows

I have data with this structure:
id month val
1 0 4
2 0 4
3 0 5
1 1 3
2 1 7
3 1 9
1 2 12
2 2 1
3 2 5
1 3 10
2 3 4
3 3 7
...
I want to get mean val for each id, grouped by two months. Expected result:
id two_months val
1 0 3.5
2 0 5.5
3 0 7
1 1 11
2 1 2.5
3 1 6
What's the simplest way to do it using Pandas?
If months are consecutive integers starting by 0 use integer division by 2:
df = df.groupby(['id',df['month'] // 2])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 0 3.5
1 2 0 5.5
2 3 0 7.0
3 1 1 11.0
4 2 1 2.5
5 3 1 6.0
Possible solution with convert to datetimes:
df.index = pd.to_datetime(df['month'].add(1), format='%m')
df = df.groupby(['id', pd.Grouper(freq='2MS')])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 1900-01-01 3.5
1 2 1900-01-01 5.5
2 3 1900-01-01 7.0
3 1 1900-03-01 11.0
4 2 1900-03-01 2.5
5 3 1900-03-01 6.0

Padding and reshaping pandas dataframe

I have a dataframe with the following form:
data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3],'Time':[0,1,2,0,1,2,3,0,1],
'sig':[2,3,1,4,2,0,2,3,5],'sig2':[9,2,8,0,4,5,1,1,0],
'group':['A','A','A','B','B','B','B','A','A']})
print(data)
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 2 0 4 0 B
4 2 1 2 4 B
5 2 2 0 5 B
6 2 3 2 1 B
7 3 0 3 1 A
8 3 1 5 0 A
I want to reshape and pad such that each 'ID' has the same number of Time values, the sig1,sig2 are padded with zeros (or mean value within ID) and the group carries the same letter value. The output after repadding would be :
data_pad = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3],'Time':[0,1,2,3,0,1,2,3,0,1,2,3],
'sig1':[2,3,1,0,4,2,0,2,3,5,0,0],'sig2':[9,2,8,0,0,4,5,1,1,0,0,0],
'group':['A','A','A','A','B','B','B','B','A','A','A','A']})
print(data_pad)
ID Time sig1 sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
My end goal is to ultimately reshape this into something with shape (number of ID, number of time points, number of sequences {2 here}).
It seems that if I pivot data, it fills in with nan values, which is fine for the signal values, but not the groups. I am also hoping to avoid looping through data.groupby('ID'), since my actual data has a large number of groups and the looping would likely be very slow.
Here's one approach creating the new index with pd.MultiIndex.from_product and using it to reindex on the Time column:
df = data.set_index(['ID', 'Time'])
# define a the new index
ix = pd.MultiIndex.from_product([df.index.levels[0],
df.index.levels[1]],
names=['ID', 'Time'])
# reindex using the above multiindex
df = df.reindex(ix, fill_value=0)
# forward fill the missing values in group
df['group'] = df.group.mask(df.group.eq(0)).ffill()
print(df.reset_index())
ID Time sig sig2 group
0 1 0 2 9 A
1 1 1 3 2 A
2 1 2 1 8 A
3 1 3 0 0 A
4 2 0 4 0 B
5 2 1 2 4 B
6 2 2 0 5 B
7 2 3 2 1 B
8 3 0 3 1 A
9 3 1 5 0 A
10 3 2 0 0 A
11 3 3 0 0 A
IIUC:
(data.pivot_table(columns='Time', index=['ID','group'], fill_value=0)
.stack('Time')
.sort_index(level=['ID','Time'])
.reset_index()
)
Output:
ID group Time sig sig2
0 1 A 0 2 9
1 1 A 1 3 2
2 1 A 2 1 8
3 1 A 3 0 0
4 2 B 0 4 0
5 2 B 1 2 4
6 2 B 2 0 5
7 2 B 3 2 1
8 3 A 0 3 1
9 3 A 1 5 0
10 3 A 2 0 0
11 3 A 3 0 0

create a 'group number' column for a pandas data frame column of '0' and '1' s

How to get the data frame below
dd = pd.DataFrame({'val':[0,0,1,1,1,0,0,0,0,1,1,0,1,1,1,1,0,0],
'groups':[1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,'ignore','ignore']})
val groups
0 0 1
1 0 1
2 1 1
3 1 1
4 1 1
5 0 2
6 0 2
7 0 2
8 0 2
9 1 2
10 1 2
11 0 3
12 1 3
13 1 3
14 1 3
15 1 3
16 0 ignore
17 0 ignore
I have a series df.val with has values [0,0,1,1,1,0,0,0,0,1,1,0,1,1,1,1,0,0].
How to create df.groups from df.val.
first 0,0,1,1,1 will form group 1,(i.e. from the beginning upto next occurrence of 0 after 1's)
0,0,0,0,1,1 will form group 2, (incremental group number, starting where previous group ended uptill next occurrence of 0 after 1's),...etc
Can anyone please help.
First test if next value after 0 is 1 and create groups by sumulative sums by Series.cumsum:
s = (dd['val'].eq(0) & dd['val'].shift().eq(1)).cumsum().add(1)
Then convert last group to ignore if last value of data are 0 with numpy.where:
mask = s.eq(s.max()) & (dd['val'].iat[-1] == 0)
dd['new'] = np.where(mask, 'ignore', s)
print (dd)
val groups new
0 0 1 1
1 0 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 0 2 2
6 0 2 2
7 0 2 2
8 0 2 2
9 1 2 2
10 1 2 2
11 0 3 3
12 1 3 3
13 1 3 3
14 1 3 3
15 1 3 3
16 0 ignore ignore
17 0 ignore ignore
IIUC first we do diff and cumsum , then we need to find the condition to ignore the previous value we get (np.where)
s=df.val.diff().eq(-1).cumsum()+1
df['New']=np.where(df['val'].eq(1).groupby(s).transform('any'),s,'ignore')
df
val groups New
0 0 1 1
1 0 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 0 2 2
6 0 2 2
7 0 2 2
8 0 2 2
9 1 2 2
10 1 2 2
11 0 3 3
12 1 3 3
13 1 3 3
14 1 3 3
15 1 3 3
16 0 ignore ignore
17 0 ignore ignore

How to get an item from a second df if there is a match between them

Probably this question has already an answer, but I could not succeed to find any.
I want to get items from a second data-frame to be appended to a new column in the first dataframe if there a match between both dataframe
Here I am showing some sample data quite similar to the case I am confronting.
import pandas as pd
import numpy as np
a = np.arange(3).repeat(3)
b = np.tile(np.arange(3),3)
df1 = pd.DataFrame({'a':a, 'b':b})
a b
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
6 2 0
7 2 1
8 2 2
a2 = np.arange(1, 4).repeat(3)
b2 = np.tile(np.arange(3),3)
c = np.random.randint(0, 10, size=a2.size)
df2 = pd.DataFrame({'a2':a2, 'b2':b2, 'c':c})
a2 b2 c
0 1 0 3
1 1 1 1
2 1 2 9
3 2 0 5
4 2 1 8
5 2 2 4
6 3 0 1
7 3 1 6
8 3 2 1
The desired output should be like
a b c
0 0 0 nan
1 0 1 nan
2 0 2 nan
3 1 0 3
4 1 1 1
5 1 2 9
6 2 0 5
7 2 1 8
8 2 2 4
Unfortunately, I could not come up with anyway to solve it.
Use merge with left join and rename columns names:
df = df1.merge(df2.rename(columns={'a2':'a', 'b2':'b'}), on=['a','b'], how='left')
print (df)
a b c
0 0 0 NaN
1 0 1 NaN
2 0 2 NaN
3 1 0 3.0
4 1 1 5.0
5 1 2 0.0
6 2 0 2.0
7 2 1 6.0
8 2 2 2.0

Categories

Resources