Merge dataframes on nearest datetime / timestamp - python

I have two data frames as follows:
A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/22/2014","07/02/2014","01/01/2015","01/01/1991","08/02/1999"]})
B = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["02/15/2015","06/30/2014","07/02/1999","10/05/1990","06/24/2014"], "value": ["3","5","1","7","8"] })
Which look like the following:
>>> A
ID date
0 A 2014-06-22
1 A 2014-07-02
2 C 2015-01-01
3 B 1991-01-01
4 B 1999-08-02
>>> B
ID date value
0 A 2015-02-15 3
1 A 2014-06-30 5
2 C 1999-07-02 1
3 B 1990-10-05 7
4 B 2014-06-24 8
I want to merge A with the values of B using the nearest date. In this example, none of the dates match, but it could the the case that some do.
The output should be something like this:
>>> C
ID date value
0 A 06/22/2014 8
1 A 07/02/2014 5
2 C 01/01/2015 3
3 B 01/01/1991 7
4 B 08/02/1999 1
It seems to me that there should be a native function in pandas that would allow this.
Note: as similar question has been asked here
pandas.merge: match the nearest time stamp >= the series of timestamps

You can use reindex with method='nearest' and then merge:
A['date'] = pd.to_datetime(A.date)
B['date'] = pd.to_datetime(B.date)
A.sort_values('date', inplace=True)
B.sort_values('date', inplace=True)
B1 = B.set_index('date').reindex(A.set_index('date').index, method='nearest').reset_index()
print (B1)
print (pd.merge(A,B1, on='date'))
ID_x date ID_y value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
You can also add parameter suffixes:
print (pd.merge(A,B1, on='date', suffixes=('_', '')))
ID_ date ID value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3

pd.merge_asof(A, B, on="date", direction='nearest')

Related

Python - Count duplicate user Id's occurence in a given month

If I create a Dataframe from
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26','2022-08-30','2022-09-3','2022-09-8','2022-09-13'],
"id": ['A','B','C','D','E','B','A','F','G','F','H']})
df['date'] = pd.to_datetime(df['date'])
(Table 1 below showing the data)
I am interested in counting how many times an ID appears in a given month. For example in a given month A, B and F all occur twice whilst everything else occurs once. The difficulty with this data is that the the frequency of dates are not evenly spread out.
I attempted to resample on date by month, with the hope of counting duplicates.
df.resample('M', on='date')['id']
But all the functions that can be used on resample just give me the number of unique occurences rather than how many times each ID occured.
A rough example of the output is below [Table 2]
All of the examples I have seen merely count how many total or unique occurences occur for a given month, this question is focused on finding out how many occurences each Id had in a month.
Thankyou for your time.
[Table 1] - Data
idx
date
id
0
2022-08-10
A
1
2022-08-18
B
2
2022-08-18
C
3
2022-08-20
D
4
2022-08-20
E
5
2022-08-24
B
6
2022-08-26
A
7
2022-08-30
F
8
2022-09-03
G
9
2022-09-08
F
10
2022-09-13
H
[Table 2] - Rough example of desired output
id
occurences in a month
A
2
B
2
C
1
D
1
E
1
F
2
G
1
H
1
Use Series.dt.to_period for month periods and count values per id by GroupBy.size, then aggregate sum:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
id occurences in a month
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 1
Or use Grouper:
df1 = (df.groupby(['id',pd.Grouper(freq='M', key='date')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
EDIT:
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26',
'2022-08-30','2022-09-3','2022-09-8','2022-09-13','2050-12-15'],
"id": ['A','B','C','D','E','B','A','F','G','F','H','H']})
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
print (df)
Because count first per month or days or dates and sum values it is same like:
df1 = df.groupby('id').size().reset_index(name='occurences')
print (df1)
id occurences
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 2
Same sum of counts per id:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size())
print (df1)
id date
A 2022-08 2
B 2022-08 2
C 2022-08 1
D 2022-08 1
E 2022-08 1
F 2022-08 1
2022-09 1
G 2022-09 1
H 2022-09 1
2050-12 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.to_period('d')])
.size())
print (df1)
id date
A 2022-08-10 1
2022-08-26 1
B 2022-08-18 1
2022-08-24 1
C 2022-08-18 1
D 2022-08-20 1
E 2022-08-20 1
F 2022-08-30 1
2022-09-08 1
G 2022-09-03 1
H 2022-09-13 1
2050-12-15 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.day])
.size())
print (df1)
id date
A 10 1
26 1
B 18 1
24 1
C 18 1
D 20 1
E 20 1
F 8 1
30 1
G 3 1
H 13 1
15 1
dtype: int64

Breakdown data-frame into second-by-second time-series

I have this dataset of active subjects during specified time-periods.
start end name
0 00:00 00:10 a
1 00:10 00:20 b
2 00:00 00:20 c
3 00:00 00:10 d
4 00:10 00:15 e
5 00:15 00:20 a
The intervals are inclusive on the left(start) side and not inclusive on the right(end).
There are always three subjects active. I want to increase the granularity of the data, so that I will have info of the three active subjects for each second. Each second has three unique values.
This would be the desired result for the test case.
slot1 slot2 slot3
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 b c a
16 b c a
17 b c a
18 b c a
19 b c a
The order of the subjects inside the slots is irrelevant for now. The subjects can reappear in the data like "a" from 00:00 to 00:10 and then again from 00:15 to 00:20. The intervals can be at any second.
Route 1: One (costly but easy) way is to explode the data to the seconds, then merge 3 times:
time_df = (('00:' + df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
.apply(lambda x: np.arange(*x), axis=1)
.to_frame('time')
.assign(slot=df['name'])
.explode('time')
)
(time_df.merge(time_df, on='time', suffixes=['1','2'])
.query('slot1 < slot2')
.merge(time_df, on='time')
.query('slot2 < slot')
)
Output:
time slot1 slot2 slot
2 0 a c d
11 1 a c d
20 2 a c d
29 3 a c d
38 4 a c d
47 5 a c d
56 6 a c d
65 7 a c d
74 8 a c d
83 9 a c d
92 10 b c e
101 11 b c e
110 12 b c e
119 13 b c e
128 14 b c e
139 15 a b c
148 16 a b c
157 17 a b c
166 18 a b c
175 19 a b c
Route 2: Another way is to cross merge then query the overlapping intervals:
df[['start','end']] = (('00:' + df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
)
(df.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_x < name_y')
[['name_x','name_y','start','end']]
.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_y < name')
[['start','end', 'name_x','name_y', 'name']]
)
Output:
start end name_x name_y name
3 0 10 a c d
16 10 15 b c e
38 15 20 a b c
As you can see the this output is just the same as the other, but in the original form. Depending on your data, one route might better than the other.
Update Since your data has exactly 3 slot at any time, you can easily do with pivot. This is the best solution.
# time_df as in Route 1
(time_df.sort_values(['time','slot'])
.assign(nums = lambda x: np.arange(len(x)) % 3)
.pivot('time', 'nums', 'slot')
)
# in general, `.assign(nums=lambda x: x.groupby('time').cumcount()`
# also works instead of the above
Output:
nums 0 1 2
time
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 a b c
16 a b c
17 a b c
18 a b c
19 a b c
This solution uses piso (pandas interval set operations) and will run fast.
setup
Create data and convert to pandas.Timedelta
df = pd.DataFrame(
{
"start": ["00:00", "00:10", "00:00", "00:00", "00:10", "00:15"],
"end": ["00:10", "00:20", "00:20", "00:10", "00:15", "00:20"],
"name": ["a", "b", "c", "d", "e", "a"],
}
)
df[["start", "end"]] = ("00:" + df[["start", "end"]].astype(str)).apply(pd.to_timedelta)
create the sample times (a pandas.TimedeltaIndex of seconds):
sample_times = pd.timedelta_range(df["start"].min(), df["end"].max(), freq="s")
solution
For each possible value of "name" create a pandas.IntervalIndex which has the intervals defined by start and stop columns:
ii_series = df.groupby("name").apply(
lambda d: pd.IntervalIndex.from_arrays(d["start"], d["end"], closed="left")
)
ii_series looks like this:
name
a IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
b IntervalIndex([[0 days 00:10:00, 0 days 00:20:...
c IntervalIndex([[0 days 00:00:00, 0 days 00:20:...
d IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
e IntervalIndex([[0 days 00:10:00, 0 days 00:15:...
dtype: object
Then to each of these interval index we'll apply the piso.contains function, which can be used to test whether a set of points is contained in an interval
contained = ii_series.apply(piso.contains,x=sample_times, result="points")
contained will be a dataframe indexed by the names, and whose columns are the sample times. The transpose of this, looks like:
a b c d e
0 days 00:00:00 True False True True False
0 days 00:00:01 True False True True False
0 days 00:00:02 True False True True False
0 days 00:00:03 True False True True False
0 days 00:00:04 True False True True False
... ... ... ... ... ...
0 days 00:19:56 True True True False False
0 days 00:19:57 True True True False False
0 days 00:19:58 True True True False False
0 days 00:19:59 True True True False False
0 days 00:20:00 False False False False False
This format of data may be easier to work with, depending on the application, but if you want to have it in the format stated in the question then you can create a series of lists, indexed by each second:
series_of_lists = (
contained.transpose()
.melt(ignore_index=False)
.query("value == True")
.reset_index()
.groupby("index")["name"]
.apply(pd.Series.to_list)
)
Then convert to dataframe:
pd.DataFrame(series_of_lists.to_list(), index=series_of_lists.index)
which will look like this:
0 1 2
index
0 days 00:00:00 a c d
0 days 00:00:01 a c d
0 days 00:00:02 a c d
0 days 00:00:03 a c d
0 days 00:00:04 a c d
... .. .. ..
0 days 00:19:55 a b c
0 days 00:19:56 a b c
0 days 00:19:57 a b c
0 days 00:19:58 a b c
0 days 00:19:59 a b c
Note: I am the creator of piso, feel free to reach out if you have any questions.

Insert row in pandas Dataframe based on Date Column

I have a Dataframe df and a list li, My dataframe column contains:
Student Score Date
A 10 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
My list contain Name of all Student li=[A,B,C]
If any student have not came on particular day then insert the name of student in dataframe with score value = 0
My Final Dataframe should be like:
Student Score Date
A 10 15-03-19
B 0 15-03-19
C 0 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
B 0 17-03-19
C 0 17-03-19
Use DataFrame.reindex with MultiIndex.from_product:
li = list('ABC')
mux = pd.MultiIndex.from_product([df['Date'].unique(), li], names=['Date', 'Student'])
df = df.set_index(['Date', 'Student']).reindex(mux, fill_value=0).reset_index()
print (df)
Date Student Score
0 15-03-19 A 10
1 15-03-19 B 0
2 15-03-19 C 0
3 16-03-19 A 12
4 16-03-19 B 10
5 16-03-19 C 11
6 17-03-19 A 9
7 17-03-19 B 0
8 17-03-19 C 0
Alternative is use left join with DataFrame.merge and helper DataFrame created by product, last replace missing values by fillna:
from itertools import product
df1 = pd.DataFrame(list(product(df['Date'].unique(), li)), columns=['Date', 'Student'])
df = df1.merge(df, how='left').fillna(0)
print (df)
Date Student Score
0 15-03-19 A 10.0
1 15-03-19 B 0.0
2 15-03-19 C 0.0
3 16-03-19 A 12.0
4 16-03-19 B 10.0
5 16-03-19 C 11.0
6 17-03-19 A 9.0
7 17-03-19 B 0.0
8 17-03-19 C 0.0

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

Generating sub data frame based on a value in an column

I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5

Categories

Resources