Python - Select min values in dataframe - python

I have a data frame that looks like this:
How can I make a new data frame that contains only the minimum 'Time' values for a user on the same date?
So I want to have a data frame with the same structure, but only one 'Time' for a 'Date' for a user.
So it should be like this:

Sort values by time column and check for duplicates in Date+User_name. However to make sure 09:00 is lower than 10:00 we can convert the strings to time first.
import pandas as pd
data = {
'User_name':['user1','user1','user1', 'user2'],
'Date':['8/29/2016','8/29/2016', '8/31/2016', '8/31/2016'],
'Time':['9:07:41','9:07:42','9:07:43', '9:31:35']
}
# Recreate sample dataframe
df = pd.DataFrame(data)
Alternative 1 (quicker):
#100 loops, best of 3: 1.73 ms per loop
# Create a mask
m = (df.reindex(pd.to_datetime(df['Time']).sort_values().index)
.duplicated(['Date','User_name']))
# Apply inverted mask
df = df.loc[~m]
Alternative 2 (more readable):
One easier way would be too remake the df['Time'] column to datetime and group it by date and User_name and get the idxmin(). This will be our mask. (Credit to jezrael)
# 100 loops, best of 3: 4.34 ms per loop
# Create a mask
m = pd.to_datetime(df['Time']).groupby([df['Date'],df['User_name']]).idxmin()
df = df.loc[m]
Output:
Date Time User_name
0 8/29/2016 9:07:41 user1
2 8/31/2016 9:07:43 user1
3 8/31/2016 9:31:35 user2

Update 1
#User included into grouping
Not the best way but simple
df = pd.DataFrame(np.datetime64('2016')+
np.random.randint(0,3*24,
size=(7,1)).astype('<m8[h]'),
columns =['DT']).join(pd.Series(list('abcdefg'),name='str_val')
).join(pd.Series(list('UAUAUAU'),name='User'))
df['Date'] = df.DT.dt.date
df['Time'] = df.DT.dt.time
df.drop(columns = ['DT'],inplace=True)
print (df)
Output:
str_val User Date Time
0 a U 2016-01-01 04:00:00
1 b A 2016-01-01 10:00:00
2 c U 2016-01-01 20:00:00
3 d A 2016-01-01 22:00:00
4 e U 2016-01-02 04:00:00
5 f A 2016-01-02 23:00:00
6 g U 2016-01-02 09:00:00
Code to get values
print (df.sort_values(['Date','User','Time']).groupby(['Date','User']).first())
Output:
Date User
2016-01-01 A b 10:00:00
U a 04:00:00
2016-01-02 A f 23:00:00
U e 04:00:00

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

I have a data frame with one column and I'd like to split it into two columns [duplicate]

I have a pandas dataframe with over 1000 timestamps (below) that I would like to loop through:
2016-02-22 14:59:44.561776
I'm having a hard time splitting this time stamp into 2 columns- 'date' and 'time'. The date format can stay the same, but the time needs to be converted to CST (including milliseconds).
Thanks for the help
Had same problem and this worked for me.
Suppose the date column in your dataset is called "date"
import pandas as pd
df = pd.read_csv(file_path)
df['Dates'] = pd.to_datetime(df['date']).dt.date
df['Time'] = pd.to_datetime(df['date']).dt.time
This will give you two columns "Dates" and "Time" with splited dates.
I'm not sure why you would want to do this in the first place, but if you really must...
df = pd.DataFrame({'my_timestamp': pd.date_range('2016-1-1 15:00', periods=5)})
>>> df
my_timestamp
0 2016-01-01 15:00:00
1 2016-01-02 15:00:00
2 2016-01-03 15:00:00
3 2016-01-04 15:00:00
4 2016-01-05 15:00:00
df['new_date'] = [d.date() for d in df['my_timestamp']]
df['new_time'] = [d.time() for d in df['my_timestamp']]
>>> df
my_timestamp new_date new_time
0 2016-01-01 15:00:00 2016-01-01 15:00:00
1 2016-01-02 15:00:00 2016-01-02 15:00:00
2 2016-01-03 15:00:00 2016-01-03 15:00:00
3 2016-01-04 15:00:00 2016-01-04 15:00:00
4 2016-01-05 15:00:00 2016-01-05 15:00:00
The conversion to CST is more tricky. I assume that the current timestamps are 'unaware', i.e. they do not have a timezone attached? If not, how would you expect to convert them?
For more details:
https://docs.python.org/2/library/datetime.html
How to make an unaware datetime timezone aware in python
EDIT
An alternative method that only loops once across the timestamps instead of twice:
new_dates, new_times = zip(*[(d.date(), d.time()) for d in df['my_timestamp']])
df = df.assign(new_date=new_dates, new_time=new_times)
The easiest way is to use the pandas.Series dt accessor, which works on columns with a datetime dtype (see pd.to_datetime). For this case, pd.date_range creates an example column with a datetime dtype, therefore use .dt.date and .dt.time:
df = pd.DataFrame({'full_date': pd.date_range('2016-1-1 10:00:00.123', periods=10, freq='5H')})
df['date'] = df['full_date'].dt.date
df['time'] = df['full_date'].dt.time
In [166]: df
Out[166]:
full_date date time
0 2016-01-01 10:00:00.123 2016-01-01 10:00:00.123000
1 2016-01-01 15:00:00.123 2016-01-01 15:00:00.123000
2 2016-01-01 20:00:00.123 2016-01-01 20:00:00.123000
3 2016-01-02 01:00:00.123 2016-01-02 01:00:00.123000
4 2016-01-02 06:00:00.123 2016-01-02 06:00:00.123000
5 2016-01-02 11:00:00.123 2016-01-02 11:00:00.123000
6 2016-01-02 16:00:00.123 2016-01-02 16:00:00.123000
7 2016-01-02 21:00:00.123 2016-01-02 21:00:00.123000
8 2016-01-03 02:00:00.123 2016-01-03 02:00:00.123000
9 2016-01-03 07:00:00.123 2016-01-03 07:00:00.123000
If your timestamps are already in pandas format (not string), then:
df["date"] = df["timestamp"].date
dt["time"] = dt["timestamp"].time
If your timestamp is a string, you can parse it using the datetime module:
from datetime import datetime
data1["timestamp"] = df["timestamp"].apply(lambda x: \
datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f"))
Source:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
If your timestamp is a string, you can convert it to a datetime object:
from datetime import datetime
timestamp = '2016-02-22 14:59:44.561776'
dt = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f')
From then on you can bring it to whatever format you like.
Try
s = '2016-02-22 14:59:44.561776'
date,time = s.split()
then convert time as needed.
If you want to further split the time,
hour, minute, second = time.split(':')
try this:
def time_date(datetime_obj):
date_time = datetime_obj.split(' ')
time = date_time[1].split('.')
return date_time[0], time[0]
In addition to #Alexander if you want a single liner
df['new_date'],df['new_time'] = zip(*[(d.date(), d.time()) for d in df['my_timestamp']])
If your timestamp is a string, you can convert it to pandas timestamp before splitting it.
#convert to pandas timestamp
data["old_date"] = pd.to_datetime(data.old_date)
#split columns
data["new_date"] = data["old_date"].dt.date
data["new_time"] = data["old_date"].dt.time

find closest rows between dataframes with positive timedelta

I have two dataframes each with a datetime column:
df_long=
mytime_long
0 00:00:01 1/10/2013
1 00:00:05 1/10/2013
2 00:00:55 1/10/2013
df_short=
mytime_short
0 00:00:02 1/10/2013
1 00:00:03 1/10/2013
2 00:00:06 1/10/2013
The timestamps are unique and can be assumed sorted in each of the two dataframes.
I would like to create a new dataframe that contains the nearest (index,mytime_long) after or at the same time value in mytime_short (hence with a non-negative timedelta)
ex.
0 (0, 00:00:02 1/10/2013)
1 (2, 00:00:06 1/10/2013)
2 (np.nan,np.nat)
write a function to get the closest index & timestamp in df_short given a timestamp
def get_closest(n):
mask = df_short.mytime_short >= n
ids = np.where(mask)[0]
if ids.size > 0:
return ids[0], df_short.mytime_short[ids[0]]
else:
return np.nan, np.nan
apply this function over df_long.mytime_long, to get a new data frame with the index & timestamp values in a tuple
df = df_long.mytime_long.apply(get_closest)
df
# output:
0 (0, 2013-01-10 00:00:02)
1 (2, 2013-01-10 00:00:06)
2 (nan, nan)
ilia timofeev's answer reminded me of this pandas.merge_asof function which is perfect for this type of join
df = pd.merge_asof(df_long,
df_short.reset_index(),
left_on='mytime_long',
right_on='mytime_short',
direction='forward')[['index', 'mytime_short']]
df
# output:
index mytime_short
0 0.0 2013-01-10 00:00:02
1 2.0 2013-01-10 00:00:06
2 NaN NaT
Little bit ugly, but effective way to solve task. Idea is to join them on timestamp and select first "short" after "long" if any.
#recreate data
df_long = pd.DataFrame(
pd.to_datetime( ['00:00:01 1/10/2013','00:00:05 1/10/2013','00:00:55 1/10/2013']),
index = [0,1,2],columns = ['mytime_long'])
df_short = pd.DataFrame(
pd.to_datetime( ['00:00:02 1/10/2013','00:00:03 1/10/2013','00:00:06 1/10/2013']),
index = [0,1,2],columns = ['mytime_short'])
#join by time, preserving ids
df_all = df_short.assign(inx_s=df_short.index).set_index('mytime_short').join(
df_long.assign(inx_l=df_long.index).set_index('mytime_long'),how='outer')
#mark all "short" rows with nearest "long" id
df_all['inx_l'] = df_all.inx_l.ffill().fillna(-1)
#select "short" rows
df_short_candidate = df_all[~df_all.inx_s.isnull()].astype(int)
df_short_candidate['mytime_short'] = df_short_candidate.index
#select get minimal "short" time in "long" group,
#join back with long to recover empty intersection
df_res = df_long.join(df_short_candidate.groupby('inx_l').first())
print (df_res)
Out:
mytime_long inx_s mytime_short
0 2013-01-10 00:00:01 0.0 2013-01-10 00:00:02
1 2013-01-10 00:00:05 2.0 2013-01-10 00:00:06
2 2013-01-10 00:00:55 NaN NaT
Performance comparison on sample of 100000 elements:
186 ms to execute this implementation.
1min 3s to execute df_long.mytime_long.apply(get_closest)
UPD: but the winner is #Haleemur Ali's pd.merge_asof with 10ms

Printing a corresponding value of a column in Pandas

I have a table with 3 parameters "Date", "Time" and "Value". I want to add a new column to this table which has values corresponding to "Time" after every 10 minutes.
I though of creating a new column with Seconds for ease.
Now I want a new column suppose "x" which would have the first vlaue and the values which correspond to the "Time" only after 10 minutes. What can we do in this case?
IIUC you can first remove dates and hours by sub, add 10 Min, then convert to miliseconds and divide 1000:
df = pd.DataFrame({'Time': ['08:01:29:12','08:03:29:12','08:05:29:12'],
'Date':['1/2/2016','1/2/2016','1/2/2016']})
df['Date'] = pd.to_datetime(df.Date)
df['Time'] = pd.to_datetime(df.Time, format='%H:%M:%S:%f')
df['new'] = df.Time.sub(df.Time.values.astype('<M8[h]') +
pd.offsets.Minute(10)).astype('timedelta64[ms]') / 1000
print (df)
Date Time new
0 2016-01-02 1900-01-01 08:01:29.120 689.12
1 2016-01-02 1900-01-01 08:03:29.120 809.12
2 2016-01-02 1900-01-01 08:05:29.120 929.12
Or:
df['new'] = df.Time.sub(df.Time.values.astype('<M8[h]') +
pd.to_timedelta('00:10:00')).astype('timedelta64[ms]') / 1000
print (df)
Date Time new
0 2016-01-02 1900-01-01 08:01:29.120 689.12
1 2016-01-02 1900-01-01 08:03:29.120 809.12
2 2016-01-02 1900-01-01 08:05:29.120 929.12

Pandas Merge on Specific Attributes of DateTimeIndex

I currently have two pandas data frames which are both indexed using the pandas DateTimeIndex format.
df1
datetimeindex value
2014-01-01 00:00:00 204.501667
2014-01-01 01:00:00 125.345000
2014-01-01 02:00:00 119.660000
df2 (where the year 1900 is a filler year I added during import. Actual year does not matter)
datetimeindex temperature
1900-01-01 00:00:00 48.2
1900-01-01 01:00:00 30.2
1900-01-01 02:00:00 42.8
I would like to use pd.merge to combine the data frames based on the left index, however, I would like to ignore the year altogether to yield this:
merged_df
datetimeindex value temperature
2014-01-01 00:00:00 204.501667 48.2
2014-01-01 01:00:00 125.345000 30.2
2014-01-01 02:00:00 119.660000 42.8
so far I have tried:
merged_df = pd.merge(df1,df2,left_on =
['df1.index.month','df1.index.day','df1,index.hour'],right_on =
['df2.index.month','df2.index.day','df2.index.hour'],how = 'left')
which gave me the error KeyError: 'df2.index.month'
Is there a way to perform this merge as I have outlined it?
Thanks
You have to lose the quotesL
In [11]: pd.merge(df1, df2, left_on=[df1.index.month, df1.index.day, df1.index.hour],
right_on=[df2.index.month, df2.index.day, df2.index.hour])
Out[11]:
key_0 key_1 key_2 value temperature
0 1 1 0 204.501667 48.2
1 1 1 1 125.345000 30.2
2 1 1 2 119.660000 42.8
Here "df2.index.month" is a string whereas df2.index.month is the array of months.
Probably not as efficient because pd.to_datetime can be slow:
df2['NewIndex'] = pd.to_datetime(df2.index)
df2['NewIndex'] = df2['NewIndex'].apply(lambda x: x.replace(year=2014))
df2.set_index('NewIndex',inplace=True)
Then just do a merge on the whole index.

Categories

Resources