Printing a corresponding value of a column in Pandas - python

I have a table with 3 parameters "Date", "Time" and "Value". I want to add a new column to this table which has values corresponding to "Time" after every 10 minutes.
I though of creating a new column with Seconds for ease.
Now I want a new column suppose "x" which would have the first vlaue and the values which correspond to the "Time" only after 10 minutes. What can we do in this case?

IIUC you can first remove dates and hours by sub, add 10 Min, then convert to miliseconds and divide 1000:
df = pd.DataFrame({'Time': ['08:01:29:12','08:03:29:12','08:05:29:12'],
'Date':['1/2/2016','1/2/2016','1/2/2016']})
df['Date'] = pd.to_datetime(df.Date)
df['Time'] = pd.to_datetime(df.Time, format='%H:%M:%S:%f')
df['new'] = df.Time.sub(df.Time.values.astype('<M8[h]') +
pd.offsets.Minute(10)).astype('timedelta64[ms]') / 1000
print (df)
Date Time new
0 2016-01-02 1900-01-01 08:01:29.120 689.12
1 2016-01-02 1900-01-01 08:03:29.120 809.12
2 2016-01-02 1900-01-01 08:05:29.120 929.12
Or:
df['new'] = df.Time.sub(df.Time.values.astype('<M8[h]') +
pd.to_timedelta('00:10:00')).astype('timedelta64[ms]') / 1000
print (df)
Date Time new
0 2016-01-02 1900-01-01 08:01:29.120 689.12
1 2016-01-02 1900-01-01 08:03:29.120 809.12
2 2016-01-02 1900-01-01 08:05:29.120 929.12

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

pandas - combine time and date from two dataframe columns to a datetime column

This is a follow up question of the accepted solution in here.
I have a pandas dataframe:
In one column 'time' is the time stored in the following format: 'HHMMSS' (e.g. 203412 means 20:34:12).
In another column 'date' the date is stored in the following format: 'YYmmdd' (e.g 200712 means 2020-07-12). YY represents the addon to the year 2000.
Example:
import pandas as pd
data = {'time': ['123455', '000010', '100000'],
'date': ['200712', '210601', '190610']}
df = pd.DataFrame(data)
print(df)
# time date
#0 123455 200712
#1 000010 210601
#2 100000 190610
I need a third column which contains the combined datetime format (e.g. 2020-07-12 12:34:55) of the two other columns. So far, I can only modify the time but I do not know how to add the date.
df['datetime'] = pd.to_datetime(df['time'], format='%H%M%S')
print(df)
# time date datetime
#0 123455 200712 1900-01-01 12:34:55
#1 000010 210601 1900-01-01 00:00:10
#2 100000 190610 1900-01-01 10:00:00
How can I add in column df['datetime'] the date from column df['date'], so that the dataframe is:
time date datetime
0 123455 200712 2020-07-12 12:34:55
1 000010 210601 2021-06-01 00:00:10
2 100000 190610 2019-06-10 10:00:00
I found this question, but I am not exactly sure how to use it for my purpose.
You can join columns first and then specify formar:
df['datetime'] = pd.to_datetime(df['date'] + df['time'], format='%y%m%d%H%M%S')
print(df)
time date datetime
0 123455 200712 2020-07-12 12:34:55
1 000010 210601 2021-06-01 00:00:10
2 100000 190610 2019-06-10 10:00:00
If possible integer columns:
df['datetime'] = pd.to_datetime(df['date'].astype(str) + df['time'].astype(str), format='%y%m%d%H%M%S')

Convert date column formated as xx:xx.x

I have come across a CSV file that contains a date column formatted in the following manner: xx:xx.x, here's a couple of the data present in the column marked as date:
07:33.0
34:53.0
06:30.0
30:09.0
02:18.0
My question is what type of formatting is this? And how can I convert it to a proper date format using Python?
It looks like times without hours.
You can create timedeltas by add 0 hours by to_timedelta:
df['col'] = pd.to_timedelta('00:' + df['col'])
print (df)
col
0 0 days 00:07:33
1 0 days 00:34:53
2 0 days 00:06:30
3 0 days 00:30:09
4 0 days 00:02:18
Or convert to datetimes by to_datetime - there is added default date:
df['col'] = pd.to_datetime(df['col'], format='%M:%S.%f')
print (df)
col
0 1900-01-01 00:07:33
1 1900-01-01 00:34:53
2 1900-01-01 00:06:30
3 1900-01-01 00:30:09
4 1900-01-01 00:02:18

how to convert time in unorthodox format to timestamp in pandas dataframe

I have a column in my dataframe which I want to convert to a Timestamp. However, it is in a bit of a strange format that I am struggling to manipulate. The column is in the format HHMMSS, but does not include the leading zeros.
For example for a time that should be '00:03:15' the dataframe has '315'. I want to convert the latter to a Timestamp similar to the former. Here is an illustration of the column:
message_time
25
35
114
1421
...
235347
235959
Thanks
Use Series.str.zfill for add leading zero and then to_datetime:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_datetime(s, format='%H%M%S')
print (df)
message_time
0 1900-01-01 00:00:25
1 1900-01-01 00:00:35
2 1900-01-01 00:01:14
3 1900-01-01 00:14:21
4 1900-01-01 23:53:47
5 1900-01-01 23:59:59
In my opinion here is better create timedeltas by to_timedelta:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_timedelta(s.str[:2] + ':' + s.str[2:4] + ':' + s.str[4:])
print (df)
message_time
0 00:00:25
1 00:00:35
2 00:01:14
3 00:14:21
4 23:53:47
5 23:59:59

Python - Select min values in dataframe

I have a data frame that looks like this:
How can I make a new data frame that contains only the minimum 'Time' values for a user on the same date?
So I want to have a data frame with the same structure, but only one 'Time' for a 'Date' for a user.
So it should be like this:
Sort values by time column and check for duplicates in Date+User_name. However to make sure 09:00 is lower than 10:00 we can convert the strings to time first.
import pandas as pd
data = {
'User_name':['user1','user1','user1', 'user2'],
'Date':['8/29/2016','8/29/2016', '8/31/2016', '8/31/2016'],
'Time':['9:07:41','9:07:42','9:07:43', '9:31:35']
}
# Recreate sample dataframe
df = pd.DataFrame(data)
Alternative 1 (quicker):
#100 loops, best of 3: 1.73 ms per loop
# Create a mask
m = (df.reindex(pd.to_datetime(df['Time']).sort_values().index)
.duplicated(['Date','User_name']))
# Apply inverted mask
df = df.loc[~m]
Alternative 2 (more readable):
One easier way would be too remake the df['Time'] column to datetime and group it by date and User_name and get the idxmin(). This will be our mask. (Credit to jezrael)
# 100 loops, best of 3: 4.34 ms per loop
# Create a mask
m = pd.to_datetime(df['Time']).groupby([df['Date'],df['User_name']]).idxmin()
df = df.loc[m]
Output:
Date Time User_name
0 8/29/2016 9:07:41 user1
2 8/31/2016 9:07:43 user1
3 8/31/2016 9:31:35 user2
Update 1
#User included into grouping
Not the best way but simple
df = pd.DataFrame(np.datetime64('2016')+
np.random.randint(0,3*24,
size=(7,1)).astype('<m8[h]'),
columns =['DT']).join(pd.Series(list('abcdefg'),name='str_val')
).join(pd.Series(list('UAUAUAU'),name='User'))
df['Date'] = df.DT.dt.date
df['Time'] = df.DT.dt.time
df.drop(columns = ['DT'],inplace=True)
print (df)
Output:
str_val User Date Time
0 a U 2016-01-01 04:00:00
1 b A 2016-01-01 10:00:00
2 c U 2016-01-01 20:00:00
3 d A 2016-01-01 22:00:00
4 e U 2016-01-02 04:00:00
5 f A 2016-01-02 23:00:00
6 g U 2016-01-02 09:00:00
Code to get values
print (df.sort_values(['Date','User','Time']).groupby(['Date','User']).first())
Output:
Date User
2016-01-01 A b 10:00:00
U a 04:00:00
2016-01-02 A f 23:00:00
U e 04:00:00

Categories

Resources