How to remove microseconds from timedelta - python

I have microseconds that I want to essentially truncate from a pandas column. I tried something like analyze_me['how_long_it_took_to_order'] = analyze_me['how_long_it_took_to_order'].apply(lambda x: x.replace(microsecond=0) but to this error came up replace() takes no keyword arguments.
For example: I want 00:19:58.582052 to become 00:19:58 or 00:19:58.58

I think you need to convert your string in to a timedelta with pd.to_timedelta and then take advantage of the excellent dt accessor with the floor method which truncates based on string. Here are the first two rows of your data.
df['how_long_it_took_to_order'] = pd.to_timedelta(df['how_long_it_took_to_order'])
df['how_long_it_took_to_order'].dt.floor('s')
0 00:19:58
1 00:25:09
Can round to the hundredth of a second.
df['how_long_it_took_to_order'].dt.floor('10ms')
0 00:19:58.580000
1 00:25:09.100000
Here I create some a Series of timedeltas and then use the dt accessor with the floor method to truncate down to the nearest microsecond.
d = pd.timedelta_range(0, periods=6, freq='644257us')
s = pd.Series(d)
s
0 00:00:00
1 00:00:00.644257
2 00:00:01.288514
3 00:00:01.932771
4 00:00:02.577028
5 00:00:03.221285
dtype: timedelta64[ns]
Now truncate
s.dt.floor('s')
0 00:00:00
1 00:00:00
2 00:00:01
3 00:00:01
4 00:00:02
5 00:00:03
dtype: timedelta64[ns]
If you want to truncate to the nearest hundredth of a second do this:
s.dt.floor('10ms')
0 00:00:00
1 00:00:00.640000
2 00:00:01.280000
3 00:00:01.930000
4 00:00:02.570000
5 00:00:03.220000
dtype: timedelta64[ns]

your how_long_it_took_to_order column seems to be of string (object) dtype.
So try this:
analyze_me['how_long_it_took_to_order'] = \
analyze_me['how_long_it_took_to_order'].str.split('.').str[0]
or:
analyze_me['how_long_it_took_to_order'] = \
analyze_me['how_long_it_took_to_order'].str.replace('(\.\d{2})\d+', r'\1')
for "centiseconds", like: 00:19:58.58

I needed this for a simple script where I wasn't using Pandas, and came up with a simple hack which should work everywhere.
age = age - timedelta(microseconds=age.microseconds)
where age is my timedelta object.
You can't directly modify the microseconds member of a timedelta object because it's immutable, but of course, you can replace it with another immutable object.

Related

How to change str to date when year data inconsistent?

I've got a dataframe with a column names birthdates, they are all strings, most are saved as %d.%m.%Y, some are saved as %d.%m.%y.
How can I make this work?
df["birthdates_clean"] = pd.to_datetime(df["birthdates"], format = "%d.%m.%Y")
If this can't work, do I need to filter the rows? How would I do it?
Thanks for taking time to answer!
I am not sure what is the expected output, but you can let to_datetime parse automatically the dates:
df = pd.DataFrame({"birthdates": ['01.01.2000', '01.02.00', '02.03.99',
'02.03.22', '01.01.71', '01.01.72']})
# as datetime
df["birthdates_clean"] = pd.to_datetime(df["birthdates"], dayfirst=True)
# as custom string
df["birthdates_clean2"] = (pd.to_datetime(df["birthdates"], dayfirst=True)
.dt.strftime('%d.%m.%Y')
)
NB. the shift point is currently at 71/72. 71 gets evaluated as 2071 and 72 as 1972
output:
birthdates birthdates_clean birthdates_clean2
0 01.01.2000 2000-01-01 01.01.2000
1 01.02.00 2000-02-01 01.02.2000
2 02.03.99 1999-03-02 02.03.1999
3 02.03.22 2022-03-02 02.03.2022
4 01.01.71 2071-01-01 01.01.2071
5 01.01.72 1972-01-01 01.01.1972

Remove decimal from year value in a data frame

I currently have a 'Year Built' column in a df describing the date buildings were built, when I imported the csv file the years all have decimal places after them: 1920.0, 1985.0 . How to I go about changing them into a datetime format or just removing the decimal place?
df1['Year Built'].head()
0 1920.0
1 1985.0
2 NaN
3 1930.0
4 1985.0
Name: Year Built, dtype: float64
When I tried to use datetime...
df1['Year Built'] = pd.to_datetime(df1['Year Built'])
# check
df1['Year Built'].unique()
array(['1970-01-01T00:00:00.000001920', '1970-01-01T00:00:00.000001985',
'NaT', '1970-01-01T00:00:00.000001930',
'1970-01-01T00:00:00.000001986', '1970-01-01T00:00:00.000001987',
'1970-01-01T00:00:00.000001988', '1970-01-01T00:00:00.000001990',
Add parameter format by %Y for match YYYY and also errors='coerce' for convert not matched values to misisng values NaT:
df1['Year Built'] = pd.to_datetime(df1['Year Built'], format='%Y', errors='coerce')
print (df1)
Year Built
0 1920-01-01
1 1985-01-01
2 NaT
3 1930-01-01
4 1985-01-01
you can simply change them from float to int (and later string if you wont be processing them as numbers)
df1['Year Built'] = df1['Year Built'].astype(int)
and here is the link for more details
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html
Similar to Lion Nadej Ahmed answer you can use the dtype parameter when you read in the data, specifying int to prevent the year from becoming a float.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
We can simply use datetime
import datetime
df1['year_built'] = pd.to_datetime(df1['year_built'])
print(df1)

Increment attributes of a datetime Series in pandas

I have a Series containing datetime64[ns] elements called series, and would like to increment the months. I thought the following would work fine, but it doesn't:
series.dt.month += 1
The error is
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there a simple way to achieve this without needing to redefine things?
First, I created timeseries date example:
import datetime
t = [datetime.datetime(2015,4,18,23,33,58),datetime.datetime(2015,4,19,14,32,8),datetime.datetime(2015,4,20,18,42,44),datetime.datetime(2015,4,20,21,41,19)]
import pandas as pd
df = pd.DataFrame(t,columns=['Date'])
Timeseries:
df
Out[]:
Date
0 2015-04-18 23:33:58
1 2015-04-19 14:32:08
2 2015-04-20 18:42:44
3 2015-04-20 21:41:19
Now increment part, you can use offset option.
df['Date']+pd.DateOffset(days=30)
Output:
df['Date']+pd.DateOffset(days=30)
Out[66]:
0 2015-05-18 23:33:58
1 2015-05-19 14:32:08
2 2015-05-20 18:42:44
3 2015-05-20 21:41:19
Name: Date, dtype: datetime64[ns]

How to convert string timestamp values to datetime64 dtype

The resulting DataFrame below lists the Timestamp values as the strings:
import pandas as pd
df = pd.DataFrame({'Time':['00:00:00:19','02:11:00:07','02:00:40:23']})
What method to use to convert these string values to datetime64 so the sum() and mean() functions could be applied to the column?
Below is the screenshot of the DataFrame as it shown in Notebook:
It's probably not the best way, but it's functional:
durations = (df.Time.str.split(':', expand=True).applymap(int) * [24*60*60, 60*60, 60, 1]).sum(axis=1).apply(pd.Timedelta, unit='s')
Gives you:
0 0 days 00:00:19
1 3 days 08:00:07
2 2 days 00:40:23
dtype: timedelta64[ns]
And durations.sum() will give you Timedelta('5 days 08:40:49')
Okay - slightly easier:
df.Time.str.replace('(\d+):(.*)', r'\1 days \2').apply(pd.Timedelta)

Extracting hours from a csv with pandas

I have a csv that looks like this
time,result
1308959819,1
1379259923,2
1318632821,3
1375216682,2
1335930758,4
times are in unix format. I want to extract the hours from such times and groupby the file with respect to such values.
I tried
times = pd.to_datetime(df.time, unit='s')
or even
times = pd.DataFrame(pd.to_datetime(df.time, unit='s'))
but in both cases I got an error with
times.hour
>>>AttributeError: 'DataFrame' object has no attribute 'hour'
You're getting that error because Series and DataFrames don't have hour attributes. You can access the information you want using the .dt convenience accessor (docs here):
>>> times = pd.to_datetime(df.time, unit='s')
>>> times
0 2011-06-24 23:56:59
1 2013-09-15 15:45:23
2 2011-10-14 22:53:41
3 2013-07-30 20:38:02
4 2012-05-02 03:52:38
Name: time, dtype: datetime64[ns]
>>> times.dt
<pandas.tseries.common.DatetimeProperties object at 0xb5de94c>
>>> times.dt.hour
0 23
1 15
2 22
3 20
4 3
dtype: int64
You can use the builtin datetime class to do this.
import datetime
# your code here
hours = datetime.datetime.fromtimestamp(df.time).hour

Categories

Resources