Replacing a substring with another substring pandas python

Replacing a substring with another substring pandas python - python

I have a dataframe, df.
I want to replace the 7th to 5th from last character with a 0 if it's a /:
df['StartDate'].str[-7:-5]=df['StartDate'].str[-7:-5].str.replace('/', '0')
Returns the error:
TypeError: 'StringMethods' object does not support item assignment
Data looks like:
number StartDate EndDate Location_Id Item_Id xxx yyy\
3 460 4/1/2012 4/11/2012 2502 3890004215 0 0
28 2731 10/17/2013 10/30/2013 3509 5100012114 0 0
34 1091 1/10/2013 1/23/2013 2544 5100012910 0 0
134 1630 5/2/2013 5/15/2013 2506 69511912000 0 0
138 327 1/12/2012 1/25/2012 5503 1380016686

Pandas has builtin support for datetime objects (pandas might have its own implementation rather than using the standard library's directly, but the idea is the same), so instead of trying to reformat dates using string methods, converting to datetime is much easier:
df['StartDate'] = pd.to_datetime(df['StartDate'])
Once you've converted, there are some easy to use methods related to datetime objects that you can get at through the .dt accessor (may be a recent addition in v0.15):
df.StartDate.dt.month
Out[20]:
3 4
28 10
34 1
134 5
138 1
dtype: int64

Related

Why isn't my column converting to string from int?

*Input:*
df["waiting_time"].value_counts()

*Output:*
2 days 6724
4 days 5290
1 days 5213
7 days 4906
6 days 4037
...
132 days 1
125 days 1
117 days 1
146 days 1
123 days 1
Name: waiting_time, Length: 128, dtype: int64
I tried:
df['wait_dur'] = df['waiting_time'].values.astype(str)
and I've tried apply as well. No changes to the data type, it stays the same.

You need to skip the 'values' part in your code:
df['wait_dur'] = df['waiting_time'].astype(str)
If you check first row for example, you will get:
type(df['wait_dur'][0])
<class 'str'>

df = df.applymap(str)
This should work, it applies the map string throughout.
If you want to see more methods go here.

How to convert timedelta to integer in pandas?

I have a column 'Time' in pandas that includes both integer and time deltas in days:
index Time
1 91
2 28
3 509 days 00:00:00
4 341 days 00:00:00
5 250 days 00:00:00
I am wanting to change all of the Time deltas to integers, but I am getting many errors when trying to pick and choose which values to convert, as it throws errors when I try to convert an integer within the column rather than a TD.
I want this:
index Time
1 91
2 28
3 509
4 341
5 250
I've tried a few variations of this to check if it's an integer, as I'm not concerned with those:
for x in finished['Time Future']:
if isinstance(x, int):
continue
else:
finished['Time'][x] = finished['Time'][x].astype(int)
But It is not working at all. I can't seem to find a solution.

This seems to work:
# If the day counts are actual integers:
m = ~df.Time.apply(lambda x: isinstance(x, int))
# OR, in case the day counts are strings:
m = ~df.Time.str.isdigit()
df.loc[m, 'Time'] = df.Time[m].apply(lambda x: pd.Timedelta(x).days)
Results in:
Time
1 91
2 28
3 509
4 341
5 250

Elegant way to add years as timedelta units to shift dates - Pandas

I have a dataframe like as shown below
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
What I would like to do is add offset which is in years to the dates columns.
So, I was trying to convert the offset to timedelta object with unit=y or unit=Y and then shift admit_dates
df1['offset'] = pd.to_timedelta(df1['offset'],unit='Y') #also tried with `y` (small y)
df1['shifted_date'] = df1['admit_dates'] + df1['offset']
However, I get the below error
ValueError: Units 'M' and 'Y' are no longer supported, as they do not
represent unambiguous timedelta values durations.
Is there any other elegant way to shift dates by years?

The max Timestamp supported in pandas is Timestamp('2262-04-11 23:47:16.854775807') so you could not be able to add 310 years to date 12/31/2011, one possible way is to use python's datetime objects which support a max year upto 9999 so you should be able to add 310 years to that.
from dateutil.relativedelta import relativedelta
df['admit_dates'] = pd.to_datetime(df['admit_dates'])
df['admit_dates'] = df['admit_dates'].dt.date.add(
df['offset'].apply(lambda y: relativedelta(years=y)))
Result:
df
person_id admit_dates discharge_dates drug_start_dates offset
0 11 2238-03-21 05/09/2015 05/29/1967 223
1 11 2239-01-21 01/29/2016 01/21/1957 223
2 11 2241-07-20 7/27/2018 7/27/1959 223
3 21 2327-01-11 01/12/2017 01/01/1961 310
4 21 2321-12-31 01/31/2016 12/31/1961 310

One thing you can do is extract the year out of the date, and add it to the offset:
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[10,20,2,31,12]})
df1.admit_dates = pd.to_datetime(df1.admit_dates)
df1["new_year"] = df1.admit_dates.dt.year + df1.offset
df1["date_with_offset"] = pd.to_datetime(pd.DataFrame({"year": df1.new_year,
"month": df1.admit_dates.dt.month,
"day":df1.admit_dates.dt.day}))
The catch - with your original offsets, some of the dates cause the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2328-01-11 00:00:00
According to the documentation, the maximum date in pandas is Apr. 11th, 2262 (at about quarter to midnight, to be specific). It's probably because they keep time in nanoseconds, and that's when the out of bounds error occurs for this representation.

Units 'Y' and 'M' becomes deprecated since pandas 0.25.0
But thanks to numpy timedelta64 through which we can use these units in the pandas Timedelta
import pandas as pd
import numpy as np
# Builds your dataframe
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset
0 11 03/21/2015 05/09/2015 05/29/1967 223
1 11 01/21/2016 01/29/2016 01/21/1957 223
2 11 7/20/2018 7/27/2018 7/27/1959 223
3 21 01/11/2017 01/12/2017 01/01/1961 310
4 21 12/31/2011 01/31/2016 12/31/1961 310
>>> df1['shifted_date'] = df1.apply(lambda r: pd.Timedelta(np.timedelta64(r['offset'], 'Y'))+ pd.to_datetime(r['admit_dates']), axis=1)
>>> df1['shifted_date'] = df1['shifted_date'].dt.date
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset shifted_date
0 11 03/21/2015 05/09/2015 05/29/1967 223 2238-03-21
1 11 01/21/2016 01/29/2016 01/21/1957 223 2239-01-21
2 11 7/20/2018 7/27/2018 7/27/1959 223 2241-07-20
....

Python Pandas: Convert timedelta Value From Subtracting Two Dates Into Integer Datatype (AttributeError)

I have the following data set output (shown below) that was produced by the following code:
df_EVENT5_5['dtin'] = pd.to_datetime(df_EVENT5_5['dtin'])
df_EVENT5_5['age'] = df_EVENT5_5['dtin'].apply(dt.datetime.date) - df_EVENT5_5['dtbuilt'].apply(dt.datetime.date)
id age
1 6252 days, 0:00:00
2 1800 days, 0:00:00
3 5873 days, 0:00:00
In the above data set, after running dtypes on the data frame, age appears to be an object.
I want to convert the "age" column into an integer datatype that only has the value of days. Below is my desired output:
id age
1 6252
2 1800
3 5873
I tried the following code:
df_EVENT5_5['age_no_days'] = df_EVENT5_5['age'].dt.total_seconds()/ (24 * 60 * 60)
Below is the error:
AttributeError: Can only use .dt accessor with datetimelike values

The fact that you are getting an object column suggests to me that there are some values that can't be interpreted as proper timedeltas. If that's the case, I would use pd.to_timedelta with the argument errors='coerce', then call dt.days:
df['age'] = pd.to_timedelta(df['age'],errors='coerce').dt.days
>>> df
id age
0 1 6252
1 2 1800
2 3 5873

Counting daily events on Pandas Time series

Hi I have a time series and would like to count how many events I have per day(i.e. rows in the table within a day). The command I'd like to use is:
ts.resample('D', how='count')
but "count" is not a valid aggregation function for time series, I suppose.
just to clarify, here is a sample of the dataframe:
0 2008-02-22 03:43:00
1 2008-02-22 03:43:00
2 2010-08-05 06:48:00
3 2006-02-07 06:40:00
4 2005-06-06 05:04:00
5 2008-04-17 02:11:00
6 2012-05-12 06:46:00
7 2004-05-17 08:42:00
8 2004-08-02 05:02:00
9 2008-03-26 03:53:00
Name: Data_Hora, dtype: datetime64[ns]
and this is the error I am getting:
ts.resample('D').count()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-42-86643e21ce18> in <module>()
----> 1 ts.resample('D').count()
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
255 def resample(self, rule, how=None, axis=0, fill_method=None,
256 closed=None, label=None, convention='start',
--> 257 kind=None, loffset=None, limit=None, base=0):
258 """
259 Convenience method for frequency conversion and resampling of regular
/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.pyc in resample(self, obj)
98 return obj
99 else: # pragma: no cover
--> 100 raise TypeError('Only valid with DatetimeIndex or PeriodIndex')
101
102 rs_axis = rs._get_axis(self.axis)
TypeError: Only valid with DatetimeIndex or PeriodIndex
That can be fixed by turning the datetime column into an index with set_index. However after I do that, I still get the following error:
DataError: No numeric types to aggregate
because my Dataframe does not have a numeric column.
But I just want to count rows!! The simple "select count(*) group by ... " from SQL.

In order to get this to work, after removing the rows in which the index was NaT:
df2 = df[df.index!=pd.NaT]
I had to add a column of ones:
df2['n'] = 1
and then count only that column:
df2.n.resample('D', how="sum")
then I could visualize the data with:
plot(df2.n.resample('D', how="sum"))

In [104]: df = DataFrame(1,index=date_range('20130101 9:01',freq='h',periods=1000),columns=['A'])
In [105]: df
Out[105]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2013-01-01 09:01:00 to 2013-02-12 00:01:00
Freq: H
Data columns (total 1 columns):
A 1000 non-null values
dtypes: int64(1)
In [106]: df.resample('D').count()
Out[106]:
A 43
dtype: int64

You can do this with a one liner, using value counts and resampling.
Assuming your DataFrame is named df:
df.index.value_counts().resample('D', how='sum')
This method also works if datetime is not your index:
df.any_datetime_series.value_counts().resample('D', how='sum')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing a substring with another substring pandas python - python

Related

Why isn't my column converting to string from int?

How to convert timedelta to integer in pandas?

Elegant way to add years as timedelta units to shift dates - Pandas

Python Pandas: Convert timedelta Value From Subtracting Two Dates Into Integer Datatype (AttributeError)

Counting daily events on Pandas Time series

Categories

Resources