or-clause with date and null-values in python - python

I have following df:
date_from date_to birth_date death_date
0 2016-01-10 2019-06-05 2015-02-15 2018-07-25
1 2016-05-11 2020-06-13 2014-03-07 2020-07-11
2 2016-02-23 Nat 2014-03-07 2019-06-08
3 2015-12-08 Nat 2014-03-07 2019-06-08
I'm trying to select all cases where date_to > death_date OR where date_to = Nat.
I've tried following code:
df = df[(df['date_to'] > df['death_date']) | (df[df['DATE_TO'].isnull()])]
but I get following error-message
'TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]'
and I don't really know how to work around this problem.

From your question
import pandas as pd
# ..... your data frame df ......
# considering that you have the following types
>>> df.dtypes
date_from datetime64[ns]
date_to datetime64[ns]
birth_date datetime64[ns]
death_date datetime64[ns]
dtype: object
df = df[(df['date_to'] > df['death_date']) | (df['date_to'].isnull())]
>>> df
date_from date_to birth_date death_date
0 2016-01-10 2019-06-05 2015-02-15 2018-07-25
2 2016-02-23 NaT 2014-03-07 2019-06-08
3 2015-12-08 NaT 2014-03-07 2019-06-08
In case date_to column is not datetime you can convert like this
df['date_to'] = df['date_to'].replace('Nat', pd.NaT)
df['date_to'] = pd.to_datetime(df['date_to'])

Related

how to change datetime format in pandas. fastest way?

I import from .csv and get object-type columns:
begin end
0 2019-03-29 17:02:32.838469+00 2019-04-13 17:32:19.134874+00
1 2019-06-13 19:22:19.331201+00 2019-06-13 19:51:21.987534+00
2 2019-03-27 06:56:51.138795+00 2019-03-27 06:56:54.834751+00
3 2019-05-28 11:09:29.320478+00 2019-05-29 06:47:21.794092+00
4 2019-03-24 07:03:03.582679+00 2019-03-24 09:50:32.595199+00
I need to get in datetime format:
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
what I do (at first I make them datetime format, then cut to dates only, then make datetime format again):
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
df['begin'] = df['begin'].dt.date
df['end'] = df['end'].dt.date
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
Is there any short way to do this without converting 2 times?
You can use .apply() on the 2 columns, each with pd.to_datetime once and use dt.normalize() to remove time info, yet maintaining as datetime format. (Also used dt.tz_localize(None) to remove timezone info):
df = df.apply(lambda x: pd.to_datetime(x).dt.tz_localize(None).dt.normalize())
Result:
print(df)
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 begin 5 non-null datetime64[ns] <=== datetime format
1 end 5 non-null datetime64[ns] <=== datetime format
dtypes: datetime64[ns](2)
memory usage: 120.0 bytes

Merge dataframe object and timedelta64

I have a dataframe of dtype datetime64
df:
time timestamp
18053.401736 2019-06-06 09:38:30+00:00
18053.418252 2019-06-06 10:02:17+00:00
18053.424514 2019-06-06 10:11:18+00:00
18053.454132 2019-06-06 10:53:57+00:00
Name: timestamp, dtype: datetime64[ns, UTC]
and a Series of dtype timedelta64
ss:
ref_time
0 days 09:00:00
1 0 days 09:00:01
2 0 days 09:00:02
3 0 days 09:00:03
4 0 days 09:00:04
...
21596 0 days 14:59:56
21597 0 days 14:59:57
21598 0 days 14:59:58
21599 0 days 14:59:59
21600 0 days 15:00:00
Name: timeonly, Length: 21601, dtype: timedelta64[ns]
I want to merge the two so that the output df have values only where timestamp coincide with the one of the Series:
Desired output:
time timestamp ref_time
Nan Nan 09:00:00
... ... ...
Nan Nan 09:38:29
18053.401736 2019-06-06 09:38:30+00:00 09:38:30
Nan Nan 09:38:31
... ... ...
18053.418252 2019-06-06 10:02:17+00:00 10:02:17
Nan Nan 10:02:18
Nan Nan 10:02:19
... ... ...
18053.424514 2019-06-06 10:11:18+00:00 10:11:18
... ... ...
18053.454132 2019-06-06 10:53:57+00:00 10:53:57
However if I convert 'timestamp' to a time-only I get an object dtype and I can't merge it with ss.
dframe['timestamp'].dtype # --> datetime64[ns, UTC]
df['timeonly'] = df['timestamp'].dt.time
df['timeonly'].dtype # --> object
df_date.merge(timeax, how='outer', on=['timeonly'])
# ValueError: You are trying to merge on object and timedelta64[ns] columns. If you wish to proceed you should use pd.concat
but using concat as suggested doesn't give me the desired output.
How can I merge/join the DataFrame and the Series?
Pandas version 1.1.5
Convert the timestamp to timedelta by subtracting the date part and then merge:
df1 = pd.DataFrame([pd.Timestamp('2019-06-06 09:38:30+00:00'),pd.Timestamp('2019-06-06 10:02:17+00:00')], columns=['timestamp'])
df2 = pd.DataFrame([pd.Timedelta('09:38:30')], columns=['ref_time'])
timestamp
0 2019-06-06 09:38:30+00:00
1 2019-06-06 10:02:17+00:00
timestamp datetime64[ns, UTC]
dtype: object
ref_time
0 09:38:30
ref_time timedelta64[ns]
dtype: object
df1['merge_key'] = df1['timestamp'].dt.tz_localize(None) - pd.to_datetime(df1['timestamp'].dt.date)
df_merged = df1.merge(df2, left_on = 'merge_key', right_on = 'ref_time')
Gives:
timestamp merge_key ref_time
0 2019-06-06 09:38:30+00:00 09:38:30 09:38:30
The main challenge here is to get everything into compatible date types. Using your, slightly modified, examples as inputs
from io import StringIO
df = pd.read_csv(StringIO(
"""
time,timestamp
18053.401736,2019-06-06 09:38:30+00:00
18053.418252,2019-06-06 10:02:17+00:00
18053.424514,2019-06-06 10:11:18+00:00
18053.454132,2019-06-06 10:53:57+00:00
"""))
df['timestamp'] = pd.to_datetime(df['timestamp'])
from datetime import timedelta
sdf = pd.read_csv(StringIO(
"""
ref_time
0 days 09:00:00
0 days 09:00:01
0 days 09:00:02
0 days 09:00:03
0 days 09:00:04
0 days 09:38:30
0 days 10:02:17
0 days 14:59:56
0 days 14:59:57
0 days 14:59:58
0 days 14:59:59
0 days 15:00:00
"""))
sdf['ref_time'] = pd.to_timedelta(sdf['ref_time'])
The dtypes here are as in your question which is important
First we figure out the base_date as we need to convert timedeltas into datetimes etc. Note we set it to midnight of the relevant date via round('1d')
base_date = df['timestamp'].iloc[0].round('1d').to_pydatetime()
base_date
output
datetime.datetime(2019, 6, 6, 0, 0, tzinfo=<UTC>)
Next we add timedeltas from sdf to the base_date:
sdf['ref_dt'] = sdf['ref_time'] + base_date
Now sdf['ref_dt'] and df['timestamp'] are in the same 'units' and of the same type, so we can merge
sdf.merge(df, left_on = 'ref_dt', right_on = 'timestamp', how = 'left')
output
ref_time ref_dt time timestamp
-- --------------- ------------------------- ------- -------------------------
0 0 days 09:00:00 2019-06-06 09:00:00+00:00 nan NaT
1 0 days 09:00:01 2019-06-06 09:00:01+00:00 nan NaT
2 0 days 09:00:02 2019-06-06 09:00:02+00:00 nan NaT
3 0 days 09:00:03 2019-06-06 09:00:03+00:00 nan NaT
4 0 days 09:00:04 2019-06-06 09:00:04+00:00 nan NaT
5 0 days 09:38:30 2019-06-06 09:38:30+00:00 18053.4 2019-06-06 09:38:30+00:00
6 0 days 10:02:17 2019-06-06 10:02:17+00:00 18053.4 2019-06-06 10:02:17+00:00
7 0 days 14:59:56 2019-06-06 14:59:56+00:00 nan NaT
8 0 days 14:59:57 2019-06-06 14:59:57+00:00 nan NaT
9 0 days 14:59:58 2019-06-06 14:59:58+00:00 nan NaT
10 0 days 14:59:59 2019-06-06 14:59:59+00:00 nan NaT
11 0 days 15:00:00 2019-06-06 15:00:00+00:00 nan NaT
and we see the merge happening where needed

Split a date range in a single row to multiple rows with one row per month [duplicate]

I am dealing with financial data which i need to extrapolate for different months. Here is my dataframe:
invoice_id,date_from,date_to
30492,2019-02-04,2019-09-18
I want to break this up for different months between date_from and date_to. Hence i need to add rows for each month with month starting date to ending date. Final output should look like:
invoice_id,date_from,date_to
30492,2019-02-04,2019-02-28
30492,2019-03-01,2019-03-31
30492,2019-04-01,2019-04-30
30492,2019-05-01,2019-05-31
30492,2019-06-01,2019-06-30
30492,2019-07-01,2019-07-31
30492,2019-08-01,2019-08-30
30492,2019-09-01,2019-09-18
Need to take care of leap year scenario as well. Is there any native method already available in pandas datetime package which i can use to achieve the desired output ?
Use:
print (df)
invoice_id date_from date_to
0 30492 2019-02-04 2019-09-18
1 30493 2019-01-20 2019-03-10
#added months between date_from and date_to
df1 = pd.concat([pd.Series(r.invoice_id,pd.date_range(r.date_from, r.date_to, freq='MS'))
for r in df.itertuples()]).reset_index()
df1.columns = ['date_from','invoice_id']
#added starts of months - sorting for correct positions
df2 = (pd.concat([df[['invoice_id','date_from']], df1], sort=False, ignore_index=True)
.sort_values(['invoice_id','date_from'])
.reset_index(drop=True))
#added MonthEnd and date_to to last rows
mask = df2['invoice_id'].duplicated(keep='last')
s = df2['invoice_id'].map(df.set_index('invoice_id')['date_to'])
df2['date_to'] = np.where(mask, df2['date_from'] + pd.offsets.MonthEnd(), s)
print (df2)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
8 30493 2019-01-20 2019-01-31
9 30493 2019-02-01 2019-02-28
10 30493 2019-03-01 2019-03-10
You can use pandas.date_range with start and end date, in combination with freq='MS' which is beginning of month and freq='M' which is end of month:
x = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='MS')
y = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='M')
df_new = pd.DataFrame({'date_from':x,
'date_to':y})
df_new['invoice_id'] = df.iloc[0]['invoice_id']
print(df_new)
date_from date_to invoice_id
0 2019-03-01 2019-02-28 30492
1 2019-04-01 2019-03-31 30492
2 2019-05-01 2019-04-30 30492
3 2019-06-01 2019-05-31 30492
4 2019-07-01 2019-06-30 30492
5 2019-08-01 2019-07-31 30492
6 2019-09-01 2019-08-31 30492
Another way, using the resample method of a datetime index:
# melt, so we have start and end dates in 1 column
df = pd.melt(df, id_vars='invoice_id')
# now set the date column as index
df.set_index(inplace=True, keys='value')
# resample to daily level
df = df.resample('D').ffill().reset_index()
# get the yr-month value of each daily row
df['yr_month'] = df['value'].dt.strftime("%Y-%m")
# Now group by month and take min/max day values
output = (df.groupby(['invoice_id', 'yr_month'])['value']
.agg({'date_from': 'min', 'date_to': 'max'})
.reset_index()
.drop(labels='yr_month', axis=1))
print(output)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18

Split datetime column into separate date and time columns

I am trying to extract a Date and a Time from a Timestamp:
DateTime
31/12/2015 22:45
to be:
Date | Time |
31/12/2015| 22:45 |
however when I use:
df['Date'] = pd.to_datetime(df['DateTime']).dt.date
I Get :
2015-12-31
Similarly with Time i get:
df['Time'] = pd.to_datetime(df['DateTime']).dt.time
gives
23:45:00
but if I try to format it I get an error:
df['Date'] = pd.to_datetime(f['DateTime'], format='%d/%m/%Y').dt.date
ValueError: unconverted data remains: 00:00
Try strftime
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['Date'] = df['DateTime'].dt.strftime('%d/%m/%Y')
df['Time'] = df['DateTime'].dt.strftime('%H:%M')
DateTime Date Time
0 2015-12-31 22:45:00 31/12/2015 22:45
Option 1
Since you don't really need to operate on the dates per se, just split your column on space:
df = df.DateTime.str.split(expand=True)
df.columns = ['Date', 'Time']
df
Date Time
0 31/12/2015 22:45
Option 2
Alternatively, just drop the format specifier completely:
v = pd.to_datetime(df['DateTime'], errors='coerce')
df['Time'] = v.dt.time
df['Date'] = v.dt.floor('D')
df
Time Date
0 22:45:00 2015-12-31
If your DateTime column is already a datetime type, you shouldn't need to call pd.to_datetime on it.
Are you looking for a string ("12:34") or a timestamp (the concept of 12:34 in the afternoon)? If you're looking for the former, there are answers here that cover that. If you're looking for the latter, you can use the .dt.time and .dt.date accessors.
>>> pd.__version__
u'0.20.2'
>>> df = pd.DataFrame({'DateTime':pd.date_range(start='2018-01-01', end='2018-01-10')})
>>> df['date'] = df.DateTime.dt.date
>>> df['time'] = df.DateTime.dt.time
>>> df
DateTime date time
0 2018-01-01 2018-01-01 00:00:00
1 2018-01-02 2018-01-02 00:00:00
2 2018-01-03 2018-01-03 00:00:00
3 2018-01-04 2018-01-04 00:00:00
4 2018-01-05 2018-01-05 00:00:00
5 2018-01-06 2018-01-06 00:00:00
6 2018-01-07 2018-01-07 00:00:00
7 2018-01-08 2018-01-08 00:00:00
8 2018-01-09 2018-01-09 00:00:00
9 2018-01-10 2018-01-10 00:00:00

Pandas: Adding varying numbers of days to a date in a dataframe

I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.
If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]

Categories

Resources