How to find rows of a dataframe containing a date value? - python

There is a huge dataframe containing multiple data types in different columns. I want to find rows that contain date values in different columns.
Here a test dataframe:
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
Now, I want to create a new dataframe that contains only dates in both columns A and B, here rows 2nd and last.
Expected output:
A B C
1 2020-06-01 16:58:17.274311 2020-06-01 17:13:20.391394 2
6 2020-05-05 2020-05-25 7
What is the best way to do that? Thanks.
P.S.> Dates can be in any standard format.

Use:
m = df[['A', 'B']].transform(pd.to_datetime, errors='coerce').isna().any(axis=1)
df = df[~m]
Result:
# print(df)
A B C
1 2020-06-01 17:54:16.377722 2020-06-01 17:54:16.378432 2
6 2020-05-05 2020-05-25 7

Solution for test only A,B columns is boolean indexing with DataFrame.notna and DataFrame.all for not match any non datetimes:
df = df[df[['A','B']].apply(pd.to_datetime, errors='coerce').notna().all(axis=1)]
print (df)
A B C
1 2020-06-01 16:14:35.020855 2020-06-01 16:14:35.021855 2
6 2020-05-05 2020-05-25 7

import pandas as pd
from datetime import datetime
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
m = pd.concat([pd.to_datetime(df['A'], errors='coerce'),
pd.to_datetime(df['B'], errors='coerce')], axis=1).isna().all(axis=1)
print(df[~m])
Prints:
A B C
1 2020-06-01 12:17:51.320286 2020-06-01 12:17:51.320826 2
6 2020-05-05 2020-05-25 7

Related

How to make new timeseries based on previous value in python pandas?

this is my first question on stack overflow so the formatting might be a bit off. I have a problem that I know a solution for with a for loop in python. However, I don't know if there is a way in pandas itself that does the same thing faster.
Problem:
Suppose I have a pandas Series 'in' consisting of an index date and where every date has a value (integer). There is also a Series 'out' that has the same structure.
Ex:
in
date val
2022-12-01 5
2022-12-02 8
2022-12-03 19
out
date val
2022-12-01 3
2022-12-02 7
2022-12-03 21
If I want to make a Series of the amount of events that are being processed each day, I could do it with a for loop in the following where the value of every day is open.iloc[i]=open[i-1]+in[i]-out[i]. The result should be
open
date val
2022-12-01 2 #5-3
2022-12-02 3 #2+8-7
2022-12-03 1 #3+19-21
Is there a way to do this in pandas itself, without the need for a for loop?
new answer
Use cumsum:
ser_open = ser_in.sub(ser_out).cumsum()
Output:
2022-12-01 2
2022-12-02 3
2022-12-03 1
dtype: int64
Used input:
ser_in = pd.Series([5, 8, 19], index=['2022-12-01', '2022-12-02', '2022-12-03'])
ser_out = pd.Series([3, 7, 21], index=['2022-12-01', '2022-12-02', '2022-12-03'])
initial answer
Use shift after setting date as index:
out = (df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)
.reset_index()
)
Or, for assignment use variant with map:
df_open['val'] = df_open['date'].map(df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)['val']
)
Output:
date val
0 2022-12-01 2.0
1 2022-12-02 3.0
2 2022-12-03 1.0
Used inputs:
df_in = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [5, 8, 19]})
df_out = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [3, 7, 21]})
df_open = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [2.0, 3.0, 1.0]})

Pandas test reappearance of a value based on the rolling period

I am trying to find a way to check if my current row value - df['ColM'] in the dataframe below appeared in a 5 day look-back period
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColN'] = ['AAA', 'AAA', 'AAA', 'ABC', 'ABC', 'ABC', 'ABC', 'ABC']
df['ColM'] = ['XYZ', 'WUV', 'WUV', 'XYZ', 'WUV', 'WUV', 'OPQ', 'XYZ']
df['ColN_dt'] = ['03-12-2018', '03-13-2018', '03-16-2018', '03-18-2018', '03-22-2018', '03-23-2018', '03-26-2018', '03-30-2018']
I am trying to see if row value for column ColM by ColN group ever appeared in last 5 days. i.e. I am looking to create a flag:
df['flag'] = [0, 0, 1, 0, 0, 1, 0, 0]
I think you can create a flag column using groupby, if your df['ColN_dt'] is a datetime Series:
# Set df['ColN_dt'] to datetime:
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
# Make sure dates are sorted (they are in your example, but just in case)
df.sort_values('ColN_dt', inplace=True)
# Create your flag column
df['flag'] = (df.groupby(['ColN', 'ColM'])['ColN_dt']
.apply(lambda x: x.diff() < pd.Timedelta(days=5))
.astype(int))
This returns:
>>> df
ColN ColM ColN_dt flag
0 AAA XYZ 2018-03-12 0
1 AAA WUV 2018-03-13 0
2 AAA WUV 2018-03-16 1
3 ABC XYZ 2018-03-18 0
4 ABC WUV 2018-03-22 0
5 ABC WUV 2018-03-23 1
6 ABC OPQ 2018-03-26 0
7 ABC XYZ 2018-03-30 0
Explanation:
df.groupby(['ColN', 'ColM'])['ColN_dt']
Groups your dataframe by ColN and ColM
.apply(lambda x: x.diff() < pd.Timedelta(days=5))
Checks whether the difference between a row's ['ColN_dt'] in each group is less than 5 days from the row above. This returns a boolean, which you can set to int with .astype(int)

Add a new row to a Pandas DataFrame with specific index name

I'm trying to add a new row to the DataFrame with a specific index name 'e'.
number variable values
a NaN bank true
b 3.0 shop false
c 0.5 market true
d NaN government true
I have tried the following but it's creating a new column instead of a new row.
new_row = [1.0, 'hotel', 'true']
df = df.append(new_row)
Still don't understand how to insert the row with a specific index. Will be grateful for any suggestions.
You can use df.loc[_not_yet_existing_index_label_] = new_row.
Demo:
In [3]: df.loc['e'] = [1.0, 'hotel', 'true']
In [4]: df
Out[4]:
number variable values
a NaN bank True
b 3.0 shop False
c 0.5 market True
d NaN government True
e 1.0 hotel true
PS using this method you can't add a row with already existing (duplicate) index value (label) - a row with this index label will be updated in this case.
UPDATE:
This might not work in recent Pandas/Python3 if the index is a
DateTimeIndex and the new row's index doesn't exist.
it'll work if we specify correct index value(s).
Demo (using pandas: 0.23.4):
In [17]: ix = pd.date_range('2018-11-10 00:00:00', periods=4, freq='30min')
In [18]: df = pd.DataFrame(np.random.randint(100, size=(4,3)), columns=list('abc'), index=ix)
In [19]: df
Out[19]:
a b c
2018-11-10 00:00:00 77 64 90
2018-11-10 00:30:00 9 39 26
2018-11-10 01:00:00 63 93 72
2018-11-10 01:30:00 59 75 37
In [20]: df.loc[pd.to_datetime('2018-11-10 02:00:00')] = [100,100,100]
In [21]: df
Out[21]:
a b c
2018-11-10 00:00:00 77 64 90
2018-11-10 00:30:00 9 39 26
2018-11-10 01:00:00 63 93 72
2018-11-10 01:30:00 59 75 37
2018-11-10 02:00:00 100 100 100
In [22]: df.index
Out[22]: DatetimeIndex(['2018-11-10 00:00:00', '2018-11-10 00:30:00', '2018-11-10 01:00:00', '2018-11-10 01:30:00', '2018-11-10 02:00:00'], dtype='da
tetime64[ns]', freq=None)
Use append by converting list a dataframe in case you want to add multiple rows at once i.e
df = df.append(pd.DataFrame([new_row],index=['e'],columns=df.columns))
Or for single row (Thanks #Zero)
df = df.append(pd.Series(new_row, index=df.columns, name='e'))
Output:
number variable values
a NaN bank True
b 3.0 shop False
c 0.5 market True
d NaN government True
e 1.0 hotel true
If it's the first row you need:
df = Dataframe(columns=[number, variable, values])
df.loc['e', [number, variable, values]] = [1.0, 'hotel', 'true']
df.loc['e', :] = [1.0, 'hotel', 'true']
should be the correct implementation in case of conflicting index and column names.
In future versions of Pandas, DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False) will be deprecated.
Source: Pandas Documentation
The documentation recommends using .concat().
It would look like this (if you wanted an empty row with only the added index name:
df = pd.concat([df, pd.Series(index=['New index label'], dtype=str)])
If you wanted to add data use this:
df = pd.concat([df, pd.Series(data, index=['New index label'], dtype=str)])
Hope that helps!

Groupby certain number of rows pandas

I have a dataframe with let's say 2 columns: dates and doubles
2017-05-01 2.5
2017-05-02 3.5
... ...
2017-05-17 0.2
2017-05-18 2.5
Now I would like to do a groupby and sum with x rows. So i.e. with 6 rows it would return:
2017-05-06 15.6
2017-05-12 13.4
2017-05-18 18.0
Is there a clean solution to do this without running it through a for-loop with something like this:
temp = pd.DataFrame()
j = 0
for i in range(0,len(df.index),6):
temp[df.ix[i]['date']] = df.ix[i:i+6]['value'].sum()
I guess you are looking for resample. consider this dataframe
rng = pd.date_range('2017-05-01', periods=18, freq='D')
num = np.random.randint(5,size = 18)
df = pd.DataFrame({'date': rng, 'val': num})
df.resample('6D', on = 'date').sum().reset_index()
will return
date val
0 2017-05-01 14
1 2017-05-07 11
2 2017-05-13 16
This is alternative solution using groupby range of length of the dataframe.
Two columns using agg
df.groupby(np.arange(len(df))//6).agg(lambda x: {'date': x.date.iloc[0],
'value': x.value.sum()})
Multiple columns you can use first (or last) for date and sum for other columns.
group = df.groupby(np.arange(len(df))//6)
pd.concat((group['date'].first(),
group[[c for c in df.columns if c != 'date']].sum()), axis=1)

Counting the business days between two series

Is there a better way than bdate_range() to measure business days between two columns of dates via pandas?
df = pd.DataFrame({ 'A' : ['1/1/2013', '2/2/2013', '3/3/2013'],
'B': ['1/12/2013', '4/4/2013', '3/3/2013']})
print df
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
f = lambda x: len(pd.bdate_range(x['A'], x['B']))
df['DIFF'] = df.apply(f, axis=1)
print df
With output of:
A B
0 1/1/2013 1/12/2013
1 2/2/2013 4/4/2013
2 3/3/2013 3/3/2013
A B DIFF
0 2013-01-01 00:00:00 2013-01-12 00:00:00 9
1 2013-02-02 00:00:00 2013-04-04 00:00:00 44
2 2013-03-03 00:00:00 2013-03-03 00:00:00 0
Thanks!
brian_the_bungler was onto the most efficient way of doing this using numpy's busday_count:
import numpy as np
A = [d.date() for d in df['A']]
B = [d.date() for d in df['B']]
df['DIFF'] = np.busday_count(A, B)
print df
On my machine this is 300x faster on your test case, and 1000s of times faster on much larger arrays of dates
You can use pandas' Bday offset to step through business days between two dates like this:
new_column = some_date - pd.tseries.offsets.Bday(15)
Read more in this conversation: https://stackoverflow.com/a/44288696
It also works if some_date is a single date value, not a series.

Categories

Resources