Exclude a specific date based on a condition using pandas - python

df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
'03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])
What I would like to do is
a) Exclude/filter out records from the data frame if a subject has Dec 31st and Jan 1st in its records. Please note that year doesn't matter.
If a subject has either Dec 31st or Jan 1st, we leave them as is.
But if they have both Dec 31st and Jan 1st, we remove one (either Dec 31st or Jan 1st) of them. note they could have multiple entries with the same date as well. Like person_id = 11
I could only do the below
df2_new = df2['dates'] != '2017-12-31' #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]
My expected output is like as shown below
For person_id = 11, we drop 12-31 because it had both 12-31 and 01-01 in their records whereas for person_id = 14, we don't drop 12-31 because it has only 12-31 in its records.
We drop 12-31 only when both 12-31 and 01-01 appear in a person's records.

Use:
s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]
Result:
# print(df3)
person_id variable dates
0 11 admit_date 2011-01-01
1 11 admit_date 2009-01-01
4 11 admit_date 2014-04-03
5 12 admit_date 2016-08-04
6 12 admit_date 2014-03-05
7 13 admit_date 2011-02-07
8 13 admit_date 2016-08-08
9 14 admit_date 2017-12-31
10 14 admit_date 2011-05-01
11 14 admit_date 2014-05-21
12 14 admit_date 2016-07-12

Another way
Coerce the date to day month.
Create temp column where 31st Dec is converted to 1st Jan
Drop duplicates by Person id and the temp column keeping first.
df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])
person_id variable dates check
0 11 admit_date 01 Jan 01 Jan
4 11 admit_date 03 Apr 03 Apr
5 12 admit_date 04 Aug 04 Aug
6 12 admit_date 05 Mar 05 Mar
7 13 admit_date 07 Feb 07 Feb
8 13 admit_date 08 Aug 08 Aug
9 14 admit_date 31 Dec 01 Jan
10 14 admit_date 01 May 01 May
11 14 admit_date 21 May 21 May
12 14 admit_date 12 Jul 12 Jul

Related

Given beginning and end dates with a specific year, what is the best way to fill in the date column of a dataframe that only contains day and month?

I am parsing bank statement PDFs that have the full start and end dates in question both in the filename and in the document, but the actual entries corresponding to the transactions only contain the day and month ('%d %b'). Here is what the series looks like for the "Date" column:
1
2 24 Dec
3 27 Dec
4
5
6 30 Dec
7
8 31 Dec
9
10
11 2 Jan
12
13 3 Jan
14 6 Jan
15 14 Jan
16 15 Jan
I have the start and end dates as 2013-12-23 and 2014-01-23. What is an efficient way to populate this series/column with the correct full date given the start and end range? I would like any existing date to forward fill the same date to the next date so:
1
2 24 Dec 2013
3 27 Dec 2013
4 27 Dec 2013
5 27 Dec 2013
6 30 Dec 2013
7 30 Dec 2013
8 31 Dec 2013
9 31 Dec 2013
10 31 Dec 2013
11 2 Jan 2014
12 2 Jan 2014
13 3 Jan 2014
14 6 Jan 2014
15 14 Jan 2014
16 15 Jan 2014
The date format is irrelevant as long as it is a datetime format. I was hoping to use something internal to pandas, but I can't figure out what to use and right now a check against the start and end dates and filling out the year based on where the date fits into the range is the best I've come up with, but it is inefficient to have to run this on the whole column. Any help/advice/tips would be appreciated, thanks in advance.
EDIT: just wanted to add that I am hoping for a general procedural solution that can apply to any start/end date and set of transactions and not just this particular series included although I think this is a good test case as it has an end of year overlap.
EDIT2: what I have so far after posting this question is the following, which doesn't seem terribly efficient, but seems to work:
def add_year(date, start, end):
if not date:
return(np.NaN)
else:
test_date = "{} {}".format(date, start.year)
test_date = datetime.strptime(test_date, '%d %b %Y').date()
if start_date <= test_date <= end_date:
return(test_date)
else:
return(datetime.strptime("{} {}".format(date, end.year), '%d %b %Y').date())
df['Date'] = df.Date.map(lambda date: add_year(date, start_date, end_date))
df.Date.ffill(inplace=True)
try:
df['Date']=df['Date'].replace('nan|NaN',float('NaN'),regex=True)
#convert string nan to actual NaN's
df['Date']=df['Date'].ffill()
#forword fill NaN's
c=df['Date'].str.contains('Dec') & df['Date'].notna()
#checking if Date column contain Dec
idx=df[c].index[-1]
#getting the index of last 'Date' where condition c satisfies
df.loc[:idx,'Date']=df.loc[:idx,'Date']+' 2013'
#adding 2013 to 'Date' upto last index of c
df.loc[idx+1:,'Date']=df.loc[idx+1:,'Date']+' 2014'
#adding 2014 to 'Date' from last index of c+1 upto last
df['Date']=pd.to_datetime(df['Date'])
#Finally converting these values to datetime
output of df:
Date
0 NaT
1 2013-12-24
2 2013-12-27
3 2013-12-27
4 2013-12-27
5 2013-12-30
6 2013-12-30
7 2013-12-31
8 2013-12-31
9 2013-12-31
10 2014-01-02
11 2014-01-02
12 2014-01-03
13 2014-01-06
14 2014-01-14
15 2014-01-15

How do I parse value as datetime?

I have dates as below:
date
0 Today, 12 Mar
1 Tomorrow, 13 Mar
2 Tomorrow, 13 Mar
3 Tomorrow, 13 Mar
4 Tomorrow, 13 Mar
5 14 Mar 2021
6 14 Mar 2021
7 14 Mar 2021
8 14 Mar 2021
9 15 Mar 2021
How do I parse it as datetime in pandas?
Your date contains 'Today' and 'Tomorrow' which is not a valid format(if it is valid then I don't know I never worked with this type of format) of datetime so firstly replace them to 2021(if year is fixed...i.e 2021):-
df['date']=df['date'].str.replace('Today','2021')
df['date']=df['date'].str.replace('Tomorrow','2021')
Now just use to_datetime() method:-
df['date']=pd.to_datetime(df['date'])

pd.melt() a dictionary/series of dataframes

For example I have the following map:
{'df1': Jan Feb Mar
1 3 5
2 4 6
'df2': Jan Feb Mar
7 9 11
8 10 12
......}
And I want the following output:
Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
Does anyone knows if its possible to do it this way?
What I have tried is to iterate through DataFrames to try getting
{'df1': Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
'df2': Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
by using
for x in dfMap:
df = pd.melt(list(x.values()))
Then try to concat it with df1m =
pd.concat(df.values(), ignore_index=True)
Which gave me error
AttributeError: 'list' object has no attribute 'columns'
I am fairly new to programming and really wanted to learn, will be nice if anyone can explain how this works, and why list or dict_values object has no attribute 'columns'.
Thanks in advance!
You can concat and stack:
out = pd.concat(d.values()).stack().droplevel(0)
Or:
out = pd.concat(d.values()).melt()
Example:
df = pd.DataFrame(np.arange(1,10).reshape(-1,3),columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.iterrows():
d[f"df{e+1}"] = i.to_frame().T
print(d,'\n')
out = pd.concat(d.values()).stack().droplevel(0)
print(out)
{'df1': Jan Feb Mar
0 1 2 3, 'df2': Jan Feb Mar
1 4 5 6, 'df3': Jan Feb Mar
2 7 8 9}
Jan 1
Feb 2
Mar 3
Jan 4
Feb 5
Mar 6
Jan 7
Feb 8
Mar 9
dtype: int32
With melt:
out = pd.concat(d.values()).melt()
print(out)
variable value
0 Jan 1
1 Jan 4
2 Jan 7
3 Feb 2
4 Feb 5
5 Feb 8
6 Mar 3
7 Mar 6
8 Mar 9
EDIT, for edited question , try:
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
Example below:
df = pd.DataFrame(np.arange(1,13).reshape(3,-1).T,columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.groupby(df.index//2):
d[f"df{e+1}"] = i
print(d,'\n')
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
print(out)
{'df1': Jan Feb Mar
0 1 5 9
1 2 6 10, 'df2': Jan Feb Mar
2 3 7 11
3 4 8 12}
Jan 1
Jan 2
Feb 5
Feb 6
Mar 9
Mar 10
Jan 3
Jan 4
Feb 7
Feb 8
Mar 11
Mar 12
dtype: int32
Or you can also convert the dataframe names as int and then sort:
out = (pd.concat(d.values(),keys=[int(key[2:]) for key in d.keys()])
.stack().sort_index(level=[0,-1]).droplevel([0,1]))

Rolling percentage change in Python data frame

I have a dataframe like this (many rows):
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
a 34 24 47 30 11 57 47 44 22 33 16 39
b 50 53 42 23 19 29 38 46 21 18 13 24
. . .
.
. . .
I would like to create a new df with the rolling 3 month percentage change values, so the [1,1] element will be the % change between the value of Apr and the value of Jan, the [1,2] element wiil be the % change between May and Feb etc... Therefore, for each value, I want the % change between this value and the value 3 months ago
This is the sample output that I want (for example the first value is
[(30-34)/34]*100 = -11.7):
Apr May Jun Jul Aug Sep Oct Nov Dec
a -11.7% -54.1% 0% 56.6% 300% .. .. .. ..
. .
. .
I know that pandas have the .pct_change but this does not calculate the percentages in the way that I want. Any ideas on how I can do this in python?
Thank you
Use pct_change with axis=1 and periods=3:
df.pct_change(periods=3, axis=1)
Output:
Jan Feb Mar Apr May Jun Jul Aug Sep \
a NaN NaN NaN -0.117647 -0.541667 0.212766 0.566667 3.000000 -0.614035
b NaN NaN NaN -0.540000 -0.641509 -0.309524 0.652174 1.421053 -0.275862
Oct Nov Dec
a -0.297872 -0.636364 0.772727
b -0.526316 -0.717391 0.142857
Drop NaN columns:
df.pct_change(periods=3, axis=1).dropna(1)
Output:
Apr May Jun Jul Aug Sep Oct Nov Dec
a -0.117647 -0.541667 0.212766 0.566667 3.000000 -0.614035 -0.297872 -0.636364 0.772727
b -0.540000 -0.641509 -0.309524 0.652174 1.421053 -0.275862 -0.526316 -0.717391 0.142857

Working with dates in pandas

I have been collecting Twitter data for a couple of days now and, among other things, I need to analyze how content propagates. I created a list of timestamps when users were interested in content and imported twitter timestamps in pandas df with the column name 'timestamps'. It looks like this:
0 Sat Dec 14 05:13:28 +0000 2013
1 Sat Dec 14 05:21:12 +0000 2013
2 Sat Dec 14 05:23:10 +0000 2013
3 Sat Dec 14 05:27:54 +0000 2013
4 Sat Dec 14 05:37:43 +0000 2013
5 Sat Dec 14 05:39:38 +0000 2013
6 Sat Dec 14 05:41:39 +0000 2013
7 Sat Dec 14 05:43:46 +0000 2013
8 Sat Dec 14 05:44:50 +0000 2013
9 Sat Dec 14 05:47:33 +0000 2013
10 Sat Dec 14 05:49:29 +0000 2013
11 Sat Dec 14 05:55:03 +0000 2013
12 Sat Dec 14 05:59:09 +0000 2013
13 Sat Dec 14 05:59:45 +0000 2013
14 Sat Dec 14 06:17:19 +0000 2013
etc. What I want to do is to sample every 10min and count how many users are interested in content in each time frame. My problem is that I have no clue how to process the timestamps I imported from Twitter. Should I use regular expressions or is there any better approach to this? I would appreciate if someone could provide some pointers. Thanks!
That's ISO date format, it can be easily converted to datetime with pd.to_datetime:
>>> df[:2]
timestamp
0 Sat Dec 14 05:13:28 +0000 2013
1 Sat Dec 14 05:21:12 +0000 2013
>>> df['timestamp'] = pd.to_datetime(df['timestamp'])
>>> df[:2]
timestamp
0 2013-12-14 05:13:28
1 2013-12-14 05:21:12
To resample, you can make it an index, and use resample
>>> df.index = df['timestamp']
>>> df.resample('20Min', 'count')
2013-12-14 05:00:00 timestamp 1
2013-12-14 05:20:00 timestamp 5
2013-12-14 05:40:00 timestamp 8
2013-12-14 06:00:00 timestamp 1
dtype: int64

Categories

Resources