Using Python Pandas df.loc to locate a partially variable value

Using Python Pandas df.loc to locate a partially variable value - python

I'm trying to use the df.loc from pandas to locate a partially variable value in my dataframe.
in my dataframe there are times and dates combined
for example
Time Col1 Col2
1-1-2019 01:00 5 7
1-1-2019 02:00 6 9
1-2-2019 01:00 8 3
if I use df.loc[df['Time'] == ['1-1-2019']] I want to locate the first 2 columns of this dataframe that being 1-1-2019 01:00 and 1-1-2019 02:00.
it is giving my an error: Lengths must match. To be fair that is a logical error for pandas to give because I only am inputting the day, not the time.
Is there a way to make the search value partialy variable? So pandas look for the 1-1-2019 01:00 and 1-1-2019 02:00?

First idea is remove times, better say set to 0 by Series.dt.floor or Series.dt.normalize:
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.floor('d') == '2019-01-01']
#alternative
#df1 = df.loc[df['Time'].dt.normalize() == '2019-01-01']
print (df1)
Time Col1 Col2
0 2019-01-01 01:00:00 5 7
1 2019-01-01 02:00:00 6 9
Or compare dates by Series.dt.date:
from datetime import date
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.date == date(2019,1,1)]
#in some version should working
#df1 = df.loc[df['Time'].dt.date == '2019-01-01']
Or convert to strings YYYY-MM-DD by Series.dt.strftime and compare:
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.strftime('%Y-%m-%d') == '2019-01-01']

The following will subset the df on your condition
df[df.Time.str.contains('1-1-2019')]
Time Col1 Col2
0 2019-01-01 01:00:00 5 7
1 2019-01-01 02:00:00 6 9

Related

Create a new dataframe counting of all the dates between two dates in another dataframe

I have a dataframe with columns 'start_date' and 'end_date' for about 2,000 records. I want to create a new dataframe that counts all of the dates between the start and end date and creates a dataframe that sums these counts for each date as follows:
start and end dataframe
ID
start_date
end_date
1
01/01/2021
03/01/2021
2
02/01/2021
04/01/2021
3
01/01/2021
04/01/2021
date count dataframe
date
count
01/01/2021
2
02/01/2021
3
03/01/2021
3
04/01/2021
2

If medium/large DataFrame for better performance is better avoid explode with date_range, better is use repeat with adding timedeltas:
df["start_date"] = pd.to_datetime(df["start_date"], dayfirst=True)
df["end_date"] = pd.to_datetime(df["end_date"], dayfirst=True)
#subtract values and convert to days
s = df["end_date"].sub(df["start_date"]).dt.days + 1
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df1 = (df["start_date"].add(add)
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count'))
print (df1)
date count
0 2021-01-01 2
1 2021-01-02 3
2 2021-01-03 3
3 2021-01-04 2

You can use pd.date_range to get individual dates between start_date and end_date for each of the rows, then explode it, and finally call value_counts
>>> out = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']),
axis=1).explode().value_counts()
If needed call to_frame() passing the column name for counts and reset and rename the index column:
>>> out.to_frame('count').reset_index().rename(columns={'index':'date'})
OUTPUT:
date count
0 2021-01-03 3
1 2021-01-02 3
2 2021-01-04 2
3 2021-01-01 2
Don't forget to convert start_date and end_date columns to datetime type if they are not:
>>> df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
>>> df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)

You can use:
# Convert dates in dd/mm/yyyy to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
# Create date ranges for each row
date_rng = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']), axis=1)
# expand dates in range dates into separate rows and count unique dates
df_out = df.assign(date=date_rng).explode('date').groupby('date')['date'].count().reset_index(name='count')
Result:
print(df_out)
date count
0 2021-01-01 2
1 2021-01-02 3
2 2021-01-03 3
3 2021-01-04 2

How to drop records based on number of unique days using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days
If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.
This is what I tried
df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output
I expect my output to be like this

Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:
df = df[df.groupby('subject_id')['day'].transform('nunique')>=4]
Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:
df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)
print (df)
subject_id time_1 val day month
7 2 2173-04-04 16:00:00 5 4 4
8 2 2173-04-09 22:00:00 8 9 4
9 2 2173-04-11 04:00:00 3 11 4
10 2 2173-04-13 04:30:00 4 13 4
11 2 2173-04-14 08:00:00 6 14 4

Accessing different columns from DataFrame in transform

I want to write a transformation function accessing two columns from a DataFrame and pass it to transform().
Here is the DataFrame which I would like to modify:
print(df)
date increment
0 2012-06-01 0
1 2003-04-08 1
2 2009-04-22 3
3 2018-05-24 6
4 2006-09-25 2
5 2012-11-02 4
I would like to increment the year in column date by the number of years given variable increment. The proposed code (which does not work) is:
df.transform(lambda df: date(df.date.year + df.increment, 1, 1))
Is there a way to access individual columns in the function (here a lambda function) passed to transform()?

You can use pandas.to_timedelta :
# If necessary convert to date type first
# df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'] + pd.to_timedelta(df['increment'], unit='Y')
[out]
date increment
0 2012-06-01 00:00:00 0
1 2004-04-07 05:49:12 1
2 2012-04-21 17:27:36 3
3 2024-05-23 10:55:12 6
4 2008-09-24 11:38:24 2
5 2016-11-01 23:16:48 4
or alternatively:
df['date'] = pd.to_datetime({'year': df.date.dt.year.add(df.increment),
'month': df.date.dt.month,
'day': df.date.dt.day})
[out]
date increment
0 2012-06-01 0
1 2004-04-08 1
2 2012-04-22 3
3 2024-05-24 6
4 2008-09-25 2
5 2016-11-02 4
Your own solution could also be fixed by instead using the apply method and passing the axis=1 argument:
from datetime import date
df.apply(lambda df: date(df.date.year + df.increment, 1, 1), axis=1)

Split a pandas date list based on another pandas date list

I'm trying to split one date list by using another. So:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
df['date'].split_by(sf['split'])
would yield:
date num
0 2015-01-15 1.0
1 2015-02-01 NaN
2 2015-02-15 2.0
...but of course, it doesn't. I'm sure there's a simple merge or join I'm missing here, but I can't figure it out. Thanks.
Also, if the 'split' list has multiple dates, some of which fall outside the range of the 'date' list, I don't want them included. So basically, the extents of the new range would be the same as the old.
(side note: if there's a better way to convert a dictionary to a DataFrame and immediately convert the date strings to datetimes, that would be icing on the cake)

I think you need boolean indexing for filter sf by min and max of column date in df first and then concat with sort_values, for align need rename column:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015', '2/1/2016', '2/1/2014']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
print (df)
date num
0 2015-01-15 1
1 2015-02-15 2
print (sf)
split
0 2015-02-01
1 2016-02-01
2 2014-02-01
mask = (sf.split <= df.date.max()) & (sf.split >= df.date.min())
print (mask)
0 True
1 False
2 False
Name: split, dtype: bool
sf = sf[mask]
print (sf)
split
0 2015-02-01
df = pd.concat([df, sf.rename(columns={'split':'date'})]).sort_values('date')
print (df)
date num
0 2015-01-15 1.0
0 2015-02-01 NaN
1 2015-02-15 2.0

Subtract a column in pandas dataframe by its first value

I need to subtract all elements in one column of pandas dataframe by its first value.
In this code, pandas complains about self.inferred_type, which I guess is the circular referencing.
df.Time = df.Time - df.Time[0]
And in this code, pandas complains about setting value on copies.
df.Time = df.Time - df.iat[0,0]
What is the correct way to do this computation in Pandas?

I think you can select first item in column Time by iloc:
df.Time = df.Time - df.Time.iloc[0]
Sample:
start = pd.to_datetime('2015-02-24 10:00')
rng = pd.date_range(start, periods=5)
df = pd.DataFrame({'Time': rng, 'a': range(5)})
print (df)
Time a
0 2015-02-24 10:00:00 0
1 2015-02-25 10:00:00 1
2 2015-02-26 10:00:00 2
3 2015-02-27 10:00:00 3
4 2015-02-28 10:00:00 4
df.Time = df.Time - df.Time.iloc[0]
print (df)
Time a
0 0 days 0
1 1 days 1
2 2 days 2
3 3 days 3
4 4 days 4
Notice:
For me works perfectly your 2 ways also.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python Pandas df.loc to locate a partially variable value - python

The following will subset the df on your condition df[df.Time.str.contains('1-1-2019')] Time Col1 Col2 0 2019-01-01 01:00:00 5 7 1 2019-01-01 02:00:00 6 9

Related

Create a new dataframe counting of all the dates between two dates in another dataframe

How to drop records based on number of unique days using pandas?

Accessing different columns from DataFrame in transform

Split a pandas date list based on another pandas date list

Subtract a column in pandas dataframe by its first value

Categories

Resources