I'm trying to use the df.loc from pandas to locate a partially variable value in my dataframe.
in my dataframe there are times and dates combined
for example
Time Col1 Col2
1-1-2019 01:00 5 7
1-1-2019 02:00 6 9
1-2-2019 01:00 8 3
if I use df.loc[df['Time'] == ['1-1-2019']] I want to locate the first 2 columns of this dataframe that being 1-1-2019 01:00 and 1-1-2019 02:00.
it is giving my an error: Lengths must match. To be fair that is a logical error for pandas to give because I only am inputting the day, not the time.
Is there a way to make the search value partialy variable? So pandas look for the 1-1-2019 01:00 and 1-1-2019 02:00?
First idea is remove times, better say set to 0 by Series.dt.floor or Series.dt.normalize:
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.floor('d') == '2019-01-01']
#alternative
#df1 = df.loc[df['Time'].dt.normalize() == '2019-01-01']
print (df1)
Time Col1 Col2
0 2019-01-01 01:00:00 5 7
1 2019-01-01 02:00:00 6 9
Or compare dates by Series.dt.date:
from datetime import date
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.date == date(2019,1,1)]
#in some version should working
#df1 = df.loc[df['Time'].dt.date == '2019-01-01']
Or convert to strings YYYY-MM-DD by Series.dt.strftime and compare:
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.strftime('%Y-%m-%d') == '2019-01-01']
The following will subset the df on your condition
df[df.Time.str.contains('1-1-2019')]
Time Col1 Col2
0 2019-01-01 01:00:00 5 7
1 2019-01-01 02:00:00 6 9
Related
I have a dataframe with columns 'start_date' and 'end_date' for about 2,000 records. I want to create a new dataframe that counts all of the dates between the start and end date and creates a dataframe that sums these counts for each date as follows:
start and end dataframe
ID
start_date
end_date
1
01/01/2021
03/01/2021
2
02/01/2021
04/01/2021
3
01/01/2021
04/01/2021
date count dataframe
date
count
01/01/2021
2
02/01/2021
3
03/01/2021
3
04/01/2021
2
If medium/large DataFrame for better performance is better avoid explode with date_range, better is use repeat with adding timedeltas:
df["start_date"] = pd.to_datetime(df["start_date"], dayfirst=True)
df["end_date"] = pd.to_datetime(df["end_date"], dayfirst=True)
#subtract values and convert to days
s = df["end_date"].sub(df["start_date"]).dt.days + 1
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df1 = (df["start_date"].add(add)
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count'))
print (df1)
date count
0 2021-01-01 2
1 2021-01-02 3
2 2021-01-03 3
3 2021-01-04 2
You can use pd.date_range to get individual dates between start_date and end_date for each of the rows, then explode it, and finally call value_counts
>>> out = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']),
axis=1).explode().value_counts()
If needed call to_frame() passing the column name for counts and reset and rename the index column:
>>> out.to_frame('count').reset_index().rename(columns={'index':'date'})
OUTPUT:
date count
0 2021-01-03 3
1 2021-01-02 3
2 2021-01-04 2
3 2021-01-01 2
Don't forget to convert start_date and end_date columns to datetime type if they are not:
>>> df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
>>> df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
You can use:
# Convert dates in dd/mm/yyyy to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
# Create date ranges for each row
date_rng = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']), axis=1)
# expand dates in range dates into separate rows and count unique dates
df_out = df.assign(date=date_rng).explode('date').groupby('date')['date'].count().reset_index(name='count')
Result:
print(df_out)
date count
0 2021-01-01 2
1 2021-01-02 3
2 2021-01-03 3
3 2021-01-04 2
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days
If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.
This is what I tried
df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output
I expect my output to be like this
Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:
df = df[df.groupby('subject_id')['day'].transform('nunique')>=4]
Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:
df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)
print (df)
subject_id time_1 val day month
7 2 2173-04-04 16:00:00 5 4 4
8 2 2173-04-09 22:00:00 8 9 4
9 2 2173-04-11 04:00:00 3 11 4
10 2 2173-04-13 04:30:00 4 13 4
11 2 2173-04-14 08:00:00 6 14 4
I want to write a transformation function accessing two columns from a DataFrame and pass it to transform().
Here is the DataFrame which I would like to modify:
print(df)
date increment
0 2012-06-01 0
1 2003-04-08 1
2 2009-04-22 3
3 2018-05-24 6
4 2006-09-25 2
5 2012-11-02 4
I would like to increment the year in column date by the number of years given variable increment. The proposed code (which does not work) is:
df.transform(lambda df: date(df.date.year + df.increment, 1, 1))
Is there a way to access individual columns in the function (here a lambda function) passed to transform()?
You can use pandas.to_timedelta :
# If necessary convert to date type first
# df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'] + pd.to_timedelta(df['increment'], unit='Y')
[out]
date increment
0 2012-06-01 00:00:00 0
1 2004-04-07 05:49:12 1
2 2012-04-21 17:27:36 3
3 2024-05-23 10:55:12 6
4 2008-09-24 11:38:24 2
5 2016-11-01 23:16:48 4
or alternatively:
df['date'] = pd.to_datetime({'year': df.date.dt.year.add(df.increment),
'month': df.date.dt.month,
'day': df.date.dt.day})
[out]
date increment
0 2012-06-01 0
1 2004-04-08 1
2 2012-04-22 3
3 2024-05-24 6
4 2008-09-25 2
5 2016-11-02 4
Your own solution could also be fixed by instead using the apply method and passing the axis=1 argument:
from datetime import date
df.apply(lambda df: date(df.date.year + df.increment, 1, 1), axis=1)
I'm trying to split one date list by using another. So:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
df['date'].split_by(sf['split'])
would yield:
date num
0 2015-01-15 1.0
1 2015-02-01 NaN
2 2015-02-15 2.0
...but of course, it doesn't. I'm sure there's a simple merge or join I'm missing here, but I can't figure it out. Thanks.
Also, if the 'split' list has multiple dates, some of which fall outside the range of the 'date' list, I don't want them included. So basically, the extents of the new range would be the same as the old.
(side note: if there's a better way to convert a dictionary to a DataFrame and immediately convert the date strings to datetimes, that would be icing on the cake)
I think you need boolean indexing for filter sf by min and max of column date in df first and then concat with sort_values, for align need rename column:
d = {'date':['1/15/2015','2/15/2015'], 'num':[1,2]}
s = {'split':['2/1/2015', '2/1/2016', '2/1/2014']}
df = pd.DataFrame(d)
sf = pd.DataFrame(s)
df['date'] = pd.to_datetime(df['date'])
sf['split'] = pd.to_datetime(sf['split'])
print (df)
date num
0 2015-01-15 1
1 2015-02-15 2
print (sf)
split
0 2015-02-01
1 2016-02-01
2 2014-02-01
mask = (sf.split <= df.date.max()) & (sf.split >= df.date.min())
print (mask)
0 True
1 False
2 False
Name: split, dtype: bool
sf = sf[mask]
print (sf)
split
0 2015-02-01
df = pd.concat([df, sf.rename(columns={'split':'date'})]).sort_values('date')
print (df)
date num
0 2015-01-15 1.0
0 2015-02-01 NaN
1 2015-02-15 2.0
I need to subtract all elements in one column of pandas dataframe by its first value.
In this code, pandas complains about self.inferred_type, which I guess is the circular referencing.
df.Time = df.Time - df.Time[0]
And in this code, pandas complains about setting value on copies.
df.Time = df.Time - df.iat[0,0]
What is the correct way to do this computation in Pandas?
I think you can select first item in column Time by iloc:
df.Time = df.Time - df.Time.iloc[0]
Sample:
start = pd.to_datetime('2015-02-24 10:00')
rng = pd.date_range(start, periods=5)
df = pd.DataFrame({'Time': rng, 'a': range(5)})
print (df)
Time a
0 2015-02-24 10:00:00 0
1 2015-02-25 10:00:00 1
2 2015-02-26 10:00:00 2
3 2015-02-27 10:00:00 3
4 2015-02-28 10:00:00 4
df.Time = df.Time - df.Time.iloc[0]
print (df)
Time a
0 0 days 0
1 1 days 1
2 2 days 2
3 3 days 3
4 4 days 4
Notice:
For me works perfectly your 2 ways also.