I have a dataframe with a "Timestamp" column like this:
df[Timestamp]
0 1.341709
1 1.343688
2 1.344503
3 1.344593
4 1.344700
...
1263453 413.056745
1263454 413.056836
1263455 413.056945
1263456 413.057046
1263457 413.057153
Name: Timestamp, Length: 1263458, dtype: float64
Now i have two variables to define a start and end of an interval like so:
start = 10
end = 15
To select all rows in the Dataframe where the Timestamp lies between "start" and "end" I use a query approach:
df_want = df.query("#start <= Timestamp < #end")
This gives me a Typeerror though
TypeError: '<=' not supported between instances of 'int' and 'type'
Why does this not work, shouldnt Timestamp be of type 'float64'? Why is it just 'type'?
You need to do the following:
df_want = df[df['Timestamp'].between(start,end)]
With the variables
df[(df['Timestamp'] >= start) & (df['Timestamp'] <= end)]
With the values harcoded:
df[(df['Timestamp'] >= 15) & (df['Timestamp'] <= 30)]
Related
I currently have a dataframe of customers, contracts, and contract dates like this ex
Cust Contract Start End
A 123 10/1/2021 11/3/2021
B 987 7/4/2022 8/12/2022
For each row, I want to generate a variable that tells me if it was active during a set range ex: 10/1/2021-12/31/2021.
When I import from the excel file, the 'Start' and 'End' Columns come in as datetime64[ns]
Code I have tried so far is this:
df.loc[df['Start'].dt.strftime('%Y-%m-%d')<='2021-10-31' & df['End'].dt.strftime('%Y-%m-%d')<='2021-10-1', 'Active Flag'] = 'Yes'
When I run this I get the following error
Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]
I'm not really sure if I am even on the correct track for solving this, or if there is an easier way. Any help would be appreciated as Python's date time operations are very odd to me.
We can use simple date comparisons to solve this.
There could be two scenarios for these contracts :
Where the contract is ONLY active within the range
start_range = pd.to_datetime('10/1/2021')
end_range = pd.to_datetime('31/12/2021')
# --- Start Date --- --- End Date ---
df['Active_Only_in_Range'] = (df.Start >= start_range) & (df.End <= end_range)
df['Active_Only_in_Range_Flag'] = np.where(df.Active_Only_in_Range,'Yes','No')
Where the contract was PARTIALLY active within the range (hence it started in the range or ended in the range - hence it has been partially active in the provided range)
start_range = pd.to_datetime('10/1/2021')
end_range = pd.to_datetime('31/12/2021')
# ----------- Start Date ------------ ----------- End Date ------------
df['Partial_Active'] = ((df.Start >= start_range) & (df.Start <= end_range)) | ((df.End >= start_range) & (df.End <= end_range))
df['Partial_Active_Flag'] = np.where(df.Partial_Active,'Yes','No')
Thus, depending upon your problem you can find which one suits you.
Here is one approach:
# Convert to `datetime` first
df.Start = pd.to_datetime(df.Start, format="%m/%d/%Y")
df.End = pd.to_datetime(df.End, format="%m/%d/%Y")
target = pd.date_range('2021-10-1', '2021-12-31')
df['is_active'] = df.Start.isin(target) | df.End.isin(target)
Output:
Out[17]:
Cust Contract Start End is_active
0 A 123 2021-10-01 2021-11-03 True
1 B 987 2022-07-04 2022-08-12 False
I have column of dates ( Dtype = datetime64[ns]) and a column of integers (Dtype = Int16) in the same dataframe (df).
The integers column all have values between 20 and 100.
i want to create a new date column that is the Date column minus the integer columns
i.e.
df:
RepDate | Age | RevisedDate (To be calc)
1/1/19 | 20 | 1/1/99
4/2/22 | 50 | 4/2/72
so far i've tried:
df['RevisedDate'] = df['RepDate'] - pd.DateOffset(years= df['Age'])
but i get a TypeError: cannot convert the series to <class 'int'>
i've also tried deconstrucint and putting back torgether:
df['YofRevDate'] = pd.DatetimeIndex(df['RepDate']).year - df['Age']
df['MofRevDate'] = pd.DatetimeIndex(df['RepDate']).month
df['DofRevDate'] = pd.DatetimeIndex(df['RepDate']).day
df['RevisedDate'] = date(df['YofRevDate'], df['MofRevDate'], df['DofRevDate'])
but i get the same error on the last line.
your help will be much apricated
To achieve that you have 2 ways:
convert Age to something like timedelta in years (not supported in recent pandas versions), and add timedelta to date column
parse date column, do age calculations, create new date column
I'm not sure how fast is my sollution (second option):
new_year = (df.RepDate.dt.year + df.Age).astype(str)
month_day = df.RepDate.dt.month.astype(str) + "-" + df.RepDate.dt.day.astype(str)
df['NewRepDate'] = pd.to_datetime(new_year + "-" + month_day,format="%Y-%m-%d")
df
I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?
I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)
Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].
I have a dataset which has a list of subjects, a start date, and an end date. I'm trying to do a loop so that for each subject I have a list of dates between the start date and end date. I've tried so many ways to do this based on previous posts but still having issues.
an example of the dataframe:
Participant # Start_Date End_Date
1 23-04-19 25-04-19
An example of the output I want:
Participant # Range
1 23-04-19
1 24-04-19
1 25-04-19
Right now my code looks like this:
subjs_490 = tracksheet_490['Participant #']
for subj_490 in subjs_490:
temp_a = tracksheet_490[tracksheet_490['Participant #'].isin([subj_490])]
start = temp_a['Start_Date']
end = temp_a['End_Date'
start_dates = pd.to_datetime(pd.Series(start), format = '%d-%m-%y')
end_dates = pd.to_datetime(pd.Series(end), format = '%d-%m-%y')
date_range = pd.date_range(start_dates, end_dates).tolist()
With this method I'm getting the following error:
Cannot convert input [1 2016-05-03 Name: Start_Date, dtype: datetime64[ns]] of type to Timestamp
Expanding ranges tends to be a slow process. You can create the date_range and then explode it to get what you want. Moving 'Participant #' to the index makes sure it's repeated for all rows that are exploded.
df = (df.set_index('Participant #')
.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1) # :( slow
.rename('Range')
.explode()
.reset_index())
Participant # Range
0 1 2019-04-23
1 1 2019-04-24
2 1 2019-04-25
If you can't use explode another option is to create a separate DataFrame for each row and then concat them all together.
pd.concat([pd.DataFrame({'Participant #': par, 'Range': pd.date_range(start, end)})
for par,start,end in zip(df['Participant #'], df['start_date'], df['end_date'])],
ignore_index=True)
I'm filtering a dataframe by hour and weekday:
if type == 'daily':
hour = data.index.hour
day = data.index.weekday
selector = ((hour != 17)) | ((day!=5) & (day!=6))
data = data[selector]
if type == 'weekly':
day = data.index.weekday
selector = ((day!=5) & (day!=6))
data = data[selector]
Then I'm using a for where I need to write some conditional according to the weekday/hour and the row.index doesn't have any information. What can I do in this case ?
I need to do something like (this it won't work since row.index doesn't have weekday or hour info):
for index, row in data.iterrows():
if type == 'weekly' & row.index.weekday == 1 & row.index.hour == 0 & row.index.min == 0 | \
type == 'daily' & row.index.hour == 18 & row.index.min == 0:
Thx in advance
I know this is not the most elegant way, but you could create your variables in columns:
df['Hour'] = df.index.hour
If you need a min or a max based on those variables, you could create another column and use rolling_min or rolling type formulas.
Once you have your columns, you can iterate as you please with iteration you suggested.
There's info about the index properties here