I have a dataset that containes a column with "Time" values, but it's showing as object, and I want to converte them to time so I can do a for loop to see if the time is between two times.
for i in df['Time']:
if i >= dt.time(21,0,0) and i <= dt.time(7, 30,0) or i >= dt.time(3,0,0) and i <= dt.time(10,0,0) or i >= dt.time(10,30,0) and i <= dt.time(14,0,0):
df['In/Out'] = 'In'
else:
df['In/Out'] = 'Out'
I want the code to set the value in a new column to "In" if the time is between two times.
The first times are (21:00) & (07:30) the second are (03:00) & (10:00) and the third are (10:30) & (14:00)
If the time is not in those ranges, it should set the value in the new column to "Out".
You can simplify:
(21:00) & (07:30) the second are (03:00) & (10:00)
to:
(21:00) & (10:00)
so solution is use Series.between with numpy.where:
df=pd.DataFrame({'Time':['0:01:00','8:01:00','2021-08-13 10:19:10','12:01:00',
'14:01:00','18:01:01','23:01:00']})
df['Time'] = pd.to_datetime(df['Time']).dt.time
m = (df['Time'].between(dt.time(21,0,0), dt.time(23,23,23)) |
df['Time'].between(dt.time(0,0,0), dt.time(10,0,0)) |
df['Time'].between(dt.time(10,30,0), dt.time(14, 0,0)))
df['In/Out'] = np.where(m, 'In','Out')
print (df)
Time In/Out
0 00:01:00 In
1 08:01:00 In
2 10:19:10 Out
3 12:01:00 In
4 14:01:00 Out
5 18:01:01 Out
6 23:01:00 In
Related
I have a dataframe which contains sales information of products, what i need to do is to create a function which based on the product id, product type and date, calculates the average sales for a time period which is less than the given date in the function.
This is how I have implemented it, but this approach takes a lot of time and I was wondering if there was a faster way to do this.
Dataframe:
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
Current code:
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
Calling the function:
cal_avg(2,'A','2018-02-12 15:00:00')
53.983
If you are running the calc_avg function "rarely" then I suggest ignoring my answer. Otherwise, it might be beneficial to you to simply calculate the expanding window average for each product/product type. It might be slow depending on your dataset size (in which case maybe just run it on specific product types?), but you'll only need to run it once. First sort by the column you want to perform the 'expanding' on (expanding is missing the 'on' parameter) to ensure the proper row order. Then 'groupby' and transform each group (to keep the indices of the original dataframe) with your expanding window aggregation of choice (in this case 'mean').
df = df.sort_values('sales_time')
df['exp_mean_sales'] = df.groupby(['prod_id', 'prod_type'])['sale_amt'].transform(lambda gr: gr.expanding().mean())
With the result being:
df.head()
prod_id prod_type sales_time sale_amt exp_mean_sales
0 2 B 2018-01-01 00:00:00 8 8.000000
1 2 B 2018-01-01 03:00:00 72 40.000000
2 2 B 2018-01-01 06:00:00 33 37.666667
3 2 A 2018-01-01 09:00:00 81 81.000000
4 2 B 2018-01-01 12:00:00 83 49.000000
Check Below code, with %%timeit comparison (Google Colab)
import pandas as pd
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
## OP's function
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
## Numpy data prep
prod_id_array = np.array(df.values[:,:1])
prod_type_array = np.array(df.values[:,1:2])
sales_time_array = np.array(df.values[:,2:3], dtype=np.datetime64)
values = np.array(df.values[:,3:])
OP's function -
%%timeit
cal_avg(2,'A','2018-02-12 15:00:00')
Output:
Numpy version
%%timeit -n 1000
cal_vals = [2,'A','2018-02-12 15:00:00']
mask = np.logical_and(prod_id_array == cal_vals[0], prod_type_array == cal_vals[1], sales_time_array <= np.datetime64(cal_vals[2]) )
np.mean(values[mask])
Output:
I have a dataframe with three columns lets say
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
Zara qrs 2021-02-25
I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but,
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit
so we will drop that row.
Desired Output:
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.
for index, row in dupes_df.iterrows():
for date in uniq_dates_list:
format_date = datetime.strptime(date,'%d/%m/%y')
if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):
print("here here")
break
else:
dupes_df.drop(index, inplace=True)
I need a much more omptimal solution for it. Studied about lamba functions, but couldn't get to the depths of it.
IIUC, this should work for you:
import pandas as pd
import itertools
from io import StringIO
data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])
df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
Output:
Name Address Date
0 faraz xyz 2022-01-01
1 Abdul abc 2022-06-06
First I think it's be easier if you use 'relativedelta' from 'dateutil'.
Reference: https://pynative.com/python-difference-between-two-dates-in-months/
Second, I think you need to add a column, let's call it score.
At the second loop, if delta <= 6 month :
set score = 1 and 'continue'
This way each row is compared to all rows.
Delete all rows that have score == 0.
I have a dataframe
id |start|stop|join_date
233| 0 | 12 |2015-01-01
234| 0 | 12 |2013-03-04
235| 10 | 23 |2014-01-10
GOAL:
I want to create another column stop_date that offsets the join_date based on whether or not the start date is 0.
If the start is 0 then stop_date is the join_date is offset by the months in stop
If the start is not 0 then stop_date is the join_date is offset by the months in stop and the months in start
I wrote the following function:
def stop_date(x):
if x['start'] == 0:
return x['join_date'] + x['stop'].astype('timedelta64[M]')
elif x['start'] != 0 :
return x['join_date'] + x['start'].astype('timedelta64[M]') + x['stop'].astype('timedelta64[M]')
else:
return x
I tried to apply to the dataframe by:
df['stop_date'] = df.apply(stop_date, axis = 1)
I keep getting an error : AttributeError: ("'int' object has no attribute 'astype'", 'occurred at index 0')
I cannot figure out how to achieve this.
Because when start is 0, doing the sum of start and stop won't change the number of month to add, you can sum both, convert with astype and add the 'join_date':
df['stop_date'] = (pd.to_datetime(df['join_date'])
+ df[['start', 'stop']].sum(axis=1).astype('timedelta64[M]')
).dt.date
print (df)
id start stop join_date stop_date
0 233 0 12 2015-01-01 2016-01-01
1 234 0 12 2013-03-04 2014-03-04
2 235 10 23 2014-01-10 2016-10-10
Convert the columns to the desired dtype before you apply the function. x['stop'] is a scalar value of the datatype of the column (e.g., 12), so it has no dataframe or series methods, such as astype.
I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)
I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)