Related
I'm having trouble with trying to add redundant data to an incomplete DataFrame.
def add_lines_when_empty(df, nameOfTimeColumn, timeDeltaInterval, startOfDay, endOfDay, max_interval_without_data):
i = 0
while i < len(df[nameOfTimeColumn])-1:
time0 = datetime.strptime(df[nameOfTimeColumn][i],
'%Y-%m-%d %H:%M:%S')
time1 = datetime.strptime(df[nameOfTimeColumn][i+1],
'%Y-%m-%d %H:%M:%S')
if time0-time1 != timeDeltaInterval:
if time0.time() != startOfDay:
row = df.iloc[i].copy()
row.loc[nameOfTimeColumn] = (time0 - timeDeltaInterval).strftime('%Y-%m-%d %H:%M:%S')
df = add_row_at_index(df, row, i+1)
elif time1.time() != endOfDay:
row = df.iloc[i].copy()
row.loc[nameOfTimeColumn] = (time1 + timeDeltaInterval).strftime('%Y-%m-%d %H:%M:%S')
df = add_row_at_index(df, row, i+1)
elif time0-time1 > max_interval_without_data:
row = df.iloc[i].copy()
row.loc[nameOfTimeColumn] = (time0 - max_interval_without_data - datetime.combine(date.min, startOfDay) + datetime.combine(date.min, endOfDay)).strftime('%Y-%m-%d %H:%M:%S')
df = add_row_at_index(df, row, i+1)
i += 1
return df
Now this function checks that the time between each row is timeDeltaInterval, unless a row is startOfDay and the next is endOfDay. If so, it checks that each row are indeed on two following days (max_interval_without_data = 1 day). If one of these is incorrect, it adds a row after row i with the same data as row i (except for the time of course).
It really doesn't look efficient at all, but this is the best I could come up with. It takes around 10 minutes to go over a 120'000 rows DataFrame. Could anyone help me make it less complicated? I'm pretty new to df so I'm still struggling to make these processes efficient.
Also, I'm pretty sure my add_row_at_index() function is decent, but here it is just in case:
def add_row_at_index(df, row, index):
df.loc[index-0.5] = row
df = df.sort_index().reset_index(drop=True)
return df
Here is a short example of what this function should do:
Here, we have 7 minutes missing (between 13:36 and 13:44). So it should add these 7 minutes with the same data as 13:44 to give something like this:
Also, if startOfDay == 08.00hrs and endOfDay == 18.00hrs, the last elif makes sure that there is no more than one day between each, and if so, fills the entire missing day with values from the next startOfDay.
Thanks in advance!
I'm working with field change history data which has timestamps for when the field value was changed. In this example, I need to calculate the overall case duration in 'Termination in Progress' status.
The given case was changed from and to this status three times in total:
see screenshot
I need to add up all three durations in this case and in other cases it can be more or less than three.
Does anyone know how to calculate that in Python?
Welcome to Stack Overflow!
Based on the limited data you provided, here is a solution that should work although the code makes some assumptions that could cause errors so you will want to modify it to suit your needs. I avoided using list comprehension or array math to make it more clear since you said you're new to Python.
Assumptions:
You're pulling this data into a pandas dataframe
All Old values of "Termination in Progress" have a matching new value for all Case Numbers
import datetime
import pandas as pd
import numpy as np
fp = r'<PATH TO FILE>\\'
f = '<FILENAME>.csv'
data = pd.read_csv(fp+f)
#convert ts to datetime for later use doing time delta calculations
data['Edit Date'] = pd.to_datetime(data['Edit Date'])
# sort by the same case number and date in opposing order to make sure values for old and new align properly
data.sort_values(by = ['CaseNumber','Edit Date'], ascending = [True,False],inplace = True)
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
# this loop could also be accomplished with list comprehension like this:
#ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
print('Deltas between groups')
print(ts_deltas)
print()
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
print('Total Time Delta')
print(total_ts_delta)
Deltas between groups
[Timedelta('0 days 00:08:00'), Timedelta('0 days 00:06:00'), Timedelta('0 days 02:08:00')]
Total Time Delta
0 days 02:22:00
I've also attached a picture of the solution minus my file path for obvious reasons. Hope this helps. Please remember to mark as correct if this solution works for you. Otherwise let me know what issues you run into.
EDIT:
If you have multiple case numbers you want to look at, you could do it in various ways, but the simplest would be to just get a list of unique case numbers with data['CaseNumber'].unique() then iterate over that array filtering for each case number and appending the total time delta to a new list or a dictionary (not necessarily the most efficient solution, but it will work).
cases_total_td = {}
unique_cases = data['CaseNumber'].unique()
for case in unique_cases:
temp_data = data[data['CaseNumber'] == case]
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
cases_total_td[case] = total_ts_delta
print(cases_total_td)
{1005222: Timedelta('0 days 02:22:00')}
I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules:
def disambiguate(arg):
arg['length'] = (arg.end - arg.begin).abs()
df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy()
data = []
out = pd.DataFrame()
for row in df.itertuples():
test = df[df['note_id']==row.note_id].copy()
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
# filter out overlapping rows via hierarchy
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == minScore]
data.append(fx)
out = pd.concat(data, axis=0)
# randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742
out.reset_index(inplace=True)
out = out.reindex(np.random.permutation(out.index))
return out.drop_duplicates(subset=['begin', 'end', 'note_id'])
This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there).
Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form.
Here is a sample of my data:
begin,end,note_id,score
0,9,0365,1
10,14,0365,1
25,37,0365,0.7
28,37,0365,1
38,42,0365,1
53,69,0365,0.7857142857142857
56,60,0365,1
56,69,0365,1
64,69,0365,1
83,86,0365,1
91,98,0365,0.8333333333333334
101,108,0365,1
101,127,0365,1
112,119,0365,1
112,127,0365,0.8571428571428571
120,127,0365,1
163,167,0365,1
196,203,0365,1
208,216,0365,1
208,223,0365,1
208,231,0365,1
208,240,0365,0.6896551724137931
217,223,0365,1
217,231,0365,1
224,231,0365,1
246,274,0365,0.7692307692307693
252,274,0365,1
263,274,0365,0.8888888888888888
296,316,0365,0.7222222222222222
301,307,0365,1
301,316,0365,1
301,330,0365,0.7307692307692307
301,336,0365,0.78125
308,316,0365,1
308,323,0365,1
308,330,0365,1
308,336,0365,1
317,323,0365,1
317,336,0365,1
324,330,0365,1
324,336,0365,1
361,418,0365,0.7368421052631579
370,404,0365,0.7111111111111111
370,418,0365,0.875
383,418,0365,0.8285714285714286
396,404,0365,1
396,418,0365,0.8095238095238095
405,418,0365,0.8333333333333334
432,453,0365,0.7647058823529411
438,453,0365,1
438,458,0365,0.7222222222222222
I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe.
It should been:
cases = set(df['note_id'].tolist())
for case in cases:
test = df[df['note_id']==case].copy()
for row in df.itertuples():
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == maxLength]
data.append(fx)
out = pd.concat(data, axis=0)
For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!
I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.
So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]
I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).