iterations over list in dataframe - python

I have the following issue:
I have a dataframe with 3 columns :
The first is userID, the second is invoiceType and the third the time of creation of the invoice.
df = pd.read_csv('invoice.csv')
Output: UserID InvoiceType CreateTime
1 a 2018-01-01 12:31:00
2 b 2018-01-01 12:34:12
3 a 2018-01-01 12:40:13
1 c 2018-01-09 14:12:25
2 a 2018-01-12 14:12:29
1 b 2018-02-08 11:15:00
2 c 2018-02-12 10:12:12
I am trying to plot the invoice cycle for each user. I need to create2 new columns, time_diff, and time_diff_wrt_first_invoice. time_diff will represent the time difference between each invoice for each user and time_diff_wrt_first_invoice will represent the time difference between all the invoices and the first invoice, which will be interesting for ploting purposes. This is my code:
"""
********** Exploding a variable that is a list in each dataframe cell
"""
def explode_list(df,x):
return (df[x].apply(pd.Series)
.stack()
.reset_index(level = 1, drop=True)
.to_frame(x))
"""
****** applying explode_list to all the columns ******
"""
def explode_listDF(df):
exploaded_df = pd.DataFrame()
for x in df.columns.tolist():
exploaded_df = pd.concat([exploaded_df, explode_list(df,x)],
axis = 1)
return exploaded_df
"""
******** Getting the time difference column in pivot table format
"""
def pivoted_diffTime(df1, _freq=60):
# _ freq is 1 for minutes frequency
# _freq is 60 for hour frequency
# _ freq is 60*24 for daily frequency
# _freq is 60*24*30 for monthly frequency
df = df.sort_values(['UserID', 'CreateTime'])
df_pivot = df.pivot_table(index = 'UserID',
aggfunc= lambda x : list(v for v in x)
)
df_pivot['time_diff'] = [[0]]*len(df_pivot)
for user in df_pivot.index:
try:
_list = [0]+[math.floor((x - y).total_seconds()/(60*_freq))
for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:],
df_pivot.loc[user, 'CreateTime'][:-1])]
df_pivot.loc[user, 'time_diff'] = _list
except:
print('There is a prob here :', user)
return df_pivot
"""
***** Pipelining the two functions to obtain an exploaded dataframe
with time difference ******
"""
def get_timeDiff(df, _frequency):
df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))
return df
And once I have time_diff, I am creating time_diff_wrt_first_variable this way:
# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] =
[[0]]*len(df_with_timeDiff)
# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():
df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])
The problem is that I have a dataframe with hundreds of thousands of users and it's so time consuming. I am wondering if there is a solution that fits better my need.

Check out .loc[] for pandas.
df_1 = pd.DataFrame(some_stuff)
df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']
you can access specific columns, run a loop to check for certain types of conditions, and if you add a comma after the condition and put in a specific column name it'll only return that column.
I'm not 100% sure if that answers whatever question you're asking, cause I didn't actually see one, but it seemed like you were running a lot of for loops and stuff to isolate columns, which is what .loc[] is for.

I have found a better solution. Here's my code :
def next_diff(x):
return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])
def create_timediff(df):
df.sort_values(['UserID', 'CreateTime'], inplace=True)
a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
b = a.apply(np.cumsum)
a = a.reset_index()
b = b.reset_index()
# Here I explode the lists inside the cell
rows1= []
_ = a.apply(lambda row: [rows1.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
rows2 = []
__ = b.apply(lambda row: [rows2.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])
df = df.set_index('UserID')
df['time_diff']= df1_new['CreateTime']
df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
df.reset_index(inplace=True)
return df

Related

How to re-number strings after sorting a dataframe

Description:
I have a GUI that allows the user to add variables that are displayed in a dataframe. As the variables are added, they are automatically numbered, ex.'FIELD_0' and 'FIELD_1' etc and each variable has a value associated with it. The data is actually row-based instead of column based, in that the 'FIELD' ids are in column 0 and progress downwards and the corresponding value is in column 1, in the same corresponding row. As shown below:
0 1
0 FIELD_0 HH_5_MILES
1 FIELD_1 POP_5_MILES
The user is able to reorder these values and move them up/down a row. However, it's important that the number ordering remains sequential. So, if the user positions 'FIELD_1' above 'FIELD_0' then it gets re-numbered appropriately. Example:
0 1
0 FIELD_0 POP_5_MILES
1 FIELD_1 HH_5_MILES
Currently, I'm using the below code to perform this adjustment - this same re-numbering occurs with other variable names within the same dataframe.
df = pandas.DataFrame({0:['FIELD_1','FIELD_0']})
variable_list = ['FIELD', 'OPERATOR', 'RESULT']
for var in variable_list:
field_list = ['%s_%s' % (var, _) for _, field_name in enumerate(df[0].isin([var]))]
field_count = 0
for _, field_name in enumerate(df.loc[:, 0]):
if var in field_name:
df.loc[_, 0] = field_list[field_count]
field_count += 1
This gets me the result I want, but it seems a bit inelegant. If there is a better way, I'd love to know what it is.
It appears you're looking to overwrite the Field values so that they always appear in order starting with 0.
We can filter to only rows which str.contains the word FIELD. Then assign those to a list comprehension like field_list.
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
# Select Where Values are Field
m = df[0].str.contains('FIELD')
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'FIELD_{n}' for n in range(m.sum())]
print(df)
df:
0
0 FIELD_0
1 OTHER_1
2 FIELD_1
3 OTHER_0
For multiple variables:
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
variable_list = ['FIELD', 'OTHER']
for v in variable_list:
# Select Where Values are Field
m = df[0].str.contains(v)
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'{v}_{n}' for n in range(m.sum())]
df:
0
0 FIELD_0
1 OTHER_0
2 FIELD_1
3 OTHER_1
You can use sort values as below:
def f(x):
l=x.split('_')[1]
return int(l)
df.sort_values(0, key=lambda col: [f(k) for k in col]).reset_index(drop=True)
0
0 FIELD_0
1 FIELD_1

How can you increase the speed of an algorithm that computes a usage streak?

I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak

combine pandas apply results as multiple columns in a single dataframe

Summary
Suppose that you apply a function to a groupby object, so that every g.apply for every g in the df.groupby(...) gives you a series/dataframe. How do I combine these results into a single dataframe, but with the group names as columns?
Details
I have a dataframe event_df that looks like this:
index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0
...
I want to create a sampling of the event for every note, and the sampling is done at times as given by t_df:
index t
0 0
1 0.5
2 1.0
...
So that I'd get something like this.
t C D
0 off off
0.5 on off
1.0 off on
...
What I've done so far:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
print(group_with_t)
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)
So what I get is a number of dataframes for each note, all of the same size (same as t_df):
t event
0 on
0.5 off
...
t event
0 off
0.5 on
...
How do I go from here to my desired dataframe, with each group corresponding to a column in a new dataframe, and the index being t?
EDIT: sorry, I didn't take into account below, that you rescale your time column and can't present a whole solution now because I have to leave, but I think, you could do the rescaling by using pandas.merge_asof with your two dataframes to get the nearest "rescaled" time and from the merged dataframe you could try the code below. I hope this is, what you wanted.
import pandas as pd
import io
sio= io.StringIO("""index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)
df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')
Take the first row in each time-note group by agg({'event': 'first'}), then use the note-index column and transpose it, so the note values become columns. Then at the end fill all cells, for which no datapoints could be found with 'off' by fillna.
This outputs:
Out[28]:
event
note C D
time
0.50 on off
0.75 off on
1.00 off off
You might also want to try min or max in case on/off is not unambiguous for a combination of time/note (if there are more rows for the same time/note where some have on and some have off) and you prefer one of these values (say if there is one on, then no matter how many offs are there, you want an on etc.). If you want something like a mayority-vote, I would suggest to add a mayority vote column in the aggregated dataframe (before the unstack()).
Oh so I found it! All I had to do was to unstack the groupby results. Going back to generating the groupby result:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
## print(group_with_t) ## unnecessary!
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)
At this point, result is a dataframe with note as an index:
>> print(result)
event
note t
C 0 off
0.5 on
1.0 off
....
D 0 off
0.5 off
1.0 on
....
Doing result = result.unstack('note') does the trick:
>> result = result.unstack('note')
>> print(result)
event
note C D
t
0 off off
0.5 on on
1.0 off off
....
D 0 off
0.5 off
1.0 on
....

How to store values from loop to a dataframe?

I have created a loop that generates some values. I want to store those values in a data frame. For example, completed one loop, append to the first row.
def calculate (allFiles):
result = pd.DataFrame(columns = ['Date','Mid Ebb Total','Mid Flood Total','Mid Ebb Control','Mid Flood Control'])
total_Mid_Ebb = 0
total_Mid_Flood = 0
total_Mid_EbbControl = 0
total_Mid_FloodControl = 0
for file_ in allFiles:
xls = pd.ExcelFile(file_)
df = xls.parse('General Impact')
Mid_Ebb = df[df['Tidal Mode'] == "Mid-Ebb"] #filter
Mid_Ebb_control = df[df['Station'].isin(['C1','C2','C3'])] #filter control
Mid_Flood = df[df['Tidal Mode'] == "Mid-Flood"] #filter
Mid_Flood_control = df[df['Station'].isin(['C1','C2','C3', 'SR2'])] #filter control
total_Mid_Ebb += Mid_Ebb.Station.nunique() #count unique stations = sample number
total_Mid_Flood += Mid_Flood.Station.nunique()
total_Mid_EbbControl += Mid_Ebb_control.Station.nunique()
total_Mid_FloodControl += Mid_Flood_control.Station.nunique()
Mid_Ebb_withoutControl = total_Mid_Ebb - total_Mid_EbbControl
Mid_Flood_withoutControl = total_Mid_Flood - total_Mid_FloodControl
print('Ebb Tide: The total number of sample is {}. Number of sample without control station is {}. Number of sample in control station is {}'.format(total_Mid_Ebb, Mid_Ebb_withoutControl, total_Mid_EbbControl))
print('Flood Tide: The total number of sample is {}. Number of sample without control station is {}. Number of sample in control station is {}'.format(total_Mid_Flood, Mid_Flood_withoutControl, total_Mid_FloodControl))
The dataframe result contains 4 columns. The date is fixed. I would like to put total_Mid_Ebb, Mid_Ebb_withoutControl, total_Mid_EbbControl to the dataframe.
I believe you need append scalars in loop to list of tuples and then use DataFrame constructor. Last count differences in result DataFrame:
def calculate (allFiles):
data = []
for file_ in allFiles:
xls = pd.ExcelFile(file_)
df = xls.parse('General Impact')
Mid_Ebb = df[df['Tidal Mode'] == "Mid-Ebb"] #filter
Mid_Ebb_control = df[df['Station'].isin(['C1','C2','C3'])] #filter control
Mid_Flood = df[df['Tidal Mode'] == "Mid-Flood"] #filter
Mid_Flood_control = df[df['Station'].isin(['C1','C2','C3', 'SR2'])] #filter control
total_Mid_Ebb = Mid_Ebb.Station.nunique() #count unique stations = sample number
total_Mid_Flood = Mid_Flood.Station.nunique()
total_Mid_EbbControl = Mid_Ebb_control.Station.nunique()
total_Mid_FloodControl = Mid_Flood_control.Station.nunique()
data.append((total_Mid_Ebb,
total_Mid_Flood,
total_Mid_EbbControl,
total_Mid_FloodControl))
cols=['total_Mid_Ebb','total_Mid_Flood','total_Mid_EbbControl','total_Mid_FloodControl']
result = pd.DataFrame(data, columns=cols)
result['Mid_Ebb_withoutControl'] = result.total_Mid_Ebb - result.total_Mid_EbbControl
result['Mid_Flood_withoutControl']=result.total_Mid_Flood-result.total_Mid_FloodControl
#if want check all totals
total = result.sum()
print (total)
return result
Here is an example of loading data per column in a dataframe after each iteration of a loop. While this is not THE only method, it's one that helps understand concept better.
Necessary imports
import pandas as pd
from random import randint
First define an empty data-frame of 5 columns to match your problem
df = pd.DataFrame(columns=['A','B','C','D','E'])
Next we iterate through for loop and generate value using randint() and add one value at a time to each column Staring with 'A' all the way to 'E',
for i in range(5): #add 5 rows of data
df.loc[i, ['A']] = randint(0,99)
df.loc[i, ['B']] = randint(0,99)
df.loc[i, ['C']] = randint(0,99)
df.loc[i, ['D']] = randint(0,99)
df.loc[i, ['E']] = randint(0,99)
We get a DF whose 5 rows are populated.
>>> df
A B C D E
0 4 74 71 37 90
1 41 80 77 81 8
2 14 16 82 98 89
3 1 77 3 56 91
4 34 9 85 44 19
Hope above helps and you are able to tailor to your needs.
Note this does not produce a row per file as requested, but it more of a comment about general use of Pandas for problems like this - it is often easier to read all the data then process using the pandas files than to write your own loops over different cases.
I think you are not using pandas in the idiomatic way here. I think you will save a lot of code and get a more understandable result if you do it this way:
controlstations = ['C1', 'C2', 'C3', 'SR2']
df = pd.concat(pd.read_excel(file_, sheetname='General Impact') for file_ in files)
df['Control'] = df.Station.isin(controlstations)
counts = df.groupby(['Control', 'Tidal Mode']).Station.agg('nunique')
So here you are reading all the excel files into a single dataframe first, then adding a column to indicate if that is a control station or not, then using groupby to count the different combinations.
counts is a series with a two-dimensional index (for some made up data):
Control Tidal Mode
False Mid-Ebb 2
Mid-Flood 2
True Mid-Ebb 2
Mid-Flood 2
You can access the values you have in your function like this:
total_Mid_Ebb = counts['Mid-Ebb'].sum()
total_Mid_Ebb_Control = counts['Mid-Ebb', True]
total_Mid_Flood = counts['Mid-Flood'].sum()
total_Mid_Flood_Control = counts['Mid-Flood', True]
After which you can easily add them to a DataFrame:
import datetime
today = datetime.datetime.today()
totals = [total_Mid_Ebb, total_Mid_Flood, total_Mid_Ebb_Control, total_Mid_Flood_Control]
result = pd.DataFrame(data=[totals], columns=['Mid Ebb Total', 'Mid Flood Total', 'Mid Ebb Control', 'Mid Flood Control'],
index=[today])

How to "melt" `pandas.DataFrame` objects in Python 3?

I'm trying to melt certain columns of a pd.DataFrame while preserving columns of the other. In this case, I want to melt sine and cosine columns into values and then which column they came from (i.e. sine or cosine) into a new columns entitled data_type then preserving the original desc column.
How can I use pd.melt to achieve this without melting and concatenating each component manually?
# Data
a = np.linspace(0,2*np.pi,100)
DF_data = pd.DataFrame([a, np.sin(np.pi*a), np.cos(np.pi*a)], index=["t", "sine", "cosine"], columns=["t_%d"%_ for _ in range(100)]).T
DF_data["desc"] = ["info about this" for _ in DF_data.index]
The round about way I did it:
# Melt each part
DF_melt_A = pd.DataFrame([DF_data["t"],
DF_data["sine"],
pd.Series(DF_data.shape[0]*["sine"], index=DF_data.index, name="data_type"),
DF_data["desc"]]).T.reset_index()
DF_melt_A.columns = ["idx","t","values","data_type","desc"]
DF_melt_B = pd.DataFrame([DF_data["t"],
DF_data["cosine"],
pd.Series(DF_data.shape[0]*["cosine"], index=DF_data.index, name="data_type"),
DF_data["desc"]]).T.reset_index()
DF_melt_B.columns = ["idx","t","values","data_type","desc"]
# Merge
pd.concat([DF_melt_A, DF_melt_B], axis=0, ignore_index=True)
If I do pd.melt(DF_data I get a complete meltdown
In response to the comments:
allright so I had to create a similar df because I did not have access to your a variable. I change your a variable for a list from 0 to 99... so t will be 0 to 99
you could do this :
a = range(0, 100)
DF_data = pd.DataFrame([a, [np.sin(x)for x in a], [np.cos(x)for x in a]], index=["t", "sine", "cosine"], columns=["t_%d"%_ for _ in range(100)]).T
DF_data["desc"] = ["info about this" for _ in DF_data.index]
df = pd.melt(DF_data, id_vars=['t','desc'])
df.head(5)
this should return what you are looking for.
t desc variable value
0 0.0 info about this sine 0.000000
1 1.0 info about this sine 0.841471
2 2.0 info about this sine 0.909297
3 3.0 info about this sine 0.141120
4 4.0 info about this sine -0.756802

Categories

Resources