I have created a loop that generates some values. I want to store those values in a data frame. For example, completed one loop, append to the first row.
def calculate (allFiles):
result = pd.DataFrame(columns = ['Date','Mid Ebb Total','Mid Flood Total','Mid Ebb Control','Mid Flood Control'])
total_Mid_Ebb = 0
total_Mid_Flood = 0
total_Mid_EbbControl = 0
total_Mid_FloodControl = 0
for file_ in allFiles:
xls = pd.ExcelFile(file_)
df = xls.parse('General Impact')
Mid_Ebb = df[df['Tidal Mode'] == "Mid-Ebb"] #filter
Mid_Ebb_control = df[df['Station'].isin(['C1','C2','C3'])] #filter control
Mid_Flood = df[df['Tidal Mode'] == "Mid-Flood"] #filter
Mid_Flood_control = df[df['Station'].isin(['C1','C2','C3', 'SR2'])] #filter control
total_Mid_Ebb += Mid_Ebb.Station.nunique() #count unique stations = sample number
total_Mid_Flood += Mid_Flood.Station.nunique()
total_Mid_EbbControl += Mid_Ebb_control.Station.nunique()
total_Mid_FloodControl += Mid_Flood_control.Station.nunique()
Mid_Ebb_withoutControl = total_Mid_Ebb - total_Mid_EbbControl
Mid_Flood_withoutControl = total_Mid_Flood - total_Mid_FloodControl
print('Ebb Tide: The total number of sample is {}. Number of sample without control station is {}. Number of sample in control station is {}'.format(total_Mid_Ebb, Mid_Ebb_withoutControl, total_Mid_EbbControl))
print('Flood Tide: The total number of sample is {}. Number of sample without control station is {}. Number of sample in control station is {}'.format(total_Mid_Flood, Mid_Flood_withoutControl, total_Mid_FloodControl))
The dataframe result contains 4 columns. The date is fixed. I would like to put total_Mid_Ebb, Mid_Ebb_withoutControl, total_Mid_EbbControl to the dataframe.
I believe you need append scalars in loop to list of tuples and then use DataFrame constructor. Last count differences in result DataFrame:
def calculate (allFiles):
data = []
for file_ in allFiles:
xls = pd.ExcelFile(file_)
df = xls.parse('General Impact')
Mid_Ebb = df[df['Tidal Mode'] == "Mid-Ebb"] #filter
Mid_Ebb_control = df[df['Station'].isin(['C1','C2','C3'])] #filter control
Mid_Flood = df[df['Tidal Mode'] == "Mid-Flood"] #filter
Mid_Flood_control = df[df['Station'].isin(['C1','C2','C3', 'SR2'])] #filter control
total_Mid_Ebb = Mid_Ebb.Station.nunique() #count unique stations = sample number
total_Mid_Flood = Mid_Flood.Station.nunique()
total_Mid_EbbControl = Mid_Ebb_control.Station.nunique()
total_Mid_FloodControl = Mid_Flood_control.Station.nunique()
data.append((total_Mid_Ebb,
total_Mid_Flood,
total_Mid_EbbControl,
total_Mid_FloodControl))
cols=['total_Mid_Ebb','total_Mid_Flood','total_Mid_EbbControl','total_Mid_FloodControl']
result = pd.DataFrame(data, columns=cols)
result['Mid_Ebb_withoutControl'] = result.total_Mid_Ebb - result.total_Mid_EbbControl
result['Mid_Flood_withoutControl']=result.total_Mid_Flood-result.total_Mid_FloodControl
#if want check all totals
total = result.sum()
print (total)
return result
Here is an example of loading data per column in a dataframe after each iteration of a loop. While this is not THE only method, it's one that helps understand concept better.
Necessary imports
import pandas as pd
from random import randint
First define an empty data-frame of 5 columns to match your problem
df = pd.DataFrame(columns=['A','B','C','D','E'])
Next we iterate through for loop and generate value using randint() and add one value at a time to each column Staring with 'A' all the way to 'E',
for i in range(5): #add 5 rows of data
df.loc[i, ['A']] = randint(0,99)
df.loc[i, ['B']] = randint(0,99)
df.loc[i, ['C']] = randint(0,99)
df.loc[i, ['D']] = randint(0,99)
df.loc[i, ['E']] = randint(0,99)
We get a DF whose 5 rows are populated.
>>> df
A B C D E
0 4 74 71 37 90
1 41 80 77 81 8
2 14 16 82 98 89
3 1 77 3 56 91
4 34 9 85 44 19
Hope above helps and you are able to tailor to your needs.
Note this does not produce a row per file as requested, but it more of a comment about general use of Pandas for problems like this - it is often easier to read all the data then process using the pandas files than to write your own loops over different cases.
I think you are not using pandas in the idiomatic way here. I think you will save a lot of code and get a more understandable result if you do it this way:
controlstations = ['C1', 'C2', 'C3', 'SR2']
df = pd.concat(pd.read_excel(file_, sheetname='General Impact') for file_ in files)
df['Control'] = df.Station.isin(controlstations)
counts = df.groupby(['Control', 'Tidal Mode']).Station.agg('nunique')
So here you are reading all the excel files into a single dataframe first, then adding a column to indicate if that is a control station or not, then using groupby to count the different combinations.
counts is a series with a two-dimensional index (for some made up data):
Control Tidal Mode
False Mid-Ebb 2
Mid-Flood 2
True Mid-Ebb 2
Mid-Flood 2
You can access the values you have in your function like this:
total_Mid_Ebb = counts['Mid-Ebb'].sum()
total_Mid_Ebb_Control = counts['Mid-Ebb', True]
total_Mid_Flood = counts['Mid-Flood'].sum()
total_Mid_Flood_Control = counts['Mid-Flood', True]
After which you can easily add them to a DataFrame:
import datetime
today = datetime.datetime.today()
totals = [total_Mid_Ebb, total_Mid_Flood, total_Mid_Ebb_Control, total_Mid_Flood_Control]
result = pd.DataFrame(data=[totals], columns=['Mid Ebb Total', 'Mid Flood Total', 'Mid Ebb Control', 'Mid Flood Control'],
index=[today])
Related
I am currently working on a data science project. The Idea is to clean the data from "glassdoor_jobs.csv", and present it in a much more understandable manner.
import pandas as pd
df = pd.read_csv('glassdoor_jobs.csv')
#salary parsing
#Removing "-1" Ratings
#Clean up "Founded"
#state field
#Parse out job description
df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df = df[df['Salary Estimate'] != '-1']
Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$',''))
minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))
I am getting the error at that last line. After digging a bit, I found out in minus_hr, some of the 'Salary Estimate' only has one number instead of range:
index
Salary Estimate
0
150
1
58
2
130
3
125-150
4
110-140
5
200
6
67- 77
And so on. Now I'm trying to figure out how to work around the "list index out of range", and make max_salary the same as the min_salary for the cells with only one value.
I am also trying to get average between the min and max salary, and if the cell only has a single value, make that value the average
So in the end, something like index 0 would look like:
index
min
max
average
0
150
150
150
You'll have to add in a conditional statement somewhere.
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)
The above might do it, or you can define a function.
def max_salary(cell_value):
if '-' in cell_value:
max_salary = split(cell_value, '-')[1]
else:
max_salary = cell_value
return max_salary
df['max_salary'] = minus_hr.apply(lambda x: max_salary(x))
def avg_salary(cell_value):
if '-' in cell_value:
salaries = split(cell_value,'-')
avg = sum(salaries)/len(salaries)
else:
avg = cell_value
return avg
df['avg_salary'] = minus_hr.apply(lambda x: avg_salary(x))
Swap in min_salary and repeat
Test the length of x.split('-') before accessing the elements.
salaries = x.split('-')
if len(salaries) == 1:
# only one salary number is given, so assign the same value to min and max
df['min_salary'] = df['max_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
else:
# two salary numbers are given
df['min_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(salaries[1]))
If you want to avoid .apply()...
Try:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
sals = df['Salary Estimate'].str.extractall(r'(?P<min_salary>\d+)[^0-9]*(?P<max_salary>\d*)?')
# reset the new frame's index
sals = sals.reset_index()
# join the extracted min/max salary columns to the original dataframe and fill any blanks with nan
df = df.join(sals[['min_salary', 'max_salary']].fillna(np.nan))
# fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see what you've got
print(df)
Or without using regex:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
df['sals'] = df['Salary Estimate'].str.split('-')
# expand the list in sals to two columns filling with nan
df[['min_salary', 'max_salary']] = pd.DataFrame(df.sals.tolist()).fillna(np.nan)
# delete the sals column
del df['sals']
# # fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# # set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# # calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see you've got
print(df)
Output:
Salary Estimate min_salary max_salary average_salary
0 150 150 150 150
1 58 58 58 58
2 130 130 130 130
3 125-150 125 150 137
4 110-140 110 140 125
5 200 200 200 200
6 67- 77 67 77 72
I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak
I am reading from a csv file using python and I pulled the data from one column. Every 15 lines is a set of data for one category but only the first 5 lines from that set is relevant. How can I read the first 5 lines of every 15 lines from a total of 205 lines? It reads every other 15 up to a point and then it begins to misalign. The blue rectangles shows where it starts to stray.Here is an image of my data format from the column :
inp = pd.read_csv(input_dir)
win = inp['File '][1:]
ny4= inp['Unnamed: 24']
df = pd.DataFrame({"Ny/4" : ny4})
ny4_len = len(ny4)
#
for i in win:
itr.append(i.split('_'))
for e in itr:
plh1.append(e[4:e.index("SFRP")])
plh1 = plh1[1::13]
win_df = pd.DataFrame({'Window' : plh1})
for u in win_df['Window']:
plh2.append(k.join(u))
chicken = len(df)/12
#kow = list(islice(ny4,1, 17) )
#
maxm = pd.concat(list(map(lambda x: x[1:6], np.array_split(ny4,17))), ignore_index=True)
plh2_df= pd.DataFrame({'Window Name': plh2})
ny4_data= pd.DataFrame(np.reshape(maxm.values,(17,5)), columns = ['Center', 'UL', 'UR', 'LL','LR'])
conc= pd.concat([plh2_df,ny4_data], axis=1, sort=True)
[1]: https://i.stack.imgur.com/yMqw8.png
Use pd.concat with np.array_split:
print(pd.concat(list(map(lambda x: x[:5], np.array_split(df, len(df) / 15))), ignore_index=True))
It should work now.
I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F
I have the following issue:
I have a dataframe with 3 columns :
The first is userID, the second is invoiceType and the third the time of creation of the invoice.
df = pd.read_csv('invoice.csv')
Output: UserID InvoiceType CreateTime
1 a 2018-01-01 12:31:00
2 b 2018-01-01 12:34:12
3 a 2018-01-01 12:40:13
1 c 2018-01-09 14:12:25
2 a 2018-01-12 14:12:29
1 b 2018-02-08 11:15:00
2 c 2018-02-12 10:12:12
I am trying to plot the invoice cycle for each user. I need to create2 new columns, time_diff, and time_diff_wrt_first_invoice. time_diff will represent the time difference between each invoice for each user and time_diff_wrt_first_invoice will represent the time difference between all the invoices and the first invoice, which will be interesting for ploting purposes. This is my code:
"""
********** Exploding a variable that is a list in each dataframe cell
"""
def explode_list(df,x):
return (df[x].apply(pd.Series)
.stack()
.reset_index(level = 1, drop=True)
.to_frame(x))
"""
****** applying explode_list to all the columns ******
"""
def explode_listDF(df):
exploaded_df = pd.DataFrame()
for x in df.columns.tolist():
exploaded_df = pd.concat([exploaded_df, explode_list(df,x)],
axis = 1)
return exploaded_df
"""
******** Getting the time difference column in pivot table format
"""
def pivoted_diffTime(df1, _freq=60):
# _ freq is 1 for minutes frequency
# _freq is 60 for hour frequency
# _ freq is 60*24 for daily frequency
# _freq is 60*24*30 for monthly frequency
df = df.sort_values(['UserID', 'CreateTime'])
df_pivot = df.pivot_table(index = 'UserID',
aggfunc= lambda x : list(v for v in x)
)
df_pivot['time_diff'] = [[0]]*len(df_pivot)
for user in df_pivot.index:
try:
_list = [0]+[math.floor((x - y).total_seconds()/(60*_freq))
for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:],
df_pivot.loc[user, 'CreateTime'][:-1])]
df_pivot.loc[user, 'time_diff'] = _list
except:
print('There is a prob here :', user)
return df_pivot
"""
***** Pipelining the two functions to obtain an exploaded dataframe
with time difference ******
"""
def get_timeDiff(df, _frequency):
df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))
return df
And once I have time_diff, I am creating time_diff_wrt_first_variable this way:
# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] =
[[0]]*len(df_with_timeDiff)
# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():
df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])
The problem is that I have a dataframe with hundreds of thousands of users and it's so time consuming. I am wondering if there is a solution that fits better my need.
Check out .loc[] for pandas.
df_1 = pd.DataFrame(some_stuff)
df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']
you can access specific columns, run a loop to check for certain types of conditions, and if you add a comma after the condition and put in a specific column name it'll only return that column.
I'm not 100% sure if that answers whatever question you're asking, cause I didn't actually see one, but it seemed like you were running a lot of for loops and stuff to isolate columns, which is what .loc[] is for.
I have found a better solution. Here's my code :
def next_diff(x):
return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])
def create_timediff(df):
df.sort_values(['UserID', 'CreateTime'], inplace=True)
a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
b = a.apply(np.cumsum)
a = a.reset_index()
b = b.reset_index()
# Here I explode the lists inside the cell
rows1= []
_ = a.apply(lambda row: [rows1.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
rows2 = []
__ = b.apply(lambda row: [rows2.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])
df = df.set_index('UserID')
df['time_diff']= df1_new['CreateTime']
df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
df.reset_index(inplace=True)
return df