I have a dataframe like this:
Application|Category|Feature|Scenario|Result|Exec_Time
A1|C1|F1|scenario1|PASS|2.3
A1|C1|F1|scenario2|FAIL|20.3
A2|C1|F3|scenario3|PASS|12.3
......
The outcome i am looking for will be a pivot with count of results by Feature and also the sum of exec times. Like this
Application|Category|Feature|Count of PASS|Count of FAIL|SumExec_Time
A1|C1|F1|200|12|45.62
A1|C1|F2|90|0|15.11
A1|C2|F3|97|2|33.11*
I got individual dataframes to get the pivots of result counts and the sum of execution time by feature but I am not able to merge those dataframes to get my final expected outcome.
dfr = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Final_Result"],aggfunc=[len])
dft = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Exec_time_mins"],aggfunc=[np.sum])
You don't need to merge results here, you can create this with a single pivot_table or groupby/apply. I don't have your data but does this get you what you want?
pivot = pd.pivot_table(df, index=["Application","Category","Feature"],
values = ["Final_Result", "Exec_time_mins"],
aggfunc = [len, np.sum])
#Count total records, number of FAILs and total time.
df2 = df.groupby(by=['Application','Category','Feature']).agg({'Result':[len,lambda x: len(x[x=='FAIL'])],'Exec_Time':sum})
#rename columns
df2.columns=['Count of PASS','Count of FAIL','SumExec_Time']
#calculate number of pass
df2['Count of PASS']-=df2['Count of FAIL']
#reset index
df2.reset_index(inplace=True)
df2
Out[1197]:
Application Category Feature Count of PASS Count of FAIL SumExec_Time
0 A1 C1 F1 1 1 22.6
1 A2 C1 F3 1 0 12.3
Related
I have a dataframe like as below
id,status,amount,qty
1,pass,123,4500
1,pass,156,3210
1,fail,687,2137
1,fail,456,1236
2,pass,216,324
2,pass,678,241
2,nan,637,213
2,pass,213,543
df = pd.read_clipboard(sep=',')
I would like to do the below
a) Groupby id and compute the pass percentage for each id
b) Groupby id and compute the average amount for each id
So, I tried the below
df['amt_avg'] = df.groupby('id')['amount'].mean()
df['pass_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
df['fail_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
but this doesn't work.
I am having trouble in getting the pass percentage.
In my real data I have lot of columns like status for which I have to find these % distribution of a specific value (ex: pass)
I expect my output to be like as below
id,pass_pct,fail_pct,amt_avg
1,50,50,2770.75
2,75,0,330.25
Use crosstab with replace missing values by nan with remove nan column and then add new column amt_avg by DataFrame.join:
s = df.groupby('id')['qty'].mean()
df = (pd.crosstab(df['id'], df['status'].fillna('nan'), normalize=0)
.drop('nan', 1)
.mul(100)
.join(s.rename('amt_avg')))
print (df)
fail pass amt_avg
id
1 50.0 50.0 2770.75
2 0.0 75.0 330.25
I have two different csv files, I have merged them into a single data frame and grouped according to the 'class_name' column. The group by works as intended but I dont know how to perform the operation by comparing the groups against one other. From r1.csv the class algebra has gone down by 5 students, so I want -5, calculus has increased by 5 so it has to +5, this has to be added as a new column in a separate data frame. Same with date arithmetics.
This is what I tried so far
import pandas as pd
report_1_df=pd.read_csv('r1.csv')
report_2_df=pd.read_csv('r2.csv')
for group,elements in pd.concat([report_1_df, report_2_df], axis=0, sort=False).groupby('class_name'):
print(elements)
I can see that my group by works, I tried .sum() .diff() but none seem to do what I want, what can I do here. Thanks.
r1.csv
class_name,student_count,start_time,end_time
algebra,15,"2019,Dec,08","2019,Dec,09"
calculus,10,"2019,Dec,08","2019,Dec,09"
statistics,12,"2019,Dec,08","2019,Dec,09"
r2.csv
class_name,student_count,start_time,end_time
calculus,15,"2019,Dec,09","2019,Dec,10"
algebra,10,"2019,Dec,09","2019,Dec,10"
trigonometry,12,"2019,Dec,09","2019,Dec,10"
Needed
class_name,student_count,student_count_change,start_time,start_time_delay,end_time,end_time_delay
algebra,10,-5,"2019,Dec,09",1,"2019,Dec,10",1
calculus,15,5,"2019,Dec,09",1,"2019,Dec,10",1
statistics,12,-12,"2019,Dec,08",0,"2019,Dec,09",0
trigonometry,12,12,"2019,Dec,09",0,"2019,Dec,10",0
Not sure if there's a more direct way, but you can start by appending missing data on both your dfs:
classes = (df1["class_name"].append(df2["class_name"])).unique()
def fill_data(df):
for i in np.setdiff1d(classes, df["class_name"].values):
df.loc[df.shape[0]] = [i, 0, *df.iloc[0,2:].values]
return df
df1 = fill_data(df1)
df2 = fill_data(df2)
With the missing classes filled, now you can use groupby to assign a new column for the difference and lastly drop_duplicates:
df = pd.concat([df1,df2],axis=0).reset_index(drop=True)
df["diff"] = df.groupby("class_name")["student_count"].diff().fillna(df["student_count"])
print (df.drop_duplicates("class_name",keep="last"))
class_name student_count start_time end_time diff
4 calculus 15 2019,Dec,09 2019,Dec,10 5.0
5 algebra 10 2019,Dec,09 2019,Dec,10 -5.0
6 trigonometry 12 2019,Dec,09 2019,Dec,10 12.0
7 statistics 0 2019,Dec,09 2019,Dec,10 -12.0
I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.
I'm sure there must be a quickfix for this but I can't find an answer with a good explanation. I'm looking to iterate over a dataframe and build a crosstab for each pair of columns with pandas. I have subsetted 2 cols from the original data and removed rows with unsuitable data. With the remaining data I am looking to do a crosstab to ultimately build a contingency table to do a ChiX test. Here is my code:
my_data = pd.read_csv(DATA_MATRIX, index_col=0) #GET DATA
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns) #INITIATE DF TO HOLD ChiX-result
for c1 in my_data.columns:
for c2 in my_data.columns:
sample_df = pd.DataFrame(my_data, columns=[c1,c2]) #make df to do ChiX on
sample_df = sample_df[(sample_df[c1] != 0.5) | (sample_df[c2] != 0.5)].dropna() # remove unsuitable rows
contingency = pd.crosstab(sample_df[c1], sample_df[c2]) ##This doesn't work?
# DO ChiX AND STORE P-VALUE IN 'AM': CODE STILL TO WRITE
The dataframe contains the values 0.0, 0.5, 1.0. The '0.5' is missing data so I am removing these rows before making the contingency table, the remaining values that I wish to make the contingency tables from are all either 0.0 or 1.0. I have checked at the code works up to this point. The error printed to the console is:
ValueError: If using all scalar values, you must pass an index
If anyone can explain why this doesn't work? Help to solve in any way? Or even better provide an alternative way to do a ChiX test on the columns then that would be very helpful, thanks in advance!
EDIT: example of the structure of the first few rows of sample_df
col1 col2
sample1 1 1
sample2 1 1
sample3 0 0
sample4 0 0
sample5 0 0
sample6 0 0
sample7 0 0
sample8 0 0
sample9 0 0
sample10 0 0
sample11 0 0
sample12 1 1
A crosstab between two identical entities is meaningless. pandas is going to tell you:
ValueError: The name col1 occurs multiple times, use a level number
Meaning it assumes you're passing two different columns from a multi-indexed dataframe with the same name.
In your code, you're iterating over columns in a nested loop, so the situation arises where c1 == c2, so pd.crosstab errors out.
The fix would involve adding an if check and skipping that iteration if the columns are equal. So, you'd do:
for c1 in my_data.columns:
for c2 in my_data.columns:
if c1 == c2:
continue
... # rest of your code
I have a dataframe df, which has the column months_to_maturity and has multiple rows associated with a months_to_maturity of 1,2, etc. each. I am trying to keep only the first 3 rows associated with a particular months_to_maturity value. For example, for months_to_maturity = 1 I would like to have only 3 associated rows and for months_to_maturity = 2, another 3 rows and so on. I try to do this using the code below, but get the error IndexError: index 21836 is out of bounds for axis 0 with size 4412 and hence am wondering if there is a better way to do this. pairwise gives the current and next row of the dataframe. The values of months_to_maturity are sorted.
count = 0
for (i1, row1), (i2,row2) in pairwise(df.iterrows()):
if row1.months_to_maturity == row2.months_to_maturity:
count = count + 1
if count == 3:
df.drop(df.index[i1])
df = df.reset_index()
elif row1.months_to_maturity != row2.months_to_maturity:
count = 0
Thank You
You can do:
df.groupby('months_to_maturity').head(3)