I'm sure this is quite simple, but my brain is frozen and there are so many different pivot and transpose methods. A hint would be nice at this stage.
I have this dataframe:
I want this:
I know how to get to here, if that helped, but I'm not sure if it does
FYI - The actual data has more columns and I need to separate out these four based on the "site" column, reformat everything, calculate some percentages, put the pieces back together, and eventually end up with something like this:
I'm hoping that if I can get on the right track for reformatting part of the data, I can repeat the process...
(then I need to figure out how to run a Chi-square test, but that's for later... :-(
The easiest resolution is df.stack:
df = pd.DataFrame({'MIC-m': [138, 3, 22, 45],
'MIC-t': [34, 90, 30, 53],
'MIC-q': [73, 13, 53, 68],
'Total': [229, 229, 229, 229]}, index=['H', 'L', 'M', 'X'])
# Drop total, because we need sum of columns, not rows
df.drop(columns='Total', inplace=True)
# Get final result
df = pd.DataFrame(df.append(df.sum().rename('Total')).T.stack(), columns=['count'])
yields:
count
MIC-m H 138
L 3
M 22
X 45
Total 208
MIC-t H 34
L 90
M 30
X 53
Total 207
MIC-q H 73
L 13
M 53
X 68
Total 207
Related
I apologize as I prefer to ask questions where I've made an attempt at the code needed to resolve the issue. Here, despite many attempts, I haven't gotten any closer to a resolution (in part because I'm a hobbyist and self-taught). I'm attempting to use two dataframes together to calculate the average values in a specific column, then generate a new column to store that average.
I have two dataframes. The first contains the players and their stats. The second contains a list of each player's opponents during the season.
What I'm attempting to do is use the two dataframes to calculate expected values when facing a specific opponent. Stated otherwise, I'd like to be able to see if a player is performing better or worse than the expected results based on the opponent but first need to calculate the average of their opponents.
My dataframes actually have thousands of players and hundreds of matchups, so I've shortened them here to have a representative dataframe that isn't overwhelming.
The first dataframe (df) contains five columns. Name, STAT1, STAT2, STAT3, and STAT4.
The second dataframe (df_Schedule) has a Name column but then has a separate column for each opponent faced. The df_Schedule usually contains different numbers of columns depending on the week of the season. For example, after week 1 there may be four columns. After week 26 there might be 100 columns. For simplicity sake, I've included just five columns ['Name', 'Opp1', 'Opp2', 'Opp3', 'Opp4', 'Opp5'].
Using these two dataframes I'm trying to create new columns in the first dataframe (df). EXP1 (for "Expected STAT1"), EXP2, EXP3, EXP4. The expected columns are simply an average of the STAT columns based on the opponents faced during the season. For example, Edgar faced Ralph three times, Marc once and David once. The formula to calculate Edgar's EXP1 is simply:
((Ralph.STAT1 * 3) + (Marc.STAT1 * 1) + (David.STAT1 * 1) / Number_of_Contests (which is five in this example) = 100.2
import pandas as pd
data = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],}
data2 = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'Opp1':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp2':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp3':['Marc', 'David', 'Edgar', 'Ralph'],
'Opp4':['David', 'Marc', 'Ralph', 'Edgar'],
'Opp5':['Ralph', 'Edgar', 'David', 'Marc'],}
df = pd.DataFrame(data)
df_Schedule = pd.DataFrame(data2)
print(df)
print(df_Schedule)
I would like the result to be something like:
data_Final = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],
'EXP1':[100.2, 102.6, 101, 105.2],
'EXP2':[92.8, 106.6, 101.8, 92.8],
'EXP3':[60.2, 58.4, 72.8, 47.6],
'EXP4':[88.2, 63.6, 54.2, 98],}
df_Final = pd.DataFrame(data_Final)
print(df_Final)
Is there a way to use the scheduling dataframe to lookup the values of opponents, average them, and then create a new column based on those averages?
Try:
df = df.set_index("Name")
df_Schedule = df_Schedule.set_index("Name")
for i, c in enumerate(df.filter(like="STAT"), 1):
df[f"EXP{i}"] = df_Schedule.replace(df[c]).mean(axis=1)
print(df.reset_index())
Prints:
Name STAT1 STAT2 STAT3 STAT4 EXP1 EXP2 EXP3 EXP4
0 Edgar 100 116 56 55 100.2 92.8 60.2 88.2
1 Ralph 96 93 59 96 102.6 106.6 58.4 63.6
2 Marc 110 85 41 113 101.0 101.8 72.8 54.2
3 David 103 100 83 40 105.2 92.8 47.6 98.0
I have a small dataframe with student_id, exam_1, exam_2, exam_3, exam_4, and exam_5 as columns. There are 5 students as well for the rows. What I'd like to do is plot a bar graph showing the exam grades of one student aka one specific row, and ultimately doing it for each or a specific student from user input.
For now, though, I'm stuck on how to plot a bar graph for just one specific student.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'student_id': [83838, 16373, 93538, 29383, 58585],
'exam_1': [80, 95, 90, 75, 50],
'exam_2': [60, 92, 88, 85, 40],
'exam_3': [70, 55, 75, 45, 60],
'exam_4': [55, 95, 45, 80, 55],
'exam_5': [91, 35, 92, 90, 75]})
print(df)
Which produces this as output:
student_id exam_1 exam_2 exam_3 exam_4 exam_5
0 83838 80 60 70 55 91
1 16373 95 92 55 95 35
2 93538 90 88 75 45 92
3 29383 75 85 45 80 90
4 58585 50 40 60 55 75
Adding this code below will allow me to select just one specific student ID aka row:
df = df.loc[df['student_id'] == 29383]
print(df)
student_id exam_1 exam_2 exam_3 exam_4 exam_5
3 29383 75 85 45 80 90
From here is where I'd like to plot this particular student's exams in a bar plot.
I tried the code below but it doesn't display it how I'd like. It seems that the index of this particular student is being used for the tick on the x-axis, if you can see the image. It will show '3' with some bar plots around it.
exam_plots_for_29383 = df.plot.bar()
plt.show()
Which will output this bar plot:
Dataframe with bar plot. Looks weird.
I tried to transpose the dataframe, which kind of gets me to what I want. I used this code below:
df = df.T
exam_plots_for_29383_T = df.plot.bar()
plt.show()
But I end up with this as a graph:
Transpose of dataframe with bar plot. Looks weird still.
I'm a bit stuck. I know there's a logical way of properly plotting a bar plot from the dataframe, I just can't for the life of me figure it out.
I'd like the bar plot to have:
Exams 1 through 5 show up on the x-axis.
Their values on the y-axis.
Each exam bar in separate color.
The legend showing the colors.
I think the last two options are done automatically. It's just the first two that are breaking my brain. I appreciate any help or tips.
Here's the code in full in case anyone would like to see it without it being split like above.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'student_id': [83838, 16373, 93538, 29383, 58585],
'exam_1': [80, 95, 90, 75, 50],
'exam_2': [60, 92, 88, 85, 40],
'exam_3': [70, 55, 75, 45, 60],
'exam_4': [55, 95, 45, 80, 55],
'exam_5': [91, 35, 92, 90, 75]})
print(df)
df = df.loc[df['student_id'] == 29383]
print(df)
exam_plots_for_29383 = df.plot.bar()
plt.show()
df = df.T
exam_plots_for_29383_T = df.plot.bar()
plt.show()
You are very close. The issue is that your numeric-like student ID is messing up all of the plots (which is why ID 29383 is giving you a bar close to 30,000 in all of your graphs).
Set the 'student_id' to the index so that it doesn't get plotted and now you can plot each student separately slicing the index with .loc[student_id], or if you plot the entire DataFrame it will color each different student.
df = df.set_index('student_id')
df.loc[29383].plot(kind='bar', figsize=(4,3), rot=30)
Knowing there are 5 exams you can give each its own color if you really want. Use a categorical color palette (tab10). (This also only works with Series.plot)
from matplotlib import cm
df.loc[29383].plot(kind='bar', figsize=(4,3), rot=30, color=cm.tab10.colors[0:5])
I have a Pandas dataframe with columns labeled Ticks, Water, and Temp, with a few million rows (possibly billion on a complete dataset), but it looks something like this
...
'Ticks' 'Water' 'Temp'
215 4 26.2023
216 1 26.7324
217 17 26.8173
218 2 26.9912
219 48 27.0111
220 1 27.2604
221 19 27.7563
222 32 28.3002
...
(All temperatures are in ascending order, and all 'ticks' are also linearly spaced and in ascending order too)
What I'm trying to do is to reduce the data down to a single 'Water' value for each floored, integer 'Temp' value, and just the first 'Tick' value (or last, it doesn't really have that much of an effect on the analysis).
The current direction I'm working on doing this is to start at the first row and save the tick value, check if the temperature is an integer value greater than the previous, add the water value, move to the next row check the temperature value, add the water value if it's not a whole integer higher. If the temperature value is an integer value higher, append the saved 'tick' value and integer temperature value and the summed water count to a new dataframe.
I'm sure this will work but, I'm thinking there should be a way to do this a lot more efficiently using some type of application of df.loc or df.iloc since everything is nicely in ascending order.
My hopeful output for this would be a much shorter dataset with values that look something like this:
...
'Ticks' 'Water' 'Temp'
215 24 26
219 68 27
222 62 28
...
Use GroupBy.agg and Series.astype
new_df = (df.groupby(df['Temp'].astype(int))
.agg({'Ticks' : 'first', 'Water' : 'sum'})
#.agg(Ticks = ('Ticks', 'first'), Water = ('Water', 'sum'))
.reset_index()
.reindex(columns=df.columns)
)
print(new_df)
Output
Ticks Water Temp
0 215 24 26
1 219 68 27
2 222 32 28
I have some trouble understanding the rules for which ticks you want in the final dataframe, but here is a way to get the indices of all Temps with equal floored value:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = pd.DataFrame({
'Ticks': [215, 216, 217, 218, 219, 220, 221, 222],
'Water': [4, 1, 17, 2, 48, 1, 19, 32],
'Temp': [26.2023, 26.7324, 26.8173, 26.9912, 27.0111, 27.2604, 27.7563, 28.3002]})
# first floor all temps
data['Temp'] = data['Temp'].apply(np.floor)
# get the indices of all equal temps
groups = data.groupby('Temp').groups
print(groups)
# maybe apply mean?
data = data.groupby('Temp').mean()
print(data)
hope this helps
Two data frames like below and I want to calculate the correlation coefficient.
It works fine when both columns are completed with actual values. But when they are not, it takes zero as value when calculating the correlation coefficient.
For example, Addison’s and Caden’s weights are 0. Jack and Noah don’t have Weights. I want to exclude them for calculation.
(In the tries, it seems only consider the same lengths, i.e. Jack and Noah are automatically excluded – is it?)
How can I include only the people with non-zero values for calculation?
Thank you.
import pandas as pd
Weight = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah"],
'Weight': [10, 0, 12, 20, 25, 10, 0, 18, 16, 13]}
df_wt = pd.DataFrame(Weight)
Score = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah", "Jack", "Noah"],
'Score': [360, 476, 345, 601, 604, 313, 539, 531, 507, 473, 450, 470]}
df_sc = pd.DataFrame(Score)
print df_wt.Weight.corr(df_sc.Score)
Masking and taking non-zero values and common index:
df_wt.set_index('Name', inplace=True)
df_sc.set_index('Name', inplace=True)
mask = df_wt['Weight'].ne(0)
common_index = df_wt.loc[mask, :].index
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
0.923425144491911
If both dataframes contains zeros then:
mask1 = df_wt['Weight'].ne(0)
mask2 = df_sc['Score'].ne(0)
common_index = df_wt.loc[mask1, :].index.intersection(df_sc.loc[mask2, :].index)
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
Use map for add new column, remove 0 rows byboolean indexing and last apply your solution in same DataFrame:
df_wt['Score'] = df_wt['Name'].map(df_sc.set_index('Name')['Score'])
df_wt = df_wt[df_wt['Weight'].ne(0)]
print (df_wt)
Name Weight Score
0 Abigail 10 360
2 Aiden 12 345
3 Amelia 20 601
4 Aria 25 604
5 Ava 10 313
7 Charlotte 18 531
8 Chloe 16 507
9 Elijah 13 473
print (df_wt.Weight.corr(df_wt.Score))
0.923425144491911
I've got a pandas dataframe of golfers' round scores going back to 2003 (approx 300000 rows). It looks something like this:
Date----Golfer---Tournament-----Score---Player Total Rounds Played
2008-01-01---Tiger Woods----Invented Tournament R1---72---50
2008-01-01---Phil Mickelson----Invented Tournament R1---73---108
I want the 'Player Total Rounds Played' column to be a running total of the number of rounds (i.e. instance in the dataframe) that a player has played up to that date. Is there a quick way of doing it? My current solution (basically using iterrows and then a one-line function) works fine but will take approx 11hrs to run.
Thanks,
Tom
Here is one way:
df = df.sort_values('Date')
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
For example:
import pandas as pd
df = pd.DataFrame([['A', 70, 50],
['B', 72, 55],
['A', 73, 45],
['A', 71, 60],
['B', 74, 55],
['A', 72, 65]],
columns=['Golfer', 'Rounds', 'Played'])
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
# Golfer Rounds Played Rounds CumSum
# 0 A 70 50 70
# 1 B 72 55 72
# 2 A 73 45 143
# 3 A 71 60 214
# 4 B 74 55 146
# 5 A 72 65 286