Two data frames like below and I want to calculate the correlation coefficient.
It works fine when both columns are completed with actual values. But when they are not, it takes zero as value when calculating the correlation coefficient.
For example, Addison’s and Caden’s weights are 0. Jack and Noah don’t have Weights. I want to exclude them for calculation.
(In the tries, it seems only consider the same lengths, i.e. Jack and Noah are automatically excluded – is it?)
How can I include only the people with non-zero values for calculation?
Thank you.
import pandas as pd
Weight = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah"],
'Weight': [10, 0, 12, 20, 25, 10, 0, 18, 16, 13]}
df_wt = pd.DataFrame(Weight)
Score = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah", "Jack", "Noah"],
'Score': [360, 476, 345, 601, 604, 313, 539, 531, 507, 473, 450, 470]}
df_sc = pd.DataFrame(Score)
print df_wt.Weight.corr(df_sc.Score)
Masking and taking non-zero values and common index:
df_wt.set_index('Name', inplace=True)
df_sc.set_index('Name', inplace=True)
mask = df_wt['Weight'].ne(0)
common_index = df_wt.loc[mask, :].index
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
0.923425144491911
If both dataframes contains zeros then:
mask1 = df_wt['Weight'].ne(0)
mask2 = df_sc['Score'].ne(0)
common_index = df_wt.loc[mask1, :].index.intersection(df_sc.loc[mask2, :].index)
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
Use map for add new column, remove 0 rows byboolean indexing and last apply your solution in same DataFrame:
df_wt['Score'] = df_wt['Name'].map(df_sc.set_index('Name')['Score'])
df_wt = df_wt[df_wt['Weight'].ne(0)]
print (df_wt)
Name Weight Score
0 Abigail 10 360
2 Aiden 12 345
3 Amelia 20 601
4 Aria 25 604
5 Ava 10 313
7 Charlotte 18 531
8 Chloe 16 507
9 Elijah 13 473
print (df_wt.Weight.corr(df_wt.Score))
0.923425144491911
Related
I have a csv dataset where I have a column name "Types of Incidents" and another column named "Number of units".
Using Python and Pandas I am trying to find the average of "Number of units" when the value in type of incidents is 111. (It is found multiple times).
I have tried searching for multiple pandas methods but couldn't find how to find it on a huge dataset.
Here is the question:
What is the ratio of the average number of units that arrive to a scene of an incident classified as '111 - Building fire' to the number that arrive for '651 - Smoke scare, odor of smoke'?
An alternate to ML-Nielsen's value specific answer:
df.groupby('Types of Incidents')['Number of units'].mean()
This will provide the average Number of units for all Incident Types.
You can specify multiple columns as well if needed.
Reproducible Example:
data = {
"Incident_Type": [111, 380, 390, 111, 651, 651],
"Number_of_units": [50, 40, 45, 99, 12, 13]
}
data = pd.DataFrame(data)
data
Incident_Type Number_of_units
0 111 50
1 380 40
2 390 45
3 111 99
4 651 12
5 651 13
data.groupby('Incident_Type')['Number_of_units'].mean()
Incident_Type
111 74.5
380 40.0
390 45.0
651 12.5
Name: Number_of_units, dtype: float64
Now if you wish to find the ratios of the units you will need to store this result as a dataframe.
average_units = data.groupby('Incident_Type')['Number_of_units'].mean().to_frame()
average_units = average_units.reset_index()
average_units
Incident_Type Number_of_units
0 111 74.5
1 380 40.0
2 390 45.0
3 651 12.5
So we have our result stored in a dataframe called average_units.
incident1_units = average_units[average_units['Incident_Type']==111]['Number_of_units'].values[0]
incident2_units = average_units[average_units['Incident_Type']==651]['Number_of_units'].values[0]
incident1_units / incident2_units
5.96
If I understand correctly, you probably have to first select the right rows and then calculate the mean. Something like this:
df.loc[df['Types of Incidents']==111, 'Number of units'].mean()
This will give you the mean of Number of units where the condition df['Types of Incidents']==111 is true.
I have a pandas dataset that looks at the number of n cases of an instance over time.
I have sorted the dataset in ascending order from the first recorded date and have created a new column called 'change'.
I am unsure however how to take the data from column n and map it onto the 'change' column such that each cell in the 'change' column represents the difference from the previous day.
For example, if on day 334 there were n = 14000 and on day 335 there were n = 14500 cases, in that corresponding 'change' cell I would want it to say '500'.
I have been trying things out for the past couple of hours but to no avail so have come here for some help.
I know this is wordier than I would like, but if you need any clarification let me know.
import pandas as pd
df = pd.DataFrame({
'date': [1,2,3,4,5,6,7,8,9,10],
'cases': [100, 120, 129, 231, 243, 212, 375, 412, 440, 1]
})
df['change'] = df.cases.diff()
OUTPUT
date cases change
0 1 100 NaN
1 2 120 20.0
2 3 129 9.0
3 4 231 102.0
4 5 243 12.0
5 6 212 -31.0
6 7 375 163.0
7 8 412 37.0
8 9 440 28.0
9 10 1 -439.0
I have two dataframes, one containing a column with points, and another one containing a polygon.
The data looks like this:
>>> df1
Index Point
0 1 POINT (100 400)
1 2 POINT (920 400)
2 3 POINT (111 222)
>>> df2
Index Area-ID Polygon
0 1 New York POLYGON ((226000 619000, 226000 619500, 226500...
1 2 Amsterdam POLYGON ((226000 619000, 226000 619500, 226500...
2 3 Berlin POLYGON ((226000 619000, 226000 619500, 226500...
Reproducible example:
import pandas as pd
import shapely.wkt
data = {'Index': [1, 2, 3],
'Point': ['POINT (100 400)', 'POINT (920 400)', 'POINT (111 222)']}
df1 = pd.DataFrame(data)
df1['Point'] = df1['Point'].apply(shapely.wkt.loads)
data = {'Index': [1, 2, 3],
'Area-ID': ['New York', 'Amsterdam', 'Berlin'],
'Polygon': ['POLYGON ((90 390, 110 390, 110 410, 90 410, 90 390))',
'POLYGON ((890 390, 930 390, 930 410, 890 410, 890 390))',
'POLYGON ((110 220, 112 220, 112 225, 110 225, 110 220))']}
df2 = pd.DataFrame(data)
df2['Polygon'] = df2['Polygon'].apply(shapely.wkt.loads)
With shapely's function 'polygon.contains' I can check whether a polygon contains a certain point.
The goal is to find the corresponding polygon for every point in dataframe 1.
The following approach works, but takes way too long considering the datasets are very large:
for index, row in dataframe1.iterrows():
print(index)
for index, row2 in dataframe2.iterrows():
if row2['Polygon'].contains(row[Point']):
dataframe1.iloc[index]['Area-ID'] = row2['Area-ID']
Is there a more time-efficient way to achieve this goal?
If every point is contained by exactly one polygon (as it does in the current form of the question), you can do:
df1=\
df1.assign(cities=df1.Point.apply(lambda point:
df2['Area-ID'].loc[
[i for i, polygon in enumerate(df2.Polygon)
if polygon.contains(point)][0]
]))
You'll get:
Index Point cities
0 1 POINT (100 400) New York
1 2 POINT (920 400) Amsterdam
2 3 POINT (111 222) Berlin
I apologize as I prefer to ask questions where I've made an attempt at the code needed to resolve the issue. Here, despite many attempts, I haven't gotten any closer to a resolution (in part because I'm a hobbyist and self-taught). I'm attempting to use two dataframes together to calculate the average values in a specific column, then generate a new column to store that average.
I have two dataframes. The first contains the players and their stats. The second contains a list of each player's opponents during the season.
What I'm attempting to do is use the two dataframes to calculate expected values when facing a specific opponent. Stated otherwise, I'd like to be able to see if a player is performing better or worse than the expected results based on the opponent but first need to calculate the average of their opponents.
My dataframes actually have thousands of players and hundreds of matchups, so I've shortened them here to have a representative dataframe that isn't overwhelming.
The first dataframe (df) contains five columns. Name, STAT1, STAT2, STAT3, and STAT4.
The second dataframe (df_Schedule) has a Name column but then has a separate column for each opponent faced. The df_Schedule usually contains different numbers of columns depending on the week of the season. For example, after week 1 there may be four columns. After week 26 there might be 100 columns. For simplicity sake, I've included just five columns ['Name', 'Opp1', 'Opp2', 'Opp3', 'Opp4', 'Opp5'].
Using these two dataframes I'm trying to create new columns in the first dataframe (df). EXP1 (for "Expected STAT1"), EXP2, EXP3, EXP4. The expected columns are simply an average of the STAT columns based on the opponents faced during the season. For example, Edgar faced Ralph three times, Marc once and David once. The formula to calculate Edgar's EXP1 is simply:
((Ralph.STAT1 * 3) + (Marc.STAT1 * 1) + (David.STAT1 * 1) / Number_of_Contests (which is five in this example) = 100.2
import pandas as pd
data = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],}
data2 = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'Opp1':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp2':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp3':['Marc', 'David', 'Edgar', 'Ralph'],
'Opp4':['David', 'Marc', 'Ralph', 'Edgar'],
'Opp5':['Ralph', 'Edgar', 'David', 'Marc'],}
df = pd.DataFrame(data)
df_Schedule = pd.DataFrame(data2)
print(df)
print(df_Schedule)
I would like the result to be something like:
data_Final = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],
'EXP1':[100.2, 102.6, 101, 105.2],
'EXP2':[92.8, 106.6, 101.8, 92.8],
'EXP3':[60.2, 58.4, 72.8, 47.6],
'EXP4':[88.2, 63.6, 54.2, 98],}
df_Final = pd.DataFrame(data_Final)
print(df_Final)
Is there a way to use the scheduling dataframe to lookup the values of opponents, average them, and then create a new column based on those averages?
Try:
df = df.set_index("Name")
df_Schedule = df_Schedule.set_index("Name")
for i, c in enumerate(df.filter(like="STAT"), 1):
df[f"EXP{i}"] = df_Schedule.replace(df[c]).mean(axis=1)
print(df.reset_index())
Prints:
Name STAT1 STAT2 STAT3 STAT4 EXP1 EXP2 EXP3 EXP4
0 Edgar 100 116 56 55 100.2 92.8 60.2 88.2
1 Ralph 96 93 59 96 102.6 106.6 58.4 63.6
2 Marc 110 85 41 113 101.0 101.8 72.8 54.2
3 David 103 100 83 40 105.2 92.8 47.6 98.0
I have a Pandas dataframe with columns labeled Ticks, Water, and Temp, with a few million rows (possibly billion on a complete dataset), but it looks something like this
...
'Ticks' 'Water' 'Temp'
215 4 26.2023
216 1 26.7324
217 17 26.8173
218 2 26.9912
219 48 27.0111
220 1 27.2604
221 19 27.7563
222 32 28.3002
...
(All temperatures are in ascending order, and all 'ticks' are also linearly spaced and in ascending order too)
What I'm trying to do is to reduce the data down to a single 'Water' value for each floored, integer 'Temp' value, and just the first 'Tick' value (or last, it doesn't really have that much of an effect on the analysis).
The current direction I'm working on doing this is to start at the first row and save the tick value, check if the temperature is an integer value greater than the previous, add the water value, move to the next row check the temperature value, add the water value if it's not a whole integer higher. If the temperature value is an integer value higher, append the saved 'tick' value and integer temperature value and the summed water count to a new dataframe.
I'm sure this will work but, I'm thinking there should be a way to do this a lot more efficiently using some type of application of df.loc or df.iloc since everything is nicely in ascending order.
My hopeful output for this would be a much shorter dataset with values that look something like this:
...
'Ticks' 'Water' 'Temp'
215 24 26
219 68 27
222 62 28
...
Use GroupBy.agg and Series.astype
new_df = (df.groupby(df['Temp'].astype(int))
.agg({'Ticks' : 'first', 'Water' : 'sum'})
#.agg(Ticks = ('Ticks', 'first'), Water = ('Water', 'sum'))
.reset_index()
.reindex(columns=df.columns)
)
print(new_df)
Output
Ticks Water Temp
0 215 24 26
1 219 68 27
2 222 32 28
I have some trouble understanding the rules for which ticks you want in the final dataframe, but here is a way to get the indices of all Temps with equal floored value:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = pd.DataFrame({
'Ticks': [215, 216, 217, 218, 219, 220, 221, 222],
'Water': [4, 1, 17, 2, 48, 1, 19, 32],
'Temp': [26.2023, 26.7324, 26.8173, 26.9912, 27.0111, 27.2604, 27.7563, 28.3002]})
# first floor all temps
data['Temp'] = data['Temp'].apply(np.floor)
# get the indices of all equal temps
groups = data.groupby('Temp').groups
print(groups)
# maybe apply mean?
data = data.groupby('Temp').mean()
print(data)
hope this helps