Pandas: Most resource efficient way to apply function - python

I have two dataframes, one containing a column with points, and another one containing a polygon.
The data looks like this:
>>> df1
Index Point
0 1 POINT (100 400)
1 2 POINT (920 400)
2 3 POINT (111 222)
>>> df2
Index Area-ID Polygon
0 1 New York POLYGON ((226000 619000, 226000 619500, 226500...
1 2 Amsterdam POLYGON ((226000 619000, 226000 619500, 226500...
2 3 Berlin POLYGON ((226000 619000, 226000 619500, 226500...
Reproducible example:
import pandas as pd
import shapely.wkt
data = {'Index': [1, 2, 3],
'Point': ['POINT (100 400)', 'POINT (920 400)', 'POINT (111 222)']}
df1 = pd.DataFrame(data)
df1['Point'] = df1['Point'].apply(shapely.wkt.loads)
data = {'Index': [1, 2, 3],
'Area-ID': ['New York', 'Amsterdam', 'Berlin'],
'Polygon': ['POLYGON ((90 390, 110 390, 110 410, 90 410, 90 390))',
'POLYGON ((890 390, 930 390, 930 410, 890 410, 890 390))',
'POLYGON ((110 220, 112 220, 112 225, 110 225, 110 220))']}
df2 = pd.DataFrame(data)
df2['Polygon'] = df2['Polygon'].apply(shapely.wkt.loads)
With shapely's function 'polygon.contains' I can check whether a polygon contains a certain point.
The goal is to find the corresponding polygon for every point in dataframe 1.
The following approach works, but takes way too long considering the datasets are very large:
for index, row in dataframe1.iterrows():
print(index)
for index, row2 in dataframe2.iterrows():
if row2['Polygon'].contains(row[Point']):
dataframe1.iloc[index]['Area-ID'] = row2['Area-ID']
Is there a more time-efficient way to achieve this goal?

If every point is contained by exactly one polygon (as it does in the current form of the question), you can do:
df1=\
df1.assign(cities=df1.Point.apply(lambda point:
df2['Area-ID'].loc[
[i for i, polygon in enumerate(df2.Polygon)
if polygon.contains(point)][0]
]))
You'll get:
Index Point cities
0 1 POINT (100 400) New York
1 2 POINT (920 400) Amsterdam
2 3 POINT (111 222) Berlin

Related

Find the average of pandas dataframe column(s) for multiple or all rows

I have a csv dataset where I have a column name "Types of Incidents" and another column named "Number of units".
Using Python and Pandas I am trying to find the average of "Number of units" when the value in type of incidents is 111. (It is found multiple times).
I have tried searching for multiple pandas methods but couldn't find how to find it on a huge dataset.
Here is the question:
What is the ratio of the average number of units that arrive to a scene of an incident classified as '111 - Building fire' to the number that arrive for '651 - Smoke scare, odor of smoke'?
An alternate to ML-Nielsen's value specific answer:
df.groupby('Types of Incidents')['Number of units'].mean()
This will provide the average Number of units for all Incident Types.
You can specify multiple columns as well if needed.
Reproducible Example:
data = {
"Incident_Type": [111, 380, 390, 111, 651, 651],
"Number_of_units": [50, 40, 45, 99, 12, 13]
}
data = pd.DataFrame(data)
data
Incident_Type Number_of_units
0 111 50
1 380 40
2 390 45
3 111 99
4 651 12
5 651 13
data.groupby('Incident_Type')['Number_of_units'].mean()
Incident_Type
111 74.5
380 40.0
390 45.0
651 12.5
Name: Number_of_units, dtype: float64
Now if you wish to find the ratios of the units you will need to store this result as a dataframe.
average_units = data.groupby('Incident_Type')['Number_of_units'].mean().to_frame()
average_units = average_units.reset_index()
average_units
Incident_Type Number_of_units
0 111 74.5
1 380 40.0
2 390 45.0
3 651 12.5
So we have our result stored in a dataframe called average_units.
incident1_units = average_units[average_units['Incident_Type']==111]['Number_of_units'].values[0]
incident2_units = average_units[average_units['Incident_Type']==651]['Number_of_units'].values[0]
incident1_units / incident2_units
5.96
If I understand correctly, you probably have to first select the right rows and then calculate the mean. Something like this:
df.loc[df['Types of Incidents']==111, 'Number of units'].mean()
This will give you the mean of Number of units where the condition df['Types of Incidents']==111 is true.

Using pandas to create rate of change in a variable

I have a pandas dataset that looks at the number of n cases of an instance over time.
I have sorted the dataset in ascending order from the first recorded date and have created a new column called 'change'.
I am unsure however how to take the data from column n and map it onto the 'change' column such that each cell in the 'change' column represents the difference from the previous day.
For example, if on day 334 there were n = 14000 and on day 335 there were n = 14500 cases, in that corresponding 'change' cell I would want it to say '500'.
I have been trying things out for the past couple of hours but to no avail so have come here for some help.
I know this is wordier than I would like, but if you need any clarification let me know.
import pandas as pd
df = pd.DataFrame({
'date': [1,2,3,4,5,6,7,8,9,10],
'cases': [100, 120, 129, 231, 243, 212, 375, 412, 440, 1]
})
df['change'] = df.cases.diff()
OUTPUT
date cases change
0 1 100 NaN
1 2 120 20.0
2 3 129 9.0
3 4 231 102.0
4 5 243 12.0
5 6 212 -31.0
6 7 375 163.0
7 8 412 37.0
8 9 440 28.0
9 10 1 -439.0

Creating new DF column based on average values from specific columns identified in second DF

I apologize as I prefer to ask questions where I've made an attempt at the code needed to resolve the issue. Here, despite many attempts, I haven't gotten any closer to a resolution (in part because I'm a hobbyist and self-taught). I'm attempting to use two dataframes together to calculate the average values in a specific column, then generate a new column to store that average.
I have two dataframes. The first contains the players and their stats. The second contains a list of each player's opponents during the season.
What I'm attempting to do is use the two dataframes to calculate expected values when facing a specific opponent. Stated otherwise, I'd like to be able to see if a player is performing better or worse than the expected results based on the opponent but first need to calculate the average of their opponents.
My dataframes actually have thousands of players and hundreds of matchups, so I've shortened them here to have a representative dataframe that isn't overwhelming.
The first dataframe (df) contains five columns. Name, STAT1, STAT2, STAT3, and STAT4.
The second dataframe (df_Schedule) has a Name column but then has a separate column for each opponent faced. The df_Schedule usually contains different numbers of columns depending on the week of the season. For example, after week 1 there may be four columns. After week 26 there might be 100 columns. For simplicity sake, I've included just five columns ['Name', 'Opp1', 'Opp2', 'Opp3', 'Opp4', 'Opp5'].
Using these two dataframes I'm trying to create new columns in the first dataframe (df). EXP1 (for "Expected STAT1"), EXP2, EXP3, EXP4. The expected columns are simply an average of the STAT columns based on the opponents faced during the season. For example, Edgar faced Ralph three times, Marc once and David once. The formula to calculate Edgar's EXP1 is simply:
((Ralph.STAT1 * 3) + (Marc.STAT1 * 1) + (David.STAT1 * 1) / Number_of_Contests (which is five in this example) = 100.2
import pandas as pd
data = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],}
data2 = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'Opp1':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp2':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp3':['Marc', 'David', 'Edgar', 'Ralph'],
'Opp4':['David', 'Marc', 'Ralph', 'Edgar'],
'Opp5':['Ralph', 'Edgar', 'David', 'Marc'],}
df = pd.DataFrame(data)
df_Schedule = pd.DataFrame(data2)
print(df)
print(df_Schedule)
I would like the result to be something like:
data_Final = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],
'EXP1':[100.2, 102.6, 101, 105.2],
'EXP2':[92.8, 106.6, 101.8, 92.8],
'EXP3':[60.2, 58.4, 72.8, 47.6],
'EXP4':[88.2, 63.6, 54.2, 98],}
df_Final = pd.DataFrame(data_Final)
print(df_Final)
Is there a way to use the scheduling dataframe to lookup the values of opponents, average them, and then create a new column based on those averages?
Try:
df = df.set_index("Name")
df_Schedule = df_Schedule.set_index("Name")
for i, c in enumerate(df.filter(like="STAT"), 1):
df[f"EXP{i}"] = df_Schedule.replace(df[c]).mean(axis=1)
print(df.reset_index())
Prints:
Name STAT1 STAT2 STAT3 STAT4 EXP1 EXP2 EXP3 EXP4
0 Edgar 100 116 56 55 100.2 92.8 60.2 88.2
1 Ralph 96 93 59 96 102.6 106.6 58.4 63.6
2 Marc 110 85 41 113 101.0 101.8 72.8 54.2
3 David 103 100 83 40 105.2 92.8 47.6 98.0

To find correspondents in data-frames for calculation

Two data frames like below and I want to calculate the correlation coefficient.
It works fine when both columns are completed with actual values. But when they are not, it takes zero as value when calculating the correlation coefficient.
For example, Addison’s and Caden’s weights are 0. Jack and Noah don’t have Weights. I want to exclude them for calculation.
(In the tries, it seems only consider the same lengths, i.e. Jack and Noah are automatically excluded – is it?)
How can I include only the people with non-zero values for calculation?
Thank you.
import pandas as pd
Weight = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah"],
'Weight': [10, 0, 12, 20, 25, 10, 0, 18, 16, 13]}
df_wt = pd.DataFrame(Weight)
Score = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah", "Jack", "Noah"],
'Score': [360, 476, 345, 601, 604, 313, 539, 531, 507, 473, 450, 470]}
df_sc = pd.DataFrame(Score)
print df_wt.Weight.corr(df_sc.Score)
Masking and taking non-zero values and common index:
df_wt.set_index('Name', inplace=True)
df_sc.set_index('Name', inplace=True)
mask = df_wt['Weight'].ne(0)
common_index = df_wt.loc[mask, :].index
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
0.923425144491911
If both dataframes contains zeros then:
mask1 = df_wt['Weight'].ne(0)
mask2 = df_sc['Score'].ne(0)
common_index = df_wt.loc[mask1, :].index.intersection(df_sc.loc[mask2, :].index)
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
Use map for add new column, remove 0 rows byboolean indexing and last apply your solution in same DataFrame:
df_wt['Score'] = df_wt['Name'].map(df_sc.set_index('Name')['Score'])
df_wt = df_wt[df_wt['Weight'].ne(0)]
print (df_wt)
Name Weight Score
0 Abigail 10 360
2 Aiden 12 345
3 Amelia 20 601
4 Aria 25 604
5 Ava 10 313
7 Charlotte 18 531
8 Chloe 16 507
9 Elijah 13 473
print (df_wt.Weight.corr(df_wt.Score))
0.923425144491911

pandas, numpy round down to nearest 100

I created a dataframe column with the below code, and was trying to figure out how to round it down to the nearest 100th.
...
# This prints out my new value rounded to the nearest whole number.
df['new_values'] = (10000/df['old_values']).apply(numpy.floor)
# How do I get it to round down to the nearest 100th instead?
# i.e. 8450 rounded to 8400
You need divide by 100, convert to int and last multiple by 100:
df['new_values'] = (df['old_values'] / 100).astype(int) *100
Same as:
df['new_values'] = (df['old_values'] / 100).apply(np.floor).astype(int) *100
Sample:
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
df['new_values'] = (df['old_values'] / 100).astype(int) *100
print (df)
old_values new_values
0 8450 8400
1 8470 8400
2 343 300
3 573 500
4 34543 34500
5 23999 23900
EDIT:
df = pd.DataFrame({'old_values':[3, 6, 89, 573, 34, 23]})
#show output of first divide for verifying output
df['new_values1'] = (10000/df['old_values'])
df['new_values'] = (10000/df['old_values']).div(100).astype(int).mul(100)
print (df)
old_values new_values1 new_values
0 3 3333.333333 3300
1 6 1666.666667 1600
2 89 112.359551 100
3 573 17.452007 0
4 34 294.117647 200
5 23 434.782609 400
Borrowing #jezrael's sample dataframe
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
Use floordiv or //
df // 100 * 100
old_values
0 8400
1 8400
2 300
3 500
4 34500
5 23900
I've tried something similar using the math module
a = [123, 456, 789, 145]
def rdl(x):
ls = []
for i in x:
a = math.floor(i/100)*100
ls.append(a)
return ls
rdl(a)
#Output was [100, 200, 400, 700, 100]
Hope this provides some idea. Its very similar to solution provided by #jezrael

Categories

Resources