Summing Pandas columns between two rows - python

I have a Pandas dataframe with columns labeled Ticks, Water, and Temp, with a few million rows (possibly billion on a complete dataset), but it looks something like this
...
'Ticks' 'Water' 'Temp'
215 4 26.2023
216 1 26.7324
217 17 26.8173
218 2 26.9912
219 48 27.0111
220 1 27.2604
221 19 27.7563
222 32 28.3002
...
(All temperatures are in ascending order, and all 'ticks' are also linearly spaced and in ascending order too)
What I'm trying to do is to reduce the data down to a single 'Water' value for each floored, integer 'Temp' value, and just the first 'Tick' value (or last, it doesn't really have that much of an effect on the analysis).
The current direction I'm working on doing this is to start at the first row and save the tick value, check if the temperature is an integer value greater than the previous, add the water value, move to the next row check the temperature value, add the water value if it's not a whole integer higher. If the temperature value is an integer value higher, append the saved 'tick' value and integer temperature value and the summed water count to a new dataframe.
I'm sure this will work but, I'm thinking there should be a way to do this a lot more efficiently using some type of application of df.loc or df.iloc since everything is nicely in ascending order.
My hopeful output for this would be a much shorter dataset with values that look something like this:
...
'Ticks' 'Water' 'Temp'
215 24 26
219 68 27
222 62 28
...

Use GroupBy.agg and Series.astype
new_df = (df.groupby(df['Temp'].astype(int))
.agg({'Ticks' : 'first', 'Water' : 'sum'})
#.agg(Ticks = ('Ticks', 'first'), Water = ('Water', 'sum'))
.reset_index()
.reindex(columns=df.columns)
)
print(new_df)
Output
Ticks Water Temp
0 215 24 26
1 219 68 27
2 222 32 28

I have some trouble understanding the rules for which ticks you want in the final dataframe, but here is a way to get the indices of all Temps with equal floored value:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = pd.DataFrame({
'Ticks': [215, 216, 217, 218, 219, 220, 221, 222],
'Water': [4, 1, 17, 2, 48, 1, 19, 32],
'Temp': [26.2023, 26.7324, 26.8173, 26.9912, 27.0111, 27.2604, 27.7563, 28.3002]})
# first floor all temps
data['Temp'] = data['Temp'].apply(np.floor)
# get the indices of all equal temps
groups = data.groupby('Temp').groups
print(groups)
# maybe apply mean?
data = data.groupby('Temp').mean()
print(data)
hope this helps

Related

Find the average of pandas dataframe column(s) for multiple or all rows

I have a csv dataset where I have a column name "Types of Incidents" and another column named "Number of units".
Using Python and Pandas I am trying to find the average of "Number of units" when the value in type of incidents is 111. (It is found multiple times).
I have tried searching for multiple pandas methods but couldn't find how to find it on a huge dataset.
Here is the question:
What is the ratio of the average number of units that arrive to a scene of an incident classified as '111 - Building fire' to the number that arrive for '651 - Smoke scare, odor of smoke'?
An alternate to ML-Nielsen's value specific answer:
df.groupby('Types of Incidents')['Number of units'].mean()
This will provide the average Number of units for all Incident Types.
You can specify multiple columns as well if needed.
Reproducible Example:
data = {
"Incident_Type": [111, 380, 390, 111, 651, 651],
"Number_of_units": [50, 40, 45, 99, 12, 13]
}
data = pd.DataFrame(data)
data
Incident_Type Number_of_units
0 111 50
1 380 40
2 390 45
3 111 99
4 651 12
5 651 13
data.groupby('Incident_Type')['Number_of_units'].mean()
Incident_Type
111 74.5
380 40.0
390 45.0
651 12.5
Name: Number_of_units, dtype: float64
Now if you wish to find the ratios of the units you will need to store this result as a dataframe.
average_units = data.groupby('Incident_Type')['Number_of_units'].mean().to_frame()
average_units = average_units.reset_index()
average_units
Incident_Type Number_of_units
0 111 74.5
1 380 40.0
2 390 45.0
3 651 12.5
So we have our result stored in a dataframe called average_units.
incident1_units = average_units[average_units['Incident_Type']==111]['Number_of_units'].values[0]
incident2_units = average_units[average_units['Incident_Type']==651]['Number_of_units'].values[0]
incident1_units / incident2_units
5.96
If I understand correctly, you probably have to first select the right rows and then calculate the mean. Something like this:
df.loc[df['Types of Incidents']==111, 'Number of units'].mean()
This will give you the mean of Number of units where the condition df['Types of Incidents']==111 is true.

Creating new DF column based on average values from specific columns identified in second DF

I apologize as I prefer to ask questions where I've made an attempt at the code needed to resolve the issue. Here, despite many attempts, I haven't gotten any closer to a resolution (in part because I'm a hobbyist and self-taught). I'm attempting to use two dataframes together to calculate the average values in a specific column, then generate a new column to store that average.
I have two dataframes. The first contains the players and their stats. The second contains a list of each player's opponents during the season.
What I'm attempting to do is use the two dataframes to calculate expected values when facing a specific opponent. Stated otherwise, I'd like to be able to see if a player is performing better or worse than the expected results based on the opponent but first need to calculate the average of their opponents.
My dataframes actually have thousands of players and hundreds of matchups, so I've shortened them here to have a representative dataframe that isn't overwhelming.
The first dataframe (df) contains five columns. Name, STAT1, STAT2, STAT3, and STAT4.
The second dataframe (df_Schedule) has a Name column but then has a separate column for each opponent faced. The df_Schedule usually contains different numbers of columns depending on the week of the season. For example, after week 1 there may be four columns. After week 26 there might be 100 columns. For simplicity sake, I've included just five columns ['Name', 'Opp1', 'Opp2', 'Opp3', 'Opp4', 'Opp5'].
Using these two dataframes I'm trying to create new columns in the first dataframe (df). EXP1 (for "Expected STAT1"), EXP2, EXP3, EXP4. The expected columns are simply an average of the STAT columns based on the opponents faced during the season. For example, Edgar faced Ralph three times, Marc once and David once. The formula to calculate Edgar's EXP1 is simply:
((Ralph.STAT1 * 3) + (Marc.STAT1 * 1) + (David.STAT1 * 1) / Number_of_Contests (which is five in this example) = 100.2
import pandas as pd
data = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],}
data2 = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'Opp1':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp2':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp3':['Marc', 'David', 'Edgar', 'Ralph'],
'Opp4':['David', 'Marc', 'Ralph', 'Edgar'],
'Opp5':['Ralph', 'Edgar', 'David', 'Marc'],}
df = pd.DataFrame(data)
df_Schedule = pd.DataFrame(data2)
print(df)
print(df_Schedule)
I would like the result to be something like:
data_Final = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],
'EXP1':[100.2, 102.6, 101, 105.2],
'EXP2':[92.8, 106.6, 101.8, 92.8],
'EXP3':[60.2, 58.4, 72.8, 47.6],
'EXP4':[88.2, 63.6, 54.2, 98],}
df_Final = pd.DataFrame(data_Final)
print(df_Final)
Is there a way to use the scheduling dataframe to lookup the values of opponents, average them, and then create a new column based on those averages?
Try:
df = df.set_index("Name")
df_Schedule = df_Schedule.set_index("Name")
for i, c in enumerate(df.filter(like="STAT"), 1):
df[f"EXP{i}"] = df_Schedule.replace(df[c]).mean(axis=1)
print(df.reset_index())
Prints:
Name STAT1 STAT2 STAT3 STAT4 EXP1 EXP2 EXP3 EXP4
0 Edgar 100 116 56 55 100.2 92.8 60.2 88.2
1 Ralph 96 93 59 96 102.6 106.6 58.4 63.6
2 Marc 110 85 41 113 101.0 101.8 72.8 54.2
3 David 103 100 83 40 105.2 92.8 47.6 98.0

Label Not In List and KeyError

I'm trying to get the value of a specific cell.
main_id name code
0 1345 Jones 32
1 1543 Jack 62
2 9874 Buck 86
3 2456 Slim 94
I want the cell that says code=94, as I already know the main_id but nothing else.
raw_data = {'main_id': ['1345', '1543', '9874', '2456'],
'name': ['Jones', 'Jack', 'Buck', 'Slim'],
'code': [32, 62, 86, 94]}
df=pd.DataFrame(raw_data, columns = ['main_id', 'name', 'code'])
v=df.loc[str(df['main_id']) == str(2456)]['code'].values
print(df.loc['name'])
The print(df.loc['name']) claims the label is not in index
And the v=df.loc[str(df['main_id']) == str(2456)]['code'].values says 'KeyError False'
df.loc['name'] raises a KeyError because name is not in the index; it is in the columns. When you use loc, the first argument is for index. You can use df['name'] or df.loc[:, 'name'].
You can also pass boolean arrays to loc (both for index and columns). For example,
df.loc[df['main_id']=='2456']
Out:
main_id name code
3 2456 Slim 94
You can still select a particular column for this, too:
df.loc[df['main_id']=='2456', 'code']
Out:
3 94
Name: code, dtype: int64
With boolean indexes, the returning object will always be a Series even if you have only one value. So you might want to access the underlying array and select the first value from there:
df.loc[df['main_id']=='2456', 'code'].values[0]
Out:
94
But better way is to use the item method:
df.loc[df['main_id']=='2456', 'code'].item()
Out:
94
This way, you'll get an error if the length of the returning Series is greater than 1 while values[0] does not check that.
Alternative solution:
In [76]: df.set_index('main_id').at['2456','code']
Out[76]: 94

Python: How to find properties of items in a series

I have a series of numbers that I have broken into buckets using pandas.cut.
agepreg_cuts = pd.cut(df['agepreg'],[0,20,25,30,pd.np.inf], right=False)
I then count it and display the count.
agepreg_count = (df.groupby(agepreg_cuts).count())
agepreg_count
Which gives me much more information than I want:
sest cmintvw totalwgt_lb
agepreg
[0, 20) 3182 0 1910
[20, 25) 4246 0 2962
[25, 30) 3178 0 2336
[30, inf) 2635 0 1830
Now I want to format it like this:
INAPPLICABLE 352
0 to 20 3182
20 to 25 4246
25 to 30 3178
30 to 50 2635
Total 13593
Which leads me to a couple of questions.
How do I extract the begin/end properties (e.g. 25/30) from the bin [25,30)?
How do I discover properties in a series so that I do not have to ask SO the previous question?
For reference, the data I am using comes from the nsfg. The free book thinkstats2 has companion code and data on github.
From the 'code' directory, you can run the following line to load the dataframe.
import nsfg
df = nsfg.ReadFemPreg()
df
You could use iterate over the frame using iterrows and then work on categorical value like
In [679]: for x, i in agepreg_count.iterrows():
.....: print ' to '.join(x[1:-1].split(', ')), i['agepreg']
.....:
0 to 20 0
20 to 25 43
25 to 30 27
30 to inf 30
If you are just looking for a well formated string (your example suggests it) you can use a label argument to the cut function.
## create labels from breakpoints
breaks=[0,20,25,30,pd.np.inf]
diff=np.diff(breaks).tolist()
## make tuples of *breaks* and length of intervals
joint = list(zip(breaks,diff))
## format label
s1 = "{left:,.0f} to {right:,.0f}"
labels = [s1.format(left=yr[0], right=yr[0]+yr[1]-1) for yr in joint]
labels
['0 to 19', '20 to 24', '25 to 29', '30 to inf']
Then, cut using the breaks and labels.
df['agebin'] = pd.cut(df['agepreg'],breaks, labels=labels, right=False)
And summarize:
df.groupby('agebin')['agebin'].size()

percentage of sum in dataframe pandas

i created the following dataframe by using pandas melt and groupby with value and variable. I used the following:
df2 = pd.melt(df1).groupby(['value','variable'])['variable'].count().unstack('variable').fillna(0)
Percentile Percentile1 Percentile2 Percentile3
value
None 0 16 32 48
bottom 0 69 85 88
top 0 69 88 82
mediocre 414 260 209 196
I'm looking to create an output that excludes the 'None' row and creates a percentage of the sum of the 'bottom', 'top', and 'mediocre' rows. Desire output would be the following.
Percentile Percentile1 Percentile2 Percentile3
value
bottom 0% 17.3% 22.3% 24.0%
top 0% 17.3% 23.0% 22.4%
mediocre 414% 65.3% 54.7% 53.6%
one of the main parts of this that i'm struggling with is creating a new row to equal an output. any help would be greatly appreciated!
You can drop the 'None' row like this:
df2 = df2.drop('None')
If you don't want it permanently dropped you don't have to assign that result back to
df2.
Then you get your desired output with:
df2.apply(lambda c: c / c.sum() * 100, axis=0)
Out[11]:
Percentile1 Percentile2 Percentile3
value
bottom 17.336683 22.251309 24.043716
top 17.336683 23.036649 22.404372
mediocre 65.326633 54.712042 53.551913
To just get straight to that result without permanently dropping the None row:
df2.drop('None').apply(lambda c: c / c.sum() * 100, axis=0)

Categories

Resources