pandas - extract values greater than a threshold from a column - python

I have a DataFrame - a snapshot of which looks like this:
I am trying to grab all the math_score and reading_score values greater than 70 grouped by school_name.
So my end result should look something like this:
I am trying to calculate the % of students with a passing math_score and reading_score which is % of scores > 70.
Any help on how I can go about this?
This is what I have tried:
school_data_grouped = school_data_complete.groupby('school_name')
passing_math_score = school_data_grouped.loc[(school_data_grouped['math_score'] >= 70)]
I get an error with this that says:
AttributeError: Cannot access callable attribute 'loc' of 'DataFrameGroupBy' objects, try using the 'apply' method
What can I do to achive this? Any help is much appreciated.
Thanks!

You can create a column for whether each student passed, for example:
school_data['passed_math'] = school_data['math_score'] >= 70
school_data['passed_both'] = (school_data['math_score'] >= 70) & (school_data['reading_score'] >= 70)
You can then get the pass rate by school using a groupby:
pass_rate = school_data.groupby('school_name').mean()

You need to first filter for math_score & reading_score then apply groupby, because groupby doesn't return a Dataframe.
To work on your question, I got data from this link
DATA
https://www.kaggle.com/aljarah/xAPI-Edu-Data/
I changed column names though.
CODE
import pandas as pd
school_data_df = pd.read_csv('xAPI-Edu-Data 2.csv')
school_data_df.head()
df_70_math_score = school_data_df[school_data_df.math_score > 70]
df_70_reading_math_score = df_70_math_score[df_70_math_score.reading_score >70]
df_70_reading_math_score.head()
grouped_grade = df_70_reading_math_score.groupby('GradeID')
You can do any stats generation from this groupby_object 'grouped_grade'

Related

Pandas: using .Query(inplace=True) doesn't filter dataset as expected

I'm quite new to Pandas and I'm doing some work on concrete compressive strength. I have a dataset which I've imported into a dataframe like so:
file = "Concrete_Data.csv"
data = pd.read_csv(file)
The dataset looks like this:
I then use the following code to create an additional column:
# insert water-cement ratio column
water_cement = pd.Series([])
for i in range(len(data)):
water_cement[i] = round(
data["Water"][i] / (data["Cement"][i] + data["Water"][i]), 2)
data.insert(4, "Water-Cement", water_cement, True)
But before I do this, I want to filter down my dataframe to only records with 'Age' <= 28 and then operate on that dataset. I tried using this code:
data.query('Age <= 28', inline=True)
But then upon creating the water-cement column I hit KeyError: 222 - Can anyone explain this error and what's happening in the dataframe which is stopped the code from working?
You have typo
data.query('Age <= 28', inline=True)
should be
data.query('Age <= 28', inplace=True)
see pandas.DataFrame.query docs
First filter:
data = data.query('Age <= 28')
Or:
data = data[data['Age'] <= 28]
and then use vectorized solution - divide and sum columns:
data.insert(4, "Water-Cement", (data["Water"] / (data["Cement"]+ data["Water"])).round(2))

Calculating On Base Volume (OBV) with Python Pandas

I have a trading Python Pandas DataFrame which includes the "close" and "volume". I want to calculate the On-Balance Volume (OBV). I've got it working over the entire dataset but I want it to be calculated on a rolling series of 10.
The current function looks as follows...
def calculateOnBalanceVolume(df):
df['obv'] = 0
index = 1
while index <= len(df) - 1:
if(df.iloc[index]['close'] > df.iloc[index-1]['close']):
df.at[index, 'obv'] += df.at[index-1, 'obv'] + df.at[index, 'volume']
if(df.iloc[index]['close'] < df.iloc[index-1]['close']):
df.at[index, 'obv'] += df.at[index-1, 'obv'] - df.at[index, 'volume']
index = index + 1
return df
This creates the "obv" column and works out the OBV over the 300 entries.
Ideally I would like to do something like this...
data['obv10'] = data.volume.rolling(10, min_periods=1).apply(calculateOnBalanceVolume)
This looks like it has potential to work but the problem is the "apply" only passes in the "volume" column so you can't work out the change in closing price.
I also tried this...
data['obv10'] = data[['close','volume']].rolling(10, min_periods=1).apply(calculateOnBalanceVolume)
Which sort of works but it tries to update the "close" and "volume" columns instead of adding the new "obv10" column.
What is the best way of doing this or do you just have to iterate over the data in batches of 10?
I found a more efficient way of doing the code above from this link:
Calculating stocks's On Balance Volume (OBV) in python
import numpy as np
def calculateOnBalanceVolume(df):
df['obv'] = np.where(df['close'] > df['close'].shift(1), df['volume'],
np.where(df['close'] < df['close'].shift(1), -df['volume'], 0)).cumsum()
return df
The problem is this still does the entire data set. This looks pretty good but how can I cycle through it in batches of 10 at a time without looping or iterating through the entire data set?
*** UPDATE ***
I've got slightly closer to getting this working. I have managed to calculate the OBV in groups of 10.
for gid,df in data.groupby(np.arange(len(data)) // 10):
df['obv'] = np.where(df['close'] > df['close'].shift(1), df['volume'],
np.where(df['close'] < df['close'].shift(1), -df['volume'], 0)).cumsum()
I want this to be calculated rolling not in groups. Any idea how to do this using Pandas in an efficient way?
*** UPDATE ***
It turns out that OBV is supposed to be calculated over the entire data set. I've settled on the following code which looks correct now.
# calculate on-balance volume (obv)
self.df['obv'] = np.where(self.df['close'] > self.df['close'].shift(1), self.df['volume'],
np.where(self.df['close'] < self.df['close'].shift(1), -self.df['volume'], self.df.iloc[0]['volume'])).cumsum()

what is the source of this error: python pandas

import pandas as pd
census_df = pd.read_csv('census.csv')
#census_df.head()
def answer_seven():
census_df_1 = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
census_df_1['highest'] = census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].max()
census_df_1['lowest'] =census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].min()
x = abs(census_df_1['highest'] - census_df_1['lowest']).tolist()
return x[0]
answer_seven()
This is trying to use the data from census.csv to find the counties that have the largest absolute change in population within 2010-2015(POPESTIMATES), I wanted to simply find the difference between abs.value of max and min value for each year/column. You must return a string. also [(census_df['SUMLEV'] ==50)] means only counties are taken as they are set to 50. But the code gives an error that ends with
KeyError: "['POPESTIAMTE2010' 'POPESTIAMTE2011' 'POPESTIAMTE2012'
'POPESTIAMTE2013'\n 'POPESTIAMTE2014' 'POPESTIAMTE2015'] not in index"
Am I indexing the wrong data structure? I'm really new to datascience and coding.
I think the column names in the code have typo. The pattern is 'POPESTIMATE201?' and not 'POPESTIAMTE201?'
Any help with shortening the code will be appreciated. Here is the code that works -
census_df = pd.read_csv('census.csv')
def answer_seven():
cdf = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
columns = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
cdf['big'] = cdf[columns].max(axis =1)
cdf['sml'] = cdf[columns].min(axis =1)
cdf['change'] = cdf[['big']].sub(cdf['sml'], axis=0)
return cdf['change'].idxmax()

pandas iterrows throwing error

I am trying to do a change data capture on two dataframes. The logic is to merge two dataframes and group by one keys and then run a loop for groups having count >1 to see which column 'updated'. I am getting strange error. any help is appreciated.
code
import pandas as pd
import numpy as np
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("reading wolverine xlxs")
# defining metadata
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
df_w01 = pd.read_excel("wolverine_1.xlsx", names = df_header)
df_w02 = pd.read_excel("wolverine_2.xlsx", names = df_header)
df_w01['version'] = 'OLD'
df_w02['version'] = 'NEW'
#print(df_w01)
df_m_d = pd.concat([df_w01, df_w02], ignore_index = True)
first_pass = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep=False)]
first_pass_keep_duplicate = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep='first')]
group_by_1 = first_pass.groupby(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'])
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
print(first_pass)
And The error I get :
AttributeError: Cannot access callable attribute 'iterrows' of 'DataFrameGroupBy' objects, try using the 'apply' method
Any help is much appreciated.
Your GroupBy object supports iteration, so instead of
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
you need to do something like
for name, group in group_by_1:
print name
print group
then you can do what you need to do with each group
See the docs
Why not do as suggested and use apply? Something like:
def print_rows(rows):
print rows
group_by_1.apply(print_rows)

Python pandas extract variables from dataframe

What is the best way to convert DataFrame columns into variables. I have a condition for bet placement and I use head(n=1)
back_bf_lay_bq = bb[(bb['bf_back_bq_lay_lose_net'] > 0) & (bb['bq_lay_price'] < 5) & (bb['bq_lay_price'] != 0) & (bb['bf_back_liquid'] > bb['bf_back_stake']) & (bb['bq_lay_liquid'] > bb['bq_lay_horse_win'])].head(n=1)
I would like to convert columns into variables and pass them to API for bet placement. So I convert back_bf_lay_bq to dictionary and extract values
#Bets placements
dd = pd.DataFrame.to_dict(back_bf_lay_bq, orient='list')
#Betdaq bet placement
bq_selection_id = dd['bq_selection_id'][0]
bq_lay_stake = dd['bq_lay_stake'][0]
bq_lay_price = dd['bq_lay_price'][0]
bet_type = 2
reset_count = dd['bq_count_reset'][0]
withdrawal_sequence = dd['bq_withdrawal_sequence'][0]
kill_type = 2
betdaq_request = betdaq_api.PlaceOrdersNoReceipt(bq_selection_id,bq_lay_stake,bq_lay_price,bet_type,reset_count,withdrawal_sequence,kill_type)
I do not think that it is the most efficient way and it brings a bug from time to time
bq_selection_id = dd['bq_selection_id'][0]
IndexError: list index out of range
So can you suggest a better way to get values from DataFrame and pass them to API?
IIUC you could use iloc to get your first row and then slice your dataframe with your columns subset and pass that to your variables. Something like that:
bq_selection_id, bq_lay_stake, bq_lay_price, withdrawal_sequence = back_bf_lay_bq[['bq_selection_id', 'bq_lay_stake', 'bq_lay_price', 'withdrawal_sequence']].iloc[0]

Categories

Resources