Use of iterrows() and arithmetic in Pandas - python

Indexing into an array in C is pretty easy and the brackets handle arithmetic nicely, thus allowing for the comparison of adjacent values. That's what I'd like to do in with iterrows() in Pandas, but I can't find a suitable example that shows how to do so. Consider the following:
Year Name Winner Count
432 1936 Alice 0.0 2
538 1937 Alice 1.0 2
6391 1985 Bob 1.0 2
6818 1989 Brad 0.0 2
Alice did not win a prize in 1936, but she did win one in 1937. I need to iterate over all of the rows, 1) check to see if the Year in row n immediately follows the Year in row n - 1, and 2) if so, did the subject win in the second year and not the first? Alice fits the bill, and I'd like to loop through the frame printing out her name and everyone else who meet the criteria.
I had started with . . .
for index, row in df.iterrows():
if df['Year'] > df[df.Year - 1]:
And got, among other things, that the data type I had explicitly cast as an int (i.e., Year), is now being returned as a string. Is there a way to do this, or should I explore a different method?

Here's some augmented sample data, to account for edge cases:
Year Name Winner Count
432 1936 Alice 0.0 2
538 1937 Alice 1.0 2
6390 1985 Bob 1.0 2
6817 1989 Brad 0.0 2
433 1997 Alice 0.0 2
539 1993 Alice 1.0 2
6391 1986 Bob 1.0 2
6818 1990 Brad 0.0 2
6819 1991 Brad 0.0 2
This approach sorts rows by Name and Year, then establishes whether a given year meets the criteria for inclusion (i.e., consecutive with the year before, and a win).
Then a simple groupby() finds the subjects who qualify.
import pandas as pd
df = pd.read_clipboard()
df.sort_values(['Name','Year'], inplace=True)
# eligible = consecutive year and won in that year
df['eligible'] = (df.Year.subtract(df.Year.shift()) == 1.) & (df.Winner)
# identify any person with at least one eligible year
df.groupby('Name').eligible.any())
Output:
Name
Alice True
Bob True
Brad False

Related

Compare two Excel files and show divergences using Python

I was comparing two excel files which contains the information of the students of two schools. However those files might contain different number of rows between them.
The first set that I used is to import the excel files in two dataframes:
df1 = pd.read_excel('School A - Information.xlsx')
df2 = pd.read_excel('School B - Information.xlsx')
print(df1)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
1 nick 15 MEX 1
2 juli 14 CAN 0
3 tom 19 NOR 1
print(df2)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
1 tom 19 NOR 1
2 nick 15 MEX 4
After this, I would like to check the divergences between those two dataframes (index order is not important). However I am receiving an error due to the size of the dataframes.
compare = df1.values == df2.values
<ipython-input-9-7cc64ba0e622>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
compare = df1.values == df2.values
print(compare)
False
Adding to that, I would like to create a third DataFrame with the corresponding divergences, that shows the divergence.
import numpy as np
rows,cols=np.where(compare==False)
for item in zip(rows,cols):
df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.iloc[item[0], item[1]])
However, using this code is not working, as the index order may be different between the two dataframes.
My expected output should be the below dataframe:
You can use pd.merge to accomplish this. If you're unfamiliar with dataframe merges, here's a post that describes relational database merging ideas: link. So in this case, what we want to do is first do a left merge of df2 onto df1 to find how the Previous Schools column differs:
df_merged = pd.merge(df1, df2, how="left", on=["Name", "Age", "Birth_Country"], suffixes=["_A", "_B"])
print(df_merged)
will give you a new dataframe
Name Age Birth_Country Previous Schools_A Previous Schools_B
0 tom 10 USA 3 3.0
1 nick 15 MEX 1 4.0
2 juli 14 CAN 0 NaN
3 tom 19 NOR 1 1.0
This new dataframe has all the information you're looking for. To find just the rows where the Previous Schools entries differ:
df_different = df_merged[df_merged["Previous Schools_A"]!=df_merged["Previous Schools_B"]]
print(df_different)
Name Age Birth_Country Previous Schools_A Previous Schools_B
1 nick 15 MEX 1 4.0
2 juli 14 CAN 0 NaN
and to find the rows where Previous Schools has not changed:
df_unchanged = df_merged[df_merged["Previous Schools_A"]==df_merged["Previous Schools_B"]]
print(df_unchanged)
Name Age Birth_Country Previous Schools_A Previous Schools_B
0 tom 10 USA 3 3.0
3 tom 19 NOR 1 1.0
If I were you, I'd stop here, because creating the final dataframe you want is going to have generic object column types because of the mix of strings and integers, which will limit its uses... but maybe you need it in that particular formattting for some reason. In which case, it's all about putting together these dataframe subsets in the right way to get your desired formatting. Here's one way.
First, initialize the final dataframe with the unchanged rows:
df_final = df_unchanged[["Name", "Age", "Birth_Country", "Previous Schools_A"]].copy()
df_final = df_final.rename(columns={"Previous Schools_A": "Previous Schools"})
print(df_final)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
3 tom 19 NOR 1
now process the entries that have changed between dataframes. There are two cases here: where the entries have changed (where Previous Schools_B is not NaN) and where the entrie is new (where Previous Schools_B is NaN). We'll deal with each in turn:
changed_entries = df_different[~pd.isnull(df_different["Previous Schools_B"])].copy()
changed_entries["Previous Schools"] = changed_entries["Previous Schools_A"].astype('str') + " --> " + changed_entries["Previous Schools_B"].astype('int').astype('str')
changed_entries = changed_entries.drop(columns=["Previous Schools_A", "Previous Schools_B"])
print(changed_entries)
Name Age Birth_Country Previous Schools
1 nick 15 MEX 1 --> 4
and now process the entries that are completely new:
new_entries = df_different[pd.isnull(df_different["Previous Schools_B"])].copy()
new_entries = "NaN --> " + new_entries[["Name", "Age", "Birth_Country", "Previous Schools_A"]].astype('str')
new_entries = new_entries.rename(columns={"Previous Schools_A": "Previous Schools"})
print(new_entries)
Name Age Birth_Country Previous Schools
2 NaN --> juli NaN --> 14 NaN --> CAN NaN --> 0
and finally, concatenate all the dataframes:
df_final = pd.concat([df_final, changed_entries, new_entries])
print(df_final)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
3 tom 19 NOR 1
1 nick 15 MEX 1 --> 4
2 NaN --> juli NaN --> 14 NaN --> CAN NaN --> 0

fill a new column with the division of data of a column in a Excel Sheet by looking up the denominator value from other Sheet

i have a 2 dataframes as given below,
import pandas as pd
restaurant = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx")
restaurant
Restaurant StartYear Capex inflation_adjusted_capex
Bawarchi Restaurant 1986 6000 Nan
Ks Baker's 1988 2000 Nan
Rajesh Restaurant 1989 1050 Nan
Ahmed Steak House 1990 9000 Nan
Absolute Barbique 1997 9500 Nan
inflation = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx", sheet_name="Sheet2")
inflation
Years Inflation_Factor
1985 0.111
1986 0.134
1987 0.191
1988 0.2253
1989 0.265
1990 0.304
Aim: is to fill "inflation_adjusted_capex" with div of "Capex" by corresponding years "Inflation_Factor from second Dataframe.
The code i wrote is,
for i in restaurant["StartYear"]:
restaurant["inflation_adjusted_capex"] =
(restaurant["inflation_adjusted_capex"])/(inflation[inflation["Years"] == i]["Inflation_Factor"])
print(restaurant["inflation_adjusted_capex"])
0 Nan
1 Nan
2 Nan
3 Nan
4 Nan
Name: Inflation adjusted Capex to current year, dtype: float64
Unfortunately this code is returning Nan values, kindly help me. Thanks in advance.
There are a couple ways to do this. The first is to join the dataframes so that you have your inflation factors in the first dataframe, and then do the calculation:
#add inflation_factor column to first dataframe
restaurant = restaurant.merge(inflation, left_on = 'StartYear', right_on = 'Year')
#do dividsion
restaurant['inflation_adjusted_capex'] = restaurant['Capex']/restaurant['Inflation_Factor']
The other is to apply a function that behaves like an excel VLOOKUP:
#set year as index for inflation so we can look up based on it
inflation = inflation.set_index('Year')
#look up inflation factor and divide with a lambda function
restaurant['inflation_adjusted_capex'] = inflation.apply(lambda row: row['Capex']/inflation['Inflation_Factor'][row['StartYear']], 1)

Chained conditional count in Pandas

I have a dataframe that looks at how a form has been filled out. Here's an example:
ID Name Postcode Street Employer Salary
1 John NaN Craven Road NaN NaN
2 Sue TD2 NAN NaN 15000
3 Jimmy MW6 Blake Street Bank 40000
4 Laura QE2 Mill Lane NaN 20000
5 Sam NW2 Duke Avenue Farms 35000
6 Jordan SE6 NaN NaN NaN
7 NaN CB2 NaN Startup NaN `
I want to return a count of successively filled out columns on the condition that all previous columns have been filled. The final output should look something like:
Name Postcode Street Employer salary
6 5 3 2 2
Is there a good Pandas way of doing this? I suppose there could be a way of applying a mask so that if any previous boolean is given as zero the current column is also zero and then counting that but I'm not sure if that is the best way.
Thanks!
I think you can use notnull and cummin:
In [99]: df.notnull().cummin(axis=1).sum(axis=0)
Out[99]:
Name 6
Postcode 5
Street 3
Employer 2
Salary 2
dtype: int64
Although note that I had to replace your NAN (Sue's street) with a float NaN before I did that, and I assumed that ID was your index.
The cumulative minimum is one way to implement "applying a mask so that if any previous boolean is given as zero the current column is also zero", as you predicted would work.
Maybe cumprod BTW you have 'NAN' in your df, I try then as notnull here
df.notnull().cumprod(1).sum()
Out[59]:
ID 7
Name 6
Postcode 5
Street 4
Employer 2
Salary 2
dtype: int64

Pandas fillna with DataFrame of values

Accordingly to the docs, the fillna value parameter can be one among the following:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
I have a data frame that looks like:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
And that is what I want to do:
NaN Cabin will be filled according to the median value given the Pclass feature value
NaN Age will be filled according to its median value across the data set
NaN Embarked will be filled according to the median value given the Pclass feature value
So after some data manipulation, I got this data frame:
Pclass Cabin Embarked Ticket
0 1 C S 50
1 2 F S 13
2 3 G S 5
What it says is that for the Pclass == 1 the most common Cabin is C. Given that, in my original data frame df I want to fill every df['Cabin'] == null with C.
This is a small example and I could treat each possible null combination by hand with something as:
df_both[df_both['Pclass'] == 1 & df_both['Cabin'] == np.NaN] = 'C'
However, I wonder if I can use this derived data frame to do all this filling automatic.
Thank you.
If you want to fill all Nan's with something like the median or the mean of the specific column you can do the following.
for median:
df.fillna(df.median())
for mean
df.fillna(df.mean())
see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-with-a-pandasobject for more information.
Edit:
Alternatively you can use a dictionary with specified values. The keys need to map to column names. This way you can also impute for missing values in strings.
df.fillna({'col1':'a','col2': 1})

Max value using idxmax

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

Categories

Resources