Filling missing age in titanic dataset - python

I know there are several hundred solutions on this, but I was wondering if there is a smarter way to fill the panda's data frame missing the age column based on lengthy certain conditions as folows.
mean_value = df[(df["Survived"]== 1) & (df["Pclass"] == 1) & (df["Sex"] == "male")
& (df["Embarked"] == "C") & (df["SibSp"] == 0) & (df["Parch"] == 0)].Age.mean().round(2)
df = df.assign(
Age=np.where(df.Survived.eq(1) & df.Pclass.eq(1) & df.Sex.eq("male") & df.Embarked.eq("C") &
df.SibSp.eq(0) & df.Parch.eq(0) & df.Age.isnull(), mean_value, df.Age)
)
Repeating the following for all 6 columns above, with all categorical combinations is too long and bulky, I was wondering if there is a smarter way to do this?
#Ben.T answer:
If I understood your method correctly, this is the "verbose version" of it ?
for a in np.unique(df.Survived):
for b in np.unique(df.Pclass):
for c in np.unique(df.Sex):
for d in np.unique(df.SibSp):
for e in np.unique(df.Parch):
for f in np.unique(df.Embarked):
mean_value = df[(df["Survived"] == a) & (df["Pclass"] == b) & (df["Sex"] == c)
& (df["SibSp"] == d) & (df["Parch"] == e) & (df["Embarked"] == f)].Age.mean()
df = df.assign(Age=np.where(df.Survived.eq(a) & df.Pclass.eq(b) & df.Sex.eq(c) & df.SibSp.eq(d) &
df.Parch.eq(e) & df.Embarked.eq(f) & df.Age.isnull(), mean_value, df.Age))
which is equivalent to this?
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))

You can create a variable that combines all of your criteria, and then you can use the ampersand to add more criteria later.
Note, in the seaborn titanic dataset, where I got the data from, the column names are lowercase.
criteria = ((df["survived"]== 1) &
(df["pclass"] == 1) &
(df["sex"] == "male") &
(df["embarked"] == "C") &
(df["sibsp"] == 0) &
(df["parch"] == 0))
fillin = df.loc[criteria, 'age'].mean()
df.loc[criteria & (df['age'].isnull()), 'age'] = fillin

I guess groupby.transform can do it. It creates for each row the mean over the group of all the columns in the groupby, and it does it for all the combinations possibles at once. Then using fillna with the serie created will fill missing value with the mean of the group with same charateristics.
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))

Related

Comparing two dataframe pandas Python

I have two dataframes and I have to compare these dfs. if rows in the dfs are same, I should find this row. I have written code but not running correctly.
def color(row):
if (df3_2['dt'].values == df_condition['dt'].values ) & ( df3_2['platform'].values == df_condition['platform'].values) & ( df3_2['network'].values == df_condition['network'].values) & ( df3_2['currency'].values == df_condition['currency'].values) & ( df3_2['cost'].values == df_condition['cost'].values) :
return "asdas"
else:
return "wewq"
df3_2.loc[:, 'demo'] = df3_2.apply(color, axis = 1)
Code wrote "asdas" all rows in the dataframe but Dataframe has different columns.
Thanks

Pandas fillna() inplace Parameter Isn't Working Without Using a Triple For Loop

I'm trying to break a DataFrame into four parts and to impute rounded mean values for each part using fillna(). I have two columns, main_campus and degree_type I want to filter on, which have two unique values each. So between them I should be able to filter the DataFrame into two groups.
I first did this with a triple for loop (see below), which seems to work, but when I tried to do it in a more elegant way, I got a SettingWithCopy warning that I couldn't fix by using .loc or .copy(), and the missing values wouldn't be filled even when inplace was set to True. Here's the code for the latter method:
# Imputing mean values for main campus BA students
df[(df.main_campus == 1) &
(df.degree_type == 'BA')] = df[(df.main_campus == 1) &
(df.degree_type == 'BA')].fillna(
df[(nulled_data.main_campus == 1) &
(df.degree_type == 'BA')
].mean(),
inplace=True)
# Imputing mean values for main campus BS students
df[(df.main_campus == 1) &
(df.degree_type == 'BS')] = df[(df.main_campus == 1) &
(df.degree_type == 'BS')].fillna(
df[(df.main_campus == 1) &
(df.degree_type == 'BS')
].mean(),
inplace=True)
# Imputing mean values for downtown campus BA students
df[(df.main_campus == 0) &
(df.degree_type == 'BA')] = df[(df.main_campus == 0) &
(df.degree_type == 'BA')].fillna(
df[(df.main_campus == 0) &
(df.degree_type == 'BA')
].mean(),
inplace=True)
# Imputing mean values for downtown campus BS students
df[(df.main_campus == 0) &
(df.degree_type == 'BS')] = df[(df.main_campus == 0) &
(df.degree_type == 'BS')].fillna(
df[(df.main_campus == 0) &
(df.degree_type == 'BS')
].mean(),
inplace=True)
I should mention the previous code went through several iterations, trying it without setting it back to the slice, with and without inplace, etc.
Here's the code with the triple for loop that works:
imputation_cols = [# all the columns I want to impute]
for col in imputation_cols:
for i in [1, 0]:
for path in ['BA', 'BS']:
group = ndf.loc[((df.main_campus == i) &
(df.degree_type == path)), :]
group = group.fillna(value=round(group.mean()))
df.loc[((df.main_campus == i) &
(df.degree_type == path)), :] = group
It's worth mentioning that I think the use of the group variable in the triple for loop code is also to help the filled NaN values actually get set back to the DataFrame, but I would need to double check this.
Does anyone have an idea for what's going on here?
A good way to approach such a problem is to simplify your code. Simplifying your code makes it easier to find the source of the warning:
group1 = (df.main_campus == 1) & (df.degree_type == 'BA')
group2 = (df.main_campus == 1) & (df.degree_type == 'BS')
group3 = (df.main_campus == 0) & (df.degree_type == 'BA')
group4 = (df.main_campus == 0) & (df.degree_type == 'BS')
# Imputing mean values for main campus BA students
df.loc[group1, :] = df.loc[group1, :].fillna(df.loc[group1, :].mean()) # repeat for other groups
Now you can see the problem more clearly. You are trying to write the mean of the df back to the df. Pandas issues a warning because the slice you use to compute the mean could be inconsistent with the changed dataframe. In your case it produces the correct result. But the consistency of your dataframe is at risk.
You could solve this by computing the mean beforehand:
group1_mean = df.loc[group1, :].mean()
df.loc[group1, :] = df.loc[group1, :].fillna(group1_mean)
In my opinion this makes the code more clear. But you still have four groups (group1, group2, ...). A clear sign to use a loop:
from itertools import product
for campus, degree in product([1, 0], ['BS', 'BA']):
group = (df.main_campus == campus) & (df.degree_type == degree)
group_mean = df.loc[group, :].mean()
df.loc[group, :] = df.loc[group, :].fillna(group_mean)
I have used product from itertools to get rid of the ugly nested loop. It is quite similar to your "inelegant" first solution. So you were almost there the first time.
We ended up with four lines of code and a loop. I am sure with some pandas magic you could convert it to one line. However, you will still understand these four lines in a week or a month or a year from now. Also, other people reading your code will understand it easily. Readability counts.
Disclaimer: I could not test the code since you did not provide a sample dataframe. So my code may throw an error because of a typo. A minimal reproducible example makes it so much easier to answer questions. Please consider this the next time you post a question on SO.

Line type data plot doesn't appear

There is this dataset that I'm going to plot (it can be obtained here https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016). I wanted to plot the suicides_no for male and female, 25-34 years old in Russian Federation, from 2000 to 2015. So I create a new data frame for that.
Here's the main data frame.
DF = pd.read_csv("D:/who_suicide_statistics.csv")
Here's my code for creating the new data frame.
DF1 = (DF.loc[(DF["country"] == "Russian Federation") & (DF["age"] == "25-34 years")
& (DF["sex"] == "male") & (DF["year"] >= 2000)])
DF2 = (DF.loc[(DF["country"] == "Russian Federation") & (DF["age"] == "25-34 years")
& (DF["sex"] == "female") & (DF["year"] >= 2000)])
year_sex_suicides = {}
year_sex_suicides["year"] = DF1["year"]
year_sex_suicides["male_suicides"] = DF1["suicides_no"]
year_sex_suicides["female_suicides"] = DF2["suicides_no"]
DF333 = pd.DataFrame(data=year_sex_suicides)
And here is the code for the plot that I wanted.
DF333.plot(kind="line", x="year", y=["male suicides", "female_suicides"])
The graph I came up with is this
There's something wrong but I couldn't find it.
You likely have quite some NaNs in the dataset you're trying to plot (look at your DataFrame by evaluating print(DF333) for example).
You could either fill them up using something like (which is not a good way though!):
DF333.fillna(method='ffill', inplace=True)
Or reset the index when building your dataframe:
DF1 = (DF.loc[(DF["country"] == "Russian Federation") & (DF["age"] == "25-34 years")
& (DF["sex"] == "male") & (DF["year"] >= 2000)])
DF1 = DF1.set_index('year')
DF2 = (DF.loc[(DF["country"] == "Russian Federation") & (DF["age"] == "25-34 years")
& (DF["sex"] == "female") & (DF["year"] >= 2000)])
DF2 = DF2.set_index('year')
year_sex_suicides = {}
year_sex_suicides["male_suicides"] = DF1["suicides_no"]
year_sex_suicides["female_suicides"] = DF2["suicides_no"]
DF333 = pd.DataFrame(data=year_sex_suicides)
DF333.plot(kind="line", y=["male_suicides", "female_suicides"])
That way, you ensure that everything is put in the row that matches the year, and not the row index from the original CSV file. Of course, you could also use something like Panda's groupby, which will reduce the number of lines of code as well (but maybe you need DF1 etc. later on for other purposes too):
DF1 = DF.loc[(DF["country"] == "Russian Federation") & (DF["age"] == "25-34 years")
& (DF["year"] >= 2000)]
DF1.groupby(['year', 'sex']).sum()['suicides_no'].groupby('sex').plot()

Conversion of Access Query to Python Script

I have an Access Query which I want to convert into Python Script:
SELECT
[Functional_Details].Customer_No,
Sum([Functional_Details].[SUM(Incoming_Hours)]) AS [SumOfSUM(Incoming_Hours)],
Sum([Functional_Details].[SUM(Incoming_Minutes)]) AS [SumOfSUM(Incoming_Minutes)],
Sum([Functional_Details].[SUM(Incoming_Seconds)]) AS [SumOfSUM(Incoming_Seconds)],
[Functional_Details].Rate,
[Functional_Details].Customer_Type
FROM [Functional_Details]
WHERE(
(([Functional_Details].User_ID) Not In ("IND"))
AND
(([Functional_Details].Incoming_ID)="Airtel")
AND
(([Functional_Details].Incoming_Category)="Foreign")
AND
(([Functional_Details].Outgoing_ID)="Airtel")
AND
(([Functional_Details].Outgoing_Category)="Foreign")
AND
(([Functional_Details].Current_Operation)="NO")
AND
(([Functional_Details].Active)="NO")
)
GROUP BY [Functional_Details].Customer_No, [Functional_Details].Rate, [Functional_Details].Customer_Type
HAVING ((([Functional_Details].Customer_Type)="Check"));
I have Functional_Details stored in a dataframe: df_functional_details
I am not able to understand how to proceed with the python script.
So far I have tried:
df_fd_temp=df_functional_details.copy()
if(df_fd_temp['User_ID'] != 'IND'
and df_fd_temp['Incoming_ID'] == 'Airtel'
and df_fd_temp['Incoming_Category'] == 'Foreign'
and df_fd_temp['Outgoing_ID'] == 'Airtel'
and df_fd_temp['Outgoing_Category'] == 'Foreign'
and df_fd_temp['Current_Operation'] == 'NO'
and df_fd_temp['Active'] == 'NO'):
df_fd_temp.groupby(['Customer_No','Rate','Customer_Type']).groups
df_fd_temp[df_fd_temp['Customer_Type'].str.contains("Check")]
First, select the rows where the conditions apply (note the parentheses and & instead of and):
df_fd_temp = df_fd_temp[(df_fd_temp['User_ID'] != 'IND') &
(df_fd_temp['Incoming_ID'] == 'Airtel') &
(df_fd_temp['Incoming_Category'] == 'Foreign') &
(df_fd_temp['Outgoing_ID'] == 'Airtel') &
(df_fd_temp['Outgoing_Category'] == 'Foreign') &
(df_fd_temp['Current_Operation'] == 'NO') &
(df_fd_temp['Active'] == 'NO')]
Then, do the group-by logic:
df_grouped = df_fd_temp.groupby(['Customer_No','Rate','Customer_Type'])
You now have a groupby object, which you can further manipulate and filter:
df_grouped.filter(lambda x: "Check" in x['Customer_Type'])
You might need to tweak the group filtering based on what your actual dataset looks like.
Further reading:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.filter.html

Dropping rows in Python using != operator is not working

I want to drop rows in my dataset using:
totes = df3.loc[(df3['Reporting Date'] != '18/08/2017') & (df3['Business Line'] != 'Bondy')]
However it is not what I expect; I know that the number of rows I want ot drop is 496 after using:
totes = df3.loc[(df3['Reporting Date'] == '18/08/2017') & (df3['Business Line'] == 'Bondy')]
When I run my drop function, it is giving back much less rows than my dataset minus 496.
Does anyone know how to fix this?
You are correct to use &, but it is being misused. This is a logic problem. Note:
(NOT X) AND (NOT Y) != NOT(X AND Y)
Instead, you can calculate the negative of a Boolean condition via the ~ operator:
totes = df3.loc[~((df3['Reporting Date'] == '18/08/2017') & (df3['Business Line'] == 'Bondy'))]
Those parentheses and masks can get confusing, so you can write this more clearly:
m1 = df3['Reporting Date'].eq('18/08/2017')
m2 = df3['Business Line'].eq('Bondy')
totes = df3.loc[~(m1 & m2)]
Alternatively, note that:
NOT(X & Y) == NOT(X) | NOT(Y)
So you can use:
m1 = df3['Reporting Date'].ne('18/08/2017')
m2 = df3['Business Line'].ne('Bondy')
totes = df3.loc[m1 | m2]

Categories

Resources