Fill missing values based on the condition of other columns - python

I have this large dataframe, illustrated below is for simplicity purposes.
pd.DataFrame(df.groupby(['Pclass', 'Sex'])['Age'].median())
Groupby results:
And it have this data that needs to be imputed
Missing Data:
How can I impute these values based on the median of the grouped statistic
The result that I want is:
# You can use this for reference
import numpy as np
import pandas as pd
mldx_arrays = [np.array([1, 1,
2, 2,
3, 3]),
np.array(['male', 'female',
'male', 'female',
'male', 'female'])]
multiindex_df = pd.DataFrame(
[34,29,24,40,18,25], index=mldx_arrays,
columns=['Age'])
multiindex_df.index.names = ['PClass', 'Sex']
multiindex_df
d = {'PClass': [1, 1, 2, 2, 3, 3],
'Sex': ['male', 'female', 'male', 'female', 'male', 'female'],
'Age': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)

If all values are missing remove Age column and use DataFrame.join:
df = df.drop('Age', axis=1).join(multiindex_df, on=['PClass','Sex'])
print (df)
PClass Sex Age
0 1 male 34
1 1 female 29
2 2 male 24
3 2 female 40
4 3 male 18
5 3 female 25
If need replace only missing values use DataFrame.join and replace missing values in original column:
df = df.join(multiindex_df, on=['PClass','Sex'], rsuffix='_')
df['Age'] = df['Age'].fillna(df.pop('Age_'))
print (df)
PClass Sex Age
0 1 male 34.0
1 1 female 29.0
2 2 male 24.0
3 2 female 40.0
4 3 male 18.0
5 3 female 25.0
If need replace missing values by median per groups use GroupBy.transform:
df['Age'] = df['Age'].fillna(df.groupby(['PClass', 'Sex'])['Age'].transform('median'))

Given your example case you can simply assign the Series to the dataframe and re-define the column:
df['Age'] = base_df.groupby(['Pclass', 'Sex'])['Age'].median()
Otherwise you need to careful of positioning and in case it's not sorted you might want to use sort_index() or sort_values() first, depending on the case.

Is there special reason for filling NaN? If not, use reset_index on your result :
df = pd.read_csv('your_file_name.csv') # input your file name or url
df.groupby(['Pclass', 'Sex'])['Age'].median().reset_index()

Related

How to fill a pandas dataframe column with one of two list values?

I am trying to add a 'sex' column to an existing 'tips' dataframe. There are 244 rows that need to be filled randomly with either 'Male' or 'Female'. I have tried using a for loop to iterate through each row and assign either list option, but I can't quite get it right.
sex = ['Male', 'Female']
def sex():
for row in tips['sex']:
sex[random.randint(0,1)]
tips['sex'] = sex()
You can use np.random.choice for this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'x': [1, 3, 4, 5, 7]})
df['sex'] = np.random.choice(['Male', 'Female'], size=len(df))
df
x sex
0 1 Male
1 3 Male
2 4 Male
3 5 Female
4 7 Male
(Far worse) Loop Methods:
sex_col = []
for i in range(len(df)):
sex_col.append(sex[random.randint(0,1)])
df['sex'] = sex_col
for i in df.index:
df.loc[i, 'sex'] = sex[random.randint(0,1)]

How to get pandas crosstab to sum up values for multiple columns?

Let's assume we have a table like:
id chr val1 val2
... A 2 10
... B 4 20
... A 3 30
...and we'd like to have a contingency table like this (grouped by chr, thus using 'A' and 'B' as the row indices and then summing up the values for val1 and val2):
val1 val2 total
A 5 40 45
B 4 20 24
total 9 60 69
How can we achieve this?
pd.crosstab(index=df.chr, columns=["val1", "val2"]) looked quite promising but it just counts the rows and does not sum up the values.
I have also tried (numerous times) to supply the values manually...
pd.crosstab(
index=df.chr.unique(),
columns=["val1", "val2"],
values=[
df.groupby("chr")["val1"],
df.groupby("chr")["val2"]
],
aggfunc=sum
)
...but this always ends up in shape mismatches and when I tried to reshape via NumPy:
values=np.array([
df.groupby("chr")["val1"].values,
df.groupby("chr")["val2"].values
].reshape(-1, 2)
...crosstab tells me that it expected 1 value instead of the two given for each row.
import pandas as pd
df = pd.DataFrame({'chr': {0: 'A', 1: 'B', 2: 'A'},
'val1': {0: 2, 1: 4, 2: 3},
'val2': {0: 10, 1: 20, 2: 30}})
# aggregate values by chr
df = df.groupby('chr').sum().reset_index()
df = df.set_index('chr')
# Column Total
df.loc['total', :] = df.sum()
# Row total
df['total'] = df.sum(axis=1)
Output
val1 val2 total
chr
A 5.0 40.0 45.0
B 4.0 20.0 24.0
total 9.0 60.0 69.0
What you want is pivot_table
table = pd.pivot_table(df, values=['val1','val2'], index=['char'], aggfunc=np.sum)
table['total'] = table['val1'] + table['val2']

python-pandas: new column based on index?

I have a df similar to the one below:
name age sex
1 john 12 m
2 mary 13 f
3 joseph 12 m
4 maria 14 f
How can I make a new column based on the index? for example for index 1 and 2, i want them to have the label 1 and for index 3 and 4, i want them to be labeled 2, like so:
name age sex label
1 john 12 m cluster1
2 mary 13 f cluster1
3 joseph 12 m cluster2
4 maria 14 f cluster2
Should i use something like (df.index.isin([1, 2])) == 'cluster1'? I think it's not possible to do df['target'] = (df.index.isin([1, 2])) == 'cluster1 assuming that label doesn't exist in the beginning.
I think this is what you are looking for? You can use lists for different clusters to make your labels arbitrary in this way.
import pandas as pd
data = {'name':['bob','sue','mary','steve'], 'age':[11, 23, 53, 44]}
df = pd.DataFrame(data)
print(df)
df['label'] = 0
cluster1 = [0, 3]
cluster2 = [1, 2]
df.loc[cluster1, 'label'] = 1
df.loc[cluster2, 'label'] = 2
#another way
#df.iloc[cluster1, df.columns.get_loc('label')] = 1
#df.iloc[cluster2, df.columns.get_loc('label')] = 2
print(df)
output:
name age
0 bob 11
1 sue 23
2 mary 53
3 steve 44
name age label
0 bob 11 1
1 sue 23 2
2 mary 53 2
3 steve 44 1
You can let the initial column creation to be anything. So you can either have it be one of the cluster values (so then you only have to set the other cluster manually instead of both) or you can have it be None so you can then easily check after assigning labels that you didn't miss any rows.
If the assignment to clusters is truly arbitrary I don't think you'll be able to automate it much more than this.
Is this the solution you are looking for? I doubled the data so you can try different sequences. Here, if you write create_label(df, 3) instead of 2, it will iterate over 3 by 3. It gives you an opportunity to have a parametric solution.
import pandas as pd
df = pd.DataFrame({'name': ['john', 'mary', 'joseph', 'maria', 'john', 'mary', 'joseph', 'maria'],
'age': [12, 13, 12, 14, 12, 13, 12, 14],
'sex': ['m', 'f','m', 'f', 'm', 'f','m', 'f']})
df.index = df.index + 1
df['label'] = pd.Series()
def create_label(data, each_row):
i = 0
j = 1
while i <= len(data):
data['label'][i: i + each_row] = 'label' + str(j)
i += each_row
j += 1
return data
df_new = create_label(df, 2)
For small data frame or dataset you can use the below code
Label=pd.Series(['cluster1','cluster1','cluster2','cluster2'])
df['label']=Label
you can use a for loop and use list to get a new column with desired data
import pandas as pd
df = pd.read_csv("dataset.csv")
list1 = []
for i in range(len(df.name)):
if i < 2:
list1.append('cluster1')
else:
list1.append('cluster2')
label = pd.Series(list1)
df['label'] = label
You can simply use iloc and assign the values for the columns:
import pandas as pd
df = pd.read_csv('test.txt',sep='\+', engine = "python")
df["label"] = "" # adds empty "label" column
df["label"].iloc[0:2] = "cluster1"
df["label"].iloc[2:4] = "cluster2"
Since the values do not follow a certain order, as per your comments, you'd have to assign each "cluster" value manually.

Squeezing pandas DataFrame to have non-null values and modify column names

I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0

Update a dataframe(df1) column with value from another dataframe(df2) column when a key column in df1 matches to multiple columns in df2

I have a dataframe (df1) like this.
import pandas as pd
import numpy as np
d1 = {'A': [np.nan, 'India', 'CHN', 'JP'],
'B': [np.nan, np.nan, np.nan, np.nan]}
df1 = pd.DataFrame(data=d1)
df1
A B
0 NaN NaN
1 India NaN
2 CHN NaN
3 JP NaN
And another dataframe like this.
d2 = {'X': ['Japan', 'China', 'India'],
'Y': ['JP', 'CN', 'IN'],
'Z': ['JPN', 'CHN', 'IND']}
df2 = pd.DataFrame(data=d2)
df2
X Y Z
0 Japan JP JPN
1 China CN CHN
2 India IN IND
I'm trying to update values in column B of df1, while searching for values in column A of df1 in all of the columns of df2, with values from column X of df2 when there is a match.
The expected result is:
A B
0 NaN NaN
1 India India
2 CHN China
3 JP Japan
I tried using inner join (pd.merge()) but as I have one column in left and 3 columns on right, I couldn't get far.
pd.merge(df1, df2, left_on=["A"], right_on=["X"], how="inner")
I tried using isin() and .loc() but since I need to update df1['B'] with value from df2, I couldn't figure how I could get respective data from df2.
df1.loc[
(df1["A"].isin(df2["X"])) |
(df1["A"].isin(df2["Y"])) |
(df1["A"].isin(df2["Z"]))
]
I have an idea to store each column values as keys in a dictionary and values as their respective value from df2['X']. Using that dictionary as lookup for each row in df1['A'], I could update value of df1['B'].
lookup_data = {
"Japan" : "Japan",
"JP" : "Japan",
"JPN" : "Japan"
}
df1['B'] = [lookup_data.get(x, np.nan) for x in df1['A']]
However, I'm interested if this can be solved in a more efficient way. Please help. Thanks.
you can use map on the column A from df1 with a series with index being all the values of df2 and values the corresponding value in df2 column X. to do so you can use set_index the column X, stack and then inverse the values and the index in a new series.
#create the series for the map
s = df2.set_index(df2['X']).stack()
s = pd.Series(s.index.get_level_values(0), index=s.values)
# map A and fillna
df1['B'] = df1['B'].fillna(df1['A'].map(s))
print (df1)
A B
0 NaN NaN
1 India India
2 CHN China
3 JP Japan
This need check the value for all cell in df2 from df1 column A , so we do
s=f1.A.dropna().map(lambda x : df2.loc[df2.isin([x]).any(1).loc[lambda x : x].index,'X'].values[0])
df1.B.fillna(s,inplace=True)
df1
A B
0 NaN NaN
1 India India
2 CHN China
3 JP Japan

Categories

Resources