Combine NaN value in rows - Pandas - python

I would like to know if it's possible to combine some rows if we have in specific columns NaN value ? But the order can be change. I thought combine the rows if Name is duplicated.
import pandas as pd
import numpy as np
d = {'Name': ['Jacque','Paul', 'Jacque'], 'City': [np.nan, '4', '10'], 'Birthday' : ['1','2',np.nan]}
df = pd.DataFrame(data=d)
df
And I would like to have this output :

Check with sorted
out = df.apply(lambda x : sorted(x,key=pd.isnull)).dropna()
Name City Birthday
0 Jacque 4.0 1.0
1 Paul 10.0 2.0

Related

Python : How do I perform the below Dataframe Operation

I have two dataframes
codes are below for the two dfs
import pandas as pd
df1 = pd.DataFrame({'income1': [-13036.0, 1200.0, -12077.5, 1100.0],
'income2': [-30360.0, 2000.0, -2277.5, 1500.0],
})
df2 = pd.DataFrame({'name1': ['abc', 'deb', 'hghg', 'gfgf'],
'name2': ['dfd', 'dfd1', 'df3df', 'fggfg'],
})
I want to combine the 2 dfs to get a single df with names against the respective income values, as shown below. Any help is appreciated. Please note that I want it in the same sequence as shown in my output.
Here is possible convert values to numpy array and flatten with pass to DataFrame cosntructor:
df = pd.DataFrame({'Name': np.ravel(df2.to_numpy()),
'Income': np.ravel(df1.to_numpy())})
print (df)
Name Income
0 abc -13036.0
1 dfd -30360.0
2 deb 1200.0
3 dfd1 2000.0
4 hghg -12077.5
5 df3df -2277.5
6 gfgf 1100.0
7 fggfg 1500.0
Or concat with DataFrame.stack and Series.reset_index for default index values:
df = pd.concat([df2.stack().reset_index(drop=True),
df1.stack().reset_index(drop=True)],axis=1, keys=['Name','Income'])
print (df)
Name Income
0 abc -13036.0
1 dfd -30360.0
2 deb 1200.0
3 dfd1 2000.0
4 hghg -12077.5
5 df3df -2277.5
6 gfgf 1100.0
7 fggfg 1500.0
Try this:
incomes = pd.concat([df1.income1, df1.income2], axis = 0)
names = pd.concat([df2.name1 , df2.name2] , axis = 0)
df = pd.DataFrame({'Name': names, 'Incomes': incomes})

Squeezing pandas DataFrame to have non-null values and modify column names

I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0

Merge pandas rows based on values and NaNs

My dataframe looks like this :
ID VALUE1 VALUE2 VALUE3
1 NaN [ab,c] Good
1 google [ab,c] Good
2 NaN [ab,c1] NaN
2 First [ab,c1] Good1
2 First [ab,c1]
3 NaN [ab,c] Good
Requirement is :
ID is the key. I have 3 rows for ID 2. So, I need to merge two rows into 1 row such that I have valid values (excluding Nulls and spaces) for all the columns.
My expected output is:
ID VALUE1 VALUE2 VALUE3
1 google [ab,c] Good
2 First [ab,c1] Good1
3 NaN [ab,c] Good
Do we have any pandas function to achieve this or should I have to seperate the data into two or more dataframes and process for merging based on NaN/spaces?
Thanks for your help
Micheal G has a more elegant solution above.
Here is my more time consuming and amateur approach:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,1,2,2,2,3],
"V1": [np.nan,'google',np.nan,'First','First',np.nan],
"V2": [['ab','c'],['ab','c'],['ab','c1'],['ab','c1'],['ab','c1'],['ab','c']],
"V3": ['Good','Good',np.nan,np.nan,'Good1','Good']
})
uniq = df.ID.unique() #Get the unique values in ID
df = df.set_index(['ID']) #Since we are try find the rows with the least amount of nan's.
#Setting the index by ID is going to make our future statements faster and easier.
newDf = pd.DataFrame()
for i in uniq: #Running the loop per unique value in column ID
temp = df.loc[i]
if(isinstance(temp, pd.Series)): #if there is only 1 row with the i, add that row to out new DataFrame
newDf = newDf.append(temp)
else:
NonNanCountSeries = temp.apply(lambda x: x.count(), axis=1)
#Get the number of non-nan's in the per each row. It is given in list.
NonNanCountList = NonNanCountSeries.tolist()
newDf = newDf.append(temp.iloc[NonNanCountList.index(max(NonNanCountList))])
#Let's break this down.
#Find the max in out nanCountList: max(NonNanCountList))
#Find the index of where the max is. Paraphrased: get the row number with the
#most amount of non-nan's: NonNanCountList.index(max(NonNanCountList))
#Get the row by passing the index into temp.iloc
#Add the row to newDf and update newDf
print(newDf)
Which should return:
V1 V2 V3
1 google [ab, c] Good
2 First [ab, c1] Good1
3 NaN [ab, c] Good
Note, I capitalised Google.
import pandas as pd
import numpy as np
data = {'ID' : [1,1,2,2,2,3], 'VALUE1':['NaN','Google','NaN', 'First', 'First','NaN'], 'VALUE2':['abc', 'abc', 'abc1', 'abc1', 'abc1', 'abc'], 'VALUE3': ['Good', 'Good', 'NaN', 'Good1', '0', 'Good']}
df = pd.DataFrame(data)
df_ = df.replace('NaN', np.NaN).fillna('zero', inplace=False)
df2 = df_.sort_values(['VALUE1', 'ID'])
mask = df2.ID.duplicated()
print (df_[~mask])
Output
ID VALUE1 VALUE2 VALUE3
1 1 Google abc Good
3 2 First abc1 Good1
5 3 zero abc Good
Finally, just be aware the tilda character (~) in the mask is essential

Pandas groupby and value counts for complex strings that have multiple occurrences

Suppose I have a df like this:
stringOfInterest trend
0 C up
1 D down
2 E down
3 C,O up
4 C,P up
I want to plot this df as a bar graph using pandas. To obtain the proper grouped bar plots, I would like to group the data by the column df["trend"] and then count the occurrence of df["stringOfInterest"] for each letter.
As can be seen, some of this strings contain multiple letters separated by a ",".
Using
df.groupby("trend").stringOfInterest.value_counts().unstack(0)
produces the expected result:
trend down up
stringOfInterest
- 7.0 8.0
C 3.0 11.0
C,O NaN 2.0
C,P 1.0 1.0
D 1.0 2.0
E 15.0 14.0
E,T 1.0 NaN
However, I would like to count the occurrence of individual characters (C,E,D,...).
On the original df this can be achieved like this:
s = df.stringOfInterest.str.split(",", expand = True).stack()
s.value_counts()
This typically generates something like this:
C 3
E 2
D 1
O 1
P 1
T 1
Unfortunately, this cannot be used here after the groupby() in combination with unstack().
Maybe I am on the wrong track and some more elegant way would be preferred.
To clarify the plotting: For each letter (stringOfInterest), there must be two bars indicating the number of "up" and "down" trend(s).
Based on this answer here: Pandas expand rows from list data available in column
Is this something that would help you?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'stringOfInterest': {0: 'C', 1: 'D', 2: 'E', 3: 'C,O', 4: 'C,P'},
'trend': {0: 'up', 1: 'down', 2: 'down', 3: 'up', 4: 'up'}})
df2 = (pd.DataFrame(df.stringOfInterest.str.split(',').tolist(), index=df.trend)
.stack()
.reset_index()
.groupby('trend')[0]
.value_counts()
.unstack()
).T
df2.plot(kind='bar')
plt.show()
Another approach
We could also zip the columns together and expand.
import pandas as pd
from collections import Counter
data = [(x,i) for x,y in zip(df.trend,df.stringOfInterest.str.split(',')) for i in y]
pd.Series(Counter(data)).plot(kind='bar')

How do I change a single index value in pandas dataframe?

energy.loc['Republic of Korea']
I want to change the value of index from 'Republic of Korea' to 'South Korea'.
But the dataframe is too large and it is not possible to change every index value. How do I change only this single value?
#EdChum's solution looks good.
Here's one using rename, which would replace all these values in the index.
energy.rename(index={'Republic of Korea':'South Korea'},inplace=True)
Here's an example
>>> example = pd.DataFrame({'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,nan,4],
'data2' : list('abcdef')})
>>> example.set_index('key1',inplace=True)
>>> example
data1 data2
key1
a 1.0 a
a 2.0 b
a 2.0 c
b 3.0 d
a NaN e
b 4.0 f
>>> example.rename(index={'a':'c'}) # can also use inplace=True
data1 data2
key1
c 1.0 a
c 2.0 b
c 2.0 c
b 3.0 d
c NaN e
b 4.0 f
You want to do something like this:
as_list = df.index.tolist()
idx = as_list.index('Republic of Korea')
as_list[idx] = 'South Korea'
df.index = as_list
Basically, you get the index as a list, change that one element, and the replace the existing index.
Try This
df.rename(index={'Republic of Korea':'South Korea'},inplace=True)
If you have MultiIndex DataFrame, do this:
# input DataFrame
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
# changes index level 'i1' values 0 to -1
t.rename(index={0:-1}, level='i1', inplace=True)
Here's another good one, using replace on the column.
df.reset_index(inplace=True)
df.drop('index', axis = 1, inplace=True)
df["Country"].replace("Republic of Korea", value="South Korea", inplace=True)
df.set_index("Country", inplace=True)
Here's another idea based on set_value
df = df.reset_index()
df.drop('index', axis = 1, inplace=True)
index = df.index[df["Country"] == "Republic of Korea"]
df.set_value(index, "Country", "South Korea")
df = df.set_index("Country")
df["Country"] = df.index
We can use rename function to change row index or column name. Here is the example,
Suppose data frame is like given below,
student_id marks
index
1 12 33
2 23 98
To change index 1 to 5
we will use axis = 0 which is for row
df.rename({ 1 : 5 }, axis=0)
df refers to data frame variable. So, output will be like
student_id marks
index
5 12 33
2 23 98
To change column name
we will have to use axis = 1
df.rename({ "marks" : "student_marks" }, axis=1)
so, changed data frame is
student_id student_marks
index
5 12 33
2 23 98
This seems to work too:
energy.index.values[energy.index.tolist().index('Republic of Korea')] = 'South Korea'
No idea though whether this is recommended or discouraged.

Categories

Resources