selecting not None value from a dataframe column - python

I would like to use the fillna function to fill None value of a column with its own first most frequent value that is not None or nan.
Input DF:
Col_A
a
None
None
c
c
d
d
The output Dataframe could be either:
Col_A
a
c
c
c
c
d
d
Any suggestion would be very appreciated.
Many Thanks, Best Regards,
Carlo

Prelude: If your None is actually a string, you can simplify any headaches by getting rid of them first-up. Use replace:
df = df.replace('None', np.nan)
I believe you could use fillna + value_counts:
df
Col_A
0 a
1 NaN
2 NaN
3 c
4 c
5 d
6 d
df.fillna(df.Col_A.value_counts(sort=False).index[0])
Col_A
0 a
1 c
2 c
3 c
4 c
5 d
6 d
Or, with Vaishali's suggestion, use idxmax to pick c:
df.fillna(df.Col_A.value_counts(sort=False).idxmax())
Col_A
0 a
1 c
2 c
3 c
4 c
5 d
6 d
The fill-values could either be c or d, depending on whether you include sort=False or not.
Details
df.Col_A.value_counts(sort=False)
c 2
a 1
d 2
Name: Col_A, dtype: int64

fillna + mode
df.Col_A.fillna(df.Col_A.mode()[0])
Out[963]:
0 a
1 c
2 c
3 c
4 c
5 d
6 d
Name: Col_A, dtype: object

To address 'None', you need to use replace then fillna much like #COLDSPEED suggests:
dr = df.Col_A.replace('None',np.nan)
dr.fillna(dr.dropna().value_counts().index[0])
Output:
0 a
1 d
2 d
3 c
4 c
5 d
6 d
Name: Col_A, dtype: object

Related

Grouping the columns and identifying values which are not part of this group

I have a DataFrame which looks like this:
df:-
A B
1 a
1 a
1 b
2 c
3 d
Now using this dataFrame i want to get the following new_df:
new_df:-
item val_not_present
1 c #1 doesn't have values c and d(values not part of group 1)
1 d
2 a #2 doesn't have values a,b and d(values not part of group 2)
2 b
2 d
3 a #3 doesn't have values a,b and c(values not part of group 3)
3 b
3 c
or an individual DataFrame for each items like:
df1:
item val_not_present
1 c
1 d
df2:-
item val_not_present
2 a
2 b
2 d
df3:-
item val_not_present
3 a
3 b
3 c
I want to get all the values which are not part of that group.
You can use np.setdiff and explode:
values_b = df.B.unique()
pd.DataFrame(df.groupby("A")["B"].unique().apply(lambda x: np.setdiff1d(values_b,x)).rename("val_not_present").explode())
Output:
val_not_present
A
1 c
1 d
2 a
2 b
2 d
3 a
3 b
3 c
Another approach is using crosstab/pivot_table to get counts and then filter on where count is 0 and transform to dataframe:
m = pd.crosstab(df['A'],df['B'])
pd.DataFrame(m.where(m.eq(0)).stack().index.tolist(),columns=['A','val_not_present'])
A val_not_present
0 1 c
1 1 d
2 2 a
3 2 b
4 2 d
5 3 a
6 3 b
7 3 c
You could convert B to a categorical datatype and then compute the value counts. Categorical variables will show categories that have frequency counts of zero so you could do something like this:
df['B'] = df['B'].astype('category')
new_df = (
df.groupby('A')
.apply(lambda x: x['B'].value_counts())
.reset_index()
.query('B == 0')
.drop(labels='B', axis=1)
.rename(columns={'level_1':'val_not_present',
'A':'item'})
)

creating a column in dataframe pandas based on a condition

I have two lists
A = ['a','b','c','d','e']
B = ['c','e']
A dataframe with column
A
0 a
1 b
2 c
3 d
4 e
I wish to create an additional column for rows where elements in B matches A.
A M
0 a
1 b
2 c match
3 d
4 e match
You can use loc or numpy.where and condition with isin:
df.loc[df.A.isin(B), 'M'] = 'match'
print (df)
A M
0 a NaN
1 b NaN
2 c match
3 d NaN
4 e match
Or:
df['M'] = np.where(df.A.isin(B),'match','')
print (df)
A M
0 a
1 b
2 c match
3 d
4 e match

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

pandas: dataframe from dict with comma separated values

I'm trying to create a DataFrame from a nested dictionary, where the values are in comma separated strings.
Each value is nested in a dict, such as:
dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
My desired output is:
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
All I have so far is converting the dict to dataframe and splitting the items in each list. But I'm not sure this is getting me any closer to my objective.
df = pd.DataFrame(dict)
Out[439]:
1 2 3
event A, B, C D, B, A, C D, B, C
In [441]: df.loc['event'].str.split(',').apply(pd.Series)
Out[441]:
0 1 2 3
1 A B C NaN
2 D B A C
3 D B C NaN
Any help is appreciated. Thanks
You can use a couple of comprehensions to massage the nested dict into a better format for creation of a DataFrame that flags if an entry for the column exists or not:
the_dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
df = pd.DataFrame([[{z:1 for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 1.0 1 1 NaN
1 1.0 1 1 1.0
2 NaN 1 1 1.0
Once you've made the DataFrame you can simply loop through the columns and convert the values that flagged the existence of the letter into a letter using the where method(below this does where NaN leave as NaN, otherwise it inserts the letter for the column):
for col in df.columns:
df_mask = df[col].isnull()
df[col]=df[col].where(df_mask,col)
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
Based on #merlin's suggestion you can go straight to the answer within the comprehension:
df = pd.DataFrame([[{z:z for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
From what you have(modified the split a little to strip the extra spaces) df1, you can probably just stack the result and use pd.crosstab() on the index and value column:
df1 = df.loc['event'].str.split('\s*,\s*').apply(pd.Series)
df2 = df1.stack().rename('value').reset_index()
pd.crosstab(df2.level_0, df2.value)
# value A B C D
# level_0
# 1 1 1 1 0
# 2 1 1 1 1
# 3 0 1 1 1
This is not exactly as you asked for, but I imagine you may prefer this to your desired output.
To get exactly what you are looking for, you can add an extra column which is equal to the value column above and then unstack the index that contains the values:
df2 = df1.stack().rename('value').reset_index()
df2['value2'] = df2.value
df2.set_index(['level_0', 'value']).drop('level_1', axis = 1).unstack(level = 1)
# value2
# value A B C D
# level_0
# 1 A B C None
# 2 A B C D
# 3 None B C D

pandas hierarchical indexing unique values

Say I have series:
A a 1
b 1
B c 5
d 8
e 5
where first two columns together is the hierarchical index. I want to find how many unique values are for index level=0, e.g., in this output should be A 1; B 2. How can this be done easily? Thanks!
groupby on level 0 and then call .nunique on the column:
>>> df
val
A a 1
b 1
B c 5
d 8
e 5
>>> df.groupby(level=0)['val'].nunique()
A 1
B 2
Name: val, dtype: int64

Categories

Resources