add suffix to column name based on condition

add suffix to column name based on condition - python

i want to add the prefix '_nan' to columns that are all nan. I have the following code that prints what I want but does not reassign the columns in the actual dataframe and I am not sure why. Does anyone have any ideas why this is happening? Thanks in advance
df = pd.DataFrame({ 'a':[1, 0, 0, 0],
'b':[np.nan, np.nan, np.nan, np.nan],
'c':[np.nan, np.nan, np.nan, np.nan]})
a = df.loc[:,df.isna().all()].columns
df[[*a]] = df[[*a]].add_suffix('_nan')

You can use list comprehension:
df.columns = [x + '_nan' if df[x].isna().all() else x for x in df.columns]
Output:
a b_nan c_nan
0 1 NaN NaN
1 0 NaN NaN
2 0 NaN NaN
3 0 NaN NaN

why this is happening?
After some experiments I found that when you assign pandas.DataFrame slice to slice of pandas.DataFrame then pandas apparently does bother solely about order of columns (given in list), not their names, consider following example:
import pandas as pd
df_1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6],'c':[7,8,9]})
df_2 = pd.DataFrame({'x':[10,20,30],'y':[400,500,600]})
df_1[['a','b']] = df_2[['x','y']]
print(df_1)
output
a b c
0 10 400 7
1 20 500 8
2 30 600 9
whilst
...
df_1[['a','b']] = df_2[['y','x']]
print(df_1)
produce
a b c
0 400 10 7
1 500 20 8
2 600 30 9

Related

pandas most efficient way to execute arithmetic operations on multiple dataframe columns

my first post!
I'm running python 3.8.5 & pandas 1.1.0 on jupyter notebooks.
I want to divide several columns by the corresponding elements in another column of the same dataframe.
For example:
import pandas as pd
df = pd.DataFrame({'a': [2, 3, 4], 'b': [4, 6, 8], 'c':[6, 9, 12]})
df
a b c
0 2 4 6
1 3 6 9
2 4 8 12
I'd like to divide columns 'b' & 'c' by the corresponding values in 'a' and substitute the values in 'b' and 'c' with the result of this division. So the above dataframe becomes:
a b c
0 2 2 3
1 3 2 3
2 4 2 3
I tried
df.iloc[: , 1:] = df.iloc[: , 1:] / df['a']
but this gives:
a b c
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
I got it working by doing:
for colname in df.columns[1:]:
df[colname] = (df[colname] / df['a'])
Is there a faster way of doing the above by avoiding the for loop?
thanks,
mk

Almost there, use div with axis=0:
df.iloc[:,1:] = df.iloc[:,1:].div(df.a, axis=0)

df.b= df.b/df.a
df.c=df.c/df.a
or
df[['b','c']]=df.apply(lambda x: x[['b','c']]/x.a ,axis=1)

Pandas, replace NaNs with values from MultiIndex DataFrame

Problem
I have a dataframe with some NaNs that I am trying to fill intelligently based off values from another dataframe. I have not found an efficient way to do this but I suspect there is a way with pandas.
Minimal Example
index1 = [1, 1, 1, 2, 2, 2]
index2 = ['a', 'b', 'a', 'b', 'a', 'b']
# dataframe to fillna
df = pd.DataFrame(
np.asarray([[np.nan, 90, 90, 100, 100, np.nan], index1, index2]).T,
columns=['data', 'index1', 'index2']
)
# dataframe to lookup fill values from
multi_index = pd.MultiIndex.from_product([sorted(list(set(index1))), sorted(list(set(index2)))])
fill_val_lookup = pd.DataFrame([89, 91, 99, 101], index=multi_index, columns=
['fill_vals'])
Starting data (df):
data index1 index2
0 nan 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 nan 2 b
Lookup table to find values to fill NaNs:
fill_vals
1 a 89
b 91
2 a 99
b 101
Desired output:
data index1 index2
0 89 1 a
1 90 1 b
2 90 1 a
3 100 2 b
4 100 2 a
5 101 2 b
Ideas
The closest post I have found is about filling NaNs with values from one level of a multiindex.
I've also tried setting the index of df to be a multiindex using columns index1 and index2 and then using df.fillna, however this does not work.

combine_first is the function that you need. But first, update the index names of the other dataframe.
fill_val_lookup.index.names = ["index1", "index2"]
fill_val_lookup.columns = ["data"]
df.index1 = df.index1.astype(int)
df.data = df.data.astype(float)
df.set_index(["index1","index2"]).combine_first(fill_val_lookup)\
.reset_index()
# index1 index2 data
#0 1 a 89.0
#1 1 a 90.0
#2 1 b 90.0
#3 2 a 100.0
#4 2 b 100.0
#5 2 b 101.0

Python Pandas Dataframe, remove all rows where 'None' is the value in any column

I have a large dataframe. When it was created 'None' was used as the value where a number could not be calculated (instead of 'nan')
How can I delete all rows that have 'None' in any of it's columns? I though I could use df.dropna and set the value of na, but I can't seem to be able to.
Thanks
I think this is a good representation of the dataframe:
temp = pd.DataFrame(data=[['str1','str2',2,3,5,6,76,8],['str3','str4',2,3,'None',6,76,8]])

Setup
Borrowed #MaxU's df
df = pd.DataFrame([
[1, 2, 3],
[4, None, 6],
[None, 7, 8],
[9, 10, 11]
], dtype=object)
Solution
You can just use pd.DataFrame.dropna as is
df.dropna()
0 1 2
0 1 2 3
3 9 10 11
Supposing you have None strings like in this df
df = pd.DataFrame([
[1, 2, 3],
[4, 'None', 6],
['None', 7, 8],
[9, 10, 11]
], dtype=object)
Then combine dropna with mask
df.mask(df.eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
You can ensure that the entire dataframe is object when you compare with.
df.mask(df.astype(object).eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11

Thanks for all your help. In the end I was able to get
df = df.replace(to_replace='None', value=np.nan).dropna()
to work. I'm not sure why your suggestions didn't work for me.

UPDATE:
In [70]: temp[temp.astype(str).ne('None').all(1)]
Out[70]:
0 1 2 3 4 5 6 7
0 str1 str2 2 3 5 6 76 8
Old answer:
In [35]: x
Out[35]:
a b c
0 1 2 3
1 4 None 6
2 None 7 8
3 9 10 11
In [36]: x = x[~x.astype(str).eq('None').any(1)]
In [37]: x
Out[37]:
a b c
0 1 2 3
3 9 10 11
or bit nicer variant from #roganjosh:
In [47]: x = x[x.astype(str).ne('None').all(1)]
In [48]: x
Out[48]:
a b c
0 1 2 3
3 9 10 11

im a bit late to the party, but this is prob the simplest method:
df.dropna(axis=0, how='any')
Parameters:
axis='index/column' how='any/all'
axis '0' is for dropping rows (most common), and '1' will drop columns instead.
and the parameter how will drop if there are 'any' None types in the row/ column,
or if they are all None types (how='all')

if still None is not removed , we can do
df = df.replace(to_replace='None', value=np.nan).dropna()
the above solution worked partially still the None was converted to NaN but not removed (thanks to the above answer as it helped to move further)
so then i added one more line of code that is take the particular column
df['column'] = df['column'].apply(lambda x : str(x))
this changed the NaN to nan
now remove the nan
df = df[df['column'] != 'nan']

Adding a new column to a pandas dataframe with different number of rows

I'm not sure if pandas is made to do this... But I'd like to add a new row to my dataframe with more rows than the existing columns.
Minimal example:
import pandas as pd
df = pd.DataFrame()
df ['a'] = [0,1]
df ['b'] = [0,1,2]
Could someone please explain if this is possible? I'm using a dataframe to store long lists of data and they all have different lengths that I don't necessarily know at the start.

Absolutely possible. Use pd.concat
Demonstration
df1 = pd.DataFrame([[1, 2, 3]])
df2 = pd.DataFrame([[4, 5, 6, 7, 8, 9]])
pd.concat([df1, df2])
df1 looks like
0 1 2
0 1 2 3
df2 looks like
0 1 2 3 4 5
0 4 5 6 7 8 9
pd.concat looks like
0 1 2 3 4 5
0 1 2 3 NaN NaN NaN
0 4 5 6 7.0 8.0 9.0

Using pandas fillna() on multiple columns

I'm a new pandas user (as of yesterday), and have found it at times both convenient and frustrating.
My current frustration is in trying to use df.fillna() on multiple columns of a dataframe. For example, I've got two sets of data (a newer set and an older set) which partially overlap. For the cases where we have new data, I just use that, but I also want to use the older data if there isn't anything newer. It seems I should be able to use fillna() to fill the newer columns with the older ones, but I'm having trouble getting that to work.
Attempt at a specific example:
df.ix[:,['newcolumn1','newcolumn2']].fillna(df.ix[:,['oldcolumn1','oldcolumn2']], inplace=True)
But this doesn't work as expected - numbers show up in the new columns that had been NaNs, but not the ones that were in the old columns (in fact, looking through the data, I have no idea where the numbers it picked came from, as they don't exist in either the new or old data anywhere).
Is there a way to fill in NaNs of specific columns in a DataFrame with vales from other specific columns of the DataFrame?

fillna is generally for carrying an observation forward or backward. Instead, I'd use np.where... If I understand what you're asking.
import numpy as np
np.where(np.isnan(df['newcolumn1']), df['oldcolumn1'], df['newcolumn1'])

To answer your question: yes. Look at using the value argument of fillna. Along with the to_dict() method on the other dataframe.
But to really solve your problem, have a look at the update() method of the DataFrame. Assuming your two dataframes are similarly indexed, I think it's exactly what you want.
In [36]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [37]: df
Out[37]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [38]: df2 = pd.DataFrame({'A': [0, np.nan, 2, 3, 4, 5], 'B': [1, 0, 1, 1, 0, 0]})
In [40]: df2
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [52]: df.update(df2, overwrite=False)
In [53]: df
Out[53]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
Notice that all the NaNs in df were replaced except for (1, A) since that was also NaN in df2. Also some of the values like (5, B) differed between df and df2. By using overwrite=False it keeps the value from df.
EDIT: Based on comments it seems like your looking for a solution where the column names don't match over the two DataFrames (It'd be helpful if you posted sample data). Let's try that, replacing column A with C and B with D.
In [33]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [34]: df2 = pd.DataFrame({'C': [0, np.nan, 2, 3, 4, 5], 'D': [1, 0, 1, 1, 0, 0]})
In [35]: df
Out[35]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [36]: df2
Out[36]:
C D
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [37]: d = {'A': df2.C, 'B': df2.D} # pass this values in fillna
In [38]: df
Out[38]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [40]: df.fillna(value=d)
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
I think if you invest the time to learn pandas you'll hit fewer moments of frustration. It's a massive library though, so it takes time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

add suffix to column name based on condition - python

You can use list comprehension: df.columns = [x + '_nan' if df[x].isna().all() else x for x in df.columns] Output: a b_nan c_nan 0 1 NaN NaN 1 0 NaN NaN 2 0 NaN NaN 3 0 NaN NaN

Related

pandas most efficient way to execute arithmetic operations on multiple dataframe columns

Pandas, replace NaNs with values from MultiIndex DataFrame

Python Pandas Dataframe, remove all rows where 'None' is the value in any column

Adding a new column to a pandas dataframe with different number of rows

Using pandas fillna() on multiple columns

Categories

Resources