How to replace string in pandas data frame to NaN? - python

Let's consider following data frame:
I want to change string-type elements of this DataFrame into NaN. Example of an solution would be:
frame.replace("k", np.NaN)
frame.replace("s", np.NaN)
However it would be very problematic in bigger data sets to go through each element, checking if this element is string and changing it at the end. Is there an easier solution?
Desired table:

Use df.replace regex
import numpy as np
df.replace(regex='[A-Za-z]', value=np.nan)

Use pd.to_numeric to transform all non numeric values to nan:
frame = frame.apply(pd.to_numeric, errors='coerce')

You can use astype(str) and .str.digit for each column to get a mask for values that are numbers, and then just index the dataframe with that mask to make NaN the values that aren't masked:
df = df[df.astype(str).apply(lambda col: col.str.isdigit())]
Output:
>>> df
0 1 2
0 1 2 NaN
1 2 NaN 4
2 5 NaN 1

Related

pandas dropna() only if in first row NaN value

I have a dataframe like the following
df = [[1,'NaN',3],[4,5,'Nan'],[7,8,9]]
df = pd.DataFrame(df)
and I would like to remove all columns that have in their first row a NaN value.
So the output should be:
df = [[1,3],[4,'Nan'],[7,9]]
df = pd.DataFrame(df)
So in this case, only the second column is removed since the first element was a NaN value.
Hence, dropna() is based on a condition.. any idea how to handle this? Thx!
If values are np.nan and not string NaN(else replace them), you can do:
Input:
df = [[1,np.nan,3],[4,5,np.nan],[7,8,9]]
df = pd.DataFrame(df)
Solution:
df.loc[:,df.iloc[0].notna()] #assign back to your desired variable
0 2
0 1 3.0
1 4 NaN
2 7 9.0

Unable to fillna a column in dataframe with values from a series

I am trying to fillna in a specific column of the dataframe with the mean of not-null values of the same type (based on the value from another column in the dataframe).
Here is the code to reproduce my issue:
import numpy as np
import pandas as pd
df = pd.DataFrame()
#Create the DateFrame with a column of floats
#And a column of labels (str)
np.random.seed(seed=6)
df['col0']=np.random.randn(100)
lett=['a','b','c','d']
df['col1']=np.random.choice(lett,100)
#Set some of the floats to NaN for the test.
toz = np.random.randint(0,100,25)
df.loc[toz,'col0']=np.NaN
df[df['col0'].isnull()==False].count()
#Create a DF with mean for each label.
w_series = df.loc[(~df['col0'].isnull())].groupby('col1').mean()
col0
col1
a 0.057199
b 0.363899
c -0.068074
d 0.251979
#This dataframe has our label (a,b,c,d) as the index. Doesn't seem
#to work when I try to df.fillna(w_series). So I try to reindex such
#that the labels (a,b,c,d) become a column again.
#
#For some reason I cannot just do a set_index and expect the
#old index to become column. So I append the new index and
#then reset it.
w_series['col2'] = list(range(w_series.size))
w_frame = w_series.set_index('col2',append=True)
w_frame.reset_index('col1',inplace=True)
#I try fillna() with the new dataframe.
df.fillna(w_frame)
Still no luck:
col0 col1
0 0.057199 b
1 0.729004 a
2 0.217821 d
3 0.251979 c
4 -2.486781 a
5 0.913252 b
6 NaN a
7 NaN b
What am I doing wrong?
How do I fillna the dataframe with the averages of specific rows that match the missing information?
Does the size of the dataframe being filled (df) and the filler dataframe (w_frame) have to match?
Thank you
fillna is base on index, so , you need same index for your target dataframe and process dataframe
df.set_index('col1')['col0'].fillna(w_frame.set_index('col1').col0).reset_index()
# I only show the first 11 row
Out[74]:
col1 col0
0 b 0.363899
1 a 0.729004
2 d 0.217821
3 c -0.068074
4 a -2.486781
5 b 0.913252
6 a 0.057199
7 b 0.363899
8 c -0.068074
9 b -0.429894
10 a 2.631281
My way to fillna
df['col1']=df.groupby("col1")['col0'].transform(lambda x: x.fillna(x.mean()))

Adding a new column in pandas dataframe from another dataframe with differing indices

This is my original dataframe.
This is my second dataframe containing one column.
I want to add the column of second dataframe to the original dataframe at the end. Indices are different for both dataframes. I did like this.
df1['RESULT'] = df2['RESULT']
It doesn't return an error and the column is added but all values are NaNs. How do I add these columns with their values?
Assuming the size of your dataframes are the same, you can assign the RESULT_df['RESULT'].values to your original dataframe. This way, you don't have to worry about indexing issues.
# pre 0.24
feature_file_df['RESULT'] = RESULT_df['RESULT'].values
# >= 0.24
feature_file_df['RESULT'] = RESULT_df['RESULT'].to_numpy()
Minimal Code Sample
df
A B
0 -1.202564 2.786483
1 0.180380 0.259736
2 -0.295206 1.175316
3 1.683482 0.927719
4 -0.199904 1.077655
df2
C
11 -0.140670
12 1.496007
13 0.263425
14 -0.557958
15 -0.018375
Let's try direct assignment first.
df['C'] = df2['C']
df
A B C
0 -1.202564 2.786483 NaN
1 0.180380 0.259736 NaN
2 -0.295206 1.175316 NaN
3 1.683482 0.927719 NaN
4 -0.199904 1.077655 NaN
Now, assign the array returned by .values (or .to_numpy() for pandas versions >0.24). .values returns a numpy array which does not have an index.
df2['C'].values
array([-0.141, 1.496, 0.263, -0.558, -0.018])
df['C'] = df2['C'].values
df
A B C
0 -1.202564 2.786483 -0.140670
1 0.180380 0.259736 1.496007
2 -0.295206 1.175316 0.263425
3 1.683482 0.927719 -0.557958
4 -0.199904 1.077655 -0.018375
You can also call set_axis() to change the index of a dataframe/column. So if the lengths are the same, then with set_axis(), you can coerce the index of one dataframe to be the same as the other dataframe.
df1['A'] = df2['A'].set_axis(df1.index)
If you get SettingWithCopyWarning, then to silence it, you can create a copy by either calling join() or assign().
df1 = df1.join(df2['A'].set_axis(df1.index))
# or
df1 = df1.assign(new_col = df2['A'].set_axis(df1.index))
set_axis() is especially useful if you want to add multiple columns from another dataframe. You can just call join() after calling it on the new dataframe.
df1 = df1.join(df2[['A', 'B', 'C']].set_axis(df1.index))

Remove rows where column value type is string Pandas

I have a pandas dataframe. One of my columns should only be floats. When I try to convert that column to floats, I'm alerted that there are strings in there. I'd like to delete all rows where values in this column are strings...
Use convert_objects with param convert_numeric=True this will coerce any non numeric values to NaN:
In [24]:
df = pd.DataFrame({'a': [0.1,0.5,'jasdh', 9.0]})
df
Out[24]:
a
0 0.1
1 0.5
2 jasdh
3 9
In [27]:
df.convert_objects(convert_numeric=True)
Out[27]:
a
0 0.1
1 0.5
2 NaN
3 9.0
In [29]:
You can then drop them:
df.convert_objects(convert_numeric=True).dropna()
Out[29]:
a
0 0.1
1 0.5
3 9.0
UPDATE
Since version 0.17.0 this method is now deprecated and you need to use to_numeric unfortunately this operates on a Series rather than a whole df so the equivalent code is now:
df.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna()
One of my columns should only be floats. I'd like to delete all rows
where values in this column are strings
You can convert your series to numeric via pd.to_numeric and then use pd.Series.notnull. Conversion to float is required as a separate step to avoid your series reverting to object dtype.
# Data from #EdChum
df = pd.DataFrame({'a': [0.1, 0.5, 'jasdh', 9.0]})
res = df[pd.to_numeric(df['a'], errors='coerce').notnull()]
res['a'] = res['a'].astype(float)
print(res)
a
0 0.1
1 0.5
3 9.0
Assume your data frame is df and you wanted to ensure that all data in one of the column of your data frame is numeric in specific pandas dtype, e.g float:
df[df.columns[n]] = df[df.columns[n]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(float).dropna()
You can find the data type of a column from the dtype.kind attribute. Something like df[col].dtype.kind. See the numpy docs for more details. Transpose the dataframe to go from indices to columns.

How to add an empty column to a dataframe?

What's the easiest way to add an empty column to a pandas DataFrame object? The best I've stumbled upon is something like
df['foo'] = df.apply(lambda _: '', axis=1)
Is there a less perverse method?
If I understand correctly, assignment should fill:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
To add to DSM's answer and building on this associated question, I'd split the approach into two cases:
Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan
Adding multiple columns: I'd suggest using the .reindex(columns=[...]) method of pandas to add the new columns to the dataframe's column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows.
Here is an example adding multiple columns:
mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2'])
or
mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1) # version > 0.20.0
You can also always concatenate a new (empty) dataframe to the existing dataframe, but that doesn't feel as pythonic to me :)
I like:
df['new'] = pd.Series(dtype='int')
# or use other dtypes like 'float', 'object', ...
If you have an empty dataframe, this solution makes sure that no new row containing only NaN is added.
Specifying dtype is not strictly necessary, however newer Pandas versions produce a DeprecationWarning if not specified.
an even simpler solution is:
df = df.reindex(columns = header_list)
where "header_list" is a list of the headers you want to appear.
any header included in the list that is not found already in the dataframe will be added with blank cells below.
so if
header_list = ['a','b','c', 'd']
then c and d will be added as columns with blank cells
Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF.
This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe.
Consider the same DF sample demonstrated by #DSM:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
A B
0 1 2
1 2 3
2 3 4
df.assign(C="",D=np.nan)
Out[21]:
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.
if you want to add column name from a list
df=pd.DataFrame()
a=['col1','col2','col3','col4']
for i in a:
df[i]=np.nan
df["C"] = ""
df["D"] = np.nan
Assignment will give you this warning SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
so its better to use insert:
df.insert(index, column-name, column-value)
#emunsing's answer is really cool for adding multiple columns, but I couldn't get it to work for me in python 2.7. Instead, I found this works:
mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])
One can use df.insert(index_to_insert_at, column_header, init_value) to insert new column at a specific index.
cost_tbl.insert(1, "col_name", "")
The above statement would insert an empty Column after the first column.
this will also work for multiple columns:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
df1 = pd.DataFrame(columns=['C','D','E'])
df = df.join(df1, how="outer")
>>>df
A B C D E
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
Then do whatever you want to do with the columns
pd.Series.fillna(),pd.Series.map()
etc.
The below code address the question "How do I add n number of empty columns to my existing dataframe". In the interest of keeping solutions to similar problems in one place, I am adding it here.
Approach 1 (to create 64 additional columns with column names from 1-64)
m = list(range(1,65,1))
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists
Approach 2 (to create 64 additional columns with column names from 1-64)
df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')
You can do
df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe
If you have a list of columns that you want to be empty, you can use assign, then comprehension dict, then dict unpacking.
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> nan_cols_name = ["C","D","whatever"]
>>> df.assign(**{col:np.nan for col in nan_cols_name})
A B C D whatever
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
You can also unpack multiple dict in a dict that you unpack if you want different values for different columns.
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
nan_cols_name = ["C","D","whatever"]
empty_string_cols_name = ["E","F","bad column with space"]
df.assign(**{
**{col:np.nan for col in my_empy_columns_name},
**{col:"" for col in empty_string_cols_name}
}
)
Sorry for I did not explain my answer really well at beginning. There is another way to add an new column to an existing dataframe.
1st step, make a new empty data frame (with all the columns in your data frame, plus a new or few columns you want to add) called df_temp
2nd step, combine the df_temp and your data frame.
df_temp = pd.DataFrame(columns=(df_null.columns.tolist() + ['empty']))
df = pd.concat([df_temp, df])
It might be the best solution, but it is another way to think about this question.
the reason of I am using this method is because I am get this warning all the time:
: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["empty1"], df["empty2"] = [np.nan, ""]
great I found the way to disable the Warning
pd.options.mode.chained_assignment = None
The reason I was looking for such a solution is simply to add spaces between multiple DFs which have been joined column-wise using the pd.concat function and then written to excel using xlsxwriter.
df[' ']=df.apply(lambda _: '', axis=1)
df_2 = pd.concat([df,df1],axis=1) #worked but only once.
# Note: df & df1 have the same rows which is my index.
#
df_2[' ']=df_2.apply(lambda _: '', axis=1) #didn't work this time !!?
df_4 = pd.concat([df_2,df_3],axis=1)
I then replaced the second lambda call with
df_2['']='' #which appears to add a blank column
df_4 = pd.concat([df_2,df_3],axis=1)
The output I tested it on was using xlsxwriter to excel.
Jupyter blank columns look the same as in excel although doesnt have xlsx formatting.
Not sure why the second Lambda call didnt work.

Categories

Resources