Python DataFrame String replace accidently Returing NaN - python

I encounter a weird problem in Python Pandas, while I read a excel and replace a character "k", the result gives me NaN for the rows without "K". see below image
It should return 173 on row #4,instead of NaN, but if I create a brand new excel, and type the same number. it will work.
or if i use this code,
df = pd.DataFrame({ 'sales':['75.8K','6.9K','7K','6.9K','173','148']})
df
then it will works well. Why? please advise!

This is because the 173 and 148 values from the excel import are numbers, not strings. Since str.replace returns a value that is non-numeric, these values become NaN. You can see that demonstrated by setting up the dataframe with numbers in those position:
df = pd.DataFrame({ 'sales':['75.8K','6.9K','7K','6.9K',173,148]})
df.dtypes
# sales object
# dtype: object
df['num'] = df['sales'].str.replace('K','')
Output:
sales num
0 75.8K 75.8
1 6.9K 6.9
2 7K 7
3 6.9K 6.9
4 173 NaN
5 148 NaN
If you don't mind all your values being strings, you can use
df = pd.read_excel('manual_import.xlsx', dtype=str)
or
df = pd.read_excel('manual_import.xlsx', converters={'sales':str})
should just convert all the sales values to strings.

Try this:
df['nums'] = df['sales'].astype(str)
df['nums'] = pd.to_numeric(df['nums'].str.replace('K', ''))

Related

How to extract column title from a dataframe and add it to another dataframe?

My goal is to have my column titles in the small df added to an existing large dataframe without me manually typing the name in.
This is the small dataframe.
veddra_term_code veddra_version veddra_term_name number_of_animals_affected accuracy
335 11 Emesis NaN NaN
142 11 Anaemia NOS NaN NaN
The large dataframe is similar to the above but has forty columns.
This is the code I used to extract the small dataframe from dict.
df = pd.DataFrame(reaction for result in d['results'] for reaction in result['reaction']) #get reaction data
df
You can pass dataframe.reindex a list of columns, consisting of the existing columns and also new ones. If a column does not exist yet in the dataframe, it will get as value NaN.
Assume that df is your big dataframe you want to extend with columns. You can then create a new list of column names (columns_to_add) from your small dataframe and combine them. Then you call reindex on the big dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
existing_columns = df.columns.tolist()
columns_to_add = ["C", "D"] # or use small_df.columns.tolist()
new_columns = existing_columns + columns_to_add
df = df.reindex(columns = new_columns)
This will produce:
A B C D
0 1 2 NaN NaN
1 2 3 NaN NaN
2 3 4 NaN NaN
If you do not like NaN you can use a different value by passing the keyword fill_value.
(e.g. df.reindex(columns = new_columns, fill_value=0).
df.columns will give you an array of the names of the columns
import numpy as np
#loop small dataframe headers
for i in small_df.columns:
# if large df doesnt have the header, create the header
if i not in large_df.columns:
#creates new header with no data
large_df.loc[:,i]=np.nan

Pandas mean function returns all NaN

I have this dataframe:
df = [{'A1':10, 'A2':''}, {'A1':11,'A2':110}, {'A1':12,'A2':120}]
And I'd like to average the different columns ignoring the '' (empty string) values.
This is the desired output
df_AVG = [{'A1':10, 'A2':'','avg':10}, {'A1':11,'A2':110,'avg': 60.5}, {'A1':12,'A2':120,'avg':66}]
And I can do it with this code:
df['avg'] = df[['A1','A2']].mean(axis=1, numeric_only=True)
But when I modify the dataframe and it includes more than one blank space, like this
df = [{'A1':10, 'A2':''}, {'A1':'','A2':110}, {'A1':12,'A2':120}]
And I run the same code, the output is this. All 'avg' values are NaN, including the ones that previously worked:
df_AVG = [{'A1':10, 'A2':'','avg':NaN}, {'A1':11,'A2':110,'avg': NaN}, {'A1':12,'A2':120,'avg':NaN}]
Could you tell me what's wrong with this approach? Thanks!
When you use numeric_only it "drops" the not numerical columns, so in your second case, it drops all columns since they are both strings. If you check more closely your average on the first case you will see that in the second and third-row it only takes the 11 and 12 since the 110 and 120 are "dropped" because of the empty string.
If you want, you can do this:
df['avg'] = df[['A1','A2']].replace('', np.nan).apply(lambda row: np.nanmean(row), axis=1)
It replace '' with NaN and get the mean ignoring those NaN
You should coerce the columns to numeric types. A simple way could be:
df['avg'] = pd.DataFrame({col : pd.to_numeric(df[col]) for col in df.columns}).mean(axis=1)
It gives as expected:
A1 A2 avg
0 10 10.0
1 110 110.0
2 12 120 66.0

How to slice within multiple columns?

I have a dataframe with some columns and I want to work with 3 of them (with some nan). In order to simplify, let's say that the columns are:
A B C
2135 87539 5255
213
9841 126
The first thing I want to do is only have the first two digits of each cell, but I don't know how to do it, since the dytpe is float and I have some missing values. I want to have this:
A B C
21 87 52
21
98 12
Then I want to replace the nan values by '103'. This part I did this way and it worked.
df.update(df[['A', 'B', 'C']].fillna(103))
So my final dataframe would be like this:
A B C
21 87 52
21 103 103
98 12 103
I just really don't know how to do the first part, where I slice the integers. Anyone can help me?
IIUC, use Series.astype and then DataFrame.stack
to access the first 2 characters.
df.stack().astype(str).str[:2].unstack()
Output
A B C
0 21 87 52
1 21
2 98 12
or using pd.to_numeric if you want get float values at the end
pd.to_numeric(df2.stack().astype(str).str[:2], errors='coerce').unstack()
A B C
0 21.0 87.0 52.0
1 21.0 NaN NaN
2 98.0 12.0 NaN
Here's what you can do. I'm pretty sure it is not the perfect solution, but it's a vague idea and it does work. Maybe you can try to optimize it a bit.
import pandas as pd
data = [[2135,87539,5255],[213,130,130],[9841,126,130]]
df = pd.DataFrame(data,columns=['A','B','C'], dtype=float)
print df
#Converting to object Dtype
df[["A", "B","C"]] = df[["A","B", "C"]].astype(str)
print df
#Storing columns in lists
listA = list(df.A)
listB = list(df.B)
listC = list(df.C)
#For generating required data
newListA = []
newListB = []
newListC = []
#Storing required data into new lists
for itemA,itemB,itemC in zip(listA,listB,listC):
newListA.append(str(itemA[0:2]))
newListB.append(str(itemB[0:2]))
newListC.append(str(itemC[0:2]))
#Converting the lists to new Dataframe with float Dtype
new_df = pd.DataFrame(zip(newListA,newListB,newListC), columns=['A','B','C'], dtype=float)
print(new_df)

Pandas select all columns without NaN

I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.
df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]
DF for example:
col1 col2 col3
23 45 NaN
54 39 NaN
NaN 45 76
87 32 NaN
The output should look like:
df.dropna(axis=1, thresh=2)
col1 col2
23 45
54 39
NaN 45
87 32
You can create with non-NaN columns using
df = df[df.columns[~df.isnull().all()]]
Or
null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)
If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null
cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)
df[df.columns[~df.isnull().any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.
df[df.columns[~df.isnull().all()]] only removes the columns that have nothing but null values and leaves columns with even one non-null value.
df.isnull() will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.
df.isnull().any() will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all() will not flag columns with even one value!
I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.
I can think in the following code:
import pandas as pd
df = pd.DataFrame({
'col1': [23, 54, pd.np.nan, 87],
'col2': [45, 39, 45, 32],
'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})
# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
return col.isnull().sum() > threshold
# Then you apply the "complement" of function to get the column with
# no NaN.
df.loc[:, ~df.apply(has_nan)]
# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]
Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
you should try df_notnull = df.dropna(how='all')
This will get you only non null rows.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values
df = df[full_col_series.index]
This worked for me quite well and probably tailored for your need as well!
def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
if df[j].isnull().sum() <= thresh:
ind.append(j)
return df[ind]
I see a lot of how to get rid of null values on this thread. Which for my dataframes is never the case. We do not delete data. Ever.
I took the question as how to get just your null values to show, and in my case I had to find latitude and longitude and fill them in.
What I did was this for one column nulls:
df[df['Latitude'].isnull()]
or to explain it out
dataframe[dataframe['Column you want'].isnull()]
This pulled up my whole data frame and all the missing values of latitude.
What did not work is this and I can't explain why. Trying to do two columns at the same time:
df[df[['Latitude','Longitude']].isnull()]
That will give me all NANs in the entire data frame.
So to do this all at once what I added was the ID, in my case my ID for each row is APNs, with the two columns I needed at the end
df[df['Latitude'].isnull()][['APN','Latitude','Longitude']]
By doing this little hack I was able to get every ID I needed to add data too for 600,000+ rows of data to filter for. Then did it again for longitude just to be sure I did not miss anything.

Pandas command unintentionally shifting data between rows

I have a pandas dataframe with a integer column called TDLINX. I'm trying to convert that to a string with leading zeros such that all values are 7 characters, with leading zeros. So 7 would become "0000007"
This is the code that I used:
df_merged_total['TDLINX2'] = df.TDLINX.apply(lambda x: str(x).zfill(7))
At first glance this appeared to work, but as I went further down the file, I realized that the value in TDLINX2 was starting to get shifted. What could be causing this and what can I do to prevent it?
You could do something like this:
>>> df = pd.DataFrame({"col":[1, 33, 555, 7777]})
>>> df["new_col"] = ["%07d" % x for x in df.col]
>>> df
col new_col
0 1 0000001
1 33 0000033
2 555 0000555
3 7777 0007777

Categories

Resources