I have this dataframe:
df = [{'A1':10, 'A2':''}, {'A1':11,'A2':110}, {'A1':12,'A2':120}]
And I'd like to average the different columns ignoring the '' (empty string) values.
This is the desired output
df_AVG = [{'A1':10, 'A2':'','avg':10}, {'A1':11,'A2':110,'avg': 60.5}, {'A1':12,'A2':120,'avg':66}]
And I can do it with this code:
df['avg'] = df[['A1','A2']].mean(axis=1, numeric_only=True)
But when I modify the dataframe and it includes more than one blank space, like this
df = [{'A1':10, 'A2':''}, {'A1':'','A2':110}, {'A1':12,'A2':120}]
And I run the same code, the output is this. All 'avg' values are NaN, including the ones that previously worked:
df_AVG = [{'A1':10, 'A2':'','avg':NaN}, {'A1':11,'A2':110,'avg': NaN}, {'A1':12,'A2':120,'avg':NaN}]
Could you tell me what's wrong with this approach? Thanks!
When you use numeric_only it "drops" the not numerical columns, so in your second case, it drops all columns since they are both strings. If you check more closely your average on the first case you will see that in the second and third-row it only takes the 11 and 12 since the 110 and 120 are "dropped" because of the empty string.
If you want, you can do this:
df['avg'] = df[['A1','A2']].replace('', np.nan).apply(lambda row: np.nanmean(row), axis=1)
It replace '' with NaN and get the mean ignoring those NaN
You should coerce the columns to numeric types. A simple way could be:
df['avg'] = pd.DataFrame({col : pd.to_numeric(df[col]) for col in df.columns}).mean(axis=1)
It gives as expected:
A1 A2 avg
0 10 10.0
1 110 110.0
2 12 120 66.0
Related
I have a large dataset with 400columns and 30,000 rows. The dataset is all numerical but some columns have weird string values in them (denoted as "#?") instead of being blank. This changes the dtypes of the columns that have "#?" into object type. (150 columns object dtype)
I need to convert all the columns into float or int dtypes, and then fill the normal NaN values in the data, with means of a column's groups. (e.g: means of X, means of Y in each column)
col1 col2 col3
X 21 32
X NaN 3
Y Nan 5
My end goal is to apply this to the entire data:
df.groupby("col1").transform(lambda x: x.fillna(x.mean()))
But I can't apply this for the columns that have "#?" in them, they get dropped.
I tried replacing the #? with a numerical value, and then convert all the columns into float dtype, which works, but the replaced values also should be included in the above code.
I thought about replacing #? with an weird value like -123.456 so that it doesn't get mixed with actual data points, and maybe replace all the -123.456 with the means of column groups but the -123.456 would need to be excluded from the mean. But I just don't know how that would even work. If I convert it back to NaN again, the dtype changes back to object.
I think the best way to go about it would be directly replacing the #? with the column group means.
Any ideas?
edit: I'm so dumb lol
df=df.replace('#?', '').astype(float, errors = 'ignore')
this works.
Use:
print (df)
col1 col2 col3
0 X 21 32
1 X #? 3
2 Y NaN 5
df = (df.set_index('col1')
.replace(r'#\?', np.nan, regex=True)
.astype(float)
.groupby("col1")
.transform(lambda x: x.fillna(x.mean())))
print (df)
col2 col3
col1
X 21.0 32.0
X 21.0 3.0
Y NaN 5.0
I encounter a weird problem in Python Pandas, while I read a excel and replace a character "k", the result gives me NaN for the rows without "K". see below image
It should return 173 on row #4,instead of NaN, but if I create a brand new excel, and type the same number. it will work.
or if i use this code,
df = pd.DataFrame({ 'sales':['75.8K','6.9K','7K','6.9K','173','148']})
df
then it will works well. Why? please advise!
This is because the 173 and 148 values from the excel import are numbers, not strings. Since str.replace returns a value that is non-numeric, these values become NaN. You can see that demonstrated by setting up the dataframe with numbers in those position:
df = pd.DataFrame({ 'sales':['75.8K','6.9K','7K','6.9K',173,148]})
df.dtypes
# sales object
# dtype: object
df['num'] = df['sales'].str.replace('K','')
Output:
sales num
0 75.8K 75.8
1 6.9K 6.9
2 7K 7
3 6.9K 6.9
4 173 NaN
5 148 NaN
If you don't mind all your values being strings, you can use
df = pd.read_excel('manual_import.xlsx', dtype=str)
or
df = pd.read_excel('manual_import.xlsx', converters={'sales':str})
should just convert all the sales values to strings.
Try this:
df['nums'] = df['sales'].astype(str)
df['nums'] = pd.to_numeric(df['nums'].str.replace('K', ''))
I have a csv that is read by my python code and a dataframe is created using pandas.
CSV file is in following format
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing
Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As #applesoup mentioned the following is just enough:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]
People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !
I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.
df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]
DF for example:
col1 col2 col3
23 45 NaN
54 39 NaN
NaN 45 76
87 32 NaN
The output should look like:
df.dropna(axis=1, thresh=2)
col1 col2
23 45
54 39
NaN 45
87 32
You can create with non-NaN columns using
df = df[df.columns[~df.isnull().all()]]
Or
null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)
If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null
cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)
df[df.columns[~df.isnull().any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.
df[df.columns[~df.isnull().all()]] only removes the columns that have nothing but null values and leaves columns with even one non-null value.
df.isnull() will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.
df.isnull().any() will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all() will not flag columns with even one value!
I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.
I can think in the following code:
import pandas as pd
df = pd.DataFrame({
'col1': [23, 54, pd.np.nan, 87],
'col2': [45, 39, 45, 32],
'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})
# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
return col.isnull().sum() > threshold
# Then you apply the "complement" of function to get the column with
# no NaN.
df.loc[:, ~df.apply(has_nan)]
# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]
Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
you should try df_notnull = df.dropna(how='all')
This will get you only non null rows.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values
df = df[full_col_series.index]
This worked for me quite well and probably tailored for your need as well!
def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
if df[j].isnull().sum() <= thresh:
ind.append(j)
return df[ind]
I see a lot of how to get rid of null values on this thread. Which for my dataframes is never the case. We do not delete data. Ever.
I took the question as how to get just your null values to show, and in my case I had to find latitude and longitude and fill them in.
What I did was this for one column nulls:
df[df['Latitude'].isnull()]
or to explain it out
dataframe[dataframe['Column you want'].isnull()]
This pulled up my whole data frame and all the missing values of latitude.
What did not work is this and I can't explain why. Trying to do two columns at the same time:
df[df[['Latitude','Longitude']].isnull()]
That will give me all NANs in the entire data frame.
So to do this all at once what I added was the ID, in my case my ID for each row is APNs, with the two columns I needed at the end
df[df['Latitude'].isnull()][['APN','Latitude','Longitude']]
By doing this little hack I was able to get every ID I needed to add data too for 600,000+ rows of data to filter for. Then did it again for longitude just to be sure I did not miss anything.
I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.