Remove columns that have 'N' number of NA values in it - python - python

Suppose I use df.isnull().sum() and I get a count for all the 'NA' values in all the columns of df dataframe. I want to remove a column that has NA values above 'K'.
For eg.,
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': [0,np.nan,np.nan,np.nan,np.nan,np.nan],})
df.isnull().sum()
A 1
B 2
C 0
D 2
E 5
dtype: int64
Suppose I want to remove columns that have '2' and above number of NA values. How would be approach this problem? My output should be,
df.columns
A,C
Can anybody help me in doing this?
Thanks

Call dropna and pass axis=1 to drop column-wise and pass thresh=len(df)-K, what thresh does is it sets the minimum number of non-NaN values which is equal to the number of rows minus K NaN values
In [22]:
df.dropna(axis=1, thresh=len(df)-1)
Out[22]:
A C
0 1.0 0
1 2.1 0
2 NaN 0
3 4.7 0
4 5.6 0
5 6.8 0
If you just want the columns:
In [23]:
df.dropna(axis=1, thresh=len(df)-1).columns
Out[23]:
Index(['A', 'C'], dtype='object')
Or simply mask the counts output against the columns:
In [28]:
df.columns[df.isnull().sum() <2]
Out[28]:
Index(['A', 'C'], dtype='object')

Could do something like:
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
Which just builds a list of columns that match your requirement (fewer than threshold nulls), and then uses that list to reindex the dataframe. So if you set threshold to 1:
threshold = 1
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': ['NA', 'NA', 'NA', 'NA', 'NA', 'NA'],})
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
df.count()
Will yield:
C 6
E 6
dtype: int64

The dropna() function has a thresh argument that allows you to give the number of non-NaN values you require, so this would give you your desired output:
df.dropna(axis=1,thresh=5).count()
A 5
C 6
E 6
If you wanted just C & E, you'd have to change thresh to 6 in this case.

Related

duplicate index in a list and calculate mean by index

input: list of dataframe
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list= []
for df in (df1,df2,df3):
df_list.append(df)
I have a duplicate index of [1,2,3], want an average of them in the output
output: dataframe with corresponding index
1 (1.2+2.2)/2
2 (1.4+1.8)/2
3 (3.3+2.5)/2
4 4.3
5 6.4
7 4.9
So how to groupby duplicate index in a list and output average into a dataframe. Directly concatenate dataframes is not an option for me.
I would first concatenate all the data into a single DataFrame. Note that the values will automatically be aligned by index. Then you can get the means easily:
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list = [df1, df2, df3]
df = pd.concat(df_list, axis=1)
df.columns = ['N1', 'N2', 'N3']
print(df.mean(axis=1))
1 1.7
2 1.6
3 2.9
4 4.3
5 6.4
7 4.9
dtype: float64

multiply 2 columns in 2 dfs if they match the column name

I have 2 dfs with some similar colnames.
I tried this, it worked only when I have nonrepetitive colnames in national df.
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].values
I tried to use the same code on df where it has several names, but I got the following error 'shapes (26,33) and (1,26) not aligned: 33 (dim 1) != 1 (dim 0)'. Because in the second df it has 33 columns with the same name, and that needs to be multiplied elementwise with one column for the first df.
This code does not work, as there are repeated same colnames in urban.columns.
[np.matrix(urban[col].values) * np.matrix(F[col2].values) for col in urban.columns for col2 in F.columns if col == col2]
Reproducivle code
df1 = pd.DataFrame({
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col2': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
Hopefully the below working example helps. Please provided a minimum reproducible example in your question with input code and desired output like I have provided. Please see how to ask a good pandas question:
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6]})
print(df1)
df2 = pd.DataFrame({
'FX Rate': [1.5, 2.0, 3.0, 5.0, 10.0]})
print(df2)
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
for col in ['Col1', 'Col2']:
df1[col] = df1[col] * df2['FX Rate']
df1
(df1)
Product Col1 Col2
0 AA 1 2
1 AA 2 4
2 BB 1 2
3 BB 2 4
4 BB 3 6
(df2)
FX Rate
0 1.5
1 2.0
2 3.0
3 5.0
4 10.0
Out[1]:
Product Col1 Col2
0 AA 1.5 3.0
1 AA 4.0 8.0
2 BB 3.0 6.0
3 BB 10.0 20.0
4 BB 30.0 60.0
You can't multiply two DataFrame if they have different shapes but if you want to multiply it anyway then use transpose:
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].T.values
You can get the common columns of the 2 dataframes, then multiply the 2 dataframe by simple multiplication. Then, join back the only column(s) in df1 to the multiplication result, as follows:
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
Demo
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col3': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
print(df1)
Product Col1 Col2 Col3
0 AA 1.5 2.0 7
1 AA 4.0 0.0 4
2 BB 3.0 8.0 2
3 BB 10.0 20.0 8
4 BB 30.0 42.0 6
A friend of mine sent this solution wich works just as i wanted.
out = urban.copy()
for col in urban.columns:
for col2 in F.columns:
if col == col2:
out.loc[:,col] = urban.loc[:,[col]].values * F.loc[:,[col2]].values

Pandas: Last time when a column had a non-nan value

Let's assume that I have the following data-frame:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "nominal": [1, np.nan, 1, 1, np.nan], "numeric1": [3, np.nan, np.nan, 7, np.nan], "numeric2": [2, 3, np.nan, 2, np.nan], "numeric3": [np.nan, 2, np.nan, np.nan, 3], "date":[pd.Timestamp(2005, 6, 22), pd.Timestamp(2006, 2, 11), pd.Timestamp(2008, 9, 13), pd.Timestamp(2009, 5, 12), pd.Timestamp(2010, 5, 9)]})
As output, I want to get a data-frame, that will indicate the number of days that have passed since a non-nan value was seen for that column, for that id. If a column has a value for the corresponding date, or if a column doesn't have a value at the start for an new id, the value should be a 0. In addition, this is supposed to be computed only for the numeric columns. With that said, the output data-frame should be:
output_df = pd.DataFrame({"numeric1_delta": [0, 234, 1179, 0, 362], "numeric2_delta": [0, 0, 945, 0, 362], "numeric3_delta": [0, 0, 945, 0, 0]})
Looking forward to your answers!
You can groupby the cumsum of the non null and then subtract the first date:
In [11]: df.numeric1.notnull().cumsum()
Out[11]:
0 1
1 1
2 1
3 2
4 2
Name: numeric1, dtype: int64
In [12]: df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[12]:
0 2005-06-22
1 2005-06-22
2 2005-06-22
3 2009-05-12
4 2009-05-12
Name: date, dtype: datetime64[ns]
In [13]: df.date - df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[13]:
0 0 days
1 234 days
2 1179 days
3 0 days
4 362 days
Name: date, dtype: timedelta64[ns]
For multiple columns:
ncols = [col for col in df.columns if col.startswith("numeric")]
for c in ncols:
df[c + "_delta"] = df.date - df.groupby(df[c].notnull().cumsum()).date.transform('first')

Change some, but not all, pandas multiindex column names

Suppose I have a data frame with multiindex column names that looks like this:
A B
'1.5' '2.3' '8.4' b1
r1 1 2 3 a
r2 4 5 6 b
r3 7 8 9 10
How would I change the just the column names under 'A' from strings to floats, without modifying 'b1', to get the following?
A B
1.5 2.3 8.4 b1
r1 1 2 3 a
r2 4 5 6 b
r3 7 8 9 10
In the real use case, under 'A' there would be thousands of columns with names that should be floats (they represent the wavelengths for a spectrometer) and the data in the data frame represents multiple different observations.
Thanks!
# build the DataFrame (sideways at first, then transposed)
arrays = [['A','A','A','B'],['1.5', '2.3', '8.4', 'b1']]
tuples = list( zip(*arrays) )
data1 = np.array([[1,2,3,'a'], [4,5,6,'b'], [7,8,9,10]])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(data1.T, index=index).T
Printing df.columns gives the existing column names.
Out[84]:
MultiIndex(levels=[[u'A', u'B'], [u'1.5', u'2.3', u'8.4', u'b1']],
labels=[[0, 0, 0, 1], [0, 1, 2, 3]],
names=[u'first', u'second'])
Now change the column names
# make new column titles (probably more pythonic ways to do this)
A_cols = [float(i) for i in df['A'].columns]
B_cols = [i for i in df['B'].columns]
cols = A_cols + B_cols
# set levels
levels = [df.columns.levels[0],cols]
df.columns.set_levels(levels,inplace=True)
Gives the following output
Out[86]:
MultiIndex(levels=[[u'A', u'B'], [1.5, 2.3, 8.4, u'b1']],
labels=[[0, 0, 0, 1], [0, 1, 2, 3]],
names=[u'first', u'second'])

Why does changing one `np.nan` value change all of the nan values in pandas dataframe?

When I change one value in the entire DataFrame, it changes other values. Compare scenario 1 and scenario 2:
Scenario 1: Here notice that I only have float(np.nan) values for NaNs
info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[float(np.nan)]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])
result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])
result_df = result_df.fillna(0.0) # does NOT fill in NaNs
The result of Scenario 1 is just a dataframe without the NaNs filled in.
Scenario 2: Here notice that I only have None value in ONE spot
info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[None]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])
result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])
result_df = result_df.fillna(0.0) # this works!?!
Even though I only fill in one of the NaN values with None, the other float(np.nan)s get filled in with 0.0, as if they are NaNs too.
Why is there some relationship between the NaNs?
The 1st info_num is dtype='S3' (strings). In the 2nd it is dtype=object, a mix of integers, nan (a float) and strings (and a None).
In the dataframes I see something that prints as 'nan' in the one, and a mix of None and NaN in the other. It looks like fillna treats None and NaN the same, but ignores a string 'nan'.
The doc for fillna
Fill NA/NaN values using the specified method
Pandas NaN is the same as np.nan.
fillna uses pd.isnull to determine where to put the 0.0 value.
def isnull(obj):
"""Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
For the 2nd case:
In [116]: pd.isnull(result_df)
Out[116]:
G Bd O P keys
0 False False False False False
1 False False False True False
2 False False True False True
3 False False False False False
4 False False False False False
(its all False for the first, string, case).
In [121]: info_num0
Out[121]:
array([['4', '8', '5', '6', 'ui'],
['1', '5', '6', 'nan', 'g'],
['6', '1', 'nan', '90', 'nan'],
['5', '2', '8', '4', 'q'],
['1', '6', '4', '3', 'w']],
dtype='<U3')
In [122]: info_num
Out[122]:
array([[1, 8, 3, 0, 'ui'],
[1, 5, 1, None, 'g'],
[0, 2, nan, 90, nan],
[7, 7, 1, 4, 'q'],
[3, 7, 0, 3, 'w']], dtype=object)
np.nan is float already:
In [125]: type(np.nan)
Out[125]: float
If you'd added dtype=object to the initial array definition, you'd get the same effect as using that None:
In [140]: np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']],dtype=object)
Out[140]:
array([[6, 7, 8, 1, 'ui'],
[5, 2, 5, nan, 'g'],
[3, 0, nan, 90, nan],
[5, 2, 1, 3, 'q'],
[1, 7, 7, 2, 'w']], dtype=object)
Better yet, create the initial data as a list of lists, rather than an array. numpy arrays have to uniform elements; with a mix of ints, nan, and strings you only get that with dtype=object. But that is little more than an array wrapper around a list. Python lists already allow this kind of diversity.
In [141]: alist = [[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']]
In [142]: alist
Out[142]:
[[4, 0, 2, 6, 'ui'],
[3, 3, 3, nan, 'g'],
[3, 5, nan, 90, nan],
[4, 0, 6, 7, 'q'],
[0, 8, 3, 8, 'w']]
In [143]: result_df1 = pd.DataFrame(data=alist, columns=['G','Bd', 'O', 'P', 'keys'])
In [144]: result_df1
Out[144]:
G Bd O P keys
0 4 0 2 6 ui
1 3 3 3 NaN g
2 3 5 NaN 90 NaN
3 4 0 6 7 q
4 0 8 3 8 w
I'm not sure how pandas stores this internally, but result_df1.values does return an object array.
In [146]: result_df1.values
Out[146]:
array([[4, 0, 2.0, 6.0, 'ui'],
[3, 3, 3.0, nan, 'g'],
[3, 5, nan, 90.0, nan],
[4, 0, 6.0, 7.0, 'q'],
[0, 8, 3.0, 8.0, 'w']], dtype=object)
So if a column has a nan, all the numbers a float (nan is a kind of float). The first 2 columns remain integer. The last is a mix of strings and that nan.
But dtypes suggest that pandas is using a structured array, with each column being a field with the relevant dtype.
In [147]: result_df1.dtypes
Out[147]:
G int64
Bd int64
O float64
P float64
keys object
dtype: object
The equivalent numpy dtype would be:
dt = np.dtype([('G',np.int64),('Bd',np.int64),('O',np.float64),('P',np.float64), ('keys',object)])
We can make a structured array with this dtype. I have to turn the list of lists into a list of tuples (the structured records):
X = np.array([tuple(x) for x in alist],dt)
producing:
array([(4, 0, 2.0, 6.0, 'ui'),
(3, 3, 3.0, nan, 'g'),
(3, 5, nan, 90.0, nan),
(4, 0, 6.0, 7.0, 'q'),
(0, 8, 3.0, 8.0, 'w')],
dtype=[('G', '<i8'), ('Bd', '<i8'), ('O', '<f8'), ('P', '<f8'), ('keys', 'O')])
That can go directly into Pandas as:
In [162]: pd.DataFrame(data=X)
Out[162]:
G Bd O P keys
0 4 0 2 6 ui
1 3 3 3 NaN g
2 3 5 NaN 90 NaN
3 4 0 6 7 q
4 0 8 3 8 w

Categories

Resources