How to delete observation from data frame in python. For example, I have data frame with variables a, b, c in it, and I want to delete observation if variable a is missing, or variable c is equal to zero.
You could build a boolean mask using isnull:
mask = (df['a'].isnull()) | (df['c'] == 0)
and then select the desired rows with:
df = df.loc[~mask]
~mask is the boolean inverse of mask, so df.loc[~mask] selects rows where a is not null and c is not 0.
For example,
import numpy as np
import pandas as pd
arr = np.arange(15, dtype='float').reshape(5,3) % 4
arr[arr > 2] = np.nan
df = pd.DataFrame(arr, columns=list('abc'))
# a b c
# 0 0 1 2
# 1 NaN 0 1
# 2 2 NaN 0
# 3 1 2 NaN
# 4 0 1 2
mask = (df['a'].isnull()) | (df['c'] == 0)
df = df.loc[~mask]
yields
a b c
0 0 1 2
3 1 2 NaN
4 0 1 2
Let's say your DataFrame looks like this:
In [2]: data = pd.DataFrame({
...: 'a': [1,2,3,pd.np.nan,5],
...: 'b': [3,4,pd.np.nan,5,6],
...: 'c': [0,1,2,3,4],
...: })
In [3]: data
Out[3]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
3 NaN 5 3
4 5 6 4
To delete rows with missing observations, use:
In [5]: data.dropna()
Out[5]:
a b c
0 1 3 0
1 2 4 1
4 5 6 4
To delete rows where only column 'a' has missing observations, use:
In [6]: data.dropna(subset=['a'])
Out[6]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
4 5 6 4
To delete rows that have either missing observations or zeros, use:
In [18]: data[data.all(axis=1)].dropna()
Out[18]:
a b c
1 2 4 1
4 5 6 4
Related
Is there a way to find unique rows, where unique is in the sense of two "identical" columns?
>>> d = pandas.DataFrame([['A',1],['A',2],['A',3],['B',1],['B',4],['B',2]], columns = ['col_a','col_b'])
>>> d col_a col_b
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
5 B 2
>>> d.merge(d,left_on='col_b',right_on='col_b') col_a_x col_b col_a_y
0 A 1 A
1 A 1 B
2 B 1 A
3 B 1 B
4 A 2 A
5 A 2 B
6 B 2 A
7 B 2 B
8 A 3 A
9 B 4 B
>>> d_desired
0 A 1 A
1 A 1 B
3 B 1 B
4 A 2 A
5 A 2 B
7 B 2 B
8 A 3 A
9 B 4 B
But I would like to drop the duplicate entries - e.g B 1 A,B 2 A
I would later want to group by the two columns, thus I need somehow to always drop the same "duplicate", meaning if I dropped B1A I should also drop B2A and not A2B.
Try this and see if it works for you :
M = d.merge(d,left_on='col_b',right_on='col_b')
#find rows where col first is greater than col last
#and not equal to each other
cond = (M.col_a_x > M.col_a_y) & (M.col_a_x != M.col_a_y)
#filter out the row
M.loc[~cond]
Supposing I have the two DataFrames shown below:
dd = pd.DataFrame([1,0, 3, 0, 5])
0
0 1
1 0
2 3
3 0
4 5
and
df = pd.DataFrame([2,4])
0
0 2
1 4
How can I broadcast the values of df into dd with step = 2 so I end up with
0
0 1
1 2
2 3
3 4
4 5
Another solution:
dd = pd.DataFrame([1, 0, 3, 0, 5])
df = pd.DataFrame([2, 4])
dd.iloc[1::2] = df.values
dd
# Out:
0
0 1
1 2
2 3
3 4
4 5
dd.values[1::2] = df.values
dd now contains:
0
0 1
1 2
2 3
3 4
4 5
Note that here step=2 condition is used. array[1::2] syntax means start from the array element with index 1, until the end, with step=2.
Change df.index by range and fill second DataFrame:
df.index = range(1, len(dd)+1, 2)[:len(df)]
print (df)
0
1 2
3 4
dd.loc[df.index] = df
print (dd)
0
0 1
1 2
2 3
3 4
4 5
I have a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 2, 3], ['a,b', 5, 6], ['c', 8, 9]])
0 1 2
0 a 2 3
1 a,b 5 6
2 c 8 9
I want to spread the first column to n columns (where n is the number of unique, comma-separated values, in this case 3). Each of the resulting columns shall be 1 if the value is present, and 0 else. Expected result is:
1 2 a c b
0 2 3 1 0 0
1 5 6 1 0 1
2 8 9 0 1 0
I came up with the following code, but it seems a bit circuitous to me.
>>> import re
>>> dfSpread = pd.get_dummies(df[0].str.split(',', expand=True)).\
rename(columns=lambda x: re.sub('.*_','',x))
>>> pd.concat([df.iloc[:,1:], dfSpread], axis = 1)
Is there a built-in function that does just that that I wasn't able to find?
Using get_dummies
df.set_index([1,2])[0].str.get_dummies(',').reset_index()
Out[229]:
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
You can use pop + concat here for an alternative version of Wen's answer.
pd.concat([df, df.pop(df.columns[0]).str.get_dummies(sep=',')], axis=1)
1 2 a b c
0 2 3 1 0 0
1 5 6 1 1 0
2 8 9 0 0 1
Why does the function below change the global DataFrame named df? Shouldn't it just change a local df within the function, but not the global df?
import pandas as pd
df = pd.DataFrame()
def adding_var_inside_function(df):
df['value'] = 0
print(df.columns) # Index([], dtype='object')
adding_var_inside_function(df)
print(df.columns) # Index([u'value'], dtype='object')
from docs:
Mutability and copying of data
All pandas data structures are value-mutable (the values they contain can be altered) but not always
size-mutable. The length of a Series cannot be changed, but, for
example, columns can be inserted into a DataFrame. However, the vast
majority of methods produce new objects and leave the input data
untouched. In general, though, we like to favor immutability where
sensible.
Here is another example, showing values (cell's) mutability:
In [21]: df
Out[21]:
a b c
0 3 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4
In [22]: df2 = df
In [23]: df2.loc[0, 'a'] = 100
In [24]: df
Out[24]:
a b c
0 100 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4
df2 is a reference to df
In [28]: id(df) == id(df2)
Out[28]: True
Your function, that won't mutate the argument DF:
def adding_var_inside_function(df):
df = df.copy()
df['value'] = 0
return df
In [30]: df
Out[30]:
a b c
0 100 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4
In [31]: adding_var_inside_function(df)
Out[31]:
a b c value
0 100 2 0 0
1 3 3 1 0
2 4 0 0 0
3 2 3 2 0
4 0 4 4 0
In [32]: df
Out[32]:
a b c
0 100 2 0
1 3 3 1
2 4 0 0
3 2 3 2
4 0 4 4
I am working with survey data loaded from an h5-file as hdf = pandas.HDFStore('Survey.h5') through the pandas package. Within this DataFrame, all rows are the results of a single survey, whereas the columns are the answers for all questions within a single survey.
I am aiming to reduce this dataset to a smaller DataFrame including only the rows with a certain depicted answer on a certain question, i.e. with all the same value in this column. I am able to determine the index values of all rows with this condition, but I can't find how to delete this rows or make a new df with these rows only.
In [36]: df
Out[36]:
A B C D
a 0 2 6 0
b 6 1 5 2
c 0 2 6 0
d 9 3 2 2
In [37]: rows
Out[37]: ['a', 'c']
In [38]: df.drop(rows)
Out[38]:
A B C D
b 6 1 5 2
d 9 3 2 2
In [39]: df[~((df.A == 0) & (df.B == 2) & (df.C == 6) & (df.D == 0))]
Out[39]:
A B C D
b 6 1 5 2
d 9 3 2 2
In [40]: df.ix[rows]
Out[40]:
A B C D
a 0 2 6 0
c 0 2 6 0
In [41]: df[((df.A == 0) & (df.B == 2) & (df.C == 6) & (df.D == 0))]
Out[41]:
A B C D
a 0 2 6 0
c 0 2 6 0
If you already know the index you can use .loc:
In [12]: df = pd.DataFrame({"a": [1,2,3,4,5], "b": [4,5,6,7,8]})
In [13]: df
Out[13]:
a b
0 1 4
1 2 5
2 3 6
3 4 7
4 5 8
In [14]: df.loc[[0,2,4]]
Out[14]:
a b
0 1 4
2 3 6
4 5 8
In [15]: df.loc[1:3]
Out[15]:
a b
1 2 5
2 3 6
3 4 7
If you just need to get the top rows; you can use df.head(10)
Use query to search for specific conditions:
In [3]: df
Out[3]:
age family name
0 1 A john
1 36 A jason
2 32 A jane
3 26 B jack
4 30 B james
In [4]: df.query('age > 30 & family == "A"')
Out[4]:
age family name
1 36 A jason
2 32 A jane