I am working with survey data loaded from an h5-file as hdf = pandas.HDFStore('Survey.h5') through the pandas package. Within this DataFrame, all rows are the results of a single survey, whereas the columns are the answers for all questions within a single survey.
I am aiming to reduce this dataset to a smaller DataFrame including only the rows with a certain depicted answer on a certain question, i.e. with all the same value in this column. I am able to determine the index values of all rows with this condition, but I can't find how to delete this rows or make a new df with these rows only.
In [36]: df
Out[36]:
A B C D
a 0 2 6 0
b 6 1 5 2
c 0 2 6 0
d 9 3 2 2
In [37]: rows
Out[37]: ['a', 'c']
In [38]: df.drop(rows)
Out[38]:
A B C D
b 6 1 5 2
d 9 3 2 2
In [39]: df[~((df.A == 0) & (df.B == 2) & (df.C == 6) & (df.D == 0))]
Out[39]:
A B C D
b 6 1 5 2
d 9 3 2 2
In [40]: df.ix[rows]
Out[40]:
A B C D
a 0 2 6 0
c 0 2 6 0
In [41]: df[((df.A == 0) & (df.B == 2) & (df.C == 6) & (df.D == 0))]
Out[41]:
A B C D
a 0 2 6 0
c 0 2 6 0
If you already know the index you can use .loc:
In [12]: df = pd.DataFrame({"a": [1,2,3,4,5], "b": [4,5,6,7,8]})
In [13]: df
Out[13]:
a b
0 1 4
1 2 5
2 3 6
3 4 7
4 5 8
In [14]: df.loc[[0,2,4]]
Out[14]:
a b
0 1 4
2 3 6
4 5 8
In [15]: df.loc[1:3]
Out[15]:
a b
1 2 5
2 3 6
3 4 7
If you just need to get the top rows; you can use df.head(10)
Use query to search for specific conditions:
In [3]: df
Out[3]:
age family name
0 1 A john
1 36 A jason
2 32 A jane
3 26 B jack
4 30 B james
In [4]: df.query('age > 30 & family == "A"')
Out[4]:
age family name
1 36 A jason
2 32 A jane
Related
Is there a way to find unique rows, where unique is in the sense of two "identical" columns?
>>> d = pandas.DataFrame([['A',1],['A',2],['A',3],['B',1],['B',4],['B',2]], columns = ['col_a','col_b'])
>>> d col_a col_b
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
5 B 2
>>> d.merge(d,left_on='col_b',right_on='col_b') col_a_x col_b col_a_y
0 A 1 A
1 A 1 B
2 B 1 A
3 B 1 B
4 A 2 A
5 A 2 B
6 B 2 A
7 B 2 B
8 A 3 A
9 B 4 B
>>> d_desired
0 A 1 A
1 A 1 B
3 B 1 B
4 A 2 A
5 A 2 B
7 B 2 B
8 A 3 A
9 B 4 B
But I would like to drop the duplicate entries - e.g B 1 A,B 2 A
I would later want to group by the two columns, thus I need somehow to always drop the same "duplicate", meaning if I dropped B1A I should also drop B2A and not A2B.
Try this and see if it works for you :
M = d.merge(d,left_on='col_b',right_on='col_b')
#find rows where col first is greater than col last
#and not equal to each other
cond = (M.col_a_x > M.col_a_y) & (M.col_a_x != M.col_a_y)
#filter out the row
M.loc[~cond]
I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!
If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.
There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32
Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label
I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.
How to delete observation from data frame in python. For example, I have data frame with variables a, b, c in it, and I want to delete observation if variable a is missing, or variable c is equal to zero.
You could build a boolean mask using isnull:
mask = (df['a'].isnull()) | (df['c'] == 0)
and then select the desired rows with:
df = df.loc[~mask]
~mask is the boolean inverse of mask, so df.loc[~mask] selects rows where a is not null and c is not 0.
For example,
import numpy as np
import pandas as pd
arr = np.arange(15, dtype='float').reshape(5,3) % 4
arr[arr > 2] = np.nan
df = pd.DataFrame(arr, columns=list('abc'))
# a b c
# 0 0 1 2
# 1 NaN 0 1
# 2 2 NaN 0
# 3 1 2 NaN
# 4 0 1 2
mask = (df['a'].isnull()) | (df['c'] == 0)
df = df.loc[~mask]
yields
a b c
0 0 1 2
3 1 2 NaN
4 0 1 2
Let's say your DataFrame looks like this:
In [2]: data = pd.DataFrame({
...: 'a': [1,2,3,pd.np.nan,5],
...: 'b': [3,4,pd.np.nan,5,6],
...: 'c': [0,1,2,3,4],
...: })
In [3]: data
Out[3]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
3 NaN 5 3
4 5 6 4
To delete rows with missing observations, use:
In [5]: data.dropna()
Out[5]:
a b c
0 1 3 0
1 2 4 1
4 5 6 4
To delete rows where only column 'a' has missing observations, use:
In [6]: data.dropna(subset=['a'])
Out[6]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
4 5 6 4
To delete rows that have either missing observations or zeros, use:
In [18]: data[data.all(axis=1)].dropna()
Out[18]:
a b c
1 2 4 1
4 5 6 4
In pandas, how can I add a new column which enumerates rows based on a given grouping?
For instance, assume the following DataFrame:
import pandas as pd
import numpy as np
a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C']
df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)})
df
col_a col_b
0 A 0
1 B 1
2 C 2
3 A 3
4 A 4
5 C 5
6 B 6
7 B 7
8 A 8
9 C 9
I'd like to add a col_c that gives me the Nth row of the "group" based on a grouping of col_a and sorting of col_b.
Desired output:
col_a col_b col_c
0 A 0 1
3 A 3 2
4 A 4 3
8 A 8 4
1 B 1 1
6 B 6 2
7 B 7 3
2 C 2 1
5 C 5 2
9 C 9 3
I'm struggling to get to col_c. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b']), it's now a matter of getting to that new column and labeling each row.
There's cumcount, for precisely this case:
df['col_c'] = g.cumcount()
As it says in the docs:
Number each item in each group from 0 to the length of that group - 1.
Original answer (before cumcount was defined).
You could create a helper function to do this:
def add_col_c(x):
x['col_c'] = np.arange(len(x))
return x
First sort by column col_a:
In [11]: df.sort('col_a', inplace=True)
then apply this function across each group:
In [12]: g = df.groupby('col_a', as_index=False)
In [13]: g.apply(add_col_c)
Out[13]:
col_a col_b col_c
3 A 3 0
8 A 8 1
0 A 0 2
4 A 4 3
6 B 6 0
1 B 1 1
7 B 7 2
9 C 9 0
2 C 2 1
5 C 5 2
In order to get 1,2,... you couls use np.arange(1, len(x) + 1).
The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).
Here is my pure numpy suggestion:
In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False))
In [6]: sizes = df.groupby('col_a', sort=False).size().values
In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes)
In [8]: print df
col_a col_b col_c
9 C 9 0
5 C 5 1
2 C 2 2
7 B 7 0
6 B 6 1
1 B 1 2
8 A 8 0
4 A 4 1
3 A 3 2
0 A 0 3
You could define your own function to deal with that:
In [58]: def func(x):
....: x['col_c'] = x['col_a'].argsort() + 1
....: return x
....:
In [59]: df.groupby('col_a').apply(func)
Out[59]:
col_a col_b col_c
0 A 0 1
3 A 3 2
4 A 4 3
8 A 8 4
1 B 1 1
6 B 6 2
7 B 7 3
2 C 2 1
5 C 5 2
9 C 9 3