Slice Pandas DataFrame by Row

Slice Pandas DataFrame by Row - python

I am working with survey data loaded from an h5-file as hdf = pandas.HDFStore('Survey.h5') through the pandas package. Within this DataFrame, all rows are the results of a single survey, whereas the columns are the answers for all questions within a single survey.
I am aiming to reduce this dataset to a smaller DataFrame including only the rows with a certain depicted answer on a certain question, i.e. with all the same value in this column. I am able to determine the index values of all rows with this condition, but I can't find how to delete this rows or make a new df with these rows only.

In [36]: df
Out[36]:
A B C D
a 0 2 6 0
b 6 1 5 2
c 0 2 6 0
d 9 3 2 2
In [37]: rows
Out[37]: ['a', 'c']
In [38]: df.drop(rows)
Out[38]:
A B C D
b 6 1 5 2
d 9 3 2 2
In [39]: df[~((df.A == 0) & (df.B == 2) & (df.C == 6) & (df.D == 0))]
Out[39]:
A B C D
b 6 1 5 2
d 9 3 2 2
In [40]: df.ix[rows]
Out[40]:
A B C D
a 0 2 6 0
c 0 2 6 0
In [41]: df[((df.A == 0) & (df.B == 2) & (df.C == 6) & (df.D == 0))]
Out[41]:
A B C D
a 0 2 6 0
c 0 2 6 0

If you already know the index you can use .loc:
In [12]: df = pd.DataFrame({"a": [1,2,3,4,5], "b": [4,5,6,7,8]})
In [13]: df
Out[13]:
a b
0 1 4
1 2 5
2 3 6
3 4 7
4 5 8
In [14]: df.loc[[0,2,4]]
Out[14]:
a b
0 1 4
2 3 6
4 5 8
In [15]: df.loc[1:3]
Out[15]:
a b
1 2 5
2 3 6
3 4 7

If you just need to get the top rows; you can use df.head(10)

Use query to search for specific conditions:
In [3]: df
Out[3]:
age family name
0 1 A john
1 36 A jason
2 32 A jane
3 26 B jack
4 30 B james
In [4]: df.query('age > 30 & family == "A"')
Out[4]:
age family name
1 36 A jason
2 32 A jane

Related

Pandas self join on a single column with no duplicates

Is there a way to find unique rows, where unique is in the sense of two "identical" columns?
>>> d = pandas.DataFrame([['A',1],['A',2],['A',3],['B',1],['B',4],['B',2]], columns = ['col_a','col_b'])
>>> d col_a col_b
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
5 B 2
>>> d.merge(d,left_on='col_b',right_on='col_b') col_a_x col_b col_a_y
0 A 1 A
1 A 1 B
2 B 1 A
3 B 1 B
4 A 2 A
5 A 2 B
6 B 2 A
7 B 2 B
8 A 3 A
9 B 4 B
>>> d_desired
0 A 1 A
1 A 1 B
3 B 1 B
4 A 2 A
5 A 2 B
7 B 2 B
8 A 3 A
9 B 4 B
But I would like to drop the duplicate entries - e.g B 1 A,B 2 A
I would later want to group by the two columns, thus I need somehow to always drop the same "duplicate", meaning if I dropped B1A I should also drop B2A and not A2B.

Try this and see if it works for you :
M = d.merge(d,left_on='col_b',right_on='col_b')
#find rows where col first is greater than col last
#and not equal to each other
cond = (M.col_a_x > M.col_a_y) & (M.col_a_x != M.col_a_y)
#filter out the row
M.loc[~cond]

Aggregate data frame rows based on conditions

I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!

If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64

Pandas DataFrame drop tuple or list of columns

When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.

There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32

Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label

I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.

Drop observations from the data frame in python

How to delete observation from data frame in python. For example, I have data frame with variables a, b, c in it, and I want to delete observation if variable a is missing, or variable c is equal to zero.

You could build a boolean mask using isnull:
mask = (df['a'].isnull()) | (df['c'] == 0)
and then select the desired rows with:
df = df.loc[~mask]
~mask is the boolean inverse of mask, so df.loc[~mask] selects rows where a is not null and c is not 0.
For example,
import numpy as np
import pandas as pd
arr = np.arange(15, dtype='float').reshape(5,3) % 4
arr[arr > 2] = np.nan
df = pd.DataFrame(arr, columns=list('abc'))
# a b c
# 0 0 1 2
# 1 NaN 0 1
# 2 2 NaN 0
# 3 1 2 NaN
# 4 0 1 2
mask = (df['a'].isnull()) | (df['c'] == 0)
df = df.loc[~mask]
yields
a b c
0 0 1 2
3 1 2 NaN
4 0 1 2

Let's say your DataFrame looks like this:
In [2]: data = pd.DataFrame({
...: 'a': [1,2,3,pd.np.nan,5],
...: 'b': [3,4,pd.np.nan,5,6],
...: 'c': [0,1,2,3,4],
...: })
In [3]: data
Out[3]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
3 NaN 5 3
4 5 6 4
To delete rows with missing observations, use:
In [5]: data.dropna()
Out[5]:
a b c
0 1 3 0
1 2 4 1
4 5 6 4
To delete rows where only column 'a' has missing observations, use:
In [6]: data.dropna(subset=['a'])
Out[6]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
4 5 6 4
To delete rows that have either missing observations or zeros, use:
In [18]: data[data.all(axis=1)].dropna()
Out[18]:
a b c
1 2 4 1
4 5 6 4

Enumerate each row for each group in a DataFrame

In pandas, how can I add a new column which enumerates rows based on a given grouping?
For instance, assume the following DataFrame:
import pandas as pd
import numpy as np
a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C']
df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)})
df
col_a col_b
0 A 0
1 B 1
2 C 2
3 A 3
4 A 4
5 C 5
6 B 6
7 B 7
8 A 8
9 C 9
I'd like to add a col_c that gives me the Nth row of the "group" based on a grouping of col_a and sorting of col_b.
Desired output:
col_a col_b col_c
0 A 0 1
3 A 3 2
4 A 4 3
8 A 8 4
1 B 1 1
6 B 6 2
7 B 7 3
2 C 2 1
5 C 5 2
9 C 9 3
I'm struggling to get to col_c. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b']), it's now a matter of getting to that new column and labeling each row.

There's cumcount, for precisely this case:
df['col_c'] = g.cumcount()
As it says in the docs:
Number each item in each group from 0 to the length of that group - 1.
Original answer (before cumcount was defined).
You could create a helper function to do this:
def add_col_c(x):
x['col_c'] = np.arange(len(x))
return x
First sort by column col_a:
In [11]: df.sort('col_a', inplace=True)
then apply this function across each group:
In [12]: g = df.groupby('col_a', as_index=False)
In [13]: g.apply(add_col_c)
Out[13]:
col_a col_b col_c
3 A 3 0
8 A 8 1
0 A 0 2
4 A 4 3
6 B 6 0
1 B 1 1
7 B 7 2
9 C 9 0
2 C 2 1
5 C 5 2
In order to get 1,2,... you couls use np.arange(1, len(x) + 1).

The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).
Here is my pure numpy suggestion:
In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False))
In [6]: sizes = df.groupby('col_a', sort=False).size().values
In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes)
In [8]: print df
col_a col_b col_c
9 C 9 0
5 C 5 1
2 C 2 2
7 B 7 0
6 B 6 1
1 B 1 2
8 A 8 0
4 A 4 1
3 A 3 2
0 A 0 3

You could define your own function to deal with that:
In [58]: def func(x):
....: x['col_c'] = x['col_a'].argsort() + 1
....: return x
....:
In [59]: df.groupby('col_a').apply(func)
Out[59]:
col_a col_b col_c
0 A 0 1
3 A 3 2
4 A 4 3
8 A 8 4
1 B 1 1
6 B 6 2
7 B 7 3
2 C 2 1
5 C 5 2
9 C 9 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slice Pandas DataFrame by Row - python

If you already know the index you can use .loc: In [12]: df = pd.DataFrame({"a": [1,2,3,4,5], "b": [4,5,6,7,8]}) In [13]: df Out[13]: a b 0 1 4 1 2 5 2 3 6 3 4 7 4 5 8 In [14]: df.loc[[0,2,4]] Out[14]: a b 0 1 4 2 3 6 4 5 8 In [15]: df.loc[1:3] Out[15]: a b 1 2 5 2 3 6 3 4 7

If you just need to get the top rows; you can use df.head(10)

Use query to search for specific conditions: In [3]: df Out[3]: age family name 0 1 A john 1 36 A jason 2 32 A jane 3 26 B jack 4 30 B james In [4]: df.query('age > 30 & family == "A"') Out[4]: age family name 1 36 A jason 2 32 A jane

Related

Pandas self join on a single column with no duplicates

Aggregate data frame rows based on conditions

Pandas DataFrame drop tuple or list of columns

Drop observations from the data frame in python

Enumerate each row for each group in a DataFrame

Categories

Resources