Multiple filters Python Data.frame - python

I'm pretty new to python. I'm trying to filter rows in a data.frame as I do in R.
sub_df = df[df[main_id]==3]
works, but
df[df[main_id] in [3,7]]
gives me error
"The truth value of a Series is ambiguous"
Can you please suggest me a correct syntax to write similar selections?

You can use pandas isin function. This would look like this:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
df[df['A'].isin([2, 3])]
giving:
A B
1 2 b
2 3 f

df[df[main_id].apply(lambda x: x in [3, 7])]

yet another solution:
In [60]: df = pd.DataFrame({'main_id': [0,1, 2, 3], 'x': list('ABCD')})
In [61]: df
Out[61]:
main_id x
0 0 A
1 1 B
2 2 C
3 3 D
In [62]: df.query("main_id in [0,3]")
Out[62]:
main_id x
0 0 A
3 3 D

Related

Hierarchical Columns in Numpy

I'm new to Pandas and trying to recreate the following dataframe, such that values in columns A and B contain random numbers 0 through 8. However, I keep getting "ValueError: all arrays must be same length". Can someone please review my code ? Thank you!
DataFrame
df = pd.DataFrame(np.random.randint(0, high=9),index = [[1, 2, 3], ['a', 'b']],
columns = ['A', 'B'])
Since there are two layers to the index, you have to create a multi index:
df = pd.DataFrame(
np.random.randint(9, size=(6, 2)),
index=pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']]),
columns=['A', 'B']
)
output:
A B
1 a 1 0
b 4 3
2 a 7 3
b 1 6
3 a 5 4
b 3 3

DataFrame fastest way to update rows without a loop

Creating a scenario:
Assuming a dataframe with two series, where A is the input and B is the result of A[index]*2:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [2, 4, 6]})
Lets say I am receiving a 100k row dataframe and searching for errors in it (here B->0 is invalid):
df = pd.DataFrame({'A': [1, 2, 3],
'B': [2, 0, 6]})
Searching the invalid rows by using
invalid_rows = df.loc[df['A']*2 != df['B']]
I have the invalid_rows now, but I am not sure what would be the fastest way to overwrite the invalid rows in the original df with the result of A[index]*2?
Iterating over the df using iterrows() is an option but slow if the df grows. Can I use df.update() for this somehow?
Working solution with a loop:
index = -1
for row_index, my_series in df.iterrows():
if myseries['A']*2 != myseries['B']:
df[index]['B'] = myseries['A']*2
But is there a faster way to do this?
Using mul, ne and loc:
m = df['A'].mul(2).ne(df['B'])
# same as: m = df['A'] * 2 != df['B']
df.loc[m, 'B'] = df['A'].mul(2)
A B
0 1 2
1 2 4
2 3 6
m returns a boolean series which marks the row where A * 2 != B
print(m)
0 False
1 True
2 False
dtype: bool

Get pandas groupby object to ignore missing dataframes

i'm using pandas to read an excel file and convert the spreadsheet to a dataframe. Then i apply groupby and store the individual groups in variables using get_group for later computation.
My issue is that the input file isn't always the same size, sometimes the groupby will result in 10 dfs, sometimes 25 etc. How to i get my program to ignore if a df is missing from the intial data?
df = pd.read_excel(filepath, 0, skiprows=3, parse_cols='A,B,C,E,F,G',
names=['Result', 'Trial', 'Well', 'Distance', 'Speed', 'Time'])
df = df.replace({'-': 0}, regex=True) #replaces '-' values with 0
df = df['Trial'].unique()
gb = df.groupby('Trial') #groups by column Trial
trial_1 = gb.get_group('Trial 1')
trial_2 = gb.get_group('Trial 2')
trial_3 = gb.get_group('Trial 3')
trial_4 = gb.get_group('Trial 4')
trial_5 = gb.get_group('Trial 5')
Say my initial data only has 3 trials, how would i get it to ignore trials 4, 5 later? My code runs when all trials are present but fails when some are missing :( It sounds very much like an if statement would be needed, but my tired brain has no idea where...
Thanks in advance!
After grouping you can get the groups using attribute .groups this returns a dict of the group names, you can then just iterate over the dict keys dynamically so you don't need to hard code the size:
In [22]:
df = pd.DataFrame({'grp':list('aabbbc'), 'val':np.arange(6)})
df
Out[22]:
grp val
0 a 0
1 a 1
2 b 2
3 b 3
4 b 4
5 c 5
In [23]:
gp = df.groupby('grp')
gp.groups
Out[23]:
{'a': Int64Index([0, 1], dtype='int64'),
'b': Int64Index([2, 3, 4], dtype='int64'),
'c': Int64Index([5], dtype='int64')}
In [25]:
for g in gp.groups.keys():
print(gp.get_group(g))
grp val
0 a 0
1 a 1
grp val
2 b 2
3 b 3
4 b 4
grp val
5 c 5

Map function across multi-column DataFrame

Given a DataFrame like the following
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': [4, 3, 2, 1]})
I would like to map a row-wise function across its columns
In [3]: df.map(lambda (x, y): x + y)
and get something like the following
0 5
1 5
2 5
3 5
Name: None, dtype: int64
Is this possible?
You can apply a function row-wise by setting axis=1
df.apply(lambda row: row.x + row.y, axis=1)
Out[145]:
0 5
1 5
2 5
3 5
dtype: int64

pandas DataFrame set value on boolean mask

I'm trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven't found any resources on this specific error.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df[mask] = 30
Traceback (most recent call last):
...
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Above, I want to replace all of the True entries in the mask with the value 30.
I could do df.replace instead, but masking feels a bit more efficient and intuitive here. Can someone explain the error, and provide an efficient way to set all of the values?
You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True in the where method, so df.where(mask, other=30, inplace=True) will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non
np.nan value
EDIT
OK, after a little misunderstanding you can still use where y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f
If you want to use different columns to create your mask, you need to call the values property of the dataframe.
Example
Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999
I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f
pandas uses NaN to mark invalid or missing data and can be used across types, since your DataFrame as mixed int and string data types it will not accept the assignment to a single type (other than NaN) as this would create a mixed type (int and str) in B through an in-place assignment.
#JohnE method using np.where creates a new DataFrame in which the type of column B is an object not a string as in the initial example.

Categories

Resources