Subset rows in df depending on conditions

Subset rows in df depending on conditions - python

Hello I have a df such as :
I wondered how I can subset row where :
COL1 contains a string "ok"
COL2 > 4
COL3 < 4
here is an exemple
COL1 COL2 COL3
AB_ok_7 5 2
AB_ok_4 2 5
AB_uy_2 5 2
AB_ok_2 2 2
U_ok_7 12 3
I should display only :
COL1 COL2 COL3
AB_ok_7 5 2
U_ok_7 12 3

Like this:
In [2288]: df[df['COL1'].str.contains('ok') & df['COL2'].gt(4) & df['COL3'].lt(4)]
Out[2288]:
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3

You can use boolean indexing and chaining all the conditions.
m = df['COL1'].str.contains('ok')
m1 = df['COL2'].gt(4)
m2 = df['COL3'].lt(4)
df[m & m1 & m2]
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3

Related

position or move pandas column to a specific column index

I have a DF mydataframe and it has multiple columns (over 75 columns) with default numeric index:
Col1 Col2 Col3 ... Coln
I need to arrange/change position to as follows:
Col1 Col3 Col2 ... Coln
I can get the index of Col2 using:
mydataframe.columns.get_loc("Col2")
but I don't seem to be able to figure out how to swap, without manually listing all columns and then manually rearrange in a list.

Try:
new_cols = [Col1, Col3, Col2] + df.columns[3:]
df = df[new_cols]

How to proceed:
store the names of columns in a list;
swap the names in that list;
apply the new order on the dataframe.
code:
l = list(df)
i1, i2 = l.index('Col2'), l.index('Col3')
l[i2], l[i1] = l[i1], l[i2]
df = df[l]

I'm imagining you want what #sentence is assuming. You want to swap the positions of 2 columns regardless of where they are.
This is a creative approach:
Create a dictionary that defines which columns get switched with what.
Define a function that takes a column name and returns an ordering.
Use that function as a key for sorting.
d = {'Col3': 'Col2', 'Col2': 'Col3'}
k = lambda x: df.columns.get_loc(d.get(x, x))
df[sorted(df, key=k)]
Col0 Col1 Col3 Col2 Col4
0 0 1 3 2 4
1 5 6 8 7 9
2 10 11 13 12 14
3 15 16 18 17 19
4 20 21 23 22 24
Setup
df = pd.DataFrame(
np.arange(25).reshape(5, 5)
).add_prefix('Col')

Using np.r_ to create array of column index:
Given sample as follows:
df:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
i, j = df.columns.slice_locs('col2', 'col10')
df[df.columns[np.r_[:i, i+1, i, i+2:j]]]
Out[142]:
col1 col3 col2 col4 col5 col6 col7 col8 col9 col10
0 0 2 1 3 4 5 6 7 8 9
1 10 12 11 13 14 15 16 17 18 19

Join two data frame with two columns values of a df with a single column values of another dataframe. based on some conditions?

I have a dataframe like this:
df1
col1 col2 col3 col4
1 2 A S
3 4 A P
5 6 B R
7 8 B B
I have another data frame:
df2
col5 col6 col3
9 10 A
11 12 R
I want to join these two data frame if any value of col3 and col4 of df1 matches with col3 values of df2 it will join.
the final data frame will look like:
df3
col1 col2 col3 col5 col6
1 2 A 9 10
3 4 A 9 10
5 6 R 11 12
If col3 value presents in df2 then it will join via col3 values else it will join via col4 values if it presents in col3 values of df2
How to do this in most efficient way using pandas/python?

Use double merge with default inner join, for second filter out rows matched in df3, last concat together:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2[~df2['col3'].isin(df1['col3'])], on='col3'))
df = pd.concat([df3, df4],ignore_index=True)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9 10
1 3 4 A 9 10
2 5 6 R 11 12
EDIT: Use left join and last combine_first:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3', how='left')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2, on='col3', how='left'))
df = df3.combine_first(df4)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9.0 10.0
1 3 4 A 9.0 10.0
2 5 6 B 11.0 12.0
3 7 8 B NaN NaN

How to map with multiple columns python

I have two data frames as below:
df1:
col1 col2 col3
1 2 A
1 2 A
3 4 B
3 4 B
df2:
col1 col2
1 2
3 4
I want to put the value of col3 of df1 in df2 like below:
result:
col1 col2 col3
1 2 A
3 4 B
I tried with following code for mapping but getting error message, How to do that
df2['col3'] = df2[['col1','col2]].map(df1.set_index(['col1','col2])['col3'])

Use:
df11 = df1[['col1','col2','col3']].drop_duplicates(['col1','col2'])
df = df2.merge(df11, on=['col1','col2'], how='left')
print (df)
col1 col2 col3
0 1 2 A
1 3 4 B

Python/Pandas - Combining groupby mean and min

What's the syntax for combining mean and a min on a dataframe? I want to group by 2 columns, calculate the mean within a group for col3 and keep the min value of col4. Would something like
groupeddf = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean().min('col4')
work? If not, what's the correct syntax? Thank you!
EDIT
Okay, so the question wasn't quite clear without an example. I'll update it now. Also changes in text above.
I have:
ungrouped
col1 col2 col3 col4
1 2 3 4
1 2 4 1
2 4 2 1
2 4 1 3
2 3 1 3
Wanted output is grouped by columns 1-2, mean for column 3 (and actually some more columns on the data, this is simplified) and the minimum of col4:
grouped
col1 col2 col3 col4
1 2 3.5 1
2 4 1.5 1
2 3 1 3

I think you need first mean and then min of column col4:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean()['col4'].min()
or min of Series:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
Sample:
nongrouped = pd.DataFrame({'col1':[1,1,3],
'col2':[1,1,6],
'col3':[1,1,9],
'col4':[1,3,5]})
print (nongrouped)
col1 col2 col3 col4
0 1 1 1 1
1 1 1 1 3
2 3 6 9 5
print (nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean())
1 1 1 2
3 6 9 5
Name: col4, dtype: int64
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
print (min_val)
2
EDIT:
You need aggregate:
groupeddf = nongrouped.groupby(['col1', 'col2'], sort=False)
.agg({'col3':'mean','col4':'min'})
.reset_index()
.reindex(columns=nongrouped.columns)
print (groupeddf)
col1 col2 col3 col4
0 1 2 3.5 1
1 2 4 1.5 1
2 2 3 1.0 3

Remove columns from data frame with hashing

Given two pandas dataframes:
df1 = pd.read_csv(file1, names=['col1','col2','col3'])
df2 = pd.read_csv(file2, names=['col1','col2','col3'])
I'd like to remove all the rows in df2 where the values of either col1 or col2 (or both) do not exist in df1.
Doing the following:
df2 = df2[(df2['col1'] in set(df1['col1'])) & (df2['col2'] in set(df1['col2']))]
yields:
TypeError: 'Series' objects are mutable, thus they cannot be hashed

I think you can try isin:
df2 = df2[(df2['col1'].isin(df1['col1'])) & (df2['col2'].isin(df1['col2']))]
df1 = pd.DataFrame({'col1':[1,2,3,3],
'col2':[4,5,6,2],
'col3':[7,8,9,5]})
print (df1)
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
3 3 2 5
df2 = pd.DataFrame({'col1':[1,2,3,5],
'col2':[4,7,4,1],
'col3':[7,8,9,1]})
print (df2)
col1 col2 col3
0 1 4 7
1 2 7 8
2 3 4 9
3 5 1 1
df2 = df2[(df2['col1'].isin(df1['col1'])) & (df2['col2'].isin(df1['col2'].unique()))]
print (df2)
col1 col2 col3
0 1 4 7
2 3 4 9
Another solution is merge, because inner join (how='inner') is by default, but it works only for values with same position in both DataFrames:
print (pd.merge(df1, df2))
col1 col2 col3
0 1 4 7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subset rows in df depending on conditions - python

Hello I have a df such as : I wondered how I can subset row where : COL1 contains a string "ok" COL2 > 4 COL3 < 4 here is an exemple COL1 COL2 COL3 AB_ok_7 5 2 AB_ok_4 2 5 AB_uy_2 5 2 AB_ok_2 2 2 U_ok_7 12 3 I should display only : COL1 COL2 COL3 AB_ok_7 5 2 U_ok_7 12 3

Like this: In [2288]: df[df['COL1'].str.contains('ok') & df['COL2'].gt(4) & df['COL3'].lt(4)] Out[2288]: COL1 COL2 COL3 0 AB_ok_7 5 2 4 U_ok_7 12 3

You can use boolean indexing and chaining all the conditions. m = df['COL1'].str.contains('ok') m1 = df['COL2'].gt(4) m2 = df['COL3'].lt(4) df[m & m1 & m2] COL1 COL2 COL3 0 AB_ok_7 5 2 4 U_ok_7 12 3

Related

position or move pandas column to a specific column index

Join two data frame with two columns values of a df with a single column values of another dataframe. based on some conditions?

How to map with multiple columns python

Python/Pandas - Combining groupby mean and min

Remove columns from data frame with hashing

Categories

Resources