Select row in DataFrame based on values in multiple rows

Select row in DataFrame based on values in multiple rows - python

I've got a DataFrame and I'd like to select rows where in one column they have a certain value, AND the row above has a certain value in another column. How do I do this without a for loop?
For example:
df = pd.DataFrame({'one': [1,2,3,4,1,2,3,4], 'two': [1,2,3,4,5,6,7,8]})
Where I'd like to find the row where df.one on that row equals 1, and df.two on the row above equals 4, so in the example row number 4 with values [1,5].

You can try shift with boolean indexing:
print df
one two
0 1 1
1 2 2
2 3 3
3 4 4
4 1 5
5 2 6
6 3 7
7 4 8
print (df.one == 1) & (df.two.shift() == 4)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
dtype: bool
print df[(df.one == 1) & (df.two.shift() == 4)]
one two
4 1 5

Related

recombine string columns based on another columns in pandas

I have a pandas DataFrame with 3 columns :
id product_id is_opt
1 1 False
1 2 False
1 3 True
1 4 True
2 5 False
2 6 False
2 7 False
3 8 False
3 9 False
3 10 True
I want to transform this DataFrame this way :
For a set of rows that shares the same id, if all rows are is_opt = False, then the set of rows stays unchanged. For example, the rows with id = 2 do not change.
For a set of rows that shares the same id, if at least one row is is_opt = True, then we apply this transformation:
All rows that are is_opt = True stay unchanged.
All rows that are is_opt = False take at the end of their product_id all the product_ids of the rows that are is_opt = True. If there are n rows is_opt = True, then 1 row with is_opt = False gives n rows. For exemple, the first row [1, 1, False] gives 2 rows [1, 1-3, False] and [1, 1-4, False].
The expected output for the example is:
id product_id
1 1-3
1 1-4
1 2-3
1 2-4
1 3
1 4
2 5
2 6
2 7
3 8-10
3 9-10
3 10
is_opt column has been droped in the expected result.
Can you help me with a way to get this result in an efficient set of operations ? It is straightforward with some for loops but I would like something efficient because the DataFrames in production are huge.

You can use a custom function and itertools.product:
from itertools import product
def combine(df):
if df['is_opt'].any():
a = df.loc[~df['is_opt'], 'product_id']
b = df.loc[df['is_opt'], 'product_id']
l = ['-'.join(map(str, p)) for p in product(a, b)]
return pd.Series(l+b.tolist())
return df['product_id']
out = df.groupby('id').apply(combine).droplevel(1).reset_index(name='product_id')
output:
id product_id
0 1 1-3
1 1 1-4
2 1 2-3
3 1 2-4
4 1 3
5 1 4
6 2 5
7 2 6
8 2 7
9 3 8-10
10 3 9-10
11 3 10

pandas: Select rows by diff with previous columns, but only one time per row

I have a dataset like below
ID value
1 10
2 15
3 18
4 30
5 35
I would like to keep all the rows that has value - value of the previous row <=5, so I do
df['diff'] = df.value.diff()
df = df[df.diff <= 5]
Then I will have
ID value diff
2 15 5
3 18 3
5 35 5
However, I don't want to keep row 3, because row 2 is kept due to row 1, and as row 1 and row 2 become a pair, row 3 should not be paired with row 2 anymore.
How could I do that using pandas? Indeed I can write a for loop but it is not the best idea.

So you have the mask that checks if difference to previous row <= 5:
>>> d = df.value.diff().le(5)
>>> d
1 False
2 True
3 True
4 False
5 True
Rows marked with True will be kept, but you don't want to keep a True row if the previous row was also True.
Then we can shift this mask, negate it and & with the original to convert True's that have True in previous row into False:
>>> d & ~d.shift(fill_value=False)
1 False
2 True
3 False
4 False
5 True
where fill_value is needed otherwise there arises NaN and it "can't bitwise-negate float". Putting False there has no effect other than silencing that issue.
Now we can select the rows from the dataframe with this resultant mask:
>>> wanted = d & ~d.shift(fill_value=False)
>>> df[wanted]
ID value
2 15
5 35

Populate a new dataframe column with True if two cell values match another smaller subset dataframe in pandas

I am looking to populate a new dataframe column with True if two cell values match another smaller subset dataframe in pandas, otherwise with a value of False.
For instance, this is original output dataframe I am constructing.
ID Type
1 A
2 B
3 A
4 A
5 C
6 A
7 D
8 A
9 B
10 A
And the smaller subset of the dataframe selected based on some criteria:
ID Type
1 A
3 A
4 A
5 C
7 D
10 A
What I am trying to accomplish is when ID and Type in the output dataframe match with the smaller subset datadrame, I want to populate a new column called 'Result' and value equals to True. Otherwise, value equals to False.
ID Type Result
1 A True
2 B False
3 A True
4 A True
5 C True
6 A False
7 D True
8 A False
9 B False
10 A True

You can .merge() the 2 dataframes using a left merge with the original dataframe as base and turn on the indicator= parameter to show the merge result. Then change the merge result to True for the rows that appear in both dataframes and False otherwise.
df_out = df1.merge(df2, on=['ID', 'Type'] , how='left', indicator='Result')
df_out['Result'] = (df_out['Result'] == 'both')
Explanation:
With indicator= parameter turn on, Pandas will show you the merge result of which dataframe the current row are from (in terms of both, left_only and right_only)
df_out = df1.merge(df2, on=['ID', 'Type'] , how='left', indicator='Result')
print(df_out)
ID Type Result
0 1 A both
1 2 B left_only
2 3 A both
3 4 A both
4 5 C both
5 6 A left_only
6 7 D both
7 8 A left_only
8 9 B left_only
9 10 A both
Then, we transform the both and others to True/False by boolean mask, as follows:
df_out['Result'] = (df_out['Result'] == 'both')
print(df_out)
ID Type Result
0 1 A True
1 2 B False
2 3 A True
3 4 A True
4 5 C True
5 6 A False
6 7 D True
7 8 A False
8 9 B False
9 10 A True

How to search subset of a pandas dataframe for the row in which a value occurs

I have two dataframes, e.g.
import pandas as pd
import numpy as np
from random import shuffle
df_data = pd.DataFrame(data=np.random.randint(low=0, high=10, size=(10,3)), columns=['A', 'B', 'C'])
keys = np.arange(0, 10)
shuffle(keys)
df_data['keys'] = keys
key_data = pd.DataFrame(data=np.reshape(np.arange(1,10), (3,3)), columns=['Key_col1', 'Key_col2', 'Key_col3'])
key_data['Timestamp'], key_data['Info'] = ['Mon', 'Wed', 'Fri'], [13, 2, 47]
Which returns, something like this:
A B C keys
0 3 9 2 5
1 7 9 4 7
2 9 6 6 0
3 9 9 0 9
4 8 5 8 6
5 2 5 7 3
6 5 1 2 4
7 3 9 6 2
8 4 2 3 8
9 6 5 5 1
and this:
Key_col1 Key_col2 Key_col3 Timestamp Info
0 1 2 3 Mon 13
1 4 5 6 Wed 2
2 7 8 9 Fri 47
I'd like to use the 'keys' column in the first dataframe to search the only the Key columns in the second dataframe (i.e. Key_col1, Key_col2, Key_col3) (because the 'info' column may contain values that much keys).
I'll then add the columns Timestamp and Info to the row in which the there is a match for key.
Expected output for row 0 would be this:
A B C keys Timestamp Info
0 3 9 2 5 Wed 2
My approach is to first a subset of my key_df for a value:
key_data.iloc[:, 0:3] == 2
OUT
Key_col1 Key_col2 Key_col3
0 False True False
1 False False False
2 False False False
In my next step I try to return only the row where the value True occurs using df.loc
key_data.loc[:, key_data.iloc[:, 0:3] == 2]
But this results in the error ValueError: Cannot index with multidimensional key
Can somebody help me to return the row in which the value True occurs so that I can use this index for selecting where to append my data?
Thanks
EDIT: The keys are unique and all of them are present in exactly 1 of the 3 key columns.

This works for you, just rename the columns:
new_df = pd.merge(df_data, key_data, how= 'right', left_on=['keys','keys','keys'], right_on = ['Key_col1','Key_col2','Key_col3'])
new_df =new_df.dropna(axis=1, how='all')

Can somebody help me to return the row in which the value True occurs so that I can use this index for selecting where to append my data?
The answer to this question is key_data.loc[(key_data.iloc[:, 0:3] == 2).any(axis=1)], but for your larger goal, doing something with merge as Rahul Agarwal suggests would be better.

Can't evaluate column for empty values

I have read 20+ threads on this, and am still coming up empty (no pun intended).
I have a pandas dataframe df_s, which has a column that contains dates at iloc[:,8]. I am trying to add a new column to the dataframe with a value (yes/no) based on whether there is a value in the other column or not.
This is what I have been trying:
CDRFormUp = []
for row in df_s.iloc[:,8]:
if row=="":
CDRFormUp.append('No')
else:
CDRFormUp.append('Yes')
df_s['CDR Form Up'] = CDRFormUp
CDRFormUp would be the new column. I'm running every row in the dataframe, and checking to see if the value in the column is anything.
I have tried...
if row <>"":
if row == "":
if row is None:
if row:
if row>0:
Nothing is working. The column contains dates and empty cells and text. For example, the value in this column in the first row is "CDF Form", in the second row it is blank, in the third row it is "4865" or something like that.
If I set the iloc to a different column that just contains Country names, and set the condition to "Country = "Italy", it properly adds the "Yes" or "No" to the new column for each row...so it's not a wrong iloc or something else.
Any help would be incredibly appreciated.
Thanks!

You need to use np.where with Pandas dataframes.
df_s = pd.DataFrame(np.random.randint(1,10,(5,10)))
df_s.iloc[1,8] = ''
df_s.iloc[3,8] = np.nan
df_s['CDRFormUp'] = np.where(df_s.iloc[:,8].mask(df_s.iloc[:,8].str.len()==0).isnull(),'Yes','No')
print(df_s)
Output:
0 1 2 3 4 5 6 7 8 9 CDRFormUp
0 6 5 5 5 9 3 3 5 3 9 No
1 5 4 7 3 9 6 8 9 9 Yes
2 5 2 2 7 7 6 3 2 5 2 No
3 8 2 1 9 7 3 7 8 NaN 8 Yes
4 4 4 1 5 3 5 9 4 4 9 No

I suspect you have elements with white space.
Consider the datafame df_s
df_s = pd.DataFrame([
[1, 'a', 'Yes'],
[2, '', 'No'],
[3, ' ', 'No']
])
df_s
0 1 2
0 1 a Yes
1 2 No
2 3 No
Both rows 1 and 2 in column 1 have what look like blank strings. But they aren't
df_s.iloc[:, 1] == ''
0 False
1 True
2 False
Name: 1, dtype: bool
You may want to consider seeing if the entire thing is white space or stripping white space first.
Option 1
all white space
df_s.iloc[:, 1].str.match('^\s*$')
0 False
1 True
2 True
Name: 1, dtype: bool
Which we can convert to yes/no with
df_s.iloc[:, 1].str.match('^\s*$').map({True: 'no', False: 'yes'})
0 yes
1 no
2 no
Name: 1, dtype: object
Add a new column
df_s.assign(
CDRFormUp=df_s.iloc[:, 1].str.match('^\s*$').map({True: 'no', False: 'yes'})
)
0 1 2 CDRFormUp
0 1 a Yes yes
1 2 No no
2 3 No no
Option 2
strip white space then check if empty
df_s.iloc[:, 1].str.strip() == ''
0 False
1 True
2 True
Name: 1, dtype: bool
Add new column
df_s.assign(
CDRFormUp=df_s.iloc[:, 1].str.strip().eq('').map({True: 'no', False: 'yes'})
)
0 1 2 CDRFormUp
0 1 a Yes yes
1 2 No no
2 3 No no

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select row in DataFrame based on values in multiple rows - python

You can try shift with boolean indexing: print df one two 0 1 1 1 2 2 2 3 3 3 4 4 4 1 5 5 2 6 6 3 7 7 4 8 print (df.one == 1) & (df.two.shift() == 4) 0 False 1 False 2 False 3 False 4 True 5 False 6 False 7 False dtype: bool print df[(df.one == 1) & (df.two.shift() == 4)] one two 4 1 5

Related

recombine string columns based on another columns in pandas

pandas: Select rows by diff with previous columns, but only one time per row

Populate a new dataframe column with True if two cell values match another smaller subset dataframe in pandas

How to search subset of a pandas dataframe for the row in which a value occurs

Can't evaluate column for empty values

Categories

Resources