Find greater row between two dataframes of same shape - python

I have two dataframes of the same shape and am trying to find all the rows in df A where every value is greater than the corresponding row in df B.
Mini-example:
df_A = pd.DataFrame({'one':[20,7,2],'two':[11,9,1]})
df_B = pd.DataFrame({'one':[1,8,12],'two':[10,5,3]})
I'd like to return only row 0.
one two
0 20 11
I realise that df_A > df_B gets me most of the way, but I just can't figure out how to return only those rows where everything is True.
(I tried merging the two, but that didn't seem to make it simpler.)

IIUIC, you can use all
In [633]: m = (df_A > df_B).all(1)
In [634]: m
Out[634]:
0 True
1 False
2 False
dtype: bool
In [635]: df_A[m]
Out[635]:
one two
0 20 11
In [636]: df_B[m]
Out[636]:
one two
0 1 10
In [637]: pd.concat([df_A[m], df_B[m]])
Out[637]:
one two
0 20 11
0 1 10
Or, if you just need row indices.
In [642]: m.index[m]
Out[642]: Int64Index([0], dtype='int64')

df_A.loc[(df_A > df_B).all(axis=1)]

import pandas as pd
df_A = pd.DataFrame({"one": [20, 7, 2], "two": [11, 9, 1]})
df_B = pd.DataFrame({"one": [1, 8, 12], "two": [10, 5, 3]})
row_indices = (df_A > df_B).apply(min, axis=1)
print(df_A[row_indices])
print()
print(df_B[row_indices])
Output is:
one two
0 20 11
one two
0 1 10
Explanation:
df_A > df_B compares element wise, this is the result:
one two
0 True True
1 False True
2 False False
Pythons max interprets True > False, so applying min row wise (this is why I used axis=1) only computes True if both values in a row are True:
0 True
1 False
2 False
This is now a boolean index to extract rows from df_A resp. df_B.

It can be done in one line code if you are interested.
df_A[(df_A > df_B)].dropna(axis=0, how='any')
Here df_A[(df_A > df_B)] gives the output after matching true false either the value or na.
one two
0 20.0 11.0
1 NaN 9.0
2 NaN NaN
Then we can drop the na values along the the 0 axis if there is at least anynot a number value.

Related

Check if a value in a series appears anywhere in a df column always evaluates to True

I have been trying to test if any value in a series appears in a column in a dataframe, however it appears that the way I have been doing it returns True regardless of whether the values appear or not.
For example, using the below dataframe and series:
df = pd.DataFrame({'Col1' : [1, 2, 2, 6, 4, 8],
'Col2' : [11, 11, 11, 11, 11, 11]})
Col1 Col2
0 1 3
1 2 3
2 2 3
3 6 3
4 4 3
5 8 3
series = pd.Series([3, 5, 9])
0 3
1 5
2 9
dtype: int64
I want to check if any value in series appears in Col1 (which it doesn't in this example, so should evaluate to False). The method I've been using to check this is:
if True in series.isin(df['Col1']):
Which evaluates to True despite the fact that the output of series.isin(df['Col1']) is a series of only False. I was wondering why this is the case and is there a better way to check if a value in a series is in a dataframe? I know I can use if any(series.isin(df['Col1'])), is this the best way?
(Pandas version: 1.1.3)
aggregate using any:
if series.isin(df['Col1']).any():
# do something
Why your approach failed
True in series.isin(df['Col1']) check if there is True in the index of the series, not its values. This would work with True in series.isin(df['Col1']).values
Although it doesn't return True for me, I suspect in your case True in series.isin(df['Col1']) might work as you have 1 in the indices and True is equivalent to 1 (which pandas version do you have?).
You can check overlap with
overlap = series[series.isin(df['Col1'])]
If you are only interested in checking for emptiness - or no overlap - then you can use the empty property
series[series.isin(df['Col1'])].empty
# True

dropping rows that has only one non zero value from a pandas dataframe in python

I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]

Pandas dataframe self-dependency in data to fill a column

I have dataframe with data as:
The value of "relation" is determined from the codeid. Leather has "codeid"=11 which is already appeared against bag, so in relation we put the value bag.
Same happens for shoes.
ToDo: Fill the value of "relation", by putting check on codeid in terms of dataframes. Any help would be appreciated.
Edit: Same codeid e.g. 11 can appear > twice. But the "relation" can have only value as bag because bag is the first one to have codeid=11. i have updated the picture as well.
If want only first dupe value to last duplicated use transform with first and then set NaN values by loc with duplicated:
df = pd.DataFrame({'id':[1,2,3,4,5],
'name':list('brslp'),
'codeid':[11,12,13,11,13]})
df['relation'] = df.groupby('codeid')['name'].transform('first')
print (df)
id name codeid relation
0 1 b 11 b
1 2 r 12 r
2 3 s 13 s
3 4 l 11 b
4 5 p 13 s
#get first duplicated values of codeid
print (df['codeid'].duplicated(keep='last'))
0 True
1 False
2 True
3 False
4 False
Name: codeid, dtype: bool
#get all duplicated values of codeid with inverting boolenam mask by ~ for unique rows
print (~df['codeid'].duplicated(keep=False))
0 False
1 True
2 False
3 False
4 False
Name: codeid, dtype: bool
#chain boolen mask together
print (df['codeid'].duplicated(keep='last') | ~df['codeid'].duplicated(keep=False))
0 True
1 True
2 True
3 False
4 False
Name: codeid, dtype: bool
#replace True values by mask by NaN
df.loc[df['codeid'].duplicated(keep='last') |
~df['codeid'].duplicated(keep=False), 'relation'] = np.nan
print (df)
id name codeid relation
0 1 b 11 NaN
1 2 r 12 NaN
2 3 s 13 NaN
3 4 l 11 b
4 5 p 13 s
I think you want to do something like this:
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shoes']], columns = ['name', 'codeid', 'relation'])
def codeid_analysis(rows):
if rows['codeid'] == 11:
rows['relation'] = 'bag'
elif rows['codeid'] == 12:
rows['relation'] = 'shirt' #for example. You should put what you want here
elif rows['codeid'] == 13:
rows['relation'] = 'pants' #for example. You should put what you want here
return rows
result = df.apply(codeid_analysis, axis = 1)
print(result)
It is not the optimal solution since it is costly to your memory, but here is my try. df1 is created in order to hold the null values of the relation column, since it seems that nulls are the first occurrence. After some cleaning, the two dataframes are merged to provide into one.
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shopper'],
['something',13,""]], columns = ['name', 'codeid', 'relation'])
df1=df.loc[df['relation'] == 'null'].copy()#create a df with only null values in relation
df1.drop_duplicates(subset=['name'], inplace=True)#drops the duplicates and retains the first entry
df1=df1.drop("relation",axis=1)#drop the unneeded column
final_df=pd.merge(df, df1, left_on='codeid', right_on='codeid')#merge the two dfs on the columns names

Dropping rows in python pandas

I have the following DataFrame:
2010-01-03 2010-01-04 2010-01-05 2010-01-06 2010-01-07
1560 0.002624 0.004992 -0.011085 -0.007508 -0.007508
14 0.000000 -0.000978 -0.016960 -0.016960 -0.009106
2920 0.000000 0.018150 0.018150 0.002648 0.025379
1502 0.000000 0.018150 0.011648 0.005963 0.005963
78 0.000000 0.018150 0.014873 0.014873 0.007564
I have list of indices corresponding to rows that I want to drop from my DataFrame. For simplicity, assume my list is idx_to_drop = [1560,1502] which correspond to the 1st row and 4th row in the daraframe above.
I tried to run df2 = df.drop(df.index[idx_to_drop]), but that expects row numbers rather than the .ix() index value. I have many more rows and many more columns, and getting row numbers by using the where() function takes a while.
How can I drop rows whose .ix() match?
I would tackle this by breaking the problem into two pieces. Mask what you are looking for, then sub-select the inverse.
Short answer:
df[~df.index.isin([1560, 1502])]
Explanation with runnable example, using isin:
import pandas as pd
df = pd.DataFrame({'index': [1, 2, 3, 1500, 1501],
'vals': [1, 2, 3, 4, 5]}).set_index('index')
bad_rows = [1500, 1501]
mask = df.index.isin(bad_rows)
print mask
[False False False True True]
df[mask]
vals
index
1500 4
1501 5
print ~mask
[ True True True False False]
df[~mask]
vals
index
1 1
2 2
3 3
You can see that we've identified the two bad rows, then we want to choose all the rows that aren't the bad ones. Our mask if for the bad rows, and all other rows would be anything that is not the mask (~mask)

Keeping the N first occurrences of

The following code will (of course) keep only the first occurrence of 'Item1' in rows sorted by 'Date'. Any suggestions as to how I could get it to keep, say the first 5 occurrences?
## Sort the dataframe by Date and keep only the earliest appearance of 'Item1'
## drop_duplicates considers the column 'Date' and keeps only first occurence
coocdates = data.sort('Date').drop_duplicates(cols=['Item1'])
You want to use head, either on the dataframe itself or on the groupby:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [1, 6], [2, 8]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 1 6
3 2 8
In [13]: df.head(2) # the first two rows
Out[13]:
A B
0 1 2
1 1 4
In [14]: df.groupby('A').head(2) # the first two rows in each group
Out[14]:
A B
0 1 2
1 1 4
3 2 8
Note: the behaviour of groupby's head was changed in 0.14 (it didn't act like a filter - but modified the index), so you will have to reset index if using an earlier versions.
Use groupby() and nth():
According to Pandas docs, nth()
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
Therefore all you need is:
df.groupby('Date').nth([0,1,2,3,4]).reset_index(drop=False, inplace=True)

Categories

Resources