Create a new column by comparing rows pandas - python

My dataframe looks like this
df = pd.Dataframe({ 'a': ["10001", "10001", "10002", "10002" , "10002"], 'b': ['hello', 'hello', 'hola', 'hello', 'hola']})
I want to create a new column 'c' of boolean values with the following condition:
If values of 'a' is the same (i.e. 1st and 2nd row, 3rd and 4th and 5th row), check if values of 'b' of those rows are the same. (2nd row returns True. 4th row returns False).
If values of 'a' is not the same, skip.
My current code is the following:
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
t = []
for i in df['match']:
if i == True:
t.append(df[col2].eq(df[col2].shift()))
check_consistency('a','b')
And it returns error.

I think this is groupby
df.groupby('a').b.apply(lambda x : x==x.shift())
Out[431]:
0 False
1 True
2 False
3 False
4 False
Name: b, dtype: bool

A bitwise & should do: Checking if both the conditions are satisfied:
df['c'] = (df.a == df.a.shift()) & (df.b == df.b.shift())
df.c
#0 False
#1 True
#2 False
#3 False
#4 False
#Name: c, dtype: bool
Alternatively, if you want to make your current code work, you can do something like (essentially doing the same check as above):
def check_consistency(col1,col2):
df['match'] = df[col1].eq(df[col1].shift())
for i in range(len(df['match'])):
if (df['match'][i] == True):
df.loc[i,'match'] = (df.loc[i, col2] == df.loc[i-1, col2])
check_consistency('a','b')

Related

check columns in DataFrame for constant values explanation

I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)
I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.
import numpy as np
import pandas as pd
data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
A =df.loc[:, (df != df.iloc[0]).any()]
Use:
m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a = df.columns[m1].tolist()
b = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']
Explanation:
First compare all values by 0:
print (df == 0)
A B C
0 True False False
1 True False False
2 True False False
Then test if all values are Trues by DataFrame.all:
print ((df == 0).all())
A True
B False
C False
dtype: bool
Then compare first values of row by DataFrame.iloc:
print (df == df.iloc[0])
A B C
0 True True True
1 True True False
2 True True False
And test again by all:
print ((df == df.iloc[0]).all())
A True
B True
C False
dtype: bool
because exclude 0 chain inverted first mask by ~ with & for bitwise AND:
print (~m1 & m2)
A False
B True
C False
dtype: bool
This seems like a clean way to do what you want:
m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1
m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))
m1 gives you a boolean of columns that all have zeros:
m1
A True
B False
C False
dtype: bool
m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)
m2
A False
B True
C False
dtype: bool
Deriving your lists is trivial from these masks.

Python Pandas - Cannot recognize a string from a column in another dataframe column

I've a dataframe with the following data:
Now I am trying to use the isIn method in order to produce a new column with the result if the col_a is in col_b.So in this case I am trying to produce the following output:
For this I am using this code:
df['res'] = df.col_a.isin(df.col_b)
But it's always return FALSE. I also try this: df['res'] = df.col_b.isin(df.col_a)
but with the same result... all the rows as FALSE.
What I am doing wrong?
Thanks!
You can check if value in col_a is in col_b per rows by apply:
df['res'] = df.apply(lambda x: x.col_a in x.col_b, axis=1)
Or by list comprehension:
df['res'] = [a in b for a, b in zip(df.col_a, df.col_b)]
EDIT: Error obviously mean there are missing values, so if-else statement is necessary:
df = pd.DataFrame({'col_a':['SQL','Java','C#', np.nan, 'Python', np.nan],
'col_b':['I.like_SQL_since_i_used_to_ETL',
'I like_programming_SQL.too',
'I prefer Java',
'I like beer',
np.nan,
np.nan]})
print (df)
df['res'] = df.apply(lambda x: x.col_a in x.col_b
if (x.col_a == x.col_a) and (x.col_b == x.col_b)
else False, axis=1)
df['res1'] = [a in b if (a == a) and (b == b) else False for a, b in zip(df.col_a, df.col_b)]
print (df)
col_a col_b res res1
0 SQL I.like_SQL_since_i_used_to_ETL True True
1 Java I like_programming_SQL.too False False
2 C# I prefer Java False False
3 NaN I like beer False False
4 Python NaN False False
5 NaN NaN False False

Check if two rows in pandas DataFrame has same set of values regard & regardless of column order

I have two dataframe with same index but different column names. Number of columns are the same. I want to check, index by index, 1) whether they have same set of values regardless of column order, and 2) whether they have same set of values regarding column order.
ind = ['aaa', 'bbb', 'ccc']
df1 = pd.DataFrame({'old1': ['A','A','A'], 'old2': ['B','B','B'], 'old3': ['C','C','C']}, index=ind)
df2 = pd.DataFrame({'new1': ['A','A','A'], 'new2': ['B','C','B'], 'new3': ['C','B','D']}, index=ind)
This is the output I need.
OpX OpY
-------------
aaa True True
bbb False True
ccc False False
Could anyone help me with OpX and OpY?
Using tuple and set: keep the order or tuple , and reorder with set
s1=df1.apply(tuple,1)==df2.apply(tuple,1)
s2=df1.apply(set,1)==df2.apply(set,1)
pd.concat([s1,s2],1)
Out[746]:
0 1
aaa True True
bbb False True
ccc False False
Since cs95 mentioned apply have problem here
s=np.equal(df1.values,df2.values).all(1)
t=np.equal(np.sort(df1.values,1),np.sort(df2.values,1)).all(1)
pd.DataFrame(np.column_stack([s,t]),index=df1.index)
Out[754]:
0 1
aaa True True
bbb False True
ccc False False
Here's a solution that is performant and should scale. First, align the DataFrames on the index so you can compare them easily.
df3 = df2.set_axis(df1.columns, axis=1, inplace=False)
df4, df5 = df1.align(df3)
For req 1, simply call DataFrame.equals (or just use the == op):
u = (df4 == df5).all(axis=1)
u
aaa True
bbb False
ccc False
dtype: bool
Req 2 is slightly more complex, sort them along the first axis, then compare.
v = pd.Series((np.sort(df4) == np.sort(df5)).all(axis=1), index=u.index)
v
aaa True
bbb True
ccc False
dtype: bool
Concatenate the results,
pd.concat([u, v], axis=1, keys=['X', 'Y'])
X Y
aaa True True
bbb False True
ccc False False
For item 2):
(df1.values == df2.values).all(axis=1)
This checks element-wise equality of the dataframes, and gives True when all entries in a row are equal.
For item 1), sort the values along each row first:
import numpy as np
(np.sort(df1.values, axis=1) == np.sort(df2.values, axis=1)).all(axis=1)
Construct a new DataFrame and check the equality:
df3 = pd.DataFrame(index=ind)
df3['OpX'] = (df1.values == df2.values).all(1)
df3['OpY'] = (df1.apply(np.sort, axis=1).values == df2.apply(np.sort, axis=1).values).all(1)
print(df3)
Output:
OpX OpY
aaa True True
bbb False True
ccc False False

Easy Way to See if Two Columns are One-to-One in Pandas

Working with data in Python 3+ with pandas. It seems like there should be an easy way to check if two columns have a one-to-one relationship (regardless of column type), but I'm struggling to think of the best way to do this.
Example of expected output:
A B C
0 'a' 'apple'
1 'b' 'banana'
2 'c' 'apple'
A & B are one-to-one? TRUE
A & C are one-to-one? FALSE
B & C are one-to-one? FALSE
Well, you can create your own function to check it:
def isOneToOne(df, col1, col2):
first = df.groupby(col1)[col2].count().max()
second = df.groupby(col2)[col1].count().max()
return first + second == 2
isOneToOne(df, 'A', 'B')
#True
isOneToOne(df, 'A', 'C')
#False
isOneToOne(df, 'B', 'C')
#False
In case you data is more like this:
df = pd.DataFrame({'A': [0, 1, 2, 0],
'C': ["'apple'", "'banana'", "'apple'", "'apple'"],
'B': ["'a'", "'b'", "'c'", "'a'"]})
df
# A B C
#0 0 'a' 'apple'
#1 1 'b' 'banana'
#2 2 'c' 'apple'
#3 0 'a' 'apple'
Then you can use:
def isOneToOne(df, col1, col2):
first = df.drop_duplicates([col1, col2]).groupby(col1)[col2].count().max()
second = df.drop_duplicates([col1, col2]).groupby(col2)[col1].count().max()
return first + second == 2
df.groupby(col1)[col2]\
.apply(lambda x: x.nunique() == 1)\
.all()
should work fine if you want a true or false answer.
A nice way to visualize the relationship between two columns with discrete / categorical values (in case you are using Jupyter notebook) is :
df.groupby([col1, col2])\
.apply(lambda x : x.count())\
.iloc[:,0]\
.unstack()\
.fillna(0)
This matrix will tell you the correspondence between the column values in the two columns.
In case of a one-to-one relationship there will be only one non-zero value per row in the matrix.
df.groupby('A').B.nunique().max()==1 #Output: True
df.groupby('B').C.nunique().max()==1 #Output: False
Within each value in [groupby column], count the number of unique values in [other column], then check that the maximum for all such counts is one
one way to solve this ,
df['A to B']=df.groupby('B')['A'].transform(lambda x:x.nunique()==1)
df['A to C']=df.groupby('C')['A'].transform(lambda x:x.nunique()==1)
df['B to C']=df.groupby('C')['B'].transform(lambda x:x.nunique()==1)
Output:
A B C A to B A to C B to C
0 0 a apple True False False
1 1 b banana True True True
2 2 c apple True False False
To check column by column:
print (df['A to B']==True).all()
print (df['A to C']==True).all()
print (df['B to C']==True).all()
True
False
False
Here is my solution (only two or three lines of codes) to check for any number of columns to see whether they are one to one match (duplicated matches are allowed, see the example bellow).
cols = ['A', 'B'] # or any number of columns ['A', 'B', 'C']
res = df.groupby(cols).count()
uniqueness = [res.index.get_level_values(i).is_unique
for i in range(res.index.nlevels)]
all(uniqueness)
Let's make it a function and add some docs:
def is_one_to_one(df, cols):
"""Check whether any number of columns are one-to-one match.
df: a pandas.DataFrame
cols: must be a list of columns names
Duplicated matches are allowed:
a - 1
b - 2
b - 2
c - 3
(This two cols will return True)
"""
if len(cols) == 1:
return True
# You can define you own rules for 1 column check, Or forbid it
# MAIN THINGs: for 2 or more columns check!
res = df.groupby(cols).count()
# The count number info is actually bootless.
# What maters here is the grouped *MultiIndex*
# and its uniqueness in each level
uniqueness = [res.index.get_level_values(i).is_unique
for i in range(res.index.nlevels)]
return all(uniqueness)
By using this function, you can do the one-to-one match check:
df = pd.DataFrame({'A': [0, 1, 2, 0],
'B': ["'a'", "'b'", "'c'", "'a'"],
'C': ["'apple'", "'banana'", "'apple'", "'apple'"],})
is_one_to_one(df, ['A', 'B'])
is_one_to_one(df, ['A', 'C'])
is_one_to_one(df, ['A', 'B', 'C'])
# Outputs:
# True
# False
# False

Cycling through a dictionary of dataframes for values above a threshold

I am trying to set up a way to cycle though a 100 element dictionary, where each element is a 42,000 row dataframe and check whether the value is above a threshold
I am having problems working out how to store the row data so it is not overwritten.
I've made a simple example of what I would like to do:
I have a three element dictionary (my_dic) where each element is a dataframe
I want to cycle through each row of every dataframe and check if any of the columns are above a threshold number.
I have been trying to use .any() and .where but I dont know how to capture the data for each df separately.
I would like to end up with three separate new df that have boolean values in each column that is above the threshold.
Any help would be great!
d1 = np.random.rand(3,3)
d2 = np.random.rand(3,3)
d3 = np.random.rand(3,3)
df1 =pd.DataFrame(d1, index=['a', 'b', 'c'])
df2 =pd.DataFrame(d2, index=['a', 'b', 'c'])
df3 =pd.DataFrame(d3, index=['a', 'b', 'c'])
my_dic = {}
my_dic['a'] = df1
my_dic['b'] = df2
my_dic['c'] = df3
threshold = 0.5
I want to cycle through every row of each of the keys in my_dic and change the value to booleans if it is greater than a threshold number
for k in my_dic:
print k
data = my_dic[k]
for row in range(len(data)):
print row
np.where(data.iloc[row,:] > threshold)
This is the point I am struggling with, Im not sure how to keep this data, without it being overwritten
I assume you're looking for a dict comprhension, if you just want a boolean mask.
result = {k : v > threshold for k, v in my_dic.items()}
for v in result.values():
print(v, '\n')
0 1 2
a False False False
b True False True
c False True True
0 1 2
a False True True
b True True True
c False True True
0 1 2
a False False False
b False False False
c True True False
If you want the result as 0/1, use astype:
result = {k : v.gt(threshold).astype(int) for k, v in my_dic.items()}

Categories

Resources