Python - Is one string column in another? - python

I have a pandas dataframe with two columns. I need to determine if the string value from one column is in the string value of another column. The second column could be a 'single value' like 'value1' or it could be multiple items separated by a '/' in the string, like this: 'value1/value2/value3'.
For each row, I need to determine if the string is present in the other string in the same row, so that 'value1' in 'value1/value2/value3' would evaluate to True.
My attempts thus far fail to check within each row, and just look to see if the first column string is present in ALL rows column 2.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':['a','b','c','d','e'],
'b':['a/b','c/d','c/a','a/b','e']})
df['a'].isin(df['b'])
Expected result would evaluate to:
True
False
True
False
True

Comprehension
[a in b for a, b in zip(df.a, df.b)]
[True, False, True, False, True]
df.assign(In=[a in b for a, b in zip(df.a, df.b)])
a b In
0 a a/b True
1 b c/d False
2 c c/a True
3 d a/b False
4 e e True
Numpy
from numpy.core.defchararray import find
a, b = df.values.astype(str).T
find(b, a) >= 0
array([ True, False, True, False, True])
df.assign(In=find(b, a) >= 0)
a b In
0 a a/b True
1 b c/d False
2 c c/a True
3 d a/b False
4 e e True

Related

how to count and change value in dataframe by row

I want to count the number of times a value False appears in my dataframe and get the number of how many times False appears in a row.
So here is how my table should look initially:
A
B
C
D
count
First
row
True
False
0
Second
row
False
True
0
Third
row
True
True
0
Fourth
row
False
False
0
This is how it should look:
A
B
C
D
count
First
row
True
False
1
Second
row
False
True
1
Third
row
True
True
0
Fourth
row
False
False
2
This is my code, I have tried to count for at least one column to begin with something, but it does not change the value in count column.
import pandas as pd
data = {'A': ['One', None, 'One', None], 'B': [None, 'Two', None, 'Two'], 'C': [True, False, True, False],
'D': [False, True, True, False], 'count': [0, 0, 0, 0]}
df = pd.DataFrame(data)
for index, row in df.iterrows():
if row['C'] is False:
row['count'] += 1
print(df.head(4))
IIUC, you want to count the False (boolean) values per row?
You can subset the boolean columns with select_dtypes, then invert the boolean value with ~ (so that False becomes True and is equivalent to 1), then sum per row:
df['count'] = (~df.select_dtypes('boolean')).sum(axis=1)
output:
A B C D count
0 One None True False 1
1 None Two False True 1
2 One None True True 0
3 None Two False False 2
Select columns 'C' and 'D', flip/invert the booleans (~) and then sum across both columns:
df['count'] = (~df[['C', 'D']]).sum(axis='columns')

How can I insert a column of a dataframe in pandas as a list of a cell into another dataframe?

I have several of dataframes (df, tmp_df and sub_df) and I want to enter a column of tmp_df into a cell of sub_df as a list. My code and dataframes are shown as below. But the loop part is not working correctly:
import pandas as pd
df = pd.read_csv('myfile.csv')
tmp_df = pd.DataFrame()
sub_df = pd.DataFrame()
tmp_df = df[df['Type'] == True]
for c in tmp_df['Category']:
sub_df['Data'] , sub_df ['Category'], sub_df['Type'] = [list(set(tmp_df['Data']))],
tmp_df['Category'], tmp_df['Type']
df:
Data
Category
Type
30275
A
True
35881
C
False
28129
C
True
30274
D
False
30351
D
True
35886
A
True
39900
C
True
35887
A
False
35883
A
True
35856
D
True
35986
C
False
30350
D
False
28129
C
True
31571
C
True
tmp_df:
Data
Category
Type
30275
A
True
28129
C
True
30351
D
True
35886
A
True
39900
C
True
35883
A
True
35856
D
True
28129
C
True
31571
C
True
What should I do if I want the following result?
sub_df:
Data
Category
Type
[30275,35886,35883]
A
True
[28129,39900,28129,31571]
C
True
[30351,35856]
D
True
you can select the rows withquery, then groupby+agg:
(df.query('Type') # or 'Type == "True"' if strings
.groupby('Category', as_index=False)
.agg({'Data': list, 'Type': 'first'})
)
output:
Category Data Type
0 A [30275, 35886, 35883] True
1 C [28129, 39900, 28129, 31571] True
2 D [30351, 35856] True

check columns in DataFrame for constant values explanation

I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)
I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.
import numpy as np
import pandas as pd
data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
A =df.loc[:, (df != df.iloc[0]).any()]
Use:
m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a = df.columns[m1].tolist()
b = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']
Explanation:
First compare all values by 0:
print (df == 0)
A B C
0 True False False
1 True False False
2 True False False
Then test if all values are Trues by DataFrame.all:
print ((df == 0).all())
A True
B False
C False
dtype: bool
Then compare first values of row by DataFrame.iloc:
print (df == df.iloc[0])
A B C
0 True True True
1 True True False
2 True True False
And test again by all:
print ((df == df.iloc[0]).all())
A True
B True
C False
dtype: bool
because exclude 0 chain inverted first mask by ~ with & for bitwise AND:
print (~m1 & m2)
A False
B True
C False
dtype: bool
This seems like a clean way to do what you want:
m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1
m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))
m1 gives you a boolean of columns that all have zeros:
m1
A True
B False
C False
dtype: bool
m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)
m2
A False
B True
C False
dtype: bool
Deriving your lists is trivial from these masks.

Python Pandas - Cannot recognize a string from a column in another dataframe column

I've a dataframe with the following data:
Now I am trying to use the isIn method in order to produce a new column with the result if the col_a is in col_b.So in this case I am trying to produce the following output:
For this I am using this code:
df['res'] = df.col_a.isin(df.col_b)
But it's always return FALSE. I also try this: df['res'] = df.col_b.isin(df.col_a)
but with the same result... all the rows as FALSE.
What I am doing wrong?
Thanks!
You can check if value in col_a is in col_b per rows by apply:
df['res'] = df.apply(lambda x: x.col_a in x.col_b, axis=1)
Or by list comprehension:
df['res'] = [a in b for a, b in zip(df.col_a, df.col_b)]
EDIT: Error obviously mean there are missing values, so if-else statement is necessary:
df = pd.DataFrame({'col_a':['SQL','Java','C#', np.nan, 'Python', np.nan],
'col_b':['I.like_SQL_since_i_used_to_ETL',
'I like_programming_SQL.too',
'I prefer Java',
'I like beer',
np.nan,
np.nan]})
print (df)
df['res'] = df.apply(lambda x: x.col_a in x.col_b
if (x.col_a == x.col_a) and (x.col_b == x.col_b)
else False, axis=1)
df['res1'] = [a in b if (a == a) and (b == b) else False for a, b in zip(df.col_a, df.col_b)]
print (df)
col_a col_b res res1
0 SQL I.like_SQL_since_i_used_to_ETL True True
1 Java I like_programming_SQL.too False False
2 C# I prefer Java False False
3 NaN I like beer False False
4 Python NaN False False
5 NaN NaN False False

matching of columns between two pandas dataframe

import pandas as pd
temp1 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp1['a'] = [1,2,2,3,3,4,4,4,9,11]
temp1['b'] = 'B'
temp2 = pd.DataFrame(index=arange(10), columns=['a','b'])
temp2['a'] = [1,2,3,4,5,6,7,8,9,10]
temp2['b'] = 'B'
As the script above, I want to pickup rows from temp1 that column a was not seen at temp2. I can use %in% in R to do it easily, how can I do it in pandas?
update 01
the output should be one row which column a is 11 and column b is B
You can use isin to perform boolean indexing:
isin will produce a boolean index:
In [95]:
temp1.a.isin(temp2.a)
Out[95]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
Name: a, dtype: bool
This can then be used as a mask in the final output:
In [94]:
# note the ~ this negates the result so equivalent of NOT
temp1[~temp1.a.isin(temp2.a)]
Out[94]:
a b
9 11 B
You can use isin to get the indices that are seen, and then negate the boolean indices:
temp1[~temp1.a.isin(temp2.a)]

Categories

Resources