Applying function to pandas dataframe column after str.findall - python

I have such a dataframe df with two columns:
Col1 Col2
'abc-def-ghi' 1
'abc-opq-rst' 2
I created a new column Col3 like this:
df['Col3'] = df['Col1'].str.findall('abc', flags=re.IGNORECASE)
And got such a dataframe afterwards:
Col1 Col2 Col3
'abc-def-ghi' 1 [abc]
'abc-opq-rst' 2 [abc]
What I want to do now is to create a new column Col4 where I get a one if Col3 contains 'abc' and otherwise zero.
I tried to do this with a function:
def f(row):
if row['Col3'] == '[abc]':
val = 1
else:
val = 0
return val
And applied this to my pandas dataframe:
df['Col4'] = df.apply(f, axis=1)
But I only get 0, also in column that contain 'abc'. I think there is something wrong with my if-statement.
How can I solve this?

Just do
df['Col4'] = df.Col3.astype(bool).astype(int)

Related

drop all duplicate values in python

Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9
This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9

Pandas all cells which length <2

I want to acquire all rows in Dataframe where if the length of any cloumu shorter than 2.
For example:
df = pd.DataFrame({"col1":["a","ab",""],"col2":["bc","abc", "a"]})
col1 col2
0 a bc
1 ab abc
2 a
How to get this output:
col1 col2
0 a bc
2 a
Let's try stack to reshape then using str.len compute the length and create boolean mask with lt + any:
df[df.stack().str.len().lt(2).any(level=0)]
col1 col2
0 a bc
2 a
You can use the len() method of the pandas Series :
for col in df.columns:
df[col] = df[col][df[col].str.len() < 3]
df = df.dropna()
A list comprehension could help here :
df.loc[[not any(len(word) > 2 for word in entry)
for entry in df.to_numpy()]
]
col1 col2
0 a bc
2 a

How to find common elements in several dataframes

I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.

Python 3.6: Find first occurance string(entire column value) from dataframe which starts with '$'

I have dataframe with 55 columns, want to find first occurance string where column value satrts with '$'
I tried below script, but could not achieve.
string = ''
for col in df:
string=df[col].str.startswith('$')
if string!='': sys.exit()
sample df:
Col1 Col2 Col3 Col4
123 5678 $45678 $5000
54356 768 Rs.5432 6546
Expected result: $45678, which is present in 3rd column
You can create mask first:
m = df.astype(str).applymap(lambda x: x.startswith('$'))
print (m)
Col1 Col2 Col3 Col4
0 False False True True
1 False False False False
And then get position of first True in rows and columns by numpy.where for select by iat:
print (np.where(m))
(array([0, 0], dtype=int64), array([2, 3], dtype=int64))
idx = np.where(m)[0][0]
col = np.where(m)[1][0]
a = df.iat[idx, col]
$45678
Use numpy.char.startswith:
for col in df:
if np.any(np.char.startswith(np.asarray(df[col], str), '$')):
string = col
break
else:
sys.exit()
IIUC you can use .loc and .iloc based on condition i.e
mask = df.apply(lambda x : x.str.startswith('$').any(),0)
#mask will return the boolean values so using loc we can access the columns
col = df.loc[:,mask].iloc[:,0]
Output col :
0 $45678
1 Rs.5432
Name: Col3, dtype: object
col[col.str.startswith('$')].values[0]
'$45678'

Find unique values for each column

I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']

Categories

Resources