Find value in one column in another column with regex in pandas - python

I have a pandas dataframe with two columns of strings. I want to identify all row where the string in the first column (s1) appears within the string in the second column (s2).
So if my columns were:
abc abcd*ef_gh
z1y xxyyzz
I want to keep the first row, but not the second.
The only approach I can think of is to:
iterate through dataframe rows
apply df.str.contains() to s2 using the contents of s1 as the matching pattern
Is there a way to accomplish this that doesn't require iterating over the rows?

It is probably doable (for simple matching only), in a vectorised way, with numpy chararray methods:
In [326]:
print df
s1 s2
0 abc abcd*ef_gh
1 z1y xxyyzz
2 aaa aaabbbsss
In [327]:
print df.ix[np.char.find(df.s2.values.astype(str),
df.s1.values.astype(str))>=0,
's1']
0 abc
2 aaa
Name: s1, dtype: object

The best I could come up with is to use apply instead of manual iterations:
>> df = pd.DataFrame({'x': ['abc', 'xyz'], 'y': ['1234', '12xyz34']})
>> df
x y
0 abc 1234
1 xyz 12xyz34
>> df.x[df.apply(lambda row: row.y.find(row.x) != -1, axis=1)]
1 xyz
Name: x, dtype: object

Related

Matching regex in two different dataframe Python

I'm having trouble on how to match regex in two different dataframe that is linked with its type and unique country. Here is the sample for the data df and the regex df. Note that the shape for these two dataframe is different because the regex df containing just unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY XYZ 456792abc MY XYZ ^\w{6,10}$
IT ABC MY45889976
IT ABC IT567888976
I have tried to merge them together and just use lambda to do the matching. Below is my code,
df.merge(df_regex,left_on='Country',right_on="Country")
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
But, it will add another row for each of the different type and country. So there will be a lot of duplication which is not efficient and time consuming.
Is there any pythonic way to match the data to its country and type but the reference is in another dataframe. without merging those 2 df. Then if its match to its own regex, it will return 1, else 0.
To avoid repetition based on Type we should include Type also in the joining conditions, Now apply the lambda
df2 = df.merge(df_regex, left_on=['Country', 'Type'],right_on=['Country', 'Type'])
df2['Data Quality'] = df2.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
df2
It will give you the following output.
Country Type Data Regex Data Quality
0 MY ABC MY1234567890 ^MY[0-9]{10} 1
1 IT ABC IT1234567890 ^IT[0-9]{10} 1
2 IT ABC MY45889976 ^IT[0-9]{10} 0
3 IT ABC IT567888976 ^IT[0-9]{10} 0
4 PL PQR PL123456 ^PL 1
5 MY XYZ 456792abc ^\w{6,10}$ 1

Pandas - slicing column values based on another column

How can I slice column values based on first & last character location indicators from two other columns?
Here is the code for a sample df:
import pandas as pd
d = {'W': ['abcde','abcde','abcde','abcde']}
df = pd.DataFrame(data=d)
df['First']=[0,0,0,0]
df['Last']=[1,2,3,5]
df['Slice']=['a','ab','abc','abcde']
print(df.head())
Code output:
Desired Output:
Just do it with for loop , you may worry about the speed , please check For loops with pandas - When should I care?
df['Slice']=[x[y:z]for x,y,z in zip(df.W,df.First,df.Last)]
df
Out[918]:
W First Last Slice
0 abcde 0 1 a
1 abcde 0 2 ab
2 abcde 0 3 abc
3 abcde 0 5 abcde
I am not sure if this will be faster, but a similar approach would be:
df['Slice'] = df.apply(lambda x: x[0][x[1]:x[2]],axis=1)
Briefly, you go through each row (axis=1) and apply a custom function. The function takes the row (stored as x), and slices the first element using the second and third elements as indices for the slicing (that's the lambda part). I will be happy to elaborate more if this isn't clear.

Replacing Periods in DF's Columns

Replacing Periods in DF's Columns
I was wondering if there was an efficient way to replace periods in pandas dataframes without having to iterate through each row and call.replace() on the row.
import pandas as pd
df = pd.DataFrame.from_dict({'column':['Sam M.']})
df.column = df.column.replace('.','')
print df
Result
column
0 None
Desired Result
column
0 Sam M
df['column'].str.replace('.', '', regex=False)
0 Sam M
Name: column, dtype: object
Because . is a regex special character so put '\' front of it then it will be good:
Solution:
df['column'].str.replace('\.','')
Example:
df['column']=df['column'].str.replace('\.','')
print(df)
Output:
column
0 Sam M

How to write a return value of a function into new column of a pandas dataframe

I have a pandas dataframe containing a column with strings (that are comma separated substrings). I want to remove some of the substrings and write the remaining ones to a new column in the same dataframe.
The code I have written to do this looks like this:
def remove_betas(df):
for index,row in df.iterrows():
list= row['Column'].split(',')
if 'substring' in list:
list.remove('beta-lactam')
New= (',').join(list)
elif not 'substring' in list:
New= (',').join(Gene_list)
return New
df['NewColumn'].iloc[index]=New
df.apply(remove_betas, axis=1)
When I run it, my new column contains only zeros. The thought behind this code is to get each string for each row in df, split it at comma into substrings and search the resulting list for the substring I want to remove. After removal, I join the list back together into a string and write that to a new column of df, at the same index position as the corresponding row.
What do I have to change to write the resulting substrings to a new column in the desired manner?
EDIT
By the way, I have tried to write a lambda expression as in how to compute a new column based on the values of other columns in pandas - python , but I cannot really figure out how to do everything in a vectorized function.
I also tried replacing the substring with nothing ( as in df.column.replace('x,?', ''), but that does not work since I have to count the lists later. Therefore the substring must be removed as in list.remove('substring')
Why not employing a one liner regex solution:
import re
df = pd.DataFrame({'col1':[3,4,5],'col2':['a,ben,c','a,r,ben','cat,dog'],'col3':[1,2,3]})
#In [220]: df
#Out[220]:
# col1 col2 col3
#0 3 a,ben,c 1
#1 4 a,r,ben 2
#2 5 cat,dog 3
df['new'] = df.col2.apply(lambda x: re.sub(',?ben|ben,?', '', x))
#In [222]: df
#Out[222]:
# col1 col2 col3 new
#0 3 a,ben,c 1 a,c
#1 4 a,r,ben 2 a,r
#2 5 cat,dog 3 cat,dog
Or just use replace:
In [272]: df.col2.str.replace(',?ben|ben,?', '',case=False)
Out[272]:
0 a,c
1 a,r
2 cat,dog
Name: col2, dtype: object

Python, pandas: how to remove greater than sign

Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3

Categories

Resources