search for regex in text using python - python

I want to search for region in s1 . I want to return 1 if i the text contains "region" or "région" or "regions" or "régions" and 0 in the other case.
i wrote the code below but it does'nt work
s1 = pd.Series(['here is region', 'my regions', 'régionally', 'région','régions','regions','region'])
s1.str.contains('r.gion[s][^a-zA-Z]', regex=True).astype(int)
In this case the result must be
[1,1,0,1,1,1,1]

You may use
s1.str.contains(r'\br[ée]gions?\b').astype(int)
If you want to save the regex in a file and then read in and use as a variable just write \br[ée]gions?\b there.
Test:
>>> import pandas as pd
>>> s1 = pd.Series(['here is region', 'my regions', 'régionally', 'région','régions','regions','region'])
>>> s1.str.contains(r'\br[ée]gions?\b').astype(int)
0 1
1 1
2 0
3 1
4 1
5 1
6 1
dtype: int32
Details
\b - a word boundary
r - r char
[ée] - one of the letters in the character class
gion - gion
s? - an optional s letter
\b - a word boundary.

Related

PYTHON - .replace function

I have a DF similar to the below:
Name
Text
Michael
66l additional text
John
55i additional text
Mary
88l additional text
What I want to do is anywhere "l" occurs in the first string of the "Text" column, then replace it with "P"
Current code
DF['Text'] = DF['Text'].replace({"l", "P", 1})
Desired Outcome
Name
Text
Michael
66P additional text
John
55i additional text
Mary
88P additional text
You can use pandas.Series.str.replace with regex to identify the first word of the string.
>>> import pandas as pd
>>>
>>>
>>> df
Text
0 66l additional text
1 55i additional text
2 88l additional text
>>>
>>>
>>> df['Text'] = df['Text'].str.replace(r"^\w+\b", lambda x: x.group(0).replace("l", "P"), regex=True)
>>> df
Text
0 66P additional text
1 55i additional text
2 88P additional text
Asssuming the l only occurs once (as is shown in your sample dataframe) you can use
df['Text'].str.replace(r'^(\S*)l', r'\1P', regex=True)
# => 0 66P additional text
# 1 55i additional text
# 2 88P additional text
# Name: Text, dtype: object
See the regex demo. Details:
^ - start of string
(\S*) - Group 1: zero or more whitespaces
l - an l char (letter).
The replacement is \1P, i.e. the Group 1 value + P letter.
With your shown samples only, this could be easily done by using str[range] functionality of Python pandas, with your shown samples of DataFrame please try following code.
import pandas as pd
##Create your df here....
df['Text'] = df['Text'].str[:2] + 'P ' + df['Text'].str[4:]
Explanation:
df['Text'].str[:2]: Taking(printing) from 1st position of column Text to till 3rd position(it starts from 0).
+ 'P ' +: Adding/concatenating P to it as per OP's requirement in question here.
df['Text'].str[4:]: Taking(printing) from 5th position of column Text to till end of column's value here and saving this whole df['Text'].str[:2] + 'P ' + df['Text'].str[4:] code's output into Text column itself of DataFrame.

Regex not working properly for some cases (python)?

I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)
str int
0 abc 123
1 gbc#* 23abc
2 abc123 abc200
3 124abc 1230&*
4 abcer£$%&*! 230!?*&
num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)
Output:
correct_int: ['23', '1230', '230']
As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first
correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']
now it removed for all but sometimes it's not removing when the value is '124abc'
Is my pattern wrong? I have also tried giving different patterns but nothing worked
I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int'
Expected output:
Once after cleaning and replacing with the old with the cleaned values the output would look like this.
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
You can do it with
df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings
df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])
Returns:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
Try the following:
'\D' represents any non digit value, substitute those with empty string '' in int column
[^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)
Output:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230

Python Pandas handle special characters in strings

I write a function which I want to apply to a dataframe later.
def get_word_count(text,df):
#text is a lowercase list of words
#df is a dataframe with 2 columns: word and count
#this function updates the word counts
#f=open('stopwords.txt','r')
#stopwords=f.read()
stopwords='in the and an - '
for word in text:
if word not in stopwords:
if df['word'].str.contains(word).any():
df.loc[df['word']==word, 'count']=df['count']+1
else:
df.loc[0]=[word,1]
df.index=df.index+1
return df
Then I check it:
word_df=pd.DataFrame(columns=['word','count'])
sentence1='[first] - missing "" in the text [first] word'.split()
y=get_word_count(sentence1, word_df)
sentence2="error: wrong word in the [second] text".split()
y=get_word_count(sentence2, word_df)
y
I get the following results:
Word Count
[first] 2
missing 1
"" 1
text 2
word 2
error: 1
wrong 1
So where is [second] from the sentence2?
If I omit one of square brackets I get an error message. How do I handle words with special characters? Note that I don't want to get rid of them because if I do, I will miss "" in the sentence1.
The problem comes from the line:
if df['word'].str.contains(word).any():
This reports if any of the words in the word column contains the given word. The DataFrame from df['word'].str.contains(word) reports True when [second] is given and compared to specifically [first].
For a quick fix, I changed the line to:
if word in df['word'].tolist():
Creating a DataFrame in a loop like that is not recommended, you should do something like this:
stopwords='in the and an - '
sentence = sentence1+sentence2
df = pd.DataFrame([sentence.split()]).T
df.rename(columns={0: 'Words'}, inplace=True)
df = df.groupby(by=['Words'])['Words'].size().reset_index(name='counts')
df = df[~df['Words'].isin(stopwords.split())]
print(df)
Words counts
0 "" 1
2 [first] 2
3 [second] 1
4 error: 1
6 missing 1
7 text 2
9 word 2
10 wrong 1
I have rebuild it in a way you can add sentences and see the frequency growing
from collections import Counter
from collections import defaultdict
import pandas as pd
def terms_frequency(corpus, stop_words=None):
'''
Takes in texts and returns a pandas DataFrame of words frequency
'''
corpus_ = corpus.split()
# remove stop wors
terms = [word for word in corpus_ if word not in stop_words]
terms_freq = pd.DataFrame.from_dict(Counter(terms), orient='index').reset_index()
terms_freq = terms_freq.rename(columns={'index':'word', 0:'count'}).sort_values('count',ascending=False)
terms_freq.reset_index(inplace=True)
terms_freq.drop('index',axis=1,inplace=True)
return terms_freq
def get_sentence(sentence, storage, stop_words=None):
storage['sentences'].append(sentence)
corpus = ' '.join(s for s in storage['sentences'])
return terms_frequency(corpus,stop_words)
# tests
STOP_WORDS = 'in the and an - '
storage = defaultdict(list)
S1 = '[first] - missing "" in the text [first] word'
print(get_sentence(S1,storage,STOP_WORDS))
print('\nNext S2')
S2 = 'error: wrong word in the [second] text'
print(get_sentence(S2,storage,STOP_WORDS))

Extract multiple Words using regex on excel data

I have a dataset in Excel:
column1
Bank A : 12
Bank B : 40
Bank C : 55
where it contains only a single row with Bank A, B and C information inside one cell.
How would I be able to use regex in Python to create 3 columns whereby my new dataset is:
Bank A Bank B Bank C
12 40 55
Thank You!
You could do it with the following regex:
(.*?)\s:\s(\d+)
Regex Demo
Or using this regex to be more forgiving with the spaces before and after the :
(.*?)(?:\s+)?:(?:\s+)?(\d+)
Regex Demo
Explanation:
(.*?) # For Group 1, match every character
\s:\s # until reaching a space + : + space
(\d+) # For Group 2, match every digit
Then with your python code you can access contents of Group 1 and 2 using the Match.group() method and build the columns as you need.

Using Reg Ex to Match Strings in a Data Frame and Replace - python

i have data frame that looks like this
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT
i want to be able to strip ,60 ,R-12,HT using regex and also deletes the moreinfo and GoCats rows from the df.
My expected Results:
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
I first removed the strings
del = ['hello', 'moreinfo']
for i in del:
df = df[value!= i]
Can somebody suggest a way to use regex to match and delete all case that do match A067-M4FL-CAA-020 or MZF8-050Z-AAB pattern so i don't have to create a list for all possible cases?
I was able to strip a single line like this but i want to be able to strip all matching cases in the dataframe
pattern = r',\w+ \,\w+-\w+\,\w+ *'
line = 'MRF2-050A-TFC,60 ,R-12,HT'
for i in re.findall(pattern, line):
line = line.replace(i,'')
>>> MRF2-050A-TFC
I tried adjusting my code but it prints out the same output for each row
pattern = r',\w+ \,\w+-\w+\,\w+ *'
for d in df:
for i in re.findall(pattern, d):
d = d.replace(i,'')
Any suggestions will be greatly appreciated. Thanks
You may try this
(?:\w+-){2,}[^,\n]*
Demo
Python scripts may be as follows
ss="""0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT"""
import re
regx=re.compile(r'(?:\w+-){2,}[^,\n]*')
m= regx.findall(ss)
for i in range(len(m)):
print("%d %s" %(i, m[i]))
and the output is
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
Here's a simpler approach you can try without using regex. pandas has many in-built functions to deal with text data.
# remove unwanted values
df['value'] = df.value.str.replace(r'moreinfo|60|R-.*|HT|GoCats|\,', '')
# drop na
df = df[(df != '')].dropna()
# print
print(df)
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
3 MZF8-050Z-AAB
5 MZA2-0580-TFD
-----------
# data used
df = pd.read_fwf(StringIO(u'''
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT'''),header=1)
I'd suggest capturing the data you DO want, since it's pretty particular, and the data you do NOT want could be anything.
Your pattern should look something like this:
^\w{4}-\w{4}-\w{3}(?:-\d{3})?
https://regex101.com/r/NtH2Ut/2
I'd recommend being more specific than \w where possible. (Like ^[A-Z]\w{3}) if you know the beginning four character chunk should start with a letter.
edit
Sorry, I may not have read your input and output literally enough:
https://regex101.com/r/NtH2Ut/3
^(?:\d+\s+\w{4}-\w{4}-\w{3}(?:-\d{3})?)|^\s+.*

Categories

Resources