How to remove duplicated results of regular expression (re) in Python

How to remove duplicated results of regular expression (re) in Python - python

There is a string:
str = 'Please Contact Prof. Zheng Zhao: Zheng.Z#xxx.com for details, or our HR: john.will#xxx.com'
I wanted to parse all of the email in that string, so I set:
p = r'[\w\.]+#[\w\.]+'
re.findall(p, str)
And the result was:
['zheng.z#xxx.com', 'Zheng.Z#xxx.com', 'john.will#xxx.com']
Apparently, the first and the second are duplicated. How do we prevent this from happening?

You can remove duplicates using a set. A set is like an unordered list which can't contain duplicates. In this case, you don't care about case, so making the results lowercase will let you properly check for duplicates.
import re
s = 'Please Contact Prof. Zheng Zhao: Zheng.Z#xxx.com for details, or our HR: john.will#xxx.com'
p = r'[\w\.]+#[\w\.]+'
list(set(result.lower() for result in re.findall(p, s)))

Related

The phone number field contains U.S. phone numbers, and needs to be modified to the international format, with "+1-" in front of the phone number

I am using
import re
def transform_record(record):
new_record = re.sub(r'(,[^a-zA-z])', r'\1+1-',record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
#Excpected Output:::" Sabrina Green,+1-802-867-5309,System Administrator"
But I am getting output::
Sabrina Green,8+1-02-867-5309,S+-ystem Administrator

Below one is working.
re.sub(r",([\d-]+)",r",+1-\1" ,record)

import re
def transform_record(record):
new_record = re.sub(r',(?=\d)', r',+1-',record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
# Sabrina Green,+1-802-867-5309,System Administrator
print(transform_record("Eli Jones,684-3481127,IT specialist"))
# Eli Jones,+1-684-3481127,IT specialist
print(transform_record("Melody Daniels,846-687-7436,Programmer"))
# Melody Daniels,+1-846-687-7436,Programmer
print(transform_record("Charlie Rivera,698-746-3357,Web Developer"))
# Charlie Rivera,+1-698-746-3357,Web Developer

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
def transform_record(record, number_field=1):
fields = record.split(",") # See note.
if not fields[number_field].startswith("+1-"):
fields[number_field] = "+1-" + fields[number_field]
return ",".join(fields)
I have a note in the above implementation. You are probably working with CSV data. You should use a proper CSV parser instead of just splitting on commas if so. Just splitting on commas goes wrong if a field contains escaped commas.

If your data is not well ordered, and you want to add +1- before any , that is followed with a digit, yo may use
re.sub(r',(?=\d)', r',+1-', record)
See the regex demo.
The ,(?=\d) pattern matches a comma first, and then (?=\d) positive lookahead makes sure there is a digit right after, without consuming the digit (and it remains in the replacement result).
See the Python demo online.

First of all, detect the pattern from the record text by r",(?=[0-9])". That means if there are some digits after , comma, then add +1- after the comma and then the previous phone number.
For example : 345-345-34567 convert to +1-345-345-34567
import re
def transform_record(record):
new_record = re.sub(r",(?=[0-9])",",+1-",record)
return new_record
print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
# Sabrina Green,+1-802-867-5309,System Administrator
print(transform_record("Eli Jones,684-3481127,IT specialist"))
# Eli Jones,+1-684-3481127,IT specialist
print(transform_record("Melody Daniels,846-687-7436,Programmer"))
# Melody Daniels,+1-846-687-7436,Programmer
print(transform_record("Charlie Rivera,698-746-3357,Web Developer"))
# Charlie Rivera,+1-698-746-3357,Web Developer

import re
def transform_record(record):
new_record = re.sub(r"([\d-]+)",r"+1-\1",record)
return new_record

new_record = re.sub(r",([\d])",r",+1-",record)
This works for me.

In this code we want to search one or more digit so you need to use \d in a class with the "+" sign and for re.sub you need to add the previous phone number with "+1"
new_record = re.sub(r',([\d]+)',r',+1\1', record)

BeautifulSoup String Search

I have been googling and looking at other question here on search for a string in a BeautifulSoup object.
Per my searching, the following should detect the string - but it doesn't:
strings = soup.find_all(string='Results of Operations and Financial Condition')
However, the following detects the string:
tags = soup.find_all('div',{'class':'info'})
for tag in tags:
if re.search('Results of Operations and Financial Condition',tag.text):
''' Do Something'''
Why does one work and the other not?

You might want to use:
strings = soup.find_all(string=lambda x: 'Results of Operations and Financial Condition' in x)
This happens because the implementation of find_all looks for the string you search to match exactly. I suppose you might have some other text next to 'Results of Operations and Financial Condition'.
If you check the docs here you can see that you can give a function to that string param and it seems that the following lines are equivalent:
soup.find_all(string='Results of Operations and Financial Condition')
soup.find_all(string=lambda x: x == 'Results of Operations and Financial Condition')

For this code
page = urllib.request.urlopen('https://en.wikipedia.org/wiki/Alloxylon_pinnatum')
sp = bs4.BeautifulSoup(page)
print(sp.find_all(string=re.compile('The pinkish-red compound flowerheads'))) # You need to use like this to search within text nodes.
print(sp.find_all(string='The pinkish-red compound flowerheads, known as'))
print(sp.find_all(string='The pinkish-red compound flowerheads, known as ')) #notice space at the end of string
Results are -
['The pinkish-red compound flowerheads, known as ']
[]
['The pinkish-red compound flowerheads, known as ']
It looks like string argument searches for exact full string match, not whether some HTML text node contains that string, but exact value of the HTML text node. You can however use regular expressions to search whether a text node contains some string, as shown in above code.

Regex to catch only the certain part of the string

Is there universal regex to catch only the names of companies?
Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc
Q4_2018_Control4_Corp
The output should be:
American_Airlines_Group_Inc
Apple_Inc
Alcoa_Inc
Arconic_Inc
Orkla_ASA
AGCO_Corp
Autodesk_Inc
Note:
The name of the company may contain symbols or numbers

You can use this regex,
[a-zA-Z]+(?:_[a-zA-Z]+)*$
Your company names all start with alphabetical words and hyphen separated till end of string, for which above regex will work fine.
Here, [a-zA-Z]+ starts matching alphabetical company names, and (?:_[a-zA-Z]+)* further matches any alphabetical words having hyphen before them and $ ensures the matched string ends with the string.
Regex Demo
Python code,
import re
arr = ['Q4_2017_American_Airlines_Group_Inc','Q1_2016_Apple_Inc','Q4_2014_Alcoa_Inc','Q3_2015_Arconic_Inc','Q3_2017_Orkla_ASA','Q2_2018_AGCO_Corp','Quarter_3_2018_Autodesk_Inc']
for s in arr:
m = re.search(r'[a-zA-Z]+(?:_[a-zA-Z]+)*$', s)
print(s, '-->', m.group())
Prints,
Q4_2017_American_Airlines_Group_Inc --> American_Airlines_Group_Inc
Q1_2016_Apple_Inc --> Apple_Inc
Q4_2014_Alcoa_Inc --> Alcoa_Inc
Q3_2015_Arconic_Inc --> Arconic_Inc
Q3_2017_Orkla_ASA --> Orkla_ASA
Q2_2018_AGCO_Corp --> AGCO_Corp
Quarter_3_2018_Autodesk_Inc --> Autodesk_Inc
Also, if you have a single string of those company names, then you can use following code and use re.findall to list all company names,
import re
s = '''Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc'''
print(re.findall(r'(?m)[a-zA-Z]+(?:_[a-zA-Z]+)*$', s))
Prints,
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
Edit:
As Chyngyz Akmatov raised, if name can contain numbers and in general any symbol, then this regex will get the name properly, which assumes company name starts after year part and underscore.
(?<=\d{4}_).*$
Demo handling any character in company name

You can use re.sub:
import re
data = [re.sub('\w+\d{4}_', '', i) for i in filter(None, content.split('\n'))]
Output:
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']

You can also use this regex:
_\d+(?:_\d+)*_(.*)
Code:
import re
lst = ['Q4_2017_American_Airlines_Group_Inc', 'Q1_2016_Apple_Inc', 'Q4_2014_Alcoa_Inc', 'Q3_2015_Arconic_Inc', 'Q3_2017_Orkla_ASA', 'Q2_2018_AGCO_Corp', 'Quarter_3_2018_Autodesk_Inc']
for x in lst:
print(re.search(r'_\d+(?:_\d+)*_(.*)', x).group(1))
# American_Airlines_Group_Inc
# Apple_Inc
# Alcoa_Inc
# Arconic_Inc
# Orkla_ASA
# AGCO_Corp
# Autodesk_Inc

Assuming there are only normal letters and the names are the end of each line :
grep -o '[A-Za-z][A-Za-z_]*$' names

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?

The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)

The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.

The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

Regex Python firstname lastname tag the keywords in json dump

Python Regex :
I have a json file and list of keywords.
I need to match the keywords in the json file dump.
I have set of keywords : Data Filter Terms:
Candidate Names
Hillary Clinton
Bernie Sanders
Jeb Bush
Donald Trump
John Kasich
Marco Rubio
Scott Walker
I need to match these keywords in such a way that it should search for
'Scott Walker' as well as 'Scott','Walker' independently too.
and I need to tag these in the json dump.
can anyone help me out in this?
I wrote a pseudocode for this :
import re
json_pages = open('/home/Desktop/arti.json','r')
filterd_pages = []
for page in json_pages:
text = page['text']
re.match('Hillary Clinton')
if matches:
page['matched_keyword'] = matches.group()
filterd_pages.append(page)
dump_json(filterd_pages)
f = open('/home/soundarya/Documents/synapsifyone.json')
json_response = json.loads(f.read())
keywords = ['Hillary Clinton ', 'Bernie Sanders', 'Jeb Bush','Donald Trump','John Kasich','Marco Rubio','Scott Walker']
for k, v in json_response.iteritems():
if k in keywords:
print(v)
break
How to tag the keywords in the JSON dump?
I have crawled so many datas , posts from nearly 30 urls using Diffbot tool and got json as the output file. from this json file i have to match the keywords (First Name , Last Name, First Name Last Name) and tag it at the end of each dict in the list or return the sentences which have the sentences that contain - 'hillary' , 'clinton' or 'Hillary Clinton.

You can create a list of regular expressions, one for each term. The idea is to construct a regular expression that matches either the whole term or any word in it.
term_regexes = []
term_tags = []
for term in term_file:
term_matchers = [term] + term.split()
term_regexes.append('|'.join(term_matchers))
term_tags.append(term)
We are creating a list to hold the regular expressions and another for holding the tags.
term_file contains each term. For each of them, we construct the regular expression that matches either the term or any of its parts. This corresponds to the union of the expressions matching each one of them, using the union ("or") regex operator ("|"). For instance, the expression "Hillary Clinton|Hillary|Clinton" would do the job for your example.
Finally, we iterate the list of dictionaries, search for a match of any of the terms and tag when found:
for d in dict_list:
# Search each term.
for term_regex, tag in zip(term_matchers, term_tags):
if re.match(term_regex, d['text'], re.IGNORECASE):
d['tag'] = tag
break

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove duplicated results of regular expression (re) in Python - python

Related

The phone number field contains U.S. phone numbers, and needs to be modified to the international format, with "+1-" in front of the phone number

BeautifulSoup String Search

Regex to catch only the certain part of the string

Multiple distinct replaces using RegEx

Regex Python firstname lastname tag the keywords in json dump

Categories

Resources