finding duplicate words in a string and print them using re - python

I need some help with printing duplicated last names in a text file (lower case and uppercase should be the same)
The program do not print words with numbers (i.e. if the number appeared in last name or in the first name the whole name is ignored)
for example:
my text file is :
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu
the output should be:
Assaf
Assaf
David
Bibi
Amnon
Ehud
========
Spanier
Levi
import re
def delete_numbers(line):
words = re.sub(r'\w*\d\w*', '', line).strip()
for t in re.split(r',', words):
if len(t.split()) == 1:
words = re.sub(t, '',words)
words = re.sub(',,', '', words)
return words
fname = input("Enter file name: ")
file = open(fname,"r")
for line in file.readlines():
words = delete_numbers(line)
first_name = re.findall(r"([a-zA-Z]+)\s",words)
for i in first_name:
print(i)
print("***")
a = ""
for t in re.split(r',', words):
a+= (", ".join(t.split()[1:])) + " "

Ok, first let's start with an aside - opening files in an idiomatic way. Use the with statement, which guarantees your file will be closed. For small scripts, this isn't a big deal, but if you ever start writing longer-lived programs, memory leaks due to incorrectly closed files can come back to haunt you. Since your file has everything on a single line:
with open(fname) as f:
data = f.read()
The file is now closed. This also encourages you to deal with your file immediately, and not leave it opened consuming resources unecessarily. Another aside, let's suppose you did have multiple lines. Instead of using for line in f.readlines(), use the following construct:
with open(fname) as f:
for line in f:
do_stuff(line)
Since you don't actually need to keep the whole file, and only need to inspect each line, don't use readlines(). Only use readlines() if you want to keep a list of lines around, something like lines = f.readlines().
OK, finally, data will look something like this:
>>> print(data)
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu
Ok, so if you want to use regex here, I suggest the following approach:
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
The patter here, ^(\D+)\s(\D+)$ uses the non-digit group, \D (the opposite of \d, the digit group), and the white-space group, \s. Also, it uses anchors, ^ and $, to anchor the pattern to the beginning and end of the text respectively. Also, the parentheses create capturing groups, which we will leverage. Try copy-pasting this into http://regexr.com/ and play around with it if you still don't understand. One important note, use raw-strings, i.e. r"this is a raw string" versus normal strings, "this is a normal string" (notice the r). This is because Python strings use some of the same escape characters as regex-patterns. This will help maintain your sanity. Ok, finally, I suggest using the grouping idiom, with a dict
>>> grouper = {}
Now, our loop:
>>> for fullname in data.split(','):
... match = names_regex.search(fullname.strip())
... if match:
... first, last = match.group(1), match.group(2)
... grouper.setdefault(last.title(), []).append(first.title())
...
Note, I used the .title method to normalize all our names to "Titlecase". dict.setdefault takes a key as it's first argument, and if the key doesn't exist, it sets the second argument as the value, and returns it. So, I am checking if the last-name, in title-case, exists in the grouper dict, and if not, setting it to an empty list, [], then appending to whatever is there!
Now pretty-printing for clarity:
>>> from pprint import pprint
>>> pprint(grouper)
{'Din': ['Assaf'],
'Levi': ['David', 'Amnon'],
'Netanyahu': ['Bibi'],
'Spanier': ['Assaf', 'Ehud']}
This is a very useful data-structure. We can, for example, get all last-names with more than a single first name:
>>> for last, firsts in grouper.items():
... if len(firsts) > 1:
... print(last)
...
Spanier
Levi
So, putting it all together:
>>> grouper = {}
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
>>> for fullname in data.split(','):
... match = names_regex.search(fullname.strip())
... if match:
... first, last = match.group(1), match.group(2)
... first, last = first.title(), last.title()
... print(first)
... grouper.setdefault(last, []).append(first)
...
Assaf
Assaf
David
Bibi
Amnon
Ehud
>>> for last, firsts in grouper.items():
... if len(firsts) > 1:
... print(last)
...
Spanier
Levi
Note, I have assumed order doesn't matter, so I used a normal dict. My output happens to be in the correct order because on Python 3.6, dicts are ordered! But don't rely on this, since it is an implementation detail and not a guarantee. Use collections.OrderedDict if you want to guarantee order.

Fine, since you insist on doing it using regex you should strive to do it in a single call so you don't suffer the penalty of context switches. The best approach would be to write a pattern to capture all first/last names that don't include numbers, separated by a comma, let the regex engine capture them all and then iterate over the matches and, finally, map them to a dictionary so you can split them as a last name => first name map:
import collections
import re
text = "Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, " \
"Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu"
full_name = re.compile(r"(?:^|\s|,)([^\d\s]+)\s+([^\d\s]+)(?=>$|,)") # compile the pattern
matches = collections.OrderedDict() # store for the last=>first name map preserving order
for match in full_name.finditer(text):
first_name = match.group(1)
print(first_name) # print the first name to match your desired output
last_name = match.group(2).title() # capitalize the last name for case-insensitivity
if last_name in matches: # repeated last name
matches[last_name].append(first_name) # add the first name to the map
else: # encountering this last name for the first time
matches[last_name] = [first_name] # initialize the map for this last name
print("========") # print the separator...
# finally, print all the repeated last names to match your format
for k, v in matches.items():
if len(v) > 1: # print only those with more than one first name attached
print(k)
And this will give you:
Assaf
Assaf
David
Bibi
Amnon
Ehud
========
Spanier
Levi
In addition, you have the full last name => first names match in matches.
When it comes to the pattern, let's break it down piece by piece:
(?:^|\s|,) - match the beginning of the string, whitespace or a comma (non-capturing)
([^\d\,]+) - followed by any number of characters that are not not digits or whitespace
(capturing)
\s+ - followed by one or more whitespace characters (non-capturing)
([^\d\s]+) - followed by the same pattern as for the first name (capturing)
(?=>$|,) - followed by a comma or end of the string (look-ahead, non-capturing)
The two captured groups (first and last name) are then referenced in the match object when we iterate over matches. Easy-peasy.

Related

How do I make my code remove the sender names found in the messages saved in a txt file and the tags using regex

Having this dialogue between a sender and a receiver through Discord, I need to eliminate the tags and the names of the interlocutors, in this case it would help me to eliminate the previous to the colon (:), that way the name of the sender would not matter and I would always delete whoever sent the message.
This is the information what is inside the generic_discord_talk.txt file
Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
with open('stopwords-es.txt') as f:
st = [word for line in f for word in line.split()]
print(st)
stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)
I have created a regex to detect the tags
regex = re.compile("^(<#!.+>){,1}\s{,}(messegeA|messegeB|messegeC)(<#!.+>){,1}\s{,}$")
regex_tag = re.compile("^<#!.+>")
I need that the sentence print(st) give me return the words to me but without the emitters and without the tags
You could remove either parts using an alternation | matching either from the start of the string to the first occurrence of a comma, or match <#! till the first closing tag.
^[^:\n]+:\s*|\s*<#!\d+>
The pattern matches:
^ Start of string
[^:\n]+:\s* Match 1+ occurrences of any char except : or a newline, then match : and optional whitspace chars
| Or
\s*<#! Match literally, preceded by optional whitespace chars
[^<>]+ Negated character class, match 1+ occurrences of any char except < and >
> Match literally
Regex demo
If there can be only digits after <#!
^[^:\n]+:|<#!\d+>
For example
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n]+:\s*|\s*<#![^<>]+>", "", a, 0, re.M)
If you also want to clear the leading and ending spaces, you can add this line
st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)
I think this should work:
import re
data = """Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me"""
def run():
for line in data.split("\n"):
line = re.sub(r"^\w+: ", "", line) # remove the customer/company part
line = re.sub(r"<#!\d+>", "", line) # remove tags
print(line)

How to find required word in novel in python?

I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question

How to extract person name using regular expression?

I am new to Regular Expression and I have kind of a phone directory. I want to extract the names out of it. I wrote this (below), but it extracts lots of unwanted text rather than just names. Can you kindly tell me what am i doing wrong and how to correct it? Here is my code:
import re
directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:madamson#sncn.net
Allison Andrews
Home: 612-321-0047
E-mail: AEA#anet.com
Cellular: 612-393-0029
Dustin Andrews'''
nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)
''',re.VERBOSE)
print(nameRegex.findall(directory))
the output it gives is:
[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]
Would be really grateful for help!
Your problem is that \s will also match newlines. Instead of \s just add a space. That is
name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')
This works if the names have exactly two words. If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like:
name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)
This looks for one or more words and will stretch from the beginning to end of a line (e.g. will not just get 'John Paul' from 'John Paul Jones')
I can suggest to try the next regex, it works for me:
"([A-Z][a-z]+\s[A-Z][a-z]+)"
The following regex works as expected.
Related part of the code:
nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)
print(nameRegex.findall(directory)
Output:
>>> python3 test.py
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']
Try:
nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)
This will only choose complete lines that are made up of two or more names composed of 'word' characters.

Regex to extract name from list

I am working with a text file (620KB) that has a list of ID#s followed by full names separated by a comma.
The working regex I've used for this is
^([A-Z]{3}\d+)\s+([^,\s]+)
I want to also capture the first name and middle initial (space delimiter between first and MI).
I tried this by doing:
^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)
Which works, but I want to remove the new line break that is generated on the output file (I will be importing the two output files into a database (possibly Access) and I don't want to capture the new line breaks, also if there is a better way of writing the regex?
Full code:
import re
source = open('source.txt')
ticket_list = open('ticket_list.txt', 'w')
id_list = open('id_list.txt', 'w')
for lines in source:
m = re.search('^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)', lines)
if m:
x = m.group()
print('Ticket: ' + x)
ticket_list.write(x + "\n")
ticket_list = open('First.txt', 'r')
for lines in ticket_list:
y = re.search('^(\d+)\s+([^\s]+([\D+])+)', lines)
if y:
z = y.group()
print ('ID: ' + z)
id_list.write(z + "\n")
source.close()
ticket_list.close()
id_list.close()
Sample Data:
Source:
ABC1000033830 SMITH, Z
100000012 Davis, Franl R
200000655 Gest, Baalio
DEF4528942681 PACO, BETH
300000233 Theo, David Alex
400000012 Torres, Francisco B.
ABC1200045682 Mo, AHMED
DEF1000006753 LUGO, G TO
ABC1200123123 de la Rosa, Maria E.
Depending on what kind of linebreak you're dealing with, a simple positive lookahead may remedy your pattern capturing the linebreak in the result. This was generated by RegexBuddy 4.2.0, and worked with all your test data.
if re.search(r"^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)(?=$)", subject, re.IGNORECASE | re.MULTILINE):
# Successful match
else:
# Match attempt failed
Basically, the positive lookahead makes sure that there is a linebreak (in this case, end of line) character directly after the pattern ends. It will match, but not capture the actual end of line.

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources