Regex to extract name from list - python

I am working with a text file (620KB) that has a list of ID#s followed by full names separated by a comma.
The working regex I've used for this is
^([A-Z]{3}\d+)\s+([^,\s]+)
I want to also capture the first name and middle initial (space delimiter between first and MI).
I tried this by doing:
^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)
Which works, but I want to remove the new line break that is generated on the output file (I will be importing the two output files into a database (possibly Access) and I don't want to capture the new line breaks, also if there is a better way of writing the regex?
Full code:
import re
source = open('source.txt')
ticket_list = open('ticket_list.txt', 'w')
id_list = open('id_list.txt', 'w')
for lines in source:
m = re.search('^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)', lines)
if m:
x = m.group()
print('Ticket: ' + x)
ticket_list.write(x + "\n")
ticket_list = open('First.txt', 'r')
for lines in ticket_list:
y = re.search('^(\d+)\s+([^\s]+([\D+])+)', lines)
if y:
z = y.group()
print ('ID: ' + z)
id_list.write(z + "\n")
source.close()
ticket_list.close()
id_list.close()
Sample Data:
Source:
ABC1000033830 SMITH, Z
100000012 Davis, Franl R
200000655 Gest, Baalio
DEF4528942681 PACO, BETH
300000233 Theo, David Alex
400000012 Torres, Francisco B.
ABC1200045682 Mo, AHMED
DEF1000006753 LUGO, G TO
ABC1200123123 de la Rosa, Maria E.

Depending on what kind of linebreak you're dealing with, a simple positive lookahead may remedy your pattern capturing the linebreak in the result. This was generated by RegexBuddy 4.2.0, and worked with all your test data.
if re.search(r"^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)(?=$)", subject, re.IGNORECASE | re.MULTILINE):
# Successful match
else:
# Match attempt failed
Basically, the positive lookahead makes sure that there is a linebreak (in this case, end of line) character directly after the pattern ends. It will match, but not capture the actual end of line.

Related

How do I make my code remove the sender names found in the messages saved in a txt file and the tags using regex

Having this dialogue between a sender and a receiver through Discord, I need to eliminate the tags and the names of the interlocutors, in this case it would help me to eliminate the previous to the colon (:), that way the name of the sender would not matter and I would always delete whoever sent the message.
This is the information what is inside the generic_discord_talk.txt file
Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
with open('stopwords-es.txt') as f:
st = [word for line in f for word in line.split()]
print(st)
stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)
I have created a regex to detect the tags
regex = re.compile("^(<#!.+>){,1}\s{,}(messegeA|messegeB|messegeC)(<#!.+>){,1}\s{,}$")
regex_tag = re.compile("^<#!.+>")
I need that the sentence print(st) give me return the words to me but without the emitters and without the tags
You could remove either parts using an alternation | matching either from the start of the string to the first occurrence of a comma, or match <#! till the first closing tag.
^[^:\n]+:\s*|\s*<#!\d+>
The pattern matches:
^ Start of string
[^:\n]+:\s* Match 1+ occurrences of any char except : or a newline, then match : and optional whitspace chars
| Or
\s*<#! Match literally, preceded by optional whitespace chars
[^<>]+ Negated character class, match 1+ occurrences of any char except < and >
> Match literally
Regex demo
If there can be only digits after <#!
^[^:\n]+:|<#!\d+>
For example
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n]+:\s*|\s*<#![^<>]+>", "", a, 0, re.M)
If you also want to clear the leading and ending spaces, you can add this line
st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)
I think this should work:
import re
data = """Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me"""
def run():
for line in data.split("\n"):
line = re.sub(r"^\w+: ", "", line) # remove the customer/company part
line = re.sub(r"<#!\d+>", "", line) # remove tags
print(line)

How to read the line that contains a string then extract this line without this string

I have a file .txt that contains a specific line, like this
file.txt
.
.
T - Python and Matplotlib Essentials for Scientists and Engineers
.
A - Wood, M.A.
.
.
.
I would like to extract lines that contain a string, I tried with a simple script:
with open('file.txt','r') as f:
for line in f:
if "T - " in line:
o_t = line.rstrip('\n')
elif "A - " in line:
o_a = line.rstrip('\n')
o_T = o_t.split('T - ')
print (o_T)
o_A = o_a.split('A - ')
#o_Fname =
#o_Lname =
print (o_A)
my output:
['', 'Python and Matplotlib Essentials for Scientists and Engineers']
['', 'Wood, M.A.']
and my desired output:
Python and Matplotlib Essentials for Scientists and Engineers
Wood, M.A.
moreover, for the second ("Wood, M.A.") can I also extract the last name and first name.
So the final results will be:
Python and Matplotlib Essentials for Scientists and Engineers
Wood
M.A.
Use filter to remove all empty elements from list.
Ex:
o_T = filter(None, o_t.split('T - '))
print (o_T)
o_A = filter(None, o_a.split('A - '))
print (o_A)
Output:
['Python and Matplotlib Essentials for Scientists and Engineers']
['Wood, M.A.']
The fault in your case is that you print o_t instead of o_T (which is the result of the split operation).
However as others pointed out you could also approach this by removing the first 4 characters, by using regex \w - (.+), then you could get all values. If you also need the first character, you could use (\w) - (.+).
In addition to that, if you'd give your variables better names, you'd have a better life :)

finding duplicate words in a string and print them using re

I need some help with printing duplicated last names in a text file (lower case and uppercase should be the same)
The program do not print words with numbers (i.e. if the number appeared in last name or in the first name the whole name is ignored)
for example:
my text file is :
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu
the output should be:
Assaf
Assaf
David
Bibi
Amnon
Ehud
========
Spanier
Levi
import re
def delete_numbers(line):
words = re.sub(r'\w*\d\w*', '', line).strip()
for t in re.split(r',', words):
if len(t.split()) == 1:
words = re.sub(t, '',words)
words = re.sub(',,', '', words)
return words
fname = input("Enter file name: ")
file = open(fname,"r")
for line in file.readlines():
words = delete_numbers(line)
first_name = re.findall(r"([a-zA-Z]+)\s",words)
for i in first_name:
print(i)
print("***")
a = ""
for t in re.split(r',', words):
a+= (", ".join(t.split()[1:])) + " "
Ok, first let's start with an aside - opening files in an idiomatic way. Use the with statement, which guarantees your file will be closed. For small scripts, this isn't a big deal, but if you ever start writing longer-lived programs, memory leaks due to incorrectly closed files can come back to haunt you. Since your file has everything on a single line:
with open(fname) as f:
data = f.read()
The file is now closed. This also encourages you to deal with your file immediately, and not leave it opened consuming resources unecessarily. Another aside, let's suppose you did have multiple lines. Instead of using for line in f.readlines(), use the following construct:
with open(fname) as f:
for line in f:
do_stuff(line)
Since you don't actually need to keep the whole file, and only need to inspect each line, don't use readlines(). Only use readlines() if you want to keep a list of lines around, something like lines = f.readlines().
OK, finally, data will look something like this:
>>> print(data)
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu
Ok, so if you want to use regex here, I suggest the following approach:
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
The patter here, ^(\D+)\s(\D+)$ uses the non-digit group, \D (the opposite of \d, the digit group), and the white-space group, \s. Also, it uses anchors, ^ and $, to anchor the pattern to the beginning and end of the text respectively. Also, the parentheses create capturing groups, which we will leverage. Try copy-pasting this into http://regexr.com/ and play around with it if you still don't understand. One important note, use raw-strings, i.e. r"this is a raw string" versus normal strings, "this is a normal string" (notice the r). This is because Python strings use some of the same escape characters as regex-patterns. This will help maintain your sanity. Ok, finally, I suggest using the grouping idiom, with a dict
>>> grouper = {}
Now, our loop:
>>> for fullname in data.split(','):
... match = names_regex.search(fullname.strip())
... if match:
... first, last = match.group(1), match.group(2)
... grouper.setdefault(last.title(), []).append(first.title())
...
Note, I used the .title method to normalize all our names to "Titlecase". dict.setdefault takes a key as it's first argument, and if the key doesn't exist, it sets the second argument as the value, and returns it. So, I am checking if the last-name, in title-case, exists in the grouper dict, and if not, setting it to an empty list, [], then appending to whatever is there!
Now pretty-printing for clarity:
>>> from pprint import pprint
>>> pprint(grouper)
{'Din': ['Assaf'],
'Levi': ['David', 'Amnon'],
'Netanyahu': ['Bibi'],
'Spanier': ['Assaf', 'Ehud']}
This is a very useful data-structure. We can, for example, get all last-names with more than a single first name:
>>> for last, firsts in grouper.items():
... if len(firsts) > 1:
... print(last)
...
Spanier
Levi
So, putting it all together:
>>> grouper = {}
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
>>> for fullname in data.split(','):
... match = names_regex.search(fullname.strip())
... if match:
... first, last = match.group(1), match.group(2)
... first, last = first.title(), last.title()
... print(first)
... grouper.setdefault(last, []).append(first)
...
Assaf
Assaf
David
Bibi
Amnon
Ehud
>>> for last, firsts in grouper.items():
... if len(firsts) > 1:
... print(last)
...
Spanier
Levi
Note, I have assumed order doesn't matter, so I used a normal dict. My output happens to be in the correct order because on Python 3.6, dicts are ordered! But don't rely on this, since it is an implementation detail and not a guarantee. Use collections.OrderedDict if you want to guarantee order.
Fine, since you insist on doing it using regex you should strive to do it in a single call so you don't suffer the penalty of context switches. The best approach would be to write a pattern to capture all first/last names that don't include numbers, separated by a comma, let the regex engine capture them all and then iterate over the matches and, finally, map them to a dictionary so you can split them as a last name => first name map:
import collections
import re
text = "Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, " \
"Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu"
full_name = re.compile(r"(?:^|\s|,)([^\d\s]+)\s+([^\d\s]+)(?=>$|,)") # compile the pattern
matches = collections.OrderedDict() # store for the last=>first name map preserving order
for match in full_name.finditer(text):
first_name = match.group(1)
print(first_name) # print the first name to match your desired output
last_name = match.group(2).title() # capitalize the last name for case-insensitivity
if last_name in matches: # repeated last name
matches[last_name].append(first_name) # add the first name to the map
else: # encountering this last name for the first time
matches[last_name] = [first_name] # initialize the map for this last name
print("========") # print the separator...
# finally, print all the repeated last names to match your format
for k, v in matches.items():
if len(v) > 1: # print only those with more than one first name attached
print(k)
And this will give you:
Assaf
Assaf
David
Bibi
Amnon
Ehud
========
Spanier
Levi
In addition, you have the full last name => first names match in matches.
When it comes to the pattern, let's break it down piece by piece:
(?:^|\s|,) - match the beginning of the string, whitespace or a comma (non-capturing)
([^\d\,]+) - followed by any number of characters that are not not digits or whitespace
(capturing)
\s+ - followed by one or more whitespace characters (non-capturing)
([^\d\s]+) - followed by the same pattern as for the first name (capturing)
(?=>$|,) - followed by a comma or end of the string (look-ahead, non-capturing)
The two captured groups (first and last name) are then referenced in the match object when we iterate over matches. Easy-peasy.

Delete whitespace characters in quoted columns in tab-separated file?

I had a similar text file and got great help to solve it, but I have to realize that I'm too new to programming in general and regex in particular to modify the great Python script below written by steveha for a Similar file.
EDIT: I want to get rid of tabs, newlines and other characters than "normal" words, numbers, exclamation marks, question marks, dots - in order to get a clean CSV and from there do text analysis.
import re
import sys
_, infile, outfile = sys.argv
s_pat_row = r'''
"([^"]+)" # match column; this is group 1
\s*,\s* # match separating comma and any optional white space
(\S+) # match column; this is group 2
\s*,\s* # match separating comma and any optional white space
"((?:\\"|[^"])*)" # match string data that can include escaped quotes
'''
pat_row = re.compile(s_pat_row, re.MULTILINE|re.VERBOSE)
s_pat_clean = r'''[\x01-\x1f\x7f]'''
pat_clean = re.compile(s_pat_clean)
row_template = '"{}",{},"{}"\n'
with open(infile, "rt") as inf, open(outfile, "wt") as outf:
data = inf.read()
for m in re.finditer(pat_row, data):
row = m.groups()
cleaned = re.sub(pat_clean, ' ', row[2])
words = cleaned.split()
cleaned = ' '.join(words)
outrow = row_template.format(row[0], row[1], cleaned)
outf.write(outrow)
I can't figure out how to modify it to match this file, where there is \t separating the columns and text instead of a number in the second column. My objective is to have the cleaned text ready for content analysis, but I seem to have years of learning before I get to that point where I'm familiar... ;-)
Could anyone help me modify it so it works on the data file below?
"from_user" "to_user" "full_text"
"_________erik_" "systersandra gigantarmadillo kuttersmycket NULL NULL" "\"men du...? är du bi?\". \"näeh. Tyvärr\" #fikarum,Alla vi barn i bullerbyn goes #swecrime. #fjällbackamorden,Ny mobil och en väckare som ringer 0540. #fail,När jag måste välja, \"äta kakan eller ha den kvar\", så carpe diar jag kakan på sekunden. #mums,Låter RT #bobhansson: Om pessimisterna lever 7 år kortare är det ju inte alls konstigt att dom är det.
http://t.co/a1t5ht4l2h,Finskjortan på tork: Check! Dags att leta fram gå-bort skorna..."
If your CSV file uses tabs for delimiters rather than commas, then in s_pat_row you should replace the , characters with \t. Also, the second field in your sample text file includes spaces, so the (\S+) pattern in s_pat_row will not match it. You could try this instead:
s_pat_row = r'''
"([^"]+)" # match column; this is group 1
\s*\t\s* # match separating tab and any optional white space
([^\t]+) # match a string of non-tab chars; this is group 2
\s*\t\s* # match separating tab and any optional white space
"((?:\\"|[^"])*)" # match string data that can include escaped quotes
'''
That may be sufficient to solve your immediate problem.

Python Regular Expression to find all combinations of a Letter Number Letter Designation

I need to implement a Python regular expression to search for a all occurrences A1a or A_1_a or A-1-a or _A_1_a_ or _A1a, where:
A can be A to Z.
1 can be 1 to 9.
a can be a to z.
Where there are only three characters letter number letter, separated by Underscores, Dashes or nothing. The case in the search string needs to be matched exactly.
The main problem I am having is that sometimes these three letter combinations are connected to other text by dashes and underscores. Also creating the same regular expression to search for A1a, A-1-a and A_1_a.
Also I forgot to mention this is an XML file.
Thanks this found every occurrence of what I was looking for with a slight modification [-]?[A][-]?[1][-]?[a][-]?, but I need to have these be variables something like
[-]?[var_A][-]?[var_3][-]?[Var_a][-]?
would that be done like this
regex = r"[-]?[%s][-]?[%s][-]?[%s][-]?"
print re.findall(regex,var_A,var_Num,Var_a)
Or more like:
regex = ''.join(['r','\"','[-]?[',Var_X,'][-]?[',Var_Num,'][-]?[',Var_x,'][-]?','\"'‌​])
print regex
for sstr in searchstrs:
matches = re.findall(regex, sstr, re.I)
But this isn't working
Sample Lines of the File:
Before Running Script
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="A_3_a Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="A3a1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="A_3_a**_2 Energized from Norm" t:S="0" t:SC="5">
After Running Script
What I am getting: (It's deleting the entire line and leaving only what is below)
B_1_c
B1c1
B_1_c_2
What I Want to get:
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="B_1_c Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="B1c1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="B_1_c_2 Energized from Norm" t:S="0" t:SC="5">
import re
import os
search_file_name = 'Alarms Test.fwn'
pattern = 'A3a'
fileName, fileExtension = os.path.splitext(search_file_name)
newfilename = fileName + '_' + pattern + fileExtension
outfile = open(newfilename, 'wb')
def find_ext(text):
matches = re.findall(r'([_-]?[A{1}][_-]?[3{1}][_-]?[a{1}][_-]?)', text)
records = [m.replace('3', '1').replace('A', 'B').replace('a', 'c') for m in matches]
if matches:
outfile.writelines(records)
return 1
else:
outfile.writelines(text)
return 0
def main():
success = 0
count = 0
with open(search_file_name, 'rb') as searchfile:
try:
searchstrs = searchfile.readlines()
for s in searchstrs:
success = find_ext(s)
count = count + success
finally:
searchfile.close()
print count
if __name__ == "__main__":
main()
You want to use the following to find your matches.
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', s, re.I)
See regex101 demo
If your are looking to find the matches then strip all of the -, _ characters, you could do..
import re
s = '''
A1a _A_1 A_ A_1_a A-1-a _A_1_a_ _A1a _A-1-A_ a1_a A-_-5-a
_A-_-5-A a1_-1 XMDC_A1a or XMDC-A1a or XMDC_A1-a XMDC_A_1_a_ _A-1-A_
'''
def find_this(text):
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', text, re.I)
records = [m.replace('-', '').replace('_', '') for m in matches]
print records
find_this(s)
Output
['A1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A', 'a1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A']
See working demo
To quickly get the A1as out without the punctuation, and not having to reconstruct the string from captured parts...
t = '''A1a _B_2_z_
A_1_a
A-1-a
_A_1_a_
_C1c '''
re.findall("[A-Z][0-9][a-z]",t.replace("-","").replace("_",""))
Output:
['A1a', 'B2z', 'A1a', 'A1a', 'A1a', 'C1c']
(But if you don't want to capture from FILE.TXT-2b, then you would have to be careful about most of these solutions...)
If the string can be separated by multiple underscores or dashes (e.g. A__1a):
[_-]*[A-Z][_-]*[1-9][_-]*[a-z]
If there can only be one or zero underscores or dashes:
[_-]?[A-Z][_-]?[1-9][_-]?[a-z]
regex = r"[A-Z][-_]?[1-9][-_]?[a-z]"
print re.findall(regex,some_string_variable)
should work
to just capture the parts your interested in wrap them in parens
regex = r"([A-Z])[-_]?([1-9])[-_]?([a-z])"
print re.findall(regex,some_string_variable)
if the underscores or dashes or lack thereof must match or it will return bad results you would need a statemachine whereas regex is stateless

Categories

Resources