I'm sure that if a solution exists for this then its out there somewhere but I can't find it. I've followed Python regex to match a specific word and had success in the first aspect but now am struggling with the second aspect.
I've inherited a horrible file format where each test result is on its own line. They are limited to 12 chars per record so some results are split into groups of lines e.g SITE, SITE1 and SITE2. I'm trying to parse the file into a dictionary so I can do more analysis with it and ultimately produce a formatted report.
The link above / code below allows me to match each SITE and concatenate them together but its giving me problems matching INS, INS 1 and INS 2 correctly. Yes the space is intentional - its what I have to deal with. INS is the test result and INS 1 is the limit of the test for a pass.
Is there a regular expression that would match
SITE > SITE True but SITE > SITE1 false
and
INS > INS True but INS to INS 1 false?
Here is the python code.
import re
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", "INS", "INS 1"]
for line in lines:
for heading in headings:
headregex = r"\b" + heading + r"\b"
match = re.search(headregex,heading)
if match:
print "Found " + heading + " " + line
else:
print "Not Found " + heading + " " + line
And here is some dummy data:
TEST MODE 131 AUTO
SITE startaddy
SITE1 middle addy
SITE2 end addy
USER DB
VISUAL CHECK P
BOND RANGE 25A
EARTH 0.09 OHM P
LIMIT 0.10 OHM
INS 500 V
INS 1 >299 MEG P
...
TEST MODE 231 AUTO
SITE startaddy
SITE1 middle addy
SITE2 end addy
USER DB
VISUAL CHECK P
INS 500 V
INS 2 >299 MEG P
...
Sorry for the horrid formatting - its copied and pasted from what I am dealing with!
The problem is that INS pattern finds a partial match in INS in INS 1 or INS 2 etc.
In cases when you extract alternatives, it is customary to use alternations starting with the longest value (like INS \d+|INS), but in this case you are looking to obtain a list of all regex matches only excluding some overlapping heading matches.
To achieve that, there is a way to exclude that match by treating all headings items as regular expressions, and define the INS pattern as INS(?! \d) to make sure INS is not matched if it is followed with a space and a digit.
See the Python demo:
import re
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", r"INS(?! \d)", "INS 1"]
headings=sorted(headings, key=lambda x: len(x), reverse=True)
for line in lines:
print("----")
for heading in headings:
headregex = r"\b{}\b".format(heading)
match = re.search(headregex,heading)
if match:
print "Found " + heading + " " + line
else:
print "Not Found " + heading + " " + line
Just to give an answer that might solve the problem while avoiding some of the tediousness, is this what you are trying to achieve?
import re
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", "INS", "INS 1"]
headings_re = re.compile(r"(SITE\d? )?(INS( \d)?)? (.*)")
# build by hand, only works if SITE and INS are the literal identifiers
site = []
ins = []
for line in lines:
match = headings_re.match(line)
if match:
if match.group(1):
site.append(match.group(4))
elif match.group(2):
ins.append(match.group(4))
else:
print("something weird happened")
print(match.group(0))
else:
print("something weird happened")
print(line)
print("SITE: {}".format(" ".join(site)))
>> SITE: start more end
print("INS: {}".format(" ".join(ins)))
>> INS: value1 value2
Related
I have two things that I would like to replace in my text files.
Add " " between String end with '#' (eg. ABC#) into (eg. A B C)
Ignore certain Strings end with 'H' or 'xx:xx:xx' (eg. 1111H - ignore), (eg. if is 1111, process into 'one one one one')
so far this is my code..
import re
dest1 = r"C:\\Users\CL\\Desktop\\Folder"
files = os.listdir(dest1)
#dictionary to process Int to Str
numbers = {"0":"ZERO ", "1":"ONE ", "2":"TWO ", "3":"THREE ", "4":"FOUR ", "5":"FIVE ", "6":"SIX ", "7":"SEVEN ", "8":"EIGHT ", "9":"NINE "}
for f in files:
text= open(dest1+ "\\" + f,"r")
text_read = text.read()
#num sub pattern
text = re.sub('[%s]\s?' % ''.join(numbers), lambda x: numbers[x.group().strip()]+' ', text)
#write result to file
data = f.write(text)
f.close()
sample .txt
1111H I have 11 ABC# apples
11:12:00 I went to my# room
output required
1111H I have ONE ONE A B C apples
11:12:00 I went to M Y room
also.. i realized when I write the new result, the format gets 'messy' without the breaks. not sure why.
#current output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES
ONE ONE ONE TWO H - I WENT TO MY# ROOM
#overwritten output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES ONE ONE ONE TWO H - I WENT TO MY# ROOM
You can use
def process_match(x):
if x.group(1):
return " ".join(x.group(1).upper())
elif x.group(2):
return f'{numbers[x.group(2)] }'
else:
return x.group()
print(re.sub(r'\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b|\b([A-Za-z]+)#|([0-9])', process_match, text_read))
# => 1111H I have ONE ONE A B C apples
# 11:12:00 I went to M Y room
See the regex demo. The main idea behind this approach is to parse the string only once capturing or not parts of it, and process each match on the go, either returning it as is (if it was not captured) or converted chunks of text (if the text was captured).
Regex details:
\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b - a word boundary, and then either one or more digits and one or more uppercase letters, or three occurrences of colon-separated double digits, and then a word boundary
| - or
\b([A-Za-z]+)# - Group 1: words with # at the end: a word boundary, then oneor more letters and a #
| - or
([0-9]) - Group 2: an ASCII digit.
Attached is a text file that I want to parse. I want to select the text in the last combination of the words' occurrence:
(1) Item 7 Management Discussion Analysis
(2) Item 8 Financial Statements
I would usually use regex as follow:
re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements",text, re.DOTALL)
You can see in the text file, the combination of Item 7 and Item 8 occurs often but if I find the last match (1) and last match (2), I increase by a lot the probability to grab the desired text.
The desired text in my text file starts with:
"'This Item 7, Management's Discussion and
Analysis of Financial Condition and Results of Operations, and other
parts of this Form 10-K contain forward-looking statements, within the
meaning of the Private Securities Litigation Reform Act of 1995, that
involve risks and..... "
and ends with:
"Item 8.
Financial Statements and Supplementary Data"
How can I adapt my regex code to grab this last pair between Item 7 and Item 8?
UPDATE:
I also try to parse this file using the same items.
This code has been rewritten. It now works with both the original data file (Output2.txt) and the newly added data file (Output2012.txt).
import re
discussions = []
for input_file_name in ['Output2.txt', 'Output2012.txt']:
with open(input_file_name) as f:
doc = f.read()
item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
discussion_text = r"[\S\s]*"
item8 = r"Item 8\.*\s*Financial Statements"
discussion_pattern = item7 + discussion_text + item8
results = re.findall(discussion_pattern, doc)
# Some input files have table of contents and others don't
# just keep the last match
discussion = results[len(results)-1]
discussions.append((input_file_name, discussion))
The discussions variable contains the results for each of the data files.
This is the original solution. It does not work for the new file but does show the use of named groups. I am not familiar with StackOverflow protocol here. Should I delete this old code?
By using longer match strings, the number of matches can be reduced to just 2 for both item 7
and item 8 - the table of contents and the actual section.
So search for the second occurence of item 7, and keep all text until item 8. This code uses
Python named groups.
import re
with open('Output2.txt') as f:
doc = f.read()
item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
item8 = r"Item 8\.*\s*Financial Statements"
discussion_pattern = re.compile(
r"(?P<item7>" + item7 + ")"
r"([\S\s]*)"
r"(?P<item7heading>" + item7 +")"
r"(?P<discussion>[\S\s]*)"
r"(?P<item8heading>" + item8 + ")"
)
match = re.search(discussion_pattern, doc)
discussion = match.group('discussion')
Use this pattern with s option
.*(Item 7.*?Item 8)
result at capturing group #1
Demo
. # Any character except line break
* # (zero or more)(greedy)
( # Capturing Group (1)
Item 7 # "Item 7"
. # Any character except line break
*? # (zero or more)(lazy)
Item 8 # "Item 8"
) # End of Capturing Group (1)
# " "
re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements(?!.*?(?:Item(?:(?!Item).)*7)|(?:Item(?:(?!Item).)*8))",text, re.DOTALL)
Try this.Added a lookahead .
I am working with a text file (620KB) that has a list of ID#s followed by full names separated by a comma.
The working regex I've used for this is
^([A-Z]{3}\d+)\s+([^,\s]+)
I want to also capture the first name and middle initial (space delimiter between first and MI).
I tried this by doing:
^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)
Which works, but I want to remove the new line break that is generated on the output file (I will be importing the two output files into a database (possibly Access) and I don't want to capture the new line breaks, also if there is a better way of writing the regex?
Full code:
import re
source = open('source.txt')
ticket_list = open('ticket_list.txt', 'w')
id_list = open('id_list.txt', 'w')
for lines in source:
m = re.search('^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)', lines)
if m:
x = m.group()
print('Ticket: ' + x)
ticket_list.write(x + "\n")
ticket_list = open('First.txt', 'r')
for lines in ticket_list:
y = re.search('^(\d+)\s+([^\s]+([\D+])+)', lines)
if y:
z = y.group()
print ('ID: ' + z)
id_list.write(z + "\n")
source.close()
ticket_list.close()
id_list.close()
Sample Data:
Source:
ABC1000033830 SMITH, Z
100000012 Davis, Franl R
200000655 Gest, Baalio
DEF4528942681 PACO, BETH
300000233 Theo, David Alex
400000012 Torres, Francisco B.
ABC1200045682 Mo, AHMED
DEF1000006753 LUGO, G TO
ABC1200123123 de la Rosa, Maria E.
Depending on what kind of linebreak you're dealing with, a simple positive lookahead may remedy your pattern capturing the linebreak in the result. This was generated by RegexBuddy 4.2.0, and worked with all your test data.
if re.search(r"^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)(?=$)", subject, re.IGNORECASE | re.MULTILINE):
# Successful match
else:
# Match attempt failed
Basically, the positive lookahead makes sure that there is a linebreak (in this case, end of line) character directly after the pattern ends. It will match, but not capture the actual end of line.
I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.
My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)
You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):
In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:
You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "
What I am trying to do is to take user input text which would contain wildcards (so I need to keep them that way) but furthermore to look for the specified input. So for example that I have working below I use the pipe |.
I figured out how to make this work:
dual = 'a bunch of stuff and a bunch more stuff!'
reobj = re.compile('b(.*?)f|\s[a](.*?)u', re.IGNORECASE)
result = reobj.findall(dual)
for link in result:
print link[0] +' ' + link[1]
which returns:
unch o
nd a b
As well
dual2 = 'a bunch of stuff and a bunch more stuff!'
#So I want to now send in the regex codes of my own.
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1, re.IGNORECASE)
result = reobj.findall(dual2)
for link in result:
print link[0] +' ' + link[1]
Which returns:
u n
u n
I don't understand what it is doing as if I get rid of all save link[0] in print I get:
u
u
I however can pass in a user input regex string:
dual = 'a bunch of stuff and a bunch more stuff!'
userinput = 'b(.*?)f'
reobj = re.compile(userinput, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
but when I try to update this to two user strings with the pipe:
dual = 'a bunch of stuff and a bunch more stuff!'
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1|userin2, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
I get the error:
reobj = re.compile(userin1|userin2, re.IGNORECASE)
TypeError: unsupported operand type(s) for |: 'str' and 'str'
I get this error a lot such as if I put brackets () or [] around userin1|userin2.
I have found the following:
Python regular expressions OR
but can not get it to work ;..{-( .
What I would like to do is to be able to understand how to pass in these regex variables such as that of OR and return all the matches of both as well as something such as AND - which in the end is useful as it will operate on files and let me know which files contain particular words with the various logical relations OR, AND etc.
Thanks much for your thoughts,
Brian
Although I couldn't get the answer from A. Rodas to work, he gave the idea for the .join. The example I worked out - although slightly different returns (in link[0] and link[1]) the desired results.
userin1 = '(T.*?n)'
userin2 = '(G.*?p)'
list_patterns = [userin1,userin2]
swaplogic = '|'
string = 'What is a Torsion Abelian Group (TAB)?'
theresult = re.findall(swaplogic.join(list_patterns), string)
print theresult
for link in theresult:
print link[0]+' '+link[1]