Python: matching OR of two variables containing regex code - python

What I am trying to do is to take user input text which would contain wildcards (so I need to keep them that way) but furthermore to look for the specified input. So for example that I have working below I use the pipe |.
I figured out how to make this work:
dual = 'a bunch of stuff and a bunch more stuff!'
reobj = re.compile('b(.*?)f|\s[a](.*?)u', re.IGNORECASE)
result = reobj.findall(dual)
for link in result:
print link[0] +' ' + link[1]
which returns:
unch o
nd a b
As well
dual2 = 'a bunch of stuff and a bunch more stuff!'
#So I want to now send in the regex codes of my own.
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1, re.IGNORECASE)
result = reobj.findall(dual2)
for link in result:
print link[0] +' ' + link[1]
Which returns:
u n
u n
I don't understand what it is doing as if I get rid of all save link[0] in print I get:
u
u
I however can pass in a user input regex string:
dual = 'a bunch of stuff and a bunch more stuff!'
userinput = 'b(.*?)f'
reobj = re.compile(userinput, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
but when I try to update this to two user strings with the pipe:
dual = 'a bunch of stuff and a bunch more stuff!'
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1|userin2, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
I get the error:
reobj = re.compile(userin1|userin2, re.IGNORECASE)
TypeError: unsupported operand type(s) for |: 'str' and 'str'
I get this error a lot such as if I put brackets () or [] around userin1|userin2.
I have found the following:
Python regular expressions OR
but can not get it to work ;..{-( .
What I would like to do is to be able to understand how to pass in these regex variables such as that of OR and return all the matches of both as well as something such as AND - which in the end is useful as it will operate on files and let me know which files contain particular words with the various logical relations OR, AND etc.
Thanks much for your thoughts,
Brian

Although I couldn't get the answer from A. Rodas to work, he gave the idea for the .join. The example I worked out - although slightly different returns (in link[0] and link[1]) the desired results.
userin1 = '(T.*?n)'
userin2 = '(G.*?p)'
list_patterns = [userin1,userin2]
swaplogic = '|'
string = 'What is a Torsion Abelian Group (TAB)?'
theresult = re.findall(swaplogic.join(list_patterns), string)
print theresult
for link in theresult:
print link[0]+' '+link[1]

Related

Function to extract company register number from text string using Regex

I have a function which extracts the company register number (German: handelsregisternummer) from a given text. Although my regex for this particular problem matches the correct format (please see demo), I can not extract the correct company register number.
I want to extract HRB 142663 B but I get HRB 142663.
Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.
import re
def get_handelsregisternummer(string, keyword):
# https://regex101.com/r/k6AGmq/10
reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'
match = re.compile(reg_1)
handelsregisternummer = match.findall(string) # list of matched words
if handelsregisternummer: # not empty
return handelsregisternummer[0]
else: # no match found
handelsregisternummer = ""
return handelsregisternummer
Example text scraped from website. Linebreaks make words attached to each other:
text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""
Apply function:
for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = keyword + " " + handelsregisternummer
break
if not handelsregisternummer: # if list is empty
handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}
Afterwards I get:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}
But I want this:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}
You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:
reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
# |_________| |___________________|
Then, you need to concatenate, join all the capturing groups matched and returned with findall:
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = " ".join(handelsregisternummer)
break
See the Python demo.

Pandas Python Regular Expression Assistance

I wasn't sure what to call this title, feel free to edit it if you think there is a better name.
What I am trying to do is find cases that match certain search criteria.
Specifically, I am trying to find sentences that contain the word "where" in them. Once I have identified that, I am trying to find cases where the word "SQL" command is also located within that same tag.
Let's say I have a dataframe that looks like this:
search_criteria = ['where']
df4
Q R
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example" type="SqlCommand">select id, name, from table where criteria = '5'</property><sentence>dave hates stuff>
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example">select id, name, from table where criteria = '5'</properties><sentence>dave hates stuff>
I am trying to return this:
Q R
0 file.sql <properties>version = "2", description = "example">select id, name, from table</properties>
This record should get returned because it contains both "where" and "sqlcommand".
Here is my current process:
regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<[^<]*?' + 'where' + '[^>]*?>)', re.IGNORECASE)
sql_command_regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<property[^<]*?' + 'sqlcommand' + '[^>]*?<\/property>)', re.IGNORECASE)
if not regex_stuff.empty: #if one of the search criteria is found
if not sql_command_regex_stuff.empty: #check to see if the phrase "sqlcommand" is found anywhere as well
(insert rest of code)
This does not return anything.
What am I doing wrong?
Edit #1:
It seems like I need to do something at the end, to make the regex look something like this:
<property[^<]*?SqlCommand[^(<\/property>)]*
I feel like this is the right direction, doesn't work, but I feel like this is the right step.
You could just filter with str.contains:
df[(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
0 file.sql <sentence>dave likes stuff</sentence><properti...
or use ~ to return the opposite: strings that do not contain 'sqlcommand' or 'where'
df[~(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
1 file.sql <sentence>dave likes stuff</sentence><properti...
First of all, you have to have proper XML and SQL content, so you should
make the following corrections:
As the opening tag is <properties>, the closing tag must also be
</properties>, not </property>.
version, description and type are attributes (after them
there is > closing the opening tag, so after properties there
should be a space, not >.
Remove , after version="2".
Remove , after name.
Remove ( before <properties and ) after </properties>.
To find the required rows, use str.contains as the filtering
expression.
Below you have an example program:
import pandas as pd
import re
df4 = pd.DataFrame({
'Q' : 'file.sql',
'R' : [
'<s>dave</s><properties type="SqlCommand">select id, name '
'from table where criteria=\'5\'</properties><s>dave</s>',
'<s>dave</s><properties>select id, name from table '
'where criteria=\'6\'</properties><s>dave</s>',
'<s>mike</s><properties type="SqlCommand">drop table "Xyz"'
'</properties><s>mike</s>' ]})
df5 = df4[df4.R.str.contains(
'<properties[^<>]+?sqlcommand[^<>]+?>[^<>]+?where',
flags=re.IGNORECASE)]
print(df5)
Note that the regex takes care about the proper sequence of
strings:
First match <properties.
Then a sequence of chars other than < and > ([^<>]+?).
so we are still within the just opened XML tag.
Then match sqlcommand (ignoring case).
Then another sequence of chars other than < and >
([^<>]+?).
Then >, closing the tag.
Then another sequence of chars other than < and >
([^<>]+?).
And finally where (also ignoring case).
An attempt to check for sqlcommand and where in two separate
regexes is wrong, as these words can be at other locations,
which do not meet your requirement.

Exhaustively parse file for all matches

I have a grammar for parsing some log files using pyparsing but am running into an issue where only the first match is being returned. Is there a way to ensure that I get exhaustive matches? Here's some code:
from pyparsing import Literal, Optional, oneOf, OneOrMore, ParserElement, Regex, restOfLine, Suppress, ZeroOrMore
ParserElement.setDefaultWhitespaceChars(' ')
dt = Regex(r'''\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) 20\d\d \d\d:\d\d:\d\d\,\d{3}''')
# TODO maybe add a parse action to make a datetime object out of the dt capture group
log_level = Suppress('[') + oneOf("INFO DEBUG ERROR WARN TRACE") + Suppress(']')
package_name = Regex(r'''(com|org|net)\.(\w+\.)+\w+''')
junk_data = Optional(Regex('\(.*?\)'))
guid = Regex('[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}')
first_log_line = dt.setResultsName('datetime') + \
log_level('log_level') + \
guid('guid') + \
junk_data('junk') + \
package_name('package_name') + \
Suppress(':') + \
restOfLine('message') + \
Suppress('\n')
additional_log_lines = Suppress('\t') + package_name + restOfLine
log_entry = (first_log_line + Optional(ZeroOrMore(additional_log_lines)))
log_batch = OneOrMore(log_entry)
In my mind, the last two lines are sort of equivalent to
log_entry := first_log_line | first_log_line additional_log_lines
additional_log_lines := additional_log_line | additional_log_line additional_log_lines
log_batch := log_entry | log_entry log_batch
Or something of the sort. Am I thinking about this wrong? I only see a single match with all of the expected tokens when I do print(log_batch.parseString(data).dump()).
Your scanString behavior is a strong clue. Suppose I wrote an expression to match one or more items, and erroneously defined my expression such that the second item in my list did not match. Then OneOrMore(expr) would fail, while expr.scanString would "succeed", in that it would give me more matches, but would still overlook the match I might have wanted, but just mis-parsed.
import pyparsing as pp
data = "AAA _AB BBB CCC"
expr = pp.Word(pp.alphas)
print(pp.OneOrMore(expr).parseString(data))
Gives:
['AAA']
At first glance, this looks like the OneOrMore is failing, whereas scanString shows more matches:
['AAA']
['AB'] <- really wanted '_AB' here
['BBB']
['CCC']
Here is a loop using scanString which prints not the matches, but the gaps between the matches, and where they start:
# loop to find non-matching parts in data
last_end = 0
for t,s,e in expr.scanString(data):
gap = data[last_end:s]
print(s, ':', repr(gap))
last_end = e
Giving:
0 : ''
5 : ' _' <-- AHA!!
8 : ' '
12 : ' '
Here's another way to visualize this.
# print markers where each match begins in input string
markers = [' ']*len(data)
for t,s,e in expr.scanString(data):
markers[s] = '^'
print(data)
print(''.join(markers))
Prints:
AAA _AB BBB CCC
^ ^ ^ ^
Your code would be a little more complex since your data spans many lines, but using pyparsing's line, lineno and col methods, you could do something similar.
So, there's a workaround that seems to do the trick. For whatever reason, scanString does iterate through them all appropriately, so I can very simply get my matches in a generator with:
matches = (m for m, _, _ in log_batch.scanString(data))
Still not sure why parseString isn't working exhaustively, though, and still a bit worried that I've misunderstood something about pyparsing, so more pointers are welcome here.

Python: Match strings for certain terms

I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.
My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)
You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):
In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:
You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "

Python search for multiple values and show with boundaries

I am trying to allow the user to do this:
Lets say initially the text says:
"hello world hello earth"
when the user searches for "hello" it should display:
|hello| world |hello| earth
here's what I have:
m = re.compile(pattern)
i =0
match = False
while i < len(self.fcontent):
content = " ".join(self.fcontent[i])
i = i + 1;
for find in m.finditer(content):
print i,"\t"+content[:find.start()]+"|"+content[find.start():find.end()]+"|"+content[find.end():]
match = True
pr = raw_input( "(n)ext, (p)revious, (q)uit or (r)estart? ")
if (pr == 'q'):
break
elif (pr == 'p'):
i = i - 2
elif (pr == 'r'):
i = 0
if match is False:
print "No matches in the file!"
where :
pattern = user specified pattern
fcontent = contents of a file read in and stored as array of words and lines e.g:
[['line','1'],['line','2','here'],['line','3']]
however it prints
|hello| world hello earth
hello world |hello| earth
how can i merge the two lines to be displayed as one?
Thanks
Edit:
This a part of a larger search function where the pattern..in this case the word "hello" is passed from the user, so I have to use regex search/match/finditer to find the pattern. The replace and other methods sadly won't work because the user can choose to search for "[0-9]$" and that would mean to put the ending number between |'s
If you're just doing that, use str.replace.
print self.content.replace(m.find, "|%s|" % m.find)
you can use regexp as follows:
import re
src = "hello world hello earth"
dst = re.sub('hello', '|hello|', src)
print dst
or use string replace:
dst = src.replace('hello', '|hello|')
Ok, going back to original solution since OP confirmed that word would stand on its own (ie not be a substring of another word).
target = 'hello'
line = 'hello world hello earth'
rep_target = '|{}|'.format(target)
line = line.replace(target, rep_target)
yields:
|hello| world |hello| earth
As has been pointed out based on your example, using str.replace is the easiest. If more complex criteria is required, then you can adapt the following...
import re
def highlight(string, words, boundary='|'):
if isinstance(words, basestring):
words = [words]
rs = '({})'.format(boundary.join(sorted(map(re.escape, words), key=len, reverse=True)))
return re.sub(rs, lambda L: '{0}{1}{0}'.format(boundary, L.group(1)), string)

Categories

Resources