Boolean search text file in Python

Boolean search text file in Python - python

I have a text file with 32 articles. Each article starts with the expression: <Number> of 32 DOCUMENTS, for example: 1 of 32 DOCUMENTS, 2 of 32 DOCUMENTS, etc. In order to find each article I have used the following code:
import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
So now, articles are represented by the expression sections
The next thing I want to do, is to subgroup the articles in 2 groups. Those articles containing the words: economy OR economic AND uncertainty OR uncertain AND tax OR policy, identify them with the number 1.
Whereas those articles containing the following words: economy OR economic AND uncertain OR uncertainty AND regulation OR spending, identify them with the number 2. This is what I have tried so far:
for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])
Nevertheless, it does not seem to work. Any ideas why?

It's a bit wordy, but you can get away without using regular expressions here, for example:
# Take a lowercase copy for comparisons
s = sections[i].lower()
if (('economic' in s or 'economy' in s) and
('uncertainty' in s or 'uncertain' in s) and
('tax' in s or 'policy' in s)):
do_stuff()

It is possible to write this as a single regular expression, but it is a bit tricky. For each and you'd use a zero-width lookahead assertion (?= ), and for each or you'd use a branch. Also, we'd have to use the \b for a word boundary. We'd use re.match instead of re.search.
belongs_to_group1 = bool(re.match(
r'(?=.*\b(?:economic|economy)\b)'
r'(?=.*\b(?:uncertain|uncertainty)\b)'
r'(?=.*\b(?:tax|policy)\b)', text, re.I))
Thus not very readable.
A more fruitful approach would be to find all words and put them into a set
words = set(re.findall(r'\w+', text.lower()))
belongs_to_group1 = (('uncertainty' in words or 'uncertain' in words)
and ('economic' in words or 'economy' in words)
and ('tax' in words or 'policy' in words))

You can use re.search to find those words. Then you can use if statements and python's and and or statements for the logic, and then store group one and two as two lists with the section index number as a value.
One thing you might want to note is that your logic may need brackets.
By
economy OR economic AND uncertainty OR uncertain AND tax OR policy
I assume you mean
(economy OR economic) AND (uncertainty OR uncertain) AND (tax OR policy)
which is different to (for example)
economy OR (economic AND uncertainty) OR (uncertain AND tax) OR policy
EDIT1:
Python will evaluate your statement without brackets from left to right, i.e.:
( ( ( ( (economy OR economic) AND uncertainty) OR uncertain) AND tax) OR policy)
Which I imagine is not what you want (e.g. the above evaluates true if it includes the word policy but none of the others)
EDIT2:
As pointed out in comments, EDIT1 is incorrect, although you would still need brackets to achieve case 1, if you don't have them you will get case 2 instead (and case 3 is a load of rubbish)

Related

How to remove spaces in a "single" word? ("bo ok" to "book")

I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)
Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\

You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:
import enchant
text = "int ernational trade is not good for economies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
fixed_text[-1] = compound_word
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.
This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.
As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.
import enchant
text = "inte rnatio nal trade is not good for ec onom ies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(compound_word := words[i]):
for j, pending_word in enumerate(fixed_text[::-1], 1):
if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
del fixed_text[-j:]
fixed_text.append(compound_word)
break
else:
fixed_text.append(words[i])
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))

Reading financial statements using REGEX

I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:
[[REVENUE], [9,000,000], [9,000,000]]
I came across this stack overflow post where someone attempts to use re.match() to the .groups() method to find the pattern: How to split strings into text and number?
I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.

I wrote this regex through watching your first expected output. But i am not sure what your desired output is with your third sentence.
([A-Za-z ]+)(?=\d|\S) match name until we found a number or symbol.
.*? for the string which we do not care
([\d,]+)\s([\d,]+|(?=-\n|-$)) match one or two groups of number, if there is only one group of number, this group should end with newline or end of text.
Test code(edited):
import re
regex = r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+|(?=-\n|-$))"
text = """
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
Business taxes 999 -
"""
print(re.findall(regex,text))
# [('REVENUE ', '9,000,000', '900,000'), ('COST OF SALES ', '900,000', '900,000'), ('GROSS PROFIT ', '900,000', '900,000'), ('Business taxes ', '999', '')]

Regexes are overkill for this problem as you've stated it.
text.split() and a join of the items before the last two is better suited to this.
lines = [ "REVENUE 9,000,000 900,000",
"COST OF SALES 900,000 900,000",
"GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000" ]
out = []
for line in lines:
parts = line.split()
if len(parts) < 3:
raise InputError
if len(parts) == 3:
out.append(parts)
else:
out.append([' '.join(parts[0:len(parts)-2]), parts[-2], parts[-1]])
out will contain
[['REVENUE', '9,000,000', '900,000'],
['COST OF SALES', '900,000', '900,000'],
['GROSS PROFIT (90%; 2016 - 90%)', '900,000', '900,000']]
If the label text needs further extraction, you could use regexes, or you could simply look at the items in parts[0:len(parts)-2] and process them based on the words and numbers there.

To detect the string
rev_str = "[[REVENUE], [9,000,000], [9,000,000]]"
and extract the values
("REVENUE", "9,000,000", "9,000,000")
you would do
import re
x = re.match(r"\[\[([A-Z]+)\], \[([0-9,]+)\], \[([0-9,]+)\]\]", rev_str)
x.groups()
# ('REVENUE', '9,000,000', '9,000,000')
Let's unpack this big ol' string.
Square brackets signify a range of characters. For example, [A-Z] means to look for all letters from A to Z, whereas [0-9,] means to look for the digits 0 through 9, as well as the character ,. The - here is an operator used inside square brackets to denote a range of characters that we want.
The + operator means to look for at least one occurrence of whatever immediately precedes it. For example, the expression [A-Z]+ means to look for at least one occurrence of any of the letters A through Z. You can also use the * operator instead, to look for at least zero occurrences of whatever precedes it.
The round brackets (i.e. parentheses) signify a group to be extracted from the regex. Whenever that pattern is matched, whatever is inside any expression in parentheses will be extracted and returned as a group. For example, ([A-Z+]) means to look for at least one occurrence of any of the letters A through Z, and then save whatever that turns out to be. We access this by doing x.groups() after assigning the result of the regex match to a variable x.
Otherwise, it's straightforward - accommodating for the pattern [[TEXT], [NUMBER], [NUMBER]]. The square brackets are escaped with the \ character, because we want to interpret them literally, rather than as a range of characters.
Overall, the re.match() function will search rev_str for any places where the given pattern matches, keep track of the groups within that match, and return those groups when you call x.groups().
This is a fairly simple example, but you've gotta start somewhere, right? You should be able to use this as a starting point for making a more complicated regex expression to process more of your code.

Choose best fitting regex [duplicate]

Is it possible to match 2 regular expressions in Python?
For instance, I have a use-case wherein I need to compare 2 expressions like this:
re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)
I would expect to be returned a RE object.
But obviously, Python expects a string as the second parameter.
Is there a way to achieve this, or is it a limitation of the way regex matching works?
Background: I have a list of regular expressions [r1, r2, r3, ...] that match a string and I need to find out which expression is the most specific match of the given string. The way I assumed I could make it work was by:
(1) matching r1 with r2.
(2) then match r2 with r1.
If both match, we have a 'tie'. If only (1) worked, r1 is a 'better' match than r2 and vice-versa.
I'd loop (1) and (2) over the entire list.
I admit it's a bit to wrap one's head around (mostly because my description is probably incoherent), but I'd really appreciate it if somebody could give me some insight into how I can achieve this. Thanks!

Outside of the syntax clarification on re.match, I think I am understanding that you are struggling with taking two or more unknown (user input) regex expressions and classifying which is a more 'specific' match against a string.
Recall for a moment that a Python regex really is a type of computer program. Most modern forms, including Python's regex, are based on Perl. Perl's regex's have recursion, backtracking, and other forms that defy trivial inspection. Indeed a rogue regex can be used as a form of denial of service attack.
To see of this on your own computer, try:
>>> re.match(r'^(a+)+$','a'*24+'!')
That takes about 1 second on my computer. Now increase the 24 in 'a'*24 to a bit larger number, say 28. That take a lot longer. Try 48... You will probably need to CTRL+C now. The time increase as the number of a's increase is, in fact, exponential.
You can read more about this issue in Russ Cox's wonderful paper on 'Regular Expression Matching Can Be Simple And Fast'. Russ Cox is the Goggle engineer that built Google Code Search in 2006. As Cox observes, consider matching the regex 'a?'*33 + 'a'*33 against the string of 'a'*99 with awk and Perl (or Python or PCRE or Java or PHP or ...) Awk matches in 200 microseconds but Perl would require 1015 years because of exponential back tracking.
So the conclusion is: it depends! What do you mean by a more specific match? Look at some of Cox's regex simplification techniques in RE2. If your project is big enough to write your own libraries (or use RE2) and you are willing to restrict the regex grammar used (i.e., no backtracking or recursive forms), I think the answer is that you would classify 'a better match' in a variety of ways.
If you are looking for a simple way to state that (regex_3 < regex_1 < regex_2) when matched against some string using Python or Perl's regex language, I think that the answer is it is very very hard (i.e., this problem is NP Complete)
Edit
Everything I said above is true! However, here is a stab at sorting matching regular expressions based on one form of 'specific': How many edits to get from the regex to the string. The greater number of edits (or the higher the Levenshtein distance) the less 'specific' the regex is.
You be the judge if this works (I don't know what 'specific' means to you for your application):
import re
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little']
for reg in regs:
m=re.search(reg,s)
if m:
print "'%s' matches '%s' with sub group '%s'" % (reg, s, m.group(0))
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
print " %i edits regex->match(0), %i edits match(0)->s" % (ld1,ld2)
print " score: ", score
d[reg]=score
print
else:
print "'%s' does not match '%s'" % (reg, s)
print " ===== %s ===== === %s ===" % ('RegEx'.center(10),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %22s %5s" % (key, value)
The program is taking a list of regex's and matching against the string Mary had a little lamb.
Here is the sorted ranking from "most specific" to "least specific":
===== RegEx ===== === Score ===
Mary had a little lamb 0
Mary.*little lamb 7
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\blittle\b 16
little 16
Mary 18
\b\w+mb 18
lamb 18
.* 22
This based on the (perhaps simplistic) assumption that: a) the number of edits (the Levenshtein distance) to get from the regex itself to the matching substring is the result of wildcard expansions or replacements; b) the edits to get from the matching substring to the initial string. (just take one)
As two simple examples:
.* (or .*.* or .*?.* etc) against any sting is a large number of edits to get to the string, in fact equal to the string length. This is the max possible edits, the highest score, and the least 'specific' regex.
The regex of the string itself against the string is as specific as possible. No edits to change one to the other resulting in a 0 or lowest score.
As stated, this is simplistic. Anchors should increase specificity but they do not in this case. Very short stings don't work because the wild-card may be longer than the string.
Edit 2
I got anchor parsing to work pretty darn well using the undocumented sre_parse module in Python. Type >>> help(sre_parse) if you want to read more...
This is the goto worker module underlying the re module. It has been in every Python distribution since 2001 including all the P3k versions. It may go away, but I don't think it is likely...
Here is the revised listing:
import re
import sre_parse
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little',
r'^.*lamb',r'.*.*.*b',r'.*?.*',r'.*\b[lL]ittle\b \b[Ll]amb',
r'.*\blittle\b \blamb$','^'+s+'$']
for reg in regs:
m=re.search(reg,s)
if m:
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
for t, v in sre_parse.parse(reg):
if t=='at': # anchor...
if v=='at_beginning' or 'at_end':
score-=1 # ^ or $, adj 1 edit
if v=='at_boundary': # all other anchors are 2 char
score-=2
d[reg]=score
else:
print "'%s' does not match '%s'" % (reg, s)
print
print " ===== %s ===== === %s ===" % ('RegEx'.center(15),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %27s %5s" % (key, value)
And soted RegEx's:
===== RegEx ===== === Score ===
Mary had a little lamb 0
^Mary had a little lamb$ 0
.*\blittle\b \blamb$ 6
Mary.*little lamb 7
.*\b[lL]ittle\b \b[Ll]amb 10
\blittle\b 10
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\b\w+mb 15
little 16
^.*lamb 17
Mary 18
lamb 18
.*.*.*b 21
.* 22
.*?.* 22

It depends on what kind of regular expressions you have; as #carrot-top suggests, if you actually aren't dealing with "regular expressions" in the CS sense, and instead have crazy extensions, then you are definitely out of luck.
However, if you do have traditional regular expressions, you might make a bit more progress. First, we could define what "more specific" means. Say R is a regular expression, and L(R) is the language generated by R. Then we might say R1 is more specific than R2 if L(R1) is a (strict) subset of L(R2) (L(R1) < L(R2)). That only gets us so far: in many cases, L(R1) is neither a subset nor a superset of L(R2), and so we might imagine that the two are somehow incomparable. An example, trying to match "mary had a little lamb", we might find two matching expressions: .*mary and lamb.*.
One non-ambiguous solution is to define specificity via implementation. For instance, convert your regular expression in a deterministic (implementation-defined) way to a DFA and simply count states. Unfortunately, this might be relatively opaque to a user.
Indeed, you seem to have an intuitive notion of how you want two regular expressions to compare, specificity-wise. Why not simple write down a definition of specificity, based on the syntax of regular expressions, that matches your intuition reasonably well?
Totally arbitrary rules follow:
Characters = 1.
Character ranges of n characters = n (and let's say \b = 5, because I'm not sure how you might choose to write it out long-hand).
Anchors are 5 each.
* divides its argument by 2.
+ divides its argument by 2, then adds 1.
. = -10.
Anyway, just food for thought, as the other answers do a good job of outlining some of the issues you're facing; hope it helps.

I don't think it's possible.
An alternative would be to try to calculate the number of strings of length n that the regular expression also matches. A regular expression that matches 1,000,000,000 strings of length 15 characters is less specific than one that matches only 10 strings of length 15 characters.
Of course, calculating the number of possible matches is not trivial unless the regular expressions are simple.

Option 1:
Since users are supplying the regexes, perhaps ask them to also submit some test strings which they think are illustrative of their regex's specificity. (i.e. that show their regex is more specific than a competitor's regex.) Collect all the user's submitted test strings, and then test all the regexes against the complete set of test strings.
To design a good regex, the author must have put thought into what strings match and don't match their regex, so it should be easy for them to supply good test strings.
Option 2:
You might try a Monte Carlo approach: Starting with the string that both regexes match, write a generator which generates mutations of that string (permute characters, add/remove characters, etc.) If both regexes match or don't match the same way for each mutation, then the regexes "probably tie". If one matches a mutations that the other doesn't, and vice versa, then they "absolutely tie".
But if one matches a strict superset of mutations then it is "probably less specific" than the other.
The verdict after a large number of mutations may not always be correct, but may be reasonable.
Option 3:
Use ipermute or pyParsing's invert to generate strings which match each regex. This will only work on a regexes that use a limited subset of regex syntax.

I think you could do it by looking the result of matching with the longest result
>>> m = re.match(r'google\.com\/maps','google.com/maps/hello')
>>> len(m.group(0))
15
>>> m = re.match(r'google\.com\/maps2','google.com/maps/hello')
>>> print (m)
None
>>> m = re.match(r'google\.com\/maps','google.com/maps2/hello')
>>> len(m.group(0))
15
>>> m = re.match(r'google\.com\/maps2','google.com/maps2/hello')
>>> len(m.group(0))
16

re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)
The second item to re.match() above is a string -- that's why it's not working: the regex says to match a period after google, but instead it finds a backslash. What you need to do is double up the backslashes in the regex that's being used as a regex:
def compare_regexes(regex1, regex2):
"""returns regex2 if regex1 is 'smaller' than regex2
returns regex1 if they are the same
returns regex1 if regex1 is 'bigger' than regex2
otherwise returns None"""
regex1_mod = regex1.replace('\\', '\\\\')
regex2_mod = regex2.replace('\\', '\\\\')
if regex1 == regex2:
return regex1
if re.match(regex1_mod, regex2):
return regex2
if re.match(regex2_mod, regex1):
return regex1
You can change the returns to whatever suits your needs best. Oh, and make sure you are using raw strings with re. r'like this, for example'

Is it possible to match 2 regular expressions in Python?
That certainly is possible. Use parenthetical match groups joined by | for alteration. If you arrange the parenthetical match groups by most specific regex to least specific, the rank in the returned tuple from m.groups() will show how specific your match is. You can also use named groups to name how specific your match is, such as s10 for very specific and s0 for a not so specific match.
>>> s1='google.com/maps2text'
>>> s2='I forgot my goggles at the house'
>>> s3='blah blah blah'
>>> m1=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s1)
>>> m2=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s2)
>>> m1.groups()
('google.com/maps2text', None)
>>> m2.groups()
(None, 'I forgot my goggles')
>>> patt=re.compile(r'(?P<s10>^google\.com\/maps\dtext$)|
... (?P<s5>.*go[a-z]+)|(?P<s0>[a-z]+)')
>>> m3=patt.match(s3)
>>> m3.groups()
(None, None, 'blah')
>>> m3.groupdict()
{'s10': None, 's0': 'blah', 's5': None}
If you do not know ahead of time which regex is more specific, this is a much harder problem to solve. You want to have a look at this paper covering security of regex matches against file system names.

I realize that this is a non-solution, but as there is no unambiguous way to tell which is the "most specific match", certainly when it depends on what your users "meant", the easiest thing to do would be to ask them to provide their own priority. For example just by putting the regexes in the right order. Then you can simply take the first one that matches. If you expect the users to be comfortable with regular expressions anyway, this is maybe not too much to ask?

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!

Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.

you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

python regex help: unknown information to skip

I'm having trouble with the needed regular expression... I'm sure I need to probably be using some combination of 'lookaround' or conditional expressions, but I'm at a loss.
I have a data string like:
pattern1 pattern2 pattern3 unwanted-groups pattern4 random number of tokens pattern5 optional1 optional2 more unknown unwanted junk separated with white spaces optional3 optional4 etc
where I have a matching expression for each of the 'pattern#' and 'optional#' groups (optional groups being groups that are not required in the data and therefore not always present), but I don't have any pattern (text is free-form) or group count to skip for the other sections other than all 'tokens' are separated by white space.
I've managed to figure out how to skip the unwanted stuff between the required groups but when I hit the optional groups, I'm lost. any suggestion on where I should be looking for hints/help?
Thanks
this is what I currently have:
pattern = re.compile(r'(?:(METAR|SPECI)\s*)*(?P<ICAO>[\w]{4}\s)*'
r'(?P<NIL>(NIL)\s)*(?P<UTC>[\d]{6}Z\s)*(?P<AUTOCOR>(AUTO|COR)*\s)*'
r'(?P<WINDS>[\w]{5,6}G*[\d]{0,2}(MPS|KT|KMH)\s)\s*'
r'.*?\s' #skip miscellaneous between winds and thermal data
r'(?P<THERM>[\d]{2}/[\d]{2}\s)\s*(?P<PRESS>A[\d]{4}\s)\s*'
r'(?:RMK\s)\s*(?P<AUTO>AO\d\s)*'
r'(?P<PEAK>(PK\sWND\s[\d]{5,6}/[\d]{2,4}))*'
r'(?P<SLP>SLP[\d]{3}\s)*'
r'(?P<PRECIP>P[\d]{4}\s)*'
r'(?P<remains>.*)'
)
example = "METAR KCSM 162353Z AUTO 07011KT 10SM TS SCT100 28/19 A3000 RMK AO2 PK WND 06042/2325 WSHFT 2248 LTG DSNT ALQDS PRESRR SLP135 T02780189 10389 20272 53007="
data = pattern.match(example)
It seems to work for the first 10 groups, but that is about it....
again thanks everybody

If all the data is in that format I'd go with split instead. I think it will be faster.
str = "regex1 regex2 regex3 unwanted-regex regex4 random number of tokens regex5 optregex1 optregex2 more unknown unwanted junk separated with white spaces optregex3 optregex4 etc"
parts = str.split() # now you have each part as an element of the array.
for index,item in enumerate(parts):
if index == 3:
continue # this is unwanted-regex
else:
# do what you want with the information here

You need to use the | operator and findall:
>>> re.compile("(regex\d+|optregex\d+)")
>>> regex.findall(string)
[u'regex1', u'regex2', u'regex3', u'regex4', u'regex5', u'optregex1', u'optregex2', u'optregex3', u'optregex4']
An advice: there are several tools (GUIs) that allow you to experiment with (and actually help writing) regular expressions. For python, I'm quite fond of kodos.

If all of your targets consist of things like "foo1", "bar22" etc (in other words a sequence of letters followed by a sequence of digits) and everything else (sequences of digits, "words" without numeric suffixes, etc) is "junk" then the following seems to be sufficient:
re.findall(r'[A-Za-z]+\d+', targetstr)
(We can't use just r'\w+\d+' because \w matches digits and _ (underscores) as well as letters).
If you're looking for a limited number of key patterns, or some of the junk might match "foo123 ... then you'll obviously have to be more specific.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Boolean search text file in Python - python

It's a bit wordy, but you can get away without using regular expressions here, for example: # Take a lowercase copy for comparisons s = sections[i].lower() if (('economic' in s or 'economy' in s) and ('uncertainty' in s or 'uncertain' in s) and ('tax' in s or 'policy' in s)): do_stuff()

Related

How to remove spaces in a "single" word? ("bo ok" to "book")

Reading financial statements using REGEX

Choose best fitting regex [duplicate]

Python Regular Expressions Findall

python regex help: unknown information to skip

Categories

Resources