Matching new lines - python

I have the following regexp:
pattern = re.compile(r"HESAID:|SHESAID:")
It's working correctly. I use it to split by multiple delimiters like this:
result = pattern.split(content)
What I want to add is verification so that the split does NOT happend unless HESAID: or SHESAID: are placed on new lines. This is not working:
pattern = re.compile(r"\nHESAID:\n|\nSHESAID:\n")
Please help.

It would be helpful if you elaborated on how exactly it is not working, but I am guessing that the issue is that it does not match consecutive lines of HESAID/SHESAID. You can fix this by using beginning and end of line anchors instead of actually putting \n in your regex:
pattern = re.compile(r'^HESAID:$|^SHESAID:$', re.MULTILINE)
The re.MULTILINE flag is necessary so that ^ and $ match at beginning and end of lines, instead of just the beginning and end of the string.
I would probably rewrite the regex as follows, the ? after the S makes it optional:
pattern = re.compile(r'^S?HESAID:$', re.MULTILINE)

Related

regex expression to exclude lines based on beginning or ending patterns

I searching a file for lines that do not match one of three possible regex patterns in python. If I was to search each individually, the patterns are:
pattern1 = '_[AB]_[0-9]+$'
pattern2 = 'uce.+'
pattern3 = 'ENSOFAS.+'
Pattern2 & pattern3 are near the beginning of the line (these lines technically start with >). Pattern1 at the end of the string.
I've seen ways of combining pattern2 and pattern3 with something like ^>(?:(?!uce|ENSOFAS).+$) (I'm not sure if this formatted correctly). How can I also include pattern1 in a single regex search. The reason I'm doing this is to skip over lines that match to any one of these patterns.
In essence, you are combining three smaller-regexes into one, saying that the matcher could match any of those in place of the other. The general method for this is the alternation operator, as #TallChuck has commented. So, in keeping with his example and your variables, I might do this:
pattern1 = '_[AB]_[0-9]+$'
pattern2 = '^>uce.+'
pattern3 = '^>ENSOFAS.+'
re_pattern = '(?:{}|{}|{})'.format(pattern1, pattern2, pattern3)
your_re = re.compile( re_pattern )
There's no reason you cannot include the beginning-of-line anchor ^ in each subpattern, so I've done that. Meanwhile, your example used the grouping (non-capturing) operator which is `(?:...), so I've mimicked that here as well.
The above is the exact same as if you had put it together all at once:
your_re = re.compile('(?:_[AB]_[0-9]+$|^>uce.+|^>ENSOFAS.+)')
Take your pick as to which is more readable and maintainable by you or your team.
Finally, note that it may be more efficient to pull out the beginning of line anchor (^) as the last paragraph of your question suggested, or the regex engine may be smart enough to do that on its own. Suggest to get it working first, then optimize if you need to.
Another option is to match all three at the beginning of the line by simply adding the "match anything" operator (.*) to the first pattern:
^(?:.*_[AB]_[0-9]+$|>uce.+|>ENSOFAS.+)

How to match characters at the beginning of a complex regular expression in Python?

I have a regular expression (see this question) used to match C function definitions in text file. In particular, I'm working on some git diff output.
f = open(input_file)
diff_txt = ''.join(f.readlines())
f.close
re_flags = re.VERBOSE | re.MULTILINE
pattern = re.compile(r"""
(^[^-+]) # Problematic line: Want to ensure we do not match lines with +/-
(?<=[\s:~])
(\w+)
\s*
\(([\w\s,<>\[\].=&':/*]*?)\)
\s*
(const)?
\s*
(?={)
""",
re_flags)
input file is a some raw git diff output generated in the usual way:
git diff <commit-sha-1> <commit-sha-2> > tmp.diff
The first line (^[^-+]) in my regex string is problematic. Without this line the regex will successfully match all C/C++ functions in input_file, but with it, nothing is matched. I need this line because I wan't to exclude functions that were added or removed between the two repository revisions, and lines that are added and removed are identified as
+ [added line]
- [removed line]
I've read the docs and I can't seem to find where my error is, some help would be much appreciated.
- and + are special characters in regular expressions. Try escaping them with slashes - [^\-\+]
See this question
Simply change the problematic line
(^[^-+])
to
^(?!\+|\-).*
Since we're using the negative lookahead operator ?!, we have to make sure to include the .* at the end of the line, otherwise nothing will match.

Regex multiline syntax help (python)

I'm struggling to do multiline regex with multiple matches.
I have data separated by newline/linebreaks like below. My pattern matches each of these lines if i test it separately. How can i match all the occurrences (specifically numbers?
I've read that i could/should use DOTALL somehow (possibly with MULTILINE). This seems to match any character (newlines also) but not sure of any eventual side effects. Don't want to have it match an integer or something and give me malformed data in the end.
Any info on this would be great.
What i really need though, is some assistance in making this example code work. I only need to fetch the numbers from the data.
I used re.fullmatch when i only needed one specific match in a previous case and not entirely sure which function i should use now by the way (finditer, findall, search etc.).
Thank you for any and all help :)
data = """http://store.steampowered.com/app/254060/
http://www.store.steampowered.com/app/254061/
https://www.store.steampowered.com/app/254062
store.steampowered.com/app/254063
254064"""
regPattern = '^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$'
evaluateData = re.search(regPattern, data, re.DOTALL | re.MULTILINE)
if evaluateString2 is not None:
print('do stuff')
else:
print('found no match')
import re
p = re.compile(ur'^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$', re.MULTILINE)
test_str = u"http://store.steampowered.com/app/254060/\nhttp://www.store.steampowered.com/app/254061/\nhttps://www.store.steampowered.com/app/254062\nstore.steampowered.com/app/254063\n254064"
re.findall(p, test_str)
https://regex101.com/r/rC9rI0/1
this gives [u'254060', u'254061', u'254062', u'254063', u'254064'].
Are you trying to return those specific integers?
re.search stop at the first occurrence
You should use this intead
re.findall(regPattern, data, re.MULTILINE)
['254060', '254061', '254062', '254063', '254064']
Note: Search was not working for me (python 2.7.9). It just return the first line of data
/ has no special meaning so you do not have to escape it (and in not-raw strings you would have to escape every \)
try this
regPattern = r'^\s*(?:https?://)?(?:www\.)?(?:store\.steampowered\.com/app/)?([0-9]+)/?\s*$'

Using regex to find multiple matches on the same line

I need to build a program that can read multiple lines of code, and extract the right information from each line.
Example text:
no matches
one match <'found'>
<'one'> match <found>
<'three'><'matches'><'found'>
For this case, the program should detect <'found'>, <'one'>, <'three'>, <'matches'> and <'found'> as matches because they all have "<" and "'".
However, I cannot work out a system using regex to account for multiple matches on the same line. I was using something like:
re.search('^<.*>$')
But if there are multiple matches on one line, the extra "'<" and ">'" are taken as part of the .*, without counting them as separate matches. How do I fix this?
This works -
>>> r = re.compile(r"\<\'.*?\'\>")
>>> r.findall(s)
["<'found'>", "<'one'>", "<'three'>", "<'matches'>", "<'found'>"]
Use findall instead of search:
re.findall( r"<'.*?'>", str )
You can use re.findall and match on non > characters inside of the angle brackets:
>>> re.findall('<[^>]*>', "<'three'><'matches'><'found'>")
["<'three'>", "<'matches'>", "<'found'>"]
Non-greedy quantifier '?' as suggested by anubhava is also an option.

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Categories

Resources