regex matching long string broke with \ read from file

regex matching long string broke with \ read from file - python

In file I read I have lines:
fileContent.py
header.Description ="long"\
"description"
header.Priority =1
header.Type ="short"
I need regex, that would match with lines that are broken and with ones that aren't. Now I'm doing it in such way:
with open('fileContent.py') as f:
fileContent = f.read()
template = r'\nheader\.%s\s*=\s*.+(\\n.+)?'
values = ['Description', 'Priority']
for value in values:
print re.search(re.compile(template % str(value)), fileContent).group(0)
and I receive:
header.Priority ="1"
header.Description ="long"\
If I change my template to not use raw string:
template = '\nheader\\.%s\\s*=\\s*.+(\\\n.+)?'
I receive:
header.Priority ="1"
header.Type ="short"
header.Description ="long"\
"description"
How can I build regex that would match something like 2 line broken string as above and also only one line string? I don't want to have line containing header.Type, because I'm not looking for it!
Why '\\\n' doesn't work as I expected - matching backslash+newline sequence.

Try this regex:
(?:[^\r\n]+\\[\r\n]*)+|[^\r\n]+
See the DEMO

The reason your pattern is not matching backslash+newline is that you have r'\\n' which means a backslash + 'n'.
For the case above, you could try this regex:
\nheader\.Description\s*=\s*[^\r\n]+(?P<broken_line>\\\n.+)
See demo here.
BUT it is not advisable to parse code with regular expressions because Python code is not a regular language. Use ast.

Related

How to catch a string using regex in python and replace it by desired string

I am new to python and I wrote the following code which suppose to catch a specific string and replace it with a specific string as well.
sid=\"1722407313768658\"
I used this regex: sid=(.+?)
but it catches irrelevant string as well
https://tmobile.demdex.net/dest5.html?d_nsid=0#
as well when I am running this regex on sid=\"1722407313768658\" (replacing it with 1900117189066752 , I am getting the following result which does not replace the string but add i: sid=\1900117189066752\ "1722407313768658\"
(instead of 1722407313768658 i want to have 1900117189066752 )
this is my python code:
import re
content = c.read()
################################################################
# change sessionid in content
replace_small_sid = str('sid=\\' + "\\"+str(sid) + "\\" + " ")
content = re.sub("sid=(.+?)", replace_small_sid, content)

As I understand it you wish to match string patterns in the form:
sid=\"1722407313768658\"
With the aim of replacing the digits.
To achieve this we can use positive lookbehinds and lookaheads as described here:
https://www.regular-expressions.info/lookaround.html
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
In this case our lookbehind will match
sid=\"
Our lookahead will match
\"
Please see the example here: https://regex101.com/r/2pXcMI/2
Finally, we can use this to match and replace as follows:
import re
line = "sid=\"1722407313768658\" safklabsf ipashf oiasfoi asbg fasnk sid=\"65641\" asjobfaosb asbfaosb asf asfauv sid=\"651564165\"."
replace_with = '1900117189066752'
line = re.sub('(?<=sid=\\\")\d+(?=\\\")', replace_with, line)
line
This returns
'sid="1900117189066752" safklabsf ipashf oiasfoi asbg fasnk sid="1900117189066752" asjobfaosb asbfaosb asf asfauv sid="1900117189066752".'

since you want to replace specific string, you can do it by:
content.replace("1722407313768658","1900117189066752")

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

How to perform this regex replacement more effectively in python without repeating the search?

In python, I want to search for a pattern in a given line and surround it with the html tags. I am doing it as follows:
pattern = "(boy|girl)"
line = "I am a boy"
m = re.search(pattern, line)
line = re.sub(pattern, "<strong><u>"+m.group(0)+"</u></strong>", line)
But I feel like I am repeating the search twice. In other words, I feel like I should be able to accomplish in one line, but I just don't know the right command yet in python.
Is there something like "&" from perl? that you can use to do something like:
s/pattern/<tag>&</tag>/;

Use:
line = re.sub(pattern, r'<strong><u>\1</u></strong>', line)
The \1 is the key part -- it's replaced by the text that matched the pattern. (the r prefix is recommended in all RE patterns to keep backslash escapes as literals).

Multiline regex python

I'm trying to do some text file parsing where this pattern is repeated throughout the file:
VERSION.PROGRAM:program_name
VERSION.SUBPROGRAM:sub_program_name
My intent is to, given a progra_name, retrieve the sub_program_name for each block of text i mentioned above.
I have the following function that finds if the text actually exists, but doesn't print the sub_program_name:
def find_subprogram(program_name):
regex_string = r'VERSION.PROGRAM:%s\nVERSION.SUBPROGRAM:.' % program_name
with open('file.txt', r) as f:
match = re.search(regex_string, f.read(), re.DOTALL|re.MULTILINE)
if match:
print match.group()
I will appreciate some help or tips.
Thanks

Your regex has a typo, it's looking for PRGRAM.

If you want to search for multiple lines, then you don't want to use the MULTILINE modifier. What that does is it considers each line as its own separate entity to be matched against with a beginning and an end.
You also are not using valid regex matching techniques. You should look up how to properly use regex.
For matching any character, using (.*) not %s.
Here is an example
Using VERSION\.PROGRAM:YOURSTRING\nVERSION\.SUBPROGRAM:(.*) will match the groups properly
re.compile('VERSION\.PROGRAM:%s\nVERSION\.SUBPROGRAM:(.*)'%(re.escape(yourstr)))

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.

If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')

My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'

I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.

Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex matching long string broke with \ read from file - python

Try this regex: (?:[^\r\n]+\\[\r\n]*)+|[^\r\n]+ See the DEMO

Related

How to catch a string using regex in python and replace it by desired string

Python - Parsing JSON formatted text file with regex

How to perform this regex replacement more effectively in python without repeating the search?

Multiline regex python

dealing with \n characters at end of multiline string in python

Categories

Resources