Regex to search file for exact string

Regex to search file for exact string - python

I have looked at the other questions around this, but I can't quite get those answers to work for my situation.
I have a string that I'm searching for inside a file. I want to match the exact string and if there's a match, do something. I'm trying to formulate regex for the string, "author:". If this string is found, strip it from the line and give me everything to the right of it, removing any whites space. Any ideas looking at the code below in how to achieve this?.
metadata_author = re.compile(r'\bauthor:\b')
with open(tempfile, encoding='latin-1') as search:
for line in search:
result = metadata_author.search(line)
if result in line:
author = result.strip()
print(author)

I would use a lookbehind (with a negative lookbehind for the possible dot as mentioned in the comment):
metadata_author = re.compile(r'(?<=(?<!\.)\bauthor:).+')
with open(tempfile, encoding='latin-1') as search:
for line in search:
result = metadata_author.search(line)
if result:
author = result.group().strip()
print(author)
re.search returns a match object, and not a string, so to get the matching string you have to call result.group().
If you want to remove all whitespace (instead of just trimming), use re.sub(r'\s*', '', result.group()) instead of result.group().strip().

Related

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Regex for parsing list items

I want to slurp a text into a list, and then parse a bit each item so I can keep the text I actually want.
I'm currently using:
with open("C:/text.txt" ,"rU") as input:
lines = [line.rstrip('\n') for line in input]
for line in lines:
#str(line)
regex = r"\:\s*\"(.*)\"\s{5}\d?"
try:
found = re.search(regex, line).group(1)
except AttributeError:
found ='nah'
print(found)
But it doesn't work. Always goes to the exception. When applied to a defined string, it works. Is there a difference when dealing with list items?
The text file is structured as such:
Thank you in advance!

It is clear from the image you provided that there are 3 whitespaces between the text and the digits.
Without exact text, it is impossible to classify the symbols, but it is clear there is at least one.
So, you need to modify the regex you are using to
r':\s*"(.*)"\s+'
Here, \s+ matches 1 or more whitespaces.
Note that \d? at the end of the pattern is not required if you are not interested in the whole match and only need Group 1 value.

regex matching long string broke with \ read from file

In file I read I have lines:
fileContent.py
header.Description ="long"\
"description"
header.Priority =1
header.Type ="short"
I need regex, that would match with lines that are broken and with ones that aren't. Now I'm doing it in such way:
with open('fileContent.py') as f:
fileContent = f.read()
template = r'\nheader\.%s\s*=\s*.+(\\n.+)?'
values = ['Description', 'Priority']
for value in values:
print re.search(re.compile(template % str(value)), fileContent).group(0)
and I receive:
header.Priority ="1"
header.Description ="long"\
If I change my template to not use raw string:
template = '\nheader\\.%s\\s*=\\s*.+(\\\n.+)?'
I receive:
header.Priority ="1"
header.Type ="short"
header.Description ="long"\
"description"
How can I build regex that would match something like 2 line broken string as above and also only one line string? I don't want to have line containing header.Type, because I'm not looking for it!
Why '\\\n' doesn't work as I expected - matching backslash+newline sequence.

Try this regex:
(?:[^\r\n]+\\[\r\n]*)+|[^\r\n]+
See the DEMO

The reason your pattern is not matching backslash+newline is that you have r'\\n' which means a backslash + 'n'.
For the case above, you could try this regex:
\nheader\.Description\s*=\s*[^\r\n]+(?P<broken_line>\\\n.+)
See demo here.
BUT it is not advisable to parse code with regular expressions because Python code is not a regular language. Use ast.

Multiline regex python

I'm trying to do some text file parsing where this pattern is repeated throughout the file:
VERSION.PROGRAM:program_name
VERSION.SUBPROGRAM:sub_program_name
My intent is to, given a progra_name, retrieve the sub_program_name for each block of text i mentioned above.
I have the following function that finds if the text actually exists, but doesn't print the sub_program_name:
def find_subprogram(program_name):
regex_string = r'VERSION.PROGRAM:%s\nVERSION.SUBPROGRAM:.' % program_name
with open('file.txt', r) as f:
match = re.search(regex_string, f.read(), re.DOTALL|re.MULTILINE)
if match:
print match.group()
I will appreciate some help or tips.
Thanks

Your regex has a typo, it's looking for PRGRAM.

If you want to search for multiple lines, then you don't want to use the MULTILINE modifier. What that does is it considers each line as its own separate entity to be matched against with a beginning and an end.
You also are not using valid regex matching techniques. You should look up how to properly use regex.
For matching any character, using (.*) not %s.
Here is an example
Using VERSION\.PROGRAM:YOURSTRING\nVERSION\.SUBPROGRAM:(.*) will match the groups properly
re.compile('VERSION\.PROGRAM:%s\nVERSION\.SUBPROGRAM:(.*)'%(re.escape(yourstr)))

Using regex in Python 2.7.3 to search text and output matches

I am trying to accomplish exactly what the title says. The program is meant to read a .txt file from a specified path and match the terms specified in the code. This is what I have so far:
import re
source = open("C:\\test.txt", "r")
lines = []
for line in source:
line = line.strip()
lines.append(line)
if re.search('reply', line):
print 'found: ', line
As you can see, I am specifying the term 'reply' using re.search but this restricts me to one term. I know there is a way to specify a list or dictionary of words to search for, but my attempts have failed. I think it's possible to create a list with something like ...
keywords = ['reply', 'error', 'what']
... but despite what I've read on this site, I can't seem to incorporate this into the code properly. Any advice or assistance with this is greatly appreciated!
PS. If I wanted to make the search case sensitive, would I be able to use ...
"(.*)(R|r)eply(.*)"
... in the list of terms I want to find?

One way:
import re
source = open("input", "r")
lines = []
keywords = ['reply', 'error', 'what']
# join list with OR, '|', operators
# re.I makes it case-insensitive
exp = re.compile("|".join(keywords), re.I)
for line in source:
line = line.strip()
lines.append(line)
if re.search(exp, line):
print 'found: ', line

With re.search(), you pass a single string, but you can specify quite complex patterns. See the docs on the Python re module, which has a section on "Regular Expression Syntax".
In fact you have the answer in your question... "R|r" searches for "R" or "r", so "reply|error|what" searches for 'reply', 'error', or 'what'.
PS. If I wanted to make the search case sensitive, would I be able to use ...
"(.*)(R|r)eply(.*)"
There's no need for the .* bit (and it may make your code slower). The re.search() function looks for a match anywhere in the string. (R|r)eply will look for 'reply' or 'Reply', it won't match 'REPLY' or 'rePly'.
If you want a case insensitive search, there's a flags=re.IGNORECASE option that you can pass to re.search(). E.g.:
re.search('reply', line, flags=re.IGNORECASE)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to search file for exact string - python

Related

Python - Parsing JSON formatted text file with regex

Regex for parsing list items

regex matching long string broke with \ read from file

Multiline regex python

Using regex in Python 2.7.3 to search text and output matches

Categories

Resources