Python Regex for Searching pattern in text file - python

Tags in Sample.txt:
<ServiceRQ>want everything between...</ServiceRQ>
<ServiceRQ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>want everything between</ServiceRQ>
..
Please can someone help me to get the regex? To extract the expected output from a text file. I want to create a regex to find the above tags.
This is what is have tried re.search(r"<(.*?)RQ(.*?)>(.*?)</(.*?)RQ>", line) but not working properly. I want to make a search based on word RQ in text file
The expected output should be
1. <ServiceRQ>want everything between</ServiceRQ>
2. <ServiceRQ> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>want everything between</ServiceRQ>

Try this pattern
regex= r'<\w+RQ.*?>.*?</\w+RQ>'
data=re.findall(regex, line)
The above regex will give output like
['<ServiceRQ>want everything between...</ServiceRQ>', '<ServiceRQ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>want everything between</ServiceRQ>']

As Ashish has mentioned, this one gives the tag including the contents.
regex= r'<\w+RQ.*?>.*?</\w+RQ>'
data=re.findall(regex, line)
You can also do this to retrieve JUST the contents within the tags. Changing .*? to (.*?) between the tags.
regex = r'<\w+RQ.*?>(.*?)<\/\w+RQ>'
data = re.findall(regex, sample)
This would result in the following output:
['want everything between...', 'want everything between']

Related

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

How to capture all content between two captured groups

I have a txt file that I converted from a pdf that contains a long list of items. These items have a numbering convention as follows:
[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}
This expression would match something between:
A1.1.1
and
ZZ99.99.99
This works just fine. The issue I am having is that I am trying to capture this in group 1 and everything between each item number (the item description) in group 2.
I also need these returned as a list or an iterable so that, eventually, the contents captured can be exported to an excel spreadsheet.
This is the regex I have currently:
^([A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}\s)([\w\W]*?)(?:\n)
Follow this link to find a sample of what I have and the issues I am facing:
Debuggex Demo
Is anyone able to help me figure out how to capture everything between each number no matter how many paragraphs?
Any input would be greatly appreciated, thanks!
You are very close:
import re
s = """
A1.2.1 This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.ZZ99.99.99
"""
final_data = re.findall("[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}(.*?)[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}", s)
Output:
[' This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.']
By using (.*?) you can match any text between the letters and numbers as defined by your first regex.

Multiline regex python

I'm trying to do some text file parsing where this pattern is repeated throughout the file:
VERSION.PROGRAM:program_name
VERSION.SUBPROGRAM:sub_program_name
My intent is to, given a progra_name, retrieve the sub_program_name for each block of text i mentioned above.
I have the following function that finds if the text actually exists, but doesn't print the sub_program_name:
def find_subprogram(program_name):
regex_string = r'VERSION.PROGRAM:%s\nVERSION.SUBPROGRAM:.' % program_name
with open('file.txt', r) as f:
match = re.search(regex_string, f.read(), re.DOTALL|re.MULTILINE)
if match:
print match.group()
I will appreciate some help or tips.
Thanks
Your regex has a typo, it's looking for PRGRAM.
If you want to search for multiple lines, then you don't want to use the MULTILINE modifier. What that does is it considers each line as its own separate entity to be matched against with a beginning and an end.
You also are not using valid regex matching techniques. You should look up how to properly use regex.
For matching any character, using (.*) not %s.
Here is an example
Using VERSION\.PROGRAM:YOURSTRING\nVERSION\.SUBPROGRAM:(.*) will match the groups properly
re.compile('VERSION\.PROGRAM:%s\nVERSION\.SUBPROGRAM:(.*)'%(re.escape(yourstr)))

Python Regex for selecting text between fixed title

I am trying to get the text between two fixed header in python.
Please check this link http://regex101.com/r/jV4oP5/1
I want to extract everything that starts after OPINION . The regex I wrote matches only first Line as well as OPINION BY.
Is there any other regex that can fetch the data.
Any help is appreciated
Use a dotall modifier(s) to extract everything after OPINION.
OPINION.*
DEMO
If you don't want to match OPINION then use a lookbehind,
(?<=OPINION).*
If you really mean you want the entire document after "OPINION", try: (OPINION(\n*.*)*$)
Your regex was only finding the first new line character followed by any normal characters (excluding new line).

Removing TAGS in a document

I need to find all the tags in .txt format (SEC filing) and remove from the filing.
Well, as a beginner of Python, I used the following code to find the tags, but it returns None, None, ... and I don't know how to remove all the tags. My question is how to find all the tags <....> and remove all the tags so that the document contains everything but tags.
import re
tags = [re.search(r'<.+>', line) for line in mylist]
#mylist is the filename opened by open(filename, 'rU').readlines()
Thanks for your time.
Use something like this:
re.sub(r'<[^>]+>', '', open(filename, 'r').read())
Your current code is getting a None for each line that does not include angle-bracketed tags.
You probably want to use [^>] to make sure it matches only up to the first >.
re.sub(r'<.*?>', '', line)
Use re.sub and <.*?> expression
Well, for starters, you're going to need a different regex. The one you have will select everything between the first '<' and the last '>' So the string:
I can type in <b>BOLD</b>
would render the match:
BOLD
The way to fix this would be to use a lazy operators this site has a good explanation on why you should be using
<.+?>
to match HTML tags. And ultimately, you should be substituting, so:
re.sub(r'', '', line)
Though, I suspect what you'd actually like to match is between the tags. Here's where a good lookahead can do wonders!
(?<=>).+?(?=<)
Looks crazy, but it breaks down pretty easy. Let's start with what you know:
.+?
matches a string of arbitrary length. ? means it will match the shortest string possible. (The laziness we added before)
(<?=...)
is a lookbehind. It literally looks behind itself without capturing the expression.
(?=...)
is a lookahead. It's the same as a lookbehind. Then with a little findall:
re.findall(r'(?<=>).+?(?=<)', line);
Now, you can iterate over the array and trim an unnecessary spaces that got left behind and make for some really nice output! Or, if you'd really like to use a substitution method (I know I would):
re.sub(r'\s*(?:</+?>\s*)+', ' ', line)
the
\s*
will match any amount of whitespace attached to a tag, which you can then replace with one space, whittlling down those unnerving double and triple spaces that often result from over careful tagging. As a bonus, the
(?: ... )
is known as a non-capturing group (it won't give you smaller sub matches in your result). It's not really necessary in this situation for your purposes, but groups are always useful things to think about, and it's good practice to only capture the ones you need. Tacking a + onto the end of that (as I did), will capture as many tags as are right next to each other, eliminating them into a single space. So if the file has
This is <b> <i> overemphasized </b> </i>!
you'd get
This is overemphasized !
instead of
This is overemphasized !

Categories

Resources