Using python to find specific pattern contained in a paragraph

Using python to find specific pattern contained in a paragraph - python

I'm trying to use python to go through a file, find a specific piece of information and then print it to the terminal. The information I'm looking for is contained in a block that looks something like this:
\\Version=EM64L-G09RevD.01\State=1-A1\HF=-1159.6991675\RMSD=4.915e-11\RMSF=1.175e-07\ZeroPoint=0.0353317\
I would like to be able to get the information HF=-1159.6991675. More generally, I would like the script to copy and print \HF=WhateverTheNumberIs\
I've managed to make scripts that are able to copy an entire line and print it out to the terminal, but I am unsure how to accomplish this particular task.

My suggestions is to use regular expressions (regex) in order to catch the required pattern:
import re #for using regular expressions
s = open(<filename here>).read() #read the content of the file and hold it as a string to be scanned
p = re.compile("\HF=[^\]+", re.flags) #p would be the pattern as you described, starting with \HF= till the next \)
print p.findall(s) #finds all occurrences and prints them

Regular expressions is the answer, something like r'/HF.*/'.
Tutorial:- regex tutorial
Once you have learned regex, it is an indispensable resource.

Related

How to find filenames with a specific extension using regex?

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.

You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Find text between strings python

I looked at similar question and answers but could not solve my issue.
I have a string, like the following:
ecc, ecc, .....thisIsUnique("92781227-7e7e-4768-8ee3-4e1615bddf3c", ecc, ecc.......
could be very long before and after without having some unique text.
What I need is to get the 92781227-7e7e-4768-8ee3-4e1615bddf3c code as string. So I'm looking to something that ca sound like:
when you find thisIsUnique go ahead, read the code after you find the first (" characters and keep reading until you find the first ", characters.
Unfortunately I'm not familiar with regex, but maybe there are different ways to solve the problem
thanks to all

There are a few sites you should read up on for what regex is. https://regexone.com/ and Learning Regular Expressions Use a site like this to test what you have tried: https://regex101.com/ But to get you started, this runs exactly what you have pasted as an example:
import re
text = 'ecc, ecc, .....thisIsUnique("92781227-7e7e-4768-8ee3-4e1615bddf3c", ecc, ecc.......'
match = re.search('thisIsUnique\("([^"]+)', text)
print (match.group(1))
result:
92781227-7e7e-4768-8ee3-4e1615bddf3c

Use re.search:
In [991]: text = 'ecc, ecc, .....thisIsUnique("92781227-7e7e-4768-8ee3-4e1615bddf3c", ecc, ecc.......'
In [992]: re.search('(?<=thisIsUnique\(")(.*?)"', text).group(1)
Out[992]: '92781227-7e7e-4768-8ee3-4e1615bddf3c'
'(?<=thisIsUnique\(")(.*?)"'
Employs a lookbehind.
Additional Reading
Regex HOWTO - getting started with tutorial
General documentation
Additional tutorial site - TutorialsPoint

Python - regular expressions - find every word except in tags

How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags?
I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <> and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall, would return:
['hello', '123', 'Bold', 'end']

Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)

If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split:
words = text.split()

Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.
You might want to read this post about processing context-free languages using regular expressions.

Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.

Python: getting finding a variable in a string

I currently have some code that goes to a URL, fetches the source code, and I'm trying to get it to return a variable from the string. So I created:
changetime = refreshsource.find('VARIABLE pm NST')
But it wouldn't find the area in the string because the word is not VARIABLE, it is something else. How would I retrieve the constantly changing VARIABLE from that string?

A regular expression will be able to achieve this for you. I'd you give some examples of what variable will be the we could come up with a strict expression. To match what you have above something like the following will do:
import re
# this will match 01:23, 11:34, 12:00, etc.
timex = re.compile('.*(\d{2}:\d{2})[ ]?pm NST')
match = timex.match(text, re.M|re.S)
variable = match.groups(0)
Edit: this code will actually work (unlike that first attempt :) ):
import re
# this will match 01:23, 11:34, 12:00, etc.
timex = re.compile('(\d{2}:\d{2})[ ]?pm NST')
match = timex.search(text)
if match:
variable = match.groups(0)

If the pattern is really that simple, then this seems a typical case where regular expressions comes quite handy.
Note: if you are new to regular expressions, you may want to use some introduction, like the http://www.regular-expressions.info.
On the other hand, if the pattern is more complex, then you may want to use an HTML parser, like for instance BeautifulSoup.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.