regex with python

regex with python - python

I want to extract a file, it should start with [{"linkId":"changeDriveLink" and finish by a text just befor ,"zone"
my input is:
[{"linkIdsd":"changeDridsdve [{"linkId":"changeDriveLink","url":"/drive
/3696434","zoneId":"forceAjax"},{"linkId":"printProductsFormSubst","url":"/drive/rayon.pagetemplate.substitutionlist.printproductsformsubst","zoneId":"forc
,"zone"
and i want to have:
[{"linkId":"changeDriveLink","url":"/drive
/3696434","zoneId":"forceAjax"},{"linkId":"printProductsFormSubst","url":"/drive/rayon.pagetemplate.substitutionlist.printproductsformsubst","zoneId":"forc
how can i do this by regex please?

IMHO, use the json package and search in the structure is better than writing a complex regex, which will be unreadable and quite hard to debug.
You can visit this post (Parsing json and searching through it) for more ideas.

The regular expression
re.compile(r'^\[\{"linkId":"changeDriveLink".*,"zone"', re.DOTALL)
should do this. The .* in the middle represents any character, and the re.DOTALL makes sure, that even newlines are matched, in case your json is pretty-printed.
But I think it would be better, to load the file with the json package, and then check if it satisfies your requirements:
import json
with open('filename_here.json', 'r') as json_file:
data = json.load(json_file)
if data[0]['linkId'] == 'changeDriveLink':
# then its OK
else:
# not OK
Based on the string you've given, your json is a list (array), and its first element is a dict, and the dict has a 'linkId' key with the value 'changeDriveLink'. This is what I check in the if statement.
EDIT:
Now I understand what you want to do.
First, you should omit the ^ charachter from the beggining of the expression, since the string you provided is not the start of the json file, it should be the beginning of the result.
Then, you can get the string you want with e.g. grouping:
pattern = re.compile(r'.*(?P<result>\[\{"linkId":"changeDriveLink".*),"zone"', re.DOTALL)
match_obj = pattern.match('your_json_string')
if match_obj is not None:
the_string_you_want = match_obj.group('result')
What I used here is called named grouping, you can read more about in in the documentation

Related

Format a string to a proper JSON object

I have a string (from an API call) that looks something like this:
val=
{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}
I need to do temp=json.loads(val); but the problem is that the string is not a valid JSON. The keys and values do not have the quotes around them. I tried explicitly putting the quotes and that worked.
How can I programatically include the quotes for such a string before reading it as a JSON?
Also, how can I replace the numbers scientific notations with decimals? eg. 0d-2 becomes "0" and 8.0d-1 becomes "0.8"?

You could catch anything thats a string with regex and replace it accordingly.
Assuming your strings that need quotes:
start with a letter
can have numbers at the end
never start with numbers
never have numbers or special characters in between them
This would be a regex code to catch them:
([a-z]*\d*):
You can try it out here. Or learn more about regex here.
Let's do it in python:
import re
# catch a string in json
json_string = '{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}' # note the single quotes!
# search the strings according to our rule
string_search = re.search('([a-z]*\d*):', json_string)
# extract the first capture group; so everything we matched in brackets
# this is to exclude the colon at the end from the found string as
# we don't want to enquote the colons as well
extracted_strings = string_search.group(1)
This is a solution in case you will build a loop later.
However if you just want to catch all possible strings in python as a list you can do simply the following instead:
import re
# catch ALL strings in json
json_string = '{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}' # note the single quotes!
extract_all_strings = re.findall(r'([a-z]*\d*):', json_string)
# note that this by default catches only our capture group in brackets
# so no extra step required
This was about basically regex and finding everything.
With these basics you could either use re.sub to replace everything with itself just in quotes, or generate a list of replacements to verify first that everything went right (probably somethign you'd rather want to do with this maybe a little bit unstable approach) like this.
Note that this is why I made this kind of comprehensive answer instead of just pointing you to a "re.sub" one-liner.
You can apporach your scientific number notation problem accordingly.

Python regular expression help needed, multiple lines regex

I was trying to scape a link out of a .eml file but somehow I always get "NONE" as return for my search. But I don't even get the link with the confirm brackets, no problem in getting that valid link once the string is pulled.
One problem that I see is, that the string that is found by the REGEX has multiple lines, but the REGES itself seems to be valid.
CODE/REGEX I USE:
def get_url(raw):
#get rid of whitespaces
raw = raw.replace(' ', '')
#search for the link
url = re.search('href=3D(.*?)token([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)', raw).group(1)
return url

First thing, the .eml is encoded in MIME quoted-printable (the hint is the = signs at the end of the line. You should decode this first, instead of dealing with the encoded raw text.
Second, regex is overkill. Some nice string.split() usage will work just as fine. Regex is extremely usefull in it's proper usage scenarios, but some simple python can usually do the same without having to use regex' flavor of magic, which can be confusing as [REDACTED].
Note that if you're building regex, it's always adviced to use one of the gazillion regex editors as these will help you build your regex... My personal favorite is regex101
EDIT: added regex way to do it.
import quopri
import re
def get_url_by_regex(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
return re.search('(<a href=")(.*?)(")', decoded).group(2)
def get_url(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
for line in decoded.split('\n'):
if 'token=' in line:
return line.split('<a href="')[1].split('"')[0]
return None # just in case this is needed
print(get_url(raw_email))
print(get_url_by_regex(raw_email))
result is:
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

python regular expression not matching file contents with re.match and re.MULTILINE flag

I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!

According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".

You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…

If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.