backward search in python? - python

I have the list with lines like:
=cat-egory/packagename-version
so I have to split it up into 3 different variables, like
category = cat-egory
package_name = packagename
package_version = version
I have to avoid
= and /
chars
I am fond of perl so I used to write a regexp like:
(?<==)\w+.\w+
which would give me cat-egory without leading = character
and so on, but as far as I know ?<= does not work in python, how must I extract the data then?

It seems to be working well. See: https://regex101.com/r/nnMRKd/2

Seems to work OK, maybe you are just missing the basic Python framework for capturing:
import re
text = "=cat-egory/packagename-version"
results = re.search("(?<==)\w+.\w+", text)
if results:
print (results.group(0))
output:
cat-egory
Make sure to use .search instead of .match as suggested by a comment. the .group is how you reference what you have captured instead of $1 in perl. Nothing too fancy here :)

You could even go one step further and use tuple unpacking:
import re
string = "=cat-egory/packagename-version"
rx = re.compile(r'(?<==)([^/]+)/([^-]+)-(.+)')
for match in rx.finditer(string):
category, package_name, version = match.groups()
print(category)
# cat-egory

Related

How do I remove everything from a string except what I want?

Okay, so basically I want the user to be able to input something, like "quote python syntax and semantics", remove the word 'quote' and anything else (for example, the command could be, 'could you quote for me Python syntax and semantics') then format it in a way that I can pass it to the Wikipedia article URL (in this case 'https://en.wikipedia.org/wiki/Python_syntax_and_semantics'), request it and scrape the element(s) I want.
Any answer would be greatly appreciated.
Here's a simple example of doing this:
import re
msg = input() # Here give as input "quote python syntax and semantics"
repMsg = re.sub("quote", "", msg).strip() # Erase "quote" and space at the start
repMsg = re.sub(" ", "_", repMsg) # Replace spaces with _
print(repMsg) # prints "python_syntax_and_semantics"
The python regex module is very handy for doing this sort of things. Note that you'll probably need to fine tune your code e.g. decide when to replace first occurrence vs replace all, at which point to strip white spaces etc.

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

How to clean a string using python regular expression

I have the following string which have to clean
#import re
addr="abcd&^fhj"
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$#\,\. \t\r\n]')
re.search(problemchars,addr)
In that case use re.sub searching \W (non-alphanum) and replacing by nothing.
import re
addr="abcd&^fhj"
print(re.sub("\W","",addr))
("\W+" works too, but not sure it would be more performant)
you could use the filter function as well if you don't want to go with regex
line = "abcd&^fhj"
line = filter(str.isalpha, line)
print line # Change for python3
Output :
abcdfhj
Edit: For python 3 you could change the print statement like this since the filter function returns an iterable.
print(''.join(list(line)))

python findall, group and pipe

x = "type='text'"
re.findall("([A-Za-z])='(.*?)')", x) # this will work like a charm and produce
# ['type', 'text']
However, my problem is that I'd like to implement a pipe (alternation) so that the same regex will apply to
x = 'type="text"' # see the quotes
Basically, the following regex should work but with findall it results in something strange:
([A-Za-z])=('(.*?)')|"(.*?)")
And I can't use ['"] instead of a pipe because it may end with bad results:
value="hey there what's up?"
Now, how can I build such a regex that would apply to either single or double quotes? By the way, please do not suggest any html or xml parsers as I'm not interested in them.
shlex would do a better job here, but if you insist on re, use ([A-Za-z]+)=(?P<quote>['"])(.+?)(?P=quote)
The problem is, that in ([A-Za-z]+)=('(.*?)'|"(.*?)") you have four groups and you need only two (this is probably where you found results strange). If you use ([A-Za-z]+)=('.*?'|".*?") then should be all right. Remember you can exclude grouping by putting (?:), so this would be equivalent: ([A-Za-z]+)=('(?:.*?)')|"(?:.*?)").
EDIT: I've just realised that this solution would include surrounding quotes which you don't want. You can easily strip them off though. You could also use backreference, but then you would have one extra group, which should be removed at the end, for example:
import re
from operator import itemgetter
x = "type='text' TYPE=\"TEXT\""
print map(itemgetter(0,2), re.findall("([A-Za-z]+)=(['\"])(.*?)\\2", x))
gives [('type', 'text'), ('TYPE', 'TEXT')].

Splitting an expression

I have to split a string into a list of substrings according to the criteria that all the parenthesis strings should be split .
Lets say I have (9+2-(3*(4+2))) then I should get (4+2), (3*6) and (9+2-18).
The basic objective is that I learn which of the inner parenthesis is going to be executed first and then execute it.
Please help....
It would be helpful if you could suggest a method using re module. Just so this is for everyone it is not homework and I understand Polish notation. What I am looking for is using the power of Python and re module to use it in less lines of code.
Thanks a lot....
The eval is insecure, so you have to check input string for dangerous things.
>>> import re
>>> e = "(9+2-(3*(4+2)))"
>>> while '(' in e:
... inner = re.search('(\([^\(\)]+\))', e).group(1)
... e = re.sub(re.escape(inner), eval('str'+inner), e)
... print inner,
...
(4+2) (3*6) (9+2-18)
Try something like this:
import re
a = "(9+2-(3*(4+2)))"
s,r = a,re.compile(r'\([^(]*?\)')
while('(' in s):
g = r.search(s).group(0)
s = r.sub(str(eval(g)),s)
print g
print s
This sounds very homeworkish so I am going to reply with some good reading that might lead you down the right path. Take a peek at http://en.wikipedia.org/wiki/Polish_notation. It's not exactly what you want but understanding will lead you pretty close to the answer.
i don't know exactly what you want to do, but if you want to add other operations and if you want to have more control over the expression, i suggest you to use a parser
http://www.dabeaz.com/ply/ <-- ply, for example

Categories

Resources