How to clean a string using python regular expression - python

I have the following string which have to clean
#import re
addr="abcd&^fhj"
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$#\,\. \t\r\n]')
re.search(problemchars,addr)

In that case use re.sub searching \W (non-alphanum) and replacing by nothing.
import re
addr="abcd&^fhj"
print(re.sub("\W","",addr))
("\W+" works too, but not sure it would be more performant)

you could use the filter function as well if you don't want to go with regex
line = "abcd&^fhj"
line = filter(str.isalpha, line)
print line # Change for python3
Output :
abcdfhj
Edit: For python 3 you could change the print statement like this since the filter function returns an iterable.
print(''.join(list(line)))

Related

split() not producing expected results

I have a problem with python split which I can't figure out what I am missing that results in the split function not to work properly. I have been using similar splits before and they worked just fine.
content=open(file).read)()
Sep = content.split(r'Document [a-zA-Z0-9]{25}\n')
The file I am reading is something very easy as:
"I like coffee.
Document CLASSAR020181030eeat0000l
I like tea as well.
Document CLASSAR020181030eeat0000l
I like both coffee and tea."
str.split() splits using a fixed delimiter, not a regular expression. You need to use re.split().
import re
sep = re.split(r'Document [a-zA-Z0-9]{25}\n', content)
Error - regular expression syntax on string methods
content is a string. You cannot call the split method on this variable as it will invoke a string-based method that expects a separator. This separator must be a fixed string, and not a regular expression.
Solution - Use re module
You can instead use methods within the regular expression module, as you're using regular expression syntax:
import re
with open(file, 'r') as fp:
content = fp.read()
pattern = re.compile(r'Document \w{25}\n')
separated = pattern.split(content)
The with block is just best practice for opening files in python. It
is a context manager that automatically closes your file when you're
finished. You may run into problems in the future if you don't use
this.
The regular expression I have used is slightly different to yours. It
does exactly the same thing. However, \w is short for
[a-zA-Z0-9]. I.e., it matches any alphanumeric character.
We are using the split method again. However, this split method is part of the re module, not string, as our pattern variable is an re object.

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Simple regex in python

I'm trying to simply get everything after the colon in the following:
hfarnsworth:204b319de6f41bbfdbcb28da724dda23
And then everything before the space in the following:
29ca0a80180e9346295920344d64d1ce ::: 25basement
Here's what I have:
for line in f:
line = line.rstrip() #to remove \n
line = re.compile('.* ',line) #everything before space.
print line
Any tips to point me in the corrent direction? Thanks!
Also, is re.compile the correct function to use if I want the matched string returned? I'm pretty new at python too.
Thanks!!
string = "hfarnsworth:204b319de6f41bbfdbcb28da724dda23"
print(string.split(":")[1:])
string = "29ca0a80180e9346295920344d64d1ce ::: 25basement"
print(string.split(" ")[0])
At first you should probably take a careful look at the doc for re.compile. It doesn't expect for second parameter to be a string to lookup. Try to use re.search or re.findall. E.g.:
>>> s = "29ca0a80180e9346295920344d64d1ce ::: 25basement"
>>> re.findall('(\S*) ', s)[0]
'29ca0a80180e9346295920344d64d1ce'
>>> re.search('(\S*) ', s).groups()
('29ca0a80180e9346295920344d64d1ce',)
BTW, this is not the task for regular expressions. Consider using some simple string operations (like split).
this regular expression seems to work
r"^(?:[^:]*\:)?([^:]*)(?::::.*)?$"

python regular expression replacing part of a matched string

i got an string that might look like this
"myFunc('element','node','elementVersion','ext',12,0,0)"
i'm currently checking for validity using, which works fine
myFunc\((.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\)
now i'd like to replace whatever string is at the 3rd parameter.
unfortunately i cant just use a stringreplace on whatever sub-string on the 3rd position since the same 'sub-string' could be anywhere else in that string.
with this and a re.findall,
myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)
i was able to get the contents of the substring on the 3rd position, but re.sub does not replace the string it just returns me the string i want to replace with :/
here's my code
myRe = re.compile(r"myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)")
val = "myFunc('element','node','elementVersion','ext',12,0,0)"
print myRe.findall(val)
print myRe.sub("noVersion",val)
any idea what i've missed ?
thanks!
Seb
In re.sub, you need to specify a substitution for the whole matching string. That means that you need to repeat the parts that you don't want to replace. This works:
myRe = re.compile(r"(myFunc\(.+?\,.+?\,)(.+?)(\,.+?\,.+?\,.+?\,.+?\))")
print myRe.sub(r'\1"noversion"\3', val)
If your only tool is a hammer, all problems look like nails. A regular expression is a powerfull hammer but is not the best tool for every task.
Some tasks are better handled by a parser. In this case the argument list in the string is just like a Python tuple, sou you can cheat: use the Python builtin parser:
>>> strdata = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> args = re.search(r'\(([^\)]+)\)', strdata).group(1)
>>> eval(args)
('element', 'node', 'elementVersion', 'ext', 12, 0, 0)
If you can't trust the input ast.literal_eval is safer than eval for this. Once you have the argument list in the string decontructed I think you can figure out how to manipulate and reassemble it again, if needed.
Read the documentation: re.sub returns a copy of the string where every occurrence of the entire pattern is replaced with the replacement. It cannot in any case modify the original string, because Python strings are immutable.
Try using look-ahead and look-behind assertions to construct a regex that only matches the element itself:
myRe = re.compile(r"(?<=myFunc\(.+?\,.+?\,)(.+?)(?=\,.+?\,.+?\,.+?\,.+?\))")
Have you tried using named groups? http://docs.python.org/howto/regex.html#search-and-replace
Hopefully that will let you just target the 3rd match.
If you want to do this without using regex:
>>> s = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> l = s.split(",")
>>> l[2]="'noVersion'"
>>> s = ",".join(l)
>>> s
"myFunc('element','node','noVersion','ext',12,0,0)"

Python regex confused by brackets ([])? [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
Is python confused, or is the programmer?
I've got a lot of lines of this:
some_dict[0x2a] = blah
some_dict[0xab] = blah, blah
What I'd like to do is to convert the hex codes into all uppercase to look like this:
some_dict[0x2A] = blah
some_dict[0xAB] = blah, blah
So I decided to call in the regular expressions. Normally, I'd just do this using my editor's regexps (xemacs), but the need to convert to uppercase pushes one into Lisp. ....ok... how about Python?
So I whip together a short script which doesn't work. I've condensed the code into this example, which doesn't work either. It looks to me like Python's regexps are getting confused by the brackets in the code. Is it me or Python?
import fileinput
import sys
import re
this = "0x2a"
that = "[0x2b]"
for line in [this, that]:
found = re.match("0x([0-9,a-f]{2})", line)
if found:
print("Found: %s" % found.group(0))
(I'm using the () grouping constructs so I don't capitalize the 'x' in '0x'.)
This example only prints the 0x2a value, not the 0x2b. Is this correct behavior?
I can easily work around this by changing the match expression to:
found = re.match("\[0x([0-9,a-f]{2}\])", line)
but I'm just wondering if someone can give me some insight into what's going on here.
Running Python 2.6.2 on Linux.
re.match matches from the start of the string. Use re.search instead to "match the first occurrence anywhere in the string". The key bit about this in the docs is here.
I don't think you need the comma within the brackets. i.e.:
found = re.match("0x([0-9,a-f]{2})", line)
tells python to look for commas which it might be mistakenly matching. I think you want
found = re.match("0x([0-9a-f]{2})", line)
You're using a partial pattern, so you can't use re.match, which expects to match the entire input string. You need to use re.search, which can perform partial matches.
>>> that = "[0x2b]"
>>> m = re.search("0x([0-9,a-f]{2})", that)
>>> m.group()
'0x2b'
You'll want to change
found = re.match("0x([0-9,a-f]{2})", line)
to
found = re.search("0x([0-9,a-f]{2})", line)
re.match will match only from the beginning of the string, which fails in the "[0x2b]" case.
re.search will match anywhere in the string, and thus ignore the leading "[" in the "[0x2b]" case.
See search() vs. match() for details.
You want to use re.search. This explains why.
If you use re.sub, and pass a callable as the replacement string, it will also do the uppercasing for you:
>>> that = 'some_dict[0x2a] = blah'
>>> m = re.sub("0x([0-9,a-f]{2})", lambda x: "0x"+x.group(1).upper(), that)
>>> m
'some_dict[0x2A] = blah'

Categories

Resources