Simple regex in python - python

I'm trying to simply get everything after the colon in the following:
hfarnsworth:204b319de6f41bbfdbcb28da724dda23
And then everything before the space in the following:
29ca0a80180e9346295920344d64d1ce ::: 25basement
Here's what I have:
for line in f:
line = line.rstrip() #to remove \n
line = re.compile('.* ',line) #everything before space.
print line
Any tips to point me in the corrent direction? Thanks!
Also, is re.compile the correct function to use if I want the matched string returned? I'm pretty new at python too.
Thanks!!

string = "hfarnsworth:204b319de6f41bbfdbcb28da724dda23"
print(string.split(":")[1:])
string = "29ca0a80180e9346295920344d64d1ce ::: 25basement"
print(string.split(" ")[0])

At first you should probably take a careful look at the doc for re.compile. It doesn't expect for second parameter to be a string to lookup. Try to use re.search or re.findall. E.g.:
>>> s = "29ca0a80180e9346295920344d64d1ce ::: 25basement"
>>> re.findall('(\S*) ', s)[0]
'29ca0a80180e9346295920344d64d1ce'
>>> re.search('(\S*) ', s).groups()
('29ca0a80180e9346295920344d64d1ce',)
BTW, this is not the task for regular expressions. Consider using some simple string operations (like split).

this regular expression seems to work
r"^(?:[^:]*\:)?([^:]*)(?::::.*)?$"

Related

How to clean a string using python regular expression

I have the following string which have to clean
#import re
addr="abcd&^fhj"
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$#\,\. \t\r\n]')
re.search(problemchars,addr)
In that case use re.sub searching \W (non-alphanum) and replacing by nothing.
import re
addr="abcd&^fhj"
print(re.sub("\W","",addr))
("\W+" works too, but not sure it would be more performant)
you could use the filter function as well if you don't want to go with regex
line = "abcd&^fhj"
line = filter(str.isalpha, line)
print line # Change for python3
Output :
abcdfhj
Edit: For python 3 you could change the print statement like this since the filter function returns an iterable.
print(''.join(list(line)))

python regex and replace

I am trying to learn python and regex at the same time and I am having some trouble in finding how to match till end of string and make a replacement on the fly.
So, I have a string like so:
ss="this_is_my_awesome_string/mysuperid=687y98jhAlsji"
What I'd want is to first find 687y98jhAlsji (I do not know this content before hand) and then replace it to myreplacedstuff like so:
ss="this_is_my_awesome_string/mysuperid=myreplacedstuff"
Ideally, I'd want to do a regex and replace by first finding the contents after mysuperid= (till the end of string) and then perform a .replace or .sub if this makes sense.
I would appreciate any guidance on this.
You can try this:
re.sub(r'[^=]+$', 'myreplacedstuff', ss)
The idea is to use a character class that exclude the delimiter (here =) and to anchor the pattern with $
explanation:
[^=] is a character class and means all characters that are not =
[^=]+ one or more characters from this class
$ end of the string
Since the regex engine works from the left to the right, only characters that are not an = at the end of the string are matched.
You can use regular expressions:
>>> import re
>>> mymatch = re.search(r'mysuperid=(.*)', ss)
>>> ss.replace(mymatch.group(1), 'replacing_stuff')
'this_is_my_awesome_string/mysuperid=replacing_stuff'
You should probably use #Casimir's answer though. It looks cleaner, and I'm not that good at regex :p.

trying to search and replace "[" opening square bracket

I'm new to python, so please forgive me if this is a stupid question:
I have an XML file which I'm carrying out a number of search and replaces. One of the things I need to replace is a space followed by [ with just [ (no space before). I've tried the following but keep getting an error:
line = re.sub(' [','[',line)
and I assume its because the square brackets is used for wildcards, but I don't know what the syntax should be in order to get this to work
Any help appreciated
Thanks
No need for regex here at all. This will do fine:
line = line.replace(' [', '[')
You need to escape the [ with a backslash:
line = re.sub(r' \[','[',line)
>>> line=' [stuff]'
>>> re.sub(' \[','[',line)
'[stuff]'
Since you are using sub from the regular expression package, you need to escape the bracket since it is used to express character ranges in regular expressions (e.g. [a-z]):
line = re.sub(' \[','[',line)

Python split by regular expression

In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>

python regular expression replacing part of a matched string

i got an string that might look like this
"myFunc('element','node','elementVersion','ext',12,0,0)"
i'm currently checking for validity using, which works fine
myFunc\((.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\)
now i'd like to replace whatever string is at the 3rd parameter.
unfortunately i cant just use a stringreplace on whatever sub-string on the 3rd position since the same 'sub-string' could be anywhere else in that string.
with this and a re.findall,
myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)
i was able to get the contents of the substring on the 3rd position, but re.sub does not replace the string it just returns me the string i want to replace with :/
here's my code
myRe = re.compile(r"myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)")
val = "myFunc('element','node','elementVersion','ext',12,0,0)"
print myRe.findall(val)
print myRe.sub("noVersion",val)
any idea what i've missed ?
thanks!
Seb
In re.sub, you need to specify a substitution for the whole matching string. That means that you need to repeat the parts that you don't want to replace. This works:
myRe = re.compile(r"(myFunc\(.+?\,.+?\,)(.+?)(\,.+?\,.+?\,.+?\,.+?\))")
print myRe.sub(r'\1"noversion"\3', val)
If your only tool is a hammer, all problems look like nails. A regular expression is a powerfull hammer but is not the best tool for every task.
Some tasks are better handled by a parser. In this case the argument list in the string is just like a Python tuple, sou you can cheat: use the Python builtin parser:
>>> strdata = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> args = re.search(r'\(([^\)]+)\)', strdata).group(1)
>>> eval(args)
('element', 'node', 'elementVersion', 'ext', 12, 0, 0)
If you can't trust the input ast.literal_eval is safer than eval for this. Once you have the argument list in the string decontructed I think you can figure out how to manipulate and reassemble it again, if needed.
Read the documentation: re.sub returns a copy of the string where every occurrence of the entire pattern is replaced with the replacement. It cannot in any case modify the original string, because Python strings are immutable.
Try using look-ahead and look-behind assertions to construct a regex that only matches the element itself:
myRe = re.compile(r"(?<=myFunc\(.+?\,.+?\,)(.+?)(?=\,.+?\,.+?\,.+?\,.+?\))")
Have you tried using named groups? http://docs.python.org/howto/regex.html#search-and-replace
Hopefully that will let you just target the 3rd match.
If you want to do this without using regex:
>>> s = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> l = s.split(",")
>>> l[2]="'noVersion'"
>>> s = ",".join(l)
>>> s
"myFunc('element','node','noVersion','ext',12,0,0)"

Categories

Resources