Delete Chars in Python - python

does anybody know how to delete all characters behind a specific character??
like this:
http://google.com/translate_t
into
http://google.com

if you're asking about an abstract string and not url you could go with:
>>> astring ="http://google.com/translate_t"
>>> astring.rpartition('/')[0]
http://google.com

For urls, using urlparse:
>>> import urlparse
>>> parts = urlparse.urlsplit('http://google.com/path/to/resource?query=spam#anchor')
>>> parts
('http', 'google.com', '/path/to/resource', 'query=spam', 'anchor')
>>> urlparse.urlunsplit((parts[0], parts[1], '', '', ''))
'http://google.com'
For arbitrary strings, using re:
>>> import re
>>> re.split(r'\b/\b', 'http://google.com/path/to/resource', 1)
['http://google.com', 'path/to/resource']

str="http://google.com/translate_t"
shortened=str[0:str.rfind("/")]
Should do it. str[a:b] returns a substring in python. And rfind is used to find the index of a character sequence, starting at the end of the string.

If you know the position of the character then you can use the slice syntax to to create a new string:
In [2]: s1 = "abc123"
In [3]: s2 = s1[:3]
In [4]: print s2
abc
To find the position you can use the find() or index() methods of strings.
The split() and partition() methods may be useful, too.
Those methods are documented in the Python docs for sequences.
To remove a part of a string is imposible because strings are immutable.
If you want to process URLs then you should definitely use the urlparse library. It lets you split an URL into its parts. If you just want remove a part of the file path then you will have to do that still by yourself.

Related

regex match proc name without slash

I have a list of proc names on Linux. Some have slash, some don't. For example,
kworker/23:1
migration/39
qmgr
I need to extract just the proc name without the slash and the rest. I tried a few different ways but still won't get it completely correct. What's wrong with my regex? Any help would be much appreciated.
>>> str='kworker/23:1'
>>> match=re.search(r'^(.+)\/*',str)
>>> match.group(1)
'kworker/23:1'
The problem with the regex is, that the greedy .+ is going until the end, because everything after it is optional, meaning it is kept as short as possible (essentially empty). To fix this replace the . with anything but a /.
([^\/]+)\/?.*
works. You can test this regex here. In case it is new to you, [^\/] matches anything, but a slash., as the ^ in the beginning inverts which characters are matched.
Alternatively, you can also use split as suggested by Moses Koledoye. split is often better for simple string manipulation, while regex enables you to perform very complex tasks with rather little code.
An alternative to regex is to split on slash and take the first item:
>>> s ='kworker/23:1'
>>> s.split('/')[0]
'kworker'
This also works when the string does not contain a slash:
>>> s = 'qmgr'
>>> s.split('/')[0]
'qmgr'
But if you're going to stick to re, I think re.sub is what you want, as you won't need to fetch the matching group:
>>> import re
>>> s ='kworker/23:1'
>>> re.sub(r'/.*$', '', s)
'kworker'
On a side note, assignig the name str shadows the in built string type, which you don't want.

How to use regular expressions in python?

Hopefully someone can help, I'm trying to use a regular expression to extract something from a string that occurs after a pattern, but it's not working and I'm not sure why. The regex works fine in linux...
import re
s = "GeneID:5408878;gbkey=CDS;product=carboxynorspermidinedecarboxylase;protein_id=YP_001405731.1"
>>> x = re.search(r'(?<=protein_id=)[^;]*',s)
>>> print(x)
<_sre.SRE_Match object at 0x000000000345B7E8>
Use .group() on the search result to print the captured groups:
>>> print(x.group(0))
YP_001405731.1
As Martijn has had pointed out, you created a match object. The regular expression is correct. If it was wrong, print(x) would have printed None.
You should probably think about re-writing your regex so that you find all pairs so you don't have to muck around with specific groups and hard-coded look behinds...
import re
kv = dict(re.findall('(\w+)=([^;]+)', s))
# {'gbkey': 'CDS', 'product': 'carboxynorspermidinedecarboxylase', 'protein_id': 'YP_001405731.1'}
print kv['protein_id']
# YP_001405731.1

How do I ensure that re.findall() stops at the right place?

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?
Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.
Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.
It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

grep variable in python

I need something like grep in python
I have done research and found the re module to be suitable
I need to search variables for a specific string
To search for a specific string within a variable, you can just use in:
>>> 'foo' in 'foobar'
True
>>> s = 'foobar'
>>> 'foo' in s
True
>>> 'baz' in s
False
Using re.findall will be the easiest way. You can search for just a literal string if that's what you're looking for (although your purpose would be better served by the string in operator and you'll need to escape regex characters), or else any string you would pass to grep (although I don't know the syntax differences between the two off the top of my head, but I'm sure there are differences).
>>> re.findall("x", "xyz")
['x']
>>> re.findall("b.d", "abcde")
['bcd']
>>> re.findall("a?ba?c", "abacbc")
['abac', 'bc']
It sounds like what you really want is the ability to print a large substring in a way that lets you easily see where a particular substring is. There are a couple of ways to approach this.
def grep(large_string, substring):
for line, i in enumerate(large_string.split('\n')):
if substring in line:
print("{}: {}".format(i, line))
This would print only the lines that have your substring. However, you would lose a bunch of context. If you want true grep, replace if substring in line with something that uses the re module to do regular expression matching.
def highlight(large_string, substring):
from colorama import Fore
text_in_between = large_string.split(substring)
highlighted_substring = "{}{}{}".format(Fore.RED, substring, Fore.RESET)
print(highlighted_substring.join(text_in_between))
This will print the whole large string, but with the substring you are looking for in red. Note that you'll need to pip install colorama for it to work. You can of course combine the two approaches.

python regular expression replacing part of a matched string

i got an string that might look like this
"myFunc('element','node','elementVersion','ext',12,0,0)"
i'm currently checking for validity using, which works fine
myFunc\((.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\,(.+?)\)
now i'd like to replace whatever string is at the 3rd parameter.
unfortunately i cant just use a stringreplace on whatever sub-string on the 3rd position since the same 'sub-string' could be anywhere else in that string.
with this and a re.findall,
myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)
i was able to get the contents of the substring on the 3rd position, but re.sub does not replace the string it just returns me the string i want to replace with :/
here's my code
myRe = re.compile(r"myFunc\(.+?\,.+?\,(.+?)\,.+?\,.+?\,.+?\,.+?\)")
val = "myFunc('element','node','elementVersion','ext',12,0,0)"
print myRe.findall(val)
print myRe.sub("noVersion",val)
any idea what i've missed ?
thanks!
Seb
In re.sub, you need to specify a substitution for the whole matching string. That means that you need to repeat the parts that you don't want to replace. This works:
myRe = re.compile(r"(myFunc\(.+?\,.+?\,)(.+?)(\,.+?\,.+?\,.+?\,.+?\))")
print myRe.sub(r'\1"noversion"\3', val)
If your only tool is a hammer, all problems look like nails. A regular expression is a powerfull hammer but is not the best tool for every task.
Some tasks are better handled by a parser. In this case the argument list in the string is just like a Python tuple, sou you can cheat: use the Python builtin parser:
>>> strdata = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> args = re.search(r'\(([^\)]+)\)', strdata).group(1)
>>> eval(args)
('element', 'node', 'elementVersion', 'ext', 12, 0, 0)
If you can't trust the input ast.literal_eval is safer than eval for this. Once you have the argument list in the string decontructed I think you can figure out how to manipulate and reassemble it again, if needed.
Read the documentation: re.sub returns a copy of the string where every occurrence of the entire pattern is replaced with the replacement. It cannot in any case modify the original string, because Python strings are immutable.
Try using look-ahead and look-behind assertions to construct a regex that only matches the element itself:
myRe = re.compile(r"(?<=myFunc\(.+?\,.+?\,)(.+?)(?=\,.+?\,.+?\,.+?\,.+?\))")
Have you tried using named groups? http://docs.python.org/howto/regex.html#search-and-replace
Hopefully that will let you just target the 3rd match.
If you want to do this without using regex:
>>> s = "myFunc('element','node','elementVersion','ext',12,0,0)"
>>> l = s.split(",")
>>> l[2]="'noVersion'"
>>> s = ",".join(l)
>>> s
"myFunc('element','node','noVersion','ext',12,0,0)"

Categories

Resources