Wilcard matching substring in Python - python

I am completely new to Python and don't know how to get a sub-string which matches some wildcard condition from a string.
I am trying to get a timestamp from the following string:
sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data
I want to get only "1360922654.97671" part out of the string.
Please help.

Because you mentioned wildcards you can use re
In [77]: import re
In [78]: s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
In [79]: re.findall("\d+\.\d+", s)
Out[79]: ['1360922654.97671']

If the dots and dashes have their specific function within your string, you can use this:
>>> s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
>>> s.rsplit('.', 1)[0].split('-')[-1]
'1360922654.97671'
Step by step:
>>> s.rsplit('.', 1)
['sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671', 'data']
>>> s.rsplit('.', 1)[0]
'sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671'
>>> s.rsplit('.', 1)[0].split('-')
['sdc4', '251504', '7f5', 'f59c349f0e516894fc89d2686a0d57f5', '1360922654.97671']
>>> s.rsplit('.', 1)[0].split('-')[-1]
'1360922654.97671'
This will work for any strings in the form:
anything-WHATYOUWANT.stringwithoutdots

>>> s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
>>> s.split('-')[-1][:-5]
'1360922654.97671'
slightly fewer characters, only works where the last part of the string is .data or another 5 character string.

Related

python 3 remove before and after on string

I have this string /1B5DB40?full and I want to convert it to 1B5DB40.
I need to remove the ?full and the front /
My site won't always have ?full at the end so I need something that will still work even if the ?full is not there.
Thanks and hopefully this isn't too confusing to get some help :)
EDIT:
I know I could slice at 0 and 8 or whatever, but the 1B5DB40 could be longer or shorter. For example it could be /1B5DB4000?full or /1B5
Using str.lstrip (to remove leading /) and str.split (to remove optinal part after ?):
>>> '/1B5DB40?full'.lstrip('/').split('?')[0]
'1B5DB40'
>>> '/1B5DB40'.lstrip('/').split('?')[0]
'1B5DB40'
or using urllib.parse.urlparse:
>>> import urllib.parse
>>> urllib.parse.urlparse('/1B5DB40?full').path.lstrip('/')
'1B5DB40'
>>> urllib.parse.urlparse('/1B5DB40').path.lstrip('/')
'1B5DB40'
You can use lstrip and rstrip:
>>> data.lstrip('/').rstrip('?full')
'1B5DB40'
This only works as long as you don't have the characters f, u, l, ?, / in the part that you want to extract.
You can use regular expressions:
>>> import re
>>> extract = re.compile('/?(.*?)\?full')
>>> print extract.search('/1B5DB40?full').group(1)
1B5DB40
>>> print extract.search('/1Buuuuu?full').group(1)
1Buuuuu
What about regular expressions?
import re
re.search(r'/(?P<your_site>[^\?]+)', '/1B5DB40?full').group('your_site')
In this case it matches everything that is between '/' and '?', but you can change it to your specific requirements
>>> '/1B5DB40?full'split('/')[1].split('?')[0]
'1B5DB40'
>>> '/1B5'split('/')[1].split('?')[0]
'1B5'
>>> '/1B5DB40000?full'split('/')[1].split('?')[0]
'1B5DB40000'
Split will simply return a single element list containing the original string if the separator is not found.

Extracting a substring of a string in Python based on presence of another string

common is always present regardless of string. Using that information, I'd like to grab the substring that comes just before it, in this case, "banana":
string = "apple_orange_banana_common_fruit"
In this case, "fruit":
string = "fruit_common_apple_banana_orange"
How would I go about doing this in Python?
You can use re.search() to extract the substring:
>>> import re
>>> s = 'apple_orange_banana_common_fruit'
>>> re.search(r'([a-zA-Z]+)_common', s).group(1)
'banana'
This will return a list of matches:
import re
string = "apple_orange_banana_common_fruit"
preceding_word = re.findall("[A-Za-z]+(?=_common)", string)
If common only occurs once per string, you might be better off using hwnd's solution.
import re
string = "apple_orange_bananna_common_fruit"
preceding_word = re.search('([a-zAZ]+)(?=_common)', string)
print (preceding_word.group(1))
>>> string = "fruit_common_apple_banana_orange"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
fruit
>>> string = "apple_orange_banana_common_fruit"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
banana

Using regular expression to extract string

I need to extract the IP address from the following string.
>>> mydns='ec2-54-196-170-182.compute-1.amazonaws.com'
The text to the left of the dot needs to be returned. The following works as expected.
>>> mydns[:18]
'ec2-54-196-170-182'
But it does not work in all cases. For e.g.
mydns='ec2-666-777-888-999.compute-1.amazonaws.com'
>>> mydns[:18]
'ec2-666-777-888-99'
How to I use regular expressions in python?
No need for regex... Just use str.split
mydns.split('.', 1)[0]
Demo:
>>> mydns='ec2-666-777-888-999.compute-1.amazonaws.com'
>>> mydns.split('.', 1)[0]
'ec2-666-777-888-999'
If you wanted to use regex for this:
Regex String
ec2-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*
Alternative (EC2 Agnostic):
.*\b([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*
Replacement String
Regular: \1.\2.\3.\4
Reverse: \4.\3.\2.\1
Python code
import re
subject = 'ec2-54-196-170-182.compute-1.amazonaws.com'
result = re.sub("ec2-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*", r"\1.\2.\3.\4", subject)
print result
This regex will match (^[^.]+:
So Try this:
import re
string = "ec2-54-196-170-182.compute-1.amazonaws.com"
ip = re.findall('^[^.]+',string)[0]
print ip
Output:
ec2-54-196-170-182
Best thing is this will match even if the instance was ec2,ec3 so this regex is actually very much similar to the code of #mgilson

python regular expression to find something in between two strings or phrases

How can I use regex in python to capture something between two strings or phrases, and removing everything else on the line?
For example, the following is a protein sequence preceded by a one-line header. How can I sift off "CG33289-PC" from the header below based on the stipulation that is occurs after the phrase "FlyBase_Annotation_IDs:" and before the next comma "," ?
I need to substitute the header with this simplified result "CG33289-PC" and not destroy the protein sequence (found below the header-line in all caps).
This is what each protein sequence entry looks like - a header followed by a sequence:
>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel;
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
This is the desired output:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
Using regexps:
>>> s = """>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel; MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV"""
>>> import re
>>> print re.sub(r'.*FlyBase_Annotation_IDs:([\w-]+).*;', r'\1\n', s)
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
>>>
Not an elegant solution, but this should work for you:
>>> fly = 'FlyBase_Annotation_IDs'
>>> repl = 'CG33289-PC'
>>> part1, part2 = protein.split(fly)
>>> part2 = part2.replace(repl, "FooBar")
>>> protein = fly.join([part1, part2])
assuming FlyBase_Annotation_IDs can only appear once in the data.
I'm not sure about the format of the file, but this regex will capture the data in your example:
"FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);"
Use findall function to get the match.
Assuming there is a newline after the header:
>>> import re
>>> protein = "..."
>>> r = re.compile(r"^.*FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);.*$", re.MULTILINE)
>>> r.sub(r"\1", protein)
The group ([A-Z0-9a-z-]*) in the regular expression extracts any alphanumeric character and the dash. If ids can have other characters, just add them.

in python, how to match Strings based on Regular Expression and get the non-matching parts as a list?

For example: I have a string "abcde2011-09-30.log", I want to check if this string matchs "(\d){4}-(\d){2}-(\d){2}" ( dont think it has correct syntax, but you get the idea). And I need to split the string into 3 parts: (abcde),(e2011-09-30), (.log). How can I do it in python? Thanks.
There's a split method in the re module that should work for you.
>>> s = 'abcde2011-09-30.log'
>>> re.split('(\d{4}-\d{2}-\d{2})', s)
('abcde', '2011-09-30', '.log')
If you don't actually want the date as part of the returned list, just omit the parentheses around the regular expression so that it doesn't have a capturing group:
>>> re.split('\d{4}-\d{2}-\d{2}', s)
('abcde', '.log')
Be advised that if the pattern matches more than once, i.e. if there is more than one date in the filename, then this will split on both of them. For example,
>>> s2 = 'abcde2011-09-30fghij2012-09-31.log'
>>> re.split('(\d{4}-\d{2}-\d{2})', s2)
('abcde', '2011-09-30', 'fghij', '2012-09-31', '.log')
If this is a problem, you can use the maxsplit argument to split to only split it once, on the first occurrence of the date:
>>> re.split('(\d{4}-\d{2}-\d{2})', s, 1)
('abcde', '2011-09-30', 'fghij2012-09-31.log')
How's this:
>>> import re
>>> a = "abcde2011-09-30.log"
>>> myregexp = re.compile(r'^(.*)(\d{4}-\d{2}-\d{2})(\.\w+)$')
>>> m = myregexp.match(a)
>>> m
<_sre.SRE_Match object at 0xb7f69480>
>>> m.groups()
('abcde', '2011-09-30', '.log')
I don't know the exact python regex syntax but something like this should do the job:
/^(\D+?)([\d-]+)(\.log)$/
(without using regex and interpreting your string as a filename:)
lets start with splitting the filename and the extension 'log':
filename, ext = os.path.splitext('abcde2011-09-30.log')
most probably, the length of the date is allways 10, allowing for:
year, month, day = [int(i) for i in filename[-10:].split('-')]
description = filename[:-10]
However, if you are not sure we can find out where the date-part of the filename starts:
for i in range(len(filename)):
if filename[i].isdigit():
break
description, date = filename[:i], filename[i:]
year, month, day = [int[c] for c in date.split('-')]

Categories

Resources