Python - Extract important string information

Python - Extract important string information - python

I have the following string
http://example.com/variable/controller/id32434242423423234?param1=321&param2=4324342
How in best way to extract id value, in this case - 32434242423423234
Regardz,
Mladjo

You could just use a regular expression, e.g.:
import re
s = "http://example.com/variable/controller/id32434242423423234?param1=321&param2=4324342"
m = re.search(r'controller/id(\d+)\?',s)
if m:
print "Found the id:", m.group(1)
If you need the value as an number rather than a string, you can use int(m.group(1)). There are plenty of other ways of doing this that might be more appropriate, depending on the larger goal of your code, but without more context it's hard to say.

>>> import urlparse
>>> res=urlparse.urlparse("http://example.com/variable/controller/id32434242423423234?param1=321&param2=4324342")
>>> res.path
'/variable/controller/id32434242423423234'
>>> import posixpath
>>> posixpath.split(res.path)
('/variable/controller', 'id32434242423423234')
>>> directory,filename=posixpath.split(res.path)
>>> filename[2:]
'32434242423423234'
Using urlparse and posixpath might be too much for this case, but I think it is the clean way to do it.

>>> s
'http://example.com/variable/controller/id32434242423423234?param1=321&param2=4324342'
>>> s.split("id")
['http://example.com/variable/controller/', '32434242423423234?param1=321&param2=4324342']
>>> s.split("id")[-1].split("?")[0]
'32434242423423234'
>>>

While Regex is THE way to go, for simple things I have written a string parser. In a way, is the (uncomplete) reverse operation of a string formatting operation with PEP 3101. This is very convenient because it means that you do not have to learn another way of specifying the strings.
For example:
>>> 'The answer is {:d}'.format(42)
The answer is 42
The parser does the opposite:
>>> Parser('The answer is {:d}')('The answer is 42')
42
For your case, if you want an int as output
>>> url = 'http://example.com/variable/controller/id32434242423423234?param1=321&param2=4324342'
>>> fmt = 'http://example.com/variable/controller/id{:d}?param1=321&param2=4324342'
>>> Parser(fmt)(url)
32434242423423234
If you want a string:
>>> fmt = 'http://example.com/variable/controller/id{:s}?param1=321&param2=4324342'
>>> Parser(fmt)(url)
32434242423423234
If you want to capture more things in a dict:
>>> fmt = 'http://example.com/variable/controller/id{id:s}?param1={param1:s}&param2={param2:s}'
>>> Parser(fmt)(url)
{'id': '32434242423423234', 'param1': '321', 'param2': '4324342'}
or in a tuple:
If you want to capture more things in a dict:
>>> fmt = 'http://example.com/variable/controller/id{:s}?param1={:s}&param2={:s}'
>>> Parser(fmt)(url)
('32434242423423234', '321', '4324342')
Give it a try, it is hosted here

Related

Python replace method strange behavior

Please help and explain. I tried by adding max argument but it didn't help.
key = "tea-1_a-1"
print(key.replace("a-1","a-2")) # prints 'tea-2_a-2'
I need tea-1_a-2.

Try the following:
key = "tea-1_a-1"
print(key.replace("_a-1","_a-2"))

A regular expression would do the job by looking for either the beginning of the string or the underscore character before your pattern:
>>> import re
>>> key = 'a-1_tea-1'
>>> re.sub(r'(?:^|(?<=_))a-1', 'a-2', key)
'a-2_tea-1'
>>> key = 'tea-1_a-1'
>>> re.sub(r'(?:^|(?<=_))a-1', 'a-2', key)
'tea-1_a-2'
See Python Regular expression syntax documentation for more information.

how to split the text using python?

f_output.write('\n{}, {}\n'.format(filename, summary))
I am printing the output as the name of the file. I am getting the output as VCALogParser_output_ARW.log, VCALogParser_output_CZC.log and so on. but I am interested only in printing ARW, CZC and so on. So please someone can tell me how to split this text ?

filename.split('_')[-1].split('.')[0]
this will give you : 'ARW'
summary.split('_')[-1].split('.')[0]
and this will give you: 'CZC'

If you are only interested in CZC and ARW without the .log then, you can do it with re.search method:
>>> import re
>>> s1 = 'VCALogParser_output_ARW.log'
>>> s2 = 'VCALogParser_output_CZC.log'
>>> re.search(r'.*_(.*)\.log', s1).group(1)
'ARW'
>>> re.search(r'.*_(.*)\.log', s2).group(1)
'CZC'
Or better maker your patten p then call its search method when formatting your string:
>>> p = re.compile(r'.*_(.*)\.log')
>>>
>>> '\n{}, {}\n'.format(p.search(s1).group(1), p.search(s2).group(1))
'\nARW, CZC\n'
Also, it might be helpful using re.sub with positive look ahead and group naming:
>>> p = re.compile(r'.*(?<=_)(?P<mystr>[a-zA-Z0-9]+)\.log$')
>>>
>>>
>>> p.sub('\g<mystr>', s1)
'ARW'
>>> p.sub('\g<mystr>', s2)
'CZC'
>>>
>>>
>>> '\n{}, {}\n'.format(p.sub('\g<mystr>', s1), p.sub('\g<mystr>', s2))
'\nARW, CZC\n'
In case, you are not able or you don't want to use re module, then you can define lengths of strings that you don't need and index your string variables with them:
>>> i1 = len('VCALogParser_output_')
>>> i2 = len ('.log')
>>>
>>> '\n{}, {}\n'.format(s1[i1:-i2], s2[i1:-i2])
'\nARW, CZC\n'
But keep in mind that the above is valid as long as you have those common strings in all of your string variables.

fname.split('_')[-1]
is rought but this will give you 'CZC.log', 'ARW.log' and so on, assuming that all files have the same underscore-delimited format.

If the format of the file is always such that it ends with _ARW.log or _CZC.log this is really easy to do just using the standard string split() method, with two consecutive splits:
shortname = filename.split("_")[-1].split('.')[0]
Or, to make it (arguably) a bit more readable, we can use the os module:
shortname = os.path.splitext(filename)[0].split("_")[-1]

You can also try:
>>> s1 = 'VCALogParser_output_ARW.log'
>>> s2 = 'VCALogParser_output_CZC.log'
>>> s1.split('_')[2].split('.')[0]
ARW
>>> s2.split('_')[2].split('.')[0]
CZC

Parse file name correctly, so basically my guess is that you wanna to strip file extension .log and prefix VCALogParser_output_ to do that it's enough to use str.replace rather than using str.split
Use os.linesep when you writing to file to have cross-browser
Code below would perform desired result(after applying steps listed above):
import os
filename = 'VCALogParser_output_SOME_NAME.log'
summary = 'some summary'
fname = filename.replace('VCALogParser_output_', '').replace('.log', '')
linesep = os.linesep
f_output.write('{linesep}{fname}, {summary}{linesep}'
.format(fname=fname, summary=summary, linesep=linesep))
# or if vars in execution scope strictly controlled pass locals() into format
f_output.write('{linesep}{fname}, {summary}{linesep}'.format(**locals()))

python 3 remove before and after on string

I have this string /1B5DB40?full and I want to convert it to 1B5DB40.
I need to remove the ?full and the front /
My site won't always have ?full at the end so I need something that will still work even if the ?full is not there.
Thanks and hopefully this isn't too confusing to get some help :)
EDIT:
I know I could slice at 0 and 8 or whatever, but the 1B5DB40 could be longer or shorter. For example it could be /1B5DB4000?full or /1B5

Using str.lstrip (to remove leading /) and str.split (to remove optinal part after ?):
>>> '/1B5DB40?full'.lstrip('/').split('?')[0]
'1B5DB40'
>>> '/1B5DB40'.lstrip('/').split('?')[0]
'1B5DB40'
or using urllib.parse.urlparse:
>>> import urllib.parse
>>> urllib.parse.urlparse('/1B5DB40?full').path.lstrip('/')
'1B5DB40'
>>> urllib.parse.urlparse('/1B5DB40').path.lstrip('/')
'1B5DB40'

You can use lstrip and rstrip:
>>> data.lstrip('/').rstrip('?full')
'1B5DB40'
This only works as long as you don't have the characters f, u, l, ?, / in the part that you want to extract.

You can use regular expressions:
>>> import re
>>> extract = re.compile('/?(.*?)\?full')
>>> print extract.search('/1B5DB40?full').group(1)
1B5DB40
>>> print extract.search('/1Buuuuu?full').group(1)
1Buuuuu

What about regular expressions?
import re
re.search(r'/(?P<your_site>[^\?]+)', '/1B5DB40?full').group('your_site')
In this case it matches everything that is between '/' and '?', but you can change it to your specific requirements

>>> '/1B5DB40?full'split('/')[1].split('?')[0]
'1B5DB40'
>>> '/1B5'split('/')[1].split('?')[0]
'1B5'
>>> '/1B5DB40000?full'split('/')[1].split('?')[0]
'1B5DB40000'
Split will simply return a single element list containing the original string if the separator is not found.

Python regex string matching with varying search string

Is there anyway in python to be able to perform:
"DDx" should match "01x", "10x", "11x, "00x"
in an elegant way in Python?
The easiest way I see to do this is by using regex, which in this case would be:
re.search('\d\dx',line)
Is there anyway to dynamically update this regex?
In case the input is:
"D0x" then regex: \d0x
Please help.
Using Python 2.7
EDIT
In simpler terms, my question is:
>>> str = "DDx"
>>> str.replace('\d','D')
>>> re.search(<use str here>,line)
Or any alternate approach

I think I found the answer:
>>> s = "DDx"
>>> s = s.replace('D','\d')
>>> p = "01x"
>>> c = re.search(s,p)
>>> print c.group(0)
>>> "01x"

Wilcard matching substring in Python

I am completely new to Python and don't know how to get a sub-string which matches some wildcard condition from a string.
I am trying to get a timestamp from the following string:
sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data
I want to get only "1360922654.97671" part out of the string.
Please help.

Because you mentioned wildcards you can use re
In [77]: import re
In [78]: s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
In [79]: re.findall("\d+\.\d+", s)
Out[79]: ['1360922654.97671']

If the dots and dashes have their specific function within your string, you can use this:
>>> s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
>>> s.rsplit('.', 1)[0].split('-')[-1]
'1360922654.97671'
Step by step:
>>> s.rsplit('.', 1)
['sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671', 'data']
>>> s.rsplit('.', 1)[0]
'sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671'
>>> s.rsplit('.', 1)[0].split('-')
['sdc4', '251504', '7f5', 'f59c349f0e516894fc89d2686a0d57f5', '1360922654.97671']
>>> s.rsplit('.', 1)[0].split('-')[-1]
'1360922654.97671'
This will work for any strings in the form:
anything-WHATYOUWANT.stringwithoutdots

>>> s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
>>> s.split('-')[-1][:-5]
'1360922654.97671'
slightly fewer characters, only works where the last part of the string is .data or another 5 character string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Extract important string information - python

I have the following string http://example.com/variable/controller/id32434242423423234?param1=321&param2=4324342 How in best way to extract id value, in this case - 32434242423423234 Regardz, Mladjo

>>> s 'http://example.com/variable/controller/id32434242423423234?param1=321&param2=4324342' >>> s.split("id") ['http://example.com/variable/controller/', '32434242423423234?param1=321&param2=4324342'] >>> s.split("id")[-1].split("?")[0] '32434242423423234' >>>

Related

Python replace method strange behavior

how to split the text using python?

python 3 remove before and after on string

Python regex string matching with varying search string

Wilcard matching substring in Python

Categories

Resources