How to extract a substring from a string in Python? - python

Suppose I have a string , text2='C:\Users\Sony\Desktop\f.html', and I want to separate "C:\Users\Sony\Desktop" and "f.html" and store them in different variables then what should I do ? I tried out regular expressions but I wasn't successful.

os.path.split does what you want:
>>> import os
>>> help(os.path.split)
Help on function split in module ntpath:
split(p)
Split a pathname.
Return tuple (head, tail) where tail is everything after the final slash.
Either part may be empty.
>>> os.path.split(r'c:\users\sony\desktop\f.html')
('c:\\users\\sony\\desktop', 'f.html')
>>> path,filename = os.path.split(r'c:\users\sony\desktop\f.html')
>>> path
'c:\\users\\sony\\desktop'
>>> filename
'f.html'

Related

Why is the split() returning list objects that are empty? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

how to split the text using python?

f_output.write('\n{}, {}\n'.format(filename, summary))
I am printing the output as the name of the file. I am getting the output as VCALogParser_output_ARW.log, VCALogParser_output_CZC.log and so on. but I am interested only in printing ARW, CZC and so on. So please someone can tell me how to split this text ?
filename.split('_')[-1].split('.')[0]
this will give you : 'ARW'
summary.split('_')[-1].split('.')[0]
and this will give you: 'CZC'
If you are only interested in CZC and ARW without the .log then, you can do it with re.search method:
>>> import re
>>> s1 = 'VCALogParser_output_ARW.log'
>>> s2 = 'VCALogParser_output_CZC.log'
>>> re.search(r'.*_(.*)\.log', s1).group(1)
'ARW'
>>> re.search(r'.*_(.*)\.log', s2).group(1)
'CZC'
Or better maker your patten p then call its search method when formatting your string:
>>> p = re.compile(r'.*_(.*)\.log')
>>>
>>> '\n{}, {}\n'.format(p.search(s1).group(1), p.search(s2).group(1))
'\nARW, CZC\n'
Also, it might be helpful using re.sub with positive look ahead and group naming:
>>> p = re.compile(r'.*(?<=_)(?P<mystr>[a-zA-Z0-9]+)\.log$')
>>>
>>>
>>> p.sub('\g<mystr>', s1)
'ARW'
>>> p.sub('\g<mystr>', s2)
'CZC'
>>>
>>>
>>> '\n{}, {}\n'.format(p.sub('\g<mystr>', s1), p.sub('\g<mystr>', s2))
'\nARW, CZC\n'
In case, you are not able or you don't want to use re module, then you can define lengths of strings that you don't need and index your string variables with them:
>>> i1 = len('VCALogParser_output_')
>>> i2 = len ('.log')
>>>
>>> '\n{}, {}\n'.format(s1[i1:-i2], s2[i1:-i2])
'\nARW, CZC\n'
But keep in mind that the above is valid as long as you have those common strings in all of your string variables.
fname.split('_')[-1]
is rought but this will give you 'CZC.log', 'ARW.log' and so on, assuming that all files have the same underscore-delimited format.
If the format of the file is always such that it ends with _ARW.log or _CZC.log this is really easy to do just using the standard string split() method, with two consecutive splits:
shortname = filename.split("_")[-1].split('.')[0]
Or, to make it (arguably) a bit more readable, we can use the os module:
shortname = os.path.splitext(filename)[0].split("_")[-1]
You can also try:
>>> s1 = 'VCALogParser_output_ARW.log'
>>> s2 = 'VCALogParser_output_CZC.log'
>>> s1.split('_')[2].split('.')[0]
ARW
>>> s2.split('_')[2].split('.')[0]
CZC
Parse file name correctly, so basically my guess is that you wanna to strip file extension .log and prefix VCALogParser_output_ to do that it's enough to use str.replace rather than using str.split
Use os.linesep when you writing to file to have cross-browser
Code below would perform desired result(after applying steps listed above):
import os
filename = 'VCALogParser_output_SOME_NAME.log'
summary = 'some summary'
fname = filename.replace('VCALogParser_output_', '').replace('.log', '')
linesep = os.linesep
f_output.write('{linesep}{fname}, {summary}{linesep}'
.format(fname=fname, summary=summary, linesep=linesep))
# or if vars in execution scope strictly controlled pass locals() into format
f_output.write('{linesep}{fname}, {summary}{linesep}'.format(**locals()))

python 3 remove before and after on string

I have this string /1B5DB40?full and I want to convert it to 1B5DB40.
I need to remove the ?full and the front /
My site won't always have ?full at the end so I need something that will still work even if the ?full is not there.
Thanks and hopefully this isn't too confusing to get some help :)
EDIT:
I know I could slice at 0 and 8 or whatever, but the 1B5DB40 could be longer or shorter. For example it could be /1B5DB4000?full or /1B5
Using str.lstrip (to remove leading /) and str.split (to remove optinal part after ?):
>>> '/1B5DB40?full'.lstrip('/').split('?')[0]
'1B5DB40'
>>> '/1B5DB40'.lstrip('/').split('?')[0]
'1B5DB40'
or using urllib.parse.urlparse:
>>> import urllib.parse
>>> urllib.parse.urlparse('/1B5DB40?full').path.lstrip('/')
'1B5DB40'
>>> urllib.parse.urlparse('/1B5DB40').path.lstrip('/')
'1B5DB40'
You can use lstrip and rstrip:
>>> data.lstrip('/').rstrip('?full')
'1B5DB40'
This only works as long as you don't have the characters f, u, l, ?, / in the part that you want to extract.
You can use regular expressions:
>>> import re
>>> extract = re.compile('/?(.*?)\?full')
>>> print extract.search('/1B5DB40?full').group(1)
1B5DB40
>>> print extract.search('/1Buuuuu?full').group(1)
1Buuuuu
What about regular expressions?
import re
re.search(r'/(?P<your_site>[^\?]+)', '/1B5DB40?full').group('your_site')
In this case it matches everything that is between '/' and '?', but you can change it to your specific requirements
>>> '/1B5DB40?full'split('/')[1].split('?')[0]
'1B5DB40'
>>> '/1B5'split('/')[1].split('?')[0]
'1B5'
>>> '/1B5DB40000?full'split('/')[1].split('?')[0]
'1B5DB40000'
Split will simply return a single element list containing the original string if the separator is not found.

how to extract string inside single quotes using python script

Have a set of string as follows
text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'
These data i have extracted from a Xls file and converted to string,
now i have to Extract data which is inside single quotes and put them in a list.
expecting output like
[MUC-EC-099_SC-Memory-01_TC-25, MUC-EC-099_SC-Memory-01_TC-26,MUC-EC-099_SC-Memory-01_TC-27]
Thanks in advance.
Use re.findall:
>>> import re
>>> strs = """text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'"""
>>> re.findall(r"'(.*?)'", strs, re.DOTALL)
['MUC-EC-099_SC-Memory-01_TC-25',
'MUC-EC-099_SC-Memory-01_TC-26',
'MUC-EC-099_SC-Memory-01_TC-27'
]
You can use the following expression:
(?<=')[^']+(?=')
This matches zero or more characters that are not ' which are enclosed between ' and '.
Python Code:
quoted = re.compile("(?<=')[^']+(?=')")
for value in quoted.findall(str(row[1])):
i.append(value)
print i
That text: prefix seems a little familiar. Are you using xlrd to extract it? In that case, the reason you have the prefix is because you're getting the wrapped Cell object, not the value in the cell. For example, I think you're doing something like
>>> sheet.cell(2,2)
number:4.0
>>> sheet.cell(3,3)
text:u'C'
To get the unwrapped object, use .value:
>>> sheet.cell(3,3).value
u'C'
(Remember that the u here is simply telling you the string is unicode; it's not a problem.)

python: Match string using regular expression

I am learning regular expressions. Don't understand how to match the following pattern:
" myArray = ["Var1","Var2"]; "
Ideally I want to get the data in the array and to convert into python array
Are the array items guaranteed to be surrounded by double-quotes?
This is a quick and dirty method:
re.findall('"([^,]+)"', source)
where source is your string.
I didn't escape the double-quotes in the regex since you can also use single-quotes in Python.
This returns a list of each item surrounded by double quotes
so in your example: ['Var1', 'Var2']
Regular expression complexity differs much depending on variations of input. The easiest expressions that matches given string are:
>>> from re import search, findall
>>> s = ' myArray = ["Var1","Var2"]; '
>>> name, body = search(r'\s*(\w*)\s*=\s*\[(.*)\]', s).groups(0)
>>> contents = findall(r'"(\w*)"', body)
>>> name, contents
('myArray', ['Var1', 'Var2'])
"Converting" to python array can be done like this:
>>> globals().update({name: contents})
>>> myArray
['Var1', 'Var2']
Though it is actually a bad idea as it writes garbage in globals. Instead, try using separate dictionary, or something.
If you are interested in just getting the data in the array, you can skip using regex and use eval instead.
Consider this:
myArray = eval('["Var1","Var2"]')
If you must use the line you gave in the example, you can also use exec. However this command is somewhat dangerous and needs special care if used.
Without using an re you could use builtin string methods and literal_eval which given your example returns a usable list object:
from ast import literal_eval
text = ' myArray = ["Var1","Var2"]; '
name, arr_text = (el.strip('; ') for el in text.split('='))
arr = literal_eval(arr_text)
print name, arr
Then do what you want with name and arr...

Categories

Resources