python: Match string using regular expression

python: Match string using regular expression - python

I am learning regular expressions. Don't understand how to match the following pattern:
" myArray = ["Var1","Var2"]; "
Ideally I want to get the data in the array and to convert into python array

Are the array items guaranteed to be surrounded by double-quotes?
This is a quick and dirty method:
re.findall('"([^,]+)"', source)
where source is your string.
I didn't escape the double-quotes in the regex since you can also use single-quotes in Python.
This returns a list of each item surrounded by double quotes
so in your example: ['Var1', 'Var2']

Regular expression complexity differs much depending on variations of input. The easiest expressions that matches given string are:
>>> from re import search, findall
>>> s = ' myArray = ["Var1","Var2"]; '
>>> name, body = search(r'\s*(\w*)\s*=\s*\[(.*)\]', s).groups(0)
>>> contents = findall(r'"(\w*)"', body)
>>> name, contents
('myArray', ['Var1', 'Var2'])
"Converting" to python array can be done like this:
>>> globals().update({name: contents})
>>> myArray
['Var1', 'Var2']
Though it is actually a bad idea as it writes garbage in globals. Instead, try using separate dictionary, or something.

If you are interested in just getting the data in the array, you can skip using regex and use eval instead.
Consider this:
myArray = eval('["Var1","Var2"]')
If you must use the line you gave in the example, you can also use exec. However this command is somewhat dangerous and needs special care if used.

Without using an re you could use builtin string methods and literal_eval which given your example returns a usable list object:
from ast import literal_eval
text = ' myArray = ["Var1","Var2"]; '
name, arr_text = (el.strip('; ') for el in text.split('='))
arr = literal_eval(arr_text)
print name, arr
Then do what you want with name and arr...

Related

Remove brackets and number inside from string Python

I've seen a lot of examples on how to remove brackets from a string in Python, but I've not seen any that allow me to remove the brackets and a number inside of the brackets from that string.
For example, suppose I've got a string such as "abc[1]". How can I remove the "[1]" from the string to return just "abc"?
I've tried the following:
stringTest = "abc[1]"
stringTestWithoutBrackets = str(stringTest).strip('[]')
but this only outputs the string without the final bracket
abc[1
I've also tried with a wildcard option:
stringTest = "abc[1]"
stringTestWithoutBrackets = str(stringTest).strip('[\w+\]')
but this also outputs the string without the final bracket
abc[1

You could use regular expressions for that, but I think the easiest way would be to use split:
>>> stringTest = "abc[1][2][3]"
>>> stringTest.split('[', maxsplit=1)[0]
'abc'

You can use regex but you need to use it with the re module:
re.sub(r'\[\d+\]', '', stringTest)
If the [<number>] part is always at the end of the string you can also strip via:
stringTest.rstrip('[0123456789]')
Though the latter version might strip beyond the [ if the previous character is in the strip list too. For example in "abc1[5]" the "1" would be stripped as well.

Assuming your string has the format "text[number]" and you only want to keep the "text", then you could do:
stringTest = "abc[1]"
bracketBegin = stringTest.find('[')
stringTestWithoutBrackets = stringTest[:bracketBegin]

How to copy changing substring in string?

How can I copy data from changing string?
I tried to slice, but length of slice is changing.
For example in one case I should copy number 128 from string '"edge_liked_by":{"count":128}', in another I should copy 15332 from "edge_liked_by":{"count":15332}

You could use a regular expression:
import re
string = '"edge_liked_by":{"count":15332}'
number = re.search(r'{"count":(\d*)}', string).group(1)

Really depends on the situation, however I find regular expressions to be useful.
To grab the numbers from the string without caring about their location, you would do as follows:
import re
def get_string(string):
return re.search(r'\d+', string).group(0)
>>> get_string('"edge_liked_by":{"count":128}')
'128'
To only get numbers from the *end of the string, you can use an anchor to ensure the result is pulled from the far end. The following example will grab any sequence of unbroken numbers that is both preceeded by a colon and ends within 5 characters of the end of the string:
import re
def get_string(string):
rval = None
string_match = re.search(r':(\d+).{0,5}$', string)
if string_match:
rval = string_match.group(1)
return rval
>>> get_string('"edge_liked_by":{"count":128}')
'128'
>>> get_string('"edge_liked_by":{"1321":1}')
'1'
In the above example, adding the colon will ensure that we only pick values and don't match keys such as the "1321" that I added in as a test.
If you just want anything after the last colon, but excluding the bracket, try combining split with slicing:
>>> '"edge_liked_by":{"count":128}'.split(':')[-1][0:-1]
'128'
Finally, considering this looks like a JSON object, you can add curly brackets to the string and treat it as such. Then it becomes a nested dict you can query:
>>> import json
>>> string = '"edge_liked_by":{"count":128}'
>>> string = '{' + string + '}'
>>> string = json.loads(string)
>>> string.get('edge_liked_by').get('count')
128
The first two will return a string and the final one returns a number due to being treated as a JSON object.

It looks like the type of string you are working with is read from JSON, maybe you are getting it as the output of some API you are working with?
If it is JSON, you've probably gone one step too far in atomizing it to a string like this. I'd work with the original output, if possible, if I were you.
If not, to make it more JSON like, I'd convert it to JSON by wrapping it in {}, and then working with the json.loads module.
import json
string = '"edge_liked_by":{"count":15332}'
string = "{"+string+"}"
json_obj = json.loads(string)
count = json_obj['edge_liked_by']['count']
count will have the desired output. I prefer this option to using regular expressions because you can rely on the structure of the data and reuse the code in case you wish to parse out other attributes, in a very intuitive way. With regular expressions, the code you use will change if the data are decimal, or negative, or contain non-numeric characters.

Does this help ?
a='"edge_liked_by":{"count":128}'
import re
b=re.findall(r'\d+', a)[0]
b
Out[16]: '128'

Why is the split() returning list objects that are empty? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']

I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)

Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

how to extract string inside single quotes using python script

Have a set of string as follows
text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'
These data i have extracted from a Xls file and converted to string,
now i have to Extract data which is inside single quotes and put them in a list.
expecting output like
[MUC-EC-099_SC-Memory-01_TC-25, MUC-EC-099_SC-Memory-01_TC-26,MUC-EC-099_SC-Memory-01_TC-27]
Thanks in advance.

Use re.findall:
>>> import re
>>> strs = """text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'"""
>>> re.findall(r"'(.*?)'", strs, re.DOTALL)
['MUC-EC-099_SC-Memory-01_TC-25',
'MUC-EC-099_SC-Memory-01_TC-26',
'MUC-EC-099_SC-Memory-01_TC-27'
]

You can use the following expression:
(?<=')[^']+(?=')
This matches zero or more characters that are not ' which are enclosed between ' and '.
Python Code:
quoted = re.compile("(?<=')[^']+(?=')")
for value in quoted.findall(str(row[1])):
i.append(value)
print i

That text: prefix seems a little familiar. Are you using xlrd to extract it? In that case, the reason you have the prefix is because you're getting the wrapped Cell object, not the value in the cell. For example, I think you're doing something like
>>> sheet.cell(2,2)
number:4.0
>>> sheet.cell(3,3)
text:u'C'
To get the unwrapped object, use .value:
>>> sheet.cell(3,3).value
u'C'
(Remember that the u here is simply telling you the string is unicode; it's not a problem.)

How to split a string by using [] in Python

So from this string:
"name[id]"
I need this:
"id"
I used str.split ('[]'), but it didn't work. Does it only take a single delimiter?

Use a regular expression:
import re
s = "name[id]"
re.find(r"\[(.*?)\]", s).group(1) # = 'id'
str.split() takes a string on which to split input. For instance:
"i,split,on commas".split(',') # = ['i', 'split', 'on commas']
The re module also allows you to split by regular expression, which can be very useful, and I think is what you meant to do.
import re
s = "name[id]"
# split by either a '[' or a ']'
re.split('\[|\]', s) # = ['name', 'id', '']

Either
"name[id]".split('[')[1][:-1] == "id"
or
"name[id]".split('[')[1].split(']')[0] == "id"
or
re.search(r'\[(.*?)\]',"name[id]").group(1) == "id"
or
re.split(r'[\[\]]',"name[id]")[1] == "id"

Yes, the delimiter is the whole string argument passed to split. So your example would only split a string like 'name[]id[]'.
Try eg. something like:
'name[id]'.split('[', 1)[-1].split(']', 1)[0]
'name[id]'.split('[', 1)[-1].rstrip(']')

I'm not a fan of regex, but in cases like it often provides the best solution.
Triptych already recommended this, but I'd like to point out that the ?P<> group assignment can be used to assign a match to a dictionary key:
>>> m = re.match(r'.*\[(?P<id>\w+)\]', 'name[id]')
>>> result_dict = m.groupdict()
>>> result_dict
{'id': 'id'}
>>>

You don't actually need regular expressions for this. The .index() function and string slicing will work fine.
Say we have:
>>> s = 'name[id]'
Then:
>>> s[s.index('[')+1:s.index(']')]
'id'
To me, this is easy to read: "start one character after the [ and finish before the ]".

def between_brackets(text):
return text.partition('[')[2].partition(']')[0]
This will also work even if your string does not contain a […] construct, and it assumes an implied ] at the end in the case you have only a [ somewhere in the string.

I'm new to python and this is an old question, but maybe this?
str.split('[')[1].strip(']')

You can get the value of the list use []. For example, create a list from URL like below with split.
>>> urls = 'http://quotes.toscrape.com/page/1/'
This generates a list like the one below.
>>> print( urls.split("/") )
['http:', '', 'quotes.toscrape.com', 'page', '11', '']
And what if you wanna get value only "http" from this list? You can use like this
>>> print(urls.split("/")[0])
http:
Or what if you wanna get value only "1" from this list? You can use like this
>>> print(urls.split("/")[-2])
1

str.split uses the entire parameter to split a string. Try:
str.split("[")[1].split("]")[0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: Match string using regular expression - python

I am learning regular expressions. Don't understand how to match the following pattern: " myArray = ["Var1","Var2"]; " Ideally I want to get the data in the array and to convert into python array

Related

Remove brackets and number inside from string Python

How to copy changing substring in string?

Why is the split() returning list objects that are empty? [duplicate]

how to extract string inside single quotes using python script

How to split a string by using [] in Python

Categories

Resources