Related
I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.
I have a string as shown below,
someVariable1='9',someVariable2='some , value, comma,present',somevariable5='N/A',someVariable6='some text,comma,= present,'
I have to split above string on commas but ignore commas within quotes in python and i have to create a dictionary to get the values of variables.
Example:
somedictionary.get('someVariable1')
I am new to python please help me how can i achieve this in python
Try this regular expression ,(?=(?:[^']*\'[^']*\')*[^']*$) for splitting:
import re
re.split(",(?=(?:[^']*\'[^']*\')*[^']*$)",s)
# ["someVariable1='9'",
# "someVariable2='some , value, comma,present'",
# "somevariable5='N/A'",
# "someVariable6='some text,comma,= present,'"]
This uses look ahead syntax (?=...) to find out specific comma to split;
The look up pattern is (?:[^']*\'[^']*\')*[^']*$
$ matches the end of string and optionally matches non ' characters [^']*
Use non-captured group (?:..) to define a double quote pattern [^']*\'[^']*\' which could appear behind the comma that can acts as a delimiter.
This assumes the quotes are always paired.
To convert the above to a dictionary, you can split each sub expression by =:
lst = re.split(",(?=(?:[^']*\'[^']*\')*[^']*$)",s)
dict_ = {k: v for exp in lst for k, v in [re.split("=(?=\')", exp)]}
dict_
# {'someVariable1': "'9'",
# 'someVariable2': "'some , value, comma,present'",
# 'someVariable6': "'some text,comma,= present,'",
# 'somevariable5': "'N/A'"}
dict_.get('someVariable2')
# "'some , value, comma,present'"
Build a copy of the string, looping through each character of the original string, and keeping track of the number of single-quotes you've encountered.
Whenever you see a comma, refer to the single-quote count. If it's odd (meaning you're currently inside a quoted string), don't add the comma onto the string copy; instead add some unique placeholder value (i.e. something like PEANUTBUTTER that would never actually appear in the string.)
When you're finished building the string copy, it won't have any commas inside quotes, because you replaced all those with PEANUTBUTTER, so you can safely split on commas.
Then, in the list you got from splitting, go back and replace PEANUTBUTTER with commas.
i am new to regular expressions and developed this to find out if idno has values from 0 to 9 in the first nine characters and V, v, X or x as the last. Is the syntax correct because it sends an error requesting two args.
Another problem is that it should be only 10 characters long. I used a separate code to validate that but can I integrate it into this too?
if len(idno) is 10:
if re.match("[0-9]{9}[VvXx],idno") == true:
print "Valid"
You have more wrong there than right, I'm afraid. Note the following:
You should really compare integers by equality (== 10) not identity (is 10) - CPython interns small integers, so your current code will work, but that's an implementation detail you shouldn't rely on;
If you add $ (end of string) to the end the regular expression will only match strings ten characters long, making the len check unnecessary anyway;
The quotes are in the wrong place, so you're passing a single string to re.match, rather than the pattern and the name you want to try to match it in - the comma and idno are all part of the pattern parameter;
'true' != 'True': Python is case-sensitive, and the booleans start with capital letters;
re.match returns either an SRE_Match object or None, neither of which == True. However, it's pretty awkward to write == True even where you're only getting True or False, and you can use the fact that Match is truth-y and None is false-y to write the much neater if some_thing: rather than if some_thing == True:; and
Regular expressions already have a case covering [0-9], you can just use \d (digit).
Your code should therefore be:
if re.match(r'\d{9}[VvXx]$', idno):
# ^ note 'raw' string, to avoid escaping the backslash
print "Valid"
You could simplify further using the re.IGNORECASE flag and making the group for the last character [vx]. A few examples:
>>> import re
>>> for test in ('123456789x', '123456789a', '123abc456x', '123456789xa'):
print test, re.match(r'\d{9}[vx]$', test, re.I)
# ^ shorter version of IGNORECASE
123456789x <_sre.SRE_Match object at 0x10041e308> # valid
123456789a None # wrong final letter
123abc456x None # non-digits in first nine characters
123456789xa None # start matches but ends with additional character
To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all
I'm having problems using findall in python.
I have a text such as:
the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched
So i'm trying to find all occurrences of of those alphanumeric strings in the text. the thing is I know they all have the "33e" prefix.
Right now, I have strings = re.findall(r"(33e+)+", stdout_value) but it doesn't work.
I want to be able to return 33e445a64b65, 33e5c44598e46
try this
>>> x="the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched"
>>> re.findall("33e\w+",x)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']
Here's a slight variation:
>>> string = '''the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched'''
>>> re.findall(r"(33e[a-z0-9]+)", string)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']
Instead of matching any word characters, it will only match digits and lowercase numbers after the 33e -- that's what the [a-z0-9]+ means.
If you wanted to also match capital letters, you could replace that part with [a-zA-Z0-9]+ instead.