Python and Regex - extracting a number from a string - python

I'm new to regex, and I'm starting to sort of get the hang of things. I have a string that looks like this:
This is a generated number #123 which is an integer.
The text that I've shown here around the 123 will always stay exactly the same, but it may have further text on either side. But the number may be 123, 597392, really one or more digits. I believe I can match the number and the folowing text using using \d+(?= which is an integer.), but how do I write the look-behind part?
When I try (?<=This is a generated number #)\d+(?= which is an integer.), it does not match using regexpal.com as a tester.
Also, how would I use python to get this into a variable (stored as an int)?
NOTE: I only want to find the numbers that are sandwiched in between the text I've shown. The string might be much longer with many more numbers.

You don't really need a fancy regex. Just use a group on what you want.
re.search(r'#(\d+)', 'This is a generated number #123 which is an integer.').group(1)
if you want to match a number in the middle of some known text, follow the same rule:
r'some text you know (\d+) other text you also know'

res = re.search('#(\d+)', 'This is a generated number #123 which is an integer.')
if res is not None:
integer = int(res.group(1))

You can just use the findall() in the re module.
string="This is a string that contains #134534 and other things"
match=re.findall(r'#\d+ .+',string);
print match
Output would be '#1234534 and other things'
This will match any length number #123 or #123235345 then a space then the rest of the line till it hits a newline char.

if you want to get the numbers only if the numbers are following text "This is a generated number #" AND followed by " which is an integer.", you don't have to do look-behind and lookahead. You can simply match the whole string, like:
"This is a generated number #(\d+) which is an integer."
I am not sure if I understood what you really want though. :)
updated
In [16]: a='This is a generated number #123 which is an integer.'
In [17]: b='This should be a generated number #123 which could be an integer.'
In [18]: exp="This is a generated number #(\d+) which is an integer."
In [19]: result =re.search(exp, a)
In [20]: int(result.group(1))
Out[20]: 123
In [21]: result = re.search(exp,b)
In [22]: result == None
Out[22]: True

Related

Check if a variable substring is in a string

I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.

how to check if a beautiful soup object contains numbers

I try to scrape a page and I have problems to check if the one beautifulsoup element contains numbers. I would like to clean the string, if it contains numbers. In this case, i would like just to keep the number, which is a zipcode. But before I clean it, I have to check, if the element even has a zipcode.
I search the element with following code:
soup.find("span",{"class": "locality"}).get_text()
Output: 68549 Ilvesheim, Baden-Württemberg,
I tried to check the string with following code, but it always says "False"
soup.find("span",{"class": "locality"}).get_text()).isalnum()
soup.find("span",{"class": "locality"}).get_text()).isdigit()
is there another way to check it? Since it contains "68549" it should say TRUE
You could use this simple function to check if a string contains numbers:
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
But I think this is an XY problem, and what you are really looking for is a regex to extract a zip code, check out the following:
\s(\d+)\s (You may have to change this up depending on the acceptable forms of a zip code)
>>> s = 'Output: 68549 Ilvesheim, Baden-Württemberg,'
>>> re.findall(r'\s(\d+)\s', s)
['68549']
If the string does not contain a zip code, you can check for this by just making sure the length of the result re.findall() is 0:
>>> re.findall(r'\s(\d+)\s', 'No zip code here!')
[]
Using Regex:
import re
hasnumber = re.findall(r'\d+', "68549 Ilvesheim, Baden-Württemberg")
if hasnumber:
print(hasnumber)
Output:
['68549']
If you are looking for zip codes, you might want to consider the valid ranges. For example German ZIP codes are exactly 5 digits in length:
import re
for test in ['68549 Ilvesheim, Baden-Württemberg', 'test 01234', 'test 2 123456789', 'inside (56089)']:
if len(re.findall(r'\b\d{5}\b', test)):
print "'{}' has zipcode".format(test)
So for these three examples, the third test does not match as a zip code:
'68549 Ilvesheim, Baden-Württemberg' has zipcode
'test 01234' has zipcode
'inside (56089)' has zipcode
The {5} tells the regex to match exactly 5 digits with \b ensuring a word boundary either side. If you want five or size digits, use {5,6}

How to attach a string to an integer in Python 3

I've got a two letter word that I'd like to attach to a double digit number. The word is an integer and the number is a string.
Say the name of the number is "number" and the name of the word is "word".
How would you make it print both of them together without spaces. When I try it right now it still has a space between them regardless of what I try.
Thanks !
'{}{}'.format(word, number)
For example,
In [19]: word='XY'
In [20]: number=123
In [21]: print('{}{}'.format(word, number))
XY123
The print function has a sep parameter that controls spacing between items:
print(number, word, sep="")
If you need a string, rather than printing, than unutbu's answer with string formatting is better, but this may get you to your desired results with fewer steps.
In python 3 the preferred way to construct strings is by using format
To print out a word and a number joined together you would use:
print("{}{}".format(word, number))

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Python Regex to match a string as a pattern and return number

I have some lines that represent some data in a text file. They are all of the following format:
s = 'TheBears SUCCESS Number of wins : 14'
They all begin with the name then whitespace and the text 'SUCCESS Number of wins : ' and finally the number of wins, n1. There are multiple strings each with a different name and value. I am trying to write a program that can parse any of these strings and return the name of the dataset and the numerical value at the end of the string. I am trying to use regular expressions to do this and I have come up with the following:
import re
def winnumbers(s):
pattern = re.compile(r"""(?P<name>.*?) #starting name
\s*SUCCESS #whitespace and success
\s*Number\s*of\s*wins #whitespace and strings
\s*\:\s*(?P<n1>.*?)""",re.VERBOSE)
match = pattern.match(s)
name = match.group("name")
n1 = match.group("n1")
return (name, n1)
So far, my program can return the name, but the trouble comes after that. They all have the text "SUCCESS Number of wins : " so my thinking was to find a way to match this text. But I realize that my method of matching an exact substring isn't correct right now. Is there any way to match a whole substring as part of the pattern? I have been reading quite a bit on regular expressions lately but haven't found anything like this. I'm still really new to programming and I appreciate any assistance.
Eventually, I will use float() to return n1 as a number, but I left that out because it doesn't properly find the number in the first place right now and would only return an error.
Try this one out:
((\S+)\s+SUCCESS Number of wins : (\d+))
These are the results:
>>> regex = re.compile("((\S+)\s+SUCCESS Number of wins : (\d+))")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xc827cf478a56b350>
>>> regex.match(string)
<_sre.SRE_Match object at 0xc827cf478a56b228>
# List the groups found
>>> r.groups()
(u'TheBears SUCCESS Number of wins : 14', u'TheBears', u'14')
# List the named dictionary objects found
>>> r.groupdict()
{}
# Run findall
>>> regex.findall(string)
[(u'TheBears SUCCESS Number of wins : 14', u'TheBears', u'14')]
# So you can do this for the name and number:
>>> fullstring, name, number = r.groups()
If you don't need the full string just remove the surround parenthesis.
I believe that there is no actual need to use a regex here. So you can use the following code if it acceptable for you(note that i have posted it so you will have ability to have another one option):
dict((line[:line.lower().index('success')+1], line[line.lower().index('wins:') + 6:]) for line in text.split('\n') if 'success' in line.lower())
OR in case of you are sure that all words are splitted by single spaces:
output={}
for line in text:
if 'success' in line.lower():
words = line.strip().split(' ')
output[words[0]] = words[-1]
If the text in the middle is always constant, there is no need for a regular expression. The inbuilt string processing functions will be more efficient and easier to develop, debug and maintain. In this case, you can just use the inbuilt split() function to get the pieces, and then clean the two pieces as appropriate:
>>> def winnumber(s):
... parts = s.split('SUCCESS Number of wins : ')
... return (parts[0].strip(), int(parts[1]))
...
>>> winnumber('TheBears SUCCESS Number of wins : 14')
('TheBears', 14)
Note that I have output the number of wins as an integer (as presumably this will always be a whole number), but you can easily substitute float()- or any other conversion function - for int() if you desire.
Edit: Obviously this will only work for single lines - if you call the function with several lines it will give you errors. To process an entire file, I'd use map():
>>> map(winnumber, open(filename, 'r'))
[('TheBears', 14), ('OtherTeam', 6)]
Also, I'm not sure of your end use for this code, but you might find it easier to work with the outputs as a dictionary:
>>> dict(map(winnumber, open(filename, 'r')))
{'OtherTeam': 6, 'TheBears': 14}

Categories

Resources