I try to scrape a page and I have problems to check if the one beautifulsoup element contains numbers. I would like to clean the string, if it contains numbers. In this case, i would like just to keep the number, which is a zipcode. But before I clean it, I have to check, if the element even has a zipcode.
I search the element with following code:
soup.find("span",{"class": "locality"}).get_text()
Output: 68549 Ilvesheim, Baden-Württemberg,
I tried to check the string with following code, but it always says "False"
soup.find("span",{"class": "locality"}).get_text()).isalnum()
soup.find("span",{"class": "locality"}).get_text()).isdigit()
is there another way to check it? Since it contains "68549" it should say TRUE
You could use this simple function to check if a string contains numbers:
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
But I think this is an XY problem, and what you are really looking for is a regex to extract a zip code, check out the following:
\s(\d+)\s (You may have to change this up depending on the acceptable forms of a zip code)
>>> s = 'Output: 68549 Ilvesheim, Baden-Württemberg,'
>>> re.findall(r'\s(\d+)\s', s)
['68549']
If the string does not contain a zip code, you can check for this by just making sure the length of the result re.findall() is 0:
>>> re.findall(r'\s(\d+)\s', 'No zip code here!')
[]
Using Regex:
import re
hasnumber = re.findall(r'\d+', "68549 Ilvesheim, Baden-Württemberg")
if hasnumber:
print(hasnumber)
Output:
['68549']
If you are looking for zip codes, you might want to consider the valid ranges. For example German ZIP codes are exactly 5 digits in length:
import re
for test in ['68549 Ilvesheim, Baden-Württemberg', 'test 01234', 'test 2 123456789', 'inside (56089)']:
if len(re.findall(r'\b\d{5}\b', test)):
print "'{}' has zipcode".format(test)
So for these three examples, the third test does not match as a zip code:
'68549 Ilvesheim, Baden-Württemberg' has zipcode
'test 01234' has zipcode
'inside (56089)' has zipcode
The {5} tells the regex to match exactly 5 digits with \b ensuring a word boundary either side. If you want five or size digits, use {5,6}
Related
This is in python
Input string:
Str = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
Expected output
'FU=_COG-GAB-CANE_,FU=FARE,FU=#-_MAP.com_'
here 'FU=' is the occurence we are looking for and the value which follows FU=
return all occurrences of FU=(with the associated value for FU=) in a comma-separated string, they can occur anywhere within the string and special characters are allowed.
Here is one approach.
>>> import re
>>> str_ = 'Y=DAT,X=ZANG,FU=FAT,T=TART,FU=GEM,RO=TOP,FU=MAP,Z=TRY'
>>> re.findall.__doc__[:58]
'Return a list of all non-overlapping matches in the string'
>>> re.findall(r'FU=\w+', str_)
['FU=FAT', 'FU=GEM', 'FU=MAP']
>>> ','.join(re.findall(r'FU=\w+', str_))
'FU=FAT,FU=GEM,FU=MAP'
Got it working
Python Code
import re
str_ = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
str2='FU='+',FU='.join(re.findall(r'FU=(.*?),', str_))
print(str2)
Gives the desired output:
'FU=_COG-GAB-CANE-,FU=FARE,FU=#-_MAP.com-'
Successfully gives me all the occurrences of FU= followed by values, irrespective of order and number of special characters.
Although a bit unclean way as I am manually adding FU= for the first occurrence.
Please suggest if there is a cleaner way of doing it ? , but yes it gets the work done.
I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31
I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.
I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]
I'm new to regex, and I'm starting to sort of get the hang of things. I have a string that looks like this:
This is a generated number #123 which is an integer.
The text that I've shown here around the 123 will always stay exactly the same, but it may have further text on either side. But the number may be 123, 597392, really one or more digits. I believe I can match the number and the folowing text using using \d+(?= which is an integer.), but how do I write the look-behind part?
When I try (?<=This is a generated number #)\d+(?= which is an integer.), it does not match using regexpal.com as a tester.
Also, how would I use python to get this into a variable (stored as an int)?
NOTE: I only want to find the numbers that are sandwiched in between the text I've shown. The string might be much longer with many more numbers.
You don't really need a fancy regex. Just use a group on what you want.
re.search(r'#(\d+)', 'This is a generated number #123 which is an integer.').group(1)
if you want to match a number in the middle of some known text, follow the same rule:
r'some text you know (\d+) other text you also know'
res = re.search('#(\d+)', 'This is a generated number #123 which is an integer.')
if res is not None:
integer = int(res.group(1))
You can just use the findall() in the re module.
string="This is a string that contains #134534 and other things"
match=re.findall(r'#\d+ .+',string);
print match
Output would be '#1234534 and other things'
This will match any length number #123 or #123235345 then a space then the rest of the line till it hits a newline char.
if you want to get the numbers only if the numbers are following text "This is a generated number #" AND followed by " which is an integer.", you don't have to do look-behind and lookahead. You can simply match the whole string, like:
"This is a generated number #(\d+) which is an integer."
I am not sure if I understood what you really want though. :)
updated
In [16]: a='This is a generated number #123 which is an integer.'
In [17]: b='This should be a generated number #123 which could be an integer.'
In [18]: exp="This is a generated number #(\d+) which is an integer."
In [19]: result =re.search(exp, a)
In [20]: int(result.group(1))
Out[20]: 123
In [21]: result = re.search(exp,b)
In [22]: result == None
Out[22]: True