Find ISBN with regex in Python - python

If have a text (actually lots of texts), where somewhere is one ISBN inside, and I have to find it.
I know: my ISBN-13 will start with "978" followed by 10 digits.
I don't kow: how many '-' (minus) there are and if they are at the correct place.
My code will only find me the ISBN without any Minus:
regex=r'978[0-9]{10}'
pattern = re.compile(regex, re.UNICODE)
for match in pattern.findall(mytext):
print(match)
But how can I find ISBN like these:
978-123-456-789-0
978-1234-567890
9781234567890
etc...
Is this possible with one regex-pattern?
Thanks!

This matches 10 digits and allows one optional hyphen before each:
regex = r'978(?:-?\d){10}'

Since you can't have 2 consecutive hyphens, and it must end with a digit:
r'978(-?\d){10}'
... allowing for a hyphen right after then 978, mandating a digit after every hyphen (does not end in a hyphen), and allowing for consecutive digits by making each hyphen optional.
I would add \b before the 978 and after then {10}, to make sure the ISBN's are well separated from surrounding text.
Also, I would add ?: right after the opening parenthesis, to make those non-capturing (slightly better performance, and also more expressive), making it:
r'\b978(?:-?\d){10}\b'

What about adding the - char in the pattern for the regex? This way, it will look for any combination of (number or -)x10 times.
regex=r'978[0-9\-]{10}'
Although it may be better to use
regex=r'978[0-9\-]+'
because otherwise if we use {10} and some - are found, not all the digits will be found.
Test
>>> import re
>>> regex=r'978[0-9\-]+'
>>> pattern = re.compile(regex, re.UNICODE)
>>> mytext="978-123-456-789-0"
>>> for match in pattern.findall(mytext):
... print(match)
...
978-123-456-789-0
>>> mytext="978-1234-567890"
>>> for match in pattern.findall(mytext):
... print(match)
...
978-1234-567890
>>> mytext="9781234567890"
>>> for match in pattern.findall(mytext):
... print(match)
...
9781234567890
>>>

You can try to match every digits and - characters. In that case you can't know how many characters find however:
regex=r'978[\d\-]+\d'
pattern = re.compile(regex, re.UNICODE)
for match in pattern.findall(mytext):
print(match)
If your ISBN is stucked between other digits or hyphens, you'll have some problems, but if it's clearly seperated, no worries :)
EDIT: According to the first comment, you can add an extra \d at the end of the regex (I've updated my code just below) because you know that an ISBN ends with a digit.

The simplest way should be
regex=r'978[-0-9]{10,15}'
which will accept them.

If someone is still looking : ISBN Detail and Contraints
Easy one regex = r'^(978-?|979-?)?\d(-?\d){9}$'
Strong one isbnRegex = r'^(978-?|979-?)?\d{1,5}-?\d{1,7}-?\d{1,6}-?\d{1,3}$' and include length check of 10 and 13 after removing hypen (Note : Also add substring check for length = 13 ie. only for 978 or 979, Some edge case still need to be checked)

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

Extracting entries from a line into a list in Python Regex

I have the following string:
myst="Cluster 2 0 13aa,>FZGRY:07872:11201...*1 13aa,>FZGRY:08793:13012...at100.00%2 13aa,>FZGRY:04065:08067...at100.00%"
What I want to do is to extract content bounded by > and .... into a list.
yielding:
['FZGRY:07872:11201','FZGRY:08793:13012', 'FZGRY:04065:08067']
But why this line doesn't do the job:
import re
mem = re.findall(">(.*)\.\.\.",myst)
mem
What's the right way to do it?
You can use look arounds to do this.
>>> re.findall(r'(?<=>)[^.]+(?=[.]{3})', myst)
['FZGRY:07872:11201', 'FZGRY:08793:13012', 'FZGRY:04065:08067']
Regex
(?<=>) Positive look behind. Checks if the string is preceded by >
[^.]+ Matches anything other than ., + matches one or more.
(?=[.]{3}) Positive look ahead. Check if the matched string is followed by ...
What is wrong with your regex?
>(.*)\.\.\. Here the .* is greedy and will try to match as much as possible. Add a ? at the end to make it non greedy.
>>> re.findall(">(.*?)\.\.\.",myst)
['FZGRY:07872:11201', 'FZGRY:08793:13012', 'FZGRY:04065:08067']

How should I write this regex in python

I have the string.
st = "12345 hai how r u #3456? Awer12345 7890"
re.findall('([0-9]+)',st)
It should not come like :
['12345', '3456', '12345', '7890']
I should get
['12345','7890']
I should only take the numeric values
and
It should not contain any other chars like alphabets,special chars
No need to use a regular expression:
[i for i in st.split(" ") if i.isdigit()]
Which I think is much more readable than using a regex
Corey's solution is really the right way to go here, but since the question did ask for regex, here is a regex solution that I think is simpler than the others:
re.findall(r'(?<!\S)\d+(?!\S)', st)
And an explanation:
(?<!\S) # Fail if the previous character (if one exists) isn't whitespace
\d+ # Match one or more digits
(?!\S) # Fail if the next character (if one exists) isn't whitespace
Some examples:
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890')
['12345', '7890']
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890123ER%345 234 456 789')
['12345', '234', '456', '789']
In [21]: re.findall(r'(?:^|\s)(\d+)(?=$|\s)', st)
Out[21]: ['12345', '7890']
Here,
(?:^|\s) is a non-capture group that matches the start of the string, or a space.
(\d+) is a capture group that matches one or more digits.
(?=$|\s) is lookahead assertion that matches the end of the string, or a space, without consuming it.
use this: (^|\s)[0-9]+(\s|$) pattern. (^|\s) means that your number must be at the start of the string or there must be a whitespace character before the number. And (\s|$) means that there must be a whitespace after number or the number is at the end of the string.
As said Jan Pöschko, 456 won't be found in 123 456. If your "bad" parts (#, Awer) are always prefixes, you can use this (^|\s)[0-9]+ pattern and everything will be OK. It will match all numbers, which have only whitespaces before or are at the start of the string. Hope this helped...
Your expression finds all sequences of digits, regardless of what surrounds them. You need to include a specification of what comes before and after the sequence to get the behavior you want:
re.findall(r"[\D\b](\d+)[\D\b]", st)
will do what you want. In English, it says "match all sequences of one or more digits that are surrounded by a non-digit character.or a word boundary"

Python Regular Expression Match All 5 Digit Numbers but None Larger

I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string

Regular expression to match alphanumeric string

If string "x" contains any letter or number, print that string.
How to do that using regular expressions?
The code below is wrong
if re.search('^[A-Z]?[a-z]?[0-9]?', i):
print i
re — Regular expression operations
This question is actually rather tricky. Unfortunately \w includes _ and [a-z] solutions assume a 26-letter alphabet. With the below solution please read the pydoc where it talks about LOCALE and UNICODE.
"[^_\\W]"
Note that since you are only testing for existence, no quantifiers need to be used -- and in fact, using quantifiers that may match 0 times will returns false positives.
You want
if re.search('[A-Za-z0-9]+', i):
print i
I suggest that you check out RegexBuddy. It can explain regexes well.
[A-Z]?[a-z]?[0-9]? matches an optional upper case letter, followed by an optional lower case letter, followed by an optional digit. So, it also matches an empty string. What you're looking for is this: [a-zA-Z0-9] which will match a single digit, lower- or upper case letter.
And if you need to check for letter (and digits) outside of the ascii range, use this if your regex flavour supports it: [\p{L}\p{N}]. Where \p{L} matches any letter and \p{N} any number.
don't need regex.
>>> a="abc123"
>>> if True in map(str.isdigit,list(a)):
... print a
...
abc123
>>> if True in map(str.isalpha,list(a)):
... print a
...
abc123
>>> a="###%$#%#^!"
>>> if True in map(str.isdigit,list(a)):
... print a
...
>>> if True in map(str.isalpha,list(a)):
... print a
...

Categories

Resources