Extract VAT identification number with Python [duplicate] - python

This question already has answers here:
What do ^ and $ mean in a regular expression?
(2 answers)
Closed 2 years ago.
I am trying to extract the German VAT number (Umsatzsteuer-Identifikationsnummer) from a text.
string = "I want to get this DE813992525 number."
I know, that the correct regex for this problem is (?xi)^( (DE)?[0-9]{9}|)$.
It works great according to my demo.
What I tried is:
string = "I want to get this DE813992525 number.
match = re.compile(r'(?xi)^( (DE)?[0-9]{9}|)$')
print(match.findall(string))
>>>>>> []
What I would like to get is:
print(match.findall(string))
>>>>> DE813992525

When searching within a string, dont use ^ and $:
import re
string = """I want to get this DE813992525 number.
I want to get this DE813992526 number.
"""
match = re.compile(r'DE[0-9]{9}')
print(match.findall(string))
Out:
['DE813992525', 'DE813992526']

Related

RegEx returns empty list when searching for words which begin with a number [duplicate]

This question already has answers here:
What do ^ and $ mean in a regular expression?
(2 answers)
Closed 2 years ago.
I've got a problem with carets and dollar signs in Python.
I want to find every word which starts with a number and ends with a letter
Here is what I've tried already:
import re
text = "Cell: 415kkk -555- 9999ll Work: 212-555jjj -0000"
phoneNumRegex = re.compile(r'^\d+\w+$')
print(phoneNumRegex.findall(text))
Result is an empty list:
[]
The result I want:
415kkk, 9999ll, 555jjj
Where is the problem?
Problems with your regex:
^...$ means you only want full matches over the whole string - get rid of that.
r'\w+' means "any word character" which means letters + numbers (case ignorant) plus underscore '_'. So this would match '5555' for '555' via
r'\d+' and another '5' as '\w+' hence add it to the result.
You need
import re
text = "Cell: 415kkk -555- 9999ll Work: 212-555jjj -0000"
phoneNumRegex = re.compile(r'\b\d+[a-zA-Z]+\b')
print(phoneNumRegex.findall(text))
instead:
['415kkk', '9999ll', '555jjj']
The '\b' are word boundaries so you do not match 'abcd1111' inside '_§$abcd1111+§$'.
Readup:
re-syntax
regex101.com - Regextester website that can handle python syntax

Regex to extract the date based on particular string [duplicate]

This question already has answers here:
Python/Regex - How to extract date from filename using regular expression?
(5 answers)
Closed 2 years ago.
am trying to extract the date if it matches to a particular regex
Ex :
string1 = '10/22/2019 from'
string2 = '12/22/2020 33455SE'
string3 = '7/20/2020 S0023'
Am trying to extract the string 2
Regex used :
r'(\d+[/]\d+[/]\d+[-\s\.]\d+)'
The above used regex is giving me if the string looks like, "10/22/2019 33455" but if there is a alphabet after as shown "33455SE", my code fails.
Any help ?
Tried codes :
r'(\d+[/]\d+[/]\d+[-\s\.]^\d+)' - Tried to use starts with.
Expected output : only string 2 and string 3
12/22/2020
7/20/2020
This works
import re
a = "3443E hello 10/22/2019 33455SE"
number = re.findall(r"[0-9]{2}[/][0-9]{2}[/][0-9]{4}",a)
print(number[0])
Output :
10/22/2019
This should work:
r'(\d+[/]\d+[/]\d+[-\s\.]\d+[A-Z]*)'
\d{1,2}/\d{2}/\d{4}(?=\s\w*\d+)
https://regex101.com/r/gCXHQ6/3

how to fix ''nothing to repeat at position 2'' [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 3 years ago.
ı am trying to stemmize words in tex of dataframe
data is a dataframe , karma is text column , zargan is the dict of word and root of word
for a in range(1,100000):
for j in data.KARMA[a].split():
pattern = r'\b'+j+r'\b'
data.KARMA[a] = re.sub(pattern, str(zargan.get(j,j)),data.KARMA[a])
print(data.KARMA[1])
I want to change the word and root in the texts
Looks like j contains some regular expression special character like *. If you want it to be interpreted as literal text, you can say
pattern = r'\b'+re.escape(j)+r'\b'
and possibly the same for r if it should similarly be coerced into a literal string.

re.findall only finding half the patterns [duplicate]

This question already has answers here:
Why doesn't [01-12] range work as expected?
(7 answers)
Closed 4 years ago.
I'm using re.findall to parse the year and month from a string, however it is only outputting patterns from half the string. Why is this?
date_string = '2011-1-1_2012-1-3,2015-3-1_2015-3-3'
find_year_and_month = re.findall('[1-2][0-9][0-9][0-9]-[1-12]', date_string)
print(find_year_and_month)
and my output is this:
['2011-1', '2012-1']
This is the current output for those dates but why am I only getting pattern matching for half the string?
[1-12] doesn't do what you think it does. It matches anything in the range 1 to 1, or it matches a 2.
See this question for some replacement regex options, like ([1-9]|1[0-2]): How to represent regex number ranges (e.g. 1 to 12)?
If you want an interactive tool for experimenting with regexes, I personally recommend Regexr.
Adjust your regex pattern as shown below:
import re
date_string = '2011-1-1_2012-1-3,2015-3-1_2015-3-3'
find_year_and_month = re.findall('([1-2][0-9]{3}-(?:1[0-2]|[1-9]))', date_string)
print(find_year_and_month)
The output:
['2011-1', '2012-1', '2015-3', '2015-3']

Capture repeated characters and split using Python [duplicate]

This question already has answers here:
How can I tell if a string repeats itself in Python?
(13 answers)
Closed 3 years ago.
I need to split a string by using repeated characters.
For example:
My string is "howhowhow"
I need output as 'how,how,how'.
I cant use 'how' directly in my reg exp. because my input varies. I should check the string whether it is repeating the character and need to split that characters.
import re
string = "howhowhow"
print(','.join(re.findall(re.search(r"(.+?)\1", string).group(1), string)))
OUTPUT
howhowhow -> how,how,how
howhowhowhow -> how,how,how,how
testhowhowhow -> how,how,how # not clearly defined by OP
The pattern is non-greedy so that howhowhowhow doesn't map to howhow,howhow which is also legitimate. Remove the ? if you prefer the longest match.
lengthofRepeatedChar = 3
str1 = 'howhowhow'
HowmanyTimesRepeated = int(len(str1)/lengthofRepeatedChar)
((str1[:lengthofRepeatedChar]+',')*HowmanyTimesRepeated)[:-1]
'how,how,how'
Works When u know the length of repeated characters

Categories

Resources