Related
I need to write a pattern using Regex, which from the string "PriitPann39712047623+372 5688736402-12-1998Oja 18-2,Pärnumaa,Are" will return a first name, last name, id code, phone number, date of birth and address. There are no hard requirements beside that both the first and last names always begin with a capital letter, the id code always consists of 11 numbers, the phone number calling code is +372 and the phone number itself consists of 8 numbers, the date of birth has the format dd-mm-yyyy, and the address has no specific pattern.
That is, taking the example above, the result should be [("Priit", "Pann", "39712047623", "+372 56887364", "02-12-1998", "Oja 18-2,Parnumaa,Are")]. I got this pattern
r"([1-9][0-9]{10})(\+\d{3}\s*\d{7,8})(\d{1,2}\ -\d{1,2}\-\d{1,4})"
however it returns everything except first name, last name and address. For example, ^[^0-9]* returns both the first and last name, however I don't understand how to make it return them separately. How can it be improved so that it also separately finds both the first and last name, as well as the address?
The following regex splits each of the fields into a separate group.
r"([A-Z]+[a-z]+)([A-Z]+[a-z]+)([0-9]*)(\+372 [0-9]{8,8})([0-9]{2,2}-[0-9]{2,2}-[0-9]{4,4})(.*$)"
You can get each group by calling
m = re.search(regex, search_string)
for i in range(num_fields):
group_i = m.group(i)
I have a webscraper that inputs values into a data extractor. The dates have to be accepted a one month back.
For example January is equal to 00.
'''
today="02-10-2020"
preDay="02-09-2020"
months ={"01":"00","02":"01","03":"02","04":"03","05":"04","06":"05",
"07":"06","08":"07","09":"08","10":"09","11":"10","12":"11"}
for cur, pre in months.items():
today= today[0:2].replace(cur, pre)
'''
Maybe I do not complete understand how dictionaries are iterated but when I try doing this. I will replace all the values that match the key. I only want to it change the first two characters in the string and then leave the rest of the data alone.
I have successfully done the action with an "if" statement but I would to try the same using a dictionary.
If I'm understanding your question correctly there is no need for a loop at all.
To convert the month part of the date back one month using your dictionary you could simply.
today="02-10-2020"
months ={"01":"00","02":"01","03":"02","04":"03","05":"04","06":"05",
"07":"06","08":"07","09":"08","10":"09","11":"10","12":"11"}
today = today.replace(today[0:2], months[today[0:2]], 1)
print(today)
#output:
#01-10-2020
According to the Documentation:
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
So the code I wrote takes the first two characters as the old and replaces it with the value from the dictionary at the key that matches the first two characters. The "1" at the end makes sure this replacement only happens once.
I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:
[[REVENUE], [9,000,000], [9,000,000]]
I came across this stack overflow post where someone attempts to use re.match() to the .groups() method to find the pattern: How to split strings into text and number?
I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.
I wrote this regex through watching your first expected output. But i am not sure what your desired output is with your third sentence.
([A-Za-z ]+)(?=\d|\S) match name until we found a number or symbol.
.*? for the string which we do not care
([\d,]+)\s([\d,]+|(?=-\n|-$)) match one or two groups of number, if there is only one group of number, this group should end with newline or end of text.
Test code(edited):
import re
regex = r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+|(?=-\n|-$))"
text = """
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
Business taxes 999 -
"""
print(re.findall(regex,text))
# [('REVENUE ', '9,000,000', '900,000'), ('COST OF SALES ', '900,000', '900,000'), ('GROSS PROFIT ', '900,000', '900,000'), ('Business taxes ', '999', '')]
Regexes are overkill for this problem as you've stated it.
text.split() and a join of the items before the last two is better suited to this.
lines = [ "REVENUE 9,000,000 900,000",
"COST OF SALES 900,000 900,000",
"GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000" ]
out = []
for line in lines:
parts = line.split()
if len(parts) < 3:
raise InputError
if len(parts) == 3:
out.append(parts)
else:
out.append([' '.join(parts[0:len(parts)-2]), parts[-2], parts[-1]])
out will contain
[['REVENUE', '9,000,000', '900,000'],
['COST OF SALES', '900,000', '900,000'],
['GROSS PROFIT (90%; 2016 - 90%)', '900,000', '900,000']]
If the label text needs further extraction, you could use regexes, or you could simply look at the items in parts[0:len(parts)-2] and process them based on the words and numbers there.
To detect the string
rev_str = "[[REVENUE], [9,000,000], [9,000,000]]"
and extract the values
("REVENUE", "9,000,000", "9,000,000")
you would do
import re
x = re.match(r"\[\[([A-Z]+)\], \[([0-9,]+)\], \[([0-9,]+)\]\]", rev_str)
x.groups()
# ('REVENUE', '9,000,000', '9,000,000')
Let's unpack this big ol' string.
Square brackets signify a range of characters. For example, [A-Z] means to look for all letters from A to Z, whereas [0-9,] means to look for the digits 0 through 9, as well as the character ,. The - here is an operator used inside square brackets to denote a range of characters that we want.
The + operator means to look for at least one occurrence of whatever immediately precedes it. For example, the expression [A-Z]+ means to look for at least one occurrence of any of the letters A through Z. You can also use the * operator instead, to look for at least zero occurrences of whatever precedes it.
The round brackets (i.e. parentheses) signify a group to be extracted from the regex. Whenever that pattern is matched, whatever is inside any expression in parentheses will be extracted and returned as a group. For example, ([A-Z+]) means to look for at least one occurrence of any of the letters A through Z, and then save whatever that turns out to be. We access this by doing x.groups() after assigning the result of the regex match to a variable x.
Otherwise, it's straightforward - accommodating for the pattern [[TEXT], [NUMBER], [NUMBER]]. The square brackets are escaped with the \ character, because we want to interpret them literally, rather than as a range of characters.
Overall, the re.match() function will search rev_str for any places where the given pattern matches, keep track of the groups within that match, and return those groups when you call x.groups().
This is a fairly simple example, but you've gotta start somewhere, right? You should be able to use this as a starting point for making a more complicated regex expression to process more of your code.
To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all
I know that there are similar questions to mine that have been answered, but after reading through them I still don't have the solution I'm looking for.
Using Python 3.2.2, I need to match "Month, Day, Year" with the Month being a string, Day being two digits not over 30, 31, or 28 for February and 29 for February on a leap year. (Basically a REAL and Valid date)
This is what I have so far:
pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
expression = re.compile(pattern)
matches = expression.findall(sampleTextFile)
I'm still not too familiar with regex syntax so I may have characters in there that are unnecessary (the [,][ ] for the comma and spaces feels like the wrong way to go about it), but when I try to match "January, 26, 1991" in my sample text file, the printing out of the items in "matches" is ('January', '26', '1991', '19').
Why does the extra '19' appear at the end?
Also, what things could I add to or change in my regex that would allow me to validate dates properly? My plan right now is to accept nearly all dates and weed them out later using high level constructs by comparing the day grouping with the month and year grouping to see if the day should be <31,30,29,28
Any help would be much appreciated including constructive criticism on how I am going about designing my regex.
Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):
years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years
thirties = pattern % (
"September|April|June|November",
r'0?[1-9]|[12]\d|30')
thirtyones = pattern % (
"January|March|May|July|August|October|December",
r'0?[1-9]|[12]\d|3[01]')
fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))
feb = r'(February) +(?:%s|%s)' % (
r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year
r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours) # 29 leap years only
result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result
Then we have:
>>> r.match('January 30, 2001') is not None
True
>>> r.match('January 31, 2001') is not None
True
>>> r.match('January 32, 2001') is not None
False
>>> r.match('February 32, 2001') is not None
False
>>> r.match('February 29, 2001') is not None
False
>>> r.match('February 28, 2001') is not None
True
>>> r.match('February 29, 2000') is not None
True
>>> r.match('April 30, 1908') is not None
True
>>> r.match('April 31, 1908') is not None
False
And what is this glorious regexp, you may ask?
>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))
(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)
Here are some quick thoughts:
Everyone who is suggesting you use something other than regular expression is giving you very good advice. On the other hand, it's always a good time to learn more about regular expression syntax...
An expression in square brackets -- [...] -- matches any single character inside those brackets. So writing [,], which only contains a single character, is exactly identical to writing a simple unadorned comma: ,.
The .findall method returns a list of all matching groups in the string. A group is identified by parenthese -- (...) -- and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the final two match groups are going to be 1989 and 19.
A group is identified by parentheses (...) and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the two match groups are going to be 1989 and 19. Since you don't want the inner group (first two digits), you should use a non-capturing group instead. Non-capturing groups start with ?:, used like this: (?:a|b|c)
By the way, there is some good documentation on how to use regular expressions here.
Python has a date parser as part of the time module:
import time
time.strptime("December 31, 2012", "%B %d, %Y")
The above is all you need if the date format is always the same.
So, in real production code, I would write a regular expression that parses the date, and then use the results from the regular expression to build a date string that is always the same format.
Now that you said, in the comments, that this is homework, I'll post another answer with tips on regular expressions.
You have this regular expression:
pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
One feature of regular expressions is a "character class". Characters in square brackets make a character class. Thus [,] is a character class matching a single character, , (a comma). You might as well just put the comma.
Perhaps you wanted to make the comma optional? You can do that by putting a question mark after it: ,?
Anything you put into parentheses makes a "match group". I think the mysterious extra "19" came from a match group you didn't mean to have. You can make a non-matching group using this syntax: (?:
So, for example:
r'(?:red|blue) socks'
This would match "red socks" or "blue socks" but does not make a match group. If you then put that inside plain parentheses:
r'((?:red|blue) socks)'
That would make a match group, whose value would be "red socks" or "blue socks"
I think if you apply these comments to your regular expression, it will work. It is mostly correct now.
As for validating the date against the month, that is way beyond the scope of a regular expression. Your pattern will match "February 31" and there is no easy way to fix that.
First of all as other as said i don't think that regular expression are the best choice to solve this problem but to answer your question. By using parenthesis you are dissecting the string into several subgroups and when you call the function findall, you will create a list with all the matching group you created and the matching string.
((19|20)[0-9][0-9])
Here is your problem, the regex will match both the entire year and 19 or 20 depending on whether the year start with 19 or 20.