Parsing data from the squished string - python

I need to write a pattern using Regex, which from the string "PriitPann39712047623+372 5688736402-12-1998Oja 18-2,Pärnumaa,Are" will return a first name, last name, id code, phone number, date of birth and address. There are no hard requirements beside that both the first and last names always begin with a capital letter, the id code always consists of 11 numbers, the phone number calling code is +372 and the phone number itself consists of 8 numbers, the date of birth has the format dd-mm-yyyy, and the address has no specific pattern.
That is, taking the example above, the result should be [("Priit", "Pann", "39712047623", "+372 56887364", "02-12-1998", "Oja 18-2,Parnumaa,Are")]. I got this pattern
r"([1-9][0-9]{10})(\+\d{3}\s*\d{7,8})(\d{1,2}\ -\d{1,2}\-\d{1,4})"
however it returns everything except first name, last name and address. For example, ^[^0-9]* returns both the first and last name, however I don't understand how to make it return them separately. How can it be improved so that it also separately finds both the first and last name, as well as the address?

The following regex splits each of the fields into a separate group.
r"([A-Z]+[a-z]+)([A-Z]+[a-z]+)([0-9]*)(\+372 [0-9]{8,8})([0-9]{2,2}-[0-9]{2,2}-[0-9]{4,4})(.*$)"
You can get each group by calling
m = re.search(regex, search_string)
for i in range(num_fields):
group_i = m.group(i)

Related

ValueError: Columns must be same length as key with multiple outputs

I am extracting a substring from an Excel cell and the entire string says this:
The bolts are 5" long each and 3" apart
I want to extract the length of the bolt which is 5". And I use the following code to get that
df['Bolt_Length'] = df['Description'].str.extract(r'(\s[0-9]")',expand=False)
But if the string says the following:
The bolts are 10" long each and 3" apart
and I try to use to the following code:
df['Bolt_Length'] = df['Description'].str.extract(r'(\s(\d{1,2})")',expand=False)
I get the following error message:
ValueError: Columns must be same length as key
I think Python doesn't know which number to acquire. The 10" or 3"
How can I fix this? How do I tell Python to only go for the first "?
On another note what if I want to get both the bolt length and distance from another bolt? How do I extract the two at the same time?
Your problem is that you have two capture groups in your second regular expression (\s(\d{1,2})"), not one. So basically, you're telling Python to get the number with the ", and the same number without the ":
>>> df['Description'].str.extract(r'(\s(\d{1,2})")', expand=False)
0 1
0 5" 5
1 10" 10
You can add ?: right after the opening parenthesis of a group to make it so that it does not capture anything, though it still functions as a group. The following makes it so that the inner group, which excludes the ", does not capture:
# notice vv
>>> df['Description'].str.extract(r'(\s(?:\d{1,2})")', expand=False)
0 5"
1 10"
Name: Description, dtype: object
The error occurs because your regex contains two capturing groups, that extract two column values, BUT you assign those to a single column, df['Bolt_Length'].
You need to use as many capturing groups in the regex pattern as there are columns you assign the values to:
df['Bolt_Length'] = df['Description'].str.extract(r'\s(\d{1,2})"',expand=False)
The \s(\d{1,2})" regex only contains one pair of unescaped parentheses that form a capturing group, so this works fine since this single value is assigned to a single Bolt_Length column.

Regex to split phrases separated into columns by many whitespaces

I'm hoping to get some regex assistance. I've got lines of columnar text that I'd like to split with regexes. Each column can be phrases of arbitrary characters, separated by a whitespace or maybe even two. Columns are separated by a larger number of whitespaces, perhaps at least 4.
Ultimately, I need to match a date if its in the second column.
Here's an example. I need the date in this column to be the group important_date
Rain in Spain 11/01/2000 90 Days
important_date should not match the date in this next line:
Another line of text 10/15/1990
# EXAMPLE:
import re
regex = r"(.*)\s(?P<important_date>\d{1,2}\/\d{1,2}\/\d{4}).*"
match_this = " Rain in Spain 11/01/2000 90 Days"
not_this = " Another line of text 10/15/1990"
print(f"Finding this date is good:{re.search(regex, match_this).group('important_date')}" )
print(f"But this one should throw an error:{re.search(regex,not_this).group('important_date')}")
I'm also comparing these regexes against lots of other lines of text with various structures, so this is why I don't want to just split on a string of " ". To know I've got the important_date, I need to know that the whole line looks like: one-column, second column is date, maybe another column after the date too.
Doing this with a single regex would also just fit much more easily into the rest of the application. I'm worried that line.split(" ") and checking the resulting list would interfere with other checks going on in this app.
I have not been able to figure out how to write the first part of the regex that captures words with no-more-than-2 spaces between them. Can I use lookaheads for this somehow?
Thank you!
Try this: (?m)^\s*(\w+\s)+\s+(?P<important_date>\d\d/\d\d/\d\d\d\d).*$ (https://regex101.com/r/PnIU3e/3).
I assume that the first column consists of words separated by single spaces, and is separated from the second column by more than one space.
You can split on fields of 2 or more spaces and only use the data if it is the second field:
for x in (match_this, not_this):
te=re.split(r'[ \t]{2,}',x)
if re.match(r'\d{1,2}\/\d{1,2}\/\d{4}', te[2]):
# you have an important date
print(te[2])
else:
# you don't
print('no match')

Match pattern 1 and/or pattern 2

I have multiple file names that are either a movie title or an episode in a TV show. For the movie titles I want to match the year the movie came out, and for the episode I want to match the season and episode number in the format S00E00. However, I can't known that the string contains either or, sometimes it can contain both the season and episode and the year. I also don't known what comes first in the string, the year or the season and episode.
I tried with the following pattern: (\d{4})|S(\d\d)E(\d\d), however that only returns a match for the one that came first. For the string 2012.S01E02, it returns 2012, and for the string S01E02.2012 it returns S01E02. The rest of the capture groups is None (I'm using Python 3.5).
I have a solution which uses two separate matches, if-statements and generally looks ugly. Is there's a way to have one regex pattern that returns a list (or tuple) witch contains (year, season, episode), regardless of what comes first in the string?
You could use the following regular expression:
.*?(\d{4}).*?(S\d{2}E\d{2}).*?|.*?(S\d{2}E\d{2}).*?(\d{4}).*?|.*?(S\d{2}E\d{2}).*?|.*?(\d{4}).*?
.*?(\d{4}).*?(S(\d\d)E(\d\d)).*?: This will first match the combination of the year and episode number in this order.
.*?(S(\d\d)E(\d\d)).*?(\d{4}).*?: This will match the reverse order
.*?(S(\d\d)E(\d\d)).*?: This will match the episode number
.*?(\d{4}).*?: This will match the year.
If you execute the regular expression in this order, you will always get both the year and the episode number.
var regex = /.*?(\d{4}).*?(S\d{2}E\d{2}).*?|.*?(S\d{2}E\d{2}).*?(\d{4}).*?|.*?(S\d{2}E\d{2}).*?|.*?(\d{4}).*?/;
var matches = "test|S02E12|2012_test".match(regex);
matches = matches.filter(function(item) {
return item !== undefined;
}).splice(1).sort();
console.log(matches);

Regex in python that looks into pattern over multiple lines

I am extracting the records from the file that has information of interest over three or more lines. Information is in sequence, it follows a reasonable pattern but it is can
have some boilerplate text in between.
Since this is a text file converted from PDF it is also possible that there is a page number or some other simple control elements in between.
Pattern consists of:
starting line: last name and first name separated by comma, and nothing else
next line will have two long numbers (>=7 digits) followed by two dates
last line of interest will have 4-digit number followed by a date
Pattern of interest is marked in BOLD):
LAST NAME ,FIRST NAME
... nothing or possibly some junk text
999999999 9999999 MM/DD/YY MM/DD/YY junk text
... nothing or possibly some junk text
9999 MM/DD/YY junk
I dont care
My target text by default looks something like:
SOME IRRELEVANT TEXT
DOE ,JOHN
200000002 100000070 04/04/13 12/12/12 XYZ IJK ABC SOMETHING SOMETHING
0999 12/22/12 0 1 0 SOMETHING ELSE
MORE OF SOMETHING ELSE
but it is possible to encounter something in between so it would look like:
SOME IRRELEVANT TEXT
DOE ,JOHN
Page 13 Header
200000002 100000070 04/04/13 12/12/12 XYZ IJK ABC SOMETHING SOMETHING
0999 12/22/12 0 1 0 SOMETHING ELSE
MORE OF SOMETHING ELSE
I dont really need to validate much here so I am catching three lines with a following regex.
Since I know that this pattern will occur as a substring, but with possible insertions
So far, I have been catching these elements with following three reg. expressions:
(([A-Z]+\s+)+,[A-Z]+)
(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})
(\d{4}\s+\d{2}/\d{2}/\d{2})
but I would like to extract the whole data of interest.
Is that possible and if so, how?
Here I have added regular expressions to a list and tried finding a match one after the other... Is this what you were looking for ??
import re
f = open("C:\\Users\\mridulp\\Desktop\\temp\\file1.txt")
regexpList = [re.compile("(([A-Z]+\s+)+,[A-Z]+)"),
re.compile("^.*(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})"),
re.compile("^.*(\d{4}\s+\d{2}/\d{2}/\d{2}).*")]
lines = f.readlines()
i = 0
for l in lines:
mObj = regexpList[i].match(l)
if mObj:
print mObj.group(1)
i = i + 1
if i > 2:
i = 0
f.close()
This should pull all instances of the desired substrings from the larger string for you:
re.findall('([A-Z]+\s+,[A-Z]+).+?(\d+\s+\d+\s+\d{2}\/\d{2}\/\d{2}\s+\d{2}\/\d{2}\/\d{2}).+?(\d+\s+\d{2}\/\d{2}\/\d{2})', x, re.S)
The resulting list of tuples can be stitched together if needed to get a list of desired substrings with the junk text removed.

Parse file for maximum value

I am trying to parse some data contained within a file:
>in:12 out:8 John
>in:20 out:12 Fred
>in:8 out:2 Danny
I would like to find the maximum in value, and find who has the maximum in (Fred does in my example).
It's a non-standard data format you've got there. Hence, you've to write a non-standard parser (a better idea would be to use a standard exchange format like JSON and use a parser from the standard library). I'd
create a Person class with, say, an in and out attribute
write a parser function that takes a line from the input file and, if the line contains valid data, creates a new Person
create a list of Persons from your input file called persons.
sort this list ascending by in: persons_sorted = sorted(persons, key=lambda p: p.in)
get the maximum: max_in_person = persons_sorted[-1]
Try this
>in:(\d+) out:\d+ (.*)
Group 1 will contain the in score and group 2 the name
You'll still have to filter the maximum of group 1 in python code to get the name as this is not what regexes are for.
I'm not a python programmer but this is a good start
for match in re.finditer(r">in:(\d+) out:\d+ (.*)", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()

Categories

Resources