Well I have a huge text and I need to find a way to catch a pattern and send it to a dataframe using pandas (that part is ok).
Basically it goes like this:
string = """
Huge text etc etc etc
Most frequent senders:
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00
NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00
Most frequent recipients:
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00
NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00
More text after this. """
I need to separate the name of the person, the ID number (that can come in two different patterns: xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times, and the total amount (BRL).
I managed to get the id numbers like this:
r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')
print(r.findall(string))
But that is it, I'm having difficulties trying to get the rest of the info correctly, any help would be very appreciated.
Text all together:
"""Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 More text after this. """
You could use
^\s*([^-\n]+)\s+-\s+([-\d./]+).+\b(\d+)\s+times.+R\$([\d.,]+)$
See a working demo on regex101.com.
Broken down, this reads:
^\s* # start of the line, whitespace
([^-\n]+) # anything not a "-" nor a newline
\s+-\s+ # " - "
([-\d./]+) # the ID part
.+ # every thing in that line...
\b(\d+)\s+times # ... backtracking to a digit, followed by spaces and "times"
.+ # once again every in that line...
R\$([\d.,]+) # ... backtracking to R$, followed by the total amount
$ # end of the line
Note, that a name like Jean-Baptiste Demartial would break the rule. If you are likely to encounter such names, you may use
^\s*((?:(?! - ).)+)\s+-\s+([-\d./]+).+\b(\d+)\s+times.+R\$([\d.,]+)$
# ^^^
instead. See another demo on regex101.com.
In terms of Python, this could be:
rx = re.compile(r'pattern')
for match in rx.finditer(your_string_here):
print(match.group(1)) # name
print(match.group(2)) # ID
Related
I have a string pdf_text(below)
pdf_text = """ Account History Report
IMAGE All Notes
Date Created:18/04/2022
Number of Pages: 4
Client Code - 110203 Client Name - AWS PTE. LTD.
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
Date Notes
04/03/2022 Letter Dated 04 Mar 2022.
Our Ref :2112221119 Name: Green Field Ref 1 :98-76-54321-2021/1 Ref 2:F2021001111
Amount: $233.88 Total Paid:$0.00 Balance: $233.88 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CURRENT Collector : Sam Jason
Date Notes
11/03/2022 Email for payment
11/03/2022 Case Status
08/03/2022 to send a Letter
08/03/2022 845***Ringing, No reply
21/02/2022 Letter printed - LET: LETTER 2
18/02/2022 Letter sent - LET: LETTER 2
18/02/2022 845***Line busy
"""
I need to split the string on the line Our Ref :Value Name: Value Ref 1 :Value Ref 2:Value . Which is the start of every data entity below(in rectangles)
so that I get the squared entities(in above picture) in a different string.
I used the regex pattern
data_entity_sep_pattern = r'(Our Ref.*?Name.*?Ref 1.*?Ref 2.*?)'
But I don't see the separators being retained with the splitted lines.
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())
which gives me
which obviously was not expected. Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string and split_on_data_entity[3] and split_on_data_entity[4] to be in one string.
I was referring this answer https://stackoverflow.com/a/2136580/10216112 which explains parenthesis retains the string
Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string
The parentheses retain the string, but in a separate chunk.
If you want to keep the string, but have it as part of the next chunk, use a look-ahead (?= )
Some other remarks:
You may also want to require that "Our ref" occurs as the first set of letters on a line. And when you are at it, you can remove such newline character, followed by optional white space.
There is no need to match .*? at the very end of your pattern
As the text comes from PDF, you maybe don't want to be too strict about the number of spaces between words. You could use \s+.
data_entity_sep_pattern = r'\n\s*(?=Our\s+Ref.*?Name.*?Ref\s+1.*?Ref\s+2)'
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text)
for section in split_on_data_entity:
print(section)
print("--------------------------")
Ok so I asked a question not long time ago but I forgot regex is very delicate and I showed the string in the wrong format.
The problem is, I receive a huge disorganized text that is all in one line.
In this line i have 2 different "blocks" I need: "Most frequent senders" and "Most frequent receivers"
As I said, it's all in one straight line, kinda like this:
string = """
Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 time(s) in total of: R$10.000,00 More text after this. """
As you can see, this is terribly disorganized but it's how I receive it.
Basically what I'm trying to do is get the name of the person, the ID (that can have 2 patterns xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times and the amount (in BRL so R$).
I found a way to get the IDS but that is it, nothing more.
r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')
print(r.findall(string))
Any help would be very much appreciated.
Supposing the name of the person is always uppercase and preceded by digits (or : for the first occurrence) and white space(s):
r = re.compile(r'(?<=[\d:])\s+([A-Z ]*) - ([0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2}|[0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2}).*?- (\d*)\s.*?: R\$([\d\.,]+)')
Note: You had unnecessary white spaces in you original regex after/before the IDs. You should get more matches with this one.
Also you'll get a more beautiful output with the following command:
print(*r.findall(string), sep='\n')
I have used regular expression module in order to print the students name from the following text however in the code when I define my name regex it prints the word student only how do I fix it
import re
list = '''Student name - Shaurya Ronak Ajmera
Age - 7 years
Std - 2nd
Parents name - Ronak and Shital Ajmera
Phone no - ******************
Address - ***********
Knows gujarati - can speak and understand. Reads and writes small sentences.
Batch - 11 to 12:30 am'''
numberRegex = re.compile(r'\d{10}')
nameRegex = re.compile(r'(student)\.*[A-Z a-z]+',re.I)
mo = nameRegex.findall(list)
print(mo)
no = numberRegex.findall(list)
print(no)
You can repeatedly match all the names after student name - and capture that part in a group instead.
You can omit A-Z and leave a-z in the character class, as re.I makes the match case insensitive.
student name - ([a-z]+(?: [a-z]+)*)
See a Python demo
Example
nameRegex = re.compile(r'student name - ([a-z]+(?: [a-z]+)*)', re.I)
I am having following line of text as given below:
text= 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
I am trying to split numbers followed by characters or characters followed by numbers only to get the output as:
output_text = 'Cms 12345678 Gleandaleacademy Fee Collection 00001234 Abcd Renewal 123Acgf456789
I have tried the following approcah:
import re
text = 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
text = text.lower().strip()
text = text.split(' ')
output_text =[]
for i in text:
if bool(re.match(r'[a-z]+\d+|\d+\w+',i, re.IGNORECASE))==True:
out_split = re.split('(\d+)',i)
for j in out_split:
output_text.append(j)
else:
output_text.append(i)
output_text = ' '.join(output_text)
Which is giving output as:
output_text = 'cms 12345678 gleandaleacademy fee collection 00001234 abcd renewal 123 acgf 456789 '
This code is also splliting the last element of text 123acgf456789 due to incorrect regex in re.match.
Please help me out to get correct output.
You can use
re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text)
See the regex demo
Details
\b - word boundary
(?: - start of a non-capturing group (necessary for the word boundaries to be applied to all the alternatives):
([a-zA-Z]+)(\d+) - Group 1: one or more letters and Group 2: one or more digits
| - or
(\d+)([a-zA-Z]+) - Group 3: one or more digits and Group 4: one or more letters
) - end of the group
\b - word boundary
During the replacement, either \1 and \2 or \3 and \4 replacement backreferences are initialized, so concatenating them as \1\3 and \2\4 yields the right results.
See a Python demo:
import re
text = "Cms1291682971 Gleandaleacademy Fee Collecti 0000548Andb Renewal 402Ecfev845410001"
print( re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text) )
# => Cms 1291682971 Gleandaleacademy Fee Collecti 0000548 Andb Renewal 402Ecfev845410001
I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1