I have a string pdf_text(below)
pdf_text = """ Account History Report
IMAGE All Notes
Date Created:18/04/2022
Number of Pages: 4
Client Code - 110203 Client Name - AWS PTE. LTD.
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
Date Notes
04/03/2022 Letter Dated 04 Mar 2022.
Our Ref :2112221119 Name: Green Field Ref 1 :98-76-54321-2021/1 Ref 2:F2021001111
Amount: $233.88 Total Paid:$0.00 Balance: $233.88 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CURRENT Collector : Sam Jason
Date Notes
11/03/2022 Email for payment
11/03/2022 Case Status
08/03/2022 to send a Letter
08/03/2022 845***Ringing, No reply
21/02/2022 Letter printed - LET: LETTER 2
18/02/2022 Letter sent - LET: LETTER 2
18/02/2022 845***Line busy
"""
I need to split the string on the line Our Ref :Value Name: Value Ref 1 :Value Ref 2:Value . Which is the start of every data entity below(in rectangles)
so that I get the squared entities(in above picture) in a different string.
I used the regex pattern
data_entity_sep_pattern = r'(Our Ref.*?Name.*?Ref 1.*?Ref 2.*?)'
But I don't see the separators being retained with the splitted lines.
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())
which gives me
which obviously was not expected. Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string and split_on_data_entity[3] and split_on_data_entity[4] to be in one string.
I was referring this answer https://stackoverflow.com/a/2136580/10216112 which explains parenthesis retains the string
Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string
The parentheses retain the string, but in a separate chunk.
If you want to keep the string, but have it as part of the next chunk, use a look-ahead (?= )
Some other remarks:
You may also want to require that "Our ref" occurs as the first set of letters on a line. And when you are at it, you can remove such newline character, followed by optional white space.
There is no need to match .*? at the very end of your pattern
As the text comes from PDF, you maybe don't want to be too strict about the number of spaces between words. You could use \s+.
data_entity_sep_pattern = r'\n\s*(?=Our\s+Ref.*?Name.*?Ref\s+1.*?Ref\s+2)'
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text)
for section in split_on_data_entity:
print(section)
print("--------------------------")
Related
Ok so I asked a question not long time ago but I forgot regex is very delicate and I showed the string in the wrong format.
The problem is, I receive a huge disorganized text that is all in one line.
In this line i have 2 different "blocks" I need: "Most frequent senders" and "Most frequent receivers"
As I said, it's all in one straight line, kinda like this:
string = """
Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 time(s) in total of: R$10.000,00 More text after this. """
As you can see, this is terribly disorganized but it's how I receive it.
Basically what I'm trying to do is get the name of the person, the ID (that can have 2 patterns xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times and the amount (in BRL so R$).
I found a way to get the IDS but that is it, nothing more.
r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')
print(r.findall(string))
Any help would be very much appreciated.
Supposing the name of the person is always uppercase and preceded by digits (or : for the first occurrence) and white space(s):
r = re.compile(r'(?<=[\d:])\s+([A-Z ]*) - ([0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2}|[0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2}).*?- (\d*)\s.*?: R\$([\d\.,]+)')
Note: You had unnecessary white spaces in you original regex after/before the IDs. You should get more matches with this one.
Also you'll get a more beautiful output with the following command:
print(*r.findall(string), sep='\n')
Well I have a huge text and I need to find a way to catch a pattern and send it to a dataframe using pandas (that part is ok).
Basically it goes like this:
string = """
Huge text etc etc etc
Most frequent senders:
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00
NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00
Most frequent recipients:
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00
NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00
More text after this. """
I need to separate the name of the person, the ID number (that can come in two different patterns: xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times, and the total amount (BRL).
I managed to get the id numbers like this:
r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')
print(r.findall(string))
But that is it, I'm having difficulties trying to get the rest of the info correctly, any help would be very appreciated.
Text all together:
"""Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 More text after this. """
You could use
^\s*([^-\n]+)\s+-\s+([-\d./]+).+\b(\d+)\s+times.+R\$([\d.,]+)$
See a working demo on regex101.com.
Broken down, this reads:
^\s* # start of the line, whitespace
([^-\n]+) # anything not a "-" nor a newline
\s+-\s+ # " - "
([-\d./]+) # the ID part
.+ # every thing in that line...
\b(\d+)\s+times # ... backtracking to a digit, followed by spaces and "times"
.+ # once again every in that line...
R\$([\d.,]+) # ... backtracking to R$, followed by the total amount
$ # end of the line
Note, that a name like Jean-Baptiste Demartial would break the rule. If you are likely to encounter such names, you may use
^\s*((?:(?! - ).)+)\s+-\s+([-\d./]+).+\b(\d+)\s+times.+R\$([\d.,]+)$
# ^^^
instead. See another demo on regex101.com.
In terms of Python, this could be:
rx = re.compile(r'pattern')
for match in rx.finditer(your_string_here):
print(match.group(1)) # name
print(match.group(2)) # ID
Problem Introduction
So I've fried my brain trying to get negative look ahead/behinds to work. For the last example input, my current solution returns no match (see expected output table). I'm struggling with how to match the title part of the string when it includes a year that is not at the end of the string. To be clear, I'm only interested in matching the year if it is at the end of the string. The current regex fails on the last example, as it is matching NOT("Q" OR "\d*") in the title. However, I only want it to match NOT("Q" AND "\d{1}"). Any tips/suggestions greatly appreciated. Note using Python 3.8.
Example Input
AXP - Earnings call Q2 2021
AXP - Conference call 2021
BAC,BAC.PE,BAC.PL,BACRP,BML.PL,BML.PJ,BML.PH,BML.PG,BAC.PB,BAC.PK,BAC.PM,BAC.PN - Earnings call Q2 2021
GM - General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference
AXP - American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference
The period will always be of the form Q[1-4]. period and year are optional. If they do occur, they will be at the end of the string. symbol and title are always separated by - and always occur.
Expected Output
symbol
title
period
year
AXP
Earnings call
Q2
2021
AXP
Conference call
2021
BAC
Earnings call
Q2
2021
GM
General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference
AXP
American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference
What I've Tried
r"^(?P<symbol>[^\,]{1,8})(\,[A-Z\.]+)*\s\-\s(?P<title>[^Q\d]*)\s?(?P<period>Q\d)?\s?(?P<year>19|20\d{2})$"
You can use
^(?P<symbol>[^,]{1,8})(?:,[A-Z.]*)*\s+-\s+(?P<title>(?:(?!Q\d).)*?)\s*(?P<period>Q\d)?\s?(?P<year>(?:19|20)\d{2})?$
See the regex demo.
Note:
[^Q\d]* is wrong as it matches any zero or more chars other than Q and digit, you need to match any text up to a Q + digit, that is, a (?:(?!Q\d).)*? tempered greedy token
(?P<year>19|20\d{2}) is obligatory, but it must be optional and 19|20 are not grouped, so \d{2} is only applied to 20, (?P<year>19|20\d{2}) => (?P<year>(?:19|20)\d{2})?.
There are other small enhancements here.
Details:
^ - start of string
(?P<symbol>[^,]{1,8}) - Group "symbol": one to eight chars other than a comma
(?:,[A-Z.]*)* - zero or more repetitions of a comma and then zero or more uppercase letters/dots
\s+-\s+ - a hyphen enclosed with one or more whitespaces
(?P<title>(?:(?!Q\d).)*?) - Group "title": any char other than a line break char, zero or more but as few as possible occurrences, that does not start a Q+digit char sequence
\s* - zero or more whitespaces
(?P<period>Q\d)? - Group "period": a Q and a digit
\s? - an optional whitespace
(?P<year>(?:19|20)\d{2})? - an optional Group "year": 19 or 20 and then two digits
$ - end of string.
I am trying to write a python regular expression which captures multiple values from a few columns in dataframe. Below regular expression attempts to do the same. There are 4 parts of the string.
group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount - this group is optional
Some peculiar conditions for group 3 - text that
(1)the text itself might contain characters like "-" , "$". So we cannot use - & $ as the boundary of text.
(2) The text (group 3) sometimes may not be followed by amount.
(3) Empty space between group 3 and 4 is optional
Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.
def parse_values(args):
re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
srch=re.search(re_1, args[0])
if srch is None:
return args
m = re.match(re_1, args[0])
args['dt']=m.group(1)
args['txt']=m.group(3)
args['amt']=m.group(4)
if m.group(4) is None:
if pd.isnull(args['c3']):
args['amt']=args.c2
else:
args['amt']=args.c3
return args
And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.
tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t
However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).
How about this:
(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)
As seen at regex101.com
Explanation:
First off, I've shortened the regex by changing a few minor details like using \s* instead of \s{0,}, which mean the exact same thing.
The whole [Jan|...|DEC] code was using a character class i.e. [], whcih only takes a single character from the entire set. Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.
The meat of the regex: LOOKAHEADS
(?=[\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.
I have a text document and I need to add two # symbols before the keywords present in an array.
Sample text and Array:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
Required Text:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32"
Just use the replace function
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr = ['name','employee_id','blood_group','age']
for w in arr:
str = str.replace(w, f'##{w}')
print(str)
You can simply loop over arr and use the str.replace function:
for repl in arr:
strng.replace(repl, '##'+repl)
print(strng)
However, I urge you to change the variable name str because it is a reserved keyword.
You might use re module for that task following way
import re
txt = "This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
newtxt = re.sub('('+'|'.join(arr)+')',r'##\1',txt)
print(newtxt)
Output:
This is a sample text document which consists of all demographic information of employee here is the value you may need,##name: George ##employee_id:14296##blood_group:b positive this is the blood group of the employee##age:32
Explanation: here I used regular expression to catch words from your list and replace each with ##word. This is single pass, as opposed to X passes when using multiple str.replace (where X is length of arr), so should be more efficient for cases where arr is long.
As an alternative, you can convert the below in a loop for lengthier list. There seems to be space before ## too.
str= str[:str.find(arr[0])] + ' ##' + str[str.find(arr[0]):]
str= str[:str.find(arr[1])] + ' ##' + str[str.find(arr[1]):]
str= str[:str.find(arr[2])] + ' ##' + str[str.find(arr[2]):]
str= str[:str.find(arr[3])] + ' ##' + str[str.find(arr[3]):]
You can replace the value and add space and double ## before the replaced value and in the result replace double spaces with one space.
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
for i in arr:
str = str.replace(i, " ##{}".format(i))
print(str.replace(" ", " "))
Output
This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32