Using regex to get different groups in pattern - python

Ok so I asked a question not long time ago but I forgot regex is very delicate and I showed the string in the wrong format.
The problem is, I receive a huge disorganized text that is all in one line.
In this line i have 2 different "blocks" I need: "Most frequent senders" and "Most frequent receivers"
As I said, it's all in one straight line, kinda like this:
string = """
Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 time(s) in total of: R$10.000,00 More text after this. """
As you can see, this is terribly disorganized but it's how I receive it.
Basically what I'm trying to do is get the name of the person, the ID (that can have 2 patterns xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times and the amount (in BRL so R$).
I found a way to get the IDS but that is it, nothing more.
r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')
print(r.findall(string))
Any help would be very much appreciated.

Supposing the name of the person is always uppercase and preceded by digits (or : for the first occurrence) and white space(s):
r = re.compile(r'(?<=[\d:])\s+([A-Z ]*) - ([0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2}|[0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2}).*?- (\d*)\s.*?: R\$([\d\.,]+)')
Note: You had unnecessary white spaces in you original regex after/before the IDs. You should get more matches with this one.
Also you'll get a more beautiful output with the following command:
print(*r.findall(string), sep='\n')

Related

How can I use regex in Python to find these patterns

Well I have a huge text and I need to find a way to catch a pattern and send it to a dataframe using pandas (that part is ok).
Basically it goes like this:
string = """
Huge text etc etc etc
Most frequent senders:
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00
NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00
Most frequent recipients:
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00
NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00
NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00
More text after this. """
I need to separate the name of the person, the ID number (that can come in two different patterns: xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times, and the total amount (BRL).
I managed to get the id numbers like this:
r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')
print(r.findall(string))
But that is it, I'm having difficulties trying to get the rest of the info correctly, any help would be very appreciated.
Text all together:
"""Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 times in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 More text after this. """
You could use
^\s*([^-\n]+)\s+-\s+([-\d./]+).+\b(\d+)\s+times.+R\$([\d.,]+)$
See a working demo on regex101.com.
Broken down, this reads:
^\s* # start of the line, whitespace
([^-\n]+) # anything not a "-" nor a newline
\s+-\s+ # " - "
([-\d./]+) # the ID part
.+ # every thing in that line...
\b(\d+)\s+times # ... backtracking to a digit, followed by spaces and "times"
.+ # once again every in that line...
R\$([\d.,]+) # ... backtracking to R$, followed by the total amount
$ # end of the line
Note, that a name like Jean-Baptiste Demartial would break the rule. If you are likely to encounter such names, you may use
^\s*((?:(?! - ).)+)\s+-\s+([-\d./]+).+\b(\d+)\s+times.+R\$([\d.,]+)$
# ^^^
instead. See another demo on regex101.com.
In terms of Python, this could be:
rx = re.compile(r'pattern')
for match in rx.finditer(your_string_here):
print(match.group(1)) # name
print(match.group(2)) # ID

Hash Tables: Ransom Note (time limit exceeds)

Harold is a kidnapper who wrote a ransom note, but now he is worried it will be traced back to him through his handwriting. He found a magazine and wants to know if he can cut out whole words from it and use them to create an untraceable replica of his ransom note. The words in his note are case-sensitive and he must use only whole words available in the magazine. He cannot use substrings or concatenation to create the words he needs.
Given the words in the magazine and the words in the ransom note, print Yes if he can replicate his ransom note exactly using whole words from the magazine; otherwise, print No.
Example
= "attack at dawn" = "Attack at dawn"
The magazine has all the right words, but there is a case mismatch. The answer is .
Function Description
Complete the checkMagazine function in the editor below. It must print if the note can be formed using the magazine, or .
checkMagazine has the following parameters:
string magazine[m]: the words in the magazine
string note[n]: the words in the ransom note
Prints
string: either or , no return value is expected
Input Format
The first line contains two space-separated integers, and , the numbers of words in the and the , respectively.
The second line contains space-separated strings, each .
The third line contains space-separated strings, each .
Constraints
.
Each word consists of English alphabetic letters (i.e., to and to ).
Sample Input 0
6 4
give me one grand today night
give one grand today
Sample Output 0
Yes
Sample Input 1
6 5
two times three is not four
two times two is four
Sample Output 1
No
Explanation 1
'two' only occurs once in the magazine.
Sample Input 2
7 4
ive got a lovely bunch of coconuts
ive got some coconuts
Sample Output 2
No
Explanation 2
Harold's magazine is missing the word .
# Complete the checkMagazine function below.
def checkMagazine(magazine, note):
hash_table = [[] for _ in range(5)]
test=0
for a in magazine:
length=len(a)
key=(length-1)%5
hash_table[key].append(a)
for a in note:
key=(len(a)-1)%5
if a not in hash_table[key]:
test=1
break
else:
hash_table[key].remove(a)
if test==1:
print("No")
else:
print("Yes")
my time limit exceeded for case 16 and 17. please help how to improve it
Introducing Python Counters
From the docs: A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
from collections import Counter
def checkMagazine(magazine, note):
magazine_counter = Counter(magazine)
note_counter = Counter(note)
return "Yes" if magazine_counter & note_counter == note_counter else "No"
The logical & between the two counters will give you the minimum of corresponding counts. Comparing the result with the note is what you want.

removing non English words from text in df.columns words contain letters and numbers

How to removing non English words from text in df.columns words contain letters and numbers
Ex
df['text']
'the interiors nrd studio | happy mothers day ”there is no influence so powerful as that of the mother.” —sara josepha hale... happy mother’s day mom & to all the mothers around the world! lots of light natasha
0wet3bxtfl'
'but still missing you every day happy mothers day francis mcclafferty (mccool) 9wlhju7cxf'
from the above 2 rows I need to remove the word '0wet3bxtfl' & '9wlhju7cxf'
The example includes to retain some strings that would not be found in a list of English words ("nrd", "mcclafferty", "mccool") while removing "0wet3bxtfl" and "9wlhju7cxf", so the expected result is probably best achieved by removing any non-whitespace sequences that contain either a letter followed by digit or a digit followed by letter (together with any spaces that follow), without regard to whether words are "English" or not.
The following would do this:
import re
...
filtered = re.sub('[^\s]*(\d[a-zA-Z]|[a-zA-Z]\d)[^\s]* *', '', df['text'])

Cropping out a portion of a string and printing using regex

I am trying to crop a portion of a list of strings and print them. The data looks like the following -
Books are on the table\nPick them up
Pens are in the bag\nBring them
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found\nSearch them
Laptops, headphones are lost\nSearch for them
(This is just few lines from 100 lines of data in the file)
I have to crop the string before the \n in line 1,2,5,6 and print them. I have to also print line 3,4 along with them. Expected output -
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils erasers ruler cannot be found
Laptops headphones are lost
What I have tried so far -
First I replace the comma with a space - a = name.replace(',',' ');
Then I use regex to crop out the substring. My regex expression is - b = r'.*-\s([\w\s]+)\\n'. I am unable to print line 3 and 4 in which \n is not present.
The output that I am receiving now is -
Books are on the table
Pens are in the bag
Pencils erasers ruler cannot be found
Laptops headphones are lost
What should I add to my expression to print out lines 3 and 4 as well?
TIA
You may match and remove the line parts starting with a combination of a backslash and n, or all punctuation (non-word and non-whitespace) chars using a re.sub:
a = re.sub(r'\\n.*|[^\w\s]+', '', a)
See the regex demo
Details
\\n.* - a \, n, and then the rest of the line
| - or
[^\w\s]+ - 1 or more chars other than word and whitespace chars
If you need to make sure there is an uppercase letter after \n, you may add [A-Z] after n in the pattern.
I know many people like to twist their minds into knots with regex but why not,
with open('geek_lines.txt') as lines:
for line in lines:
print (line.rstrip().split(r'\n')[0])
Simple to write, simple to read, seems to produce the correct result.
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found
Laptops, headphones are lost

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

Categories

Resources