regex - 1. add space between string, 2. ignore certain pattern - python

I have two things that I would like to replace in my text files.
Add " " between String end with '#' (eg. ABC#) into (eg. A B C)
Ignore certain Strings end with 'H' or 'xx:xx:xx' (eg. 1111H - ignore), (eg. if is 1111, process into 'one one one one')
so far this is my code..
import re
dest1 = r"C:\\Users\CL\\Desktop\\Folder"
files = os.listdir(dest1)
#dictionary to process Int to Str
numbers = {"0":"ZERO ", "1":"ONE ", "2":"TWO ", "3":"THREE ", "4":"FOUR ", "5":"FIVE ", "6":"SIX ", "7":"SEVEN ", "8":"EIGHT ", "9":"NINE "}
for f in files:
text= open(dest1+ "\\" + f,"r")
text_read = text.read()
#num sub pattern
text = re.sub('[%s]\s?' % ''.join(numbers), lambda x: numbers[x.group().strip()]+' ', text)
#write result to file
data = f.write(text)
f.close()
sample .txt
1111H I have 11 ABC# apples
11:12:00 I went to my# room
output required
1111H I have ONE ONE A B C apples
11:12:00 I went to M Y room
also.. i realized when I write the new result, the format gets 'messy' without the breaks. not sure why.
#current output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES
ONE ONE ONE TWO H - I WENT TO MY# ROOM
#overwritten output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES ONE ONE ONE TWO H - I WENT TO MY# ROOM

You can use
def process_match(x):
if x.group(1):
return " ".join(x.group(1).upper())
elif x.group(2):
return f'{numbers[x.group(2)] }'
else:
return x.group()
print(re.sub(r'\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b|\b([A-Za-z]+)#|([0-9])', process_match, text_read))
# => 1111H I have ONE ONE A B C apples
# 11:12:00 I went to M Y room
See the regex demo. The main idea behind this approach is to parse the string only once capturing or not parts of it, and process each match on the go, either returning it as is (if it was not captured) or converted chunks of text (if the text was captured).
Regex details:
\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b - a word boundary, and then either one or more digits and one or more uppercase letters, or three occurrences of colon-separated double digits, and then a word boundary
| - or
\b([A-Za-z]+)# - Group 1: words with # at the end: a word boundary, then oneor more letters and a #
| - or
([0-9]) - Group 2: an ASCII digit.

Related

preprocessing the text and excluding form footnotes , extra spaces and

I need to clean my corpus, it includes these problems
multiple spaces --> Tables .
footnote --> 10h 50m,1
unknown ” --> replace " instead of ”
e.g
for instance, you see it here:
On 1580 November 12 at 10h 50m,1 they set Mars down at 8° 36’ 50” Gemini2 without mentioning the horizontal variations, by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables .
I have done it using these functions
def fix4token(x):
x=re.sub('”', '\"', x)
if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
res=x.rstrip('0123456789')
output = re.split(r"\b,\b",res, 1)[0]
return output
else:
return x
def removespaces(x):
res=x.replace(" ", " ")
return(res)
it works not bad for this but the result is so
On 1580 November 12 at 10h 50m, they set Mars down at 8° 36’ 50" Gemini without mentioning the horizontal variations, by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables.
but the problem is it damaged other paragraphs. it does not work ver well,
I guess because this break other things
x=re.sub('”', '\"', x)
if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
res=x.rstrip('0123456789')
output = re.split(r"\b,\b",res, 1)[0]
what is the safest way to do these?
1- remove footnotes like in these phrases
"10h 50m,1" or (extra foot note in text after comma)
"Gemini2" (zodic names of month + footnote)
without changing another part of the text (e.g my approach will break the "DC2" to "DC" which is not desired
2- remove multiple spaces before dot . like "Tables ." to no spaces
or remove multiple before, like: ", by which term" to this 9only one space) ", by which term"
3-replace unknown ” -> replace " ...which is done
thank you
You can use
text = re.sub(r'\b(?:(?<=,)\d+|(Capricorn|Aquarius|Pisces|Aries|Taurus|Gemini|Cancer|Leo|Virgo|Libra|Scorpio|Ophiuchus|Sagittarius)\d+)\b|\s+(?=[.,])', r'\1', text, flags=re.I).replace('”', '"')
text = re.sub(r'\s{2,}', ' ', text)
Details:
\b - a word boundary
(?: - start of a non-capturing group:
(?<=,)\d+ - one or more digits that are preceded with a comma
| - or
(Capricorn|Aquarius|Pisces|Aries|Taurus|Gemini|Cancer|Leo|Virgo|Libra|Scorpio|Ophiuchus|Sagittarius)\d+ - one of the zodiac sign words (captured into Group 1, \1 in the replacement pattern refers to this value) and then one or more digits
) - the end of the non-capturing group
\b
| - or
\s+ - one or more whitespaces
(?=[.,]) - that are immediately followed with . or ,.
The .replace('”', '"') replaces all ” with a " char.
The re.sub(r'\s{2,}', ' ', text) code replaces all chunks of two or more whitespaces with a single regular space.

How to get all sentences that contain multiple words in Python

I am trying to make a regular expressions to get all sentences containing two words (order doesn't matter), but I can't find the solution for this.
"Supermarket. This apple costs 0.99."
I want to get back the following sentence:
This apple costs 0.99.
I tried:
([^.]*?(apple)*?(costs)[^.]*\.)
I have problems because the price contains a dot. Also this expressions gives back results with only one of the words.
Approach: For each Phrase, we have to find the sentences which contain all the words of the phrase. So, for each word in the given phrase, we check if a sentence contains it. We do this for each sentence. This process of searching may become faster if the words in the sentence are stored in a set instead of a list.
Below is the implementation of above approach in python:
def getRes(sent, ph):
sentHash = dict()
# Loop for adding hased sentences to sentHash
for s in range(1, len(sent)+1):
sentHash[s] = set(sent[s-1].split())
# For Each Phrase
for p in range(0, len(ph)):
print("Phrase"+str(p + 1)+":")
# Get the list of Words
wordList = ph[p].split()
res = []
# Then Check in every Sentence
for s in range(1, len(sentHash)+1):
wCount = len(wordList)
# Every word in the Phrase
for w in wordList:
if w in sentHash[s]:
wCount -= 1
# If every word in phrase matches
if wCount == 0:
# add Sentence Index to result Array
res.append(s)
if(len(res) == 0):
print("NONE")
else:
print('% s' % ' '.join(map(str, res)))
# Driver Function
def main():
sent = ["Strings are an array of characters",
"Sentences are an array of words"]
ph = ["an array of", "sentences are strings"]
getRes(sent, ph)
main()
You use a negated character class [^.] which matches any character except a dot.
But in your example data Supermarket. This apple costs 0.99. there are 2 dots before the dot at the end, so you can not cross the dot after Supermarket. to match apple
You could for example match until the first dot, then assert costs and use a capture group to match the part with apple and make sure the line ends with a dot.
The assertion for word 1 with a match for word 2 will match the words in both combinations.
^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$
Explanation
^[^.]*\. From the start of the string, match until and including the first dot
\s* Match 0+ whitespace character
(?=.*\bcosts\b) Positive lookahead, assert costs at the right
( Capture group 1 (this has the desired value)
.*\bapple\b.*\. Match the rest of the line that includes apple and ends with a dot
) Close group 1
$ Assert end of string
Regex demo | Python demo
import re
regex = r"^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$"
test_str = ("Supermarket. This apple costs 0.99.\n"
"Supermarket. This costs apple 0.99.\n"
"Supermarket. This apple is 0.99.\n"
"Supermarket. This orange costs 0.99.")
print(re.findall(regex, test_str, re.MULTILINE))
Output
['This apple costs 0.99.', 'This costs apple 0.99.']
I also suggest to first extract sentences and then find sentences that have both words.
However, the problem of splitting text into sentences is pretty hard because of existence of abbreviations, unusual names, etc. One way to do it is by using nltk.tokenize.punkt module.
You'll need to install NLTK and then run this in Python:
import nltk
nltk.download('punkt')
After that you can use English language sentence tokenizer with two regexes:
TEXT = 'Mr. Bean is in supermarket. iPhone 12 by Apple Inc. costs $999.99.'
WORD1 = 'apple'
WORD2 = 'costs'
import nltk.data, re
# Regex helper
find_word = lambda w, s: re.search(r'(^|\W)' + w + r'(\W|$)', s, re.I)
eng_sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for sent in eng_sent_detector.tokenize(TEXT):
if find_word(WORD1, sent) and find_word(WORD2, sent):
print (sent,"\n----")
Output:
iPhone 12 by Apple Inc. costs $999.99.
----
Notice that it handles numbers and abbreviations for you.

How to print specific words in colour on python?

I want to print a specific word a different color every time it appears in the text. In the existing code, I've printed the lines that contain the relevant word "one".
import json
from colorama import Fore
fh = open(r"fle.json")
corpus = json.loads(fh.read())
for m in corpus['smsCorpus']['message']:
identity = m['#id']
text = m['text']['$']
strtext = str(text)
utterances = strtext.split()
if 'one' in utterances:
print(identity,text, sep ='\t')
I imported Fore but I don't know where to use it. I want to use it to have the word "one" in a different color.
output (section of)
44814 Ohhh that's the one Johnson told us about...can you send it to me?
44870 Kinda... I went but no one else did, I so just went with Sarah to get lunch xP
44951 No, it was directed in one place loudly and stopped when I stoppedmore or less
44961 Because it raised awareness but no one acted on their new awareness, I guess
44984 We need to do a fob analysis like our mcs onec
Thank you
You could also just use the ANSI color codes in your strings:
# define aliases to the color-codes
red = "\033[31m"
green = "\033[32m"
blue = "\033[34m"
reset = "\033[39m"
t = "That was one hell of a show for a one man band!"
utterances = t.split()
if "one" in utterances:
# figure out the list-indices of occurences of "one"
idxs = [i for i, x in enumerate(utterances) if x == "one"]
# modify the occurences by wrapping them in ANSI sequences
for i in idxs:
utterances[i] = red + utterances[i] + reset
# join the list back into a string and print
utterances = " ".join(utterances)
print(utterances)
If you only have 1 coloured word you can use this I think, you can expand the logic for n coloured words:
our_str = "Ohhh that's the one Johnson told us about...can you send it to me?"
def colour_one(our_str):
if "one" in our_str:
str1, str2 = our_str.split("one")
new_str = str1 + Fore.RED + 'one' + Style.RESET_ALL + str2
else:
new_str = our_str
return new_str
I think this is an ugly solution, not even sure if it works. But it's a solution if you can't find anything else.
i use colour module from this link or colored module that link
Furthermore if you dont want to use a module for coloring you can address to this link or that link

How can I find the best fit subsequences of a large string?

Say I have one large string and an array of substrings that when joined equal the large string (with small differences).
For example (note the subtle differences between the strings):
large_str = "hello, this is a long string, that may be made up of multiple
substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
"subsrings tat aproimately ", "match the orginal strng"]
How can I best align the strings to produce a new set of sub strings from the original large_str? For example:
["hello, this is a long string", ", that may be made up of multiple",
"substrings that approximately ", "match the original string"]
Additional Info
The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.
Concatenate the substrings
Align the concatenation with the original string
Keep track of which positions in the original string are aligned with the boundaries between the substrings
Split the original string on the positions aligned with those boundaries
An implementation using Python's difflib:
from difflib import SequenceMatcher
from itertools import accumulate
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = [
"hello, ths is a lng strin",
", that ay be mad up of multiple",
"subsrings tat aproimately ",
"match the orginal strng"]
sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
match_index = 0
matches = [''] * len(sub_strs)
for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
if tag == 'delete' or tag == 'insert' or tag == 'replace':
matches[match_index] += large_str[i1:i2]
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
j1 += submatch_len
else:
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
matches[match_index] += large_str[i1:i1+submatch_len]
j1 += submatch_len
i1 += submatch_len
print(matches)
Output:
['hello, this is a long string',
', that may be made up of multiple ',
'substrings that approximately ',
'match the original string']
You are trying to solve sequence alignment problem. In your case it is a "local" sequence alignment. It can be solved with Smith-Waterman approach. One possible implementation is here.
If you run it, you receive:
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]
for sbs in sub_strs:
water(large_str, sbs)
>>>
Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s in
hello, th-s is a l-ng s--in
Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that ay be m d up of mul ple
, that -ay be m-d- up of mul--ple
Identity = 83.333 percent
Score = 225
substrings that approximately
su s rings t at a pro imately
su-s-rings t-at a-pro-imately
Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma ch the or g nal str ng
manbvch the orhjg-nal str-ng
The middle line shows matching characters. If you need the positions, look for max_i value to get ending position in original string. The starting position will be the value of i at the end of water() function.
(The additional info makes a lot of the following unnecessary. It was written for a situation where the substrings provided might be any permutation of the order in which they occur in the main string)
There will be a dynamic programming solution for a problem very close to this. In the dynamic programming algorithm that gives you edit distance, the state of the dynamic program is (a, b) where a is the offset into the first string and b is the offset into the second string. For each pair (a, b) you work out the smallest possible edit distance that matches the first a characters of the first string with the first b characters of the second string, working out (a, b) from (a-1, b-1), (a-1, b), and (a, b-1).
You can now write a similar algorithm with state (a, n, m, b) where a is the total number of characters consumed by substrings so far, n is the index of the current substring, m is the position within the current substring, and b is the number of characters matched in the second string. This solves the problem of matching b against a string composed by pasting together any number of copies of any of the available substrings.
This is a different problem, because if you are trying to reconstitute a long string from fragments, you might get a solution that uses the same fragment more than once, but if you are doing this you might hope that the answer is obvious enough that the collection of substrings it produces happens to be a permutation of the collection given to it.
Because the edit distance returned by this method will always be at least as good as the best edit distance when you force a permutation, you could also use this to compute a lower bound on the best possible edit distance for a permutation, and run a branch and bound algorithm to find the best permutation.

Count characters in string

So I'm trying to count anhCrawler and return the number of characters with and without spaces alone with the position of "DEATH STAR" and return it in theReport. I can't get the numbers to count correctly either. Please help!
anhCrawler = """Episode IV, A NEW HOPE. It is a period of civil war. \
Rebel spaceships, striking from a hidden base, have won their first \
victory against the evil Galactic Empire. During the battle, Rebel \
spies managed to steal secret plans to the Empire's ultimate weapon, \
the DEATH STAR, an armored space station with enough power to destroy \
an entire planet. Pursued by the Empire's sinister agents, Princess Leia\
races home aboard her starship, custodian of the stolen plans that can \
save her people and restore freedom to the galaxy."""
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR occurs and starts at position {3}.
"""
def analyzeCrawler(thetext):
numchars = 0
nospacechars = 0
numspacechars = 0
anhCrawler = thetext
word = anhCrawler.split()
for char in word:
numchars = word[numchars]
if numchars == " ":
numspacechars += 1
anhCrawler = re.split(" ", anhCrawler)
for char in anhCrawler:
nospacechars += 1
numwords = len(anhCrawler)
pos = thetext.find("DEATH STAR")
char_len = len("DEATH STAR")
ds = thetext[261:271]
dspos = "[261:271]"
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler(theReport)
You're overthinking this problem.
Number of chars in string (returns 520):
len(anhCrawler)
Number of non-whitespace characters in string (using split as using split automatically removes the whitespace, and join creates a string with no whitespace) (returns 434):
len(''.join(anhCrawler.split()))
Finding the position of "DEATH STAR" (returns 261):
anhCrawler.find("DEATH STAR")
Here, you have simplilfied version of your function:
import re
def analyzeCrawler2(thetext, text_to_search = "DEATH STAR"):
numchars = len(anhCrawler)
nospacechars = len(re.sub(r"\s+", "", anhCrawler))
numwords = len(anhCrawler.split())
dspos = anhCrawler.find(text_to_search)
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler2(theReport)
This text contains 520 characters (434 if you ignore spaces).
There are approximately 87 words in the text. The phrase
DEATH STAR occurs and starts at position 261.
I think the trick part is to remove white spaces from the string and to calculate the non-space character count. This can be done simply using regular expression. Rest should be self-explanatory.
First off, you need to indent the code that's inside a function. Second... your code can be simplified to the following:
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR is the {3}th word and starts at the {4}th character.
"""
def analyzeCrawler(thetext):
numchars = len(anhCrawler)
nospacechars = len(anhCrawler.replace(' ', ''))
numwords = len(anhCrawler.split())
word = 'DEATH STAR'
wordPosition = anhCrawler.split().index(word)
charPosition = anhCrawler.find(word)
return theReport.format(
numchars, nospacechars, numwords, wordPosition, charPosition
)
I modified the last two format arguments because it wasn't clear what you meant by dspos, although maybe it's obvious and I'm not seeing it. In any case, I included the word and char position instead. You can determine which one you really meant to include.

Categories

Resources