how to deal with compound words in regex - python

I am making regexes that return the definitions of abbreviations from a text. I have solved for a number of cases but i cannot make a solution for the case that the abbreviation has different number of characters than its actual words maybe because one word is compound like below.
string = 'CRC comes from the words colorectal cancer'
I would like to get the 'colorectal cancer' based on its short-form. Do you have any advice on what steps I should take? I thought of splitting compounds words, but it will lead to other problems.

In CRC the first word should begin with C. and the next word could be either R or C, if second word is R , third word should be C or there is not a third word at all.
at the same time you should check second word starts with C. If so you dont need to check for third word. OR condition in regex maybe upto help. I cannot pinpoint how, if I dont have enough data samples

Related

How can I find repeated string segments in Python?

So I have some medium-length string - somewhere between a few words and a few sentences. Sometimes, a substring in the text is repeated twice in a row. I need to write automatic code to identify the repeated part. Or at least flag it with a high probability.
What I know:
The repeated substring is a series of a few whole words (and punctuation marks). A repeat will not happen in the middle of a word.
The repeat is of a variable length. It can be a few words to a few sentences itself. But it's always at least a few words long. I would like to avoid flagging single word repetitions if possible.
When a repeat happens, it's always repeated exactly once, and right after the previous appearence. right after the previous appearence. (<- example)
I need to run this check on about a million different strings, so the code has to be somewhat efficient at least (not the brute force check-every-option approach).
I've been struggling with this for a while now. Would really appreciate your help.
Since the repetition of one word is a subclass of a multiple-word repetition, it's already helpful to match single words or word-like sequences. Here is the regular expression I tried on your question in an editor with regex search:
(\<\w.{3,16}\w\>).{2,}\1
This is the first repetition found
The repeat is of a variable length. It can be a few words to a few sentences itself. But it's always at least a few words long. I would like to avoid flagging single word repetitions if possible.
But it next finds repeat in repeating. So we have to tune the limits.
The part (\<\w.{3,16}\w\>) means
from word start (including a character)
3 to 16 arbitrary characters
before word end (including a character)
In other words, one or more word with a total character count of 5 to 18.
The part .{2,}\1 means
at least two characters
no upper limit
captured match
Here, the lower limit can be higher. An upper limit should be tried, especially on longer text.
I'd think that starting with finding short character sequences which repeat, then refine by looking for longer sequences that repeat in the result of the first step (plus additional characters at the end).
It's also a matter of preprocessing. I'd guess that repeating multiple-word sequences should be missed if line breaks (instead of space occur) on different places.
To automate this further, you may switch to Python's re module.

Print something when a word is in a word list

So I am currently trying to build a Caesar encrypted that automatically tries all the possibilities and compares them to a big list of words to see if it is a real word, so some sort of dictionary attack I guess.
I found a list with a lot of German words, and they even are split so that each word is on a new line. Currently, I am struggling with comparing the sentence that I currently have with the whole word list. So that when the program sees that a word in my sentence is also a word in the Word list that it prints out that this is a real word and possible the right sentence.
So this is how far I currently am, I have not included the code with which I try all the 26 letters. Only my way to look through the word list and compares it to a sentence. Maybe someone can tell me what I am doing wrong and why it doesn't work:
No idea why it doesn't work. I have also tried it with regular expressions but nothing works. The list is really long (166k Words).
There are /n at the en of each word of the list you created from the file, so the they will never be the same as what they are compared to.
Remove the newline character before appending (you can, for example, wordlist.append(line.rstrip())

How to remove all the spaces between letters?

I have text with words like this: a n a l i z e, c l a s s etc. But there are normal words as well. I need to remove all these spaces between letters of words.
reg_let = re.compile('\s[А-Яа-яёЁa-zA-Z](\s)', re.DOTALL)
text = 'T h i s is exactly w h a t I needed'
text = re.sub(reg_let, '', text)
text
OUTPUT:
'Tiis exactlyhtneeded' (while I need - 'This is exactly what I needed')
As far as I know, there is no easy way to do it because your biggest problem is to distinct the words with meaning, in other words, you need some semantic engine to tell you which word is meaningful to the sentence.
The only thing I can think of is a word embedding model, without anything like that you can clear as much spaces as you want but you cant distinct the words, meaning you'll never know which spaces to not remove.
I would love if someone will fix me if theres a simpler way im not aware of.
There is no easy solution to this problem.
The only solution that I can think of is the one in which is used a dictionary to check if a word is correct or no (present in the english dictionary).
But even doing so you'll get a lot of false positives. For example if I got the text:
a n a n a s
the words:
a
an
as
are all correct in the english dictionary. How do I split the text? For me, as human who can read a text, it is clear that the word here is ananas. But one could split the text as such:
an an as
Which is correct grammatically, but doesn't make sense in english. The correctness is given by the context. I, as human, I can understand the context. One could split, concat the string in different ways to check if it makes sense. But unfortunately there is no library, or simple procedure that can understand context.
Machine Learning could be a way, but there is no perfect solution.

Matching words with Regex (Python 3)

I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.
Problem:
Given the following input strings, find all possible output words 5 characters or longer.
qwertyuytresdftyuioknn
gijakjthoijerjidsdfnokg
Your program should find all possible words (5+ characters) that can be derived from the strings supplied.
Use http://norvig.com/ngrams/enable1.txt as your search dictionary.
The order of the output words doesn't matter.
queen question
gaeing garring gathering gating geeing gieing going
goring
Assumptions about the input strings:
QWERTY keyboard
Lowercase a-z only, no whitespace or punctuation
The first and last characters of the input string will always match
the first and last characters of the desired output word.
Don't assume users take the most efficient path between letters
Every letter of the output word will appear in the input string
Attempted solution:
First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):
import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
res.raise_for_status()
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):
fp.write(chunk)
fp.close()
I'm then trying to match the words I need using regex. The problem is that I don't know how to format my re.compile() to achieve this.
import re
input = 'qwertyuytresdftyuioknn' #example
fp= open('words.txt')
string = fp.read()
regex = re.compile(input[0]+'\w{3,}'+input[-1]) #wrong need help here
regex.findall(string)
As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with \w{3,}. Any help into this would be greatly appreciated.
This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg] will match any of the letters a, d, f or g. [adfg]{3,} will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE as the second argument to re.compile, ^ will match the beginning and $ the end of a line.
Addition:
If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5}) added to the end will ensure that.

Efficient replacement of occurrences of a list of words

I need to censor all occurrences of a list of words with *'s. I have about 400 words in the list and it's going to get hit with a lot of traffic, so I want to make it very efficient. What's an efficient algorithm/data structure to do this in? Preferably something already in Python.
Examples:
"piss off" => "**** off"
"hello" => "hello"
"go to hell" => "go to ****"
A case-insensitive trie-backed set implementation might fit the bill. For each word, you'll only process a minimum of characters. For example, you would only need to process the first letter of the word 'zoo' to know the word is not present in your list (assuming you have no 'z' expletives).
This is something that is not packaged with python, however. You may observe better performance from a simple dictionary solution since it's implemented in C.
(1) Let P be the set of phrases to censor.
(2) Precompute H = {h(w) | p in P, w is a word in p}, where h is a sensible hash function.
(3) For each word v that is input, test whether h(v) in H.
(4) If h(v) not in H, emit v.
(5) If h(v) in H, back off to any naive method that will check whether v and the words following form a phrase in P.
Step (5) is not a problem since we assume that P is (very) small compared to the quantity of input. Step (3) is an O(1) operation.
like cheeken has mentioned, a Trie may be the thing you need, and actually, you should use Aho–Corasick string matching algorithm. Something more than a trie.
For every string, say S you need to process, the time complexity is approximately O(len(S)). I mean, Linear
And you need to build the automaton initially, it's time complexity is O(sigma(len(words))), and space complexity is about(less always) O(52*sigma(len(words))) here 52 means the size of the alphabet(i take it as ['a'..'z', 'A'..'Z']). And you need to do this just for once(or every time the system launches).
You might want to time a regexp based solution against others. I have used similar regexp based substitution of one to three thousand words on a text to change phrases into links before, but I am not serving those pages to many people.
I take the set of words (it could be phrases), and form a regular expression out of them that will match their occurrence as a complete word in the text because of the '\b'.
If you have a dictionary mapping words to their sanitized version then you could use that. I just swap every odd letter with '*' for convenience here.
The sanitizer function just returns the sanitized version of any matched swear word and is used in the regular expression substitution call on the text to return a sanitized version.
import re
swearwords = set("Holy Cow".split())
swear = re.compile(r'\b(%s)\b' % '|'.join(sorted(swearwords, key=lambda w: (-len(w), w))))
sanitized = {sw:''.join((ch if not i % 2 else '*' for i,ch in enumerate(sw))) for sw in swearwords}
def sanitizer(matchobj):
return sanitized.get(matchobj.group(1), '????')
txt = 'twat prick Holy Cow ... hell hello shitter bonk'
swear.sub(sanitizer, txt)
# Out[1]: 'twat prick H*l* C*w ... hell hello shitter bonk'
You might want to use re.subn and the count argument to limit the number of substitutions done and just reject the whole text if it has too many profanities:
maxswear = 2
newtxt, scount = swear.subn(sanitizer, txt, count=maxswear)
if scount >= maxswear: newtxt = 'Ouch my ears hurt. Please tone it down'
print(newtxt)
# 'Ouch my ears hurt. Please tone it down'
If performance is what you want I would suggest:
Get a sample of the input
Calculate the average amount of censored words per line
Define a max number of words to filter per line (3 for example)
Calcule what censored words have the most hits in the sample
Write a function that given the censored words, will generate a
python file with IF statements to check each words, putting the 'most
hits' words first, since you just want to match whole words it will
be fairly simple
Once you hit the max number per line, exit the function
I know this is not nice and I'm only suggesting this approach because of the high traffic scenario, doing a loop of each word in your list will have a huge negative impact on performance.
Hope that help or at least give you some out of the box idea on how to tackle the problem.

Categories

Resources