I have text with words like this: a n a l i z e, c l a s s etc. But there are normal words as well. I need to remove all these spaces between letters of words.
reg_let = re.compile('\s[А-Яа-яёЁa-zA-Z](\s)', re.DOTALL)
text = 'T h i s is exactly w h a t I needed'
text = re.sub(reg_let, '', text)
text
OUTPUT:
'Tiis exactlyhtneeded' (while I need - 'This is exactly what I needed')
As far as I know, there is no easy way to do it because your biggest problem is to distinct the words with meaning, in other words, you need some semantic engine to tell you which word is meaningful to the sentence.
The only thing I can think of is a word embedding model, without anything like that you can clear as much spaces as you want but you cant distinct the words, meaning you'll never know which spaces to not remove.
I would love if someone will fix me if theres a simpler way im not aware of.
There is no easy solution to this problem.
The only solution that I can think of is the one in which is used a dictionary to check if a word is correct or no (present in the english dictionary).
But even doing so you'll get a lot of false positives. For example if I got the text:
a n a n a s
the words:
a
an
as
are all correct in the english dictionary. How do I split the text? For me, as human who can read a text, it is clear that the word here is ananas. But one could split the text as such:
an an as
Which is correct grammatically, but doesn't make sense in english. The correctness is given by the context. I, as human, I can understand the context. One could split, concat the string in different ways to check if it makes sense. But unfortunately there is no library, or simple procedure that can understand context.
Machine Learning could be a way, but there is no perfect solution.
Related
Today I wrote my first program, which is essentially a vocabulary learning program! So naturally I have pretty huge lists of vocabulary and a couple of questions. I created a class with the parameters, one of which is the German vocab and one of which is the Spanish vocab. My first question is: is there anyway to turn all the plain text vocabulary that I copy from an internets vocab list into strings and separate them without adding the " and the commas manually?
And my second question:
I created another list to assign each German vocab to each Spanish vocab and it looks a little bit like that:
vocabs = [
Vocabulary(spanish_word[0], german_word[0])
Vocabulary(spanish_word[1], german_word[1])
etc.
]
Vocabulary would be the class, spanish_word the first word list and German the other obviously.
But with a lot of vocab that's a lot of work too. Is there anyway to automate the process to add each word from the Spanish word list to the German one? I first tried it with the
vocabs = [
for spanish word in german word
Vocabulary(spanish_word[0], german_word[0])
]
But that didn't work. Researching on the internet also didn't help much.
Please don't be rude if those are noob questions I'm actually pretty happy that my program is running so well and I would be thankful for all the help to make it better.
Without knowing what it is you're looking to do with the result, it appears you're trying to do this:
vocabs = [Vocabulary(s, g) for s, g in zip(spanish_word, german_word)]
You didn't provide any code or example data around the "turn all the plain text vocabulary [..] into strings and separate them without adding the quotes and the commas manually". There's sure to be a way to do what you need, but you should probably ask a separate question, after first looking for a solution yourself and coming up with a solution. Ask a question if you can't get it to work.
I am looking for something slightly more reliable for unpredictable strings than just checking if "word" in "check for word".
To paint an example, lets say I have the following sentence:
"Learning Python!"
If the sentence contains "Python", I'd want to evaluate to true, but what if it were:
"Learning #python!"
Doing a split with a space as a delimiter would give me ["learning", "#python"] which does not match python.
(Note: While I do understand that I could remove the # for this particular case, the problem with this is that 1. I am tagging programming languages and don't want to strip out the # in C#, and 2. This is just an example case, there's a lot of different ways I could see human typed titles including these hints that I'd still like to catch.)
I'd basically like to inspect if beyond reasonable doubt, the sequence of characters I'm looking for is there, despite any weird ways they might mention it. What are some ways to do this? I have looked at fuzzy search a bit, but I haven't seen any use-cases of looking for single words.
The end goal here is that I have tags of programming languages, and I'd like to take in the titles of people's stream titles and tag the language if its mentioned in the title.
This code prints True if the word contains ‘python’, ignoring case.
import re
input = "Learning Python!"
print(re.search("python", input, re.IGNORECASE) is not None)
I am making regexes that return the definitions of abbreviations from a text. I have solved for a number of cases but i cannot make a solution for the case that the abbreviation has different number of characters than its actual words maybe because one word is compound like below.
string = 'CRC comes from the words colorectal cancer'
I would like to get the 'colorectal cancer' based on its short-form. Do you have any advice on what steps I should take? I thought of splitting compounds words, but it will lead to other problems.
In CRC the first word should begin with C. and the next word could be either R or C, if second word is R , third word should be C or there is not a third word at all.
at the same time you should check second word starts with C. If so you dont need to check for third word. OR condition in regex maybe upto help. I cannot pinpoint how, if I dont have enough data samples
I need to censor all occurrences of a list of words with *'s. I have about 400 words in the list and it's going to get hit with a lot of traffic, so I want to make it very efficient. What's an efficient algorithm/data structure to do this in? Preferably something already in Python.
Examples:
"piss off" => "**** off"
"hello" => "hello"
"go to hell" => "go to ****"
A case-insensitive trie-backed set implementation might fit the bill. For each word, you'll only process a minimum of characters. For example, you would only need to process the first letter of the word 'zoo' to know the word is not present in your list (assuming you have no 'z' expletives).
This is something that is not packaged with python, however. You may observe better performance from a simple dictionary solution since it's implemented in C.
(1) Let P be the set of phrases to censor.
(2) Precompute H = {h(w) | p in P, w is a word in p}, where h is a sensible hash function.
(3) For each word v that is input, test whether h(v) in H.
(4) If h(v) not in H, emit v.
(5) If h(v) in H, back off to any naive method that will check whether v and the words following form a phrase in P.
Step (5) is not a problem since we assume that P is (very) small compared to the quantity of input. Step (3) is an O(1) operation.
like cheeken has mentioned, a Trie may be the thing you need, and actually, you should use Aho–Corasick string matching algorithm. Something more than a trie.
For every string, say S you need to process, the time complexity is approximately O(len(S)). I mean, Linear
And you need to build the automaton initially, it's time complexity is O(sigma(len(words))), and space complexity is about(less always) O(52*sigma(len(words))) here 52 means the size of the alphabet(i take it as ['a'..'z', 'A'..'Z']). And you need to do this just for once(or every time the system launches).
You might want to time a regexp based solution against others. I have used similar regexp based substitution of one to three thousand words on a text to change phrases into links before, but I am not serving those pages to many people.
I take the set of words (it could be phrases), and form a regular expression out of them that will match their occurrence as a complete word in the text because of the '\b'.
If you have a dictionary mapping words to their sanitized version then you could use that. I just swap every odd letter with '*' for convenience here.
The sanitizer function just returns the sanitized version of any matched swear word and is used in the regular expression substitution call on the text to return a sanitized version.
import re
swearwords = set("Holy Cow".split())
swear = re.compile(r'\b(%s)\b' % '|'.join(sorted(swearwords, key=lambda w: (-len(w), w))))
sanitized = {sw:''.join((ch if not i % 2 else '*' for i,ch in enumerate(sw))) for sw in swearwords}
def sanitizer(matchobj):
return sanitized.get(matchobj.group(1), '????')
txt = 'twat prick Holy Cow ... hell hello shitter bonk'
swear.sub(sanitizer, txt)
# Out[1]: 'twat prick H*l* C*w ... hell hello shitter bonk'
You might want to use re.subn and the count argument to limit the number of substitutions done and just reject the whole text if it has too many profanities:
maxswear = 2
newtxt, scount = swear.subn(sanitizer, txt, count=maxswear)
if scount >= maxswear: newtxt = 'Ouch my ears hurt. Please tone it down'
print(newtxt)
# 'Ouch my ears hurt. Please tone it down'
If performance is what you want I would suggest:
Get a sample of the input
Calculate the average amount of censored words per line
Define a max number of words to filter per line (3 for example)
Calcule what censored words have the most hits in the sample
Write a function that given the censored words, will generate a
python file with IF statements to check each words, putting the 'most
hits' words first, since you just want to match whole words it will
be fairly simple
Once you hit the max number per line, exit the function
I know this is not nice and I'm only suggesting this approach because of the high traffic scenario, doing a loop of each word in your list will have a huge negative impact on performance.
Hope that help or at least give you some out of the box idea on how to tackle the problem.
I have tokenized names (strings), with the tokens separated by underscores, which will always contain a "side" token by the value of M, L, or R.
The presence of that value is guaranteed to be unique (no repetitions or dangers that other tokens might get similar values).
In example:
foo_M_bar_type
foo_R_bar_type
foo_L_bar_type
I'd like, in a single regex, to swap L for R and viceversa whenever found, and M to be left untouched.
IE the above would become:
foo_M_bar_type
foo_L_bar_type
foo_R_bar_type
when pushed through this ideal expression.
This was what I thought to be a 10 minutes exercise while writing some simple stuff, that I couldn't quite crack as concisely as I wanted to.
The problem itself was of course trivial to solve with one condition that changes the pattern, but I'd love some help doing it within a single re.sub()
Of course any food for thought is always welcome, but this being an intellectual exercise that me and a couple colleagues failed at I'd love to see it cracked that way.
And yes, I'm fully aware it might not be considered very Pythonic, nor ideal, to solve the problem with a regex, but humour me please :)
Thanks in advance
This answer [ab]uses the replacement function:
>>> s = "foo_M_bar_type foo_R_bar_type foo_L_bar_type"
>>> import re
>>> re.sub("_[LR]_", lambda m: {'_L_':'_R_','_R_':'_L_'}[m.group()], s)
'foo_M_bar_type foo_L_bar_type foo_R_bar_type'
>>>