Generalised method to clean data for text classification

Generalised method to clean data for text classification - python

Whilst searching for a text classification method, I came across this Python code which was used in the pre-processing step
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|#,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
text = text.replace('x', '')
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
return text
OP
I then tested this section of code to understand the syntax and its purpose
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
text = '[0a;m]'
BAD_SYMBOLS_RE.sub(' ', text)
# returns ' 0a m ' whilst I thought it would return ' ; '
Question: why didn't the code replace 0, a, and m although 0-9a-z was specified inside the [ ]? Why did it replace ; although that character wasn't specified?
Edit to avoid being marked as duplication:
My perceptions of the code are:
The line BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') is confusing. Including the characters #, +, and _ inside the [ ]made me think the line trying to remove the characters in the list (because no word in an English dictionary would contain those bad characters #+_, I believe?). Consequently, it made me interpret the ^ as the start of a string (instead of negation). Thus, the original post (which was kindly answered by Tim Pietzcker and Raymond Hettinger). The two lines REPLACE_BY_SPACE_RE and BAD_SYMBOLS_RE should had been combined into one such as
REMOVE_PUNCT = re.compile('[^0-9a-z]')
text = REMOVE_PUNCT.sub('', text)
I also think the code text = text.replace('x', '') (which was meant to remove the IDs that were masked as XXX-XXXX.... in the raw data) will lead to bad outcome, for example the word next will become net.
Additional questions:
Are my perceptions reasonable?
Should numbers/digits be removed from text?
Could you please recommend an overall/general strategy/code for text pre-processing for (English) text classification?

Here's some documentation about character classes.
Basically, [abc] means "any one of a, b, or c" whereas [^abc] means "any character that is not a, b, or c".
So your regex operation removes every non-digit, non-letter character except space, #, + and _ from the string, which explains the result you're getting.

General rules
The square brackets specify any one single character.
Roughly [xyz] is a short-cut for (x|y|z) but without creating a group.
Likewise [a-z] is a short-cut for (a|b|c|...|y|z).
The interpretation of character sets can be a little tricky. The start and end points get converted to their ordinal positions and the range of matching characters is inferred from there. For example [A-z] converts A to 65 and z to 122, so everything from 65 to 122 is included. That means that it also matches characters like ^ which convert to 94. It also means that characters like ö won't match because that converts to 246 which is outside the range.
Another interesting form on character classes uses the ^ to invert the selection. For example, [^a-z] means "any character not in the range from a to z.
The full details are in the "character sets" section of the re docs.
Specific Problem
In the OP's example, BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]'), the caret ^ at the beginning inverts the range so that the listed symbols are excluded from the search.
That is why the code didn't replace 0, a, and m although 0-9a-z was specified inside the [ ]. Essentially, it treated the specified characters as good characters.
Hope this helps :-)

Related

Merging three regex patterns used for text cleaning to improve efficiency

Given a text I want to make some modifications:
replace uppercase chars at the beginning of a sentence.
remove chars like ’ or ' (without adding whitespace)
remove unwanted chars for example ³ or ? , ! . (and replace with whitespace)
def multiple_replace(text):
# first sub so words like can't will change to cant and not can t
first_strip=re.sub("[’']",'',text)
def cap(match):
return (match.group().lower())
p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
#second sub to change all words that begin a sentence to lowercase
second_strip = p.sub(cap,first_strip)
# third_strip is to remove all . from text unless they are used in decimal numbers
third_strip= re.sub(r'(?<!\d)\.|\.(?!\d)','',second_strip)
# fourth strip to remove unexpected char that might be in text for example !,?³ and replace with whitespace
forth_strip=re.sub('[^A-Za-z0-9##_$&%]+',' ', third_strip)
return forth_strip
I am wondering if there is a more efficient way of doing it? Because I am going over the text 4 times just so it can be in the right format for me to parse. This seems a lot especially if there are millions of documents. Is there a more efficient way of doing this?

You could make use of an alternation to match either an uppercase char A-Z at the start of the string, or after . ? or ! followed by a whitespace char.
I think you can also add a . to the negated character class [^A-Za-z0-9##_$&%.]+ to not remove the dot for a decimal value and change the order of operations to use cap first before removing any dots.
import re
def cap(match):
return match.group().lower()
p = re.compile(r'(?<=[.?!]\s)[A-Z]|^[A-Z]', re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
first_strip = p.sub(cap, text)
second_strip = re.sub(r"[`']+|(?<!\d)\.|\.(?!\d)", '', first_strip)
third_strip = re.sub('[^A-Za-z0-9##_$&%.]+', ' ', second_strip)
print(third_strip)
Output
a test here this is test but keep 1.2
Python demo
You could also use a lambda with all 3 patterns and 2 capturing groups checking the group values in the callback, but I think that would not benefit the readability or making it easier to change or test.
import re
p = re.compile(r"(?:((?<=[.?!]\s)[A-Z]|^[A-Z])|[`']+|((?<!\d)\.|\.(?!\d))|[^A-Za-z0-9##_$&%.]+)", re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
result = re.sub(p, lambda x: x.group(1).lower() if x.group(1) else ('' if x.group(2) else ' '), text)
print(result)
Output
a test here this is test but keep 1.2
Python demo

Is there a way to tell if a newline character is splitting two distinct words in Python?

Using the below code, I imported a few .csv files with sentences like the following into Python:
df = pd.concat((pd.read_csv(f) for f in path), ignore_index=True)
Sample sentence:
I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n
While I have no problem removing the newline characters surrounded by spaces, in the middle of words, or at the end of the string, I don't know what to do with the newline characters separating words.
The output I want is as follows:
Goal sentence:
I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS.
Is there a way for me to indicate in my code that the newline character is surrounded by two distinct words? Or is this classic garbage in, garbage out?

df = df[~df['Sentence'].str.contains("\n")]

After doing some digging, I came up with two solutions.
1. The textwrap package: Though it seems that the textwrap package is normally used for visual formatting (i.e. telling a UI when to show "..." to signify a long string), it successfully identified the \n patterns I was having issues with. Though it's still necessary to remove extra whitespace of other kinds, this package got me 90% of the way there.
import textwrap
sample = 'I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n'
sample_wrap = textwrap.wrap(sample)
print(sample_wrap)
'I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS. '
2. Function to ID different \n appearance patterns: The 'boil the ocean' solution I came up with before learning about textwrap, and it doesn't work as well. This function finds matches defined as a newline character surrounded by two word (alphanumeric) characters. For all matches, the function searches NLTK's words.words() list for each string surrounding the newline character. If at least one of the two strings is a word in that list, it's considered to be two separate words.
This doesn't take into consideration domain-specific words, which have to be added to the wordlist, or words like "about", which would be incorrectly categorized by this function if the newline character appeared as "ab\nout". I'd recommend textwrap for this reason, but still thought I'd share.
carriage = re.compile(r'(\n+)')
wordword = re.compile(r'((\w+)\n+(\w+))')
def carriage_return(sentence):
if carriage.search(sentence):
if not wordword.search(sentence):
sentence = re.sub(carriage, '', sentence)
else:
matches = re.findall(wordword, sentence)
for match in matches:
word1 = match[1].lower()
word2 = match[2].lower()
if word1 in wordlist or word2 in wordlist or word1.isdigit() or word2.isdigit():
sentence = sentence.replace(match[0], word1 + ' ' + word2)
else:
sentence = sentence.replace(match[0], word1+word2)
sentence = re.sub(carriage, '', sentence)
display(sentence)
return sentence

Python Regex Find Word with Random White Space Mixed in

How do you write a regular expression to match a specific word in a string, when the string has white space added in random places?
I've got a string that has been extracted from a pdf document that has a table structure. As a consequence of that structure the extracted string contains randomly inserted new lines and white spaces. The specific words and phrases that I'm looking for are there with characters all in the correct order, but chopped randomly with white spaces. For example: "sta ck over flow".
The content of the pdf document was extracted with PyPDF2 as this is the only option available on my company's python library.
I know that I can write a specific string match for this with a possible white space after every character, but there must be a better way of searching for it.
Here's an example of what I've been trying to do.
my_string = "find the ans weron sta ck over flow"
# r's\s*t\s*a\s*c\s*k\s*' # etc
my_cleaned_string = re.sub(r's\s*t\s*a\s*c\s*k\s*', '', my_string)
Any suggestions?

Actually what you're doing is the best way. The only addition I can suggest is to dynamically construct such regexp from a word:
word = "stack"
regexp = r'\s*'.join(word)
my_string = "find the ans weron sta ck over flow"
my_cleaned_string = re.sub(regexp, '', my_string)

The best you can probably do here is to just strip all whitespace and then search for the target string inside the stripped text:
my_string = "find the ans weron sta ck over flow"
my_string = re.sub(r'\s+', '', my_string)
if 'stack' in my_string:
print("MATCH")
The reason I use "best" above is that in general you won't know if a space is an actual word boundary, or just random whitespace which has been inserted. So, you can really only do as good as finding your target as a substring in the stripped text. Note that the input text 'rust acknowledge' would now match positive for stack.

what should be the regex to match 3 or more consecutive vowels and print the whole word with the vowels [duplicate]

Say for example I have the following string "one two(three) (three) four five" and I want to replace "(three)" with "(four)" but not within words. How would I do it?
Basically I want to do a regex replace and end up with the following string:
"one two(three) (four) four five"
I have tried the following regex but it doesn't work:
#"\b\(three\)\b"
Basically I am writing some search and replace code and am giving the user the usual options to match case, match whole word etc. In this instance the user has chosen to match whole words but I don't know what the text being searched for will be.

Your problem stems from a misunderstanding of what \b actually means. Admittedly, it is not obvious.
The reason \b\(three\)\b doesn’t match the threes in your input string is the following:
\b means: the boundary between a word character and a non-word character.
Letters (e.g. a-z) are considered word characters.
Punctuation marks such as ( are considered non-word characters.
Here is your input string again, stretched out a bit, and I’ve marked the places where \b matches:
o n e t w o ( t h r e e ) ( t h r e e ) f o u r f i v e
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
As you can see here, there is a \b between “two” and “(three)”, but not before the second “(three)”.
The moral of the story? “Whole-word search” doesn’t really make much sense if what you’re searching for is not just a word (a string of letters). Since you have punctuation characters (parentheses) in your search string, it is not as such a “word”. If you searched for a word consisting only of word characters, then \b would do what you expect.
You can, of course, use a different Regex to match the string only if it surrounded by spaces or occurs at the beginning or end of the string:
(^|\s)\(three\)(\s|$)
However, the problem with this is, of course, that if you search for “three” (without the parentheses), it won’t find the one in “(three)” because it doesn’t have spaces around it, even though it is actually a whole word.
I think most text editors (including Visual Studio) will use \b only if your search string actually starts and/or ends with a word character:
var pattern = Regex.Escape(searchString);
if (Regex.IsMatch(searchString, #"^\w"))
pattern = #"\b" + pattern;
if (Regex.IsMatch(searchString, #"\w$"))
pattern = pattern + #"\b";
That way they will find “(three)” even if you select “whole words only”.

Here a simple code you may be interested in:
string pattern = #"\b" + find + #"\b";
Regex.Replace(stringToSearch, pattern, replace, RegexOptions.IgnoreCase);
Source code: snip2code - C#: Replace an exact word in a sentence

See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
So, your \b\(three\)\b regex DOES work, but NOT the way you expected. It does not match (three) in In (three) years, In(three) years and In (three)years, but it matches in In(three)years because there are word boundaries between n and ( and between ) and y.
What you can do in these situations is use dynamic adaptive word boundaries that are constructs that ensure whole word matching where they are expected only (see my "Dynamic adaptive word boundaries" YT video for better visual understanding of these constructs).
In C#, it can be written as
#"(?!\B\w)\(three\)(?<!\w\B)"
In short:
(?!\B\w) - only require a word boundary on the left if the char that follows the word boundary is a word char
\(three\)
(?<!\w\B) - only require a word boundary on the right if the char that precedes the word boundary is a word char.
In case your search phrases can contain whitespaces and you need to match the longer alternatives first you can build the pattern dynamically from a list like
var phrases = new List<string> { #"(one)", #".two.", "[three]" };
phrases = phrases.OrderByDescending(x => x.Length).ToList();
var pattern = $#"(?!\B\w)(?:{string.Join("|", phrases.Select(z => Regex.Escape(z)))})(?<!\w\B)";
with the resulting pattern like (?!\B\w)(?:\[three]|\(one\)|\.two\.)(?<!\w\B) that matches what you'd expect, see the C# demo and the regex demo.

I recently came across a similar issue in javascript trying to match terms with a leading '$' character only as separate words, e.g. if $hot = 'FUZZ', then:
"some $hot $hotel bird$hot pellets" ---> "some FUZZ $hotel bird$hot pellets"
The regex /\b\$hot\b/g (my first guess) did not work for the same reason the parens did not match in the original question — as non word characters, there is no word/non-word boundary preceding them with whitespace or a string start.
However the regex /\B\$hot\b/g does match, which shows that the positions not marked in #timwi's excellent example match the \B term. This was not intuitive to me because ") (" is not made of regex word characters. But I guess since \B is an inversion of the \b class, it doesn't have to be word characters, it just has to be not- not- word characters :)

As Gopi said, but (theoretically) catching only (three) not two(three):
string input = "one two(three) (three) four five";
string output = input.Replace(" (three) ", " (four) ");
When I test that, I get: "one two(three) (four) four five" Just remember that white-space is a string character, too, so it can also be replaced. If I did this:
//use same input
string output = input.Replace(" ", ";");
I'd get one;two(three);(three);four;five"

Regex to get list of all words with specific letters (unicode graphemes)

I'm writing a Python script for a FOSS language learning initiative. Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script).
I need to draw out the subset of those words that can be spelled using just those letters.
An English example:
words = ["cat", "dog", "tack", "coat"]
get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]
A Tamil example:
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words([u'ம', u'ப', u'ட', u'ம்') should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]
The order that the words are returned in, or the order that the letters are entered in should not make a difference.
Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions.
In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (i.e. the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order).

To support characters that can span several Unicode codepoints:
# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial
NFKD = partial(unicodedata.normalize, 'NFKD')
def match(word, letters):
word, letters = NFKD(word), map(NFKD, letters) # normalize
return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]
print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்
It assumes that the same character can be used zero or more times in a word.
If you want only words that contain exactly given characters:
import regex # $ pip install regex
chars = regex.compile(r"\X").findall # get all characters
def match(word, letters):
return sorted(chars(word)) == sorted(letters)
words = ["cat", "dog", "tack", "coat"]
print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack
Note: there is no cat in the output in this case because cat doesn't use all given characters.
What does it mean to normalize? And could you please explain the syntax of the re.match() regex?
>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç
Without normalization c and cc do not match. The characters are from the unicodedata.normalize() docs.

EDIT: Okay, don't use any of the answers from here. I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. Then #Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b. I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \b, but I don't really expect anyone to be.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
Sadly, there isn't a regular expression feature for "word boundary". [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word.
Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want.
To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:)
So, here is the pattern to match the beginning of a word. Start of line or white space: (?:^|\s)
Here is the pattern for end of word. White space or end of line: `(?:\s|$)
Putting it all together, here is our final pattern:
(?:^|\s)([ocat]+)(?:\s|$)
You can build this dynamically. You don't need to hard-code the whole thing.
import re
s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"
s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)
Now, this doesn't in any way check for valid words. If you have the following text:
This is sensible. This not: occo cttc
The pattern I showed you will match on occo and cttc, and those are not really words. They are strings made only of letters from [ocat] though.
So just do the same thing with Unicode strings. (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go.
This has a confusing problem: re.findall() doesn't return all possible matches.
EDIT: Okay, I figured out what was confusing me.
What we want is for our pattern to work with re.findall() so you can collect all the words. But re.findall() only finds non-overlapping patterns. In my example, re.findall() only returned ['occo'] and not ['occo', 'cttc'] as I expected... but this is because my pattern was matching the white space after occo. The match group didn't collect the white space, but it was matched all the same, and since re.findall() wants no overlap between matches, the white space was "used up" and didn't work for cttc.
The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". The sequence \S matches any non-whitespace so we could use that. But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. There is also special syntax for "must be preceded by" or "must be followed by". So here is, I think, the best we can do:
Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line".
Here is that pattern using ocat:
r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'
I'm very sorry but I really do think this is the best we can do and still work with re.findall()!
It's actually less confusing in Python code though:
import re
NMGROUP_BEGIN = r'(?:' # begin non-matching group
NMGROUP_END = r')' # end non-matching group
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
PAT_OR = r'|'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
s_pat = (NMGROUP_BEGIN +
BOL + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + EOL + PAT_OR +
BOL + CCS + EOL +
NMGROUP_END)
pat = re.compile(s_pat)
text = "This is sensible. This not: occo cttc"
pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]
So, the crazy thing is that when we have alternative patterns that could match, re.findall() seems to return an empty string for the alternatives that didn't match. So we just need to filter out the length-zero strings from our results:
import itertools as it
raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']
I guess it might be less confusing to just build four different patterns, run re.findall() on each, and join the results together.
EDIT: Okay, here is the code for building four patterns and trying each. I think this is an improvement.
import re
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
lst_s_pat = [
BOL + CCS + WS_AFTER,
WS_BEFORE + CCS + WS_AFTER,
WS_BEFORE + CCS + EOL,
BOL + CCS
]
lst_pat = [re.compile(s) for s in lst_s_pat]
text = "This is sensible. This not: occo cttc"
result = []
for pat in lst_pat:
result.extend(pat.findall(text))
# result set to: ['occo', 'cttc']
EDIT: Okay, here is a very different approach. I like this one best.
First, we will match all words in the text. A word is defined as one or more characters that are not punctuation and are not white space.
Then, we use a filter to remove words from the above; we keep only words that are made only of the characters we want.
import re
import string
# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following.  So, a word is a series
# of characters that are all not whitespace and not punctuation.
WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')
WORD = r'[^' + WORD_BOUNDARY + r']+'
# Create a pattern that matches only the words we want.
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'
pat_word = re.compile(WORD)
pat = re.compile(CCS)
text = "This is sensible.  This not: occo cttc"
# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]
# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))
# force the expression to expand out to a list
result = list(result_genexp)
# result set to: ['occo', 'cttc']
EDIT: Now I don't like any of the above solutions; please see the other answer, the one using \b, for the best solution in Python.

It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much #Mark Tolonen for educating me about\b`.)
So, just build up a pattern like the above, only using the desired character set at runtime, and there you go. You can use this pattern with re.findall() or re.finditer().
import re
words = ["cat", "dog", "tack", "coat"]
def get_words(chars_seq, words_seq=words):
s_chars = ''.join(chars_seq)
s_pat = r'\b[' + s_chars + r']+\b'
pat = re.compile(s_pat)
return [word for word in words_seq if pat.match(word)]
assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]

I would not use regular expressions to solve this problem. I would rather use collections.Counter like so:
>>> from collections import Counter
>>> def get_words(word_list, letter_string):
return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']
This solution should also work for other languages. Counter(word) & Counter(letter_string) finds the intersection between the two counters, or the min(c[x], f[x]). If this intersection is equivalent to your word, then you want to return the word as a match.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generalised method to clean data for text classification - python

Related

Merging three regex patterns used for text cleaning to improve efficiency

Is there a way to tell if a newline character is splitting two distinct words in Python?

Python Regex Find Word with Random White Space Mixed in

what should be the regex to match 3 or more consecutive vowels and print the whole word with the vowels [duplicate]

Regex to get list of all words with specific letters (unicode graphemes)

Categories

Resources