Is this modified cleanedup function correct? - python

So I'm new to python, and I was hoping if I could get some insight towards my cleaned up function. My cleanedup is suppose to keep not only letters but numbers, and certain symbols like '#' and '_'. Here is my code.
def cleanedup(s):
alphabet = 'abcdefghijklmnopqrstuvwxyz'
digits = '0123456789'
cleantext = ''
for character in s.lower():
if character in alphabet, digits, or characters == '#', '_':
cleantext += character
else:
cleantext += ' '
return cleantext
I was hoping to see if this function is correct or if it needs some adjusting. If there is a need for some adjusting, I hope it is nothing far different from what I have above. Thank you.

character in alphabet, digits, or characters == '#', '_' is not a valid Python expression. I'm surprised you're not getting an error. The correct way to express this would be
if character in alphabet or character in digits or character in ('#', '_'):
A better way would be to condense all the allowed characters into a single data structure, then compare the characters against that:
from string import acii_lowercase, digits
allowed = set(ascii_lowercase + digits + '#_')
def cleanedup(s):
return ''.join(c if c in allowed else ' ' for c in s.lower())
''.join is another way of combining many strings, that doesn't create additional strings in the process.
A set is a data structure like a list that works more like a mathematical set. It's faster to look up whether or not an object is in a set than it is to for a list.
A more advanced way of doing what you want would be to use regular expressions:
import re
pattern = re.compile("[^a-z0-9#_]") # All characters that are not a-z, 0-9, _, and #
def cleanedup(s):
return pattern.sub(' ', s.lower())

Related

Python - remove punctuation marks at the end and at the beginning of one or more words

I wanted to know how to remove punctuation marks at the end and at the beginning of one or more words.
If there are punctuation marks between the word, we don't remove.
for example
input:
word = "!.test-one,-"
output:
word = "test-one"
use strip
>>> import string
>>> word = "!.test-one,-"
>>> word.strip(string.punctuation)
'test-one'
The best solution is to use Python .strip(chars) method of the built-in class str.
Another approach will be to use a regular expression and the regular expressions module.
In order to understand what strip() and the regular expression does you can take a look at two functions which duplicate the behavior of strip(). The first one using recursion, the second one using while loops:
chars = '''!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~'''
def cstm_strip_1(word, chars):
# Approach using recursion:
w = word[1 if word[0] in chars else 0: -1 if word[-1] in chars else None]
if w == word:
return w
else:
return cstm_strip_1(w, chars)
def cstm_strip_2(word, chars):
# Approach using a while loop:
i , j = 0, -1
while word[i] in chars:
i += 1
while word[j] in chars:
j -= 1
return word[i:j+1]
import re, string
chars = string.punctuation
word = "~!.test-one^&test-one--two???"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
assert wsc == re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
word = "__~!.test-one^&test-one--two??__"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
# assert wsc == re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
assert re.sub(r"(^[^\w]+)|([^\w]+$)", "", word) == word
print(re.sub(r"(^[^\w]+)|([^\w]+$)", "", word), '!=', wsc )
print('"',re.sub(r"(^[^\w]+)|([^\w]+$)", "", "\tword\t"), '" != "', "\tword\t".strip(chars), '"', sep='' )
Notice that the result when using the given regular expression pattern can differ from the result when using .strip(string.punctuation) because the set of characters covered by regular expression [^\w] pattern differs from the set of characters in string.punctuation.
SUPPLEMENT
What does the regular expression pattern:
(^[^\w]+)|([^\w]+$)
mean?
Below a detailed explanation:
The '|' character means 'or' providing two alternatives for the
sub-string (called match) which is to find in the provided string.
'(^[^\w]+)' is the first of the two alternatives for a match
'(' ')' enclose what is called a "capturing group" (^[^\w]+)
The first of the two '^' asserts position at start of a line
'\w' : with \ escaped 'w' means: "word character"
(i.e. letters a-z, A-Z, digits 0-9 and the underscore '_').
The second of the two '^' means: logical "not"
(here not a "word character")
i.e. all characters except a-zA-z0-9 and '_'
(for example '~' or 'ö')
Notice that the meaning of '^' depends on context:
'^' outside of [ ] it means start of line/string
'^' inside of [ ] as first char means logical not
and not as first means itself
'[', ']' enclose specification of a set of characters
and mean the occurrence of exactly one of them
'+' means occurrence between one and unlimited times
of what was defined in preceding token
'([^\w]+$)' is the second alternative for a match
differing from the first by stating that the match
should be found at the end of the string
'$' means: "end of the line" (or "end of string")
The regular expression pattern tells the regular expression engine to work as follows:
The engine looks at the start of the string for an occurrence of a non-word
character. If one if found it will be remembered as a match and next
character will be checked and added to the already found ones if it is also
a non-word character. This way the start of the string is checked for
occurrences of non-word characters which then will be removed from the
string if the pattern is used in re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
which replaces any found characters with an empty string (in other words
it deletes found character from the string).
After the engine hits first word character in the string the search at
the start of the string will the jump to the end of the string because
of the second alternative given for the pattern to find as the first
alternative is limited to the start of the line.
This way any non-word characters in the intermediate part of the string
will be not searched for.
The engine looks then at the end of a string for a non-word character
and proceeds like at the start but going backwards to assure that the
found non-word characters are at the end of the string.
Using re.sub
import re
word = "!.test-one,-"
out = re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
print(out)
Gives #
test-one
Check this example using slice
import string
sentence = "_blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers."
if sentence[0] in string.punctuation:
sentence = sentence[1:]
if sentence[-1] in string.punctuation:
sentence = sentence[:-1]
print(sentence)
Output:
blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers

python re.sub removes numeric characters palindrome

I am trying to remove punctuation to check if a phrase (or word) is a palindrome, though when I have a word with numbers they are removed and it return True instead of False. "1a2" after cleaning punctuation with sub returns 'a' though it should still give me '1a2'. I thought I picked up only punctuation for substitution.
import re
def isPalindrome(s):
clean = re.sub("[,.;##?+^:%-=()!&$]", " ", s)
lower = ''.join([i.lower() for i in clean.split()])
if lower == lower[::-1]:
return True
else:
return False
print(isPalindrome("1a2"))
You're using - inside your regex and you need to escape it correctly, try this instead:
re.sub("[,.;##?+^:%\-=()!&$]", " ", s)
Have a look in the doc for a list of special characters and how to note a [].
I would use str.maketrans and the punctuation set from the string module in your case, because I think that this is more readable than a regex :
import string
s = s.translate(str.maketrans('', '', string.punctuation))
Special characters must be escaped in your regex string. I.e.
clean = re.sub(r"[,\.;#\#\?\+\^:%\-=\(\)!\&\$]", " ", s)
or use re.escape, which automatically escapes special characters
esc = re.escape(r',.;##?+^:%-=()!&$')
clean = re.sub("[" + esc + "]", " ", s)

Replacing or introducing a space after a sequence of letters some known or unknown then write on newlines content

The problem is that I now have a string where some words are sticked together:
fooledDog and I need fooled D****string text continues with inserted " "
whateveredJ and I need whatevered J*******string text continues with inserted " "
string = string.replace("edD","ed D")
string = string.replace("edJ","ed J")
but I need instead of "D" and "J" to have any possible character so to avoid hard coding values here so that the code will work with any letter or number in this position.
This is a pretty easy problem to solve with regular expressions (not something that is always true, even if regex are the best tool for the job). Try this:
import re
text = "fooledDog whateveredJob"
fixed_text = re.sub(r'ed([A-Z])', r'ed \1', text)
print(fixed_text) # prints "fooled Dog whatevered Job"
The pattern looks for the letters 'ed' in lowercase, followed by any capital letter (which gets captured). The replacement is 'ed' and a space, followed by the capital letter from the capturing group.
I don't fully understand your question, but it seems you have some camelCase words you wanna separate. If that's the case, try this:
import re
name = 'CamelCaseTest123'
splitted = re.sub('(?!^)([A-Z][a-z]+)', r' \1', name).split()
Output:
['Camel', 'Case', 'Test123']

Python .replace() function, removing backslash in certain way

I have a huge string which contains emotions like "\u201d", AS WELL AS "\advance\"
all that I need is to remove back slashed so that:
- \u201d = \u201d
- \united\ = united
(as it breaks the process of uploading it to BigQuery database)
I know it should be somehow this way:
string.replace('\','') But not sure how to keep \u201d emotions.
ADDITIONAL:
Example of Unicode emotions
\ud83d\udc9e
\u201c
\u2744\ufe0f\u2744\ufe0f\u2744\ufe0f
You can split on all '\' and then use a regex to replace your emotions with adding leading '\'
s = '\\advance\\\\united\\ud83d\\udc9e\\u201c\\u2744\\ufe0f\\u2744\\ufe0f\\u2744\\ufe0f'
import re
print(re.sub('(u[a-f0-9]{4})',lambda m: '\\'+m.group(0),''.join(s.split('\\'))))
As your emotions are 'u' and 4 hexa numbers, 'u[a-f0-9]{4}' will match them all, and you just have to add leading backslashes
First of all, you delete every '\' in the string with either ''.join(s.split('\\')) or s.replace('\\')
And then we match every "emotion" with the regex u[a-f0-9]{4} (Which is u with 4 hex letters behind)
And with the regex sub, you replace every match with a leading \\
You could simply add the backslash in front of your string after replacement if your string starts with \u and have at least one digit.
import re
def clean(s):
re1='(\\\\)' # Any Single Character "\"
re2='(u)' # Any Single Character "u"
re3='.*?' # Non-greedy match on filler
re4='(\\d)' # Any Single Digit
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(s)
if m:
r = '\\'+s.replace('\\','')
else:
r = s.replace('\\','')
return r
a = '\\u123'
b = '\\united\\'
c = '\\ud83d'
>>> print(a, b, c)
\u123 \united\ \ud83d
>>> print(clean(a), clean(b), clean(c))
\u123 united \ud83d
Of course, you have to split your sting if multiple entries are in the same line:
string = '\\u123 \\united\\ \\ud83d'
clean_string = ' '.join([clean(word) for word in string.split()])
You can use this simple method to replace the last occurence of your character backslash:
Check the code and use this method.
def replace_character(s, old, new):
return (s[::-1].replace(old[::-1],new[::-1], 1))[::-1]
replace_character('\advance\', '\','')
replace_character('\u201d', '\','')
Ooutput:
\advance
\u201d
You can do it as simple as this
text = text.replace(text[-1],'')
Here you just replace the last character with nothing

Regex to get list of all words with specific letters (unicode graphemes)

I'm writing a Python script for a FOSS language learning initiative. Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script).
I need to draw out the subset of those words that can be spelled using just those letters.
An English example:
words = ["cat", "dog", "tack", "coat"]
get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]
A Tamil example:
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words([u'ம', u'ப', u'ட', u'ம்') should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]
The order that the words are returned in, or the order that the letters are entered in should not make a difference.
Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions.
In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (i.e. the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order).
To support characters that can span several Unicode codepoints:
# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial
NFKD = partial(unicodedata.normalize, 'NFKD')
def match(word, letters):
word, letters = NFKD(word), map(NFKD, letters) # normalize
return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]
print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்
It assumes that the same character can be used zero or more times in a word.
If you want only words that contain exactly given characters:
import regex # $ pip install regex
chars = regex.compile(r"\X").findall # get all characters
def match(word, letters):
return sorted(chars(word)) == sorted(letters)
words = ["cat", "dog", "tack", "coat"]
print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack
Note: there is no cat in the output in this case because cat doesn't use all given characters.
What does it mean to normalize? And could you please explain the syntax of the re.match() regex?
>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç
Without normalization c and cc do not match. The characters are from the unicodedata.normalize() docs.
EDIT: Okay, don't use any of the answers from here. I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. Then #Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b. I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \b, but I don't really expect anyone to be.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
Sadly, there isn't a regular expression feature for "word boundary". [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word.
Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want.
To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:)
So, here is the pattern to match the beginning of a word. Start of line or white space: (?:^|\s)
Here is the pattern for end of word. White space or end of line: `(?:\s|$)
Putting it all together, here is our final pattern:
(?:^|\s)([ocat]+)(?:\s|$)
You can build this dynamically. You don't need to hard-code the whole thing.
import re
s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"
s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)
Now, this doesn't in any way check for valid words. If you have the following text:
This is sensible. This not: occo cttc
The pattern I showed you will match on occo and cttc, and those are not really words. They are strings made only of letters from [ocat] though.
So just do the same thing with Unicode strings. (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go.
This has a confusing problem: re.findall() doesn't return all possible matches.
EDIT: Okay, I figured out what was confusing me.
What we want is for our pattern to work with re.findall() so you can collect all the words. But re.findall() only finds non-overlapping patterns. In my example, re.findall() only returned ['occo'] and not ['occo', 'cttc'] as I expected... but this is because my pattern was matching the white space after occo. The match group didn't collect the white space, but it was matched all the same, and since re.findall() wants no overlap between matches, the white space was "used up" and didn't work for cttc.
The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". The sequence \S matches any non-whitespace so we could use that. But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. There is also special syntax for "must be preceded by" or "must be followed by". So here is, I think, the best we can do:
Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line".
Here is that pattern using ocat:
r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'
I'm very sorry but I really do think this is the best we can do and still work with re.findall()!
It's actually less confusing in Python code though:
import re
NMGROUP_BEGIN = r'(?:' # begin non-matching group
NMGROUP_END = r')' # end non-matching group
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
PAT_OR = r'|'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
s_pat = (NMGROUP_BEGIN +
BOL + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + EOL + PAT_OR +
BOL + CCS + EOL +
NMGROUP_END)
pat = re.compile(s_pat)
text = "This is sensible. This not: occo cttc"
pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]
So, the crazy thing is that when we have alternative patterns that could match, re.findall() seems to return an empty string for the alternatives that didn't match. So we just need to filter out the length-zero strings from our results:
import itertools as it
raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']
I guess it might be less confusing to just build four different patterns, run re.findall() on each, and join the results together.
EDIT: Okay, here is the code for building four patterns and trying each. I think this is an improvement.
import re
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
lst_s_pat = [
BOL + CCS + WS_AFTER,
WS_BEFORE + CCS + WS_AFTER,
WS_BEFORE + CCS + EOL,
BOL + CCS
]
lst_pat = [re.compile(s) for s in lst_s_pat]
text = "This is sensible. This not: occo cttc"
result = []
for pat in lst_pat:
result.extend(pat.findall(text))
# result set to: ['occo', 'cttc']
EDIT: Okay, here is a very different approach. I like this one best.
First, we will match all words in the text. A word is defined as one or more characters that are not punctuation and are not white space.
Then, we use a filter to remove words from the above; we keep only words that are made only of the characters we want.
import re
import string
# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following.  So, a word is a series
# of characters that are all not whitespace and not punctuation.
WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')
WORD = r'[^' + WORD_BOUNDARY + r']+'
# Create a pattern that matches only the words we want.
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'
pat_word = re.compile(WORD)
pat = re.compile(CCS)
text = "This is sensible.  This not: occo cttc"
# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]
# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))
# force the expression to expand out to a list
result = list(result_genexp)
# result set to: ['occo', 'cttc']
EDIT: Now I don't like any of the above solutions; please see the other answer, the one using \b, for the best solution in Python.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much #Mark Tolonen for educating me about\b`.)
So, just build up a pattern like the above, only using the desired character set at runtime, and there you go. You can use this pattern with re.findall() or re.finditer().
import re
words = ["cat", "dog", "tack", "coat"]
def get_words(chars_seq, words_seq=words):
s_chars = ''.join(chars_seq)
s_pat = r'\b[' + s_chars + r']+\b'
pat = re.compile(s_pat)
return [word for word in words_seq if pat.match(word)]
assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]
I would not use regular expressions to solve this problem. I would rather use collections.Counter like so:
>>> from collections import Counter
>>> def get_words(word_list, letter_string):
return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']
This solution should also work for other languages. Counter(word) & Counter(letter_string) finds the intersection between the two counters, or the min(c[x], f[x]). If this intersection is equivalent to your word, then you want to return the word as a match.

Categories

Resources