I need to write a function which replaces multiple format strings into downcase.
For example, a paragraph contains a word 'something' in different formats like 'Something', 'SomeThing', 'SOMETHING', 'SomeTHing' need to convert all format words into downcase 'something'.
How to write a function with replacing with downcase?
You can split your paragraph into different words, then use the slugify module to generate a slug of each word, compare it with "something", and if there is a match, replace the word with "something".
In [1]: text = "This paragraph contains Something, SOMETHING, AND SomeTHing"
In [2]: from slugify import slugify
In [3]: for word in text.split(" "): # Split the text using space, and iterate through the words
...: if slugify(unicode(word)) == "something": # Compare the word slug with "something"
...: text = text.replace(word, word.lower())
In [4]: text
Out[4]: 'This paragraph contains something, something AND something'
Split the text into single words and check whether a word in written in lower case is "something". If yes, then change the case to lower
if word.lower() == "something":
text = text.replace(word, "something")
To know how to split a text into words, see this question.
Another way is to iterate through single letters and check whether a letter is the first letter of "something":
text = "Many words: SoMeThInG, SOMEthING, someTHing"
for n in range(len(text)-8):
if text[n:n+9].lower() == "something": # check whether "something" is here
text = text.replace(text[n:n+9], "something")
print text
You can also use re.findall to search and split the paragraph into words and punctuation, and replace all the different cases of "Something" with the lowercase version:
import re
text = "Something, Is: SoMeThInG, SOMEthING, someTHing."
to_replace = "something"
words_punct = re.findall(r"[\w']+|[.,!?;: ]", text)
new_text = "".join(to_replace if x.lower() == to_replace else x for x in words_punct)
print(new_text)
Which outputs:
something, Is: something, something, something.
Note: re.findall requires a hardcoded regular expression to search for contents in a string. Your actual text may contain characters that are not in the regular expression above, you will need to add these as needed.
Related
I'm facing this issue:
I need to remove duplications from the beginning of each word of a text, but only if
all words in the text are duplicated. (And capitalized after)
Examples:
text = str("Thethe cacar isis momoving vvery fasfast")
So this text should be treated and printed as:
output:
"The car is moving very fast"
I got these to treat the text:
phrase = str("Thethe cacar isis momoving vvery fasfast")
phrase_up = phrase.upper()
text = re.sub(r'(.+?)\1+', r'\1', phrase_up)
text_cap = text.capitalize()
"The car is moving very fast"
Or:
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index)
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)
What I can't work it out, is HOW to determine if a text needs this treatment.
Because if we get a text such as:
"This meme is funny, said Barbara"
Where even though "meme" and "Barbara" (ar - ar) are repeating substrings, not all are, so this text shouldn't be treated.
Any pointers here?
I would suggest you to adopt a solution to check if a word is legal, using something like what is described in this post's best answer. If the word is not an english word, than you should use the regex.
For example, a word like meme should be in the english dictionary, so you should not check for repetitions.
So I would firstly split the string on spaces, in order to have the tokens. Then check if a token is an english word. If it is, skip the regex check. Otherwise check for repetitions
I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?
Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)
I want to convert all the titlecase words (words starting with uppercase character and having rest of the characters as lowercase) in the string to the lowercase characters. For example, if my initial string is:
text = " ALL people ARE Great"
I want my resultant string to be:
"ALL people ARE great"
I tried the following but it did not work
text = text.split()
for i in text:
if i in [word for word in a if not word.islower() and not word.isupper()]:
text[i]= text[i].lower()
I also checked related question Check if string is upper, lower, or mixed case in Python.. I want to iterate over my dataframe and for each word that meet this criteria.
You could define your transform function
def transform(s):
if len(s) == 1 and s.isupper():
return s.lower()
if s[0].isupper() and s[1:].islower():
return s.lower()
return s
text = " ALL people ARE Great"
final_text = " ".join([transform(word) for word in text.split()])
You can use str.istitle() to check whether your word represents the titlecased string, i.e. whether first character of the word is uppercase and rest are lowercase.
For getting your desired result, you need to:
Convert your string to list of words using str.split()
Do the transformation you need using str.istitle() and str.lower() (I am using list comprehension for iterating the list and for generating a new list of words in desired format)
Join back the list to strings using str.join() as:
For example:
>>> text = " ALL people ARE Great"
>>> ' '.join([word.lower() if word.istitle() else word for word in text.split()])
'ALL people ARE great'
I am trying to change the words that are nouns in a text to "noun".
I am having trouble. Here is what I have so far.
def noun(file):
for word in file:
for ch in word:
if ch[-1:-3] == "ion" or ch[-1:-3] == "ism" or ch[-1:-3] == "ity":
word = "noun"
if file(word-1) == "the" and (file(word+1)=="of" or file(word+1) == "on"
word = "noun"
# words that appear after the
return outfile
Any ideas?
Your slices are empty:
>>> 'somethingion'[-1:-3]
''
because the endpoint lies before the start. You could just use [-3:] here:
>>> 'somethingion'[-3:]
'ion'
But you'd be better of using str.endswith() instead:
ch.endswith(("ion", "ism", "ity"))
The function will return True if the string ends with any of the 3 given strings.
Not that ch is actually a word; if word is a string, then for ch in word iterates over individual characters, and those are never going to end in 3-character strings, being only one character long themselves.
Your attempts to look at the next and previous words are also going to fail; you cannot use a list or file object as a callable, let alone use file(word - 1) as a meaningful expression (a string - 1 fails, as well as file(...)).
Instead of looping over the 'word', you could use a regular expression here:
import re
nouns = re.compile(r'(?<=\bthe\b)(\s*\w+(?:ion|ism|ity)\s*)(?=\b(?:of|on)\b)')
some_text = nouns.sub(' noun ', some_text)
This looks for words ending in your three substrings, but only if preceded by the and followed by of or on and replaces those with noun.
Demo:
>>> import re
>>> nouns = re.compile(r'(?<=\bthe\b)(\s*\w+(?:ion|ism|ity)\s*)(?=\b(?:of|on)\b)')
>>> nouns.sub(' noun ', 'the scion on the prism of doom')
'the noun on the noun of doom'
I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.
The code i have at the moment is
file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words
I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)
This is a job for regular expressions!
For example:
import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words
A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate method. Finally, you can produce your list simply by splitting on whitespace using str.split:
import string
with open('screenplay.txt', 'rb') as f:
content = f.read()
content = content.translate(None, string.punctuation).lower()
words = content.split()
print words
Note that this will change Mr.Smith into mrsmith. If you'd like it to become ['mr', 'smith'] then you could replace all punctation with spaces, and then use str.split:
def using_translate(content):
table = string.maketrans(
string.punctuation,
' '*len(string.punctuation))
content = content.translate(table).lower()
words = content.split()
return words
One problem you might encounter using a positive regex pattern such as [a-z]+ is that it will only match ascii characters. If the file has accented characters, the words would get split apart.
Gruyère would become ['Gruy','re'].
You could fix that by using re.split to split on punctuation.
For example,
def using_re(content):
words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
return words
However, using str.translate is faster:
In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop
In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop
Use the replace method.
mystring = mystring.replace(",", "")
If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such
You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.
replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())
Output:
abc321
cda123
You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)
import re
words = set(re.findall("[a-z]+", f.read().lower()))
Using a set each word will be included just once.
Just using findall will instead give you all the words in order.
You can try something like this. Probably need some work on the regexp though.
import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())
I have tried this code and It works in my case:
from string import punctuation, whitespace
s=''
with open("path of your file","r") as myfile:
content=myfile.read().split()
for word in content:
if((word in punctuation) or (word in whitespace)) :
pass
else:
s+=word.lower()
print(s)