Remove white space after detokenizing a string with apostrophe - python

I want to remove the white space in words like can't or won't either through regex or when detokenizing
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
detok = MosesDetokenizer()
pattern= "[^\w ]+ "
text= "i can ' t use this cause they won ' t fit"
string= re.sub(pattern, '', text)
tk = tok.tokenize(string)
output= detok.detokenize(tk, return_str = True)
print(output)
"i can 't use this cause they won' t fit"
any ideas on how i can remove the white space after 'can' and 'won' so i can have can't and won't. When i use output = (' '.join(tk)).strip() to detokenize i get double white space, one before and after the apostrophe. Example i can ' t use this cause they won ' t fit

I think that you can simple do something like:
output = "i can 't use this cause they won' t fit"
output = output.replace(" '", "")
print output
"i can't use this cause they won't fit"

#BenT I can't say about the regex but yeah on your output you can apply the following operation:
output = "i can 't use this cause they won' t fit"
output = "'".join(output.split(" '"))
output = "'".join(output.split("' "))
print(output)
"i can't use this cause they won't fit"
One line solution is also there:
output = output.replace("' ", "'").replace(" '", "'")
print(output)
"i can't use this cause they won't fit"

Related

Eliminating a white spaces from a string except for end of the string

I want to eliminate white spaces in a string except for end of the string
code:
sentence = ['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
pattern = "\s+^[\s+$]"
res = [re.sub(pattern,', ', line) for line in sentence]
print(res)
But...
output is same input list.
['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
Can anyone suggest the right solution.
code:
sentence = ['He must be having a great time ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
pattern = "\s+^[\s+$]"
res = [re.sub(pattern,', ', line) for line in sentence]
print(res)
But...
output is same input list.
['He must be having a great time/n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
expected output:
['He,must,be,having,a,great,time', 'It,is,fun,to,play,chess', 'Sometimes,TT,is,better,than,Badminton ']
We can first strip off leading/trailing whitespace, then do a basic replacement of space to comma:
import re
sentence = ['He must be having a great time\n ', 'It is fun to play chess ', 'Sometimes TT is better than Badminton ']
output = [re.sub(r'\s+', ',', x.strip()) for x in sentence]
print(output)
This prints:
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
You can use a simpler split/join method (timeit: 1.48 µs ± 74 ns).
str.split() will split on groups of whitespace characters (space or newline for instance).
str.join(iter) will join the elements of iter with the str it is used on.
Demo:
sentence = [
"He must be having a great time\n ",
"It is fun to play chess ",
"Sometimes TT is better than Badminton ",
]
[",".join(s.split()) for s in sentence]
gives
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
Second method, strip/replace (timeit: 1.56 µs ± 107 ns).
str.strip() removes all whitespace characters at the beginning and then end of str.
str.replace(old, new) replaces all occurences of old in str with new (works because you have single spaces between words in your strings).
Demo:
sentence = [
"He must be having a great time\n ",
"It is fun to play chess ",
"Sometimes TT is better than Badminton ",
]
[s.strip().replace(" ", ",") for s in sentence]
gives
['He,must,be,having,a,great,time',
'It,is,fun,to,play,chess',
'Sometimes,TT,is,better,than,Badminton']
def eliminating_white_spaces(list):
for string in range(0,len(list)):
if ' ' in list[string] and string+1==len(list):
pass
else:
list[string]=str(list[string]).replace(' ',',')
return list

How to make function in a class remove word multiple times per line

The code below is supposed to clean out the word frack, and potentially a list of bad words. But for now the issue is with the function clean_line. If text line has frack more than twice, it only take the first one, also it does not react on capital letters.
class Cleaner:
def __init__(self, forbidden_word = "frack"):
""" Set the forbidden word """
self.word = forbidden_word
def clean_line(self, line):
"""Clean up a single string, replacing the forbidden word by *beep!*"""
found = line.find(self.word)
if found != -1:
return line[:found] + "*beep!*" + line[found+len(self.word):]
return line
def clean(self, text):
for i in range(len(text)):
text[i] = self.clean_line(text[i])
example_text = [
"What the frack! I am not going",
"to honour that question with a response.",
"In fact, I think you should",
"get the fracking frack out of here!",
"Frack you!"
]
clean_text = Cleaner().clean(example_text)
for line in example_text: print(line)
Assuming that you just want to get rid of any word with frack in it, you could do something like the code below. If you need to also get rid of trailing whitespace, then you will need to change the regular expression a little bit. If you need to learn more about regular expressions, I would recommend checking out regexone.com.
# Using regular expressions makes string manipulation easier
import re
example_text = [
"What the frack! I am not going",
"to honour that question with a response.",
"In fact, I think you should",
"get the fracking frack out of here!",
"Frack you!"
]
# The pattern below gets rid of all words which start with 'frack'
filter = re.compile(r'frack\w*', re.IGNORECASE)
# We then apply this filter to each element in the example_text list
clean = [filter.sub("", e) for e in example_text]
print(clean)
Output
['What the ! I am not going',
'to honour that question with a response.',
'In fact, I think you should',
'get the out of here!',
' you!']
Use this simple code to clean up your line from a bad word:
line = "frack one Frack two"
bad_word = "frack"
line = line.lower()
if bad_word in line:
clean_line = line.replace(bad_word, "")
Resulting in clean_line being:
"one two"

Append sections of string to list in Python

I have a particularly long, nasty string that looks something like this:
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
and so on. The key defining feature is that each "nameOfString" is followed by a \n with two spaces after it. The first nameOfString has two spaces in front of it as well.
I'm trying to create a list that would look something like this:
niceList = [nameOfString1, Inc_(stuff), nameOfString2, Inc_(Stuff)] and so on.
I've tried to use newString = nastyString.split() as well as newString = nastyString.replace('\n ', ''), but ultimately, these solutions can't work because each nameOfString has a space after the comma and before the 'I' of Inc. Furthermore, not all the nameOfStrings have an 'Inc,' but most do have some sort of space in their name.
Would really appreciate some guidance or direction on how I could tackle this issue, thanks!
May be you can try something like this.
[word for word in nastyString.replace("\n", "").replace(",", "").strip().split(' ') if word !='']
Output:
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
# replace '\n' with ','
nastyString = nastyString.replace('\n', ',')
# split at ',' and `strip()` all extra spaces
niceList = [v.strip() for v in nastyString.split(',') if v.strip()]
output:
niceList
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']
Update: OP shared new input:
That's awesome, never knew about the strip function. However, I actually am trying to including the "Inc" section, so I was hoping for output of: ['nameOfString1, Inc_(stuff)', 'nameOfString2, Inc_(stuff)'] and so on, any advice?
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
niceList = [v.strip() for v in nastyString.split('\n') if v.strip()]
new output:
niceList
['nameOfString1, Inc_(stuff)', 'nameOfString2, Inc_(stuff)']
You can use regular expressions:
import re
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
new_string = [i for i in re.split("[\n\s,]", nastyString) if i]
Output:
['nameOfString1', 'Inc_(stuff)', 'nameOfString2', 'Inc_(stuff)']
if you don't like to replacing '\n' do this :
import re
nastyString = ' nameOfString1, Inc_(stuff)\n nameOfString2, Inc_(stuff)\n '
word =re.findall(r'.',nastyString)
s=""
for i in word:
s+=i
print s
output :'nameOfString1, Inc_(stuff) nameOfString2, Inc_(stuff) '
now you can use split()
print s.split(',')

Regex pattern for matching entire word if it have a ; in the word in python

I am trying to remove some garbage from a text and would like to remove all words that have "," in the middle of 2 characters. I have tried both expressions bellow
r'\s.*;.*\s' and r'\s.*\W.*\s'
in this text
'the cat as;asas was wjdwi;qs at home'
And it seems to miss some white spaces, returning
'cat as;asas was wjdwi;qs at '
When I needed
'the cat was at home'
Simple solution is to not use a regex:
s = 'the cat as;asas was wjdwi;qs at home'
res = ' '.join(w for w in s.split() if ';' not in w)
# the cat was at home
You might need a more complicated check, but split it into "words" first, then apply a check to each "word"...
You can use this:
re.sub(r'(?i)\s?[a-z]+;[a-z]+\s?', ' ', yourstr)

Python: Formatting a list (which contains a list) for printing

I'm working on a project that translates input to Pig Latin (yeah, I'm sure you've never seen this one before...) and having trouble formatting my output.
(for the following, sentence = a list holding user input (phrase), split by phrase.split() )
sentence.remove(split)
final = map(str,sentence)
print "Final is (before formatting:", final
final = [sentence[0].capitalize()] , sentence[1:]
#finalFormat = ' '.join(final)
print "Final is", str(final).strip('[]')
#print "FinalFormat is", finalFormat
print "In Pig Latin, you said \"", ' '.join(map(str, final)), "\". Oink oink!"
What I get is:
"In Pig Latin, you said "['Firstword'] ['secondword', 'thirdword'] "
What I am looking for is:
"In Pig Latin, you said "Firstword secondword thirdword."
Based on my debug print statements it looks like my problem is still on the line (5 from the bottom):
final = [sentence[0].capitalize()] , sentence[1:]
Thanks in advance!
Change this line:
final = sentence[0].capitalize() , sentence[1:]
To this:
final = [sentence[0].capitalize()] + sentence[1:]
You were mapping a tuple of a string and a list, to strings, rather than a list.
Note: using 'single"' quotes here will avoid "this\"" ugliness.

Categories

Resources