empty space after stopwords removal and lemmatisation

empty space after stopwords removal and lemmatisation - python

The text looks like this before processing
0 [It's, good, for, beginners] positive
1 [I, recommend, this, starter, Ukulele, kit., I... positive
After preprocessing with stopword removal and lemmatisation
nlp = spacy.load('en', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed
def cleaning(doc):
txt = [token.lemma_ for token in doc if not token.is_stop]
if len(txt) > 2:
return ' '.join(txt)
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df3['reviewText'])
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]
the result came like this
0 ' good ' ' ' ' beginner ' positive
1 ' ' ' recommend ' ' ' ' starter ' ' ukulele ... positive
As you can see, there are lots of ' ' in the result, what caused this? I'm assuming it's the return ' '.join(txt) and re.sub("[^A-Za-z']+", ' ' that caused it, but if I removed the space or use return (txt), it simply won't remove any stopword or carry out lemmatisation.
Will these empty space cause troubles, or are they necessary, because I'm doing bigram and word2vec afterwards.
How can I fix it and have the result returned as ' recommend ' ' starter ' ' ukulele ' ' kit ' ' need ' ' learn ' ' ukulele '?

Related

Regular expression for word with specific prefix/suffix

i want to match the word only if the word is surrounded with a maximum of 1 wild character on either side followed by space or nothing on either side. for example I want ring to match 'ring' , ' ring' , ' tring', 'ring ', ' ringt', '' ringt ', ' ring ', 'tringt ', 'tringt '
but not:
'ttring', 'ringttt', 'ttringtt'
so far I have:
[?\s\S]ring[?\s\S][?!\s]
any suggestions?

If i understand correctly, this should do:
(?:^|\s)\S?ring\S?(?:\s|$)
(?:^|\s) - this non-capturing group makes sure that the pattern is preceded by a whitespace or at the beginning
\S? matches zero or one non-whitespace character
ring matches literal ring
(?:\s|$) - the zero width positive lookahead makes sure the match is preceded by a space or is at the end
Example:
In [92]: l = ['ring ', ' ringt', ' ringt ', ' ring ', \
'tringt ', 'tringt ', 'ttring', 'ringttt', 'ttringtt']
In [93]: list(filter(lambda s: re.search(r'(?:^|\s)\S?ring\S?(?:\s|$)', s), l))
Out[93]: ['ring ', ' ringt', ' ringt ', ' ring ', 'tringt ', 'tringt ']

Why don't these variables output their values properly?

I am currently working in Python 3.5 and I'm having an issue with the variables in my dictionary. I have numbers 1-29 as keys, with letters as their pairs and for some reason none of the double digit numbers register as one number. For example, 11 would come out as 1 and 1 (F and F) instead of 11(I) or 13 as one and 3 (F and TH) instead of 13(EO). Is there a way to fix this so that I can get the values of the double digit numbers?
my code is here:
Dict = {'1':'F ', '2':'U ', '3':'TH ', '4':'O ', '5':'R ', '6':'CoK ', '7':'G ', '8':'W ', '9':'H ',
'10':'N ', '11':'I ', '12':'J ', '13':'EO ', '14':'P ', '15':'X ', '16':'SoZ ', '17':'T ',
'18':'B ', '19':'E ', '20':'M ', '21':'L ', '22':'NGING ',
'23':'OE ' , '24':'D ', '25':'A ', '26':'AE ', '27':'Y ', '28':'IAoIO ', '29':'EA '}
textIn = ' '
#I'm also not sure why this doesn't work to quit out
while textIn != 'Q':
textIn = input('Type in a sentence ("Q" to quit)\n>')
textOut = ''
for i in textIn:
if i in Dict:
textOut += Dict[i]
else:
print("Not here")
print(textOut)

Your for i in textIn: will loop over the individual characters in your input. So if you write 11, it will actually be a string '11', and for i in '11' will go over the '1''s separately:
>>> text = input()
13
>>> text
'13' # See, it's a string with the single-quote marks around it!
>>> for i in text:
... print(i)
...
1
3
>>> # As you see, it printed them separately.
You don't need the for loop at all, you can just use:
if textIn in Dict:
textOut += Dict[textIn]
Since your dict has the key '11', and your textIn is equal to '11'.
There is an other major issue in your code too; the textOut variable gets overwritten on every loop, so you lose everything you've done. You want to create it outside of the while loop:
textOut = ''
while textIn != 'Q':
textIn = input('Type in a sentence ("Q" to quit)\n>')
if textIn in Dict:
textOut += Dict[textIn]
else:
print("Not here")
print(textOut)

Going character by character in a string and swapping whitespaces with python

Okay so I have to switch ' ' to *s. I came up with the following
def characterSwitch(ch,ca1,ca2,start = 0, end = len(ch)):
while start < end:
if ch[start] == ca1:
ch[end] == ca2
start = start + 1
sentence = "Ceci est une toute petite phrase."
print characterSwitch(sentence, ' ', '*')
print characterSwitch(sentence, ' ', '*', 8, 12)
print characterSwitch(sentence, ' ', '*', 12)
print characterSwitch(sentence, ' ', '*', end = 12)
Assigning len(ch) doesn't seem to work and also I'm pretty sure this isn't the most efficient way of doing this. The following is the output I'm aiming for:
Ceci*est*une*toute*petite*phrase.
Ceci est*une*toute petite phrase.
Ceci est une*toute*petite*phrase.
Ceci*est*une*toute petite phrase.

Are you looking for replace() ?
sentence = "Ceci est une toute petite phrase."
sentence = sentence.replace(' ', '*')
print sentence
# Ceci*sest*sune*stoute*spetite*sphrase.
See a demo on ideone.com additionally.
For your second requirement (to replace only from the 8th to the 12th character), you could do:
sentence = sentence[8:12].replace(' ', '*')

Assuming you have to do it character by character you could do it this way:
sentence = "this is a sentence."
replaced = ""
for c in sentence:
if c == " ":
replaced += "*"
else:
replaced += c
print replaced

How to reduce whitespace in Python? [duplicate]

This question already has answers here:
Is there a simple way to remove multiple spaces in a string?
(27 answers)
Closed 6 years ago.
How do I reduce whitespace in Python from
test = ' Good ' to single whitespace test = ' Good '
I have tried define this function but when I try to test = reducing_white(test) it doesn't work at all, does it have to do with the function return or something?
counter = []
def reducing_white(txt):
counter = txt.count(' ')
while counter > 2:
txt = txt.replace(' ','',1)
counter = txt.count(' ')
return txt

Here is how I solved it:
def reduce_ws(txt):
ntxt = txt.strip()
return ' '+ ntxt + ' '
j = ' Hello World '
print(reduce_ws(j))
OUTPUT:
' Hello World '

You need to use regular expressions:
import re
re.sub(r'\s+', ' ', test)
>>>> ' Good '
test = ' Good Sh ow '
re.sub(r'\s+', ' ', test)
>>>> ' Good Sh ow '
r'\s+' matches all multiple whitespace characters, and replaces the entire sequence with a ' ' i.e. a single whitespace character.
This solution is fairly powerful and will work on any combination of multiple spaces.

concatenate print stmt inside and outside of for loop into a sentence in python

def PrintFruiteListSentence(list_of_fruits):
print 'You would like to eat',
for i, item in enumerate (list_of_fruits):
if i != (len(list_of_fruits) - 1):
print item, 'as fruit', i+2, 'and',
else:
print item, 'as fruit', i+2,
print 'in your diet'
o/p
You would like to eat apple as fruit 1 and orange as fruit 2 and banana as fruit 3 and grape as fruit 4 in your diet.
How can i get this sentence in a variable which i can pass to another function ???
I want to pass this sentence as input to another function.

just change your call to print instead to a concatenation into an actual string.
def PrintFruiteListSentence(list_of_fruits):
sentence = 'You would like to eat '
for i, item in enumerate (list_of_fruits):
if i != (len(list_of_fruits) - 1):
sentence += item + ' as fruit ' + str(i+2) + ' and '
else:
sentence += item + ' as fruit ' + str(i+2)
sentence += ' in your diet'
print sentence
you could also use a list comprehension instead of a for loop but this is just unnecessary:
Also note that if you want i to start at a specific number, you can pass an index into enumerate
>>> def PrintFruiteListSentence(list_of_fruits):
sentence = 'You would like to eat ' + ' and '.join(fruit + ' as fruit ' + str(index) for index,fruit in enumerate(list_of_fruits,1)) + ' in your diet'
print(sentence)
>>> PrintFruiteListSentence(['apple','orange','grapes'])
You would like to eat apple as fruit 1 and orange as fruit 2 and grapes as fruit 3 in your diet
EDIT: make sure to convert i+2 to str(i+2)

The following code works:
def func(fruits):
start = "You would like to eat "
for i, item in enumerate(fruits):
if i != (len(fruits) - 1):
start += item + ' as fruit ' + str(i+1) + ' and ' # note you mistake(i+1 and not i+2)
else:
start += item + ' as fruit ' + str(i+1) # see above comment note
start += ' in your diet'
return start
print (func(["apple", "banana", "grapes"]))
You can also try to run the above snippet here on repl.it

First, you need to make it a variable, like so:
def PrintFruiteListSentence(list_of_fruits):
myStr = 'You would like to eat ',
for i, item in enumerate (list_of_fruits):
if i != (len(list_of_fruits) - 1):
myStr += item + ' as fruit ' + str(i+2)+ ' and '
else:
myStr += item + ' as fruit ' + str(i+2)
myStr += ' in your diet'
return myStr
def otherFunction(inputString):
print(inputString)
otherFunction(PrintFruiteListSentence(['apple','banana']))#example
Also look at str.format(), which makes life much easier.
EDIT:
Here is an example to illustrate a simple usage of str.format(). It might not seem powerful in this case, but can be very helpful for complicated string manipulation, or where you need a specific floating point format.
def formatExample(list_of_fruits):
myStr="you would like to eat "
for i in enumerate(list_of_fruits,1):
myStr += '{1:} as fruit {0:d}'.format(*i)+' and '
return myStr[:-4]+"in your diet."
otherFunction(formatExample(['apple','banana']))#prints the same thing

Okay this question has been answered already. But here is another take where the str.join method takes the center stage it deserves. No concatenating strings with +. No if / else statements. No unnecessary variables. Easy to read and understand whats happening:
def PrintFruiteListSentence(fruits):
return ' '.join([
'You would like to eat',
' and '.join(
'{0} as fruit {1}'.format(f, c) for c, f in enumerate(fruits, 1)
),
'in your diet'
])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

empty space after stopwords removal and lemmatisation - python

Related

Regular expression for word with specific prefix/suffix

Why don't these variables output their values properly?

Going character by character in a string and swapping whitespaces with python

How to reduce whitespace in Python? [duplicate]

concatenate print stmt inside and outside of for loop into a sentence in python

Categories

Resources