Remove a specifc repeated word using python regex? [duplicate]

Remove a specifc repeated word using python regex? [duplicate] - python

This question already has answers here:
Removing duplicates in lists
(56 answers)
Closed 1 year ago.
I have a string like :
'hi', 'what', 'are', 'are', 'what', 'hi'
I want to remove a specific repeated word. For example:
'hi', 'what', 'are', 'are', 'what'
Here, I am just removing the repeated word of hi, and keeping rest of the repeated words.
How to do this using regex?

Regex is used for text search. You have structured data, so this is unnecessary.
def remove_all_but_first(iterable, removeword='hi'):
remove = False
for word in iterable:
if word == removeword:
if remove:
continue
else:
remove = True
yield word
Note that this will return an iterator, not a list. Cast the result to list if you need it to remain a list.

You can do this
import re
s= "['hi', 'what', 'are', 'are', 'what', 'hi']"
# convert string to list. Remove first and last char, remove ' and empty spaces
s=s[1:-1].replace("'",'').replace(' ','').split(',')
remove = 'hi'
# store the index of first occurance so that we can add it after removing all occurance
firstIndex = s.index(remove)
# regex to remove all occurances of a word
regex = re.compile(r'('+remove+')', flags=re.IGNORECASE)
op = regex.sub("", '|'.join(s)).split('|')
# clean up the list by removing empty items
while("" in op) :
op.remove("")
# re-insert the removed word in the same index as its first occurance
op.insert(firstIndex, remove)
print(str(op))

You don't need regex for that, convert the string to list and then you can find the index of the first occurrence of the word and filter it from a slice of the rest of the list
lst = "['hi', 'what', 'are', 'are', 'what', 'hi']"
lst = ast.literal_eval(lst)
word = 'hi'
index = lst.index('hi') + 1
lst = lst[:index] + [x for x in lst[index:] if x != word]
print(lst) # ['hi', 'what', 'are', 'are', 'what']

Related

Stemmer function that takes a string and returns the stems of each word in a list

I am trying to create this function which takes a string as input and returns a list containing the stem of each word in the string. The problem is, that using a nested for loop, the words in the string are appended multiple times in the list. Is there a way to avoid this?
def stemmer(text):
stemmed_string = []
res = text.split()
suffixes = ('ed', 'ly', 'ing')
for word in res:
for i in range(len(suffixes)):
if word.endswith(suffixes[i]):
stemmed_string.append(word[:-len(suffixes[i])])
elif len(word) > 8:
stemmed_string.append(word[:8])
else:
stemmed_string.append(word)
return stemmed_string
If I call the function on this text ('I have a dog is barking') this is the output:
['I',
'I',
'I',
'have',
'have',
'have',
'a',
'a',
'a',
'dog',
'dog',
'dog',
'that',
'that',
'that',
'is',
'is',
'is',
'barking',
'barking',
'bark']

You are appending something in each round of the loop over suffixes. To avoid the problem, don't do that.
It's not clear if you want to add the shortest possible string out of a set of candidates, or how to handle stacked suffixes. Here's a version which always strips as much as possible.
def stemmer(text):
stemmed_string = []
suffixes = ('ed', 'ly', 'ing')
for word in text.split():
for suffix in suffixes:
if word.endswith(suffix):
word = word[:-len(suffix)]
stemmed_string.append(word)
return stemmed_string
Notice the fixed syntax for looping over a list, too.
This will reduce "sparingly" to "spar", etc.
Like every naïve stemmer, this will also do stupid things with words like "sly" and "thing".
Demo: https://ideone.com/a7FqBp

string.punctuation not working as expected in python

I was trying to create a program that removes all sorts of punctuation from a given input sentence. The code looked somewhat like this
from string import punctuation
sent = str(input())
def rempunc(string):
for i in string:
word =''
list = [0]
if i in punctuation:
x = string.index(i)
word += string[list[-1]:x]+' '
list.append(x)
list_2 = word.split(' ')
return list_2
print(rempunc(sent))
However the output is coming out as follows:
This state ment has # 1 ! punc.
['This', 'state', 'ment', 'has', '#', '1', '!', 'punc', '']
Why isn't the punctuation being removed entirely? Am I missing something in the code?
I tried changing x with x-1 in line 7 but it did not help. Now I'm stuck and don't know what else to try.

Repeated string slicing isn't necessary here.
I would suggest using filter() to filter out the undesired characters for each word, and then reading that result into a list comprehension. From there, you can use a second filter() operation to remove the empty strings:
from string import punctuation
def remove_punctuation(s):
cleaned_words = [''.join(filter(lambda x: x not in punctuation, word))
for word in s.split()]
return list(filter(lambda x: x != "", cleaned_words))
print(remove_punctuation(input()))
This outputs:
['This', 'state', 'ment', 'has', '1', 'punc']

Replacing numbers in a list of lists with corresponding lines from a text file

I have a big text file like this (without the blank space in between words but every word in each line):
this
is
my
text
and
it
should
be
awesome
.
And I have also a list like this:
index_list = [[1,2,3,4,5],[6,7,8][9,10]]
Now I want to replace every element of each list with the corresponding index line of my text file, so the expected answer would be:
new_list = [[this, is, my, text, and],[it, should, be],[awesome, .]
I tried a nasty workaround with two for loops with a range function that was way too complicated (so I thought). Then I tried it with linecache.getline, but that also has some issues:
import linecache
new_list = []
for l in index_list:
for j in l:
new_list.append(linecache.getline('text_list', j))
This does produce only one big list, which I don't want. Also, after every word I get a bad \n which I do not get when I open the file with b = open('text_list', 'r').read.splitlines() but I don't know how to implement this in my replace function (or create, rather) so I don't get [['this\n' ,'is\n' , etc...

You are very close. Just use a temp list and the append that to the main list. Also you can use str.strip to remove newline char.
Ex:
import linecache
new_list = []
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
for l in index_list:
temp = [] #Temp List
for j in l:
temp.append(linecache.getline('text_list', j).strip())
new_list.append(temp) #Append to main list.

You could use iter to do this as long as you text_list has exactly as many elements as sum(map(len, index_list))
text_list = ['this', 'is', 'my', 'text', 'and', 'it', 'should', 'be', 'awesome', '.']
index_list = [[1,2,3,4,5],[6,7,8],[9,10]]
text_list_iter = iter(text_list)
texts = [[next(text_list_iter) for _ in index] for index in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]
But I am not sure if this is what you wanted to do. Maybe I am assuming some sort of ordering of index_list. The other answer I can think of is this list comprehension
texts_ = [[text_list[i-1] for i in l] for l in index_list]
Output
[['this', 'is', 'my', 'text', 'and'], ['it', 'should', 'be'], ['awesome', '.']]

Python 3: Split string under certain condition

I have difficulties splitting a string into specific parts in Python 3.
The string is basically a list with a colon (:) as a delimiter.
Only when the colon (:) is prefixed with a backslash (\), it does
not count as a delimiter but part of the list item.
Example:
String --> I:would:like:to:find\:out:how:this\:works
Converted List --> ['I', 'would', 'like', 'to', 'find\:out', 'how', 'this\:works']
Any idea how this could work?
#Bertrand I was trying to give you some code and I was able to figure out a workaround but this is probably not the most beautiful solution
text = "I:would:like:to:find\:out:how:this\:works"
values = text.split(":")
new = []
concat = False
temp = None
for element in values:
# when one element ends with \\
if element.endswith("\\"):
temp = element
concat = True
# when the following element ends with \\
# concatenate both before appending them to new list
elif element.endswith("\\") and temp is not None:
temp = temp + ":" + element
concat = True
# when the following element does not end with \\
# append and set concat to False and temp to None
elif concat is True:
new.append(temp + ":" + element)
concat = False
temp = None
# Append element to new list
else:
new.append(element)
print(new)
Output:
['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']

You should use re.split and perform a negative lookbehind to check for the backslash character.
import re
pattern = r'(?<!\\):'
s = 'I:would:like:to:find\:out:how:this\:works'
print(re.split(pattern, s))
Output:
['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']

You can replace the ":\" with something (just make sure that this is something that doesn`t exist in the string in other place... you can use a long term or something), and than split by ":" and replace it back.
[x.replace("$","\:") for x in str1.replace("\:","$").split(":")]
Explanation:
str1 = 'I:would:like:to:find\:out:how:this\:works'
Replace ":" with "$" (or something else):
str1.replace("\:","$")
Out: 'I:would:like:to:find$out:how:this$works'
Now split by ":"
str1.replace("\:","$").split(":")
Out: ['I', 'would', 'like', 'to', 'find$out', 'how', 'this$works']
and replace "$" with ":" for every element:
[x.replace("$","\:") for x in str1.replace("\:","$").split(":")]
Out: ['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']

Use re.split
Ex:
import re
s = "I:would:like:to:find\:out:how:this\:works"
print( re.split(r"(?<=\w):", s) )
Output:
['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']

Python - Count and split/strip words in strings [duplicate]

This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 4 years ago.
The python code below reads 'resting-place' as one word.
The modified list shows up as: ['This', 'is', 'my', 'resting-place.']
I want it to show as: ['This', 'is', 'my', 'resting', 'place']
Thereby, giving me a total of 5 words instead of 4 words in the modified list.
original = 'This is my resting-place.'
modified = original.split()
print(modified)
numWords = 0
for word in modified:
numWords += 1
print ('Total words are:', numWords)
Output is:
Total words are: 4
I want the output to have 5 words.

To count number of words in a sentence with - separates to two words without splitting:
>>> original = 'This is my resting-place.'
>>> sum(map(original.strip().count, [' ','-'])) + 1
5

Here is the code:
s='This is my resting-place.'
len(s.split(" "))
4

You can use regex:
import re
original = 'This is my resting-place.'
print(re.split("\s+|-", original))
Output:
['This', 'is', 'my', 'resting', 'place.']

I think you will find what you want in this article, here you can find how to create a function where you can pass multiple parameter to split a string, in your case you'll be able to split that extra character
http://code.activestate.com/recipes/577616-split-strings-w-multiple-separators/
here is an example of the final result
>>> s = 'thing1,thing2/thing3-thing4'
>>> tsplit(s, (',', '/', '-'))
>>> ['thing1', 'thing2', 'thing3', 'thing4']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove a specifc repeated word using python regex? [duplicate] - python

Related

Stemmer function that takes a string and returns the stems of each word in a list

string.punctuation not working as expected in python

Replacing numbers in a list of lists with corresponding lines from a text file

Python 3: Split string under certain condition

Python - Count and split/strip words in strings [duplicate]

Categories

Resources