I'm trying to remove the unwanted special symbols from my string in a list by using .isalnum() function in looping through each character in the words and I use the condition for putting exception for apostrophe symbol for cases, like "can't", "didn't", "won't". But it also keeps this symbol for the cases I don't need like " ' ", " 'cant", " 'hello' ". Is there a way to keep just for when the symbol is in the middle of the words?
data_set = "Hello WOrld &()*hello world ////dog /// cat world hello can't "
split_it = data_set.lower().split()
new_word = ''
new_list = list()
for word in split_it:
new_word = ''.join([x for x in word if x.isalnum() or x == " ' "])
new_list.append(new_word)
print(new_list)
['hello', 'world', 'hello', 'world', 'dog', '', 'cat', 'world', 'hello', "can't"]
If you know all of the characters you don't want, you could use .strip() to only remove them from the start and end:
>>> words = "Hello WOrld &()*hello world ////dog /// cat world hello can't ".lower().split()
>>> cleaned_words = [word.strip("&()*/") for word in words]
>>> print(cleaned_words)
['hello', 'world', 'hello', 'world', 'dog', '', 'cat', 'world', 'hello', "can't"]
Otherwise, you'll probably want a regexp that matches any character except those whitelisted, anchored to the start or end of the string, and then use re.sub() to remove them:
>>> import re
>>> nonalnum_at_edge_re = re.compile(r'^[^a-z0-9]+|[^a-z0-9]+$', re.I)
>>> cleaned_words = [re.sub(nonalnum_at_edge_re, '', word) for word in words]
['hello', 'world', 'hello', 'world', 'dog', '', 'cat', 'world', 'hello', "can't"]
You could use a regular expression that matches any character that isn't either a lowercase letter or number, and either doesn't have such a character before it (start of word) or after it (end of word):
import re
phrase = "Hello WOrld &()*hello world ////dog /// cat world hello can't "
regex = re.compile(r'(?<![a-z0-9])([^a-z0-9])|([^a-z0-9])(?![a-z0-9])')
print([re.sub(regex, '', word) for word in phrase.lower().split()])
Output:
['hello', 'world', 'hello', 'world', 'dog', '', 'cat', 'world', 'hello', "can't"]
Related
I want an efficient way to split a list of strings using a list of words as the delimiters. The output is another list of strings.
I tried multiple .split in a single line, which does not work because the first .split returns a list and succeeding .split require a string.
Here is the input:
words = ["hello my name is jolloopp", "my jolloopp name is hello"]
splitters = ['my', 'is']
I want the output to be
final_list = ["hello ", " name ", " jolloopp", " jolloopp name ", " hello"]
Note the spaces.
It is also possible to have something like
draft_list = [["hello ", " name ", " jolloopp"], [" jolloopp name ", " hello"]]
which can be flattened using something like numpy reshape(-1,1) to get final_list, but the ideal case is
ideal_list = ["hello", "name", "jolloopp", "jolloopp name", "hello"]
where the spaces have been stripped, which is similar to using .strip().
EDIT 1:
Using re.split doesn't fully work if the word delimiters are part of other words.
words = ["hellois my name is myjolloopp", "my isjolloopp name is myhello"]
splitters = ['my', 'is']
then the output would be
['hello', '', 'name', '', 'jolloopp', '', 'jolloopp name', '', 'hello']
when it should be
['hellois', 'name', 'myjolloopp', 'isjolloopp name', 'myhello']
This is a known issue with solutions using re.split.
EDIT 2:
[x.strip() for x in re.split(' | '.join(splitters), ''.join(words))]
does not work properly when the input is
words = ["hello world", "hello my name is jolloopp", "my jolloopp name is hello"]
The output becomes
['hello worldhello', 'name', 'jolloopp', 'jolloopp name', 'hello']
when the output should be
['hello world', 'hello', 'name', 'jolloopp', 'jolloopp name', 'hello']
You could use re like,
Updated using the better way suggested by #pault using word boundaries \b instead of :space:,
>>> import re
>>> words = ['hello world', 'hello my name is jolloopp', 'my jolloopp name is hello']
# Iterate over the list of words and then use the `re` to split the strings,
>>> [z for y in (re.split('|'.join(r'\b{}\b'.format(x) for x in splitters), word) for word in words) for z in y]
['hello world', 'hello ', ' name ', ' jolloopp', '', ' jolloopp name ', ' hello']
I wanted to split a sentence on multiple delimiters:
.?!\n
However, I want to keep the comma along with the word.
For example for the string
'Hi, How are you?'
I want the result
['Hi,', 'How', 'are', 'you', '?']
I tried the following, but not getting the required result
words = re.findall(r"\w+|\W+", text)
re.split and keep your delimiters, then filter out the strings which only contain whitespace.
>>> import re
>>> s = 'Hi, How are you?'
>>> [x for x in re.split('(\s|!|\.|\?|\n)', s) if x.strip()]
['Hi,', 'How', 'are', 'you', '?']
If using re.findall:
>>> ss = """
... Hi, How are
...
... yo.u
... do!ing?
... """
>>> [ w for w in re.findall('(\w+\,?|[.?!]?)?\s*', ss) if w ]
['Hi,', 'How', 'are', 'yo', '.', 'u', 'do', '!', 'ing', '?']
You can use:
re.findall('(.*?)([\s\.\?!\n])', text)
With a bit of itertools magic and list comprehensions:
[i.strip() for i in itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)) if i.strip()]
And a bit more comprehensible version:
words = []
found = itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)
for i in found:
w = i.strip()
if w:
words.append(w)
I have been playing around with this code that I'm trying to get to read the string of text without spaces. The code needs to separate the string by identifying the all capital letters using regular expressions. However I can’t seem to get it to display the capital letters.
import re
mystring = 'ThisIsStringWithoutSpacesWordsTextManDogCow!'
wordList = re.sub("[^\^a-z]"," ",mystring)
print (wordList)
Try:
re.sub("([A-Z])"," \\1",mystring).split()
This prepends a space in front of every capital letter and splits on these spaces.
Output:
['This',
'Is',
'String',
'Without',
'Spaces',
'Words',
'Text',
'Man',
'Dog',
'Cow!']
As an alternative to sub, you could use re.findall to find all the words (beginning with an uppercase letter followed by zero or more non-uppercase characters) and then join them back together:
>>> ' '.join(re.findall(r'[A-Z][^A-Z]*', mystring))
'This Is String Without Spaces Words Text Man Dog Cow!'
Try
>>> re.split('([A-Z][a-z]*)', mystring)
['', 'This', '', 'Is', '', 'String', '', 'Without', '', 'Spaces', '', 'Words', '', 'Text', '', 'Man', '', 'Dog', '', 'Cow', '!']
This gives you word per word output. Even the ! is separated out.
If you dont want the extra '', then you can remove it by filter(lambda x: x != '', a) if a is the output of above command
>>> filter(lambda x: x != '', a)
['This', 'Is', 'String', 'Without', 'Spaces', 'Words', 'Text', 'Man', 'Dog', 'Cow', '!']
Not a regular expression solution, but you can do it in normal code as well :-)
mystring = 'ThisIsStringWithoutSpacesWordsTextManDogCow!'
output_list = []
for i, letter in enumerate(mystring):
if i!=index and letter.isupper():
output_list.append(mystring[index:i])
index = i
else:
output_list.append(mystring[index:i])
Now on topic, this could be something what you are looking for?
mystring = re.sub(r"([a-z\d])([A-Z])", r'\1 \2', mystring)
# Makes the string space separated. You can use split to convert it to list
mystring = mystring.split()
I have to match all the alphanumeric words from a text.
>>> import re
>>> text = "hello world!! how are you?"
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text)
>>> final_list
['hello', 'world', 'how', 'are', 'you']
>>>
This is fine, but further I have few words to negate i.e. the words that shouldn't be in my final list.
>>> negate_words = ['world', 'other', 'words']
A bad way to do it
>>> negate_str = '|'.join(negate_words)
>>> filter(lambda x: not re.match(negate_str, x), final_list)
['hello', 'how', 'are', 'you']
But i can save a loop if my very first regex-pattern can be changed to consider negation of those words. I found negation of characters but i have words to negate, also i found regex-lookbehind in other questions, but that doesn't help either.
Can it be done using python re?
Update
My text can span a few hundered lines. Also, list of negate_words can be lengthy too.
Considering this, is using regex for such task, correct in the first place?? Any suggestions??
I don't think there is a clean way to do this using regular expressions. The closest I could find was bit ugly and not exactly what you wanted:
>>> re.findall(r"\b(?:world|other|words)|([a-zA-Z0-9]+)\b", text)
['hello', '', 'how', 'are', 'you']
Why not use Python's sets instead. They are very fast:
>>> list(set(final_list) - set(negate_words))
['hello', 'how', 'are', 'you']
If order is important, see the reply from #glglgl below. His list comprehension version is very readable. Here's a fast but less readable equivalent using itertools:
>>> negate_words_set = set(negate_words)
>>> list(itertools.ifilterfalse(negate_words_set.__contains__, final_list))
['hello', 'how', 'are', 'you']
Another alternative is the build-up the word list in a single pass using re.finditer:
>>> result = []
>>> negate_words_set = set(negate_words)
>>> result = []
>>> for mo in re.finditer(r"[a-zA-Z0-9]+", text):
word = mo.group()
if word not in negate_words_set:
result.append(word)
>>> result
['hello', 'how', 'are', 'you']
Maybe is worth trying pyparsing for this:
>>> from pyparsing import *
>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(Suppress(oneOf(negate_words)) ^ Word(alphanums)).ignore(CharsNotIn(alphanums))
>>> parser.parseString('hello world!! how are you?').asList()
['hello', 'how', 'are', 'you']
Note that oneOf(negate_words) must be before Word(alphanums) to make sure that it matches earlier.
Edit: Just for the fun of it, I repeated the exercise using lepl (also an interesting parsing library)
>>> from lepl import *
>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(~Or(*negate_words) | Word(Letter() | Digit()) | ~Any())
>>> parser.parse('hello world!! how are you?')
['hello', 'how', 'are', 'you']
Don't ask uselessly too much to regex.
Instead, think to generators.
import re
unwanted = ('world', 'other', 'words')
text = "hello world!! how are you?"
gen = (m.group() for m in re.finditer("[a-zA-Z0-9]+",text))
li = [ w for w in gen if w not in unwanted ]
And a generator can be created instead of li, also
I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = 'This is a string, with words!'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?
I think this is the simplest way for anyone else stumbling on this post given the late response:
>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']
Try this:
import re
mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ", mystr).split()
How it works:
From the docs :
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.
so in our case :
pattern is any non-alphanumeric character.
[\w] means any alphanumeric character and is equal to the character set
[a-zA-Z0-9_]
a to z, A to Z , 0 to 9 and underscore.
so we match any non-alphanumeric character and replace it with a space .
and then we split() it which splits string by space and converts it to a list
so 'hello-world'
becomes 'hello world'
with re.sub
and then ['hello' , 'world']
after split()
let me know if any doubts come up.
To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:
>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
... nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']
The most simple way:
>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']
Using string.punctuation for completeness:
import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()
This handles newlines as well.
Well, you could use
import re
list = re.sub(r'[.!,;?]', ' ', string).split()
Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.
Inspired by #mtrw's answer, but improved to strip out punctuation at word boundaries only:
import re
import string
def extract_words(s):
return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]
>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']
>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']
Personally, I think this is slightly cleaner than the answers provided
def split_to_words(sentence):
return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed
A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".
list=mystr.split(" ",mystr.count(" "))
This way you eliminate every special char outside of the alphabet:
def wordsToList(strn):
L = strn.split()
cleanL = []
abc = 'abcdefghijklmnopqrstuvwxyz'
ABC = abc.upper()
letters = abc + ABC
for e in L:
word = ''
for c in e:
if c in letters:
word += c
if word != '':
cleanL.append(word)
return cleanL
s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']
I'm not sure if this is fast or optimal or even the right way to program.
def split_string(string):
return string.split()
This function will return the list of words of a given string.
In this case, if we call the function as follows,
string = 'This is a string, with words!'
split_string(string)
The return output of the function would be
['This', 'is', 'a', 'string,', 'with', 'words!']
This is from my attempt on a coding challenge that can't use regex,
outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')
The role of apostrophe seems interesting.
Probably not very elegant, but at least you know what's going on.
my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
pass
else:
if my_str[number_letter_in_data] in [' ']:
#if you want longer than 3 char words
if len(temp)>3:
list_words_number +=1
my_lst.append(temp)
temp=""
else:
pass
else:
temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)
You can try and do this:
tryTrans = string.maketrans(",!", " ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()