I want to remove all the words that contain a specific substring.
Sentence = 'walking my dog https://github.com/'
substring = 'http'
# Remove all words that start with the substring
#...
result = 'walking my dog'
This respects the original spacing in the string without having to fiddle around too much.
import re
string = "a suspect http://string.com with spaces before and after"
starts = "http"
re.sub(f"\\b{starts}[^ ]*[ ]+", "", string)
'a suspect with spaces before and after'
There is a simple approach that we can use for this.
Split the sentence into words
Find all the works that
Check if that word contains the substring and remove it
Join back the remaining words.
>>> sentence = 'walking my dog https://github.com/'
>>> substring = 'http'
>>> f = lambda v, w: ' '.join(filter(lambda x: w not in x, v.split(' ')))
>>> f(sentence, substring)
'walking my dog'
Explanation:
1. ' '.join(
2. filter(
3. lambda x: w not in x,
4. v.split(' ')
6. )
7. )
1 stars with a join. 2 is for filtering all the elements from 4, which splits the string into words. The condition to filter is substring not in word. The not in does a O(len(substring) * len(word)) complexity comparison.
Note: The only step that can be sped up is line 3. The fact that you are comparing words to a constant string, you can use Rabin-Karp String Matching to find the string in O(len(word)) or Z-Function to find the string in O(len(word) + len(substring))
Related
I have a python challenge that if given a string with '_' or '-' in between each word such as the_big_red_apple or the-big-red-apple to convert it to camel case. Also if the first word is uppercase keep it as uppercase. This is my code. Im not allowed to use the re library in the challenge however but I didn't know how else to do it.
from re import sub
def to_camel_case(text):
if text[0].isupper():
text = sub(r"(_|-)+"," ", text).title().replace(" ", "")
else:
text = sub(r"(_|-)+"," ", text).title().replace(" ", "")
text = text[0].lower() + text[1:]
return print(text)
Word delimiters can be - dash or _ underscore.
Let's simplify, making them all underscores:
text = text.replace('-', '_')
Now we can break out words:
words = text.split('_')
With that in hand it's simple to put them back together:
text = ''.join(map(str.capitalize, words))
or more verbosely, with a generator expression,
assign ''.join(word.capitalize() for word in words).
I leave "finesse the 1st character"
as an exercise to the reader.
If you RTFM you'll find it contains a wealth of knowledge.
https://docs.python.org/3/library/re.html#raw-string-notation
'+'
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s
The effect of + is turn both
db_rows_read and
db__rows_read
into DbRowsRead.
Also,
Raw string notation (r"text") keeps regular expressions sane.
The regex in your question doesn't exactly
need a raw string, as it has no crazy
punctuation like \ backwhacks.
But it's a very good habit to always put
a regex in an r-string, Just In Case.
You never know when code maintenance
will tack on additional elements,
and who wants a subtle regex bug on their hands?
You can try it like this :
def to_camel_case(text):
s = text.replace("-", " ").replace("_", " ")
s = s.split()
if len(text) == 0:
return text
return s[0] + ''.join(i.capitalize() for i in s[1:])
print(to_camel_case('momo_es-es'))
the output of print(to_camel_case('momo_es-es')) is momoEsEs
r"..." refers to Raw String in Python which simply means treating backlash \ as literal instead of escape character.
And (_|-)[+] is a Regular Expression that match the string containing one or more - or _ characters.
(_|-) means matching the string that contains - or _.
+ means matching the above character (- or _) than occur one or more times in the string.
In case you cannot use re library for this solution:
def to_camel_case(text):
# Since delimiters can have 2 possible answers, let's simplify it to one.
# In this case, I replace all `_` characters with `-`, to make sure we have only one delimiter.
text = text.replace("_", "-") # the_big-red_apple => the-big-red-apple
# Next, we should split our text into words in order for us to iterate through and modify it later.
words = text.split("-") # the-big-red-apple => ["the", "big", "red", "apple"]
# Now, for each word (except the first word) we have to turn its first character to uppercase.
for i in range(1, len(words)):
# `i`start from 1, which means the first word IS NOT INCLUDED in this loop.
word = words[i]
# word[1:] means the rest of the characters except the first one
# (e.g. w = "apple" => w[1:] = "pple")
words[i] = word[0].upper() + word[1:].lower()
# you can also use Python built-in method for this:
# words[i] = word.capitalize()
# After this loop, ["the", "big", "red", "apple"] => ["the", "Big", "Red", "Apple"]
# Finally, we put the words back together and return it
# ["the", "Big", "Red", "Apple"] => theBigRedApple
return "".join(words)
print(to_camel_case("the_big-red_apple"))
Try this:
First, replace all the delimiters into a single one, i.e. str.replace('_', '-')
Split the string on the str.split('-') standardized delimiter
Capitalize each string in list, i.e. str.capitilize()
Join the capitalize string with str.join
>>> s = "the_big_red_apple"
>>> s.replace('_', '-').split('-')
['the', 'big', 'red', 'apple']
>>> ''.join(map(str.capitalize, s.replace('_', '-').split('-')))
'TheBigRedApple'
>> ''.join(word.capitalize() for word in s.replace('_', '-').split('-'))
'TheBigRedApple'
If you need to lowercase the first char, then:
>>> camel_mile = lambda x: x[0].lower() + x[1:]
>>> s = 'TheBigRedApple'
>>> camel_mile(s)
'theBigRedApple'
Alternative,
First replace all delimiters to space str.replace('_', ' ')
Titlecase the string str.title()
Remove space from string, i.e. str.replace(' ', '')
>>> s = "the_big_red_apple"
>>> s.replace('_', ' ').title().replace(' ', '')
'TheBigRedApple'
Another alternative,
Iterate through the characters and then keep a pointer/note on previous character, i.e. for prev, curr in zip(s, s[1:])
check if the previous character is one of your delimiter, if so, uppercase the current character, i.e. curr.upper() if prev in ['-', '_'] else curr
skip whitepace characters, i.e. if curr != " "
Then add the first character in lowercase, [s[0].lower()]
>>> chars = [s[0].lower()] + [curr.upper() if prev in ['-', '_'] else curr for prev, curr in zip(s, s[1:]) if curr != " "]
>>> "".join(chars)
'theBigRedApple'
Yet another alternative,
Replace/Normalize all delimiters into a single one, s.replace('-', '_')
Convert it into a list of chars, list(s.replace('-', '_'))
While there is still '_' in the list of chars, keep
find the position of the next '_'
replacing the character after '_' with its uppercase
replacing the '_' with ''
>>> s = 'the_big_red_apple'
>>> s_list = list(s.replace('-', '_'))
>>> while '_' in s_list:
... where_underscore = s_list.index('_')
... s_list[where_underscore+1] = s_list[where_underscore+1].upper()
... s_list[where_underscore] = ""
...
>>> "".join(s_list)
'theBigRedApple'
or
>>> s = 'the_big_red_apple'
>>> s_list = list(s.replace('-', '_'))
>>> while '_' in s_list:
... where_underscore = s_list.index('_')
... s_list[where_underscore:where_underscore+2] = ["", s_list[where_underscore+1].upper()]
...
>>> "".join(s_list)
'theBigRedApple'
Note: Why do we need to convert the string to list of chars? Cos strings are immutable, 'str' object does not support item assignment
BTW, the regex solution can make use of some group catching, e.g.
>>> import re
>>> s = "the_big_red_apple"
>>> upper_regex_group = lambda x: x.group(1).upper()
>>> re.sub("[_|-](\w)", upper_regex_group, s)
'theBigRedApple'
>>> re.sub("[_|-](\w)", lambda x: x.group(1).upper(), s)
'theBigRedApple'
say I have a certain string and a list of strings.
I would like to append to a new list all the words from the list (of strings)
that are exactly like the pattern
for example:
list of strings = ['string1','string2'...]
pattern =__letter__letter_ ('_c__ye_' for instance)
I need to add all strings that are made up of the same letters in the same places as the pattern, and has the same length.
so for instance:
new_list = ['aczxyep','zcisyef'...]
I have tried this:
def pattern_word_equality(words,pattern):
list1 = []
for word in words:
for letter in word:
if letter in pattern:
list1.append(word)
return list1
help will be much appreciated :)
If your pattern is as simple as _c__ye_, then you can look for the characters in the specific positions:
words = ['aczxyep', 'cxxye', 'zcisyef', 'abcdefg']
result1 = list(filter(lambda w: w[1] == 'c' and w[4:6] == 'ye', words))
If your pattern is getting more complex, then you can start using regular expressions:
pat = re.compile("^.c..ye.$")
result2 = list(filter(lambda w: pat.match(w), words))
Output:
print(result1) # ['aczxyep', 'zcisyef']
print(result2) # ['aczxyep', 'zcisyef']
This works:
words = ['aczxyep', 'cxxye', 'zcisyef', 'abcdefg']
pattern = []
for i in range(len(words)):
if (words[i])[1].lower() == 'c' and (words[i])[4:6].lower() == 'ye':
pattern.append(words[i])
print(pattern)
You start by defining the words and pattern lists. Then you loop around for the amount of items in words by using len(words). You then find whether the i item number is follows the pattern by seeing if the second letter is c and the 5th and 6th letters are y and e. If this is true then it appends that word onto pattern and it prints them all out at the end.
I tried matching words including the letter "ab" or "ba" e.g. "ab"olition, f"ab"rics, pro"ba"ble. I came up with the following regular expression:
r"[Aa](?=[Bb])[Bb]|[Bb](?=[Aa])[Aa]"
But it includes words that start or end with ", (, ), / ....non-alphanumeric characters. How can I erase it? I just want to match words list.
import sys
import re
word=[]
dict={}
f = open('C:/Python27/brown_half.txt', 'rU')
w = open('C:/Python27/brown_halfout.txt', 'w')
data = f.read()
word = data.split() # word is list
f.close()
for num2 in word:
match2 = re.findall("\w*(ab|ba)\w*", num2)
if match2:
dict[num2] = (dict[num2] + 1) if num2 in dict.keys() else 1
for key2 in sorted(dict.iterkeys()):print "%s: %s" % (key2, dict[key2])
print len(dict.keys())
Here, I don't know how to mix it up with "re.compile~~" method that 1st comment said...
To match all the words with ab or ba (case insensitive):
import re
text = 'fabh, obar! (Abtt) yybA, kk'
pattern = re.compile(r"(\w*(ab|ba)\w*)", re.IGNORECASE)
# to print all the matches
for match in pattern.finditer(text):
print match.group(0)
# to print the first match
print pattern.search(text).group(0)
https://regex101.com/r/uH3xM9/1
Regular expressions are not the best tool for the job in this case. They'll complicate stuff way too much for such simple circumstances. You can instead use Python's builtin in operator (works for both Python 2 and 3)...
sentence = "There are no probable situations whereby that may happen, or so it seems since the Abolition."
words = [''.join(filter(lambda x: x.isalpha(), token)) for token in sentence.split()]
for word in words:
word = word.lower()
if 'ab' in word or 'ba' in word:
print('Word "{}" matches pattern!'.format(word))
As you can see, 'ab' in word evaluates to True if the string 'ab' is found as-is (that is, exactly) in word, or False otherwise. For example 'ba' in 'probable' == True and 'ab' in 'Abolition' == False. The second line takes take of dividing the sentence in words and taking out any punctuation character. word = word.lower() makes word lowercase before the comparisons, so that for word = 'Abolition', 'ab' in word == True.
I would do it this way:
Strip your string from unwanted chars using the below two
techniques, your choice:
a - By building a translation dictionary and using translate method:
>>> import string
>>> del_punc = dict.fromkeys(ord(c) for c in string.punctuation)
s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = s.translate(del_punc)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
b - using re.sub method:
>>> import string
>>> import re
>>> s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = re.sub(r'[%s]'%string.punctuation, '', s)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
Next will be finding your words containing 'ab' or 'ba':
a - Splitting over whitespaces and finding occurrences of your desired strings, which is the one I recommend to you:
>>> [x for x in s.split() if 'ab' in x.lower() or 'ba' in x.lower()]
['abolition', 'fabrics', 'probable', 'bank', 'halfback', '1ablution']
b -Using re.finditer method:
>>> pat
re.compile('\\b.*?(ab|ba).*?\\b', re.IGNORECASE)
>>> for m in pat.finditer(s):
print(m.group())
abolition
fabrics
probable
test case bank
halfback
1ablution
string = "your string here"
lowercase = string.lower()
if 'ab' in lowercase or 'ba' in lowercase:
print(true)
else:
print(false)
Try this one
[(),/]*([a-z]|(ba|ab))+[(),/]*
def split_on_separators(word, separators):
"""
Return a list of non-empty, non-blank strings from the original string
determined by splitting the string on any of the separators.
separators is a string of single-character separators.
>>> split_on_separators("Wow! Fantastic, you're done.", "!,")
['Wow', ' Fantastic', " you're done."]
"""
word_list = []
for ch in word.split():
stripped = ch.strip(separators)
word_list.append(stripped)
return word_list
#output
['Wow', 'Fantastic', "you're", 'done.']
The separators are removed but I can't seem to get the white space in front of the F. Secondly 'you're done.' is not in a single string
Any Help would be greatly appreciated :)
I'm using python 3
One solution could be:
def split_on_separators(word, separators):
word_list = [word]
auxList = []
for sep in separators:
for w in word_list:
auxList.extend(w.split(sep))
word_list = auxList
auxList = list()
return word_list
Out[76]: ['Wow', ' Fantastic', " you're done."]
This should do the job:
def split_on_separators(word, separators):
for sep in separators:
word = word.replace(sep, '^#^')
return [x.strip() for x in word.split('^#^')]
The ^#^ is just a placeholder. I made it a weird character combination to make sure it doesn't appear in a normal sentence. You can replace it, if you want.
Another crazy solution:
import itertools
def split_on_separators(word, separators):
groups = itertools.groupby(word, lambda char: char in separators)
return [''.join(letters) for is_sep, letters in groups if not is_sep]
For the case where two separators can be adjacent (and you want to represent it as an empty word):
import itertools
def split_on_separators(word, separators):
groups = itertools.groupby(word, lambda char: char in separators)
seps2words = lambda letters: [''] * (len(tuple(letters)) - 1)
return [word for is_sep, letters in groups
for word in ([''.join(letters)] if not is_sep else seps2words(letters))]
You're splitting prior to stripping it. Splitting is done based on whitespace between words; therefore, you won't see it in your final output. That is also the reason "you're done" is not a single string. It splits it based on white space between the words.
You might try a delimiter for the split:
str.split([sep[, maxsplit]]) Return a list of the words in the string,
using sep as the delimiter string.
I have a two dimensional array called "beats" with a bunch of data. In the second column of the array, there is a list of words in alphabetical order.
I also have a sentence called "words" which was originally a string, which I've turned into an array.
I need to check if one of the words in "words" matches any of the words in the second column of the array "beats". If a match has been found, the program changes the matched word in the sentence "words" to "match" and then return the words in a string. This is the code I'm using:
i = 0
while i < len(words):
n = 0
while n < len(beats):
if words[i] == beats[n][1]:
words[i] = "match"
n = n + 1
i = i + 1
mystring = ' '.join(words)
return mystring
So if I have the sentence:
"Money is the last money."
And "money" is in the second column of the array "beats", the result would be:
"match is the last match."
But since there's a period behind "match", it doesn't consider it a match.
Is there a way to ignore punctuation when comparing the two strings? I don't want to strip the sentence of punctuation because I want the punctuation to be in tact when I return the string once my program's done replacing the matches.
You can create a new string that has the properties you want, and then compare with the new string(s). This will strip everything but numbers, letters, and spaces while making all letters lowercase.
''.join([letter.lower() for letter in ' '.join(words) if letter.isalnum() or letter == ' '])
To strip everything but letters from a string you can do something like:
from string import ascii_letters
''.join([letter for letter in word if letter in ascii_letters])
You could use a regex:
import re
st="Money is the last money."
words=st.split()
beats=['money','nonsense']
for i,word in enumerate(words):
if word=='match': continue
for tgt in beats:
word=re.sub(r'\b{}\b'.format(tgt),'match',word,flags=re.I)
words[i]=word
print print ' '.join(words)
prints
match is the last match.
If it is only the fullstop that you are worried about, then you can add another if case to match that too. Or similar you can add custom handling if your cases are limited. or otherwise regex is the way to go.
words="Money is the last money. This money is another money."
words = words.split()
i = 0
while i < len(words):
if (words[i].lower() == "money".lower()):
words[i] = "match"
if (words[i].lower() == "money".lower() + '.'):
words[i] = "match."
i = i + 1
mystring = ' '.join(words)
print mystring
Output:
match is the last match. This match is another match.