Say I have strings,
string1 = 'Hello how are you'
string2 = 'are you doing now?'
The result should be something like
Hello how are you doing now?
I was thinking different ways using re and string search.
(Longest common substring problem)
But is there any simple way (or library) that does this in python?
To make things clear i'll add one more set of test strings!
string1 = 'This is a nice ACADEMY'
string2 = 'DEMY you know!'
the result would be!,
'This is a nice ACADEMY you know!'
This should do:
string1 = 'Hello how are you'
string2 = 'are you doing now?'
i = 0
while not string2.startswith(string1[i:]):
i += 1
sFinal = string1[:i] + string2
OUTPUT :
>>> sFinal
'Hello how are you doing now?'
or, make it a function so that you can use it again without rewriting:
def merge(s1, s2):
i = 0
while not s2.startswith(s1[i:]):
i += 1
return s1[:i] + s2
OUTPUT :
>>> merge('Hello how are you', 'are you doing now?')
'Hello how are you doing now?'
>>> merge("This is a nice ACADEMY", "DEMY you know!")
'This is a nice ACADEMY you know!'
This should do what you want:
def overlap_concat(s1, s2):
l = min(len(s1), len(s2))
for i in range(l, 0, -1):
if s1.endswith(s2[:i]):
return s1 + s2[i:]
return s1 + s2
Examples:
>>> overlap_concat("Hello how are you", "are you doing now?")
'Hello how are you doing now?'
>>>
>>> overlap_concat("This is a nice ACADEMY", "DEMY you know!")
'This is a nice ACADEMY you know!'
>>>
Using str.endswith and enumerate:
def overlap(string1, string2):
for i, s in enumerate(string2, 1):
if string1.endswith(string2[:i]):
break
return string1 + string2[i:]
>>> overlap("Hello how are you", "are you doing now?")
'Hello how are you doing now?'
>>> overlap("This is a nice ACADEMY", "DEMY you know!")
'This is a nice ACADEMY you know!'
If you were to account for trailing special characters, you'd be wanting to employ some re based substitution.
import re
string1 = re.sub('[^\w\s]', '', string1)
Although note that this would remove all special characters in the first string.
A modification to the above function which will find the longest matching substring (instead of the shortest) involves traversing string2 in reverse.
def overlap(string1, string2):
for i in range(len(s)):
if string1.endswith(string2[:len(string2) - i]):
break
return string1 + string2[len(string2) - i:]
>>> overlap('Where did', 'did you go?')
'Where did you go?'
Other answers were great guys but it did fail for this input.
string1 = 'THE ACADEMY has'
string2= '.CADEMY has taken'
output:
>>> merge(string1,string2)
'THE ACADEMY has.CADEMY has taken'
>>> overlap(string1,string2)
'THE ACADEMY has'
However there's this standard library difflib which proved to be effective in my case!
match = SequenceMatcher(None, string1,\
string2).find_longest_match\
(0, len(string1), 0, len(string2))
print(match) # -> Match(a=0, b=15, size=9)
print(string1[: match.a + match.size]+string2[match.b + match.size:])
output:
Match(a=5, b=1, size=10)
THE ACADEMY has taken
which words you want to replace are appearing in the second string so you can try something like :
new_string=[string2.split()]
new=[]
new1=[j for item in new_string for j in item if j not in string1]
new1.insert(0,string1)
print(" ".join(new1))
with the first test case:
string1 = 'Hello how are you'
string2 = 'are you doing now?'
output:
Hello how are you doing now?
second test case:
string1 = 'This is a nice ACADEMY'
string2 = 'DEMY you know!'
output:
This is a nice ACADEMY you know!
Explanation :
first, we are splitting the second string so we can find which words we have to remove or replace :
new_string=[string2.split()]
second step we will check each word of this splitter string with string1 , if any word is in that string than choose only first string word , leave that word in second string :
new1=[j for item in new_string for j in item if j not in string1]
This list comprehension is same as :
new1=[]
for item in new_string:
for j in item:
if j not in string1:
new1.append(j)
last step combines both string and join the list:
new1.insert(0,string1)
print(" ".join(new1))
Related
I am trying to remove special characters from each element in a string. The below code does count the elements but i can't get the .isalpha to remove the non alphabetical elements. Is anyone able to assist? Thank you in advance.
input = 'Hello, Goodbye hello hello! bye byebye hello?'
word_list = input.split()
for word in word_list:
if word.isalpha()==False:
word[:-1]
di = dict()
for word in word_list:
di[word] = di.get(word,0)+1
di
It seems you are expecting word[:-1] to remove the last character of word and have that change reflected in the list word_list. However, you have assigned the string in word_list to a new variable called word and therefore the change won't be reflected in the list itself.
A simple fix would be to create a new list and append values into that. Note that your original string is called input which shadows the builtin input() function which is not a good idea:
input_string = 'Hello, Goodbye hello hello! bye byebye hello?'
word_list = input_string.split()
new = []
for word in word_list:
if word.isalpha() == False:
new.append(word[:-1])
else:
new.append(word)
di = dict()
for word in new:
di[word] = di.get(word,0)+1
print(di)
# {'byebye': 1, 'bye': 1, 'Hello': 1, 'Goodbye': 1, 'hello': 3}
You could also remove the second for loop and use collections.Counter instead:
from collections import Counter
print(Counter(new))
You are nearly there with your for loop. The main stumbling block seems to be that word[:-1] on its own does nothing, you need to store that data somewhere. For example, by appending to a list.
You also need to specify what happens to strings which don't need modifying. I'm also not sure what purpose the dictionary serves.
So here's your for loop re-written:
mystring = 'Hello, Goodbye hello hello! bye byebye hello?'
word_list = mystring.split()
res = []
for word in word_list:
if not word.isalpha():
res.append(word[:-1])
else:
res.append(word)
mystring_out = ' '.join(res) # 'Hello Goodbye hello hello bye byebye hello'
The idiomatic way to write the above is via feeding a list comprehension to str.join:
mystring_out = ' '.join([word[:-1] if not word.isalpha() else word \
for word in mystring.split()])
It goes without saying that this assumes word.isalpha() returns False due to an unwanted character at the end of a string, and that this is the only scenario you want to consider for special characters.
One solution using re:
In [1]: import re
In [2]: a = 'Hello, Goodbye hello hello! bye byebye hello?'
In [3]: ' '.join([i for i in re.split(r'[^A-Za-z]', a) if i])
Out[3]: 'Hello Goodbye hello hello bye byebye hello'
I tried matching words including the letter "ab" or "ba" e.g. "ab"olition, f"ab"rics, pro"ba"ble. I came up with the following regular expression:
r"[Aa](?=[Bb])[Bb]|[Bb](?=[Aa])[Aa]"
But it includes words that start or end with ", (, ), / ....non-alphanumeric characters. How can I erase it? I just want to match words list.
import sys
import re
word=[]
dict={}
f = open('C:/Python27/brown_half.txt', 'rU')
w = open('C:/Python27/brown_halfout.txt', 'w')
data = f.read()
word = data.split() # word is list
f.close()
for num2 in word:
match2 = re.findall("\w*(ab|ba)\w*", num2)
if match2:
dict[num2] = (dict[num2] + 1) if num2 in dict.keys() else 1
for key2 in sorted(dict.iterkeys()):print "%s: %s" % (key2, dict[key2])
print len(dict.keys())
Here, I don't know how to mix it up with "re.compile~~" method that 1st comment said...
To match all the words with ab or ba (case insensitive):
import re
text = 'fabh, obar! (Abtt) yybA, kk'
pattern = re.compile(r"(\w*(ab|ba)\w*)", re.IGNORECASE)
# to print all the matches
for match in pattern.finditer(text):
print match.group(0)
# to print the first match
print pattern.search(text).group(0)
https://regex101.com/r/uH3xM9/1
Regular expressions are not the best tool for the job in this case. They'll complicate stuff way too much for such simple circumstances. You can instead use Python's builtin in operator (works for both Python 2 and 3)...
sentence = "There are no probable situations whereby that may happen, or so it seems since the Abolition."
words = [''.join(filter(lambda x: x.isalpha(), token)) for token in sentence.split()]
for word in words:
word = word.lower()
if 'ab' in word or 'ba' in word:
print('Word "{}" matches pattern!'.format(word))
As you can see, 'ab' in word evaluates to True if the string 'ab' is found as-is (that is, exactly) in word, or False otherwise. For example 'ba' in 'probable' == True and 'ab' in 'Abolition' == False. The second line takes take of dividing the sentence in words and taking out any punctuation character. word = word.lower() makes word lowercase before the comparisons, so that for word = 'Abolition', 'ab' in word == True.
I would do it this way:
Strip your string from unwanted chars using the below two
techniques, your choice:
a - By building a translation dictionary and using translate method:
>>> import string
>>> del_punc = dict.fromkeys(ord(c) for c in string.punctuation)
s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = s.translate(del_punc)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
b - using re.sub method:
>>> import string
>>> import re
>>> s = 'abolition, fabrics, probable, test, case, bank;, halfback 1(ablution).'
>>> s = re.sub(r'[%s]'%string.punctuation, '', s)
>>> print(s)
'abolition fabrics probable test case bank halfback 1ablution'
Next will be finding your words containing 'ab' or 'ba':
a - Splitting over whitespaces and finding occurrences of your desired strings, which is the one I recommend to you:
>>> [x for x in s.split() if 'ab' in x.lower() or 'ba' in x.lower()]
['abolition', 'fabrics', 'probable', 'bank', 'halfback', '1ablution']
b -Using re.finditer method:
>>> pat
re.compile('\\b.*?(ab|ba).*?\\b', re.IGNORECASE)
>>> for m in pat.finditer(s):
print(m.group())
abolition
fabrics
probable
test case bank
halfback
1ablution
string = "your string here"
lowercase = string.lower()
if 'ab' in lowercase or 'ba' in lowercase:
print(true)
else:
print(false)
Try this one
[(),/]*([a-z]|(ba|ab))+[(),/]*
I have a string, for example 'i cant sleep what should i do'as well as a phrase that is contained in the string 'cant sleep'. What I am trying to accomplish is to get an n sized window around the phrase even if there isn't n words on either side. So in this case if I had a window size of 2 (2 words on either size of the phrase) I would want 'i cant sleep what should'.
This is my current solution attempting to find a window size of 2, however it fails when the number of words to the left or right of the phrase is less than 2, I would also like to be able to use different window sizes.
import re
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print sentence_words[left-2:right+3]
left = sentence_words.index(span_words[0])
right = sentence_words.index(span_words[-1])
print sentence_words[left-2:right+3]
You can use the partition method for a non-regex solution:
>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)
Then use a slice to get up to two words:
>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')
Your exact output:
>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'
You would want to check whether p is in s or if the partition succeeds of course.
As pointed out in comments, lh should be a negative to take the last n words (thanks Mathias Ettinger):
>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'
If you define words being entities separated by spaces you can split your sentences and use regular python slicing:
def get_window(sentence, phrase, window_size):
sentence = sentence.split()
phrase = phrase.split()
words = len(phrase)
for i,word in enumerate(sentence):
if word == phrase[0] and sentence[i:i+words] == phrase:
start = max(0, i-window_size)
return ' '.join(sentence[start:i+words+window_size])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))
You can also change it to a generator by changing return to yield and be able to generate all windows if several match of phrase are in sentence:
>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']
import re
def contains_sublist(lst, sublst):
n = len(sublst)
for i in xrange(len(lst)-n+1):
if (sublst == lst[i:i+n]):
a = max(i, i-2)
b = min(i+n+2, len(lst))
return ' '.join(lst[a:b])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print contains_sublist(sentence_words, phrase_words)
you can split words using inbuilt string methods, so re shouldn't be nessesary. If you want to define varrring values, then wrap it in a function call like so:
def get_word_window(sentence, phrase, w_left=0, w_right=0):
w_lst = sentence.split()
p_lst = phrase.split()
for i,word in enumerate(w_lst):
if word == p_lst[0] and \
w_lst[i:i+len(p_lst)] == p_lst:
left = max(0, i-w_left)
right = min(len(w_lst), i+w_right+len(p_list)
return w_lst[left:right]
Then you can get the new phrase like so:
>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'
I am taking an input string that is all one continuous group of letters and splitting it into a sentence. The problem is that as a beginner I can't figure out how to modify the string to ONLY capitalize the first letter and convert the others to lowercase. I know the string.lower but that converts everything to lowercase. Any ideas?
# This program asks user for a string run together
# with each word capitalized and gives back the words
# separated and only the first word capitalized
import re
def main():
# ask the user for a string
string = input( 'Enter some words each one capitalized, run together without spaces ')
for ch in string:
if ch.isupper() and not ch.islower():
newstr = re.sub('[A-Z]',addspace,string)
print(newstr)
def addspace(m) :
return ' ' + m.group(0)
#call the main function
main()
You can use capitalize():
Return a copy of the string with its first character capitalized and
the rest lowercased.
>>> s = "hello world"
>>> s.capitalize()
'Hello world'
>>> s = "hello World"
>>> s.capitalize()
'Hello world'
>>> s = "hELLO WORLD"
>>> s.capitalize()
'Hello world'
Unrelated example. To capitalize only the first letter you can do:
>>> s = 'hello'
>>> s = s[0].upper()+s[1:]
>>> print s
Hello
>>> s = 'heLLO'
>>> s = s[0].upper()+s[1:]
>>> print s
HeLLO
For a whole string, you can do
>>> s = 'what is your name'
>>> print ' '.join(i[0].upper()+i[1:] for i in s.split())
What Is Your Name
[EDIT]
You can also do:
>>> s = 'Hello What Is Your Name'
>>> s = ''.join(j.lower() if i>0 else j for i,j in enumerate(s))
>>> print s
Hello what is your name
If you only want to capitalize the start of sentences (and your string has multiple sentences), you can do something like:
>>> sentences = "this is sentence one. this is sentence two. and SENTENCE three."
>>> split_sentences = sentences.split('.')
>>> '. '.join([s.strip().capitalize() for s in split_sentences])
'This is sentence one. This is sentence two. And sentence three. '
If you don't want to change the case of the letters that don't start the sentence, then you can define your own capitalize function:
>>> def my_capitalize(s):
if s: # check that s is not ''
return s[0].upper() + s[1:]
return s
and then:
>>> '. '.join([my_capitalize(s.strip()) for s in split_sentences])
'This is sentence one. This is sentence two. And SENTENCE three. '
The aim of my task is to add spaces before and after punctuation. Currently i've been using an iterative str.replace() to replace each punctuation p with " "+p+" ". How do i achieve the same output with str.translate() where i can just pass in two list or a dictionary:
inlist = string.punctuation
outlist = [" "+p+" " for p in string.punctuation]
inoutdict = {p:" "+p+" " for p in string.punctuation}
Lets assume that all the punctuations i have are in string.punctuation. Currently, i'm doing it as such:
from string import punctuation as punct
def punct_tokenize(text):
for ch in text:
if ch in deupunct:
text = text.replace(ch, " "+ch+" ")
return " ".join(text.split())
sent = "This's a foo-bar sentences with many, many punctuation."
print punct_tokenize(sent)
Also this iterative str.replace() is taking too long, will str.translate() be any faster?
The dict form of translate only works with unicodes:
>>> import string
>>> inoutdict = {ord(p):unicode(" "+p+" ") for p in string.punctuation}
>>> unicode("foo,,,bar!!1").translate(inoutdict)
u'foo , , , bar ! ! 1'
Another option is with regular expressions:
>>> import re
>>> rx = '[%s]' % re.escape(string.punctuation)
>>> re.sub(rx, r" \g<0> ", "foo,,,bar!!1")
'foo , , , bar ! ! 1'
As usual, show us a bigger picture to get better answers, e.g. why are you doing that? where does the input come from?, etc...