Regex does not identify '#' for removing "#' from words starting with "#' - python

How to remove # from words in a string if it is the first character in a word. It should remain if it is present by itself, in the middle of a word, or at the end of a word.
Currently I am using the regex expression:
test = "# #DataScience"
test = re.sub(r'\b#\w\w*\b', '', test)
for removing the # from the words starting with # but it does not work at all. It returns the string as it is
Can anyone please tell me why the # is not being recognized and removed?
Examples -
test - "# #DataScience"
Expected Output - "# DataScience"
Test - "kjndjk#jnjkd"
Expected Output - "kjndjk#jnjkd"
Test - "# #DataScience #KJSBDKJ kjndjk#jnjkd #jkzcjkh# iusadhuish#""
Expected Output -"# DataScience KJSBDKJ kjndjk#jnjkd jkzcjkh# iusadhuish#"

a = '# #DataScience'
b = 'kjndjk#jnjkd'
c = "# #DataScience #KJSBDKJ kjndjk#jnjkd #jkzcjkh# iusadhuish#"
regex = '(\s+)#(\S)'
import re
print re.sub(regex, '\\1\\2', a)
print re.sub(regex, '\\1\\2', b)
print re.sub(regex, '\\1\\2', c)

You can split your string by space ' ' to make a list of all words in the string. Then loop in that list, check each word for your given condition and replace hash if necessary. After that you can join the list by space ' ' to create a string and return it.
def remove_hash(str):
words = str.split(' ') # Split the string into a list
without_hash = [] # Create a list for saving the words after removing hash
for word in words:
if re.match('^#[a-zA-Z]+', word) is not None: # check if the word starts with hash('#') and contains some characters after it.
without_hash.append(word[1:]) # it true remove the hash and append it your the ther list
else:
without_hash.append(word) # otherwise append the word as is in new list
return ' '.join(without_hash) # join the new list(without hash) by space and return it.
Output:
>>> remove_hash('# #DataScience')
'# DataScience'
>>> remove_hash('kjndjk#jnjkd')
'kjndjk#jnjkd'
>>> remove_hash("# #DataScience #KJSBDKJ kjndjk#jnjkd #jkzcjkh# iusadhuish#")
'# DataScience KJSBDKJ kjndjk#jnjkd jkzcjkh# iusadhuish#'
Your make your code shorter(but a bit harder to understand) by avoiding if else like this:
def remove_hash(str):
words = str.split(' ' )
without_hash = []
for word in words:
without_hash.append(re.sub(r'^#+(.+)', r'\1', word))
return ' '.join(without_hash)
This will get you the same results

Do give the following pattern a try. It looks for a sequence of '#'s and whitespaces that's located at the beginning of the string and substitute it for '# '
import re
test = "# #DataScience"
test = re.sub(r'(^[#\s]+)', '# ', test)
>>>test
# DataScience
You can play with the pattern further here: https://regex101.com/r/6hfw4t/1

Related

How to remove duplicate words removing duplicates not case sensetive

I have a string
str1='This Python is good Good python'
I want the output removing duplicates keeping in the first word irrespective of case, for eg. good and Good are considered same as Python python. The output should be
output='This Python is good'
Following a rather traditional approach:
str1 = 'This Python is good Good python'
words_seen = set()
output = []
for word in str1.split():
if word.lower() not in words_seen:
words_seen.add(word.lower())
output.append(word)
output = ' '.join(output)
print(output) # This Python is good
A caveat: it would not preserve word boundaries consisting of multiple spaces: 'python puppy' would become 'python puppy'.
A very ugly short version:
words_seen = set()
output = ' '.join(word for word in str1.split() if not (word.lower() in words_seen or words_seen.add(word.lower())))
One approach might be to use regular expressions to remove any word for which we can find a duplicate. The catch is that regex engines move from start to end of a string. Since you want to retain the first occurrence, we can reverse the string, and then do the cleanup.
str1 = 'This Python is good Good python'
str1_reverse = ' '.join(reversed(str1.split(' ' )))
str1_reverse = re.sub(r'\s*(\w+)\s*(?=.*\b\1\b)', ' ', str1_reverse, flags=re.I)
str1 = ' '.join(reversed(str1_reverse.strip().split(' ' )))
print(str1) # This Python is good

Split alphanumeric strings by space and keep separator for just first occurence

I am trying to split and then join the first alphanumeric occurrence by space and keep other occurrences as it is, but not getting the pattern to do that.
For ex:
string: Johnson12 is at club39
converted_string: Jhonson 12 is at club39
Desired Output:
input = "Brijesh Tiwari810663 A14082014RGUBWA"
output = Brijesh Tiwari 810663 A14082014RGUBWA
Code:
import re
regex = re.compile("[0-9]{1,}")
testString = "Brijesh Tiwari810663 A14082014RGUBWA" # fill this in
a = re.split('([^a-zA-Z0-9])', testString)
print(a)
>> ['Brijesh', ' ', 'Tiwari810663', ' ', 'A14082014RGUBWA']
Here is one way. We can use re.findall on the pattern [A-Za-z]+|[0-9]+, which will alternatively find all letter or all number words. Then, join that resulting list by space to get your output
inp = "Brijesh Tiwari810663 A14082014RGUBWA"
output = ' '.join(re.findall(r'[A-Za-z]+|[0-9]+', inp))
print(output) # Brijesh Tiwari 810663 A 14082014 RGUBWA
Edit: For your updated requirement, use re.sub with just one replacement:
inp = "Johnson12 is at club39"
output = re.sub(r'\b([A-Za-z]+)([0-9]+)\b', r'\1 \2', inp, 1)
print(output) # Johnson 12 is at club39

Print words in between two keywords

I am trying to write a code in Python where I get a print out of all the words in between two keywords.
scenario = "This is a test to see if I can get Python to print out all the words in between Python and words"
go = False
start = "Python"
end = "words"
for line in scenario:
if start in line: go = True
elif end in line:
go = False
continue
if go: print(line)
Want to have a print out of "to print out all the"
Slightly different approach, let's create a list which each element being a word in the sentence. Then let's use list.index() to find which position in the sentence the start and end words first occur. We can then return the words in the list between those indices. We want it back as a string and not a list, so we join them together with a space.
# list of words ['This', 'is', 'a', 'test', ...]
words = scenario.split()
# list of words between start and end ['to', 'print', ..., 'the']
matching_words = words[words.index(start)+1:words.index(end)]
# join back to one string with spaces between
' '.join(matching_words)
Result:
to print out all the
Your initial problem is that you're iterating over scenario the string, instead of splitting it into seperate words, (Use scenario.split()) but then there are other issues about switching to searching for the end word once the start has been found, instead, you might like to use index to find the two strings and then slice the string
scenario = "This is a test to see if I can get Python to print out all the words in between Python and words"
start = "Python"
end = "words"
start_idx = scenario.index(start)
end_idx = scenario.index(end)
print(scenario[start_idx + len(start):end_idx].strip())
You can accomplish this with a simple regex
import re
txt = "This is a test to see if I can get Python to print out all the words in between Python and words"
x = re.search("(?<=Python\s).*?(?=\s+words)", txt)
Here is the regex in action --> REGEX101
Split the string and go over it word by word to find the index at which the two keywords occur. Once you have those two indices, combine the list between those indices into a string.
scenario = 'This is a test to see if I can get Python to print out all the words in between Python and words'
start_word = 'Python'
end_word = 'words'
# Split the string into a list
list = scenario.split()
# Find start and end indices
start = list.index(start_word) + 1
end = list.index(end_word)
# Construct a string from elements at list indices between `start` and `end`
str = ' '.join(list[start : end])
# Print the result
print str

How to capitalize every word including in the beginning of a quote?

For example, this sentence:
say "mosquito!"
I try to capitalize with the following code:
'say "mosquito!"'.capitalize()
Which returns this:
'Say "mosquito!"’
However, the desired result is:
'Say "Mosquito!"’
You can use str.title:
print('say "mosquito!"'.title())
# Output: Say "Mosquito!"
Looks like Python has a built-in method for this!
This is quite tricky. I will look for the first letter (alphabet) of a word. Split the string into words and join them again after converting the first letter of each word in upper case.
def start(word):
for n in range(len(word)):
if word[n].isalpha():
return n
return 0
strng = 'say mosquito\'s house'
print( ' '.join(word[:start(word)] + word[start(word)].upper() + word[start(word)+1:] for word in strng.split()))
Result:
Say "Mosquito's House"
You could do it using a lambda in regular expression substitution:
string = 'say "mosquito\'s house" '
import re
caps = re.sub("((^| |\")[a-z])",lambda m:m.group(1).upper(),string)
# 'Say "Mosquito\'s House" '

Having trouble adding a space after a period in a python string

I have to write a code to do 2 things:
Compress more than one occurrence of the space character into one.
Add a space after a period, if there isn't one.
For example:
input> This is weird.Indeed
output>This is weird. Indeed.
This is the code I wrote:
def correction(string):
list=[]
for i in string:
if i!=" ":
list.append(i)
elif i==" ":
k=i+1
if k==" ":
k=""
list.append(i)
s=' '.join(list)
return s
strn=input("Enter the string: ").split()
print (correction(strn))
This code takes any input by the user and removes all the extra spaces,but it's not adding the space after the period(I know why not,because of the split function it's taking the period and the next word with it as one word, I just can't figure how to fix it)
This is a code I found online:
import re
def correction2(string):
corstr = re.sub('\ +',' ',string)
final = re.sub('\.','. ',corstr)
return final
strn= ("This is as .Indeed")
print (correction2(strn))
The problem with this code is I can't take any input from the user. It is predefined in the program.
So can anyone suggest how to improve any of the two codes to do both the functions on ANY input by the user?
Is this what you desire?
import re
def corr(s):
return re.sub(r'\.(?! )', '. ', re.sub(r' +', ' ', s))
s = input("> ")
print(corr(s))
I've changed the regex to a lookahead pattern, take a look here.
Edit: explain Regex as requested in comment
re.sub() takes (at least) three arguments: The Regex search pattern, the replacement the matched pattern should be replaced with, and the string in which the replacement should be done.
What I'm doing here is two steps at once, I've been using the output of one function as input of another.
First, the inner re.sub(r' +', ' ', s) searches for multiple spaces (r' +') in s to replace them with single spaces. Then the outer re.sub(r'\.(?! )', '. ', ...) looks for periods without following space character to replace them with '. '. I'm using a negative lookahead pattern to match only sections, that don't match the specified lookahead pattern (a normal space character in this case). You may want to play around with this pattern, this may help understanding it better.
The r string prefix changes the string to a raw string where backslash-escaping is disabled. Unnecessary in this case, but it's a habit of mine to use raw strings with regular expressions.
For a more basic answer, without regex:
>>> def remove_doublespace(string):
... if ' ' not in string:
... return string
... return remove_doublespace(string.replace(' ',' '))
...
>>> remove_doublespace('hi there how are you.i am fine. '.replace('.', '. '))
'hi there how are you. i am fine. '
You try the following code:
>>> s = 'This is weird.Indeed'
>>> def correction(s):
res = re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
if res[-1] != '.':
res += '.'
return res
>>> print correction(s)
This is weird. Indeed.
>>> s=raw_input()
hee ss.dk
>>> s
'hee ss.dk'
>>> correction(s)
'hee ss. dk.'

Categories

Resources