Alternative approach to strip symbols in a string - python

I am working on a function which retains symbols that is inside of a word(a word can consist of a-zA-Z,0-9 and _), but removes every other symbol outside the word:
For example:
Input String - hell_o ? my name _ i's <hel'lo/>
Output - ['hell_o' ,'my', 'name', '_', "i's" ,'hel'lo']
The function i am using :
l = ' '.join(filter(None,(word.strip(punctuation.replace("_","")) for word in input_String.split())))
l = re.sub(r'\s+'," ",l)
t = str.split(l.lower())
I know this is not the best, optimal way!!Does anyone recommend any alternatives that i can try??Probably a regEx to do this??
I tried using:
negative look around and look behinds: \W+(?!\S*[a-z])|(?<!\S)\W+
s.strip(punctuation)
re.sub('[^\w]', ' ', doc.strip(' ').lower()) - This Removes punctuation inside the word too

You can match any character different than a-zA-Z, 0-9 and _ as you mention, between 2 letters with (?<=[a-z])\W(?=[a-z]) and replace it with nothing, to remove it.
In the end you will have a very dangerous algorithm for instance in the sentence I'm fine.And you? if there is no space after the dot it will end up in I'm fineAnd you? which may not be what you want.
[EDIT] after your comments.
Ok I misunderstood your question.
Now I came along with the one regex you want to select 'hell_o' ,'my', 'name', "i's" ,'hel'lo':
(?<![a-z])[a-z][^\s]*[a-z](?![a-z]).
You can see it working here: https://regex101.com/r/EAEelq/3. (don't forget the i and g flags).
[EDIT] As you also want to match the _ outside a word
ok so if you want the underscores to be matched also update as is: (?<![a-z_])[a-z_][^\s]*[a-z_](?![a-z_])|(?<= )[a-z_](?= ).
See it working here: https://regex101.com/r/EAEelq/4

Related

How to write regex to fix words composed of duplicate letters?

I scraped a few pdfs and some thick fonts get scraped as in this example:
text='and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
instead of
"and assesses our reformed teaching in the classroom"
How to fix this? I am trying with regex
pattern=r'([a-z])(?=\1)'
re.sub(pattern,'',text)
#"and aseses reformed teaching in the clasrom"
I am thinking of grouping the two groups above and add word boundaries
EDIT: this one fixes words with even number of letters:
pattern=r'([a-z])\1([a-z])\2'
re.sub(pattern,'\1\2',text)
#"and assesses oouurr reformed teaching in the classroom"
If letters are duplicated, you can try something like this
for w in text.split():
if len(w) %2 != 0:
print(w)
continue
if w[0::2] == w[1::2]:
print(w[0::2])
continue
print(w)
I am using a mixed approach: build the pattern and substitution in a for loop, then applying regex. The regexes applied go from e.g. words of 8x2=16 letters down to 3.
import re
text = 'and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
wrd_len = [9,8,7,6,5,4,3,2]
for l in wrd_len:
sub = '\\' + '\\'.join(map(str,range(1,l+1)))
pattern = '([a-z])\\' + '([a-z])\\'.join(map(str,range(1,l+1)))
text = re.sub(pattern, sub , text)
text
#and assesses our reformed teaching in the classroom
For example, the regex for 3-letter words becomes:
re.sub('([a-z])\1([a-z])\2([a-z])\3', '\1\2\3', text)
As a side note, I could not get those backslashes right with raw strings, and I am actually going to use [a-zA-Z].
i found solution in javascript that works fine :
([a-z])\1(?:(?=([a-z])\2)|(?<=\3([a-z])\1\1))
but in some how it doesn't work in python because lookbehind can't take references to group so i came up with another solution that can work in this example :
([a-z])\1(?:(?=([a-z])\2)|(?=[^a-z])))
try it here

How to split strings with special characters without removing those characters?

I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!

Capitalizing the beginning of sentences in Python

The following code is for an assignment that asks that a string of sentences is entered from a user and that the beginning of each sentence is capitalized by a function.
For example, if a user enters: 'hello. these are sample sentences. there are three of them.'
The output should be: 'Hello. These are sample sentences. There are three of them.'
I have created the following code:
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalize(sentences)
#This function capitalizes the first letter of each sentence
def capitalize(user_sentences):
sent_list = user_sentences.split('. ')
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
main()
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line. Second, it adds an extra period at the end. The output from this code using the sample input from above would be:
Hello.
These are sample sentences.
There are three of them..
Is there a way to format the output to be one line and remove the final period?
The following works for reasonably clean input:
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(x.capitalize() for x in s.split('. '))
'Hello. These are sample sentences. There are three of them.'
If there is more varied whitespace around the full-stop, you might have to use some more sophisticated logic:
>>> '. '.join(x.strip().capitalize() for x in s.split('.'))
Which normalizes the whitespace which may or may not be what you want.
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalizeFunc(sentences)
def capitalizeFunc(user_sentences):
sent_list = user_sentences.split('. ')
print(".".join((i.capitalize() for i in sent_list)))
main()
Output:
Enter sentences with lowercase letters: "hello. these are sample sentences. there are three of them."
Hello.These are sample sentences.There are three of them.
I think this might be helpful:
>>> sentence = input()
>>> '. '.join(map(lambda s: s.strip().capitalize(), sentence.split('.')))
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(map(str.capitalize, s.split('. ')))
'Hello. These are sample sentences. There are three of them.'
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line.
That’s because you’re printing each sentence with a separate call to print. By default, print adds a newline. If you don’t want it to, you can override what it adds with the end keyword parameter. If you don’t want it to add anything at all, just use end=''
Second, it adds an extra period at the end.
That’s because you’re explicitly adding a period to every sentence, including the last one.
One way to fix this is to keep track of the index as well as the sentence as you’re looping over them—e.g., with for index, sentence in enumerate(sentences):. Then you only add the period if index isn’t the last one. Or, slightly more simply, you add the period at the start, if the index is anything but zero.
However, theres a better way out of both of these problems. You split the string into sentences by splitting on '. '. You can join those sentences back into one big string by doing the exact opposite:
sentences = '. '.join(sentences)
Then you don’t need a loop (there’s one hidden inside join of course), you don’t need to worry about treating the last or first one special, and you only have one print instead of a bunch of them so you don’t need to worry about end.
A different trick is to put the cleverness of print to work for you instead of fighting it. Not only does it add a newline at the end by default, it also lets you print multiple things and adds a space between them by default. For example, print(1, 2, 3) or, equivalently, print(*[1, 2, 3]) will print out 1 2 3. And you can override that space separator with anything else you want. So you can print(*sentences, sep='. ', end='') to get exactly what you want in one go. However, this may be a bit opaque or over-clever to people reading your code. Personally, whenever I can use join instead (which is usually), I do that even though it’s a bit more typing, because it makes it more obvious what’s happening.
As a side note, a bit of your code is misleading:
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
The logic of that loop is fine, but it would be a lot easier to understand if you called the one-new-sentence variable new_sentence instead of new_sentences, and didn’t set it to an empty list at the start. As it is, the reader is led to expect that you’re going to build up a list of new sentences and then do something with it, but actually you just throw that list away at the start and handle each sentence one by one.
And, while we’re at it, you don’t need count here; just loop over sent_list directly:
for sentence in sent_list:
new_sentence = sent + '. '
print(new_sentence.capitalize())
This does the same thing as the code you had, but I think it’s easier to understand that it does that think from a quick glance.
(Of course you still need the fixes for your two problems.)
Use nltk.sent_tokenize to tokenize the string into sentences. And capitalize each sentence, and join them again.
A sentence can't always end with a ., there can other things too, like a ?, or !. Also three consecutive dots ..., doesn't end the sentence. sent_tokenize will handle them all.
from nltk.tokenize import sent_tokenize
def capitalize(user_sentences):
sents = sent_tokenize(user_sentences)
capitalized_sents = [sent.capitalize() for sent in sents]
joined_ = ' '.join(capitalized_sents)
print(joined_)
The reason your sentences were being printed on separate lines, were because print always ends its output with a newline. So, printing sentences separately in loop will make them print on newlines. So, you should print them all at once, after joining them. Or, you can specify end='' in print statement, so it doesn't end the sentences with newline characters.
The second thing, about output being ended with an extra period, is because, you're appending '. ' with each of the sentence. The good thing about sent_tokenize is, it doesn't remove '.', '?', etc from the end of the sentences, so you don't have to append '. ' at the end manually again. Instead, you can just join the sentences with a space character, and you'll be good to go.
If you get an error for nltk not being recognized, you can install it by running pip install nltk on the terminal/cmd.

How to use multiple 'if' statements nested inside an enumerator?

I have a massive string of letters all jumbled up, 1.2k lines long.
I'm trying to find a lowercase letter that has EXACTLY three capital letters on either side of it.
This is what I have so far
def scramble(sentence):
try:
for i,v in enumerate(sentence):
if v.islower():
if sentence[i-4].islower() and sentence[i+4].islower():
....
....
except IndexError:
print() #Trying to deal with the problem of reaching the end of the list
#This section is checking if
the fourth letters before
and after i are lowercase to ensure the central lower case letter has
exactly three upper case letters around it
But now I am stuck with the next step. What I would like to achieve is create a for-loop in range of (-3,4) and check that each of these letters is uppercase. If in fact there are three uppercase letters either side of the lowercase letter then print this out.
For example
for j in range(-3,4):
if j != 0:
#Some code to check if the letters in this range are uppercase
#if j != 0 is there because we already know it is lowercase
#because of the previous if v.islower(): statement.
If this doesn't make sense, this would be an example output if the code worked as expected
scramble("abcdEFGhIJKlmnop")
OUTPUT
EFGhIJK
One lowercase letter with three uppercase letters either side of it.
Here is a way to do it "Pythonically" without
regular expressions:
s = 'abcdEFGhIJKlmnop'
words = [s[i:i+7] for i in range(len(s) - 7) if s[i:i+3].isupper() and s[i+3].islower() and s[i+4:i+7].isupper()]
print(words)
And the output is:
['EFGhIJK']
And here is a way to do it with regular expressions,
which is, well, also Pythonic :-)
import re
words = re.findall(r'[A-Z]{3}[a-z][A-Z]{3}', s)
if you can't use regular expression
maybe this for loop can do the trick
if v.islower():
if sentence[i-4].islower() and sentence[i+4].islower():
for k in range(1,4):
if sentence[i-k].islower() or sentence[i+k].islower():
break
if k == 3:
return i
regex is probably the easiest, using a modified version of #Israel Unterman's answer to account for the outside edges and non-upper surroundings the full regex might be:
s = 'abcdEFGhIJKlmnopABCdEFGGIddFFansTBDgRRQ'
import re
words = re.findall(r'(?:^|[^A-Z])([A-Z]{3}[a-z][A-Z]{3})(?:[^A-Z]|$)', s)
# words is ['EFGhIJK', 'TBDgRRQ']
using (?:.) groups keeps the search for beginning of line or non-upper from being included in match groups, leaving only the desired tokens in the result list. This should account for all conditions listed by OP.
(removed all my prior code as it was generally *bad*)

python: dictionary of words and wordforms

I have the following problem: I created a dictionary (german) with words and their corresponding lemma. exemple:
"Lagerbestände", "Lager-bestand"; "Wohnhäuser", "Wohn-haus"; "Bahnhof", "Bahn-hof"
I now have a text and I want to check for all word their lemmata. It can happen that it appears a word which is not in the dict, such as "Restbestände". But the lemma of "bestände", we already know it. So I want to take the first part of the word which is unknown in dicti and add this to the lemmatized second part and print this out (or return it).
Example: "Restbestände" --> "Rest-bestand". ("bestand" is taken from the lemma of "Lagerbestände")
I coded the following:
for limit in range(1, len(Word)):
for k, v in dicti.iteritems():
if re.search('[\w]*'+Word[limit:], k, re.IGNORECASE) != None:
if '-' in v:
tmp = v.find('-')
end = v[tmp:]
end = re.sub(ur'[-]',"", end)
Word = Word[:limit] + '-' + end `
But I got 2 problems:
At the end of the words, it is printed out every time "&#10". How can I avoid this?
The second part of the word is sometimes not correct - there must be a logical error.
However; how would you solve this?
At the end of the words, it is printed out every time "&#10". How can
I avoid this?
In must use UNICODE everywhere in your script. Everywhere, everywhere, everywhere.
Also, python RegEx functions accept flag re.UNICODE that you should always set. German letters are out of ASCII set, so RegEx can be sometimes confused, for instance when matching r'\w'

Categories

Resources