Remove characters from beginning and end or only end of line - python

I want to remove some symbols from a string using a regular expression, for example:
== (that occur both at the beginning and at the end of a line),
* (at the beginning of a line ONLY).
def some_func():
clean = re.sub(r'= {2,}', '', clean) #Removes 2 or more occurrences of = at the beg and at the end of a line.
clean = re.sub(r'^\* {1,}', '', clean) #Removes 1 or more occurrences of * at the beginning of a line.
What's wrong with my code? It seems like expressions are wrong. How do I remove a character/symbol if it's at the beginning or at the end of the line (with one or more occurrences)?

If you only want to remove characters from the beginning and the end, you could use the string.strip() method. This would give some code like this:
>>> s1 = '== foo bar =='
>>> s1.strip('=')
' foo bar '
>>> s2 = '* foo bar'
>>> s2.lstrip('*')
' foo bar'
The strip method removes the characters given in the argument from the beginning and the end of the string, ltrip removes them from only the beginning, and rstrip removes them only from the end.
If you really want to use a regular expression, they would look something like this:
clean = re.sub(r'(^={2,})|(={2,}$)', '', clean)
clean = re.sub(r'^\*+', '', clean)
But IMHO, using strip/lstrip/rstrip would be the most appropriate for what you want to do.
Edit: On Nick's suggestion, here is a solution that would do all this in one line:
clean = clean.lstrip('*').strip('= ')
(A common mistake is to think that these methods remove characters in the order they're given in the argument, in fact, the argument is just a sequence of characters to remove, whatever their order is, that's why the .strip('= ') would remove every '=' and ' ' from the beginning and the end, and not just the string '= '.)

You have extra spaces in your regexs. Even a space counts as a character.
r'^(?:\*|==)|==$'

First of all you should pay attention to the spaces before "{" ... those are meaningful so the quantifier in your example applies to the space.
To remove "=" (two or more) only at begin or end also you need a different regexp... for example
clean = re.sub(r'^(==+)?(.*?)(==+)?$', r'\2', s)
If you don't put either "^" or "$" the expression can match anywhere (i.e. even in the middle of the string).

And not substituting but keeping ? :
tu = ('======constellation==' , '==constant=====' ,
'=flower===' , '===bingo=' ,
'***seashore***' , '*winter*' ,
'====***conditions=**' , '=***trees====***' ,
'***=information***=' , '*=informative***==' )
import re
RE = '((===*)|\**)?(([^=]|=(?!=+\Z))+)'
pat = re.compile(RE)
for ch in tu:
print ch,' ',pat.match(ch).group(3)
Result:
======constellation== constellation
==constant===== constant
=flower=== =flower
===bingo= bingo=
***seashore*** seashore***
*winter* winter*
====***conditions=** ***conditions=**
=***trees====*** =***trees====***
***=information***= =information***=
*=informative***== =informative***
Do you want in fact
====***conditions=** to give conditions=** ?
***====hundred====*** to give hundred====*** ?
for the beginning ?**

I think that the following code will do the job:
tu = ('======constellation==' , '==constant=====' ,
'=flower===' , '===bingo=' ,
'***seashore***' , '*winter*' ,
'====***conditions=**' , '=***trees====***' ,
'***=information***=' , '*=informative***==' )
import re,codecs
with codecs.open('testu.txt', encoding='utf-8', mode='w') as f:
pat = re.compile('(?:==+|\*+)?(.*?)(?:==+)?\Z')
xam = max(map(len,tu)) + 3
res = '\n'.join(ch.ljust(xam) + pat.match(ch).group(1)
for ch in tu)
f.write(res)
print res
Where was my brain when I wrote the RE in my earlier post ??! O!O
Non greedy quantifier .*? before ==+\Z is the real solution.

Related

How to delete whitespace in only a certain part of the string

So let's say I have this string like
string = 'abcd <# string that has whitespace> efgh'
And I want to delete all the white space inside this <#...> And not affect anything outside <#...>
But the characters outside <#...> can change too so the <#...> is not going to be in a fixed position.
How should I do this?
This is not a complicated operation. You just do it like you would as a human being. Find the two delimiters, keep the part before the first one, remove space from the middle, keep the rest.
string = 'abcd <# string that has whitespace> efgh'
i1 = string.find('<#')
i2 = string.find('>')
res = string[:i1] + string[i1:i2].replace(' ','') + string[i2:]
print(res)
Output:
abcd <#stringthathaswhitespace> efgh
How about this...
string = 'abcd <# string that has whitespace> efgh'
s = string.split()
s = ' '.join( (s[0], ''.join(s[1:-1]), s[-1]) )
If <#...> exists consistently, one method to find the string is use regular expressions (regex) to search for that part of the string with the charactersyou want to modify. You then need to strip out the white space.
It takes a bit to get your head around regex, but they can be powerful tool.
Regex

Trying to remove all punctuation characters from a string but everything I keep getting // left in

I am trying to write a function to remove all punctuation characters from a string. I've tried several permutations on translate, replace, strip, etc. My latest attempt uses a brute force approach:
def clean_lower(sample):
punct = list(string.punctuation)
for c in punct:
sample.replace(c, ' ')
return sample.split()
That gets rid of almost all of the punctuation but I'm left with // in front of one of the words. I can't seem to find any way to remove it. I've even tried explicitly replacing it with sample.replace('//', ' ').
What do I need to do?
using translate is the fastest way to remove punctuations, this will remove // too:
import string
s = "This is! a string, with. punctuations? //"
def clean_lower(s):
return s.translate(str.maketrans('', '', string.punctuation))
s = clean_lower(s)
print(s)
Use regular expressions
import re
def clean_lower(s):
return(re.sub(r'\W','',s))
Above function erases any symbols except underscore
Perhaps you should approach it from the perspective of what you want to keep:
For example:
import string
toKeep = set(string.ascii_letters + string.digits + " ")
toRemove = set(string.printable) - toKeep
cleanUp = str.maketrans('', '', "".join(toRemove))
usage:
s = "Hello! world of / and dice".translate(cleanUp)
# s will be 'Hello world of and dice'
as suggested by #jasonharper you need to redefine "sample" and it should work:
import string
sample='// Hello?) // World!'
print(sample)
punct=list(string.punctuation)
for c in punct:
sample=sample.replace(c,'')
print(sample.split())

Capture ALL strings within a Python script with regex

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks
Consider the following Python script (t.py):
print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t" in it ', variable,
"This has a single quote ' but doesn\'t end the quote as it" + \
" started with double quotes")
if "Foo Bar" != '''Another Value''':
"""
This is just nonsense
"""
aux = '?'
print("Did I \"failed\"?", f"{aux}")
I want to capture all strings in it, as:
This is also an NL test
!\n
And this has an escaped quote "don\'t" in it
This has a single quote ' but doesn\'t end the quote as it
started with double quotes
Foo Bar
Another Value
This is just nonsense
?
Did I \"failed\"?
{aux}
I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:
import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
print(f'[{i}]',s.group(0))
with the following result:
[0] And this has an escaped quote "don\'t" in it
[1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
[2] Foo Bar
[3] Another Value
[4] Did I \"failed\"?
To improve my failures, I couldn't also fully replicate what I can found with regex101.com:
I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.
Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:
('''|"""|["'])
Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.
Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).
The middle part to match anything but the delimiter can be:
((?:\\.|.)*?)
Put it all together:
('''|"""|["'])((?:\\.|.)*?)\1
and the result you want will be in the second capture group:
pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
print(f'[{i}]',s.group(2))
https://regex101.com/r/dvw0Bc/1

Having trouble adding a space after a period in a python string

I have to write a code to do 2 things:
Compress more than one occurrence of the space character into one.
Add a space after a period, if there isn't one.
For example:
input> This is weird.Indeed
output>This is weird. Indeed.
This is the code I wrote:
def correction(string):
list=[]
for i in string:
if i!=" ":
list.append(i)
elif i==" ":
k=i+1
if k==" ":
k=""
list.append(i)
s=' '.join(list)
return s
strn=input("Enter the string: ").split()
print (correction(strn))
This code takes any input by the user and removes all the extra spaces,but it's not adding the space after the period(I know why not,because of the split function it's taking the period and the next word with it as one word, I just can't figure how to fix it)
This is a code I found online:
import re
def correction2(string):
corstr = re.sub('\ +',' ',string)
final = re.sub('\.','. ',corstr)
return final
strn= ("This is as .Indeed")
print (correction2(strn))
The problem with this code is I can't take any input from the user. It is predefined in the program.
So can anyone suggest how to improve any of the two codes to do both the functions on ANY input by the user?
Is this what you desire?
import re
def corr(s):
return re.sub(r'\.(?! )', '. ', re.sub(r' +', ' ', s))
s = input("> ")
print(corr(s))
I've changed the regex to a lookahead pattern, take a look here.
Edit: explain Regex as requested in comment
re.sub() takes (at least) three arguments: The Regex search pattern, the replacement the matched pattern should be replaced with, and the string in which the replacement should be done.
What I'm doing here is two steps at once, I've been using the output of one function as input of another.
First, the inner re.sub(r' +', ' ', s) searches for multiple spaces (r' +') in s to replace them with single spaces. Then the outer re.sub(r'\.(?! )', '. ', ...) looks for periods without following space character to replace them with '. '. I'm using a negative lookahead pattern to match only sections, that don't match the specified lookahead pattern (a normal space character in this case). You may want to play around with this pattern, this may help understanding it better.
The r string prefix changes the string to a raw string where backslash-escaping is disabled. Unnecessary in this case, but it's a habit of mine to use raw strings with regular expressions.
For a more basic answer, without regex:
>>> def remove_doublespace(string):
... if ' ' not in string:
... return string
... return remove_doublespace(string.replace(' ',' '))
...
>>> remove_doublespace('hi there how are you.i am fine. '.replace('.', '. '))
'hi there how are you. i am fine. '
You try the following code:
>>> s = 'This is weird.Indeed'
>>> def correction(s):
res = re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
if res[-1] != '.':
res += '.'
return res
>>> print correction(s)
This is weird. Indeed.
>>> s=raw_input()
hee ss.dk
>>> s
'hee ss.dk'
>>> correction(s)
'hee ss. dk.'

Python string clean up: How can I remove the excess of newlines of a string in python?

I'm creating a Python/Django app and I need to clean up a string, but the main problem I have is that the string has too many line breaks in some parts. I don't want to delete all the line breaks, just the excess of them. How can I archive this in python? I'm using Python 2.7 and Django 1.6
A regexp is one way. Using your updated sample input:
>>> a = "This is my sample text.\r\n\r\n\r\n\r\n\r\n Here start another sample text"
>>> import re
>>> re.sub(r'(\r\n){2,}','\r\n', a)
'This is my sample text.\r\n Here start another sample text'
r'(\r\n)+' would work too, I just like using the 2+ lower bound to avoid some replacements of singleton \r\n substrings with the same substring.
Or you can use the splitlines method on the string and rejoin after filtering:
>>> '\r\n'.join(line for line in a.splitlines() if line)
As an example, if you know what you want to replace:
>>> a = 'string with \n a \n\n few too many\n\n\n lines'
>>> a.replace('\n'*2, '\n') # Replaces \n\n with just \n
'string with \n a \n few too many\n\n lines'
>>> a.replace('\n'*3, '') # Replaces \n\n\n with nothing...
'string with \n a \n\n few too many lines'
Or, using regular expression to find what you want
>>> import re
>>> re.findall(r'.*([\n]+).*', a)
['\n', '\n\n', '\n\n\n']
import re
a = 'string with \n a \n\n few too many\n\n\n lines'
re.sub('\n+', '\n', a)
To use a regex to replace multiple occurrences of newline with a single one (or something else you prefer such as a period, tab or whatever), try:
import re
testme = 'Some text.\nSome more text.\n\nEven more text.\n\n\n\n\nThe End'
print re.sub('\n+', '\n', testme)
Note that '\n' is a single-character (a newline), not two characters (literal backslash and 'n').
You can of course compile the regex in advance if you intend to re-use it:
pattern = re.compile('\n+')
print pattern.sub('\n', testme)
Best I could do, but Peter DeGlopper's was better.
import re
s = '\n' * 9 + 'abc' + '\n'*10
# s == '\n\n\n\n\n\n\n\n\nabc\n\n\n\n\n\n\n\n\n\n\n'
lines = re.compile('\n+')
excess_lines = lines.findall(s)
# excess_lines == ['\n' * 9, '\n' * 10]
# I feel as though there is a better way, but this works
def cmplen(first, second):
'''
Function to order strings in descending order by length
Needed so that we replace longer strings of new lines first
'''
if len(first) < len(second):
return 1
elif len(first) > len(second):
return -1
else:
return 0
excess_lines.sort(cmp=cmplen)
# excess_lines == ['\n' * 10, '\n' * 9]
for lines in excess_lines:
s = s.replace(lines, '\n')
# s = '\nabc\n'
This solution feels dirty and inelegant, but it works. You need to sort by string length because if you have a string '\n\n\n aaaaaaa \n\n\n\n' and do a replace(), the \n\n\n will replace \n\n\n\n with \n\n, and not be caught later on.

Categories

Resources