Can anyone show me how to remove spaces before and after a hyphen? The code below works on removing the space fore the hyphen but I need to removed before and after the hyphen.
#!/usr/bin/python3
import re
test_strings = ["(1973) -trailer.mp4", "(1973)- fanart.jpg", "(1973) - poster.jpg"]
for i in test_strings:
res = re.sub(' +-', '-', i)
print (res)
You could probably do it in a single regex, but since I'm sometimes lazy, just chain two together, like:
#!/usr/bin/python3
import re
test_strings = ["(1973) -trailer.mp4", "(1973)- fanart.jpg", "(1973) - poster.jpg"]
for i in test_strings:
res = re.sub('- +','-', re.sub(' +-', '-', i))
print (res)
Edit: MrGeek has a better answer in a comment.
You can use pattern \s*-\s* where \s to represent any whitespace character, if you strictly want to use space then you can just use space character in the pattern i.e. *- *.
>>> pattern = re.compile('\s*-\s*')
>>> [pattern.sub('-', item) for item in test_strings]
#output:
['(1973)-trailer.mp4', '(1973)-fanart.jpg', '(1973)-poster.jpg']
Related
I have a string with multiple newline symbols:
text = 'foo\na\nb\n$\n\nxz\nbar'
I want to remove the lines that are shorter than 3 symbols. The desired output is
'foo\n\nbar'
I tried
re.sub(r'(\n([\s\S]{0,2})\n)+', '\nX\n', text, flags= re.S)
but this matches only some subset of the string and the result is
'foo\nX\nb\nX\nxz\nbar'
I need somehow to do greedy search and replace the longest string matching the pattern.
re.S makes . match everything including newline, and you don't want that. Instead use re.M so ^ matches beginning of string and after newline, and use:
>>> import re
>>> text = 'foo\na\nb\n$\n\nxz\nbar'
>>> re.findall('(?m)^.{0,2}\n',text)
['a\n', 'b\n', '$\n', '\n', 'xz\n']
>>> re.sub('(?m)^.{0,2}\n','',text)
'foo\nbar'
That's "from start of a line, match 0-2 non-newline characters, followed by a newline".
I noticed your desired output has a \n\n in it. If that isn't a mistake use .{1,2} if blank lines are to be left in.
You might also want to allow the final line of the string to have an optional terminating newline, for example:
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar') # 3 symbols at end, no newline
'foo\nbar'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar\n') # same, with newline
'foo\nbar\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba\n') # <3 symbols, newline
'foo\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba') # < 3 symbols, no newline
'foo\n'
Perhaps you can use re.findall instead:
text = 'foo\na\nb\n$\n\nxz\nbar'
import re
print (repr("".join(re.findall(r"\n?\w{3,}\n?",text))))
#
'foo\n\nbar'
You can use this regex, which looks for any set of less than 3 non-newline characters following either start-of-string or a newline and followed by a newline or end-of-string, and replace it with an empty string:
(^|\n)[^\n]{0,2}(?=\n|$)
In python:
import re
text = 'foo\na\nb\n$\n\nxz\nbar'
print(re.sub(r'(^|\n)[^\n]{0,2}(?=\n|$)', '', text))
Output
foo
bar
Demo on rextester
There's no need to use regex for this.
raw_str = 'foo\na\nb\n$\n\nxz\nbar'
str_res = '\n'.join([curr for curr in raw_str.splitlines() if len(curr) >= 3])
print(str_res):
foo
bar
I have a regex pattern with optional characters however at the output I want to remove those optional characters. Example:
string = 'a2017a12a'
pattern = re.compile("((20[0-9]{2})(.?)(0[1-9]|1[0-2]))")
result = pattern.search(string)
print(result)
I can have a match like this but what I want as an output is:
desired output = '201712'
Thank you.
You've already captured the intended data in groups and now you can use re.sub to replace the whole match with just contents of group1 and group2.
Try your modified Python code,
import re
string = 'a2017a12a'
pattern = re.compile(".*(20[0-9]{2}).?(0[1-9]|1[0-2]).*")
result = re.sub(pattern, r'\1\2', string)
print(result)
Notice, how I've added .* around the pattern, so any of the extra characters around your data is matched and gets removed. Also, removed extra parenthesis that were not needed. This will also work with strings where you may have other digits surrounding that text like this hello123 a2017a12a some other 99 numbers
Output,
201712
Regex Demo
You can just use re.sub with the pattern \D (=not a number):
>>> import re
>>> string = 'a2017a12a'
>>> re.sub(r'\D', '', string)
'201712'
Try this one:
import re
string = 'a2017a12a'
pattern = re.findall("(\d+)", string) # this regex will capture only digit
print("".join(p for p in pattern)) # combine all digits
Output:
201712
If you want to remove all character from string then you can do this
import re
string = 'a2017a12a'
re.sub('[A-Za-z]+','',string)
Output:
'201712'
You can use re module method to get required output, like:
import re
#method 1
string = 'a2017a12a'
print (re.sub(r'\D', '', string))
#method 2
pattern = re.findall("(\d+)", string)
print("".join(p for p in pattern))
You can also refer below doc for further knowledge.
https://docs.python.org/3/library/re.html
I'm trying to use a regex to clean some data before I insert the items into the database. I haven't been able to solve the issue of removing trailing special characters at the end of my strings.
How do I write this regex to only remove trailing special characters?
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'([_+!##$?^])', '', item))
print (clean_this)
outputs this:
string01 # correct
string02 # incorrect because it remove _ in the string
string03 # correct
string041 # incorrect because it remove _ in the string
string05a # incorrect because it remove _ in the string and not just the trailing _
You could also use the special purpose rstrip method of strings
[s.rstrip('_+!##$?^') for s in strings]
# ['string01', 'str_ing02', 'string03', 'string04_1', 'string05_a']
You could repeat the character class 1+ times or else only 1 special character would be replaced. Then assert the end of the string $. Note that you don't need the capturing group around the character class:
[_+!##$?^]+$
For example:
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
print (clean_this)
See the Regex demo | Python demo
If you also want to remove whitespace characters at the end you could add \s to the character class:
[_+!##$?^\s]+$
Regex demo
You need an end-of-word anchor $
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
Demo
I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo
I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?
Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)
Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho
Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)
[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)
[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)
Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.