Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Why Python's re module escapes semicolon characters?
print(re.escape('text;text'))
gives me the following output:
text\;text
>>> re.escape.__doc__
'Escape all non-alphanumeric characters in pattern.'
It escapes ;(semicolon), because ; is not an alphanumeric character.
It escapes a semicolon because that is what it's designed to do. As per the docs, it escapes all non-alphanumeric characters.
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have the following strings.
string1 = "按照由 GPV 提供的相关报告; 世界卫生组织 WHO 发布的有关研究"
string2 = "\n\n 介绍 INTRODUCTION"
How can I remove the spaces between Chinese characters and English acronyms?
The expected result is:
"按照由GPV提供的相关报告; 世界卫生组织WHO发布的有关研究".
However, the re pattern should not remove the space between 介绍 and INTRODUCTION since there are no Chinese characters on the right side of INTRODUCTION.
If you can use the third-party regex implementation module regex, it supports \p{script} tokens which make this task easy :
\p{Han}+\s+\p{Latin}+\s+\p{Han}+
Python native re's unfortunately doesn't support these.
In order to remove the spaces, use capturing groups to select the surrounding words and refer to those in your replacement pattern :
Match (\p{Han}+)\s+(\p{Latin}+)\s+(\p{Han}+)
Replace by \1\2\3
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What do the below lines of code do? And what is its Jython equivalent?
Function Import_PUERTOR(strField, strRecord)
Dim re
Set re = New RegExp
re.Pattern = "^\s*"
re.MultiLine = False
strField = re.Replace(strField,"")
End Function
this code strips the leading spaces from the left of the strField string.
Python regex conversion? no need, python has a non-regex built-in for that (faster, shorter to write):
strField = strField.lstrip()
will do
lstrip returns a copy of the string with leading characters removed.
Syntax
str. lstrip([chars])
chars
Optional. String specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace. The chars argument is not a prefix; rather, all combinations of its values are stripped.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I can use other escape characters without any problem but my atom text editor and python itself doesn't see it as an escape character but as a normal character.
print '\s', test_line
just writes
\stesting_bot1
How can I make it so that the editor and python will see this as an escape character and as space ?
\s isn't an escape sequence in Python. \t, \n, \r etc are (see the Python lexical analysis docs) but non-special characters will not be interpreted as anything special, hence your \s appearing literally.
However, \s does means space in regular expression syntax of course...
I think you may be confusing a regex with a string. For a normal string, you just need to use the space character to print it:
print(' testing_bot1')
\s is not an escape sequence, so it will be interpreted as just backslash + "s".
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
What regex would I use to get urls that follow this pattern:
'https://www.facebook.com/' + 'some text' + 'browser'
Meaning that it starts with the facebook url, has some text which varies, and then has the word 'browser' at the end?
Thanks!
r'https://www\.facebook\.com/.*browser'
. is a regex metacharacter meaning "any single character", so the literal periods have to be escaped with backslashes. * means "any number of matches for the previous thing", so .* means "any number of arbitrary characters". The r in front of the string marks it as a raw string literal, so backslashes are processed by the regex engine instead of the Python parser.
I think it is r'https://www\.facebook\.com/.*browser$' , because the word 'browser' is at the end.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
def symbolsReplaceDashes(text):
I want to replace all spaces and symbols with hyphens. Because I want to use this with URL.
import re
text = "this isn't alphanumeric"
result = re.sub(r'\W','-',text) # result will be "this-isn-t-alphanumeric"
The \W class is the inverse of the \w class, which consists of alphanumeric characters and underscores ([a-zA-Z0-9_]). Thus, replacing any character that doesn't match \W with a dash will leave you with a string that consists of only alphanumerics, underscores, and dashes, suitable for a URL.
Instead of regex, if you want to escape a string to be used for an url, use urllib.quote() or urllib.quote_plus(). For more complex queries, you might want to build the url using urllib.urlencode(). You can reverse the quotation with urllib.unquote() and urllib.unquote_plus().
This response doesn't use regular expressions, but should also work, with greater control over the types of symbols to filter. It uses the unicodedata module to remove all symbols by checking the categories of the characters.
import unicodedata
# See http://www.dpawson.co.uk/xsl/rev2/UnicodeCategories.html for character categories
replace = ('Sc', 'Sk', 'Sm', 'So', 'Zs')
def symbolsReplaceDashes(text):
L = []
for char in text:
if unicodedata.category(unicode(char)) in replace:
L.append('-')
else: L.append(char)
return ''.join(L)
You may need to use something like urllib.quote(output.encode('utf-8')) to encode characters if ranges are beyond basic ASCII alphanumeric characters.