Regex not working when used on a Chinese text [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I created a small python function to remove some undesired elements from strings written in Chinese.
Those undesired elements feature an ampersand at the beginning (&Something).
The function uses a regex to spot them, remove them and return the longest part of the string without those undesired elements, but for some reason it's not working as expected.
I tested the function on strings in other languages and alphabets and it works as expected.
# -*- coding: utf-8 -*-
import re
def clean_sentence(my_text):
split_the_text = re.split(r'([&].*?\s)', my_text)
longest_sentence = max(split_the_text, key=len)
return longest_sentence
my_string = "一个神奇的鸭子飞在与&SOMETHING然后唱支歌给&PERSON"
print clean_sentence(my_string)
That's the output:
õ©Çõ©¬þÑ×ÕÑçþÜäÚ©¡Õ¡ÉÚú×Õ£¿õ©Ä&SOMETHINGþäÂÕÉÄÕö▒µö»µ¡îþ╗Ö&PERSON

Pretty simple:
There is no whitespace but you require one. If your SOMETHING or PERSON are only english characters or digits, you might be able to get along with:
import re
def clean_sentence(my_text):
split_the_text = re.split(r'&\w+', my_text)
longest_sentence = max(split_the_text, key=len)
return longest_sentence
my_string = "一个神奇的鸭子飞在与&SOMETHING然后唱支歌给&PERSON"
print(clean_sentence(my_string))
# 一个神奇的鸭子飞在与

Related

How to use str.replace() in this case in Python? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Expression example:
"abcddomain_rgz.png"
"djhajhdomain_rgb1.png"
Want to replace domain*.png in above expression with "domain.json".
Answers:
"abcddomain.json"
"djhajhdomain.json"
This is typical case of regex as mentioned in the comment section. Since you do not know the exact length of the string to be replaced right after domain until .png, you need to use a regular expression to perform that replacement.
Python provides you with the re module, which you can use its sub function to perform the replace:
import re
string = "djhajhdomain_rgb1.png"
result = re.sub("domain(.*).png", "domain.json", string)
print(result)
This will return:
djhajhdomain.json
use python regex instead (re package):
re.sub(r'domain.*\.png$', r"domain.json", 'djhajhdomain_rgb1.png')
Your best bet here would be regex.
x = "djhajhdomain_rgb1.png"
y = "djhajhdomain.json"
import re
pattern = re.compile(r'\w+domain')
ext = '.json'
match = re.match(pattern, x).group(0)
result = match+ext
assert result == y
import regex
compile a pattern to search in string. (Note here that the pattern will only accept alphanumerals and/or underscore before the literal string "domain")
set a pre-defined string extension
use the pattern compiled to match the string
concatenate the result
confirm that your result matches your desired output

Python Regular Expression for pattern containing multiple lines [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I want to extract all the text printed after "AAAAAAAAAAAAAAAAAA"
Give me some text!
AAAAAAAAAAAAAAAAAA
S
p
p
p
Epppp
The following does not work:
import re
m = re.findall(r'AAAAAAAAAAAAAAAAAA(.*)', result)
print m[0]
Also, can I specify a variable in a regular expression instead of a hard coded string: "AAAAAAAAAAAAAAAAAA"?
Reason being, the text: "AAAAAAAAAAAAAAAAAA" is a variable and changes. So, I would like to look for a specific variable value in the pattern and then extract all the text after it.
Use re.S or re.DOTALL (they are synonyms) to have findall match across lines. Or, in your case, search is probably more appropriate since you only want one match. Also, to have it work for a non-hard-coded string, simply use string formatting or string concatenation. To avoid having unescaped regex characters in the string, run it through re.escape.
import re
result = """Give me some text!
AAAAAAAAAAAAAAAAAA
S
p
p
p
Epppp"""
s = 'AAAAAAAAAAAAAAAAAA'
# With formatting
m = re.search(r'{}(.*)'.format(re.escape(s)), result, re.S)
# With concatenation
m = re.search(re.escape(s) + r'(.*)', result, re.S)
print m.group(1)

Search and Replace a word within a word in Python. Replace() method not working [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
How do I search and replace using built-in Python methods?
For instance, with a string of appleorangegrapes (yes all of them joined),
Replace "apple" with "mango".
The .replace method only works if the words are evenly spaced out but not if they are combined as one. Is there a way around this?
I searched the web but again the .replace method only gives me an example if they are spaced out.
Thank you for looking at the problem!
This works exactly as expected and advertised. Have a look:
s = 'appleorangegrapes'
print(s) # -> appleorangegrapes
s = s.replace('apple', 'mango')
print(s) # -> mangoorangegrapes
The only thing that you have to be careful of is that replace is not an in-place operator and as such it does not update s automatically; it only creates a new string that you have to assign to something.
s = 'appleorangegrapes'
s.replace('apple', 'mango') # the change is made but not saved
print(s) # -> appleorangegrapes
replace can work for any string, why you think that it doesn't, here is the test:
>>> s='appleorangegrapes'
>>> s.replace('apple','mango')
'mangoorangegrapes'
>>>
Don't you see that you received your expected result?

Replace every caret with a superscript in a python string [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I want to replace every caret character with a unicode superscript, for nicer printing of equations in python. My problem is, every caret may be followed by a different exponent value, so in the unicode string u'\u00b*', the * wildcard needs to be the exponent I want to print in the string. I figured some regex would work for this, but my experience with that is very little.
For example, supposed I have a string
"x^3-x^2"
, I would then want this to be converted to the unicode string
u"x\u00b3-x\u00b2"
You can use re.sub and str.translate to catch exponents and change them to unicode superscripts.
import re
def to_superscript(num):
transl = str.maketrans(dict(zip('1234567890', '¹²³⁴⁵⁶⁷⁸⁹⁰')))
return num.translate(transl)
s = 'x^3-x^2'
out = re.sub('\^\s*(\d+)', lambda m: to_superscript(m[1]), s)
print(out)
Output
x³-x²

How to cut link in python? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have the following link:
http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg
How to take just this one part of the link:
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
and remove everything else? I also want to keep the extension.
I want to remove this part:
._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_
and keep this part:
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
How can I do this in python?
You could use:
re.sub(r'\._[\w.,-]*(\.(?:jpg|png|gif))$', r'\1', inputurl)
This makes some assumptions but works on your input. The search starts at the ._ sequence, takes anything after that that is a letter, digit, dash, underscore, dot or comma, then matches the extension. I picked an explicit small group of possible extensions; you could also just use (\.w+)$ at the end instead to widen the acceptable extensions to word characters.
Demo:
>>> import re
>>> inputurl = 'http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg'
>>> re.sub(r'\._[\w.,-]*(\.(?:jpg|png|gif))$', r'\1', inputurl)
'http://ecx.images-amazon.com/images/I51JXXb2vpDL.jpg'
url = "http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg"
l = url.split(".")
print(".".join(l[:-2:])+".{}".format(l[-1]))
prints
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
The following should work:
import re
url = "http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg"
print re.sub(r"(https?://.+?)\._.+(\.\w+)", r'\1\2', url)
The above code prints
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
An important detail: More links are necessary to find the correct pattern. I'm currently assuming you want everything until the first ._
url = re.sub("(/[^./]+)\.[^/]*?(\.[^.]+)$", "\\1\\2", url)

Categories

Resources