I need to replace all occurences of normal whitespaces in «статья 1», «статьи 2» etc. with non-breaking spaces.
The construction below works fine:
re.sub('(стат.{0,4}) (\d+)', r'\1 \2', text) # 'r' in repl is important, otherwise the word is not replaced correctly, at least for texts in Russian.
however, I do not want to repeatedly use re.sub for «статья», then for «пункт», then for the names of months, I want to have a dictionary with regex expressions and replacements. Here's my code, but it does not work as expected: 'статья 1 статьи 2' should look like 'статья(non-breaking space here)1 статьи(non-breaking space here)2':
import re
text = 'статья 1 статьи 2'
dic = {'(cтат.{0,4}) (\d+)' : r'\1 \2'}
def replace():
global text
final_text = ''
for i in dic:
new_text = re.sub(str(i), str(dic[i]), text)
text = new_text
return text
print (replace())
The problem is that you copied and pasted wrong.
This pattern works:
'(стат.{0,4}) (\d+)'
This one doesn't:
'(cтат.{0,4}) (\d+)'
Why? Because in the first one, and in your search string, that first character is a U+0441, a Cyrillic small Es. But in the second one, it's a U+0063, a Latin small C. Of course the two look identical in most fonts, but they're not the same character.
So, how can you tell? Well, when I suspected this problem, here's what I did:
>>> a = '(стат.{0,4}) (\d+)' # copied and pasted from your working code
>>> b = '(cтат.{0,4}) (\d+)' # copied and pasted from your broken code
>>> print(a.encode('unicode-escape').decode('ascii'))
(\u0441\u0442\u0430\u0442.{0,4}) (\\d+)
>>> print(b.encode('unicode-escape').decode('ascii'))
(c\u0442\u0430\u0442.{0,4}) (\\d+)
And the difference is obvious: the first one has a \u0441 escape sequence where the second one has a plain ASCII c.
Related
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
I have some data stored as pandas data frame and one of the columns contains text strings in Korean. I would like to process each of these text strings as follows:
my_string = '모질상태불량(피부상태불량, 심하게 야윔), 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성(활력저하)'
Into a list like this:
parsed_text = '모질상태불량, 피부상태불량, 심하게 야윔, 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성, 활력저하'
So the problem is to identify cases where a word (or several words) are followed by parentheses with text only (can be one words or several words separated by commas) and replace them by all the words (before and inside parentheses) separated by comma (for later processing). If a word is followed by parentheses containing numbers (as in this case 7/22), it should be kept as it is. If a word is not followed by any parentheses, it should also be kept as it is. Furthermore, I would like to preserve the order of words (as they appeared in the original string).
I can extract text in parentheses by using regex as follows:
corrected_string = re.findall(r'(\w+)\((\D.*?)\)', my_string)
which yields this:
[('모질상태불량', '피부상태불량, 심하게 야윔'), ('코로나음성', '활력저하')]
But I'm having difficulty creating my resulting string, i.e. replacing my original text with the pattern I've matched. Any suggestions? Thank you.
You can use re.findall with a pattern that optionally matches a number enclosed in parentheses:
corrected_string = re.findall(r'[^,()]+(?:\([^)]*\d[^)]*\))?', my_string)
It's little bit clumsy but you can try:
my_string_list = [x.strip() for x in re.split(r"\((?!\d)|(?<!\d)\)|,", my_string) if x]
# you can make string out of list then.
I am supposed to write a little program that takes in a Persian text and in some places changes the space to half-space. The half-space or a zero-width non-joiner is used in some languages to avoid a ligature when normalizing a text. It's unicode character is supposedly '\u200c' and in some text-editors it can be shown on the screen with a SHIFT+SPACE:
import re
txt = input('Please enter a Persian text: ')
original_pattern = r'\b(\w+)\s*(ها|هايي|هايم|هاي)\b'
new_pattern = r'\1 \2'
new_txt = re.sub (original_pattern, new_pattern, txt)
print (new_txt)
In the code above, new_pattern is supposed to introduce a half-space between \1 and \2, currently there is a space between them.
The question is: How can I put a half-space there? I tried the following and in both cases got a syntax error:
new_pattern = ur'\1\u200c\2'
new_pattern = r'\1\u200c\2'
By the way, although in the Wikipedia article the unicode character for ZWNJ is given as U+200c, it doesn't seem to be working that way in the python shell and it is actually doubling the space:
>>> print ('He is a',u'\u200c','boy')
He is a boy
>>> print ("کتاب",u"\u200c","ها")
کتاب ها
>>> print ("کتاب ها")
کتاب ها
>>>
Python adds a separator for arguments of print function, you can control this with sep argument, try
print ('He is a', '\u200c', 'boy', sep="")
For a pattern, try
new_pattern = '\\1\u200c\\2'
or
new_pattern = '\\1\N{ZERO WIDTH NON-JOINER}\\2'
reason is that when you add an r prefix, escapes \ are ignored, so \u200c part of pattern is threated as 5 charactes string, i.e. pattern equals \\1\\u200c\\2, hence your error.
I have the following line :
CommonSettingsMandatory = #<Import Project="[\\.]*Shared(\\vc10\\|\\)CommonSettings\.targets," />#,true
and i want the following output:
['commonsettingsmandatory', '<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />', 'true'
If i do a simple regex with the comma, it will split the value if there's a value in it, like i wrote a comma after targets, it will split here.
So i want to ignore the text between the ## to make sure there's no splitting there.
I really don't know how to do!
http://docs.python.org/library/re.html#re.split
import re
string = 'CommonSettingsMandatory = #toto,tata#, true'
splitlist = re.split('\s?=\s?#(.*?)#,\s?', string)
Then splitlist contains ['CommonSettingsMandatory', 'toto,tata', 'true'].
While you might be able to use split with a lookbehind, I would use the groups captured by this expression.
(\S+)\s*=\s*##([^#]+)##,\s*(.*)
m = re.Search(expression, myString). use m.group(1) for the first string, m.group(2) for the second, etc.
If I understand you correctly, you're trying to split the string using spaces as delimiters, but you want to also remove any text between pound signs?
If that's correct, why not simply remove the pound sign-delimited text before splitting the string?
import re
myString = re.sub(r'#.*?#', '', myString)
myArray = myString.split(' ')
EDIT: (based on revised question)
import re
myArray = re.findall(r'^(.*?) = #(.*?)#,(.*?)$', myString)
That will actually return an array of tuples including your matches, in the form of:
[
(
'commonsettingsmandatory',
'<Import Project="[\\\\.]*Shared(\\\\vc10\\\\|\\\\)CommonSettings\\.targets," />',
'true'
)
]
(spacing added to illustrate the format better)
I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).
Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.
And I want to put of all those separated components into a list.
Some examples would probably make this clear:
Case 1: English-only string. This case is easy:
>>> "I love Python".split()
['I', 'love', 'Python']
Case 2: Chinese-only string:
>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
How do I get it to display the actual characters instead of the unicode? Something like:
['我', '爱', '蟒', '蛇']
??
Case 3: A mix of English & Chinese:
I want to turn an input string such as
"我爱Python"
and turns it into a list like this:
['我', '爱', 'Python']
Is it possible to do something like that?
I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)
# -*- coding: utf-8 -*-
import re
def group_words(s):
regex = []
# Match a whole word:
regex += [ur'\w+']
# Match a single CJK character:
regex += [ur'[\u4e00-\ufaff]']
# Match one of anything else, except for spaces:
regex += [ur'[^\s]']
regex = "|".join(regex)
r = re.compile(regex)
return r.findall(s)
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.
In Python 3, it also splits the number if you needed.
def spliteKeyWord(str):
regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
matches = re.findall(regex, str, re.UNICODE)
return matches
print(spliteKeyWord("Testing English text我爱Python123"))
=> ['Testing', 'English', 'text', '我', '爱', 'Python', '123']
Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)
A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.
I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.
Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.
# -*- coding: utf-8 -*-
import itertools, unicodedata
def group_words(s):
# This is a closure for key(), encapsulated in an array to work around
# 2.x's lack of the nonlocal keyword.
sequence = [0x10000000]
def key(part):
val = ord(part)
if part.isspace():
return 0
# This is incorrect, but serves this example; finding a more
# accurate categorization of characters is up to the user.
asian = unicodedata.category(part) == "Lo"
if asian:
# Never group asian characters, by returning a unique value for each one.
sequence[0] += 1
return sequence[0]
return 2
result = []
for key, group in itertools.groupby(s, key):
# Discard groups of whitespace.
if key == 0:
continue
str = "".join(group)
result.append(str)
return result
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:
def rec_group_words():
regex = []
# Match a whole word:
regex += [r'[A-za-z0-9\xc0-\xff]+']
# Match a single CJK character:
regex += [r'[\u4e00-\ufaff]']
regex = "|".join(regex)
return re.compile(regex)
The following works for python3.7:
import re
def group_words(s):
return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)
if __name__ == "__main__":
print(group_words(u"Testing English text"))
print(group_words(u"我爱蟒蛇"))
print(group_words(u"Testing English text我爱蟒蛇"))
['Testing', 'English', 'text']
['我', '爱', '蟒', '蛇']
['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']
For some reason, I cannot adapt Glenn Maynard's answer to python3.