Regular expression in Python issue. re.sub() working strangely

Regular expression in Python issue. re.sub() working strangely - python

So I am attempting to use python to write text to a Microsoft word document. The code works perfectly except for when it runs up against a non-ascii character. When that happens, I am greeted by the following error.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
I'd attempted to solve this issue by using regular expressions to pluck out and replace non-ascii characters. re.sub(pattern, repl, string, count=0, flags=0) seemed like it it would work. Here is the code that I threw together:
match1 = re.search(ur'Ê¼', bodyHTML)
match2 = re.search(ur'ï¬', bodyHTML)
match3 = re.search(ur'ï¬‚', bodyHTML)
if match1:
print 'Match 1'
bodyHTML = re.sub(ur'Ê¼', "'", bodyHTML)
if match3:
print 'Match 3'
bodyHTML = re.sub(ur'ï¬‚', 'fl', bodyHTML)
if match2:
print 'Match 2'
bodyHTML = re.sub(ur'ï¬', 'fi', bodyHTML)
"match1" works perfectly. Whenever there is an Ê¼ in the text, it is replaced by an apostrophe.
"match2" and "match3" are a different story. Here's an example:
After a short hike we had our ï¬rst glimpse of the museum
Naturally, this triggers a response from match2. But instead of producing
After a short hike we had our first glimpse of the museum
It spits out
After a short hike we had our fiürst glimpse of the museum
This happens several times. "signiï¬cant" becomes "signifiücant" and so on.
I am not sure why this is happening.
I am also running into an issue where match2 is steamrolling any match from match3. In other words
the ripples on the pond and so did the shimmering reï¬‚ections in the glass walls
becomes
the ripples on the pond and so did the shimmering refiéected in the glass walls
instead of
the ripples on the pond and so did the shimmering reflected in the glass walls
I'm not really sure why match2 is dominating, especially because I put the match3 if statement before the one for match2 specifically so it would remove all of sections with "ï¬‚" and leave only the "ï¬" snippets for match2 to mop up.
As far as the other non-ascii characters popping up after running the code...I have no idea.
Any help is much appreciated.
Thank you

For specific 'translation' I use chr like this:
def process_special_characters(result):
'''sub out known yuck for legit stuff (smart quotes, etc.), keep track of how many changes'''
total = 0
result = re.subn(chr(133), chr(95), result) #strange underbar
total += result[1]
result = re.subn(chr(145), chr(39), result[0]) #smart quote
total += result[1]
result = re.subn(chr(146), chr(39), result[0]) #other smart quote
total += result[1]
result = re.subn(chr(150), chr(45), result[0]) #strange hyphen
total += result[1]
result = re.subn(chr(8212), chr(45), result[0]) #strange long hyphen
total += result[1]
result = re.subn(chr(160), chr(32), result[0]) #non breaking space
total += result[1]
return result[0], total
You could easily make a dict and just have the key be your "from" and the value be the "to" and loop it. If it's a short list this works fine. Note, using subn allows you to keep track of the number of changes. You can always just toss that all that logic and just go with sub, just that in my case the business side wanted a count of changes.
You could also have lines like this if it is easier to read:
result = re.subn(chr(133), "_", result) #strange underbar
Good luck! Just toss a comment in if you have more questions and I'll update.

Related

From strings without punctuation search into master string and take from there slices with punctuation without libraries, possible?

I have this homework to do (no libraries allowed) and i've underestimated this problem:
let's say we have a list of strings: str_list = ["my head's", "free", "at last", "into alarm", "in another moment", "neck"]
What we know for sure about this is that every single string is contained in the master_string, are in order, and are without punctuation. (all this thanks to previous controls i've made)
Then we have the string: master_string = "'Come, my head's free at last!' said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
What i must do here is basically check the sequences of string of at least k (in this case k = 2) from str_list that are contained in the master_string, however i underestimated the fact that in str_list we have more than 1 word in each string so doing master_string.split() won't take me anywhere because would mean to ask something like if "my head's" == "my" and that would be false of course.
I was thinking about doing something like concatenating strings somehow one at time and searching into the master_string.strip(".,:;!?") but if i find corresponding sequences i need absolutely to take them directly from the master_string because i need the punctuation in the result variable. This basically means to take directly slices from master_string but how can that be possible? Is even something possible or i got to change approach? This is driving me totally crazy especially because there are no libraries allowed to do this.
In case you're wondering what is the expected result here would be:
["my head's free at last!", "into alarm in another moment,"] (because both respect the condition of at least k strings from str_list) and "neck" would be saved in a discard_list since it doesn't respect that condition (it musn't be discarded with .pop() because i need to do other stuff with variables discarded)

Follows my solution:
Try to extend all the basing yourself on the master_string and a finite set of punctuation characters (e.g. my head’s -> my head’s free at last!; free -> free at last!).
Keep only the substrings that have been extended at least k times.
Remove redundant substrings (e.g. free at last! is already present with my head’s free at last!).
This is the code:
str_list = ["my head’s", "free", "at last", "into alarm", "in another moment", "neck"]
master_string = "‘Come, my head’s free at last!’ said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
punctuation_characters = ".,:;!?" # list of punctuation characters
k = 1
def extend_string(current_str, successors_num = 0) :
# check if the next token is a punctuation mark
for punctuation_mark in punctuation_characters :
if current_str + punctuation_mark in master_string :
return extend_string(current_str + punctuation_mark, successors_num)
# check if the next token is a proper successor
for successor in str_list :
if current_str + " " + successor in master_string :
return extend_string(current_str + " " + successor, successors_num+1)
# cannot extend the string anymore
return current_str, successors_num
extended_strings = []
for s in str_list :
extended_string, successors_num = extend_string(s)
if successors_num >= k : extended_strings.append(extended_string)
extended_strings.sort(key=len) # sorting by ascending length
result_list = []
for es in extended_strings :
result_list = list(filter(lambda s2 : s2 not in es, result_list))
result_list.append(es)
print(result_list) # result: ['my head’s free at last!', 'into alarm in another moment,']

Ive got two different versions, number 1 gives you neck :(, but number 2 doesn't give you as much, here's number 1:
master_string = "Come, my head’s free at last!’ said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
str_list = ["my head's", "free", "at last", "into alarm", "in another moment", "neck"]
new_str = ''
for word in str_list:
if word in master_string:
new_str += word + ' '
print(new_str)
and here's number 2:
master_string = "Come, my head’s free at last!’ said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
str_list = ["my head's", "free", "at last", "into alarm", "in another moment", "neck"]
new_str = ''
for word in str_list:
if word in master_string:
new_word = word.split(' ')
if len(new_word) == 2:
new_str += word + ' '
print(new_str)

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Example output of what it should look like
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27

Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.
To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:
function isEndOfSentence(leftContext, rightContext)
Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if #midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a].
Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).
References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]

Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)
Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.
Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)
Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*).
This block is important to split if the input is as
Hello world.Hi I am here today.
i.e. if there is space or no space after the dot.

Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.
You will never reach perfection with this approach. But depending on the task it might just work "enough"...

I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
which means
start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.

Try this:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)

I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilí]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
Full code in: https://github.com/luizanisio/comparador_elastic

If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)
for stuff in sentences:
print(stuff)

How to use text.split() and retain blank (empty) lines

New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!

First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.

use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)

Python: Split unicode string on word boundaries

I need to take a string, and shorten it to 140 characters.
Currently I am doing:
if len(tweet) > 140:
tweet = re.sub(r"\s+", " ", tweet) #normalize space
footer = "… " + utils.shorten_urls(post['url'])
avail = 140 - len(footer)
words = tweet.split()
result = ""
for word in words:
word += " "
if len(word) > avail:
break
result += word
avail -= len(word)
tweet = (result + footer).strip()
assert len(tweet) <= 140
So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:
>>> s = u"简讯：新華社報道，美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域，預計約30分鐘後抵達浦東國際機場，開展他上任後首次訪華之旅。"
>>> s
u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> s.split()
[u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002']
How should I do this so it handles I18N? Does this make sense in all languages?
I'm on python 2.5.4 if that matters.

Chinese doesn't usually have whitespace between words, and the symbols can have different meanings depending on context. You will have to understand the text in order to split it at a word boundary. In other words, what you are trying to do is not easy in general.

For word segmentation in Chinese, and other advanced tasks in processing natural language, consider NLTK as a good starting point if not a complete solution -- it's a rich Python-based toolkit, particularly good for learning about NL processing techniques (and not rarely good enough to offer you viable solution to some of these problems).

the re.U flag will treat \s according to the Unicode character properties database.
The given string, however, doesn't apparently contain any white space characters according to python's unicode database:
>>> x = u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> re.compile(r'\s+', re.U).split(x)
[u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002']

I tried out the solution with PyAPNS for push notifications and just wanted to share what worked for me. The issue I had is that truncating at 256 bytes in UTF-8 would result in the notification getting dropped. I had to make sure the notification was encoded as "unicode_escape" to get it to work. I'm assuming this is because the result is sent as JSON and not raw UTF-8. Anyways here is the function that worked for me:
def unicode_truncate(s, length, encoding='unicode_escape'):
encoded = s.encode(encoding)[:length]
return encoded.decode(encoding, 'ignore')

After speaking with some native Cantonese, Mandarin, and Japanese speakers it seems that the correct thing to do is hard, but my current algorithm still makes sense to them in the context of internet posts.
Meaning, they are used to the "split on space and add … at the end" treatment.
So I'm going to be lazy and stick with it, until I get complaints from people that don't understand it.
The only change to my original implementation would be to not force a space on the last word since it is unneeded in any language (and use the unicode character … &#x2026 instead of ... three dots to save 2 characters)

Basically, in CJK (Except Korean with spaces), you need dictionary look-ups to segment words properly. Depending on your exact definition of "word", Japanese can be more difficult than that, since not all inflected variants of a word (i.e. "行こう" vs. "行った") will appear in the dictionary. Whether it's worth the effort depends upon your application.

This punts the word-breaking decision to the re module, but it may work well enough for you.
import re
def shorten(tweet, footer="", limit=140):
"""Break tweet into two pieces at roughly the last word break
before limit.
"""
lower_break_limit = limit / 2
# limit under which to assume breaking didn't work as expected
limit -= len(footer)
tweet = re.sub(r"\s+", " ", tweet.strip())
m = re.match(r"^(.{,%d})\b(?:\W|$)" % limit, tweet, re.UNICODE)
if not m or m.end(1) < lower_break_limit:
# no suitable word break found
# cutting at an arbitrary location,
# or if len(tweet) < lower_break_limit, this will be true and
# returning this still gives the desired result
return tweet[:limit] + footer
return m.group(1) + footer

What you're looking for is Chinese word segmentation tools. Word segmentation is not an easy task and is currently not perfectly solved. There are several tools:
CkipTagger
Developed by Academia Sinica, Taiwan.
jieba
Developed by Sun Junyi, a Baidu engineer.
pkuseg
Developed by Language Computing and Machine Learning Group, Peking University
If what you want is character segmentation, it can be done albeit not very useful.
>>> s = u"简讯：新華社報道，美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域，預計約30分鐘後抵達浦東國際機場，開展他上任後首次訪華之旅。"
>>> chars = list(s)
>>> chars
[u'\u7b80', u'\u8baf', u'\uff1a', u'\u65b0', u'\u83ef', u'\u793e', u'\u5831', u'\u9053', u'\uff0c', u'\u7f8e', u'\u570b', u'\u7e3d', u'\u7d71', u'\u5967', u'\u5df4', u'\u99ac', u'\u4e58', u'\u5750', u'\u7684', u'\u300c', u'\u7a7a', u'\u8ecd', u'\u4e00', u'\u865f', u'\u300d', u'\u5c08', u'\u6a5f', u'\u665a', u'\u4e0a', u'1', u'0', u'\u6642', u'4', u'2', u'\u5206', u'\u9032', u'\u5165', u'\u4e0a', u'\u6d77', u'\u7a7a', u'\u57df', u'\uff0c', u'\u9810', u'\u8a08', u'\u7d04', u'3', u'0', u'\u5206', u'\u9418', u'\u5f8c', u'\u62b5', u'\u9054', u'\u6d66', u'\u6771', u'\u570b', u'\u969b', u'\u6a5f', u'\u5834', u'\uff0c', u'\u958b', u'\u5c55', u'\u4ed6', u'\u4e0a', u'\u4efb', u'\u5f8c', u'\u9996', u'\u6b21', u'\u8a2a', u'\u83ef', u'\u4e4b', u'\u65c5', u'\u3002']
>>> print('/'.join(chars))
简/讯/：/新/華/社/報/道/，/美/國/總/統/奧/巴/馬/乘/坐/的/「/空/軍/一/號/」/專/機/晚/上/1/0/時/4/2/分/進/入/上/海/空/域/，/預/計/約/3/0/分/鐘/後/抵/達/浦/東/國/際/機/場/，/開/展/他/上/任/後/首/次/訪/華/之/旅/。

Save two characters and use an elipsis (…, 0x2026) instead of three dots!

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?

Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!

Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.

Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.