I am trying to split message text for a messaging system up into at most 160 character long sequences that end in spaces, unless it is the very last sequence, then it can end in anything as long as it is equal to or less than 160 characters.
this re expression '.{1,160}\s' almost works however it cuts of the last word of a message because generally the last character of a message is not a space.
I also tried '.{1,160}\s|.{1,160}' but this does not work because the final sequence is just the remaining text after the last space. Does anyone have an idea on how to do this?
EXAMPLE:
two_cities = ("It was the best of times, it was the worst of times, it was " +
"the age of wisdom, it was the age of foolishness, it was the " +
"epoch of belief, it was the epoch of incredulity, it was the " +
"season of Light, it was the season of Darkness, it was the " +
"spring of hope, it was the winter of despair, we had " +
"everything before us, we had nothing before us, we were all " +
"going direct to Heaven, we were all going direct the other " +
"way-- in short, the period was so far like the present period," +
" that some of its noisiest authorities insisted on its being " +
"received, for good or for evil, in the superlative degree of " +
"comparison only.")
chunks = re.findall('.{1,160}\s|.{1,160}', two_cities)
print(chunks)
will return
['It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of ',
'incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we ',
'had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, ',
'that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison ',
'only.']
where the final element of the list should be
'that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.'
not 'only.'
Try this - .{1,160}(?:(?<=[ ])|$)
.{1,160} # 1 - 160 chars
(?:
(?<= [ ] ) # Lookbehind, must end with a space
| $ # or, be at End of String
)
Info -
By default, the engine will try to match 160 characters (greedily).
Then it checks the next part of the expression.
The lookbehind enforces the last character matched with .{1,160} is a space.
Or, if at the end of the string, no enforcement.
If the lookbehind fails, and not at the end of string, the engine will backtrack to 159 characters, then check again. This repeats until the assertion passes.
You should avoid using a regular expression, since they can be inefficient.
I would recommend something like this: (see it in action here)
list = []
words = two_cities.split(" ")
for i in range(0, len(words)):
str = []
while i < len(words) and len(str) + len(words[i]) <= 160:
str.append(words[i] + " ")
i += 1
list.append(''.join(str))
print list
This creates a list of all the words, split on spaces.
If the word will fit onto the string, it will add it onto the string. When it cannot, it adds it to the list and starts a new string. At the end, you have a list of the results.
Related
I have various instance of strings such as:
- hello world,i am 2000to -> hello world, i am 2000 to
- the state was 56,869,12th -> the state was 66,869, 12th
- covering.2% -> covering. 2%
- fiji,295,000 -> fiji, 295,000
For dealing with first case, I came up with two step regex:
re.sub(r"(?<=[,])(?=[^\s])(?=[^0-9])", r" ", text) # hello world, i am 20,000to
re.sub(r"(?<=[0-9])(?=[.^[a-z])", r" ", text) # hello world, i am 20,000 to
But this breaks the text in some different ways and other cases are not covered as well. Can anyone suggest a more general regex that solves all cases properly. I've tried using replace, but it does some unintended replacements which in turn raise some other problems. I'm not an expert in regex, would appreciate pointers.
This approach covers your cases above by breaking the text into tokens:
in_list = [
'hello world,i am 2000to',
'the state was 56,869,12th',
'covering.2%',
'fiji,295,000',
'and another example with a decimal 12.3not4,5 is right out',
'parrot,, is100.00% dead'
'Holy grail runs for this portion of 100 minutes,!, 91%. Fascinating'
]
tokenizer = re.compile(r'[a-zA-Z]+[\.,]?|(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?(?:%|st|nd|rd|th)?[\.,]?')
for s in in_list:
print(' '.join(re.findall(pattern=tokenizer, string=s)))
# hello world, i am 2000 to
# the state was 56,869, 12th
# covering. 2%
# fiji, 295,000
# and another example with a decimal 12.3 not 4, 5 is right out
# parrot, is 100.00% dead
# Holy grail runs for this portion of 100 minutes, 91%. Fascinating
Breaking up the regex, each token is the longest available substring with:
Only letters with or without a period or comma,[a-zA-Z]+[\.,]?
OR |
A number-ish expression which could be
1 to 3 digits \d{1,3} followed by any number of groups of comma + 3 digits (?:,\d{3})+
OR | any number of comma-free digits \d+
optionally a decimal place followed by at least one digit (?:\.\d+),
optionally a suffix (percent, 'st', 'nd', 'rd', 'th') (?:[\.,%]|st|nd|rd|th)?
optionally period or comma [\.]?
Note the (?:blah) is used to suppress re.findall's natural desire to tell you how every parenthesized group matches up on an individual basis. In this case we just want it to walk forward through the string, and the ?: accomplishes this.
I need to clean my corpus, it includes these problems
multiple spaces --> Tables .
footnote --> 10h 50m,1
unknown ” --> replace " instead of ”
e.g
for instance, you see it here:
On 1580 November 12 at 10h 50m,1 they set Mars down at 8° 36’ 50” Gemini2 without mentioning the horizontal variations, by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables .
I have done it using these functions
def fix4token(x):
x=re.sub('”', '\"', x)
if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
res=x.rstrip('0123456789')
output = re.split(r"\b,\b",res, 1)[0]
return output
else:
return x
def removespaces(x):
res=x.replace(" ", " ")
return(res)
it works not bad for this but the result is so
On 1580 November 12 at 10h 50m, they set Mars down at 8° 36’ 50" Gemini without mentioning the horizontal variations, by which term I wish the diurnal parallaxes and the refractions to be understood in what follows. Now this observation is distant and isolated. It was reduced to the moment of opposition using the diurnal motion from the Prutenic Tables.
but the problem is it damaged other paragraphs. it does not work ver well,
I guess because this break other things
x=re.sub('”', '\"', x)
if (x[0].isdigit()== False )| (bool(re.search('[a-zA-Z]', x))==True ):
res=x.rstrip('0123456789')
output = re.split(r"\b,\b",res, 1)[0]
what is the safest way to do these?
1- remove footnotes like in these phrases
"10h 50m,1" or (extra foot note in text after comma)
"Gemini2" (zodic names of month + footnote)
without changing another part of the text (e.g my approach will break the "DC2" to "DC" which is not desired
2- remove multiple spaces before dot . like "Tables ." to no spaces
or remove multiple before, like: ", by which term" to this 9only one space) ", by which term"
3-replace unknown ” -> replace " ...which is done
thank you
You can use
text = re.sub(r'\b(?:(?<=,)\d+|(Capricorn|Aquarius|Pisces|Aries|Taurus|Gemini|Cancer|Leo|Virgo|Libra|Scorpio|Ophiuchus|Sagittarius)\d+)\b|\s+(?=[.,])', r'\1', text, flags=re.I).replace('”', '"')
text = re.sub(r'\s{2,}', ' ', text)
Details:
\b - a word boundary
(?: - start of a non-capturing group:
(?<=,)\d+ - one or more digits that are preceded with a comma
| - or
(Capricorn|Aquarius|Pisces|Aries|Taurus|Gemini|Cancer|Leo|Virgo|Libra|Scorpio|Ophiuchus|Sagittarius)\d+ - one of the zodiac sign words (captured into Group 1, \1 in the replacement pattern refers to this value) and then one or more digits
) - the end of the non-capturing group
\b
| - or
\s+ - one or more whitespaces
(?=[.,]) - that are immediately followed with . or ,.
The .replace('”', '"') replaces all ” with a " char.
The re.sub(r'\s{2,}', ' ', text) code replaces all chunks of two or more whitespaces with a single regular space.
I have this homework to do (no libraries allowed) and i've underestimated this problem:
let's say we have a list of strings: str_list = ["my head's", "free", "at last", "into alarm", "in another moment", "neck"]
What we know for sure about this is that every single string is contained in the master_string, are in order, and are without punctuation. (all this thanks to previous controls i've made)
Then we have the string: master_string = "'Come, my head's free at last!' said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
What i must do here is basically check the sequences of string of at least k (in this case k = 2) from str_list that are contained in the master_string, however i underestimated the fact that in str_list we have more than 1 word in each string so doing master_string.split() won't take me anywhere because would mean to ask something like if "my head's" == "my" and that would be false of course.
I was thinking about doing something like concatenating strings somehow one at time and searching into the master_string.strip(".,:;!?") but if i find corresponding sequences i need absolutely to take them directly from the master_string because i need the punctuation in the result variable. This basically means to take directly slices from master_string but how can that be possible? Is even something possible or i got to change approach? This is driving me totally crazy especially because there are no libraries allowed to do this.
In case you're wondering what is the expected result here would be:
["my head's free at last!", "into alarm in another moment,"] (because both respect the condition of at least k strings from str_list) and "neck" would be saved in a discard_list since it doesn't respect that condition (it musn't be discarded with .pop() because i need to do other stuff with variables discarded)
Follows my solution:
Try to extend all the basing yourself on the master_string and a finite set of punctuation characters (e.g. my head’s -> my head’s free at last!; free -> free at last!).
Keep only the substrings that have been extended at least k times.
Remove redundant substrings (e.g. free at last! is already present with my head’s free at last!).
This is the code:
str_list = ["my head’s", "free", "at last", "into alarm", "in another moment", "neck"]
master_string = "‘Come, my head’s free at last!’ said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
punctuation_characters = ".,:;!?" # list of punctuation characters
k = 1
def extend_string(current_str, successors_num = 0) :
# check if the next token is a punctuation mark
for punctuation_mark in punctuation_characters :
if current_str + punctuation_mark in master_string :
return extend_string(current_str + punctuation_mark, successors_num)
# check if the next token is a proper successor
for successor in str_list :
if current_str + " " + successor in master_string :
return extend_string(current_str + " " + successor, successors_num+1)
# cannot extend the string anymore
return current_str, successors_num
extended_strings = []
for s in str_list :
extended_string, successors_num = extend_string(s)
if successors_num >= k : extended_strings.append(extended_string)
extended_strings.sort(key=len) # sorting by ascending length
result_list = []
for es in extended_strings :
result_list = list(filter(lambda s2 : s2 not in es, result_list))
result_list.append(es)
print(result_list) # result: ['my head’s free at last!', 'into alarm in another moment,']
Ive got two different versions, number 1 gives you neck :(, but number 2 doesn't give you as much, here's number 1:
master_string = "Come, my head’s free at last!’ said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
str_list = ["my head's", "free", "at last", "into alarm", "in another moment", "neck"]
new_str = ''
for word in str_list:
if word in master_string:
new_str += word + ' '
print(new_str)
and here's number 2:
master_string = "Come, my head’s free at last!’ said Alice in a tone of delight, which changed into alarm in another moment, when she found that her shoulders were nowhere to be found: all she could see, when she looked down, was an immense length of neck, which seemed to rise like a stalk out of a sea of green leaves that lay far below her."
str_list = ["my head's", "free", "at last", "into alarm", "in another moment", "neck"]
new_str = ''
for word in str_list:
if word in master_string:
new_word = word.split(' ')
if len(new_word) == 2:
new_str += word + ' '
print(new_str)
I want to make sure that each sentence in a text starts with a capital letter.
E.g. "we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken." should become
"We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken."
I tried using split() to split the sentence. Then, I capitalized the first character of each line. I appended the rest of the string to the capitalized character.
text = input("Enter the text: \n")
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() # capitalize the first word of sentence
for i in range(1, len(line)):
a = a + line[i]
print(a)
I want to obtain "We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken."
I get "We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister
The good news is they tasted like chicken."
This code should work:
text = input("Enter the text: \n")
lines = text.split('. ') # Split the sentences
for index, line in enumerate(lines):
lines[index] = line[0].upper() + line[1:]
print(". ".join(lines))
The error in your code is that str.split(chars) removes the splitting delimiter char and that's why the period is removed.
Sorry for not providing a thorough description as I cannot think of what to say. Please feel free to ask in comments.
EDIT: Let me try to explain what I did.
Lines 1-2: Accepts the input and splits into a list by '. '. On the sample input, this gives: ['"We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister', 'the good news is they tasted like chicken.']. Note the period is gone from the first sentence where it was split.
Line 4: enumerate is a generator and iterates through an iterator returning the index and item of each item in the iterator in a tuple.
Line 5: Replaces the place of line in lines with the capital of the first character plus the rest of the line.
Line 6: Prints the message. ". ".join(lines) basically reverses what you did with split. str.join(l) takes a iterator of strings, l, and sticks them together with str between all the items. Without this, you would be missing your periods.
When you split the string by ". " that removes the ". "s from your string and puts the rest of it into a list. You need to add the lost periods to your sentences to make this work.
Also, this can result in the last sentence to have double periods, since it only has "." at the end of it, not ". ". We need to remove the period (if it exists) at the beginning to make sure we don't get double periods.
text = input("Enter the text: \n")
output = ""
if (text[-1] == '.'):
# remove the last period to avoid double periods in the last sentence
text = text[:-1]
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() # capitalize the first word of sentence
for i in range(1, len(line)):
a = a + line[i]
a = a + '.' # add the removed period
output = output + a
print (output)
We can also make this solution cleaner:
text = input("Enter the text: \n")
output = ""
if (text[-1] == '.'):
# remove the last period to avoid double periods in the last sentence
text = text[:-1]
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() + line [1:] + '.'
output = output + a
print (output)
By using str[1:] you can get a copy of your string with the first character removed. And using str[:-1] will give you a copy of your string with the last character removed.
split splits the string AND none of the new strings contain the delimiter - or the string/character you split by.
change your code to this:
text = input("Enter the text: \n")
lines = text.split('. ') #Split the sentences
final_text = ". ".join([line[0].upper()+line[1:] for line in lines])
print(final_text)
The below can handle multiple sentence types (ending in ".", "!", "?", etc...) and will capitalize the first word of each of the sentences. Since you want to keep your existing capital letters, using the capitalize function will not work (since it will make none sentence starting words lowercase). You can throw a lambda function into the list comp to take advantage of upper() on the first letter of each sentence, this keeps the rest of the sentence completely un-changed.
import re
original_sentence = 'we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken.'
val = re.split('([.!?] *)', original_sentence)
new_sentence = ''.join([(lambda x: x[0].upper() + x[1:])(each) if len(each) > 1 else each for each in val])
print(new_sentence)
The "new_sentence" list comprehension is the same as saying:
sentence = []
for each in val:
sentence.append((lambda x: x[0].upper() + x[1:])(each) if len(each) > 1 else each)
print(''.join(sentence))
You can use the re.sub function in order to replace all characters following the pattern . \w with its uppercase equivalent.
import re
original_sentence = 'we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken.'
def replacer(match_obj):
return match_obj.group(0).upper()
# Replace the very first characer or any other following a dot and a space by its upper case version.
re.sub(r"(?<=\. )(\w)|^\w", replacer, original_sentence)
>>> 'We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken.'
So I am attempting to use python to write text to a Microsoft word document. The code works perfectly except for when it runs up against a non-ascii character. When that happens, I am greeted by the following error.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
I'd attempted to solve this issue by using regular expressions to pluck out and replace non-ascii characters. re.sub(pattern, repl, string, count=0, flags=0) seemed like it it would work. Here is the code that I threw together:
match1 = re.search(ur'ʼ', bodyHTML)
match2 = re.search(ur'ï¬', bodyHTML)
match3 = re.search(ur'fl', bodyHTML)
if match1:
print 'Match 1'
bodyHTML = re.sub(ur'ʼ', "'", bodyHTML)
if match3:
print 'Match 3'
bodyHTML = re.sub(ur'fl', 'fl', bodyHTML)
if match2:
print 'Match 2'
bodyHTML = re.sub(ur'ï¬', 'fi', bodyHTML)
"match1" works perfectly. Whenever there is an ʼ in the text, it is replaced by an apostrophe.
"match2" and "match3" are a different story. Here's an example:
After a short hike we had our ï¬rst glimpse of the museum
Naturally, this triggers a response from match2. But instead of producing
After a short hike we had our first glimpse of the museum
It spits out
After a short hike we had our fiürst glimpse of the museum
This happens several times. "signiï¬cant" becomes "signifiücant" and so on.
I am not sure why this is happening.
I am also running into an issue where match2 is steamrolling any match from match3. In other words
the ripples on the pond and so did the shimmering reflections in the glass walls
becomes
the ripples on the pond and so did the shimmering refiéected in the glass walls
instead of
the ripples on the pond and so did the shimmering reflected in the glass walls
I'm not really sure why match2 is dominating, especially because I put the match3 if statement before the one for match2 specifically so it would remove all of sections with "fl" and leave only the "ï¬" snippets for match2 to mop up.
As far as the other non-ascii characters popping up after running the code...I have no idea.
Any help is much appreciated.
Thank you
For specific 'translation' I use chr like this:
def process_special_characters(result):
'''sub out known yuck for legit stuff (smart quotes, etc.), keep track of how many changes'''
total = 0
result = re.subn(chr(133), chr(95), result) #strange underbar
total += result[1]
result = re.subn(chr(145), chr(39), result[0]) #smart quote
total += result[1]
result = re.subn(chr(146), chr(39), result[0]) #other smart quote
total += result[1]
result = re.subn(chr(150), chr(45), result[0]) #strange hyphen
total += result[1]
result = re.subn(chr(8212), chr(45), result[0]) #strange long hyphen
total += result[1]
result = re.subn(chr(160), chr(32), result[0]) #non breaking space
total += result[1]
return result[0], total
You could easily make a dict and just have the key be your "from" and the value be the "to" and loop it. If it's a short list this works fine. Note, using subn allows you to keep track of the number of changes. You can always just toss that all that logic and just go with sub, just that in my case the business side wanted a count of changes.
You could also have lines like this if it is easier to read:
result = re.subn(chr(133), "_", result) #strange underbar
Good luck! Just toss a comment in if you have more questions and I'll update.