Breaking text into chunks using regex?

Breaking text into chunks using regex? - python

I am trying to create a regex that will take a longish string that contains space separated words and break it into chunks of up to 50 characters that end with a space or the end of the line.
I first came up with: (.{0,50}(\s|$)) but that only grabbed the first match. I then thought I would add a * to the end: (.{0,50}(\s|$))* but now it grabs the entire string.
I've been testing here, but can't seem to to get it to work as needed. Can anyone see what I am doing wrong here?

Here, it seems to be working:
import re
p = re.compile(ur'(.{0,50}[\s|$])')
test_str = u"jasdljasjdlk jal skdjl ajdl kajsldja lksjdlkasd jas lkjdalsjdalksjdalksjdlaksjdk sakdjakl jd fgdfgdfg\nhgkjd fdkfhgk dhgkjhdfhg kdhfgk jdfghdfkjghjf dfjhgkdhf hkdfhgkj jkdfgk jfgkfg dfkghk hdfkgh d asdada \ndkjfghdkhg khdfkghkd hgkdfhgkdhfk k dfghkdfgh dfgdfgdfgd\n"
re.findall(p, test_str)

What are you using to match the regex? The re.findall() method should return what you want.

It's not using a regex, but have you thought about using textwrap.wrap()?
In [8]: import textwrap
text = ' '.join([
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed et convallis",
"lectus. Quisque maximus diam ut sodales tincidunt. Integer ac finibus",
"elit. Etiam tristique euismod justo, vel pretium tellus malesuada et.",
"Pellentesque id mattis eros, at bibendum mauris. In luctus lorem eget nisl",
"sagittis sollicitudin. Aenean consequat at lacus at porttitor. Nunc sit",
"amet neque eu sem venenatis rutrum. Proin sed tempus lacus, sit amet porta",
"velit. Suspendisse et semper nisl, eu varius orci. Ut non metus."])
In [9]: textwrap.wrap(text, 50)
Out[9]: ['Lorem ipsum dolor sit amet, consectetur adipiscing',
'elit. Sed et convallis lectus. Quisque maximus',
'diam ut sodales tincidunt. Integer ac finibus',
'elit. Etiam tristique euismod justo, vel pretium',
'tellus malesuada et. Pellentesque id mattis eros,',
'at bibendum mauris. In luctus lorem eget nisl',
'sagittis sollicitudin. Aenean consequat at lacus',
'at porttitor. Nunc sit amet neque eu sem venenatis',
'rutrum. Proin sed tempus lacus, sit amet porta',
'velit. Suspendisse et semper nisl, eu varius orci.',
'Ut non metus.']

Here's what you need - '[^\s]{1,50}'.
Example on smaller number:
>>> text = "Lorem ipsum sit dolor"
>>> splitter = re.compile('[^\s]{1,3}')
>>> splitter.findall(text)
['Lor', 'em', 'ips', 'um', 'sit', 'dol', 'or']

Related

How to search backwards with regex on multilne string in python

I'm wondering if there's an efficient way of doing the following:
I have a python script that reads an entire file into a single string. Then, given the location of a token of interest, I'd like to find the string index of the beginning of the line given that token.
file_str = read_file("foo.txt")
token_pos = re.search("token",file_str).start()
#this does not work, as str.rfind does not take regex, and you cannot specify re.M:
beginning_of_line = file_str.rfind("^",0,token_pos)
I could use a greedy regex to find the last beginning of line, but this has to be done many times, so I'm concerned that I don't want to read the whole file on each iteration. Is there a good way to do this?
----------------- EDIT ----------------
I tried to post as simple of a question, but it looks like more details are required. Here's a better example of one of the things I'm trying to do:
file_str = """
{
blah {
{} {{} "string with unmatched }" }
}
}"""
I happen to know where the opening an closing positions of blah's braces are. I need to get the lines between the braces (non-inclusive). So, given the position of the closing brace, I need to find the beginning of the line containing it. I'd like to do something akin to a reverse regex to find it. I can, of course, write a special function to do this, but I was thinking there would be some more python-ish way of going about it. To further complicate things, I would have to do this several times per file, and the file string can potentially change between iterations, so pre-indexing doesn't really work either...

Instead of matching just the keyword, match everything from the start of the line to the keyword. You could use re.finditer()docs to get an iterator that keeps yielding matches as it finds them.
file_str = """Lorem ipsum dolor sit amet, consectetur adipiscing elit amet.
Vestibulum vestibulum mollis enim, eu tristique est rhoncus et.
Curabitur sem nisi, ornare eu pellentesque at, interdum at lectus.
Phasellus molestie, turpis id ornare efficitur, ex tellus aliquet ipsum, vitae ullamcorper tellus diam a velit.
Nulla eget eleifend nisl.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nullam finibus, velit non euismod faucibus, dolor orci maximus lacus, sed mattis nisi erat eget turpis.
Maecenas ut pharetra lorem.
Curabitur nec dui sed velit euismod bibendum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Pellentesque tempor dolor at placerat aliquet.
Duis laoreet, est vitae tempor porta, risus leo ullamcorper risus, quis vestibulum massa orci ut felis.
In finibus purus ac nulla congue mattis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis efficitur dui ac nisi lobortis, a bibendum felis volutpat.
Aenean consectetur diam at risus hendrerit, in vestibulum erat porttitor.
Quisque fringilla accumsan neque, sed efficitur nunc tristique maximus.
Maecenas gravida lectus et porttitor ultrices.
Nam lobortis, massa et porta vulputate, nulla turpis maximus sapien, sit amet finibus libero mauris eu sapien.
Donec sollicitudin vulputate neque, in tempor nisi suscipit quis.
"""
keyword = "amet"
for match_obj in re.finditer(f"^.*{keyword}", file_str, re.MULTILINE):
beginning_of_line = match_obj.start()
print(beginning_of_line, match_obj)
Which gives:
0 <re.Match object; span=(0, 60), match='Lorem ipsum dolor sit amet, consectetur adipiscin>
331 <re.Match object; span=(331, 357), match='Lorem ipsum dolor sit amet'>
566 <re.Match object; span=(566, 592), match='Lorem ipsum dolor sit amet'>
815 <re.Match object; span=(815, 841), match='Lorem ipsum dolor sit amet'>
1129 <re.Match object; span=(1129, 1206), match='Nam lobortis, massa et porta vulputate, nulla tur>
Note that the first line gets matched only once even though it contains two amets because we do a greedy match on . so the first amet on the line is consumed by the .*

You don't need use regex to find the beginning of lines with the token
This will iterate the file line by line, create the string foo with the file's content and record where the newlines are in list named line_pos_with_token
token = "token"
foo = ''
line_pos_with_token = []
with open("foo.txt", "r") as f:
for line in f:
if token in line:
line_pos_with_token.append(len(foo))
foo += line
print(line_pos_with_token)

Python 'textwrap' module not replacing all whitespace when using 'wrap' and 'fill' functions

I'm trying to output some text to the console using the Python textwrap module. According to the documentation both the wrap and fill functions will replace whitespace by default because the replace_witespace kwarg defaults to True.
However, in practice some whitespace remains and I am unable to format the text as required.
How can I format the text below without the chunk of whitespace after the first sentence?
Some title
----------
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sem eros, imperdiet vitae mi
cursus, dapibus fringilla eros. Duis quis euismod ex. Maecenas ac consequat quam. Sed ac augue
dignissim, facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor dignissim
scelerisque. Curabitur et lobortis ex.
Here is the code that I am using.
from textwrap import fill, wrap
lines = [
'Some title',
'----------',
'',
' '.join(wrap('''Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis
quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim,
facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor
dignissim scelerisque. Curabitur et lobortis ex.'''))
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))

replace_whitespace is not regular spaces ' ' - but only '\t\n\v\f\r'.
From the docs:
replace_whitespace
(default: True) If true, after tab expansion but before wrapping, the wrap() method will replace each whitespace character with a single space. The whitespace characters replaced are as follows: tab, newline, vertical tab, formfeed, and carriage return ('\t\n\v\f\r').
https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.replace_whitespace

So looks like this can be done using dedent and defining separate strings for multiline strings.
from textwrap import dedent, fill
lines = [
'Some title',
'----------',
'',
dedent('Lorem ipsum dolor sit amet, consectetur adipiscing elit. ' \
'Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis ' \
'quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim, ' \
'facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor ' \
'dignissim scelerisque. Curabitur et lobortis ex.')
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))

Scraping portions of text only after specific words in HTML file

I'm very new to Python (one week old), so I'm sorry if this sounds silly but and I would really appreciate some help. I want to scape specific portions of text in an HTML file. For example, let's say the whole text is :
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc fringilla arcu congue metus aliquam mollis.
Mauris nec maximus purus. Maecenas sit amet pretium tellus.
Quisque at dignissim lacus.
I want to scape the all text after the word "mollis" and before the word "Quisque at dignissim lacus" and the desirable output should be :
Mauris nec maximus purus. Maecenas sit amet pretium tellus.
So far, I have just managed to scrape some parts from a website and remove the HTML tag:
from bs4 import BeautifulSoup
from re import findall
file = open('filename.html', encoding= "UTF-8")
soup = BeautifulSoup(file, 'lxml')
for match in soup.find_all('div', class_='discussion-desc'):
recom = match.text
re.findall(r'#(\w+)','recommendations')
#['recommendations', 'steps']
#re.findall(r'#(\w+)', 'recommendations')
#[]
#(re.findall(r'#(\w+)', 'recommendations') or None,)[0]
#'recommendations'
#print (re.findall(r'#(\w+)', 'recommendations') or None,)[0]
#None
Please help, thank you.

In case of single occurance, you can use: re.search() :
s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus"
re.search(r'mollis\.(.*?)Quisque at dignissim lacus', s).group(1)
Output:
Out[28]: ' Mauris nec maximus purus. Maecenas sit amet pretium tellus. '
In case of multiple occurances, have a look at re.findall()

How can I split a text by (a), (b)?

I want to split my text by subparts (a), (b), ...
import re
s = "(a) First sentence. \n(b) Second sentence. \n(c) Third sentence."
l = re.compile('\(([a-f]+)').split(s)
With my regex I get a list of 7 elements:
['', 'a', ') First sentence. \n', 'b', ') Second sentence. \n', 'c', ') Third sentence.']
but what I want is a list of 3 elements,
the first item should be (a) with the first sentence, the second item (b) and the third and last item (c):
['(a) First sentence.', '(b) Second sentence.', '(c) Third sentence.']

You can use a positive lookahead ?= to split the string at parts where right after it, the pattern (letter_from_a_to_f_appears):
import re
s = "(a) Lorem ipsum dolor sit amet, consectetur adipiscing elit. \n(b) Nullam porta aliquet ornare. Integer non ullamcorper nibh. Curabitur eu maximus odio. Mauris egestas fermentum ligula non fermentum. Sed tincidunt dolor porta egestas consequat. Nullam pharetra fermentum venenatis. Maecenas at tempor sapien, eu gravida augue. Fusce nec elit sollicitudin est euismod placerat nec ut purus. \n(c) Phasellus fermentum enim ex. Suspendisse ac augue vitae magna convallis dapibus."
l = re.compile('(?=\([a-f]\))').split(s)
print(l)
Output:
['', '(a) Lorem ipsum dolor sit amet, consectetur adipiscing elit. \n', '(b) Nullam porta aliquet ornare. Integer non ullamcorper nibh. Curabitur eu maximus odio. Mauris egestas fermentum ligula non fermentum. Sed tincidunt dolor porta egestas consequat. Nullam pharetra fermentum venenatis. Maecenas at tempor sapien, eu gravida augue. Fusce nec elit sollicitudin est euismod placerat nec ut purus. \n', '(c) Phasellus fermentum enim ex. Suspendisse ac augue vitae magna convallis dapibus.']
If you don't want the empty string(s), you can use filter:
l = list(filter(None, l))
If you don't want the trailing newlines on each string, you can use map:
l = list(map(str.strip, l))
or
l = list(map(str.rstrip, l))

python write strings of bytes

I compress some data with the lzw module and I save them into a file ('wb' mode). This returns something like this:
'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*'
For small compressed data lzw's strings are in the above format.
When I put bigger strings for compression the lzw's compressed string is splited into lines.
'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*', '\xff\xb6\xd9\xe8r4'
As I checked, string contains '\n' chars so I think I lose information if the new line missing. How can I store the string so that it will be unsplitted and stored into 1 line ?
I have tried this:
for i in s_string:
testfile.write(i)
-----------------
testfile.write(s_string)
EDIT
def mycpsr(x):
#x = '11010101001010101010010111110101010101001010' # some random bits for lzw input
temp = lzw.compress(x)
temp = "".join(temp)
return temp
>>> import lzw
>>> print mycpsr('10101010011111111111111111111111100000000000111111')
If I put bigger input lets say x is a sting of 0 and 1 and len(x) = 1000 and I take the compressed data and append it to a file I get multiple lines instead of 1 line.
If the file has this data:
'\t' + normal strings + '\n'
<LZW-strings(with \t\n chars)>
'\t' + normal strings + '\n'
How can i define which is lzw and which is other data ?

So, your binary data contains newlines, and you want to embed it into a line-oriented document. To do that, you need to quote newlines in the binary data. One way to do it, which will quote not only newlines, but other non-printable characters, is by using base64 encoding:
import base64, lzw
def my_compress(x):
# returns a single line, one trailing \n included
return base64.encodestring("".join(lzw.compress(x)))
def my_decompress(line):
return lzw.decompress(base64.decodestring(line))
If your code handles binary characters other than newline, you can make the encoding more space-efficient by only replacing newline with r"\n" (backslash followed by n), and backslash with r"\\" (two backslash characters). This will allow lzw data to reside in a single binary line, and you will need to just do the inverse transformation before calling lzw.decompress.

You are dealing with binary data. If your data contains more than 256 bytes you have a good probability that some of the bytes correspond to the ascii code of '\n'. This will result in a binary file which contains more than one line if considered a text file.
This is not a problem as long as you deal with binary files as sequence of bytes not as a sequence of lines.

>>> txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum
ante velit, adipiscing eget sodales non, faucibus vitae nunc. Praesent ac lorem
cursus, aliquet magna sed, porta diam. Nunc lorem sapien, euismod in congue non
, tincidunt sit amet arcu. Lorem ipsum dolor sit amet, consectetur adipiscing el
it. Phasellus eleifend bibendum massa, ac convallis tellus sodales in. Suspendis
se non aliquam massa. Aenean erat ipsum, sagittis vitae elementum sit amet, iacu
lis sit amet quam. Vivamus luctus hendrerit libero at fringilla. Nullam id urna
est. Vestibulum pretium et tellus et dictum.
...
... Fusce nulla velit, lobortis at ligula eget, fermentum condimentum felis. Mae
cenas pretium posuere elit in posuere. Suspendisse gravida erat tristique, venen
atis erat at, sagittis elit. Donec laoreet lacinia nunc, eu consequat tortor. Cr
as at sem scelerisque, tristique dolor a, porta mauris. Fusce fermentum massa vi
tae arcu sagittis, et laoreet lacus suscipit. Vestibulum sed accumsan quam. Vest
ibulum eu egestas nisl. Curabitur dolor massa, auctor tempus dui ut, volutpat vu
lputate massa. Fusce vitae tortor adipiscing, gravida est at, molestie tortor. A
enean quis magna magna. Donec cursus enim ac egestas cursus. Pellentesque pulvin
ar nibh in sapien sollicitudin, eget tempus tortor pulvinar. Phasellus dignissim
, urna a sagittis tempor, nulla nulla rhoncus enim, vel molestie nisl lectus qui
s erat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit
amet malesuada nisi, sit amet placerat sem."""
>>>
>>> print "".join(lzw.decompress(lzw.compress(txt)))
appears to correctly re decode it including the \n

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Breaking text into chunks using regex? - python

What are you using to match the regex? The re.findall() method should return what you want.

Here's what you need - '[^\s]{1,50}'. Example on smaller number: >>> text = "Lorem ipsum sit dolor" >>> splitter = re.compile('[^\s]{1,3}') >>> splitter.findall(text) ['Lor', 'em', 'ips', 'um', 'sit', 'dol', 'or']

Related

How to search backwards with regex on multilne string in python

Python 'textwrap' module not replacing all whitespace when using 'wrap' and 'fill' functions

Scraping portions of text only after specific words in HTML file

How can I split a text by (a), (b)?

python write strings of bytes

Categories

Resources