How can I split a text by (a), (b)? - python

I want to split my text by subparts (a), (b), ...
import re
s = "(a) First sentence. \n(b) Second sentence. \n(c) Third sentence."
l = re.compile('\(([a-f]+)').split(s)
With my regex I get a list of 7 elements:
['', 'a', ') First sentence. \n', 'b', ') Second sentence. \n', 'c', ') Third sentence.']
but what I want is a list of 3 elements,
the first item should be (a) with the first sentence, the second item (b) and the third and last item (c):
['(a) First sentence.', '(b) Second sentence.', '(c) Third sentence.']

You can use a positive lookahead ?= to split the string at parts where right after it, the pattern (letter_from_a_to_f_appears):
import re
s = "(a) Lorem ipsum dolor sit amet, consectetur adipiscing elit. \n(b) Nullam porta aliquet ornare. Integer non ullamcorper nibh. Curabitur eu maximus odio. Mauris egestas fermentum ligula non fermentum. Sed tincidunt dolor porta egestas consequat. Nullam pharetra fermentum venenatis. Maecenas at tempor sapien, eu gravida augue. Fusce nec elit sollicitudin est euismod placerat nec ut purus. \n(c) Phasellus fermentum enim ex. Suspendisse ac augue vitae magna convallis dapibus."
l = re.compile('(?=\([a-f]\))').split(s)
print(l)
Output:
['', '(a) Lorem ipsum dolor sit amet, consectetur adipiscing elit. \n', '(b) Nullam porta aliquet ornare. Integer non ullamcorper nibh. Curabitur eu maximus odio. Mauris egestas fermentum ligula non fermentum. Sed tincidunt dolor porta egestas consequat. Nullam pharetra fermentum venenatis. Maecenas at tempor sapien, eu gravida augue. Fusce nec elit sollicitudin est euismod placerat nec ut purus. \n', '(c) Phasellus fermentum enim ex. Suspendisse ac augue vitae magna convallis dapibus.']
If you don't want the empty string(s), you can use filter:
l = list(filter(None, l))
If you don't want the trailing newlines on each string, you can use map:
l = list(map(str.strip, l))
or
l = list(map(str.rstrip, l))

Related

How to split a text file into chunks?

I have tried many methods but it hasn't worked for me. I want to split a text files lines into multiple chunks. Specifically 50 lines per chunk.
Like this [['Line1', 'Line2' -- up to 50] and so on.
data.txt (example):
Line2
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Python code:
with open('data.txt', 'r') as file:
sample = file.readlines()
chunks = []
for i in range(0, len(sample), 3): # replace 3 with 50 in your case
chunks.append(sample[i:i+3]) # replace 3 with 50 in your case
chunks (in my example, chunks of 3 lines):
[['Line1\n', 'Line2\n', 'Line3\n'], ['Line4\n', 'Line5\n', 'Line6\n'], ['Line7\n', 'Line8']]
You can apply the string.rstrip('\n') method on those lines to remove the \n at the end.
Alternative:
Without reading the whole file in memory (better):
chunks = []
with open('data.txt', 'r') as file:
while True:
chunk = []
for i in range(3): # replace 3 with 50 in your case
line = file.readline()
if not line:
break
chunk.append(line)
# or 'chunk.append(line.rstrip('\n')) to remove the '\n' at the ends
if not chunk:
break
chunks.append(chunk)
print(chunks)
Produces same result
A good way to do it would be to create a generic generator function that could break any sequence up into chunks of any size. Here's what I mean:
from itertools import zip_longest
def grouper(n, iterable): # Generator function.
"s -> (s0, s1, ...sn-1), (sn, sn+1, ...s2n-1), (s2n, s2n+1, ...s3n-1), ..."
FILLER = object() # Unique object
for group in zip_longest(*([iter(iterable)]*n), fillvalue=FILLER):
limit = group.index(FILLER) if group[-1] is FILLER else len(group)
yield group[:limit] # Sliced to remove any filler.
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
for chunk in grouper(3, inf):
pprint(chunk, width=90)
If the lorem ipsum.txt file contained these lines:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.
In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In
elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,
dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor
facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.
Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras
pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex
arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.
The result will be the following chunks each composed of 3 lines or less:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.\n',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In\n',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,\n')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor\n',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.\n',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras\n')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex\n',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.\n')
Update
If you want to remove the newline characters from the end of the lines of the file, you could do it with the same generic grouper() function by passing it a generator expression to preprocess the lines being read without needing to read them all into memory first:
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
lines = (line.rstrip() for line in inf) # Generator expr - cuz outer parentheses.
for chunk in grouper(3, lines):
pprint(chunk, width=90)
Output using generator expression:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.')
You can split the text by each newline using the str.splitlines() method. Then, using a list comprehension, you can use list slices to slice the list at increments of the chunk_size (50 in your case). Below, I used 3 as the chunk_size variable, but you can replace that with 50:
text = '''Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.'''
lines = text.splitlines()
chunk_size = 3
chunks = [lines[i: i + chunk_size] for i in range(0, len(lines), chunk_size)]
print(chunks)
Output:
[['Lorem ipsum dolor sit amet, consectetur adipiscing elit,', 'sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', 'Ut enim ad minim veniam,'],
['quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.', 'Duis aute irure dolor in reprehenderit in voluptate', 'velit esse cillum dolore eu fugiat nulla pariatur.'],
['Excepteur sint occaecat cupidatat non proident,', 'sunt in culpa qui officia deserunt mollit anim id est laborum.']]

Python 'textwrap' module not replacing all whitespace when using 'wrap' and 'fill' functions

I'm trying to output some text to the console using the Python textwrap module. According to the documentation both the wrap and fill functions will replace whitespace by default because the replace_witespace kwarg defaults to True.
However, in practice some whitespace remains and I am unable to format the text as required.
How can I format the text below without the chunk of whitespace after the first sentence?
Some title
----------
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sem eros, imperdiet vitae mi
cursus, dapibus fringilla eros. Duis quis euismod ex. Maecenas ac consequat quam. Sed ac augue
dignissim, facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor dignissim
scelerisque. Curabitur et lobortis ex.
Here is the code that I am using.
from textwrap import fill, wrap
lines = [
'Some title',
'----------',
'',
' '.join(wrap('''Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis
quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim,
facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor
dignissim scelerisque. Curabitur et lobortis ex.'''))
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))
replace_whitespace is not regular spaces ' ' - but only '\t\n\v\f\r'.
From the docs:
replace_whitespace
(default: True) If true, after tab expansion but before wrapping, the wrap() method will replace each whitespace character with a single space. The whitespace characters replaced are as follows: tab, newline, vertical tab, formfeed, and carriage return ('\t\n\v\f\r').
https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.replace_whitespace
So looks like this can be done using dedent and defining separate strings for multiline strings.
from textwrap import dedent, fill
lines = [
'Some title',
'----------',
'',
dedent('Lorem ipsum dolor sit amet, consectetur adipiscing elit. ' \
'Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis ' \
'quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim, ' \
'facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor ' \
'dignissim scelerisque. Curabitur et lobortis ex.')
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))

Python YAML dump using block style without quotes

How do you load and dump YAML using PyYAML, so that it uses the original styling as closely as possible?
I have Python to load and dump YAML data like:
import sys
import yaml
def _represent_dictorder(self, data):
# Maintains ordering of specific dictionary keys in the YAML output.
_data = []
ordering = ['questions', 'tags', 'answers', 'weight', 'date', 'text']
for key in ordering:
if key in data:
_data.append((str(key), data.pop(key)))
if data:
_data.extend(data.items())
return self.represent_mapping(u'tag:yaml.org,2002:map', _data)
yaml.add_representer(dict, _represent_dictorder)
text="""- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
"""
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
but this outputs the YAML in a different style, like:
- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: "1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n\
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.\n\
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus,\
\ mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.\n \
\ a. Aenean consectetur eleifend accumsan.\n4. In erat lacus, egestas\
\ ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis\
\ maximus dignissim.\n a. Proin nec neque convallis, placerat odio\
\ non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.\n\
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n \
\ a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.\n\
\ b. Nulla facilisi. Pellentesque at pretium nunc.\n c. Ut ipsum\
\ nibh, suscipit a pretium eu, eleifend vitae purus."
As you can see, it's changing the style of the text-block, so that newlines are escaped, making it a lot harder to read.
So I tried specifying the default_style attribute like:
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, default_style='|', indent=4)
And that fixed the text-block style, but then it broke other styles by putting quotes around all other strings, adding newlines to single-line strings, and munging integers, like:
- "questions":
- |-
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"tags":
"context": |-
curabitur
"answers":
- "weight": !!int |-
2
"date": |-
2014-1-19
"text": |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
How do I fix this so the output resembles the style of my original input?
How would you determine what string to represent as a block literal (or a folded block for that matter) and what to represent inline?
Under the assumption that you only want block literals used with strings that span over multiple lines, you can write your own string representer to switch between the styles based on the string content:
def selective_representer(dumper, data):
return dumper.represent_scalar(u"tag:yaml.org,2002:str", data,
style="|" if "\n" in data else None)
yaml.add_representer(str, selective_representer)
Now if you dump your data with default flow style set to False (to prevent dict/list inlining):
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
Your scalars will act as you expect them to.

Breaking text into chunks using regex?

I am trying to create a regex that will take a longish string that contains space separated words and break it into chunks of up to 50 characters that end with a space or the end of the line.
I first came up with: (.{0,50}(\s|$)) but that only grabbed the first match. I then thought I would add a * to the end: (.{0,50}(\s|$))* but now it grabs the entire string.
I've been testing here, but can't seem to to get it to work as needed. Can anyone see what I am doing wrong here?
Here, it seems to be working:
import re
p = re.compile(ur'(.{0,50}[\s|$])')
test_str = u"jasdljasjdlk jal skdjl ajdl kajsldja lksjdlkasd jas lkjdalsjdalksjdalksjdlaksjdk sakdjakl jd fgdfgdfg\nhgkjd fdkfhgk dhgkjhdfhg kdhfgk jdfghdfkjghjf dfjhgkdhf hkdfhgkj jkdfgk jfgkfg dfkghk hdfkgh d asdada \ndkjfghdkhg khdfkghkd hgkdfhgkdhfk k dfghkdfgh dfgdfgdfgd\n"
re.findall(p, test_str)
What are you using to match the regex? The re.findall() method should return what you want.
It's not using a regex, but have you thought about using textwrap.wrap()?
In [8]: import textwrap
text = ' '.join([
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed et convallis",
"lectus. Quisque maximus diam ut sodales tincidunt. Integer ac finibus",
"elit. Etiam tristique euismod justo, vel pretium tellus malesuada et.",
"Pellentesque id mattis eros, at bibendum mauris. In luctus lorem eget nisl",
"sagittis sollicitudin. Aenean consequat at lacus at porttitor. Nunc sit",
"amet neque eu sem venenatis rutrum. Proin sed tempus lacus, sit amet porta",
"velit. Suspendisse et semper nisl, eu varius orci. Ut non metus."])
In [9]: textwrap.wrap(text, 50)
Out[9]: ['Lorem ipsum dolor sit amet, consectetur adipiscing',
'elit. Sed et convallis lectus. Quisque maximus',
'diam ut sodales tincidunt. Integer ac finibus',
'elit. Etiam tristique euismod justo, vel pretium',
'tellus malesuada et. Pellentesque id mattis eros,',
'at bibendum mauris. In luctus lorem eget nisl',
'sagittis sollicitudin. Aenean consequat at lacus',
'at porttitor. Nunc sit amet neque eu sem venenatis',
'rutrum. Proin sed tempus lacus, sit amet porta',
'velit. Suspendisse et semper nisl, eu varius orci.',
'Ut non metus.']
Here's what you need - '[^\s]{1,50}'.
Example on smaller number:
>>> text = "Lorem ipsum sit dolor"
>>> splitter = re.compile('[^\s]{1,3}')
>>> splitter.findall(text)
['Lor', 'em', 'ips', 'um', 'sit', 'dol', 'or']

python write strings of bytes

I compress some data with the lzw module and I save them into a file ('wb' mode). This returns something like this:
'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*'
For small compressed data lzw's strings are in the above format.
When I put bigger strings for compression the lzw's compressed string is splited into lines.
'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*', '\xff\xb6\xd9\xe8r4'
As I checked, string contains '\n' chars so I think I lose information if the new line missing. How can I store the string so that it will be unsplitted and stored into 1 line ?
I have tried this:
for i in s_string:
testfile.write(i)
-----------------
testfile.write(s_string)
EDIT
def mycpsr(x):
#x = '11010101001010101010010111110101010101001010' # some random bits for lzw input
temp = lzw.compress(x)
temp = "".join(temp)
return temp
>>> import lzw
>>> print mycpsr('10101010011111111111111111111111100000000000111111')
If I put bigger input lets say x is a sting of 0 and 1 and len(x) = 1000 and I take the compressed data and append it to a file I get multiple lines instead of 1 line.
If the file has this data:
'\t' + normal strings + '\n'
<LZW-strings(with \t\n chars)>
'\t' + normal strings + '\n'
How can i define which is lzw and which is other data ?
So, your binary data contains newlines, and you want to embed it into a line-oriented document. To do that, you need to quote newlines in the binary data. One way to do it, which will quote not only newlines, but other non-printable characters, is by using base64 encoding:
import base64, lzw
def my_compress(x):
# returns a single line, one trailing \n included
return base64.encodestring("".join(lzw.compress(x)))
def my_decompress(line):
return lzw.decompress(base64.decodestring(line))
If your code handles binary characters other than newline, you can make the encoding more space-efficient by only replacing newline with r"\n" (backslash followed by n), and backslash with r"\\" (two backslash characters). This will allow lzw data to reside in a single binary line, and you will need to just do the inverse transformation before calling lzw.decompress.
You are dealing with binary data. If your data contains more than 256 bytes you have a good probability that some of the bytes correspond to the ascii code of '\n'. This will result in a binary file which contains more than one line if considered a text file.
This is not a problem as long as you deal with binary files as sequence of bytes not as a sequence of lines.
>>> txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum
ante velit, adipiscing eget sodales non, faucibus vitae nunc. Praesent ac lorem
cursus, aliquet magna sed, porta diam. Nunc lorem sapien, euismod in congue non
, tincidunt sit amet arcu. Lorem ipsum dolor sit amet, consectetur adipiscing el
it. Phasellus eleifend bibendum massa, ac convallis tellus sodales in. Suspendis
se non aliquam massa. Aenean erat ipsum, sagittis vitae elementum sit amet, iacu
lis sit amet quam. Vivamus luctus hendrerit libero at fringilla. Nullam id urna
est. Vestibulum pretium et tellus et dictum.
...
... Fusce nulla velit, lobortis at ligula eget, fermentum condimentum felis. Mae
cenas pretium posuere elit in posuere. Suspendisse gravida erat tristique, venen
atis erat at, sagittis elit. Donec laoreet lacinia nunc, eu consequat tortor. Cr
as at sem scelerisque, tristique dolor a, porta mauris. Fusce fermentum massa vi
tae arcu sagittis, et laoreet lacus suscipit. Vestibulum sed accumsan quam. Vest
ibulum eu egestas nisl. Curabitur dolor massa, auctor tempus dui ut, volutpat vu
lputate massa. Fusce vitae tortor adipiscing, gravida est at, molestie tortor. A
enean quis magna magna. Donec cursus enim ac egestas cursus. Pellentesque pulvin
ar nibh in sapien sollicitudin, eget tempus tortor pulvinar. Phasellus dignissim
, urna a sagittis tempor, nulla nulla rhoncus enim, vel molestie nisl lectus qui
s erat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit
amet malesuada nisi, sit amet placerat sem."""
>>>
>>> print "".join(lzw.decompress(lzw.compress(txt)))
appears to correctly re decode it including the \n

Categories

Resources