I compress some data with the lzw module and I save them into a file ('wb' mode). This returns something like this:
'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*'
For small compressed data lzw's strings are in the above format.
When I put bigger strings for compression the lzw's compressed string is splited into lines.
'\x18\xc0\x86#\x08$\x0e\x060\x82\xc2`\x90\x98l*', '\xff\xb6\xd9\xe8r4'
As I checked, string contains '\n' chars so I think I lose information if the new line missing. How can I store the string so that it will be unsplitted and stored into 1 line ?
I have tried this:
for i in s_string:
testfile.write(i)
-----------------
testfile.write(s_string)
EDIT
def mycpsr(x):
#x = '11010101001010101010010111110101010101001010' # some random bits for lzw input
temp = lzw.compress(x)
temp = "".join(temp)
return temp
>>> import lzw
>>> print mycpsr('10101010011111111111111111111111100000000000111111')
If I put bigger input lets say x is a sting of 0 and 1 and len(x) = 1000 and I take the compressed data and append it to a file I get multiple lines instead of 1 line.
If the file has this data:
'\t' + normal strings + '\n'
<LZW-strings(with \t\n chars)>
'\t' + normal strings + '\n'
How can i define which is lzw and which is other data ?
So, your binary data contains newlines, and you want to embed it into a line-oriented document. To do that, you need to quote newlines in the binary data. One way to do it, which will quote not only newlines, but other non-printable characters, is by using base64 encoding:
import base64, lzw
def my_compress(x):
# returns a single line, one trailing \n included
return base64.encodestring("".join(lzw.compress(x)))
def my_decompress(line):
return lzw.decompress(base64.decodestring(line))
If your code handles binary characters other than newline, you can make the encoding more space-efficient by only replacing newline with r"\n" (backslash followed by n), and backslash with r"\\" (two backslash characters). This will allow lzw data to reside in a single binary line, and you will need to just do the inverse transformation before calling lzw.decompress.
You are dealing with binary data. If your data contains more than 256 bytes you have a good probability that some of the bytes correspond to the ascii code of '\n'. This will result in a binary file which contains more than one line if considered a text file.
This is not a problem as long as you deal with binary files as sequence of bytes not as a sequence of lines.
>>> txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum
ante velit, adipiscing eget sodales non, faucibus vitae nunc. Praesent ac lorem
cursus, aliquet magna sed, porta diam. Nunc lorem sapien, euismod in congue non
, tincidunt sit amet arcu. Lorem ipsum dolor sit amet, consectetur adipiscing el
it. Phasellus eleifend bibendum massa, ac convallis tellus sodales in. Suspendis
se non aliquam massa. Aenean erat ipsum, sagittis vitae elementum sit amet, iacu
lis sit amet quam. Vivamus luctus hendrerit libero at fringilla. Nullam id urna
est. Vestibulum pretium et tellus et dictum.
...
... Fusce nulla velit, lobortis at ligula eget, fermentum condimentum felis. Mae
cenas pretium posuere elit in posuere. Suspendisse gravida erat tristique, venen
atis erat at, sagittis elit. Donec laoreet lacinia nunc, eu consequat tortor. Cr
as at sem scelerisque, tristique dolor a, porta mauris. Fusce fermentum massa vi
tae arcu sagittis, et laoreet lacus suscipit. Vestibulum sed accumsan quam. Vest
ibulum eu egestas nisl. Curabitur dolor massa, auctor tempus dui ut, volutpat vu
lputate massa. Fusce vitae tortor adipiscing, gravida est at, molestie tortor. A
enean quis magna magna. Donec cursus enim ac egestas cursus. Pellentesque pulvin
ar nibh in sapien sollicitudin, eget tempus tortor pulvinar. Phasellus dignissim
, urna a sagittis tempor, nulla nulla rhoncus enim, vel molestie nisl lectus qui
s erat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit
amet malesuada nisi, sit amet placerat sem."""
>>>
>>> print "".join(lzw.decompress(lzw.compress(txt)))
appears to correctly re decode it including the \n
Related
I'm wondering if there's an efficient way of doing the following:
I have a python script that reads an entire file into a single string. Then, given the location of a token of interest, I'd like to find the string index of the beginning of the line given that token.
file_str = read_file("foo.txt")
token_pos = re.search("token",file_str).start()
#this does not work, as str.rfind does not take regex, and you cannot specify re.M:
beginning_of_line = file_str.rfind("^",0,token_pos)
I could use a greedy regex to find the last beginning of line, but this has to be done many times, so I'm concerned that I don't want to read the whole file on each iteration. Is there a good way to do this?
----------------- EDIT ----------------
I tried to post as simple of a question, but it looks like more details are required. Here's a better example of one of the things I'm trying to do:
file_str = """
{
blah {
{} {{} "string with unmatched }" }
}
}"""
I happen to know where the opening an closing positions of blah's braces are. I need to get the lines between the braces (non-inclusive). So, given the position of the closing brace, I need to find the beginning of the line containing it. I'd like to do something akin to a reverse regex to find it. I can, of course, write a special function to do this, but I was thinking there would be some more python-ish way of going about it. To further complicate things, I would have to do this several times per file, and the file string can potentially change between iterations, so pre-indexing doesn't really work either...
Instead of matching just the keyword, match everything from the start of the line to the keyword. You could use re.finditer()docs to get an iterator that keeps yielding matches as it finds them.
file_str = """Lorem ipsum dolor sit amet, consectetur adipiscing elit amet.
Vestibulum vestibulum mollis enim, eu tristique est rhoncus et.
Curabitur sem nisi, ornare eu pellentesque at, interdum at lectus.
Phasellus molestie, turpis id ornare efficitur, ex tellus aliquet ipsum, vitae ullamcorper tellus diam a velit.
Nulla eget eleifend nisl.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nullam finibus, velit non euismod faucibus, dolor orci maximus lacus, sed mattis nisi erat eget turpis.
Maecenas ut pharetra lorem.
Curabitur nec dui sed velit euismod bibendum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Pellentesque tempor dolor at placerat aliquet.
Duis laoreet, est vitae tempor porta, risus leo ullamcorper risus, quis vestibulum massa orci ut felis.
In finibus purus ac nulla congue mattis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis efficitur dui ac nisi lobortis, a bibendum felis volutpat.
Aenean consectetur diam at risus hendrerit, in vestibulum erat porttitor.
Quisque fringilla accumsan neque, sed efficitur nunc tristique maximus.
Maecenas gravida lectus et porttitor ultrices.
Nam lobortis, massa et porta vulputate, nulla turpis maximus sapien, sit amet finibus libero mauris eu sapien.
Donec sollicitudin vulputate neque, in tempor nisi suscipit quis.
"""
keyword = "amet"
for match_obj in re.finditer(f"^.*{keyword}", file_str, re.MULTILINE):
beginning_of_line = match_obj.start()
print(beginning_of_line, match_obj)
Which gives:
0 <re.Match object; span=(0, 60), match='Lorem ipsum dolor sit amet, consectetur adipiscin>
331 <re.Match object; span=(331, 357), match='Lorem ipsum dolor sit amet'>
566 <re.Match object; span=(566, 592), match='Lorem ipsum dolor sit amet'>
815 <re.Match object; span=(815, 841), match='Lorem ipsum dolor sit amet'>
1129 <re.Match object; span=(1129, 1206), match='Nam lobortis, massa et porta vulputate, nulla tur>
Note that the first line gets matched only once even though it contains two amets because we do a greedy match on . so the first amet on the line is consumed by the .*
You don't need use regex to find the beginning of lines with the token
This will iterate the file line by line, create the string foo with the file's content and record where the newlines are in list named line_pos_with_token
token = "token"
foo = ''
line_pos_with_token = []
with open("foo.txt", "r") as f:
for line in f:
if token in line:
line_pos_with_token.append(len(foo))
foo += line
print(line_pos_with_token)
I'm trying to output some text to the console using the Python textwrap module. According to the documentation both the wrap and fill functions will replace whitespace by default because the replace_witespace kwarg defaults to True.
However, in practice some whitespace remains and I am unable to format the text as required.
How can I format the text below without the chunk of whitespace after the first sentence?
Some title
----------
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sem eros, imperdiet vitae mi
cursus, dapibus fringilla eros. Duis quis euismod ex. Maecenas ac consequat quam. Sed ac augue
dignissim, facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor dignissim
scelerisque. Curabitur et lobortis ex.
Here is the code that I am using.
from textwrap import fill, wrap
lines = [
'Some title',
'----------',
'',
' '.join(wrap('''Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis
quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim,
facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor
dignissim scelerisque. Curabitur et lobortis ex.'''))
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))
replace_whitespace is not regular spaces ' ' - but only '\t\n\v\f\r'.
From the docs:
replace_whitespace
(default: True) If true, after tab expansion but before wrapping, the wrap() method will replace each whitespace character with a single space. The whitespace characters replaced are as follows: tab, newline, vertical tab, formfeed, and carriage return ('\t\n\v\f\r').
https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.replace_whitespace
So looks like this can be done using dedent and defining separate strings for multiline strings.
from textwrap import dedent, fill
lines = [
'Some title',
'----------',
'',
dedent('Lorem ipsum dolor sit amet, consectetur adipiscing elit. ' \
'Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis ' \
'quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim, ' \
'facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor ' \
'dignissim scelerisque. Curabitur et lobortis ex.')
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))
I have a script that pulls a markdown file from github. I then decode the file find and replace parameters in that file with user given inputs.
tempdict = {'key_1': 'input_1', 'key 2','input_2'}
It worked great until I realized that it was missing some instances. I realized that the text was read in chunks of size 256 bytes, and if my 'key' was at the end of this chunk, then it would be on a different line of rawtext and my find and replace function would miss it. How can I avoid this error?
Code below:
url = 'https://raw.githubusercontent.com/foo/documents/master/bar.md'
rawtext = requests.get(fullurl)
rep = {}
for k,v in tempdict.items():
rep[k.replace('_','\\_')]=v
decodedtext = []
for line in rawtext:
decodedtext.append(line.decode("utf-8"))
editedtext = []
rep = dict((re.escape(k),v) for k,v in rep.items())
pattern = re.compile('|'.join(rep.keys()))
for line in decodedtext:
line = pattern.sub(lambda m: rep[re.escape(m.group(0))],line)
editedtext.append(line)
encodedtext =[]
for line in editedtext:
encodedtext.append(str.encode(line))
markdownfile = open('edited.md','wb')
for i in encodedtext:
markdownfile.write(i)
markdownfile.close()
Psuedo-Markdown file below.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus pharetra,
urna ullamcorper congue aliquet, leo nulla facilisis ipsum, tincidunt
suscipit eros erat tempor orci. Sed porttitor convallis ligula, a pharetra
nisl rhoncus sit key_1. Maecenas consequat sem nec urna mattis, non ornare
risus fermentum. Nam consectetur volutpat felis sed blandit. Etiam eu
sollicitudin diam, eu euismod diam. Fusce diam libero, sagittis varius elit
nec, ultricies key_2 est. Sed nibh purus, tincidunt eu fringilla ut,
ultrices et orci.
Anytime I tried to use rawtext.text or rawtext.iter_items() it wouldn't keep the markdown format. Also, I replace the items in tempdict to rep because the rawtext adds \\ before every underscore.
Your issue is that you are checking each section of a list for a match, which means that if a match is partially in 2 list entries then it will not be matched. Instead of creating a list, create 1 string, split it by new lines (assuming you are not trying to match any new lines) and perform your check against the new list.
decodedtext = ""
for line in rawtext:
decodedtext += line.decode("utf-8")
...
for line in decodedtext.split("\n"):
I want to vectorize some text to corresponding integers and then convert those text to its mapped integers and also create new sentence using new input integers [2,9,39,46,56,12,89,9].
I have seen some custom functions which can used for this purpose but I want to know whether sklearn itself has such functions.
from sklearn.feature_extraction.text import CountVectorizer
a=["""Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Morbi imperdiet mauris posuere, condimentum odio et, volutpat orci.
Curabitur sodales vulputate eros eu gravida. Sed pharetra imperdiet nunc et tempor.
Nullam lectus est, rhoncus vitae lacus at, fermentum aliquam metus.
Phasellus a sollicitudin tortor, non tempor nulla.
Etiam mattis felis enim, a malesuada ligula dignissim at.
Integer congue dolor ut magna blandit, lobortis consequat ante aliquam.
Nulla imperdiet libero eget lorem sagittis, eget iaculis orci dignissim.
Phasellus sit amet sodales odio. Pellentesque commodo tempor risus, et tincidunt neque.
Praesent et sem velit. Maecenas id risus sit amet ex convallis ultrices vel sed purus.
Sed fringilla, leo quis congue sollicitudin, mauris nunc vehicula mi, et laoreet ligula
urna et nulla. Nam sollicitudin urna sed dolor vehicula euismod. Mauris bibendum pulvinar
ornare. In suscipit sed mi ut posuere.
Proin egestas, nibh ut egestas mattis, ipsum nulla bibendum enim, ac suscipit nisl justo
id metus. Nam est dui, elementum eget suscipit nec, aliquam in mi. Integer tortor erat,
aliquet at sapien et, fringilla posuere leo. Praesent non congue est. Vivamus tincidunt
tellus eu placerat tincidunt. Phasellus convallis lacus vitae ex congue efficitur.
Sed ut bibendum massa, vitae molestie ligula. Phasellus purus felis, fermentum vitae
hendrerit vel, vulputate quis metus."""]
vec = CountVectorizer()
dtm=vec.fit_transform(a)
print vec.vocabulary_
#convert text to corresponding vectors
mapped_a=
#new sentence using below mapped values
#input [2,9,39,46,56,12,89,9]
#creating sentence using specific sequence
new_sentence=
For vectorizing sentence into integers you can use transform function. Output of this function is vector with counts for each term - feature vector.
vec = CountVectorizer()
vec.fit(a)
print vec.vocabulary_
new_sentence = "dolor nulla enim"
mapped_a = vec.transform([new_sentence])
print mapped_a.toarray() # sparse feature vector
tokenizer = vec.build_tokenizer()
# array of words ids
for token in tokenizer(new_sentence):
print vec.vocabulary_.get(token)
The second part of the question is not so straightforward. CountVectorizer has inverse_transform function for this purpose with a sparse vector of features as an input. However, in your example you would like to create a sentence where same terms might occur and with that function it is not possible.
However, the solution is to use vocabulary (word to id) and building inverse vocabulary (id to word) based on it. CountVectorizer by default has no inverse_vocabulary and you must create it based on the vocabulary.
input = [2,9,9]
# 1. inverse_transform function
# create sparse vector
sparse_input = [1 if i in input else 0 for i in range(0, len(vec.vocabulary_))]
print vec.inverse_transform(sparse_input)
> ['aliquam', 'commodo']
# 2. Inverse vocabulary - custom solution
terms = np.array(list(vec.vocabulary_.keys()))
indices = np.array(list(vec.vocabulary_.values()))
inverse_vocabulary = terms[np.argsort(indices)]
for i in input:
print inverse_vocabulary[i]
> ['aliquam', 'commodo', 'commodo']
Take a look at preprocessing libraries in sklearn, LabelEncoder and OneHotEncoder are usually used to encode categorical variables. But encoding the whole text is not recommended!
I am trying to create a regex that will take a longish string that contains space separated words and break it into chunks of up to 50 characters that end with a space or the end of the line.
I first came up with: (.{0,50}(\s|$)) but that only grabbed the first match. I then thought I would add a * to the end: (.{0,50}(\s|$))* but now it grabs the entire string.
I've been testing here, but can't seem to to get it to work as needed. Can anyone see what I am doing wrong here?
Here, it seems to be working:
import re
p = re.compile(ur'(.{0,50}[\s|$])')
test_str = u"jasdljasjdlk jal skdjl ajdl kajsldja lksjdlkasd jas lkjdalsjdalksjdalksjdlaksjdk sakdjakl jd fgdfgdfg\nhgkjd fdkfhgk dhgkjhdfhg kdhfgk jdfghdfkjghjf dfjhgkdhf hkdfhgkj jkdfgk jfgkfg dfkghk hdfkgh d asdada \ndkjfghdkhg khdfkghkd hgkdfhgkdhfk k dfghkdfgh dfgdfgdfgd\n"
re.findall(p, test_str)
What are you using to match the regex? The re.findall() method should return what you want.
It's not using a regex, but have you thought about using textwrap.wrap()?
In [8]: import textwrap
text = ' '.join([
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed et convallis",
"lectus. Quisque maximus diam ut sodales tincidunt. Integer ac finibus",
"elit. Etiam tristique euismod justo, vel pretium tellus malesuada et.",
"Pellentesque id mattis eros, at bibendum mauris. In luctus lorem eget nisl",
"sagittis sollicitudin. Aenean consequat at lacus at porttitor. Nunc sit",
"amet neque eu sem venenatis rutrum. Proin sed tempus lacus, sit amet porta",
"velit. Suspendisse et semper nisl, eu varius orci. Ut non metus."])
In [9]: textwrap.wrap(text, 50)
Out[9]: ['Lorem ipsum dolor sit amet, consectetur adipiscing',
'elit. Sed et convallis lectus. Quisque maximus',
'diam ut sodales tincidunt. Integer ac finibus',
'elit. Etiam tristique euismod justo, vel pretium',
'tellus malesuada et. Pellentesque id mattis eros,',
'at bibendum mauris. In luctus lorem eget nisl',
'sagittis sollicitudin. Aenean consequat at lacus',
'at porttitor. Nunc sit amet neque eu sem venenatis',
'rutrum. Proin sed tempus lacus, sit amet porta',
'velit. Suspendisse et semper nisl, eu varius orci.',
'Ut non metus.']
Here's what you need - '[^\s]{1,50}'.
Example on smaller number:
>>> text = "Lorem ipsum sit dolor"
>>> splitter = re.compile('[^\s]{1,3}')
>>> splitter.findall(text)
['Lor', 'em', 'ips', 'um', 'sit', 'dol', 'or']