I have a set of information, stored in a python dictionary, representing a dataset.
I have a structure like the following:
{
"name": "Dataset name"
"authors": ["Foo Bar", "Mickey Mouse"]
"keywords" : ["Lorem", "ipsum", "door", "sit"]
"description": "Sed pretium suscipit elit, ac euismod turpis aliquet vel. Curabitur placerat pharetra ipsum eu posuere. Nullam ut rutrum est, ut aliquam risus. Praesent efficitur lectus ac rhoncus hendrerit. Nulla facilisis metus sed purus faucibus mattis."
"files": [ list of files ]
}
I am looking for a good package to display this information on the console in a pretty and easy-to-read way.
I was looking for a result like this:
#############################################################################
# NAME #
#############################################################################
# Authors: #
# - Mr. Foo Bar #
# - Mickey Mouse #
#############################################################################
# Keywords: #
# - lorem #
# - ipsum #
# - dolor #
#############################################################################
# Description: #
# #
# Sed pretium suscipit elit, ac euismod turpis aliquet vel. Curabitur plcea #
# pharetra ipsum eu posuere. Nullam ut rutrum est, ut aliquam rises. #
# Praesent efficitur lectus ac rhoncus hendrerit. Nula facilisis metus sed #
# purus faucibus mattis. #
#############################################################################
File Description
-----------------------------------------------------------------------------
main.py Main file etc etc
test/test.h test file dolor foo bar foo
The best one is pprint, which is pretty print.
import pprint
pprint.pprint(your_dict)
Related
I'm trying to output some text to the console using the Python textwrap module. According to the documentation both the wrap and fill functions will replace whitespace by default because the replace_witespace kwarg defaults to True.
However, in practice some whitespace remains and I am unable to format the text as required.
How can I format the text below without the chunk of whitespace after the first sentence?
Some title
----------
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sem eros, imperdiet vitae mi
cursus, dapibus fringilla eros. Duis quis euismod ex. Maecenas ac consequat quam. Sed ac augue
dignissim, facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor dignissim
scelerisque. Curabitur et lobortis ex.
Here is the code that I am using.
from textwrap import fill, wrap
lines = [
'Some title',
'----------',
'',
' '.join(wrap('''Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis
quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim,
facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor
dignissim scelerisque. Curabitur et lobortis ex.'''))
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))
replace_whitespace is not regular spaces ' ' - but only '\t\n\v\f\r'.
From the docs:
replace_whitespace
(default: True) If true, after tab expansion but before wrapping, the wrap() method will replace each whitespace character with a single space. The whitespace characters replaced are as follows: tab, newline, vertical tab, formfeed, and carriage return ('\t\n\v\f\r').
https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.replace_whitespace
So looks like this can be done using dedent and defining separate strings for multiline strings.
from textwrap import dedent, fill
lines = [
'Some title',
'----------',
'',
dedent('Lorem ipsum dolor sit amet, consectetur adipiscing elit. ' \
'Duis sem eros, imperdiet vitae mi cursus, dapibus fringilla eros. Duis ' \
'quis euismod ex. Maecenas ac consequat quam. Sed ac augue dignissim, ' \
'facilisis sapien vitae, vehicula mi. Maecenas elementum ex eu tortor ' \
'dignissim scelerisque. Curabitur et lobortis ex.')
]
out = []
for l in lines:
out.append(fill(l, width=100))
print('\n'.join(out))
How do you load and dump YAML using PyYAML, so that it uses the original styling as closely as possible?
I have Python to load and dump YAML data like:
import sys
import yaml
def _represent_dictorder(self, data):
# Maintains ordering of specific dictionary keys in the YAML output.
_data = []
ordering = ['questions', 'tags', 'answers', 'weight', 'date', 'text']
for key in ordering:
if key in data:
_data.append((str(key), data.pop(key)))
if data:
_data.extend(data.items())
return self.represent_mapping(u'tag:yaml.org,2002:map', _data)
yaml.add_representer(dict, _represent_dictorder)
text="""- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
"""
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
but this outputs the YAML in a different style, like:
- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: "1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n\
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.\n\
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus,\
\ mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.\n \
\ a. Aenean consectetur eleifend accumsan.\n4. In erat lacus, egestas\
\ ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis\
\ maximus dignissim.\n a. Proin nec neque convallis, placerat odio\
\ non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.\n\
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n \
\ a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.\n\
\ b. Nulla facilisi. Pellentesque at pretium nunc.\n c. Ut ipsum\
\ nibh, suscipit a pretium eu, eleifend vitae purus."
As you can see, it's changing the style of the text-block, so that newlines are escaped, making it a lot harder to read.
So I tried specifying the default_style attribute like:
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, default_style='|', indent=4)
And that fixed the text-block style, but then it broke other styles by putting quotes around all other strings, adding newlines to single-line strings, and munging integers, like:
- "questions":
- |-
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"tags":
"context": |-
curabitur
"answers":
- "weight": !!int |-
2
"date": |-
2014-1-19
"text": |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
How do I fix this so the output resembles the style of my original input?
How would you determine what string to represent as a block literal (or a folded block for that matter) and what to represent inline?
Under the assumption that you only want block literals used with strings that span over multiple lines, you can write your own string representer to switch between the styles based on the string content:
def selective_representer(dumper, data):
return dumper.represent_scalar(u"tag:yaml.org,2002:str", data,
style="|" if "\n" in data else None)
yaml.add_representer(str, selective_representer)
Now if you dump your data with default flow style set to False (to prevent dict/list inlining):
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
Your scalars will act as you expect them to.
I want to vectorize some text to corresponding integers and then convert those text to its mapped integers and also create new sentence using new input integers [2,9,39,46,56,12,89,9].
I have seen some custom functions which can used for this purpose but I want to know whether sklearn itself has such functions.
from sklearn.feature_extraction.text import CountVectorizer
a=["""Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Morbi imperdiet mauris posuere, condimentum odio et, volutpat orci.
Curabitur sodales vulputate eros eu gravida. Sed pharetra imperdiet nunc et tempor.
Nullam lectus est, rhoncus vitae lacus at, fermentum aliquam metus.
Phasellus a sollicitudin tortor, non tempor nulla.
Etiam mattis felis enim, a malesuada ligula dignissim at.
Integer congue dolor ut magna blandit, lobortis consequat ante aliquam.
Nulla imperdiet libero eget lorem sagittis, eget iaculis orci dignissim.
Phasellus sit amet sodales odio. Pellentesque commodo tempor risus, et tincidunt neque.
Praesent et sem velit. Maecenas id risus sit amet ex convallis ultrices vel sed purus.
Sed fringilla, leo quis congue sollicitudin, mauris nunc vehicula mi, et laoreet ligula
urna et nulla. Nam sollicitudin urna sed dolor vehicula euismod. Mauris bibendum pulvinar
ornare. In suscipit sed mi ut posuere.
Proin egestas, nibh ut egestas mattis, ipsum nulla bibendum enim, ac suscipit nisl justo
id metus. Nam est dui, elementum eget suscipit nec, aliquam in mi. Integer tortor erat,
aliquet at sapien et, fringilla posuere leo. Praesent non congue est. Vivamus tincidunt
tellus eu placerat tincidunt. Phasellus convallis lacus vitae ex congue efficitur.
Sed ut bibendum massa, vitae molestie ligula. Phasellus purus felis, fermentum vitae
hendrerit vel, vulputate quis metus."""]
vec = CountVectorizer()
dtm=vec.fit_transform(a)
print vec.vocabulary_
#convert text to corresponding vectors
mapped_a=
#new sentence using below mapped values
#input [2,9,39,46,56,12,89,9]
#creating sentence using specific sequence
new_sentence=
For vectorizing sentence into integers you can use transform function. Output of this function is vector with counts for each term - feature vector.
vec = CountVectorizer()
vec.fit(a)
print vec.vocabulary_
new_sentence = "dolor nulla enim"
mapped_a = vec.transform([new_sentence])
print mapped_a.toarray() # sparse feature vector
tokenizer = vec.build_tokenizer()
# array of words ids
for token in tokenizer(new_sentence):
print vec.vocabulary_.get(token)
The second part of the question is not so straightforward. CountVectorizer has inverse_transform function for this purpose with a sparse vector of features as an input. However, in your example you would like to create a sentence where same terms might occur and with that function it is not possible.
However, the solution is to use vocabulary (word to id) and building inverse vocabulary (id to word) based on it. CountVectorizer by default has no inverse_vocabulary and you must create it based on the vocabulary.
input = [2,9,9]
# 1. inverse_transform function
# create sparse vector
sparse_input = [1 if i in input else 0 for i in range(0, len(vec.vocabulary_))]
print vec.inverse_transform(sparse_input)
> ['aliquam', 'commodo']
# 2. Inverse vocabulary - custom solution
terms = np.array(list(vec.vocabulary_.keys()))
indices = np.array(list(vec.vocabulary_.values()))
inverse_vocabulary = terms[np.argsort(indices)]
for i in input:
print inverse_vocabulary[i]
> ['aliquam', 'commodo', 'commodo']
Take a look at preprocessing libraries in sklearn, LabelEncoder and OneHotEncoder are usually used to encode categorical variables. But encoding the whole text is not recommended!
I am working with python 3.4 in windows 7.Trying to compare two text files and i want to report the differences in them using difflib.
Following is the code m using:
import difflib
from difflib_data import *
with open("s1.txt") as f, open("s2.txt") as g:
flines = f.readlines()
glines = g.readlines()
d = difflib.Differ()
diff = d.compare(flines, glines)
print("\n".join(diff))
Traceback:
from difflib_data import *
ImportError: No module named 'difflib_data'
How to remove this error....thanks
From the following post, it seems it is the example data provided with the PyMOTW tutorial.
I assume the author wants you to copy and paste the source of test data into a new file named difflib_data.py in your working dir.
Copy the following lines into difflib_data.py
text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integereu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitortellus. Aliquam venenatis. Donec facilisis pharetra tortor. In necmauris eget magna consequat convallis. Nam sed sem vitae odiopellentesque interdum. Sed consequat viverra nisl. Suspendisse arcumetus, blandit quis, rhoncus ac, pharetra eget, velit. Maurisurna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl portaadipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristiqueenim. Donec quis lectus a justo imperdiet tempus."""
text1_lines = text1.splitlines()
text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integereu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitortellus. Aliquam venenatis. Donec facilisis pharetra tortor. In necmauris eget magna consequat convallis. Nam sed sem vitae odiopellentesque interdum. Sed consequat viverra nisl. Suspendisse arcumetus, blandit quis, rhoncus ac, pharetra eget, velit. Maurisurna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl portaadipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristiqueenim. Donec quis lectus a justo imperdiet tempus."""
text2_lines = text2.splitlines()
When I submit the following text in my textarea box on the windows GAE launcher at http://localhost:8080 it displays fine.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus a dolor eget diam
condimentum varius. Proin malesuada dictum ante, sed commodo purus vestibulum in.
Sed nibh dui, volutpat eu porta eu, molestie ut lacus. Vivamus iaculis urna ut tellus
blandit eu at nisl. Fusce eros libero, aliquam vitae hendrerit vitae, posuere ac diam.
Vivamus sagittis, felis in imperdiet pellentesque, eros nibh porttitor nisi, id
tristique leo libero a ligula. In in elit et velit auctor lacinia eleifend cursus mauris. Mauris
pellentesque lorem et augue placerat ultrices. Nam sed quam nisl, eget elementum felis.
Integer sapien ipsum, aliquet quis viverra quis, adipiscing eget sapien. Nam consequat
lacinia enim, id viverra nisl molestie feugiat.
When my code is deployed on GAE after I hit the submit button it displays like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus a dolor eg=
et diam condimentum varius. Proin malesuada dictum ante, sed commodo purus =
vestibulum in. Sed nibh dui, volutpat eu porta eu, molestie ut lacus. Vivam=
us iaculis urna ut tellus tempor blandit eu at nisl. Fusce eros libero, ali=
quam vitae hendrerit vitae, posuere ac diam. Vivamus sagittis, felis in imp=
erdiet pellentesque, eros nibh porttitor nisi, id tristique leo libero a li=
gula. In in elit et velit auctor lacinia eleifend cursus mauris. Mauris pel=
lentesque lorem et augue placerat ultrices. Nam sed quam nisl, eget element=
um felis. Integer sapien ipsum, aliquet quis viverra quis, adipiscing eget =
sapien. Nam consequat lacinia enim, id viverra nisl molestie feugiat.
Implementation Description below:
I am using the jinja2 engine. I have autescape = false:
jinja_env = jinja2.Environment(loader = jinja2.FileSystemLoader(template_dir), autoescape = False)
I get the content from a textarea element. Here is how it is set in my template:
<label>
<div>Information</div>
<textarea name="information">{{r.information}}</textarea>
</label>
I retrieve the string using:
information = self.request.get('information')
I committ the string to the data store
r.information = information
r.put()
When displaying it again for editing I use the same template code:
<label>
<div>Information</div>
<textarea name="information">{{r.information}}</textarea>
</label>
Everything works great locally. But when I deploy it to the google app engine I am getting some strange results. Where do those = signs come from I wonder?
EDIT:
For clarification it is putting =CRLF at the end of every line.
*EDIT 2: *
Here is the code from comment 21 of the bug:
def from_fieldstorage(cls, fs):
"""
Create a dict from a cgi.FieldStorage instance
"""
obj = cls()
if fs.list:
# fs.list can be None when there's nothing to parse
for field in fs.list:
if field.filename:
obj.add(field.name, field)
else:
# first, set a common charset to utf-8.
common_charset = 'utf-8'
# second, check Content-Transfer-Encoding and decode
# the value appropriately
field_value = field.value
transfer_encoding = field.headers.get(
'Content-Transfer-Encoding', None)
if transfer_encoding == 'base64':
field_value = base64.b64decode(field_value)
if transfer_encoding == 'quoted-printable':
field_value = quopri.decodestring(field_value)
if field.type_options.has_key('charset') and \
field.type_options['charset'] != common_charset:
# decode with a charset specified in each
# multipart, and then encode it again with a
# charset specified in top level FieldStorage
field_value = field_value.decode(
field.type_options['charset']).encode(common_charset)
# TODO: Should we take care of field.name here?
obj.add(field.name, field_value)
return obj
multidict.MultiDict.from_fieldstorage = classmethod(from_fieldstorage)
You might be falling foul of this bug
The workaround in comment 21 has worked for me in the past, and recent comments indicate it still does.