How can I scrape data from .odt files with Python?

How can I scrape data from .odt files with Python? - python

I have a bunch of .odt files (about 2000 files), each file contains clinical data of patients of Oncology Department of the hospital I work in.
I need to load this data in a MySQL database.
Each document is formatted quite the same way, and looks like this:
Date dd.mm.yyyy
Mr. XXX YYY DOB dd.mm.yyyy
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus ut elit vel tellus vulputate gravida et sit amet neque. Donec pulvinar finibus aliquam. Donec hendrerit vitae ex a sollicitudin. Vestibulum maximus tristique pellentesque. Nullam felis erat, porta ut urna sit amet, mattis ullamcorper turpis. Aliquam erat volutpat. Aenean consequat molestie risus sed blandit. Nullam tristique luctus turpis, quis blandit turpis fringilla vitae. Nulla facilisi. Donec fringilla tristique sapien, et congue enim laoreet tincidunt. Sed vel odio leo. Integer scelerisque pulvinar sem vel maximus. Quisque rutrum, mi in posuere tempus, nunc odio posuere arcu, sed rhoncus urna ante id lorem. Nunc facilisis justo et mattis varius.
Dr. XXX YYY
I tried this: how to retrive data from odt xml file in python?
and obtained an xml file (I had only to change the line fd.write(content)
in fd.write(str(content)) to make it work).
How can I parse data from the xml file? Shall I use BeautifulSoup?

Related

Python Curses Module preset text

I just started using the Curses python module, and I was wondering if there was a way to preset text. E.g, when you ran the code it would print:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Cras nec sapien ut dui mattis viverra.
Nam et pharetra erat. Mauris sodales aliquam purus, ac congue urna dictum eu.
Donec varius facilisis quam, at malesuada odio tincidunt et. Duis fermentum leo
at mi sodales sodales quis vel urna.
and then, you were able to edit that preset text freely. Thanks.

Python YAML dump using block style without quotes

How do you load and dump YAML using PyYAML, so that it uses the original styling as closely as possible?
I have Python to load and dump YAML data like:
import sys
import yaml
def _represent_dictorder(self, data):
# Maintains ordering of specific dictionary keys in the YAML output.
_data = []
ordering = ['questions', 'tags', 'answers', 'weight', 'date', 'text']
for key in ordering:
if key in data:
_data.append((str(key), data.pop(key)))
if data:
_data.extend(data.items())
return self.represent_mapping(u'tag:yaml.org,2002:map', _data)
yaml.add_representer(dict, _represent_dictorder)
text="""- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
"""
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
but this outputs the YAML in a different style, like:
- questions:
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
tags:
context: curabitur
answers:
- weight: 2
date: 2014-1-19
text: "1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n\
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.\n\
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus,\
\ mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.\n \
\ a. Aenean consectetur eleifend accumsan.\n4. In erat lacus, egestas\
\ ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis\
\ maximus dignissim.\n a. Proin nec neque convallis, placerat odio\
\ non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.\n\
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.\n \
\ a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.\n\
\ b. Nulla facilisi. Pellentesque at pretium nunc.\n c. Ut ipsum\
\ nibh, suscipit a pretium eu, eleifend vitae purus."
As you can see, it's changing the style of the text-block, so that newlines are escaped, making it a lot harder to read.
So I tried specifying the default_style attribute like:
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, default_style='|', indent=4)
And that fixed the text-block style, but then it broke other styles by putting quotes around all other strings, adding newlines to single-line strings, and munging integers, like:
- "questions":
- |-
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"tags":
"context": |-
curabitur
"answers":
- "weight": !!int |-
2
"date": |-
2014-1-19
"text": |-
1. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
2. Donec pellentesque elit non felis feugiat, in gravida ex hendrerit.
3. Mauris quis velit sapien. Nullam blandit, diam et pharetra maximus, mi erat scelerisque turpis, eu vestibulum dui ligula non lectus.
a. Aenean consectetur eleifend accumsan.
4. In erat lacus, egestas ut tincidunt ac, congue quis elit. Suspendisse semper purus ac turpis maximus dignissim.
a. Proin nec neque convallis, placerat odio non, suscipit erat. Nulla nec mattis nibh, accumsan feugiat felis.
5. Mauris lorem magna, auctor et tristique id, fringilla ut metus.
a. Morbi non arcu odio. Maecenas faucibus urna et leo euismod placerat.
b. Nulla facilisi. Pellentesque at pretium nunc.
c. Ut ipsum nibh, suscipit a pretium eu, eleifend vitae purus.
How do I fix this so the output resembles the style of my original input?

How would you determine what string to represent as a block literal (or a folded block for that matter) and what to represent inline?
Under the assumption that you only want block literals used with strings that span over multiple lines, you can write your own string representer to switch between the styles based on the string content:
def selective_representer(dumper, data):
return dumper.represent_scalar(u"tag:yaml.org,2002:str", data,
style="|" if "\n" in data else None)
yaml.add_representer(str, selective_representer)
Now if you dump your data with default flow style set to False (to prevent dict/list inlining):
yaml.dump(yaml.load(text), stream=sys.stdout, default_flow_style=False, indent=4)
Your scalars will act as you expect them to.

ImportError :No module named difflib_data

I am working with python 3.4 in windows 7.Trying to compare two text files and i want to report the differences in them using difflib.
Following is the code m using:
import difflib
from difflib_data import *
with open("s1.txt") as f, open("s2.txt") as g:
flines = f.readlines()
glines = g.readlines()
d = difflib.Differ()
diff = d.compare(flines, glines)
print("\n".join(diff))
Traceback:
from difflib_data import *
ImportError: No module named 'difflib_data'
How to remove this error....thanks

From the following post, it seems it is the example data provided with the PyMOTW tutorial.
I assume the author wants you to copy and paste the source of test data into a new file named difflib_data.py in your working dir.
Copy the following lines into difflib_data.py
text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integereu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitortellus. Aliquam venenatis. Donec facilisis pharetra tortor. In necmauris eget magna consequat convallis. Nam sed sem vitae odiopellentesque interdum. Sed consequat viverra nisl. Suspendisse arcumetus, blandit quis, rhoncus ac, pharetra eget, velit. Maurisurna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl portaadipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristiqueenim. Donec quis lectus a justo imperdiet tempus."""
text1_lines = text1.splitlines()
text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integereu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitortellus. Aliquam venenatis. Donec facilisis pharetra tortor. In necmauris eget magna consequat convallis. Nam sed sem vitae odiopellentesque interdum. Sed consequat viverra nisl. Suspendisse arcumetus, blandit quis, rhoncus ac, pharetra eget, velit. Maurisurna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl portaadipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristiqueenim. Donec quis lectus a justo imperdiet tempus."""
text2_lines = text2.splitlines()

html textarea submission causing '=' chars to display at new lines only when deployed on GAE

When I submit the following text in my textarea box on the windows GAE launcher at http://localhost:8080 it displays fine.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus a dolor eget diam
condimentum varius. Proin malesuada dictum ante, sed commodo purus vestibulum in.
Sed nibh dui, volutpat eu porta eu, molestie ut lacus. Vivamus iaculis urna ut tellus
blandit eu at nisl. Fusce eros libero, aliquam vitae hendrerit vitae, posuere ac diam.
Vivamus sagittis, felis in imperdiet pellentesque, eros nibh porttitor nisi, id
tristique leo libero a ligula. In in elit et velit auctor lacinia eleifend cursus mauris. Mauris
pellentesque lorem et augue placerat ultrices. Nam sed quam nisl, eget elementum felis.
Integer sapien ipsum, aliquet quis viverra quis, adipiscing eget sapien. Nam consequat
lacinia enim, id viverra nisl molestie feugiat.
When my code is deployed on GAE after I hit the submit button it displays like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus a dolor eg=
et diam condimentum varius. Proin malesuada dictum ante, sed commodo purus =
vestibulum in. Sed nibh dui, volutpat eu porta eu, molestie ut lacus. Vivam=
us iaculis urna ut tellus tempor blandit eu at nisl. Fusce eros libero, ali=
quam vitae hendrerit vitae, posuere ac diam. Vivamus sagittis, felis in imp=
erdiet pellentesque, eros nibh porttitor nisi, id tristique leo libero a li=
gula. In in elit et velit auctor lacinia eleifend cursus mauris. Mauris pel=
lentesque lorem et augue placerat ultrices. Nam sed quam nisl, eget element=
um felis. Integer sapien ipsum, aliquet quis viverra quis, adipiscing eget =
sapien. Nam consequat lacinia enim, id viverra nisl molestie feugiat.
Implementation Description below:
I am using the jinja2 engine. I have autescape = false:
jinja_env = jinja2.Environment(loader = jinja2.FileSystemLoader(template_dir), autoescape = False)
I get the content from a textarea element. Here is how it is set in my template:
<label>
<div>Information</div>
<textarea name="information">{{r.information}}</textarea>
</label>
I retrieve the string using:
information = self.request.get('information')
I committ the string to the data store
r.information = information
r.put()
When displaying it again for editing I use the same template code:
<label>
<div>Information</div>
<textarea name="information">{{r.information}}</textarea>
</label>
Everything works great locally. But when I deploy it to the google app engine I am getting some strange results. Where do those = signs come from I wonder?
EDIT:
For clarification it is putting =CRLF at the end of every line.
*EDIT 2: *
Here is the code from comment 21 of the bug:
def from_fieldstorage(cls, fs):
"""
Create a dict from a cgi.FieldStorage instance
"""
obj = cls()
if fs.list:
# fs.list can be None when there's nothing to parse
for field in fs.list:
if field.filename:
obj.add(field.name, field)
else:
# first, set a common charset to utf-8.
common_charset = 'utf-8'
# second, check Content-Transfer-Encoding and decode
# the value appropriately
field_value = field.value
transfer_encoding = field.headers.get(
'Content-Transfer-Encoding', None)
if transfer_encoding == 'base64':
field_value = base64.b64decode(field_value)
if transfer_encoding == 'quoted-printable':
field_value = quopri.decodestring(field_value)
if field.type_options.has_key('charset') and \
field.type_options['charset'] != common_charset:
# decode with a charset specified in each
# multipart, and then encode it again with a
# charset specified in top level FieldStorage
field_value = field_value.decode(
field.type_options['charset']).encode(common_charset)
# TODO: Should we take care of field.name here?
obj.add(field.name, field_value)
return obj
multidict.MultiDict.from_fieldstorage = classmethod(from_fieldstorage)

You might be falling foul of this bug
The workaround in comment 21 has worked for me in the past, and recent comments indicate it still does.

lxml.html search and replace [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I need your help or suggestion, whatever.
I start reading some books about python just because of this problem I have :) But I see it will takes long time for me to learn the whole language. I also skimmed and searched through lxml.html documentation but still I can figure out how to do this what I want.
I created two html files for sample, to explain what is my problem. You can see those pieces of code here: http://pzt.me/ltbj
There is also a screenshot with differences so that is even easier to see what's going on.
If somebody tried to do something like this before or if you have an idea how could I do this please let me know.
Thank you.
Best,
Jozsef
OK here is the code:
~~~~~~~~~~~
This:
~~~~~~~~~~~
New Document
<body>
<h2><a name="2" class="class1">2</a></h2> ^ top ^
<p><span class="class3">20</span>Sed imperdiet, lacus eu consectetur tempus, tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor. <span class="class3">21</span>Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. Phasellus neque justo, aliquet non pellentesque vel, dictum non libero. Phasellus vel nulla mi, id molestie purus. Suspendisse orci ante, imperdiet at tempus id, pulvinar eu mi. Aliquam erat volutpat. <span class="class3">22</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Pellentesque pretium, ligula tristique porta fringilla, mauris lectus gravida nibh, consectetur ornare lacus tellus quis sem. <span class="class3">23</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor.</p>
<p><span class="class3">24</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. <span class="class3">25</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos.</p>
<p><span class="class3">26</span>Sed imperdiet, lacus eu consectetur tempus, "tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor."</p>
<p><span class="class3">27</span></p>
<p>Nunc volutpat lacus;</p>
<p>Etiam sit amet dapibus;</p>
<p>Nunc consequat mauris.</p>
<p><span class="class3">15</span>Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Nunc volutpat lacus a lacus dignissim sed iaculis metus consectetur. <span class="class3">17</span>Nunc consequat mauris nec ligula ullamcorper ut iaculis nibh sodales. "Nulla tincidunt lorem eu odio laoreet facilisis." <span class="class3">18</span>Aliquam erat volutpat. Curabitur sagittis, mauris quis laoreet consectetur, erat urna tincidunt augue, ut eleifend felis mi quis felis. <span class="class3">19</span>Vivamus a elit risus, consequat sagittis ligula. Nunc ut vestibulum ipsum. Curabitur at sapien vitae est egestas aliquam. <span class="class3">20</span> Donec porttitor, ligula vel venenatis posuere, purus nunc adipiscing ante, id pellentesque turpis nulla eu magna. <span class="class3">21</span>Praesent gravida, eros ut scelerisque commodo, magna quam volutpat elit, a aliquet neque ligula a mauris. <span class="class3">22</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor. <span class="class3">23</span>Lorem ipsum dolor sit:</p>
<p>Pellentesque pretium, ligula tristique</p>
<p>felis viverra;</p>
<p>justo lobortis ut "l"</p>
<p>unc ut consectetur fermentum.</p>
<p><span class="class3">14</span>Proin et tellus felis:</p>
<p>Suspendisse potenti,</p>
<p>enim non tortor</p>
<p>Donec porttitor.</p>
<p>Morbi eleifend fermentum</p>
<p>Aliquam id ante.</p>
<p><span class="class3">15</span></p>
<p>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor,</p>
<p>etiam ullamcorper.</p>
<p>vivamus interdum nulla,</p>
<p>odio laoreet facilisis.</p>
<p><span class="class3">20</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. <span class="class3">21</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. </p>
</body>
~~~~~~~~~~~~~~~~~~~~~
To become this:
~~~~~~~~~~~~~~~~~~~~~
New Document
<body>
<h2><a name="2" class="class1">2</a></h2> ^ top ^
<p><span class="class3">20</span>Sed imperdiet, lacus eu consectetur tempus, tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor. <span class="class3">21</span>Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. Phasellus neque justo, aliquet non pellentesque vel, dictum non libero. Phasellus vel nulla mi, id molestie purus. Suspendisse orci ante, imperdiet at tempus id, pulvinar eu mi. Aliquam erat volutpat. <span class="class3">22</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Pellentesque pretium, ligula tristique porta fringilla, mauris lectus gravida nibh, consectetur ornare lacus tellus quis sem. <span class="class3">23</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor.</p>
<p><span class="class3">24</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. <span class="class3">25</span>Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos.</p>
<p><span class="class3">26</span>Sed imperdiet, lacus eu consectetur tempus, "tellus metus vestibulum tortor, nec tincidunt nisl enim non tortor."</p>
<p><span class="class3">27</span><br />
Nunc volutpat lacus;<br />
Etiam sit amet dapibus;<br />
Nunc consequat mauris.</p>
<p><span class="class3">15</span>Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Nunc volutpat lacus a lacus dignissim sed iaculis metus consectetur. <span class="class3">17</span>Nunc consequat mauris nec ligula ullamcorper ut iaculis nibh sodales. "Nulla tincidunt lorem eu odio laoreet facilisis." <span class="class3">18</span>Aliquam erat volutpat. Curabitur sagittis, mauris quis laoreet consectetur, erat urna tincidunt augue, ut eleifend felis mi quis felis. <span class="class3">19</span>Vivamus a elit risus, consequat sagittis ligula. Nunc ut vestibulum ipsum. Curabitur at sapien vitae est egestas aliquam. <span class="class3">20</span> Donec porttitor, ligula vel venenatis posuere, purus nunc adipiscing ante, id pellentesque turpis nulla eu magna. <span class="class3">21</span>Praesent gravida, eros ut scelerisque commodo, magna quam volutpat elit, a aliquet neque ligula a mauris. <span class="class3">22</span>Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor. <span class="class3">23</span>Lorem ipsum dolor sit:<br />
Pellentesque pretium, ligula tristique<br />
felis viverra;<br />
justo lobortis ut "l"<br />
unc ut consectetur fermentum.</p>
<p><span class="class3">14</span>Proin et tellus felis:<br />
Suspendisse potenti,<br />
enim non tortor<br />
Donec porttitor.<br />
Morbi eleifend fermentum<br />
Aliquam id ante.</p>
<p><span class="class3">15</span><br />
Curabitur nibh dui, feugiat sed luctus sed, laoreet sed tortor,<br />
etiam ullamcorper.<br />
vivamus interdum nulla,<br />
odio laoreet facilisis.</p>
<p><span class="class3">20</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. <span class="class3">21</span>Suspendisse potenti. Nam in aliquam magna. Maecenas hendrerit fringilla dui facilisis aliquet. </p>
</body>
Can't include the image. sorry. you must to see the link on top if you want to see the image.
Thanks.

Use BeautifulSoup to parse the document and recreate it after processing it. It is the easiest thing to do. I wouldn't use lxml for what you are trying to do.
http://www.crummy.com/software/BeautifulSoup/documentation.html
Look at example here on how tags are added and removed:
Extract all <script> tags in an HTML page and append to the bottom of the document
https://stackoverflow.com/questions/tagged/beautifulsoup

If you're really that short on time you may be able to accomplish your task after reading chapter 8 of Dive Into Python ( http://diveintopython.net/html_processing/index.html ).
Alas, I strongly suggest that you start from the very beginning of the book.
Regular expressions (chapter 7 same book) may also be of great help. I have not quite understood what you're trying to accomplish though. Replace <p></p> tags with <br/>?
Anyway look into smgllib and re modules.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.