Error while creating PDF using ReportLab in Python

Error while creating PDF using ReportLab in Python - python

Data:
['<p>Work! please work.img:0\xc3\x82\xc2\xa0Will you?img:1</p>img:2img:3\xc3\x82\xc2\xa0ascasdacasdadasdaca HAHAHAHAHA! BAND!\n', '\n', "<p>Random test.</p><p><br />If you want to start a flame war, mention lines of code per day or hour in a developer\xc3\xa2€™s public forum. At least that is what I found when I started investigating how many lines of code are written per day per programmer. Lines of code, or loc for short, are supposedly a terrible metric for measuring programmer productivity and empirically I agree with this. There are too many variables involved starting with the definition of a line of code and going all the way up to the complexity of the requirements. There are single lines that take a long time to get right and there many lines which are mindless boilerplate code. All the same this measurement does have information encoded in it; the hard part is extracting that information and drawing the correct conclusions. Unfortunately I don\xc3\xa2€™t have access to enough data about software projects to provide a statistically sound analysis but I got a very interesting result from measuring two very different projects that I would like to share.</p><p>The first project is a traditional client server data mining tool for a vertical market mostly built in VB.NET and WinForms. This project started in 2003 and has been through several releases and an upgrade from .NET 1.1 to .NET 2.0. It has server components but most of the half a million lines of code lives in the client side. The team has always had around four developers although not always the same people. The average lines of code for this project came in at around ninety lines of code per day per developer. I wasn\xc3\xa2€™t able to measure the SQL in the stored procedures so this number is slightly inflated.</p><p><em>The second project is much smaller adding up to ten thousand lines of C# plus seven thousand lines of XAML c</em>reated by a team of four that also worked on the first project. This project lasted three months and it is a WPF point of sale application thus very different in scope from the first project. <strong>It was built around a number of web services in SOA fashion and does not have a database per se. Its average came up around seventy lines of code per developer per day.</strong></p><p>I am very surprised with the closeness of these numbers, especially given the difference in size and scope of the products. The commonality between them are the .NET framework and the team and one of them may be the key. Of these two, I am leaning to the .NET framework being the unifier because although the developers worked on both projects, three of elements on the team of the second project have spent less than a year on the first project and did not belong to the core team that wrote the vast majority of that first product. Or maybe there is something more general at work here?</p><p>The first step in using the WP_Filesystem is requesting credentials from the user. The normal way this is accomplished is at the time when you're saving the results of a form input, or you have otherwise determined that you need to write to a file.</p><p>The credentials form can be displayed onto an admin page by using the following code:</p><pre>$url = wp_nonce_url('themes.php?page=example','example-theme-options');\n</pre>", "if (false === ($creds = request_filesystem_credentials($url, '', false, false, null) ) ) {\n", '\treturn; // stop processing here\n', '}\n', '<p>The request_filesystem_credentials() call takes five arguments.</p><ul><li>The URL to which the form should be submitted (a nonced URL to a theme page was used in the example above)</li><li>A method override (normally you should leave this as the empty string: "")</li><li>An error flag (normally false unless an error is detected, see below)</li><li>A context directory (false, or a specific directory path that you want to test for access)</li><li>Form fields (an array of form field names from your previous form that you wish to "pass-through" the resulting credentials form, or null if there are none)</li></ul><p>The request_filesystem_credentials call will test to see if it is capable of writing to the local filesystem directly without credentials first. If this is the case, then it will return true and not do anything. Your code can then proceed to use the WP_Filesystem class.</p><p>The request_filesystem_credentials call also takes into account hardcoded information, such as hostname or username or password, which has been inserted into the wp-config.php file using defines. If these are pre-defined in that file, then this call will return that information instead of displaying a form, bypassing the form for the user.</p><p>If it does need credentials from the user, then it will output the FTP information form and return false. In this case, you should stop processing further, in order to allow the user to input credentials. Any form fields names you specified will be included in the resulting form as hidden inputs, and will be returned when the user resubmits the form, this time with FTP credentials.</p><p>Note: Do not use the reserved names of hostname, username, password, public_key, or private_key for your own inputs. These are used by the credentials form itself. Alternatively, if you do use them, the request_filesystem_credentials function will assume that they are the incoming FTP credentials.</p><p>When the credentials form is submitted, it will look in the incoming POST data for these fields, and if found, it will return them in an array suitable for passing to WP_Filesystem, which is the next step.</p><p><a id="Initializing_WP_Filesystem_Base" name="Initializing_WP_Filesystem_Base"></a>']
I use ReportLab to convert it to pdf but it fails.
This is my ReportLab code:
for page in self.pagelist:
self.image_parser(page)
print page.content
for i in range(0,len(page.content)):
bogustext = page.content[i]
while (len(re.findall(r'img:?',bogustext)) > 0):
for m in re.finditer( r'img:?', bogustext ):
image_tag = bogustext[m.start():m.end()+1]
print (image_tag.split(':')[1])
im = Image(page.images[int(image_tag.split(':')[1])],width=2*inch, height=2*inch)
Story.append(Paragraph(bogustext[0:m.start()], style))
bogustext = bogustext.replace(bogustext[0:m.start()],'')
Story.append(im)
bogustext = bogustext.replace(image_tag,'')
break
p = Paragraph(bogustext,style)
Story.append(p)
Story.append(Spacer(1,0.2*inch))
page is class of which page.content contains the Data I mentioned above.
self.image(page) is a function that removes all the image urls in the page.content(Data).
Error:
xml parser error (invalid attribute name id) in paragraph beginning
'<p>The request_filesystem_cred'
I don't get this error if I produce a PDF for every element of the list but I do get one if I try to make a complete PDF out of it. Where am I going wrong?

Related

Properly refresh an SQLAlchemy session to view externally updated data

After trying everything suggested here, I still can't get SQLAlchemy to display the correct results!
I've used various combinations of Nick's answer, session.commit(), flush() and expire_all(), restarted MySQL, even restarted the entire freaking server, and I still get old results from SQLAlchemy...why????
The most infuriating thing about this whole issue is that I can see from any other application, or even from a direct connection.execute() call, that the updated data is there. I just can't get it to display on the webpage!
BTW this is in a Pyramid app, not Flask, but since Pyramid is 99% Flask it shouldn't make a difference, right?
MTIA for any help on this, it's driving me nuts!!
PS: I tried to add this as an answer to the linked question, but it was deleted for not being a valid answer. So for future reference, if I just want to add something to an existing question without having to post an entirely new one, how would I go about that?
EDIT: My apologies zvone, here is my code:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
session = DBSession()
query = session.query(Item).join(Item.tagged)
filters = []
for term in searchTerms:
subterms = term.split(' ')
for subterm in subterms:
filters.append(Item.itemTitle.like('%' + subterm + '%'))
filters.append(Tag.tagName.like('%' + subterm + '%'))
query = query.filter(or_(*filters))
matchedItems = query.all()
And to make some more sense out of it, here's the context:
I'm building a basic CMS where users can upload and download items of any type (text files, images, etc.).
The whole idea of this page is to allow the user to search for items that have been tagged with certain expressions. Tags are entered in the search field as a comma-delimited string of search phrases, e.g. "movies, books, photos, search term with spaces". This string is split up into its counterparts to create searchTerms, a Python list of all the terms entered into the field.
You can see in the code where I'm iterating through searchTerms, splitting phrases into separate words and adding query filters for each word.
The problem arises when searching for "big, theory". I know for certain that 3 users on the production site have posted Big Bang Theory episodes, but after migrating these DB records to my dev server, I only get one search result (the old amount).
Many thanks again for the help! :D

Creating an archive - Save results or request them every time?

I'm working on a project that allows users to enter SQL queries with parameters, that SQL query will be executed over a period of time they decide (say every 2 hours for 6 months) and then get the results back to their email address.
They'll get it in the form of an HTML-email message, so what the system basically does is run the queries, and generate HTML that is then sent to the user.
I also want to save those results, so that a user can go on our website and look at previous results.
My question is - what data do I save?
Do I save the SQL query with those parameters (i.e the date parameters, so he can see the results relevant to that specific date). This means that when the user clicks on this specific result, I need to execute the query again.
Save the HTML that was generated back then, and simply display it when the user wishes to see this result?
I'd appreciate it if somebody would explain the pros and cons of each solution, and which one is considered the best & the most efficient.
The archive will probably be 1-2 months old, and I can't really predict the amount of rows each query will return.
Thanks!

Specifically regarding retrieving the results from queries that have been run previously I would suggest saving the results to be able to view later rather than running the queries again and again. The main benefits of this approach are:
You save unnecessary computational work re-running the same queries;
You guarantee that the result set will be the same as the original report. For example if you save just the SQL then the records queried may have changed since the query was last run or records may have been added / deleted.
The disadvantage of this approach is that it will probably use more disk space, but this is unlikely to be an issue unless you have queries returning millions of rows (in which case html is probably not such a good idea anyway).

If I would create such type of application then
I will have some common queries like get by current date,current time , date ranges, time ranges, n others based on my application for the user to select easily.
Some autocompletions for common keywords.
If the data gets changed frequently there is no use saving html, generating new one is good option

The crucial difference is that if data changes, new query will return different result than what was saved some time ago, so you have to decide if the user should get the up to date data or a snapshot of what the data used to be.
If relevant data does not change, it's a matter of whether the queries will be expensive, how many users will run them and how often, then you may decide to save them instead of re-running queries, to improve performance.

google app engine textarea (from form) to datastore

So i have a simple form that takes a few inputs (two text and two textareas) and runs it through a function that puts all four inputs into the datastore (google app engine). The problem is when I have a decent amount of text in one of the s (meaning, 5 paragraphs, ~4/5 sentences each, ..2,000 characters).
I am using TextProperty() s in the datastore, (and also StringProperty for the smaller inputs). It works when I only put in like a few words for each, but not when I put in a decent amount of text, what happens: a blank webpage comes up instead of my basic confirmation page. No data is transferred into the datastore.
My handler uses get() (as opposed to POST)
Why is this happening and how do I fix it?
I'm sure this is a simple fix, but I am somewhat green to this. Thanks

While in theory there is no limit, in practice all the browsers apply some limits to the query string and since you are using GET instead of POST all your inputs are passed as query parameters in the URL.
When you are getting values from input forms, you should use the proper method="POST" in the <form> and handle that correctly in your handler using post(). If you go through the Getting Started you will find out the section for Handling Forms.

GAE datastore - best practice when there are more writes than reads

I'm trying to do some practicing with the GAE datastore to get a feeling about the queries and billings mechanisms.
I've read the Oreilly book about the GAE, and watched the Google videos about the datastore. My problem is that the best practice methods are usually concerning more reads than writes to the datastore.
I Built a super simple app:
there are two webpages - one to
choose links, and one view chosen
links
every user can choose to add url links to his "links feed"
the user can choose as many links as he wants, whenever he wants.
on a different webpage, I want to show the user the most recent 10 links he chose.
every user has his own "links feed" webpage.
on every "link" I want to save and show some metadata - for example: the url link itself; when it was chosen; how many times it appeared on the feed already; etc.
In this case, since the user can choose as many links he wants, whenever he wants, my app write to the datastore, much more than the number of reads (write - when the user chose another link; read - when the user opens the webpage to see his "links feed")
Question 1:
I can think of (at least) two options how to handle the data for this app:
Option A:
- maintain entity per user with the user details, registration, etc
- maintain another entity per user that holds his recent 10 chosen links, which will be rendered to the user's webpage after he asks for it
Option B:
- maintain entity per url link - which means all the urls of all users will be stored as the same object
- maintain entity per user details (same as in Option A), but add a reference to the user's urls in the big table of the urls
What will be the better method?
Question 2:
If I want to count the total numbers of urls chosen till today, or the daily amount of urls the user chose, or any other counting - should I use it with my SDK tools, or should I insert counters in the entities I described above? (I want to reduce the amount of datastore writes as much as I can)
EDIT (to answer #Elad's comment):
Assume I want to save only the 10 last urls per users. the rest of them I want to get rid of (so to not overpopulate my DB with unnecessary data).
EDIT 2: after adding the code
So I made the try with the following code (trying first Elad's method):
Here's my class:
class UserChannel(db.Model):
currentUser = db.UserProperty()
userCount = db.IntegerProperty(default=0)
currentList = db.StringListProperty() #holds the last 20-30 urls
then I serialized the url & metadata into JSON strings, which the user POSTs from the first page.
here's how the POST is dealt:
def post(self):
user = users.get_current_user()
if user:
logging messages for debugging
self.response.headers['Content-Type'] = 'text/html'
#self.response.out.write('<p>the user_id is: %s</p>' % user.user_id())
updating the new item that user adds
current_user = UserChannel.get_by_key_name(user.nickname())
dataJson = self.request.get('dataJson')
#self.response.out.write('<p>the dataJson is: %s</p>' % dataJson)
current_user.currentPlaylist.append(dataJson)
sizePlaylist= len(current_user.currentPlaylist)
self.response.out.write('<p>size of currentplaylist is: %s</p>' % sizePlaylist)
#whenever the list gets to 30 I cut it to be 20 long
if sizePlaylist > 30:
for i in range (0,9):
current_user.currentPlaylist.pop(i)
current_user.userCount +=1
current_user.put()
Updater().send_update(dataJson)
else:
self.response.headers['Content-Type'] = 'text/html'
self.response.out.write('user_not_logged_in')
where Updater is my method for updating with Channel-API the webpage with the feed.
Now, it all works, I can see each user has a ListProperty with 20-30 links (when it hits 30, I cut it down to 20 with the pop()), but! the prices are quite high...
each POST like the one here takes ~200ms, 121 cpu_ms, cpm_usd= 0.003588. This is very expensive considering all I do is save a string to the list...
I think the problem might be that the entity gets big with the big ListProperty?

First, you're right to worry about lots of writes to GAE datastore - my own experience is that they're very expensive compared to reads. For instance, an app of mine that did nothing but insert records in a single model table reached exhausted the free quota with a few 10's of thousands of writes per day. So handling writes efficiently translates directly into your bottom line.
First Question
I wouldn't store links as separate entities. The datastore is not a RDBMS, so standard normalization practices do not necessarily apply. For each User entity, use a ListProperty to store the the most recent URLs along with their metadata (you can serialize everything into a string).
This is efficient for writing since you only update a single record - there are no updates to all the link records whenever the user adds links. Keep in mind that to keep a rolling list (FIFO) with references URLs stored as separate models, every new URL means two write actions - an insert of the new URL, and a delete to remove the oldest one.
It's also efficient for reading since a single read on the user record gives you all the data you need to render the User's feed.
From a storage perspective, the total number of URLs in the world far exceeds your number of users (even if you become the next Facebook), and so does the variance of URLs chosen by your users, so it's likely that the mean URL will have a single user - no real gain in RDBMS-style normalization of the data.
Another optimization idea: if your users usually add several links in a short period you can try to write them in bulk rather than separately. Use memcache to store newly added user URLs, and the Task Queue to periodically write that transient data to the persistent datastore. I'm not sure what's the resource cost of using Tasks though - you'll have to check.
Here's a good article to read on the subject.
Second Question
Use counters. Just keep in mind that they aren't trivial in a distributed environment, so read up - there are many GAE articles, recipes and blog posts on the subject - just google appengine counters. Here too, using memcache should be a good option in order to reduce the total number datastore writes.

Answer 1
Store Links as separate entities. Also store an entity per user with a ListProperty having keys to the most recent 20 links. As user chooses more links you just update the ListProperty of keys. ListProperty maintains order so you dont need to worry about the chronological orders of links chosen as long as you follow a FIFO insertion order.
When you want to show the user's chosen links (page 2) you can do one get(keys) to fetch all the user's links in one call.
Answer 2
Definitely keep counters, as the number of entities grows, the complexity of counting records will continue to increase but with counters, the performance will remain the same.

User-defined derived data in Django

How do I let my users apply their own custom formula to a table of data to derive new fields?
I am working on a Django application which is going to store and process a lot of data for subscribed users on the open web. Think 100-10,000 sensor readings in one page request. I am going to be drawing graphs using this data and also showing tables containing it. I expect groups of sensors to be defined by my users, who will have registered themselves on my website (i.e they correspond with a django model).
I would like to allow the user to be able to create fields that are derived from their sensor data (as part of a setup process). For example, the user might know that their average house temperature is (temperature sensor1 + temperature sensor2) / 2 and want to show that on the graph. They might also want something more interesting like solar hot water heated is (temp out - temp in) * flow * conversion constant. I will then save these defined formulas for them and everyone else who views this page of sensor data.
The main question is how do I define the formula at the centre of the system. Do I just have a user-defined string to define the formula (say 100 chars long) and parse it myself - replace the user defined with an input sample and call it toast?
Update
In the end I got just the answer I asked for : A safe way to evaluate a stored user function on the server. Evaluating the same function also on the client when the function is being defined will be a great way to make the UI intuitive.

Depends on who your clients are.
If this is "open to the public" on the WWW, you have to parse expressions yourself. You can use the Python compiler to compile Python syntax. You can also invent your own compiler for a subset of Python syntax. There are lots of examples; start with the ply project.
If this is in-house ("behind the firewall") let the post a piece of Python code and exec that code.
Give them an environment from math import * functionality available.
Fold the following around their supplied line of code:
def userFunc( col1, col2, col3, ... ):
result1= {{ their code goes here }}
return result1
Then you can exec the function definition and use the defined function without bad things happening.
While some folks like to crow that exec is "security problem", it's no more a security problem than user's sharing passwords, and admin's doing intentionally stupid things like deleting important files or turning the power off randomly while your programming is running.
exec is only a security problem if you allow anyone access to it. For in-house applications, you know the users. Train them.

I would work out what operations you want to support [+,-,*,/,(,),etc] and develop client side (javascript) to edit and apply those values to new fields of the data. I don't see the need to do any of this server-side and you will end up with a more responsive and enjoyable user experience as a result.
If you allow the user to save their formulas and re-load them when they revisit the site, you can get their browser to do all the calculations. Just provide some generic methods to add columns of data which are generated by applying one of their forumla's to your data.
I imagine the next step would be to allow them to apply those operations to the newly generated columns.
Have you considered posting their data into a google spreadsheet? This would save a lot of the development work as they already allow you to define formulas etc. and apply it to the data. I'm not too sure of the data limit (how much data you can post and operate on) mind you.

Another user asked a similar question in C.
In that post, Warren suggested that the formula could be parsed and converted from
(a + c) / b
Into reverse polish notation
a c + b /
Which is easier to process.
In this case, you could intercept the formula model's save and generate the postfix notation from the user-defined formula. Once you have postfix notation, it is fairly straightforward to write a loop that evaluates the formula from left to right.
As for implementation in Django, the core question remaining is how to map different input fields into the formula. The simple solution would be a model representing the derived field uses a many-to-many relationship with the symbol name ("a", "b" or "c") defined per-input.
If performance is really critical, you might somehow further pre-process the postfix formula before applying it to the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.