Search text for valid Python code - python

I have chunks of text that may or may not contain Python code. I need a way to search the text for code and if it is there, do something. I could easily search for specific strings that match a regex, but I need something general.
One thought I had would be to run the text through ast, but that would require parsing out all possible substrings and submitting each of them to ast.
To be clear, the text comes from a Q&A forum for Python. Users frequently post code in their questions, but the code is all smushed into one long, incoherent line when it should be formatted to be displayed properly. I need to check if there is code included in the text and if there is and it isn't formatted properly, complain to the user. Checking formatting is something I can handle, I just need to check the text for any Python code.
Any help would be appreciated.

Related

How to create a dynamic form with python using translated text as input?

I have an original text that I want to translate. I normally do it manually but I know I could save a lot of time translating automatically the most frequent words and expressions.
I will find out how to translate simple words, the problem is not here. I have read some books on python and I think using string manipulations can be done.
But I am lost about how to create the output file.
The output file will contain:
short empty forms ready to be filled wherever there is text that has not been translated
the translated words wherever they were in the original file
In the output file I will fill manually the empty forms, after pressing Tab the cursor should jump to the next exmpty form
I am lost here, I know how to do forms on html but the language I am used to is Python.
I would like to know what modules from Python I could use. I need some guidance on this.
Can you recommend me a book or a tool that explains how to do something similar to this?
This is what I want to do, assuming I have managed to create a simple database to translate colors from Spanish to English.
The first step contains the original file.
The second step contains the automatic translation.
In the third step I complete the manual translation.
After finishing everything is grouped into a normal txt file ready to be used.
I think it is quite clear. I don't expect people to tell me the code to do this, I just need to know what tools could be used to achieve my goal.
Thanks for editing.
To create an interface that works with a web browser, Flask for Python is a good method for creating webforms. There are tutorials available.
One method for storing data would be an SQLite file. That may be more than you need, so I'd recommend starting with a CSV file. Libraries exist in Python for both CSVs and SQLite.

Possible to replace paragraph numbers in .doc file with "actual" text?

In Microsoft Word, paragraph numbers are marked up slightly differently to the actual text of a document. To illustrate -
I assume this is because they are auto-generated and can be positioned differently.
I would like to replace this auto-generated 1.2.1.2 text with "actual" text, so it will be marked up the same as the "Stack Overflow" to its right.
Is this possible somehow programmatically?
I am open to language suggestion if this is possible in any way (all of my research has led me to believe it is not-hence my lack of an "attempt" at this, if I am in err here then please let me know in the comments and I will very happily update my question with an attempt if you can point me towards an API or give me something to start with please :)).
Depending on where that underline is coming from, converting the numbers to text may not have the result you're looking for. You need to make sure that all the formatting is part of the Heading style and not applied directly. The ConvertNumbertToText method, performed on any specified Range, will convert automatic numbering to plain text
wdDocument.Content.ListFormat.ConvertNumbersToText
You may also need to remove direct character formatting in order for the style format to display on the converted text:
wdDocument.Content.Select
Selection.ClearCharacterDirectFormatting

Can you modify only a text string in an XML file and still maintain integrity and functionality of .docx encasement?

I want to enter data into a Microsoft Excel Spreadsheet, and for that data to interact and write itself to other documents and webforms.
With success, I am pulling data from an Excel spreadsheet using xlwings. Right now, I’m stuck working with .docx files. The goal here is to write the Excel data into specific parts of a Microsoft Word .docx file template and create a new file.
My specific question is:
Can you modify just a text string(s) in a word/document.xml file and still maintain the integrity and functionality of its .docx encasement? It seems that there are numerous things that can change in the XML code when making even the slightest change to a Word document. I've been working with python-docx and lxml, but I'm not sure if what I seek to do is possible via this route.
Any suggestions or experiences to share would be greatly appreciated. I feel I've read every article that is easily discoverable through a google search at least 5 times.
Let me know if anything needs clarification.
Some things to note:
I started getting into coding about 2 months ago. I’ve been doing it intensively for that time and I feel I’m picking up the essential concepts, but there are severe gaps in my knowledge.
Here are my tools:
Yosemite 10.10,
Microsoft Office 2011 for Mac
You probably need to be more specific, but the short answer is, in principle, yes.
At a certain level, all python-docx does is modify strings in the XML. A couple things though:
The XML you create needs to remain well-formed and valid according to the schema. So if you change the text enclosed in a <w:t> element, for example, that works fine. Conversely, if you inject a bunch of random XML at an arbitrary point in one of the .xml parts, that will corrupt the file.
The XML "files", known as parts that make up a .docx file are contained in a Zip archive known as a package. You must unpackage and repackage that set of parts properly in order to have a valid .docx file afterward. python-docx takes care of all those details for you, but if you're going directly at the .docx file you'll need to take care of that yourself.

How to parse just the text from a Word Doc using Python?

When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest of the stuff. Please advise.
Here's a sample of open("sample.doc",r").read() of a word doc. Thanks
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00In an Interesting news,his is the first time we polled Indian channel community for their preferred memory supplier. Transcend came a close second, was seen to be more popular among class A city based resellers, was also the most recalled memory brand among customers according to resellers. However Transcend channels complained of parallel imports and constant unavailability of the products in grey x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
The tool that seems the most viable, particularly if you need an all python solution is OleFileIO.
doc is a binary format, it's not a markup language or something.
Specs: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
There is no generic why to extract
information from every file format.
You need to know the format to know
how to extract the information.
Just wanted to state that first. So what you should look for is libraries and software that can convert/extract the information you want. And as mentioned by Ofir MicroSoft have tools for that for their formats.
But if you can not do this and want to take the chance that there is text visible in the file that you think is interesting to read you could do a normal read and look for sequences of bytes that will build text. Then comes the question, what languages/charset should I support support in my hunt for text. Is it multi-byte text?
The easy start is to loop through the data and look for sequences of [a-zA-z0-9_- ] to find the text. But word is probably multi-byte. So you should scan double byte as one char.
Note: some of the new formats like open office and docx is multiple files in a compressed container. So you need to de-compress the file first, and scan XML documents after the text you looking for.
Word doc is a compressed format. You need to uncompress it first to get the real data (try open a doc file in a program like winrar, and you'll see it contains multiple files.
It even seems to be XML, so reading the format should not be that hard, although I'm not sure if you get all the data this way.
I had a similar problem, needing to query hundreds of Word documents. I converted the Word files to text files and used normal text parsing tools. Worked well.

Python SAX parser says XML file is not well-formed

I stripped some tags that I thought were unnecessary from an XML file. Now when I try to parse it, my SAX parser throws an error and says my file is not well-formed. However, I know every start tag has an end tag. The file's opening tag has a link to an XML schema. Could this be causing the trouble? If so, then how do I fix it?
Edit: I think I've found the problem. My character data contains "&lt" and "&gt" characters, presumably from html tags. After being parsed, these are converted to "<" and ">" characters, which seems to bother the SAX parser. Is there any way to prevent this from happening?
I would suggest putting those tags back in and making sure it still works. Then, if you want to take them out, do it one at a time until it breaks.
However, I question the wisdom of taking them out. If it's your XML file, you should understand it better. If it's a third-party XML file, you really shouldn't be fiddling with it (until you understand it better :-).
Does the sax parser not give you details about where it thinks it's not well-formed?
Have you tried loading the file into an XML editor and checking it there? Do other XML parsers accept it?
The schema shouldn't change whether or not the XML is well-formed or not; it may well change whether it's valid or not. See the wikipedia entry for XML well-formedness for a little bit more, or the XML specs for a lot more detail :)
EDIT: To represent "&" in text, you should escape it as &
So:
&lt
should be
&lt
(assuming you really want ampersand, l, t).
I would second recommendation to try to parse it using another XML parser. That should give an indication as to whether it's the document that's wrong, or parser.
Also, the actual error message might be useful. One fairly common problem for example is that the xml declaration (if one is used, it's optional) must be the very first thing -- not even whitespace is allowed before it.
You could load it into Firefox, if you don't have an XML editor. Firefox shows you the error.

Categories

Resources