I have a big collection of .tex files (TeX/LaTeX), and I'm writing a Python script that analyzes these files. I wish only to analyze LaTeX files, thus I want to remove all pure TeX files.
I have thought about making sure \begin{document} is contained in every file, but this rejects quite a big amount of my files, since several files are only chapters in a book, long lists, or sections in a dissertation that does not have the \begin{document} command.
Does anybody have an idea, how to filter all the pure TeX files away from my collection?
I think there's unlikely to be a completely foolproof way of doing this, given that you want to be sensitive to files which can be input with \input or \include. Given a particular file, though, you can probably classify it with considerable confidence by spotting the first of the following which you find.
TeX files usually end with \bye, and that's typically not defined in a LaTeX file.
The macro \begin is unlikely to be defined in a ‘normal’ TeX file (though \end is defined in the plain format).
That's probably about the best you can do, though it would surely be enough for the sort of statistical analysis you appear to be doing.
There's nothing to stop someone writing a TeX file from defining \begin to mean something, nor someone writing a LaTeX file to define \bye to mean something. The problem, from your point of view, is that there aren't any TeX constructs that are truly forbidden in a LaTeX file (and vice versa), even though things like \halign would be rare in LaTeX. Indeed, since LaTeX is ‘just’ a TeX format, there isn't any fundamental difference between the two, at all.
Just to drive the latter point home, there's such a thing as ConTeXt, which is a TeX format which isn't plain, but which isn't LaTeX either. It's rather rare, though.
Yeah sure, add all you file names to array, do this by listing the directory.
x = os.listdir("path")
This will add the directory contents to the variable x.
Then loop through it:
PureTex = []
for Char in x:
if Char.endswith('.tex'):
PureTex.append(Char)
else:
pass
Now the PureTex array will contain the pure files.
Related
I have been thinking for some time that variable fonts were simply combinations of multiple fonts, and that values were interpolated between them. However, I just read about this project, protottypo (which is unfortunately discontinued), and discovered about how they were storing their fonts as variables. See this screenshot from a promotional video, a few years ago:
And it seemed just so logical! Why not use a real language-like format, with variables and all. In the picture above, it (kinda) looks like python code.
And then I thought "It must have been thought through, let's look at how OpenType font variations are implemented."
And I looked on the web for the schema and the specification, but could not find it.
So the actual question(s):
How are variable fonts stored in otf files? Is it simply, as I thought before, multiple fonts and the other values are interpolated between them? Is there a variable language like the one above used to write the variable parts of the font (obviously)?
Where can I find the TTF specification for variable fonts? Is there any?
Is there a way to write a variable font with a regular text file (of course involving some vector graphics of some sort, like: const d = 'M23.6,0c-3.4,0-6.3,2.7-7.6,5.6C14.7,2.7,11.8,0,8.4,0C3.8,0,0,3.8,0,8.4c0,9.4,9.5,11.9,16,21.2 c6.1-9.3,16-12.1,16-21.2C32,3.8,28.2,0,23.6,0z' (this one makes a heart)
Thank you (that's what the heart is for :)
I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)
I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?
I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.
I try to gather some graphics and text from different folders and present them in a comprehensive way. For this I use python to copy them into one folder and derive a dynamic LaTeX presentation, where I plot the copied graphics and print the text. The problem I'm facing now is, that I can derive the title for a slide dynamically from a text file, but if it's too long it will obviously wrap around. This dynamic title can be pretty long, so it might fill the whole slide. What I'd like to do now is to limit the space used by this text, without losing its information. The non-elegant solution I have to this problem is to count the characters and if it's over a certain threshold, use a smaller font. This solution is tedious and not optimal, I'd love to hear a better idea.
I'm not familiar with the PDF specification at all. I was wondering if it's possible to directly manipulate a PDF file so that certain blocks of text that I've identified as important are highlighted in colors of my choice. Language of choice would be python.
It's possible, but not necessarily easy, because the PDF format is so rich. You can find a document describing it in detail here. The first elementary example it gives about how PDFs display text is:
BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET
BT and ET are commands to begin and end a text object; Tf is a command to use external font resource F13 (which happens to be Helvetica) at size 12; Td is a command to position the cursor at the given coordinates; Tj is a command to write the glyphs for the previous string. The flavor is somewhat "reverse-polish notation"-oid, and indeed quite close to the flavor of Postscript, one of Adobe's other great contributions to typesetting.
The problem is, there is nothing in the PDF specs that says that text that "looks" like it belongs together on the page as displayed must actually "be" together; since precise coordinates can always be given, if the PDF is generated by a sophisticated typography layout system, it might position text precisely, character by character, by coordinates. Reconstructing text in form of words and sentences is therefore not necessarily easy -- it's almost as hard as optical text recognition, except that you are given the characters precisely (well -- almost... some alleged "images" might actually display as characters...;-).
pyPdf is a very simple pure-Python library that's a good starting point for playing around with PDF files. Its "text extraction" function is quite elementary and does nothing but concatenate the arguments of a few text-drawing commands; you'll see that suffices on some docs, and is quite unusable on others, but at least it's a start. As distributed, pyPdf does just about nothing with colors, but with some hacking that could be remedied.
reportlab's powerful Python library is entirely focused on generating new PDFs, not on interpreting or modifying existing ones. At the other extreme, pure Python library pdfminer entirely focusing on parsing PDF files; it does do some clustering to try and reconstruct text in cases in which simpler libraries would be stumped.
I don't know of an existing library that performs the transformational tasks you desire, but it should be feasible to mix and match some of these existing ones to get most of it done... good luck!
Highlight is possible in pdf file using PDF annotations but doing it natively is not that easy job. If any of the mentioned library provide such facility is something that you may look for.