How to read ATFX data? - python

I have a set of measurement data in .atfx format. I know it is possible to read this data with ArtemiS SUITE, but I would need to make some post processing in python. I tried to look into the files, but as I see, atfx is a header file (with an xml structure) that points to binary files, so I'm not sure how I could write a python script to decode that, or if it is possible at all.
Is there a way to open ATFX files in python or is there a workaround?

Related

how to get text from a downloadable .doc file without saving?

I'm trying to download a .doc file using requests.get() request (though I've heard about other methods - they all require saving too)
Is there any method I could use to extract the text from it (or even convert it into a .txt for example) straight away without saving it into a file?
I've tried passing request.raw into various conventors (docx2txt.process() for example) but I assume they all work with files, not with streams.
While the script is running the memory allocation are handled by the python interpreter but if you save the content to a file the memory allocated is different. This article can be helpful to you.
Link: article

Making changes to a ntriples file with python

Scenario: I just got my hands on a huge ntriples file (6.5gb uncompressed). I am trying to open it and perform some operations (such as cleaning some of the data that it contains).
Issue: I haven't been able to check the contents of this file. Notepad++ cannot handle it, and in RDFlib, the far as I got was to load the file, but I cannot seem to find a way to edit without parsing the entire thing. I also tried using RDF package (from how to parse big datasets using RDFLib?), but I cannot find a way to install it in Python 3.
Question: What is the best option to perform this kind of operation? Is there any command in rdflib that allows for this kind of editing?
if it's ntriples then basically it's a line-by-line triples. Therefore, you can read the file by small chunks (some N lines from the file) and parse the chunk via rdflib followed by any cleaning operation you need on the graph.

Read only the headers of Excel files

I have a large number of Excel files that I need to download from the web and then extract only the header (column names) from and then move on. So far I have only managed to download the whole file and then read it into a Pandas DF from which I can extract the column names.
Is there a faster way to read, rather than download, or parse only the header, rather than the whole Excel file?
resp = requests.get(test_url)
with open('test.xls', 'wb') as output:
output.write(resp.content)
headers = pd.ExcelFile("test.xls").parse(sheetname = 2)
headers.columns
If there is not an efficient way to "partially" download the Excel file to get only the header, is there an efficient way to read only the header after it has already been downloaded?
I would say no, because xls Excel files are binary files. So the parser of pandas ExcelFile needs a complete file. If you give it a partial file, it should report an incorrect file (with some reason...).
If you really want to do that, you will have to thoroughly analyze (in binary form) some of the Excel files you want to process, and try to identify the minimum size you need to find the names in the first row. Then you should download them by implementing the http protocol at a low enough level to be able to close the connection, or at least stop reading as soon as you have enough bytes. Finally, you have just to write a dedicated parser hoping that nothing changes in those files - because you no longer use high level maintained tools for that but only binary reads.
TL/DR: unless you have a very strong reason to do that, just forget it, because it will be hard, error prone and hardly maintainable if only possible.

Can you modify only a text string in an XML file and still maintain integrity and functionality of .docx encasement?

I want to enter data into a Microsoft Excel Spreadsheet, and for that data to interact and write itself to other documents and webforms.
With success, I am pulling data from an Excel spreadsheet using xlwings. Right now, I’m stuck working with .docx files. The goal here is to write the Excel data into specific parts of a Microsoft Word .docx file template and create a new file.
My specific question is:
Can you modify just a text string(s) in a word/document.xml file and still maintain the integrity and functionality of its .docx encasement? It seems that there are numerous things that can change in the XML code when making even the slightest change to a Word document. I've been working with python-docx and lxml, but I'm not sure if what I seek to do is possible via this route.
Any suggestions or experiences to share would be greatly appreciated. I feel I've read every article that is easily discoverable through a google search at least 5 times.
Let me know if anything needs clarification.
Some things to note:
I started getting into coding about 2 months ago. I’ve been doing it intensively for that time and I feel I’m picking up the essential concepts, but there are severe gaps in my knowledge.
Here are my tools:
Yosemite 10.10,
Microsoft Office 2011 for Mac
You probably need to be more specific, but the short answer is, in principle, yes.
At a certain level, all python-docx does is modify strings in the XML. A couple things though:
The XML you create needs to remain well-formed and valid according to the schema. So if you change the text enclosed in a <w:t> element, for example, that works fine. Conversely, if you inject a bunch of random XML at an arbitrary point in one of the .xml parts, that will corrupt the file.
The XML "files", known as parts that make up a .docx file are contained in a Zip archive known as a package. You must unpackage and repackage that set of parts properly in order to have a valid .docx file afterward. python-docx takes care of all those details for you, but if you're going directly at the .docx file you'll need to take care of that yourself.

python adding gibberish when reading from a .rtf file?

I have a .rtf file that contains nothing but an integer, say 15. I wish to read this integer in through python and manipulate that integer in some way. However, it seems that python is reading in much of the metadata associated with .rtf files. Why is that? How can I avoid it? For example, trying to read in this file, I get..
{\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf460
{\fonttbl\f0\fswiss\fcharset0
Helvetica;}
{\colortbl;\red255\green255\blue255;}
\margl720\margr720\margb720\margt720\vieww9000\viewh8400\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural
That's the nature of .RTF (i.e Rich Text files), they include extra data to define how the text is layed-out and formated.
It is not recommended to store data in such files lest you encounter the difficulties you noted. Would you go through the effort to parse this file and "recover" your one numeric value, you may expose your application to the risk of updated versions of the RTF format which may render the parsing logic partially incorrect and hence yield wrong numeric data for the application).
Why not store this info in a true text file. This could be a flat text file or preferably an XML, YAML, JSON file for example for added "forward" compatibility as your application and you may add extra parameters and such in the file.
If this file is a given, however, there probably exist Python libraries to read and write to it. Check the Python Package Index (PyPI) for the RTF keyword.
That's exactly what the RTF file contains, so Python (in the absence of further instruction) is giving you what the file contains.
You may be looking for a library to read the contents of RTF files, such as pyrtf-ng.

Categories

Resources