Scenario: I just got my hands on a huge ntriples file (6.5gb uncompressed). I am trying to open it and perform some operations (such as cleaning some of the data that it contains).
Issue: I haven't been able to check the contents of this file. Notepad++ cannot handle it, and in RDFlib, the far as I got was to load the file, but I cannot seem to find a way to edit without parsing the entire thing. I also tried using RDF package (from how to parse big datasets using RDFLib?), but I cannot find a way to install it in Python 3.
Question: What is the best option to perform this kind of operation? Is there any command in rdflib that allows for this kind of editing?
if it's ntriples then basically it's a line-by-line triples. Therefore, you can read the file by small chunks (some N lines from the file) and parse the chunk via rdflib followed by any cleaning operation you need on the graph.
Related
I have a set of measurement data in .atfx format. I know it is possible to read this data with ArtemiS SUITE, but I would need to make some post processing in python. I tried to look into the files, but as I see, atfx is a header file (with an xml structure) that points to binary files, so I'm not sure how I could write a python script to decode that, or if it is possible at all.
Is there a way to open ATFX files in python or is there a workaround?
I'm trying to download a .doc file using requests.get() request (though I've heard about other methods - they all require saving too)
Is there any method I could use to extract the text from it (or even convert it into a .txt for example) straight away without saving it into a file?
I've tried passing request.raw into various conventors (docx2txt.process() for example) but I assume they all work with files, not with streams.
While the script is running the memory allocation are handled by the python interpreter but if you save the content to a file the memory allocated is different. This article can be helpful to you.
Link: article
I am new to Apache Beam. I have a requirement to read a text file with the format as given below
a=1
b=3
c=2
a=2
b=6
c=5
Here all rows till an empty line are part of one record and need to be processed together (eg. insert to the table as columns). The above example corresponds to a file with just 2 records.
I am using ReadFromText to read the file and process it. It reads each line as an element. I am then trying to loop and process till I get empty lines.
ReadFromText returns a PCollection and I have read that PCollection is an abstraction of the potentially distributed dataset. My doubt is while reading, will I get records in the same order as in the file. Or will I just get a collection of rows where the order is not preserved. What solution can I use to solve this problem?
I am using python language. I have to read the file from the GCP bucket and use Google Dataflow for execution.
No, your records are not guaranteed to be in the same order. PCollections are inherently unordered, and elements in a PCollection are expected to be parallelization, that is distinct and not reliant on other elements in the PCollection.
In your example you're using TextIO which treats each line of a text file as a separate element, but what you need is to gather each set of data for a record as one element. There are many potential ways around this.
If you can modify the text file, you could put all your data on a single line per record, and then parse that line in a transform you write. This is the usual approach taken, for example with CSV files.
If you can't modify the files, a simple solution for adding your own logic for reading files is to retrieve the files with FileIO and then write a custom ParDo with your own logic for reading the files. This is not as simple as using an existing IO out of the box, but is still easier than creating a fully featured Source.
If the files are more complex and you need a more robust solution, you can implement your own Source that reads the file and outputs records in your required format. This would most likely involve using Splittable DoFns and would require a fair amount of knowledge in how a FileBasedSource works.
I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles() (because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat used with .hadoopFile() in Python.
However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?
i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles?
If delimiter is \n\n\n you can use textinputformat.record.delimiter
sc.newAPIHadoopFile(
path ,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\n\n\n'}
)
I am trying to preform find and replace on docx files, whilst still maintaining formatting.
From reading up on this, the best way seems to be preforming the find/replace on the xml file of the document.
I can load in the xml file and find/replace on it, but unsure how to write it back.
docx:
Hello {text}!
python:
import zipfile
zip = zipfile.ZipFile(open('test.docx', 'rb'))
xmlString = zip.read('word/document.xml').decode('utf-8')
xmlString.replace('{text}', 'world')
What you are trying is dangerous, because you are processing a high lever docx file at a lower level. If you really want to do it, just use the hints from overwriting file in ziparchive as suggested by #shahvishal.
But unless you fully know all the details of docx format, my advice is : do not do that. Suppose there is for any reason in an internal field or attribute the string {text}. You are likely to change the file in an unexpected way leading immediately or even worse later to the destruction of the file (Word being no longer able to process it).
If you do your processing on a Windows machine with an installed word, you certainly could try to use automation to process the file with Microsoft Word. Unfortunately, I only did that long time ago and cannot give useful links. You will need:
general knowledge on pywin30 module
sufficient knowledge on the Automation interface of MS/WORD. Hopefully, its documentation is nice with many examples provided you have a full installation of Microsoft Office including macro help
You're really going to want to use a library for reading/writing docx files rather than trying to just deal with them as raw XML. A cursory search came up with the pypi module docx but I haven't used this module so I can't endorse it:
https://pypi.python.org/pypi/docx/0.2.4
I've had the (unfortunate) experience of dealing with the manipulation of MS Office documents from other programming languages, and spending the time to find good libraries really paid off.
The old saying goes "don't reinvent the wheel" and I think that's definitely true when manipulating non-trivial file formats. If a somewhat mature library exists to do the job, use it!
You would need to replace the file in the zip archive. There is no "simple" way of achieving this. The following is a question that should help:
overwriting file in ziparchive