Spark reading WARC file with custom InputFormat - python

I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles() (because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat used with .hadoopFile() in Python.
However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?
i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles?

If delimiter is \n\n\n you can use textinputformat.record.delimiter
sc.newAPIHadoopFile(
path ,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\n\n\n'}
)

Related

input data from multiple source into hadoop(HDFS)

how i can put different data from multiple sources into HDFS using python
i already tried SQL file using pyspark(in Pycharm IDEA) and it worked.
Now i need more functions that allowed me to ingest diffrent others data into HDFS
PySpark is very versatile - it can read multiple inputs via Streaming/SQL. You'll need to be more specific about what sources you are trying to load from.
However, if you want a more accessible way to ingest lots of data, that is what apache-kafka was explicitly built for. If you prefer not having to write lots of code, then you may also look at apache-nifi, which integrates nicely within the Hadoop ecosystem.

Reading custom format file using apache beam

I am new to Apache Beam. I have a requirement to read a text file with the format as given below
a=1
b=3
c=2
a=2
b=6
c=5
Here all rows till an empty line are part of one record and need to be processed together (eg. insert to the table as columns). The above example corresponds to a file with just 2 records.
I am using ReadFromText to read the file and process it. It reads each line as an element. I am then trying to loop and process till I get empty lines.
ReadFromText returns a PCollection and I have read that PCollection is an abstraction of the potentially distributed dataset. My doubt is while reading, will I get records in the same order as in the file. Or will I just get a collection of rows where the order is not preserved. What solution can I use to solve this problem?
I am using python language. I have to read the file from the GCP bucket and use Google Dataflow for execution.
No, your records are not guaranteed to be in the same order. PCollections are inherently unordered, and elements in a PCollection are expected to be parallelization, that is distinct and not reliant on other elements in the PCollection.
In your example you're using TextIO which treats each line of a text file as a separate element, but what you need is to gather each set of data for a record as one element. There are many potential ways around this.
If you can modify the text file, you could put all your data on a single line per record, and then parse that line in a transform you write. This is the usual approach taken, for example with CSV files.
If you can't modify the files, a simple solution for adding your own logic for reading files is to retrieve the files with FileIO and then write a custom ParDo with your own logic for reading the files. This is not as simple as using an existing IO out of the box, but is still easier than creating a fully featured Source.
If the files are more complex and you need a more robust solution, you can implement your own Source that reads the file and outputs records in your required format. This would most likely involve using Splittable DoFns and would require a fair amount of knowledge in how a FileBasedSource works.

Making changes to a ntriples file with python

Scenario: I just got my hands on a huge ntriples file (6.5gb uncompressed). I am trying to open it and perform some operations (such as cleaning some of the data that it contains).
Issue: I haven't been able to check the contents of this file. Notepad++ cannot handle it, and in RDFlib, the far as I got was to load the file, but I cannot seem to find a way to edit without parsing the entire thing. I also tried using RDF package (from how to parse big datasets using RDFLib?), but I cannot find a way to install it in Python 3.
Question: What is the best option to perform this kind of operation? Is there any command in rdflib that allows for this kind of editing?
if it's ntriples then basically it's a line-by-line triples. Therefore, you can read the file by small chunks (some N lines from the file) and parse the chunk via rdflib followed by any cleaning operation you need on the graph.

How to append data to a nested JSON file in Python

I'm creating a program that will need to store different objects in a logical structure on a file which will be read by a web server and displayed to users.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic.
I'm looking for something of this sort:
foods = [{
"fruits":{
"apple":"red",
"banana":"yellow",
"kiwi":"green"
}
"vegetables":{
"cucumber":"green",
"tomato":"red",
"lettuce":"green"
}
}]
I would like to be able to add additional data to the table like so:
newFruit = {"cherry":"red"}
foods["fruits"].append(newFruit)
Is there any way to do this in python with JSON without loading the whole file?
That is not possible with pure JSON, appending to a JSON list will always require reading the whole file into memory.
But you could use JSON Lines for that. It's a format where each line in a valid JSON on itself, that's what AWS uses for their API's. Your vegetables.json could be written like this:
{"cucumber":"green"}
{"tomato":"red"}
{"lettuce":"green"}
Therefore, adding a new entry is very easy because it becomes just appending a new entry to the end of the file.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic
If your file is really too huge to fit in memory then either the source json should have been splitted in smaller independant parts or it's just not a proper use case for json. IOW what you have in this case is a design issue, not a coding one.
There's at least one streaming json parser that might or not allow you to solve the issue, depending on the source data structure and the effective updates you have to do.
This being said, given today's computers, you need a really huge json file to end up eating all your ram so before anything else you should probably just check the effective file size and how much memory it needs to be parsed to Python.

ReactNative -- Creating file with Python to be read by AsyncStorage

I am trying to solve an issue for my app, which is programmed in ReactNative.
I have a type of object named 'Card' which has two main variables, both of them strings: 'question' and 'answer'.
The data for this objects is provided to the app, which has to load them and show them to the user. I have around 10 thousand "points" of data written as lines in a .txt, each line being: "this is a question? // here is the answer".
I would like to create a Python script that breaks that .txt in 10 thousand files that can be read by a ReactNative method at run-time. I thought about AsyncStorage.getItem and wanted to ask how to format these files so that they can be read by the method.
But I'm beggining to think I would need to use Expo.FileSystem.readAsStringAsync(). Is that right? I would rather use AsyncStorage.getItem and just parse the file right away...
AsyncStorage is not the thing you're looking for. It's a key-value-storage, which doesn't solve your problem as you're trying to use an already existing text file as database so to say. Expo's built in solution Expo.FileSystem.readAsStringAsync() seems to do exactly what you want. There are probably more ways to solve your problem, but using Expo's API is a pretty good solution it'd say.
If you want to improve performance I'd suggest using an actual database since you won't be needing to parse the txt file to the appropriate format. You could use an SQLite Database and just populate the data on the app's first launch similar to this approach How can I embed an SQLite database into an application?. Or you could call a remote api, but that's probably out of your scope.

Categories

Resources