Reading custom format file using apache beam - python

I am new to Apache Beam. I have a requirement to read a text file with the format as given below
a=1
b=3
c=2
a=2
b=6
c=5
Here all rows till an empty line are part of one record and need to be processed together (eg. insert to the table as columns). The above example corresponds to a file with just 2 records.
I am using ReadFromText to read the file and process it. It reads each line as an element. I am then trying to loop and process till I get empty lines.
ReadFromText returns a PCollection and I have read that PCollection is an abstraction of the potentially distributed dataset. My doubt is while reading, will I get records in the same order as in the file. Or will I just get a collection of rows where the order is not preserved. What solution can I use to solve this problem?
I am using python language. I have to read the file from the GCP bucket and use Google Dataflow for execution.

No, your records are not guaranteed to be in the same order. PCollections are inherently unordered, and elements in a PCollection are expected to be parallelization, that is distinct and not reliant on other elements in the PCollection.
In your example you're using TextIO which treats each line of a text file as a separate element, but what you need is to gather each set of data for a record as one element. There are many potential ways around this.
If you can modify the text file, you could put all your data on a single line per record, and then parse that line in a transform you write. This is the usual approach taken, for example with CSV files.
If you can't modify the files, a simple solution for adding your own logic for reading files is to retrieve the files with FileIO and then write a custom ParDo with your own logic for reading the files. This is not as simple as using an existing IO out of the box, but is still easier than creating a fully featured Source.
If the files are more complex and you need a more robust solution, you can implement your own Source that reads the file and outputs records in your required format. This would most likely involve using Splittable DoFns and would require a fair amount of knowledge in how a FileBasedSource works.

Related

How to append data to a nested JSON file in Python

I'm creating a program that will need to store different objects in a logical structure on a file which will be read by a web server and displayed to users.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic.
I'm looking for something of this sort:
foods = [{
"fruits":{
"apple":"red",
"banana":"yellow",
"kiwi":"green"
}
"vegetables":{
"cucumber":"green",
"tomato":"red",
"lettuce":"green"
}
}]
I would like to be able to add additional data to the table like so:
newFruit = {"cherry":"red"}
foods["fruits"].append(newFruit)
Is there any way to do this in python with JSON without loading the whole file?
That is not possible with pure JSON, appending to a JSON list will always require reading the whole file into memory.
But you could use JSON Lines for that. It's a format where each line in a valid JSON on itself, that's what AWS uses for their API's. Your vegetables.json could be written like this:
{"cucumber":"green"}
{"tomato":"red"}
{"lettuce":"green"}
Therefore, adding a new entry is very easy because it becomes just appending a new entry to the end of the file.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic
If your file is really too huge to fit in memory then either the source json should have been splitted in smaller independant parts or it's just not a proper use case for json. IOW what you have in this case is a design issue, not a coding one.
There's at least one streaming json parser that might or not allow you to solve the issue, depending on the source data structure and the effective updates you have to do.
This being said, given today's computers, you need a really huge json file to end up eating all your ram so before anything else you should probably just check the effective file size and how much memory it needs to be parsed to Python.

How to retrieve the content of a PCollection and assign it to a normal variable?

I am using Apache-Beam with the Python SDK.
Currently, my pipeline reads multiple files, parse them and generate pandas dataframes from its data.
Then, it groups them into a single dataframe.
What I want now is to retrieve this single fat dataframe, assigning it to a normal Python variable.
Is it possible to do?
PCollection is simply a logical node in the execution graph and its contents are not necessarily actually stored anywhere, so this is not possible directly.
However, you can ask your pipeline to write the PCollection to a file (e.g. convert elements to strings and use WriteToText with num_shards=1), run the pipeline and wait for it to finish, and then read that file from your main program.

Spark reading WARC file with custom InputFormat

I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles() (because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat used with .hadoopFile() in Python.
However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?
i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles?
If delimiter is \n\n\n you can use textinputformat.record.delimiter
sc.newAPIHadoopFile(
path ,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\n\n\n'}
)

Is JSON appropriate for indexing large filesystems?

The task at hand is simple: all file names and their respective directories are to be stored in an index file. Right now, it is a plain text file (compressed). I was wondering if JSON is fit for the job. I am using a script written in Python to do this. It walks through all the directories and saves them in a paragraph format. I compress them at the end too. It is fairly fast - about 15 seconds for indexing 70,000 files and 7,500 folders (including compress time). The size of the index file is about 300 KB
this/is/the/folder
file1.png
file2.txt
file3.dat
another/folder
morefiles.inf
saywhaaa.mov
The parser for this index file is simple. The first line in each paragraph is the top folder, and the remaining lines are file names contained inside that folder. Keeping the "don't fix it if it isn't broken" policy in mind, I thought if using JSON could've any advantages here? Is it quicker to parse the data? Will the index file increase in size? I don't know any JSON, but I will learn it if it would help my case.

XML to Postgres via python/psycopg2

I have an existing python script that loops through a directory of XML files parsing each file using etree, and inserting data at different points into a Postgres database schema using psycopg2 module. This hacked together script worked just fine but now the amount of data (number and size of XML files) is growing rapidly, and the number of INSERT statements is just not scaling. The largest table in my final database has grown to about ~50 million records from about 200,000 XML files. So my question is, what is the most efficient way to:
Parse data out of XMLs
Assemble row(s)
Insert row(s) to Postgres
Would it be faster to write all the data to a CSV in the correct format and then bulk load the final CSV tables to Postgres using COPY_FROM command?
Otherwise I was thinking about populating some sort of temporary data structure in memory that I could insert into the DB once it reaches a certain size? I am just having trouble arriving at the specifics of how this would work.
Thanks for any insight on this topic, and please let me know if more information is needed to answer my question.
copy_from is the fastest way I found to do bulk inserts. You might be able to get away with streaming the data through a generator to stay away from writing temporary files while keeping memory usage low.
A generator function could assemble rows out of the XML data, then consume that generator with copy_from. You may even want multiple levels of generators such that you can have one which yields records from a single file and another which composes those from all 200,000 files. You'd end up with a single query which will be much faster than 50,000,000.
I wrote an answer here with links to example and benchmark code for setting something similar up.

Categories

Resources