Is JSON appropriate for indexing large filesystems? - python

The task at hand is simple: all file names and their respective directories are to be stored in an index file. Right now, it is a plain text file (compressed). I was wondering if JSON is fit for the job. I am using a script written in Python to do this. It walks through all the directories and saves them in a paragraph format. I compress them at the end too. It is fairly fast - about 15 seconds for indexing 70,000 files and 7,500 folders (including compress time). The size of the index file is about 300 KB
this/is/the/folder
file1.png
file2.txt
file3.dat
another/folder
morefiles.inf
saywhaaa.mov
The parser for this index file is simple. The first line in each paragraph is the top folder, and the remaining lines are file names contained inside that folder. Keeping the "don't fix it if it isn't broken" policy in mind, I thought if using JSON could've any advantages here? Is it quicker to parse the data? Will the index file increase in size? I don't know any JSON, but I will learn it if it would help my case.

Related

Is there an easy way to handle inconsistent file paths in blob storage?

I have a service that drops a bunch of .gz files to an azure container on a daily cadence. I'm looking to pick these files up and convert the underlying txt/json into tables. The issue perplexing me is that the service adds two random string prefix folders and a date folder to the path.
Here's an example file path:
container/service-exports/z633dbc1-3934-4cc3-ad29-e82c6e74f070/2022-07-12/42625mc4-47r6-4bgc-ac72-11092822dd81-9657628860/*.gz
I've thought of 3 possible solutions:
I don't necessarily need the data to persist. I could theoretically loop through each folder and look for .gz, open and write them to an output file and then go back through and delete the folders in the path.
Create some sort of checkpoint file that keeps track of each path per gzip and then configure some way of comparison to the checkpoint file at runtime. Not sure how efficient this would be over time.
Use RegEx to look for random strings matching the pattern/length of the prefixes and then look for the current date folder. If the date isn't today, pass.
Am I missing a prebuilt library or function capable of simplifying this? I searched around but couldn't find any discussions on this type of problem.
You can do this using koalas.
import databricks.koalas as ks
path = "wasbs://container/service-exports/*/*/*.gz"
df = ks.read_csv(path, sep="','", header='infer')
This should work out well if all the .gz files have the same columns, then df would contain all the data from .gz files concatenated.

Reading custom format file using apache beam

I am new to Apache Beam. I have a requirement to read a text file with the format as given below
a=1
b=3
c=2
a=2
b=6
c=5
Here all rows till an empty line are part of one record and need to be processed together (eg. insert to the table as columns). The above example corresponds to a file with just 2 records.
I am using ReadFromText to read the file and process it. It reads each line as an element. I am then trying to loop and process till I get empty lines.
ReadFromText returns a PCollection and I have read that PCollection is an abstraction of the potentially distributed dataset. My doubt is while reading, will I get records in the same order as in the file. Or will I just get a collection of rows where the order is not preserved. What solution can I use to solve this problem?
I am using python language. I have to read the file from the GCP bucket and use Google Dataflow for execution.
No, your records are not guaranteed to be in the same order. PCollections are inherently unordered, and elements in a PCollection are expected to be parallelization, that is distinct and not reliant on other elements in the PCollection.
In your example you're using TextIO which treats each line of a text file as a separate element, but what you need is to gather each set of data for a record as one element. There are many potential ways around this.
If you can modify the text file, you could put all your data on a single line per record, and then parse that line in a transform you write. This is the usual approach taken, for example with CSV files.
If you can't modify the files, a simple solution for adding your own logic for reading files is to retrieve the files with FileIO and then write a custom ParDo with your own logic for reading the files. This is not as simple as using an existing IO out of the box, but is still easier than creating a fully featured Source.
If the files are more complex and you need a more robust solution, you can implement your own Source that reads the file and outputs records in your required format. This would most likely involve using Splittable DoFns and would require a fair amount of knowledge in how a FileBasedSource works.

How to append data to a nested JSON file in Python

I'm creating a program that will need to store different objects in a logical structure on a file which will be read by a web server and displayed to users.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic.
I'm looking for something of this sort:
foods = [{
"fruits":{
"apple":"red",
"banana":"yellow",
"kiwi":"green"
}
"vegetables":{
"cucumber":"green",
"tomato":"red",
"lettuce":"green"
}
}]
I would like to be able to add additional data to the table like so:
newFruit = {"cherry":"red"}
foods["fruits"].append(newFruit)
Is there any way to do this in python with JSON without loading the whole file?
That is not possible with pure JSON, appending to a JSON list will always require reading the whole file into memory.
But you could use JSON Lines for that. It's a format where each line in a valid JSON on itself, that's what AWS uses for their API's. Your vegetables.json could be written like this:
{"cucumber":"green"}
{"tomato":"red"}
{"lettuce":"green"}
Therefore, adding a new entry is very easy because it becomes just appending a new entry to the end of the file.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic
If your file is really too huge to fit in memory then either the source json should have been splitted in smaller independant parts or it's just not a proper use case for json. IOW what you have in this case is a design issue, not a coding one.
There's at least one streaming json parser that might or not allow you to solve the issue, depending on the source data structure and the effective updates you have to do.
This being said, given today's computers, you need a really huge json file to end up eating all your ram so before anything else you should probably just check the effective file size and how much memory it needs to be parsed to Python.

Can you modify only a text string in an XML file and still maintain integrity and functionality of .docx encasement?

I want to enter data into a Microsoft Excel Spreadsheet, and for that data to interact and write itself to other documents and webforms.
With success, I am pulling data from an Excel spreadsheet using xlwings. Right now, I’m stuck working with .docx files. The goal here is to write the Excel data into specific parts of a Microsoft Word .docx file template and create a new file.
My specific question is:
Can you modify just a text string(s) in a word/document.xml file and still maintain the integrity and functionality of its .docx encasement? It seems that there are numerous things that can change in the XML code when making even the slightest change to a Word document. I've been working with python-docx and lxml, but I'm not sure if what I seek to do is possible via this route.
Any suggestions or experiences to share would be greatly appreciated. I feel I've read every article that is easily discoverable through a google search at least 5 times.
Let me know if anything needs clarification.
Some things to note:
I started getting into coding about 2 months ago. I’ve been doing it intensively for that time and I feel I’m picking up the essential concepts, but there are severe gaps in my knowledge.
Here are my tools:
Yosemite 10.10,
Microsoft Office 2011 for Mac
You probably need to be more specific, but the short answer is, in principle, yes.
At a certain level, all python-docx does is modify strings in the XML. A couple things though:
The XML you create needs to remain well-formed and valid according to the schema. So if you change the text enclosed in a <w:t> element, for example, that works fine. Conversely, if you inject a bunch of random XML at an arbitrary point in one of the .xml parts, that will corrupt the file.
The XML "files", known as parts that make up a .docx file are contained in a Zip archive known as a package. You must unpackage and repackage that set of parts properly in order to have a valid .docx file afterward. python-docx takes care of all those details for you, but if you're going directly at the .docx file you'll need to take care of that yourself.

Corrupted Data Format in CSV file - Single row data split among 2 rows. Quick fix needed

I have 100's of csv files which appears to be corrupt. All files have the same common problem. Each file has 5 headers but the data is always split among 2 rows e.g.
I was thinking of a python script to correct this and was wondering if there is a function or library that can do this quickly rather than writing a whole script to adjust it. The expected format is below. How can I correct this for all files.

Categories

Resources