Where to store metadata associated with files?

Where to store metadata associated with files? - python

This a question on storing and loading data, particularly in Python. I'm not entirely sure this is the appropriate forum, so redirect me if not.
I'm handling about 50 1000-row CSV files, and each has 10 parameters of associated metadata. What is the best method to store this in regards to:
(A) All the information is human-readable plain text and it's easy for a non-programming human to associate data and metadata.
(B) It's convenient to load the metadata and each column of the csv to a python dictionary.
I've considered four possible solutions:
(0) Previously, I've stored smaller amounts of metadata in the filename. This is bad for obvious reasons.
(1) Assign each CSV file a ID number, name each "ID.csv" and then produce a "metadata.csv" which maps each CSV ID number to its metadata. The shortcomings here are that using ID numbers reduces human readability. (To learn the contents of a file a non-programming human reader must manually check the "metadata.csv")
(2) Leave the metadata at the top of CSV file. This has shortcomings in that my program would need to perform two steps: (a) get the metadata from some arbitrary number of lines at the top of the file and (b) tell the CSV reader (pandas.read_csv) to ignore first few lines.
(3) Convert to CSV to some data serialization format like YAML, where I could then easily include the metadata. This has shortcomings of easily loading the columns of the CSV to my dictionary, and not everyone knows YAML.
Are there any clever solutions to this problem? Thanks!

This question is a tad suggestive so it may be closed, but let me offer the suggestion of the built-in python module for handling json files. JSON maintains a good balance of "human-readability" and is highly portable to almost any language or format. You could construct from your original data to something like this:
{
"metadata":{"name":"foo", "status":"bar"},
"data":[[1,2,3],[4,5,6],[....]]
}
where data is your original CSV file and metadata is a dictionary containing whatever data you would like store. Additionally it is also simple to "strip" the metadata out and return the original csv data from this format - all within the confines of built-in python modules.

Related

Reading custom format file using apache beam

I am new to Apache Beam. I have a requirement to read a text file with the format as given below
a=1
b=3
c=2
a=2
b=6
c=5
Here all rows till an empty line are part of one record and need to be processed together (eg. insert to the table as columns). The above example corresponds to a file with just 2 records.
I am using ReadFromText to read the file and process it. It reads each line as an element. I am then trying to loop and process till I get empty lines.
ReadFromText returns a PCollection and I have read that PCollection is an abstraction of the potentially distributed dataset. My doubt is while reading, will I get records in the same order as in the file. Or will I just get a collection of rows where the order is not preserved. What solution can I use to solve this problem?
I am using python language. I have to read the file from the GCP bucket and use Google Dataflow for execution.

No, your records are not guaranteed to be in the same order. PCollections are inherently unordered, and elements in a PCollection are expected to be parallelization, that is distinct and not reliant on other elements in the PCollection.
In your example you're using TextIO which treats each line of a text file as a separate element, but what you need is to gather each set of data for a record as one element. There are many potential ways around this.
If you can modify the text file, you could put all your data on a single line per record, and then parse that line in a transform you write. This is the usual approach taken, for example with CSV files.
If you can't modify the files, a simple solution for adding your own logic for reading files is to retrieve the files with FileIO and then write a custom ParDo with your own logic for reading the files. This is not as simple as using an existing IO out of the box, but is still easier than creating a fully featured Source.
If the files are more complex and you need a more robust solution, you can implement your own Source that reads the file and outputs records in your required format. This would most likely involve using Splittable DoFns and would require a fair amount of knowledge in how a FileBasedSource works.

Best way to save an np.array or a python list object as a single record in BigQuery?

I have an ML model (text embedding) which outputs a large 1024 length vector of floats, which I want to persist in a BigQuery table.
The individual values in the vector don't mean anything on their own, the entire vector is the feature of interest. Hence, I want to store these lists in a single Column in BigQuery as opposed to one column for each float. Additionally, adding an additional 1024 rows to a table that is originally just 4 or 5 rows seems like a bad idea.
Is there a way of storing a python list or an np.array in a column in BigQuery (maybe convert them to a json first or something along those lines?)

Maybe it's not exactly you were looking for, but the following options are the closest workarounds to what you're trying to achieve.
First of all, you can save your data in an CSV file with one column locally and then load that file into BigQuery. There are also other file formats that can be loaded into BigQuery from a local machine, that might interest you. I personally would go with a CSV.
I did the experiment, by creating an empty table in my dataset, without adding a field. Then I used the code mentioned in the first link, after saving a column of my random data in a CSV file.
If you encounter the following error regarding the permissions, see this solution. It uses an authentication key instead.
google.api_core.exceptions.Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/project-name/jobs/job-id?location=EU: Request had insufficient authentication scopes.
Also, you might find this link useful, in case you get the following error:
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table my-project:my_dataset.random_data. Cannot add fields (field: double_field_0)
Besides loading your data from a local file, can upload your data file on Google Cloud Storage and load the data from there. Many file formats are being supported, as Avro, Parquet, ORC, CSV and newline delimited JSON.
Finally there is an option for streaming the data directly into a BigQuery table by using the API, but it is not available via the free tier.

How to append data to a nested JSON file in Python

I'm creating a program that will need to store different objects in a logical structure on a file which will be read by a web server and displayed to users.
Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic.
I'm looking for something of this sort:
foods = [{
"fruits":{
"apple":"red",
"banana":"yellow",
"kiwi":"green"
}
"vegetables":{
"cucumber":"green",
"tomato":"red",
"lettuce":"green"
}
}]
I would like to be able to add additional data to the table like so:
newFruit = {"cherry":"red"}
foods["fruits"].append(newFruit)
Is there any way to do this in python with JSON without loading the whole file?

That is not possible with pure JSON, appending to a JSON list will always require reading the whole file into memory.
But you could use JSON Lines for that. It's a format where each line in a valid JSON on itself, that's what AWS uses for their API's. Your vegetables.json could be written like this:
{"cucumber":"green"}
{"tomato":"red"}
{"lettuce":"green"}
Therefore, adding a new entry is very easy because it becomes just appending a new entry to the end of the file.

Since the file will contain a lot of information, loading the whole file tom memory, appending information and writing the whole file back to the filesystem - as some answers stated - will prove problematic
If your file is really too huge to fit in memory then either the source json should have been splitted in smaller independant parts or it's just not a proper use case for json. IOW what you have in this case is a design issue, not a coding one.
There's at least one streaming json parser that might or not allow you to solve the issue, depending on the source data structure and the effective updates you have to do.
This being said, given today's computers, you need a really huge json file to end up eating all your ram so before anything else you should probably just check the effective file size and how much memory it needs to be parsed to Python.

Best format to store data

In my program I read data from a file and then parse it. The format is
data | data | data | data | data
What is a better format to store data in ?
It must be easily parsed by python and easy to use.

JSON - http://docs.python.org/2/library/json.html
CSV - http://docs.python.org/2/library/csv.html?highlight=csvreader
XML - there's a selection to choose from depending what you need.

Take a look at pickling. You can serialise and write objects to a file and then read them back later.
If the data needs to be read by programs written in other languages consider using JSON.

Your data format is fine if you don't need to use the pipe (|) character anywhere. Databases often use pipe-delimited data and it's easily parsed.
CSV (comma-separated values) are a more universal format, but not much different that pipe-separated. Both have some limitations, but for simple data they work fine.
XML is good if you have complex data, but it's a more complicated format. Complicated doesn't necessarily mean better if your needs are simple, so you'd need to think about the data you want to store, and if you want to transfer it to other apps or languages.

Convert .csv file into .dbf using Python?

How can I convert a .csv file into .dbf file using a python script? I found this piece of code online but I'm not certain how reliable it is. Are there any modules out there that have this functionality?

Using the dbf package you can get a basic csv file with code similar to this:
import dbf
some_table = dbf.from_csv(csvfile='/path/to/file.csv', to_disk=True)
This will create table with the same name and either Character or Memo fields and field names of f0, f1, f2, etc.
For a different filename use the filenameparameter, and if you know your field names you can also use the field_names parameter.
some_table = dbf.from_csv(csvfile='data.csv', filename='mytable',
field_names='name age birth'.split())
Rather basic documentation is available here.
Disclosure: I am the author of this package.

You won't find anything on the net that reads a CSV file and writes a DBF file such that you can just invoke it and supply 2 file-paths. For each DBF field you need to specify the type, size, and (if relevant) number of decimal places.
Some questions:
What software is going to consume the output DBF file?
There is no such thing as "the" (one and only) DBF file format. Do you need dBase III ? dBase 4? 7? Visual FoxPro? etc?
What is the maximum length of text field that you need to write? Do you have non-ASCII text?
Which version of Python?
If your requirements are minimal (dBase III format, no non-ASCII text, text <= 254 bytes long, Python 2.X), then the cookbook recipe that you quoted should do the job.

Use the csv library to read your data from the csv file. The third-party dbf library can write a dbf file for you.
Edit: Originally, I listed dbfpy, but the library above seems to be more actively updated.

None that are well-polished, to my knowledge. I have had to work with xBase files many times over the years, and I keep finding myself writing code to do it when I have to do it. I have, somewhere in one of my backups, a pretty functional, pure-Python library to do it, but I don't know precisely where that is.
Fortunately, the xBase file format isn't all that complex. You can find the specification on the Internet, of course. At a glance the module that you linked to looks fine, but of course make copies of any data that you are working with before using it.
A solid, read/write, fully functional xBase library with all the bells and whistles is something that has been on my TODO list for a while... I might even get to it in what is left this year, if I'm lucky... (probably not, though, sadly).

I have created a python script here. It should be customizable for any csv layout. You do need to know your DBF data structure before this will be possible. This script requires two csv files, one for your DBF header setup and one for your body data. good luck.
https://github.com/mikebrennan/csv2dbf_python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.