Parsing an xml file and storing it into a database - python

Is there a generic/automatic way in R or in python to parse xml files with its nodes and attributes, automatically generate mysql tables for storing that information and then populate those tables.

Regarding
Is there a generic/automatic way in R
to parse xml files with its nodes and
attributes, automatically generate
mysql tables for storing that
information and then populate those
tables.
the answer is a good old yes you can, at least in R.
The XML package for R can read XML documents and return R data.frame types in a single call using the xmlToDataFrame() function.
And the RMySQL package can transfer data.frame objects to the database in a single command---including table creation if need be---using the dbWriteTable() function defined in the common DBI backend for R and provided for MySQL by RMySQL.
So in short: two lines can do it, so you can easily write yourself a new helper function that does it along with a commensurate amount of error checking.

They're three separate operations: parsing, table creation, and data population. You can do all three with python, but there's nothing "automatic" about it. I don't think it's so easy.
For example, XML is hierarchical and SQL is relational, set-based. I don't think it's always so easy to get a good relational schema for every single XML stream you can encounter.

There's the XML package for reading XML into R, and the RMySQL package for writing data from R into MySQL.
Between the two there's a lot of work. XML surpasses the scope of a RDBMS like MySQL so something that could handle any XML thrown at it would be either ridiculously complex or trivially useless.

We do something like this at work sometimes but not in python. In that case, each usage requires a custom program to be written. We only have a SAX parser available. Using an XML decoder to get a dictionary/hash in a single step would help a lot.
At the very least you'd have to tell it which tags map to which to tables and fields, no pre-existing lib can know that...

Related

How can I safely parameterize table/column names in BigQuery SQL?

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.
As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

Method for converting diverse JSON files into RDBMS schema?

I have a large number of JSON documents. I would like to store them in an RDBMS for querying. Once there they will never change; it's a data warehousing issue. I have lots of RDBMS data that I want to match the JSON data with, so it would be inefficient to store the JSON in a more traditional manner (e.g. CouchDB).
From hunting the web, I gather that the best approach might be to create JSON schema files using a tool such as JSON Schema Generator and then use that to build a structured RDBMS series of tables. My data is sufficiently limited in scope (minimal JSON nesting) that I could do this by hand if needed, but a tool that automatically converted from JSON schema to DB DDL statements would be great if it is out there.
My question has two parts but is aimed at the first issue - is there a tool or method by which I can create a master schema that describes all of my data. Many instances are missing various fields (and I have tens of gigabytes of JSON data)? The second part is with the serialization process. Does there exist a library (ideally Python) that would take a schema file and a JSON object and output the DML to insert that into a RDBMS?
We just published this package in https://github.com/deepstartup/jsonutils. May be you will find it useful. If you need us to update something, open up a JIRA.
Try:
pip install DDLJ
from DDLj import genddl
genddl(*param1,param2,*param3,*param4)
Where
param1= JSON Schema File
param2=Database (Default Oracle)
Param3= Glossary file
Param4= DDL output script
Some draft Python for converting JSON to DDL. You'll have to adapt it for JSON schema.
import json
import sys
fp = open(sys.argv[1])
jsobj = json.load(fp)
print "Create table("
for elt in jsobj["fields"]:
print elt["name"], elt["type"], ","
print ");"

How should I append to small file-like objects in mongodb?

I need an interface to mongodb by which I can treat data in a collection like a standard python file-like object. These will be fairly small files (measured in kilobytes, at most) and in particular I need the ability to append to these so-called files. (So this question is not a dupe.)
I have read the GridFS documentation, and in particular it says I should not use it for small files. The only other implementations I've been able to find have all been PHP. I'm not really looking for help writing any specifics of the code, but implementing the entire file api seems a daunting task.
Are there any shortcuts or tools to make it easier to implement file-like objects in python 2?
Am I missing that someone has already done this?
(Why am I doing this? Because I received an eleventh-hour requirement that we deploy a pre-existing application that produces csv files on a multinode cloud environment that cannot transparently handle files.)
For question 1: check out the io module, and especially IOBase. It implements all of the file-likes in terms of a fairly sensible set of methods.
You could just store the data as binary, or text, in a MongoDB collection. But you'd have two problems:
You'd have to implement as much of the Python file protocol as your other code expects to have implemented.
When you append to the "file", the document would grow in MongoDB and possibly need to be moved on disk to a location with enough space to hold the larger document. Moving documents is expensive.
Go with GridFS -- the documentation discourages you from using for static files but for your case it's perfect because PyMongo has done the work for you of implementing Python's file protocol for MongoDB data. To append to a GridFS file you must read it, save a new version with the additional data, and delete the previous version. But this isn't much more expensive than moving a grown document anyway.

Storing JSON in MySQL?

I have some things that do not need to be indexed or searched (game configurations) so I was thinking of storing JSON on a BLOB. Is this a good idea at all? Or are there alternatives?
If you need to query based on the values within the JSON, it would be better to store the values separately.
If you are just loading a set of configurations like you say you are doing, storing the JSON directly in the database works great and is a very easy solution.
No different than people storing XML snippets in a database (that doesn't have XML support). Don't see any harm in it, if it really doesn't need to be searched at the DB level. And the great thing about JSON is how parseable it is.
I don't see why not. As a related real-world example, WordPress stores serialized PHP arrays as a single value in many instances.
I think,It's beter serialize your XML.If you are using python language ,cPickle is good choice.

Aggregating multiple feeds with Universal Feed Parser

Having great luck working with single-source feed parsing in Universal Feed Parser, but now I need to run multiple feeds through it and generate chronologically interleaved output (not RSS). Seems like I'll need to iterate through URLs and stuff every entry into a list of dictionaries, then sort that by the entry timestamps and take a slice off the top. That seems do-able, but pretty expensive resource-wise (I'll cache it aggressively for that reason).
Just wondering if there's an easier way - an existing library that works with feedparser to do simple aggregation, for example. Sample code? Gotchas or warnings? Thanks.
You could throw the feeds into a database and then generate a new feed from this database.
Consider looking into two feedparser-based RSS aggregators: Planet Feed Aggregator and FeedJack (Django based), or at least how they solve this problem.
Here is already suggestion to store data in the database, e.g. bsddb.btopen() or any RDBMS.
Take a look at heapq.merge() and bisect.insort() or use one of B-tree implementations if you'd like to merge data in memory.

Categories

Resources