What serialization format is this, and are there any libraries to parse it back to python-native data structures or at least something easier to manage?
At least it looks like it could have a 1:1 correspondent in python.
%xt%tableFameUpdate%-1%{"season":[1.329534083671E9,"160",53255],"leaderboard":[["1001:6587656216929005792","1718","Kjeld","http:/..."],["1001:6301086609221020111","802","Asti","http://..."],["1018:995158152656680513","419","QiZOra","http://..."],["1018:8494206166685317681","364","Bingay","http://..."],["1:100000380528383","160","...","http://..."]],"multipliers":{"1001:6835768553933918921":67,"1001:4106589374707547411":0,"1001:5353968490097996024":0,"1018:1168770734837476224":0,"1018:8374571792147098127":0,"1001:4225536539330822139":0,"1:100000380528383":0,"1001:4082457720357735190":68,"1001:1650191466786177826":0,"1001:4299232509980238095":38,"1001:7604050184057349633":0,"1001:6587656216929005792":0,"1001:3852516077423175846":0,"1001:888471333619738847":9,"1001:7823244004315560346":0,"1001:7665905871463311833":0,"1001:4453073160237910447":0,"1001:6338802281112620503":64,"1001:7644306056081384910":13,"1001:4956919992342871722":0,"1001:4126528826861913228":29,"1001:7325864606573096759":47,"1001:6494182198787618518":16,"1001:3678910058012926187":4,"1001:435065490460532259":39,"1001:5366593356123167358":0,"1001:6041488907938219046":8,"1001:6051083835382544277":5,"1001:9187877490300372546":0,"1001:482518425014054339":0}}%
if you strip off the first piece and the last percent sign it is json, which you could parse with any json parser. It looks like its using the percent signs as a sort of iterator, you could prolly split on those.
Apart from the few characters at the beginning, this looks like JSON.
%xt%tableFameUpdate%-1% is not JSON, but the rest is. There's a lot of JSON parsers for python, pick one and it should parse your data without a hitch.
Related
I downloaded the page source (html) of websites with Selenium (Python). And I wish to find all base 64 encoded strings in html files.
Is there a known structure to all base 64 encoded strings in htmls? From my observations, it seems like it would start with ;base64 followed by hex-strings and finally a closing bracket ). Is that accurate?
From Wikipedia, the hex-string must also be composed of the followings: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. Can someone also confirm that?
Thanks a lot in advance!
* Edit 1 *
Thanks a lot Tris! The link you provided is very helpful! However, from that, it seems like there is no specific format for the end of a base 64 strings. If I want to detect its end, what advice would you give other than )?
I mainly want to track the changes of a bunch of websites, and the base64 encodings contain a lot of data that are not relevant for my use. To save storage, I therefore intend to remove them. An example is www.amd.com, which has the following data:image/png;base64,... (after being rendered by browser).
Since there are many different websites, I don't know all of their formats. Here are some other examples of the base64 strings that I found and are not useful to me:
data:font/truetype;base64,AAEAAA...
...
For several of the examples that I saw, they all ended with a closing bracket ). May I ask then under what scenario would they end with ) and otherwise?
Thanks again!
Not all base64-encoded strings will include a ;base64 at the beginning of them -- this is typically specific to data URLs. If you are specifically looking for base64-encoded images or other inline elements that would otherwise be referred to with an HTTP URL, this might be fine. The closing bracket is not typically relevant, I haven't seen that required on data URLs or other base64-encoded strings.
Typically, base64-encoded strings use the alphabet you've mentioned -- ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. If the encoded length is not a multiple of 3 bytes, it is padded with an appropriate number of = characters at the end.
There is another commonly used base64 format on the web -- the URL-safe base64 format. In this encoding, + and / are typically replaced with - and _ so they can be included in URLs safely, hence the name.
This information may be irrelevant if you know more about the structure of the websites you are trying to parse, aside from just "they contain base64-encoded string data."
What I'm trying to do is write a GUI program in Python that displays the contents of XML files with some basic features like syntax highlighting (this isn't the only thing it needs to do, but it is one of the things).
To do this, I figured I should use an xml parsing package such as lxml or ElementTree. When I render the xml, I would like to be able to use the data structure produced by the parser to do things like syntax highlighting (or whatever). lxml has a "sourceline" property, but nothing for column numbers as far as I can tell.
Am I going about this the right way? Is there a better way to accomplish what I want? Otherwise I think will have to write my own XML parser, which I'm not enthusiastic about.
I would like to preface my question that this is the first time I've interacted with an API and JSON as I'm typically more on the Database sides of things.
With that, I'm a little confused with one of the APIs I'm currently working with.
I have a vendor that has an API that allows me to pull down some information about some of the users of that service. The problem is that the response seems to not be in JSON, or if it is it isn't a version of JSON that I have seen.
The response looks like this.
{"Header":"Field1,Field2,Field3,Field4", "Rows":["Row1Value1,Row1Value2,Row1Value3,Row1Value4","Row2Value1,Row2Value2,Row2Value3,Row2Value4"]}
Which, seems wrong with everything that I've been doing with JSON so far. I'm unable to interpret this in Python as anything use-able or Powershell.
Is this a type of format? Or is this some weird thing that this vendor has generated that isn't JSON and needs to be interpreted as it's own thing?
It looks like a half-JSON implementation; the outer containers look like JSON, and you get a JSON list for the rows, but the inner contents of Header and each row in Rows looks like a string you'll need to tokenize yourself (split on commas).
I think there is a bit of confusion here. JSON means literally just JavaScript Object Notation. Anything that parses to a valid object in JS and is limited to the data types String, Bool, Int, Float, Array and Object is JSON.
So, is this JSON? Yes, beyond doubt. Is this good JSON? Not really. Unfortunately, the idea would be that you would be able to parse a JSON object into a tabular form, but here, you would have to split things yourself.
Using simple string manipulation (split()), you can easily parse the rows and restructure them to your heart's content.
In my program I read data from a file and then parse it. The format is
data | data | data | data | data
What is a better format to store data in ?
It must be easily parsed by python and easy to use.
JSON - http://docs.python.org/2/library/json.html
CSV - http://docs.python.org/2/library/csv.html?highlight=csvreader
XML - there's a selection to choose from depending what you need.
Take a look at pickling. You can serialise and write objects to a file and then read them back later.
If the data needs to be read by programs written in other languages consider using JSON.
Your data format is fine if you don't need to use the pipe (|) character anywhere. Databases often use pipe-delimited data and it's easily parsed.
CSV (comma-separated values) are a more universal format, but not much different that pipe-separated. Both have some limitations, but for simple data they work fine.
XML is good if you have complex data, but it's a more complicated format. Complicated doesn't necessarily mean better if your needs are simple, so you'd need to think about the data you want to store, and if you want to transfer it to other apps or languages.
I'm using Google App Engine and python for a web service. Some of the models (tables) I have in my web service have several binary data fields in them, and I'd like to present this data to a computer requesting it, all fields at the same time. Now, the problem is I don't know how to write it out in a way that the other computer knows where the first data ends and the other starts. I've been using JSON for all the things that aren't binary, but afaik JSON doesn't work for binary data. So how do you get around this?
You could of course separate the data and put it in its own model, and then reference it back to some metadata model. That would allow you to make a single page that just prints one data field of one of the items, but that is trappy both server and client implementation wise.
Another solution would be to put in some kind of separator, and just split the data on that. I suppose it would work and that's how you do it, but isn't there like a standardized way to do that? Any libraries I could use?
In short, I'd like to be able to do something like this:
binaryDataField1: data data data ...
binaryDataField2: data data data ...
etc
Several easy options:
base64 encode your data - meaning you can still use JSON.
Use Protocol Buffers.
Prefix each field with its length - either as a 4- or 8- byte integer, or as a numeric string.
One solution that would leverage your json investment would be to simply convert the binary data to something that json can support. For example, Base64 encoding might work well for you. You could treat the output of your BAse64 encoder just like you would a normal string in json. it looks like python has Base64 support built in, though i only use java on app engine so I can't guarantee that the linked library work in the sandbox or not.