Serializing data and unpacking safely from untrusted source

Serializing data and unpacking safely from untrusted source - python

I am using Pyramid as a basis for transfer of data for a turn-based video game. The clients use POST data to present their actions, and GET to retrieve serialized game board data. The game data can sometimes involve strings, but is almost always two integers and two tuples:
gamedata = (userid, gamenumber, (sourcex, sourcey), (destx, desty))
My general client side framework was to Pickle , convert to base 64, use urlencode, and submit the POST. The server then receives the POST, unpacks the single-item dictionary, decodes the base64, and then unpickles the data object.
I want to use Pickle because I can use classes and values. Submitting game data as POST fields can only give me strings.
However, Pickle is regarded as unsafe. So, I turned to pyYAML, which serves the same purpose. Using yaml.safe_load(data), I can serialize data without exposing security flaws. However, the safe_load is VERY safe, I cannot even deserialize harmless tuples or lists, even if they only contain integers.
Is there some middle ground here? Is there a way to serialize python structures without at the same time allowing execution of arbitrary code?
My first thought was to write a wrapper for my send and receive functions that uses underscores in value names to recreate tuples, e.g. sending would convert the dictionary value source : (x, y) to source_0 : x, source_1: y. My second thought was that it wasn't a very wise way to develop.
edit: Here's my implementation using JSON... it doesn't seem as powerful as YAML or Pickle, but I'm still concerned there may be security holes.
Client side was constructed a bit more visibly while I experimented:
import urllib, json, base64
arbitrarydata = { 'id':14, 'gn':25, 'sourcecoord':(10,12), 'destcoord':(8,14)}
jsondata = json.dumps(arbitrarydata)
b64data = base64.urlsafe_b64encode(jsondata)
transmitstring = urllib.urlencode( [ ('data', b64data) ] )
urllib.urlopen('http://127.0.0.1:9000/post', transmitstring).read()
Pyramid Server can retrieve the data objects:
json.loads(base64.urlsafe_b64decode(request.POST['data'].encode('ascii')))
On an unrelated note, I'd love to hear some other opinions about the acceptability of using POST data in this method, my game client is in no way browser based at this time.

Why not use colander for your serialization and deserialization? Colander turns an object schema into simple data structure and vice-versa, and you can use JSON to send and receive this information.
For example:
import colander
class Item(colander.MappingSchema):
thing = colander.SchemaNode(colander.String(),
validator=colander.OneOf(['foo', 'bar']))
flag = colander.SchemaNode(colander.Boolean())
language = colander.SchemaNode(colander.String()
validator=colander.OneOf(supported_languages)
class Items(colander.SequenceSchema):
item = Item()
The above setup defines a list of item objects, but you can easily define game-specific objects too.
Deserialization becomes:
items = Items().deserialize(json.loads(jsondata))
and serialization is:
json.dumps(Items().serialize(items))
Apart from letting you round-trip python objects, it also validates the serialized data to ensure it fits your schema and hasn't been mucked about with.

How about json? The library is part of the standard Python libraries, and it allows serialization of most generic data without arbitrary code execution.

I don't see raw JSON providing the answer here, as I believe the question specifically mentioned pickling classes and values. I don't believe using straight JSON can serialize and deserialize python classes, while pickle can.
I use a pickle-based serialization method for almost all server-to-server communication, but always include very serious authentication mechanisms (e.g. RSA key-pair matching). However, that means I only deal with trusted sources.
If you absolutely need to work with untrusted sources, I would at the very least, try to add (much like #MartijnPieters suggests) a schema to validate your transactions. I don't think there is a good way to work with arbitrary pickled data from an untrusted source. You'd have to do something like parse the byte-string with some disassembler and then only allow trusted patterns (or block untrusted patterns). I don't know of anything that can do this for pickle.
However, if your class is "simple enough"… you might be able to use the JSONEncoder, which essentially converts your python class to something JSON can serialize… and thus validate…
How to make a class JSON serializable
The impact is, however, you have to derive your classes from JSONEncoder.

Related

Convert GraphQLResponse dictionary to python object

I am running a graphql query using aiographql-client and getting back a GraphQLResponse object, which contains a raw dict as part of the response json data.
This dictionary conforms to a schema, which I am able to parse into a graphql.type.schema.GraphQLSchema type using graphql-core's build_schema method.
I can also correctly get the GraphQLObjectType of the object that is being returned, however I am not sure how to properly deserialize the dictionary into a python object with all the appropriate fields, using the GraphQLObjectType as a reference.
Any help would be greatly appreciated!

I'd recommend using Pydantic to the heavy lifting in the parsing.
You can then either generate the models beforehand and select the ones you need based on GraphQLObjectType or generate them at runtime based on the definition returned by build_schema.
If you really must define your models at runtime you can do that with pydantic's create_model function described here : https://pydantic-docs.helpmanual.io/usage/models/#dynamic-model-creation
For the static model generation you can probably leverage something like https://jsontopydantic.com/
If you share some code samples I'd be happy to give some more insights on the actual implementation

I faced the same tedious problem while developing a personal project.
Because of that I just published a library which has the purpose of managing the mappings between python objects and graphql objects (python objects -> graphql query and graphql response -> python objects).
https://github.com/dapalex/py-graphql-mapper
So far it manages only basic query and response, if it will become useful I will keep implementing more features.
Try to have a look and see if it can help you

Coming back to this, there are some projects out there trying to achieve this functionality:
https://github.com/enra-GmbH/graphql-codegen-ariadne
https://github.com/sauldom102/gql_schema_codegen

dictionary | class | namedtuple from YAML

I have a large-ish YAML file (~40 lines) that I'm loading using PyYAML. This is of course parsed into a large-ish dictionary plus a couple of arrays.
My question is: how to manage the data. I can of course leave it in the output dictionary and work through the data. But I was wondering if it's better instead to mangle the data in a class or use a nametuple to hold the data.
Any first-hand experience about that?

Whether you post-process the data structure into a class or not primarily has to do with how you are using that data. The same applies to the decision whether to use a tag or not and load (some off) the data from the YAML file into a specific instance of a class that way.
The primary advantage of using a class in both cases (post-processing, tagging) is that you can do additional tests during initialisation for consistency, that are not done on the key-value pairs of a dict or on the items of list.
A class also allows you to provide methods to check values before they are set, e.g. to make sure they are of the right type.
Whether that overhead is necessary depends on the project, who is using and/or updating the data etc and how long this project and its data is going to live (i.e. are you still going to understand the data and its implicit structure a year from now). These are all issues for which a well designed (and documented) class can help, at the cost of some extra work up-front.

python json response using schema

I have an app and a server communicating using json. I'm now trying to "pythonize" my server code as best as I can (I'm a long time C coder and I'm afraid my python code flow looks more C-like than pythonic).
I have a bunch of messages going back and forth. Thus far the message format was "implicit", and I didn't really define a schema to make it explicit/readable/validatable etc.
Searching through on the topic, I now have a good handle on how to define the incoming message schema, validate it etc. With colander, i might even directly be able to take it into a class.
However, on the outbound side (ie, responses from the server), I want to have a similar well defined structure and interface.
My question is:
How do I USE the defined outbound schema while CONSTRUCTING the response data ? A 'C' analogy would be to use a struct.
Essentially, I don't want any place in my code to do something ugly like
r = dict(response_field=response_data)
HttpResponse(json.dumps(r))
Because them I'm implicitly creating my format on the fly...
I'd rather use the schema as the base to contruct the response
Any thoughts, suggestions, best practices pointers ?
thanks

You can define your outbound data contracts with regular Python classes.
Or you might consider json-schema to define the public API interfaces (incoming and outgoing data contracts). You have a json-schema validator in python that can be a good alternative to colander.
If you have structured data à la relational database, then you might consider XSD and XML. More on this on stackoverflow.
If structures and constraints are simple, then Avro or Protocol Buffers might be enough.

What would be the equivalent of Pythons "pickle" in nodejs

One of Python's features is the pickle function, that allows you to store any arbitrary anything, and restore it exactly to its original form. One common usage is to take a fully instantiated object and pickle it for later use. In my case I have an AMQP Message object that is not serializable and I want to be able to store it in a session store and retrieve it which I can do with pickle. The primary difference is that I need to call a method on the object, I am not just looking for the data.
But this project is in nodejs and it seems like with all of node's low-level libraries there must be some way to save this object, so that it could persist between web calls.
The use case is that a web page picks up a RabbitMQ message and displays the info derived from it. I don't want to acknowledge the message until the message has been acted on. I would just normally just save the data in session state, but that's not an option unless I can somehow save it in its original form.

See the pickle-js project: https://code.google.com/p/pickle-js/
Also, from findbestopensource.com:
pickle.js is a JavaScript implementation of the Python pickle format. It supports pickles containing a cross-language subset of the primitive types. Key differences between pickle.js and pickle.py:text pickles only some types are lossily converted (e.g. int) some types are not supported (e.g. class)
More information available here: http://www.findbestopensource.com/product/pickle-js

As far as I am aware, there isn't an equivalent to pickle in JavaScript (or in the standard node libraries).

Check out https://github.com/carlos8f/hydration to see if it fits your needs. I'm not sure it's as complete as pickle but it's pretty terrific.
Disclaimer: The module author and I are coworkers.

Storing a python set in a database with django

I have a need to store a python set in a database for accessing later. What's the best way to go about doing this? My initial plan was to use a textfield on my model and just store the set as a comma or pipe delimited string, then when I need to pull it back out for use in my app I could initialize a set by calling split on the string. Obviously if there is a simple way to serialize the set to store it in the db so I can pull it back out as a set when I need to use it later that would be best.

If your database is better at storing blobs of binary data, you can pickle your set. Actually, pickle stores data as text by default, so it might be better than the delimited string approach anyway. Just pickle.dumps(your_set) and unpickled = pickle.loads(database_string) later.

There are a number of options here, depending on what kind of data you wish to store in the set.
If it's regular integers, CommaSeparatedIntegerField might work fine, although it often feels like a clumsy storage method to me.
If it's other kinds of Python objects, you can try pickling it before saving it to the database, and unpickling it when you load it again. That seems like a good approach.
If you want something human-readable in your database though, you could even JSON-encode it into a TextField, as long as the data you're storing doesn't include Python objects.

Redis natively stores sets (as well as other data structures (lists, dicts, queue)) and provides set operations - and its rocket fast too. I find it's the swiss army knife for python development.
I know its not a relational database per se, but it does solve this problem very concisely.

What about CommaSeparatedIntegerField?
If you need other type (string for example) you can create your own field which would work like CommaSeparatedIntegerField but will use strings (without commas).
Or, if you need other type, probably a better way of doing it: have a dictionary which maps integers to your values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Serializing data and unpacking safely from untrusted source - python

How about json? The library is part of the standard Python libraries, and it allows serialization of most generic data without arbitrary code execution.

Related

Convert GraphQLResponse dictionary to python object

dictionary | class | namedtuple from YAML

python json response using schema

What would be the equivalent of Pythons "pickle" in nodejs

Storing a python set in a database with django

Categories

Resources