python input causing checkmarx sql_injection/command_injection

python input causing checkmarx sql_injection/command_injection - python

We have a Flask/ML application which while training takes user input. While in production, it skips taking input from user and reads in from a steps.ob file.
However, Checkmarks still identifies this as a potential SQL injection vulnerability.
for i in range(0, n):
ele = input("Enter column name (one at a time): ")
cols_drop.append(ele) # adding the element
X = df.drop(cols_drop,axis='columns')
Is there any short workaround for clearing this vulnerability? (In production, since we are supplying the steps.ob file, it won't take user input but checkmarx is not considering that there is an if/else production condition)

Static analysis doesn't execute your code with values to see which logic branches are taken. In Checkmarx' case, it is doing data flow analysis so will ignore if/else and see the flow from the user-interactive input to your DB API. This means this is a true-positive result. You just have to decide if it is exploitable or not due to (perhaps) compensating controls in place during deployment.
Given that SQL Injection is reported, it is likely that your code isn't doing proper SQL parameter binding/sanitization OR Checkmarx doesn't recognize the API you're using for SQL I/O. There is not enough code to really tell if this is the case, I am making some assumptions based on the content of your question.
I am also fairly certain Checkmarx doesn't recognize any of the Pandas API to understand that it is also subject to an interactive input via the steps.obs file. Depending on how that file gets placed and is secured from potential modifications by an attacker, that may be a false-negative SQL injection. If the Pandas API were recognized, the analysis would potentially see the file input and give you another SQL Injection vulnerability.
The good news is both would be resolved with proper parameter binding. If you are already doing proper parameter binding, you have some choices:
Mark it as Not Exploitable
Tune the analysis queries to recognize any sources/sinks/sanitizers
Checkmarx can usually help you tune the query.

Related

Display data as insert statements

I need to move data from one database to another.
I can use python my counterpart can't.
How can select all data from a table and save it as insert statements.
Using SQLalchemy.
Is there a way to create a back up like this?

As others have suggested in comments, using the database backup program (mysqldump, pg_dump, etc) is your best bet; that will make sure that the data is transferred correctly for the underlying database.
Outputting INSERT statements will be risky; even the built-in SQLAlchemy facility for doing this comes with a big red warning, complete with a picture of a dragon, indicating that it can be dangerous.
If you nevertheless need to do this, and the data is generally trusted and doesn't contain much in the way of odd types, you can use:
Create (but do not execute) an insert expression as though you were inserting the rows back into the database.
Use the .compile() method with the relevant dialect parameter and literal_binds set to True.
Manually double-check that the output is, in fact, valid for the database; as per the warning in the SQLAlchemy FAQ, this method is not very dependable and may expose you to attacks if it's part of any production system.
I wouldn't recommend formatting up INSERT statements by hand; you're unlikely to do a better job than SQLAlchemy...

DynamoDB consistent read results in schema error

I am trying to interact with a DynamoDB table from python using boto. I want all reads/writes to be quorum consistency to ensure that reads sent out immediately after writes always reflect the correct data.
NOTE: my table is set up with "phone_number" as the hash key and first_name+last_name as a secondary index. And for the purposes of this question one (and only one) item exists in the db (first_name="Paranoid", last_name="Android", phone_number="42")
The following code works as expected:
customer = customers.get_item(phone_number="42")
While this statement:
customer = customers.get_item(phone_number="42", consistent_read=True)
fails with the following error:
boto.dynamodb2.exceptions.ValidationException: ValidationException: 400 Bad Request
{u'message': u'The provided key element does not match the schema', u'__type': u'com.amazon.coral.validate#ValidationException'}
Could this be the result of some hidden data corruption due to failed requests in the past? (for example two concurrent and different writes executed at eventual consistency)
Thanks in advance.

It looks like you are calling the get_item method so the issue is with how you are passing parameters.
get_item(hash_key, range_key=None, attributes_to_get=None, consistent_read=False, item_class=<class 'boto.dynamodb.item.Item'>)
Which would mean you should be calling the API like:
customer = customers.get_item(hash_key="42", consistent_read=True)
I'm not sure why the original call you were making was working.
To address your concerns about data corruption and eventual consistency, it is highly unlike that any API call you could make to DynamoDB could result in it getting into a bad state outside of you sending it bad data for an item. DynamoDB is a highly tested solution that provides exceptional availability and goes to extraordinary lengths to take care of the data you send it.
Eventual consistency is something to be aware of with DynamoDB, but generally speaking it is not something that causes many issues depending on the specifics of the use case. While AWS does not provide specific metrics on what "eventually consistent" look like, in day-to-day use it is normal to be able to read out records that were just written/modified under a second even when eventually consistent reads.
As for performing multiple writes simultaneously on the same object, DynamoDB writes are always strongly consistent. You can utilize conditional writes with DynamoDB if you are worried about an individual item being modified at the same time resulting in unexpected behavior which will allow writes to fail and your application logic to deal with any issues that arise.

Creating an archive - Save results or request them every time?

I'm working on a project that allows users to enter SQL queries with parameters, that SQL query will be executed over a period of time they decide (say every 2 hours for 6 months) and then get the results back to their email address.
They'll get it in the form of an HTML-email message, so what the system basically does is run the queries, and generate HTML that is then sent to the user.
I also want to save those results, so that a user can go on our website and look at previous results.
My question is - what data do I save?
Do I save the SQL query with those parameters (i.e the date parameters, so he can see the results relevant to that specific date). This means that when the user clicks on this specific result, I need to execute the query again.
Save the HTML that was generated back then, and simply display it when the user wishes to see this result?
I'd appreciate it if somebody would explain the pros and cons of each solution, and which one is considered the best & the most efficient.
The archive will probably be 1-2 months old, and I can't really predict the amount of rows each query will return.
Thanks!

Specifically regarding retrieving the results from queries that have been run previously I would suggest saving the results to be able to view later rather than running the queries again and again. The main benefits of this approach are:
You save unnecessary computational work re-running the same queries;
You guarantee that the result set will be the same as the original report. For example if you save just the SQL then the records queried may have changed since the query was last run or records may have been added / deleted.
The disadvantage of this approach is that it will probably use more disk space, but this is unlikely to be an issue unless you have queries returning millions of rows (in which case html is probably not such a good idea anyway).

If I would create such type of application then
I will have some common queries like get by current date,current time , date ranges, time ranges, n others based on my application for the user to select easily.
Some autocompletions for common keywords.
If the data gets changed frequently there is no use saving html, generating new one is good option

The crucial difference is that if data changes, new query will return different result than what was saved some time ago, so you have to decide if the user should get the up to date data or a snapshot of what the data used to be.
If relevant data does not change, it's a matter of whether the queries will be expensive, how many users will run them and how often, then you may decide to save them instead of re-running queries, to improve performance.

Best strategy for error handling in an interface to a database and web display

I decided to ask this question after going back and forth 100s of times trying to place error handling routines to optimize data integrity while also taking into account speed and efficiency (and wasting 100s of hours in the process. So here's the setup.
Database -> python classes -> python code -> javascript
MongoDB | that represent | that serves | web interface
the data pages (pyramid)
I want data to be robust, that is the number one requirement. So right now I validate data on the javascript side of the page, but also validate in the python classes which more or so represent data structures. While most server routines run through python classes, sometimes that feel inefficient given that it have to pass through different levels of error checking.
EDIT: I guess I should clarify. I am not looking to unify validation of client and server side code. Sorry for the bad write-up. I'm looking more to figure out where the server side validation should be done. Should it be in the direct interface to the database, or in the web server code where the data is received.
for instance, if I have an object with a barcode, should I validate the barcode in the code that reviews the data through AJAX or should I just call the object's class and validate there?
Again, is there sort of guidelines on how to do error checking in general? I want to be sort of professional, and learn but hopefully not have to go through a whole book.
I am not a software engineer, but I hope those of you who are familiar with complex projects, can tell me where I can find few guidelines on how to model/error check in a situation like this.
I'm not necessarily looking for an answer, but more like pointing me to a short set of guidelines when creating projects with different layers like this. Hopefully not extremely long..
I don't even know what tags to use in the post. HELP!!

Validating on the client and validating on the server serve different purposes entirely. Validating on the server is to make sure your model invariants hold and has to be done to maintain data integrity. Validating on the client is so the user has a friendly error message telling him that his input would've validated data integrity instead of having a traceback blow up into his face.
So there's a subtle difference in that when validating on the server you only really care whether or not the data is valid. On the client you also care, on a finer-grained level, why the input could be invalid. (Another thing that has to be handled at the client is an input format error, i.e. entering characters where a number is expected.)
It is possible to meet in the middle a little. If your model validity constraints are specified declaratively, you can use that metadata to generate some of the client validations, but they're not really sufficient. A good example would be user registration. Commonly you want two password fields, and you want the input in both to match, but the model will only contain one attribute for the password. You might also want to check the password complexity, but it's not necessarily a domain model invariant. (That is, your application will function correctly even if users have weak passwords, and the password complexity policy can change over time without the data integrity breaking.)
Another problem specific to client-side validation is that you often need to express a dependency between the validation checks. I.e. you have a required field that's a number that must be lower than 100. You need to validate that a) the field has a value; b) that the field value is a valid integer; and c) the field value is lower than 100. If any of these checks fails, you want to avoid displaying unnecessary error messages for further checks in the sequence in order to tell the user what his specific mistake was. The model doesn't need to care about that distinction. (Aside: this is where some frameworks fail miserably - either JSF or Spring MVC or either of them first attempts to do data-type conversion from the input strings to the form object properties, and if that fails, they cannot perform any further validations.)
In conclusion, the above implies that if you care about data integrity, and usability, you necessarily have to validate data at least twice, since the validations achieve different purposes even if there's some overlap. Client-side validation will have more checks and more finer-grained checks than the model-layer validation. I wouldn't really try to unify them except where your chosen framework makes it easy. (I don't know about Pyramid - Django makes these concerns separate in that Forms are a different layer than your Models, both can be validated, and they're joined by ModelForms that let you add additional validations to the ones performed by the model.)

Not sure I fully understand your question, but error handling on pymongo can be found here -
http://api.mongodb.org/python/current/api/pymongo/errors.html
Not sure if you're using a particular ORM - the docs have links to what's available, and these individually have their own best usages:
http://api.mongodb.org/python/current/tools.html
Do you have a particular ORM that you're using, or implementing your own through pymongo?

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.