firebase encryption or not? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 days ago.
The community reviewed whether to reopen this question 4 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm new to the world of Android development. Self taught in python and still learning. I've spent basically 10 - 14 hours a day every day for 7 months developing a app(cause im addicted to programming now hehe)
Anyway my app is functional but not yet on play store but i'm looking into it, as a result it got me thinking about another rabbit hole to go down...
The app collects location data, phone numbers and some other sensitive data. That data gets stored on firebase realtime database as a JSON. Because firebase is "google" owned, its https and you need to be authorised to access it, in the eyes of proper developers and google, etc. would the JSON data i'm storing of sensitive info be classed as "secure" on firebase or do i have to learn about python encryption and stuff as well ?

No
No, it's not necessarily secure because of the location.
Security is a multi-layered discipline that involves awareness of threat vectors, sensitivity to the threat environment and consideration of the data stored, not to mention legal requirements and financial risks!
Multi-layered Security
To use a concrete example, imagine an extremely "secure" vault with 15-inch thick steel walls. Inside is a priceless treasure trove. However, for the sake of convenience, the owner of the vault has left the key taped to the front door and for the sake of cost has not hired security guards or paid for cameras.
While the vault may be impressive, the way in which it is used makes it an easy target for anyone who wants to break in and steal the contents.
Your firebase database may be physically secure (since it is located inside of a Google warehouse somewhere), but your app holds the keys to the database. If your app is easy to hack into, then the security of your database is compromised.
When we say that security is "multi-layered", it means that you shouldn't be overly-reliant on any one layer of security. Perhaps your database has a password. But if the password is compromised, then all of that data is now compromised. Likewise, if your data is encrypted, but the encryption key is compromised, then all of your data is compromised. But if your database requires a password AND the data is encrypted, then an attacker would need both the password and the encryption key. Having one would not be enough. This is an example of multi-layered security.
Security Doesn't Stop at Your Database
Unfortunately, the need to access data requires, by definition, a breach of your security walls around the database. Again, to use a concrete example, this is like the classic movie trope of a laundry truck entering a maximum security prison. All the barbed wire and guards may be undone by the perfectly ordinary and expected laundry truck driving out the front gate. So in addition to database security, you need to consider app security.
For example, how easy is it for a user to spoof another user in your app? It doesn't matter if your database has many layers of security if an attacker can just use your app to access data for any and all users. (For the sake of this conversation, your "app" includes the service endpoints which your locally installed Android app uses to communicate with the server, which can be easily sniffed out by even an amateur hacker.)
No One-Size-Fits-All Advice
Security is a non-trivial topic and so it's not possible to give you advice on how to secure your app and database. The best advice I can give is to be very thoughtful about what you choose to store. If you are going to have a central database, then assume it will be breached and all the contents leaked. How bad will that be for you? If it will expose you to legal or financial risk, then it may be cost-effective to hire a professional who can help you provide the necessary security for your app. Note that privacy laws are very complicated and vary exceedingly across jurisdictions, so if you are going to store sensitive user data you might need to consult a lawyer.
Here's a quick handful of laws you may need to consider when storing sensitive user data, but there are many, many more:
USA's COPPA (Children's Online Privacy Protection Act)
The EU's GDPR (General Data Protection Regulation)
California's CCPA (California Consumer Privacy Act)
However, it sounds like this is a hobby. If so, then consider alternatives to a central database (which can get very expensive if your app goes viral!). Maybe use a local database so that all data is stored on the user's personal device (and, perhaps, provide an easy way to export/import that data). Some users (include me!) would actually find that to be a valuable feature! Or consider a hybrid model, where sensitive information is stored locally and general, non-personally-identifiable information (PII) is stored centrally (so you can run usage reports, etc).
Security is a balancing act between accessibility and secrecy, so there is not going to be any one-size-fits-all advice.
Learn More
Firebase: Privacy and Security in Firebase
Android: Security tips
Oracle: What is Data Security?
FTC: Mobile Health App Developers: FTC Best Practices
Note that the FTC's #1 Tip is:
1. Minimize Data.
Do you need to collect and retain people’s information? Remember, if you don’t collect data in the first place, you don’t have to go to the effort of securing it. If the data you collect is integral to your product or service, that’s fine, but take reasonable steps to secure the data you transmit and store, and delete it once you no longer have a legitimate business need to retain it. If you collect and retain it, you must protect it.
Can you keep the data in a de-identified form? When data is de-identified, it can’t be reasonably associated with a particular individual. A key to effective de-identification is to ensure that the data cannot be reasonably re-identified. For example, U.S. Department of Health and Human Services regulations require entities covered by the Health Insurance Portability and Accountability Act (HIPAA) either to remove specific identifiers, including date of birth and five-digit zip code, from protected health information or to have a privacy and data security expert determine that the risk of re-identification is “very small.” Appropriately de-identified data can protect people’s privacy while still allowing for beneficial use. For example, if your app collects geolocation information as part of an effort to map asthma outbreaks in a metropolitan area, consider whether you can provide the same functionality while maintaining and using that information in de-identified form. You can reduce the risk of re-identification of location data by not collecting highly specific location data about individual users in the first place, by limiting the number of locations stored for each user, or aggregating location data across users.
Since re-identification is always a risk, it’s important to keep up with technological developments. Publicly commit not to re-identify the data. And make sure your contracts with third parties require them to commit not to re-identify the data. Then monitor the third parties to make sure they live up to their promises.

Related

How do cloud services have access to your hosted script?

Let's say you have some proprietary python + selenium script that needs to run daily. If you host them on AWS, Google cloud, Azure, etc. are they allowed to see your script ? What is the best practice to "hide" such script when hosted online ?
Any way to "obfuscate" the logic, such as converting python script to binary ?
Can the cloud vendors access your script/source code/program/data?
I am not including government/legal subpoenas in this answer.
They own the infrastructure. They govern access. They control security.
However, in the real world there are numerous firewalls in place with auditing, logging and governance. A cloud vendor employee would risk termination and/or prison time for bypassing these controls.
Secrets (or rumors) are never secret for long and the valuation of AWS, Google, etc. would vaporize if they violated customer trust.
Therefore the answer is yes, it is possible but extremely unlikely. Professionally, I trust the cloud vendors with the same respect I give my bank.
Here you can find information regarding Google Cloud Enterprise Privacy Commitments.
It is described how Google protect the privacy of Google Cloud Platform and Google Workspace customers.
You control your data. Customer data is your data, not Google’s. We
only process your data according to your agreement(s).
We never use your data for ads targeting. We do not process your
customer data or service data to create ads profiles or improve Google
Ads products.
We are transparent about data collection and use. We’re committed to
transparency, compliance with regulations like the GDPR, and privacy
best practices.
We never sell customer data or service data. We never sell customer
data or service data to third parties.
Security and privacy are primary design criteria for all of our
products. Prioritizing the privacy of our customers means protecting
the data you trust us with. We build the strongest security
technologies into our products.
Therefore I believe is extremely unlikely that Google will check the content of your scripts.

DB structure for Cloud based billing software, planning issues [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
We are building a Cloud-based billing software. This software is web-based and should function like desktop software (Atleast). We will have 5000+ users billing at the same time. For now, we only have just 250 users. We are in a need of scaling now. We are using Angular as a Fronten, Python is used for Backend and React Native for Mobile App. PostgreSQL DB is used for Database. I have few doubts, to clarify before we scale.
Using PostgreSQL for DB will show any issues in the future?
Instead of Integer's primary key, we are using UUID (For easy data migrations, but it uses more space). Is that will introduce any problems in the future?
Do we have to consider any DB methods for this kind of scaling ? (Now, uses a single DB for all users)
We are planning to use one server with a huge spec (for all users). Is that will be good or do we have to plan for anything else ?
Using a separate application server and DB server is needed for our this scenario?
I'll try to answer the questions. Feel free to judge it.
So, you are building a cloud-based billing software. Now you have 250+ users and is expected to have at least 5000 users in the future.
Now answering the questions you asked:
Using PostgreSQL for DB will show any issues in the future?
ans: PostgreSQL is great for production. It is the safe way to go. It shouldn't show any issues in the future, but depends highly on the db design.
Instead of Integer's primary key, we are using UUID (For easy data migrations, but it uses more space). Is that will introduce any problems in the future?
ans: Using UUID has its own advantages and disadvantages. If you think scaling is a problem, then you should consider updating your revenue model.
Do we have to consider any DB methods for this kind of scaling ? (Now, uses a single DB for all users)
ans: A single DB for a production app is good at the initial stage. When scaling especially in the case of 5000 concurrent users, it is good to think about moving to Microservices.
We are planning to use one server with a huge spec (for all users). Is that will be good or do we have to plan for anything else ?
ans: Like I said, 5k concurrent users will require a mighty server(depends highly on the operations though, I'm assuming moderate-heavy calculations and stuff) therefore, it's recommended to plan for Microservices architecture. Thant way you can scale up heavily used services and scale down the other. But keep in mind that, Microservices may sound great, but in practice, it's a pain to setup. If you have a strong backend team, you can proceed with this idea otherwise just don't.
Using a separate application server and DB server is needed for our this scenario?
ans: Short answer is Yes. Long answer: why do you want to stress your server machine when you have that many users.

How to safely pass credentials to jdbc interface in Pyspark [duplicate]

The attack
One possible threat model, in the context of credential storage, is an attacker which has the ability to :
inspect any (user) process memory
read local (user) files
AFAIK, the consensus on this type of attack is that it's impossible to prevent (since the credentials must be stored in memory for the program to actually use them), but there's a couple of techniques to mitigate it:
minimize the amount of time the sensitive data is stored in memory
overwrite the memory as soon as the data is not needed anymore
mangle the data in memory, keep moving it, and other security through obscurity measures
Python in particular
The first technique is easy enough to implement, possibly through a keyring (hopefully kernel space storage)
The second one is not achievable at all without writing a C module, to the best of my knowledge (but I'd love to be proved wrong here, or to have a list of existing modules)
The third one is tricky.
In particular, python being a language with very powerful introspection and reflection capabilities, it's difficult to prevent access to the credentials to anyone which can execute python code in the interpreter process.
There seems to be a consensus that there's no way to enforce private attributes and that attempts at it will at best annoy other programmers who are using your code.
The question
Taking all this into consideration, how does one securely store authentication credentials using python? What are the best practices? Can something be done about the language "everything is public" philosophy? I know "we're all consenting adults here", but should we be forced to choose between sharing our passwords with an attacker and using another language?
There are two very different reasons why you might store authentication credentials:
To authenticate your user: For example, you only allow the user access to the services after the user authenticates to your program
To authenticate the program with another program or service: For example, the user starts your program which then accesses the user's email over the Internet using IMAP.
In the first case, you should never store the password (or an encrypted version of the password). Instead, you should hash the password with a high-quality salt and ensure that the hashing algorithm you use is computationally expensive (to prevent dictionary attacks) such as PBKDF2 or bcrypt. See Salted Password Hashing - Doing it Right for many more details. If you follow this approach, even if the hacker retrieves the salted, slow-hashed token, they can't do very much with it.
In the second case, there are a number of things done to make secret discovery harder (as you outline in your question), such as:
Keeping secrets encrypted until needed, decrypting on demand, then re-encrypting immediately after
Using address space randomization so each time the application runs, the keys are stored at a different address
Using the OS keystores
Using a "hard" language such as C/C++ rather than a VM-based, introspective language such as Java or Python
Such approaches are certainly better than nothing, but a skilled hacker will break it sooner or later.
Tokens
From a theoretical perspective, authentication is the act of proving that the person challenged is who they say they are. Traditionally, this is achieved with a shared secret (the password), but there are other ways to prove yourself, including:
Out-of-band authentication. For example, where I live, when I try to log into my internet bank, I receive a one-time password (OTP) as a SMS on my phone. In this method, I prove I am by virtue of owning a specific telephone number
Security token: To log in to a service, I have to press a button on my token to get a OTP which I then use as my password.
Other devices:
SmartCard, in particular as used by the US DoD where it is called the CAC. Python has a module called pyscard to interface to this
NFC device
And a more complete list here
The commonality between all these approaches is that the end-user controls these devices and the secrets never actually leave the token/card/phone, and certainly are never stored in your program. This makes them much more secure.
Session stealing
However (there is always a however):
Let us suppose you manage to secure the login so the hacker cannot access the security tokens. Now your application is happily interacting with the secured service. Unfortunately, if the hacker can run arbitrary executables on your computer, the hacker can hijack your session for example by injecting additional commands into your valid use of the service. In other words, while you have protected the password, it's entirely irrelevant because the hacker still gains access to the 'secured' resource.
This is a very real threat, as the multiple cross-site scripting attacks have shows (one example is U.S. Bank and Bank of America Websites Vulnerable, but there are countless more).
Secure proxy
As discussed above, there is a fundamental issue in keeping the credentials of an account on a third-party service or system so that the application can log onto it, especially if the only log-on approach is a username and password.
One way to partially mitigate this by delegating the communication to the service to a secure proxy, and develop a secure sign-on approach between the application and proxy. In this approach
The application uses a PKI scheme or two-factor authentication to sign onto the secure proxy
The user adds security credentials to the third-party system to the secure proxy. The credentials are never stored in the application
Later, when the application needs to access the third-party system, it sends a request to the proxy. The proxy logs on using the security credentials and makes the request, returning results to the application.
The disadvantages to this approach are:
The user may not want to trust the secure proxy with the storage of the credentials
The user may not trust the secure proxy with the data flowing through it to the third-party application
The application owner has additional infrastructure and hosting costs for running the proxy
Some answers
So, on to specific answers:
How does one securely store authentication credentials using python?
If storing a password for the application to authenticate the user, use a PBKDF2 algorithm, such as https://www.dlitz.net/software/python-pbkdf2/
If storing a password/security token to access another service, then there is no absolutely secure way.
However, consider switching authentication strategies to, for example the smartcard, using, eg, pyscard. You can use smartcards to both authenticate a user to the application, and also securely authenticate the application to another service with X.509 certs.
Can something be done about the language "everything is public" philosophy? I know "we're all consenting adults here", but should we be forced to choose between sharing our passwords with an attacker and using another language?
IMHO there is nothing wrong with writing a specific module in Python that does it's damnedest to hide the secret information, making it a right bugger for others to reuse (annoying other programmers is its purpose). You could even code large portions in C and link to it. However, don't do this for other modules for obvious reasons.
Ultimately, though, if the hacker has control over the computer, there is no privacy on the computer at all. Theoretical worst-case is that your program is running in a VM, and the hacker has complete access to all memory on the computer, including the BIOS and graphics card, and can step your application though authentication to discover its secrets.
Given no absolute privacy, the rest is just obfuscation, and the level of protection is simply how hard it is obfuscated vs. how much a skilled hacker wants the information. And we all know how that ends, even for custom hardware and billion-dollar products.
Using Python keyring
While this will quite securely manage the key with respect to other applications, all Python applications share access to the tokens. This is not in the slightest bit secure to the type of attack you are worried about.
I'm no expert in this field and am really just looking to solve the same problem that you are, but it looks like something like Hashicorp's Vault might be able to help out quite nicely.
In particular WRT to the problem of storing credentials for 3rd part services. e.g.:
In the modern world of API-driven everything, many systems also support programmatic creation of access credentials. Vault takes advantage of this support through a feature called dynamic secrets: secrets that are generated on-demand, and also support automatic revocation.
For Vault 0.1, Vault supports dynamically generating AWS, SQL, and Consul credentials.
More links:
Github
Vault Website
Use Cases

How to extract information (citation, h-index, currently working institution etc) about all professors in a specific field from Google scholar?

I want to compare different information (citation, h-index, etc) of professors in a specific field in different institutions all over the world by data mining and analysis techniques. But I have no idea how to extract these data of hundreds of (or even thousands of) professors since Google does not provide an official API for it. So I am wondering are there any other ways to do that?
Use this google code tool will calculate an individual h-index but if you do this on demand for a limited number in a particular field you will not break the terms of use - it doesn't specifically refer to limits on access but does refer to disruption of service (eg bulk requests may potentially do this) the export questions state:
I wrote a program to download lots of search results, but you blocked my computer from accessing Google Scholar. Can you raise the limit?
Err, no, please respect our robots.txt when you access Google Scholar using automated software. As the wearers of crawler's shoes and webmaster's hat, we cannot recommend adherence to web standards highly enough.
Web of Science does have an API available and a collaboration agreement with google scholar but Web of Science only for certain individuals
A solution could be to request user's web of science credential (or your own) to return the information on demand - perhaps for the top ones in the field, then store it as you planned. Google scholar only updates a few times per week and this would not be excessive use.
The other option is to request permission from google, which is an mentioned in the terms of use, although seems unlikely to be granted.
I've done a project exactly on this.
You provide an input text file to the script with the names of the professor you'd like to retrieve the information from, and the script is able to crawl google scholar and manage the info you are interested on.
The project provides also a functionality for downloading automatically the profile picture of the researchers/professors.
In order to respect the constraint imposed by the portal you can set a delay between each requests. if you have >1k of profile to crawl it might take a while but it works.
A concurrency-enabled script has also been implemented and it runs way faster than the basic sequence approach.
note: in order to specify the information you need you have to know either the id of the class of the html generated by google scholar or the name of the class.
good luck!

Security considerations - office website/portal on GAE

If one needs to create an office website (that serves as a platform for clients/customers/employees) to login and access shared data, what are the security considerations.
to give you some more detail,
The office portal has been developed in django/python and hosted through GAE. Essentially, the end point comes with a login/password to enter into the portal and access data.
I would like to know:
a) what are the things we can do to bring in a high level of security. Essentially the data is critical and hence need to be accessed by authorized people only. So would like to make it such that "The app is as safe as - how safely one keeps his password. Meaning, the only way to enter the system (unauthorized) is through a password leak (by the person) and not in any hackish way." :)
b) can we host the apps on GAE (appspot.com) with https?
c) are there better ways to secure other than passwords (i have heard about ssh keys/certificates). But the ultimate users may not be highly tech savvy.
There is always the choice between usabiity and secutity. The more security features you implent, the more difficult it gets to use it.
can we host the apps on GAE (appspot.com) with https?
Yes, but not on your own domain, only on appspot.com. If you are serving your app off of an own domain, you must direct all secure traffic through your app's appspot domain (on your own domain, you'd have to buy a SSL certificate, and you would need a dedicated IP etc.). If you really have to, there are ways to route SSL traffic over your own domain, but as this requires another server running something like stunnel, it gives attackers another attack target.
If your app has username/password authentication, the app is really as safe as how safely one keeps his password, if you have no bugs in your code that could be exploited. About the "hackish way": on GAE, you don't have to care about server security, the only possible attack target is your code.
These are some strategies for securing your app:
good QA and code review to find critical bugs; Django has already built-in protection against most trivial attacks like XSRF and SQL injection, so look at the parts of your own code that are related to critical data and authentication
think of other authentication methods like client side certificates (easy to use for the end user, most browser support this natively and modern operating systems have a certificate storage; probably not an easy thing to do on GAE)
the weakest point of every secure enviromnent is the user, so you should inform the users about good practices on handling sensitive data and passwords (BTW, requiring a password change every few months does not improves security at all as it usally results in users writing down their passwords as they can't remember it, you loose more security than you gain)
you should have good intrusion detection to lock out an attacker as soon as possible, as example behaviour analysis; Example: if a user from the USA logs in from an IP in Estonia, this is suspicious
network access restrictions: you could block all IP ranges except those from your enterprise of accessing critical data, if a password gets leaked, this minimizes the possible impact
improve end user security: if one of the users have a trojan on their computer that makes screen captures or keylogs, all your security is lost as the attacker could just watch the user while he's vieweing sensitive data; you should have a good security police in your enterprise
force users to access your site over SSL, you should not let the users choose if they prefer security ocer comfort of not

Categories

Resources