I've very recently picked up programming in Python and am working on creating a database.
I've already worked out extracting all these files from their source so they are all in a directory on my computer.
All of these files are structured the same way and what I want to do is search these multidimensional dictionaries and locate the value for a specific set of keys.
These json files are all structured similarly,
{
"userid": 34535367,
"result": {
"list": [
{
"name": 264,
"age": 64,
"id": 456345345
},
{
"name": 263,
"age": 42,
"id": 364563463456
}
]
}
}
In my case, I would like to search for the "name" key and return the relevant data(quality, id and the original userid) for the thousands of names just like it from my millions of JSON files.
Basically I'm very new at this and the little programming knowledge I have is in Python. I'm happy to start learning whatever I need to, but I'm not sure which direction to go.
If your goal is to create a database, then you should look on how databases work and solve the same problem you are trying to solve right now :)
NoSQL databases (like mangodb) work also with json documents and implements most likely a whole set of tools to search and filter documents.
Now to answer your question, there is no quick way to do so unless you do some preprocessing, meaning that you store different information about the data (called metadata).
This is a huge subject and I don't have enough expertise to give you all the answers, but I can give you a simple tip: Use indexes.
An index is a sorted key/value map where for every value, we store the documents that contains that value (or the file + position of the Json document) . For example an index for the name property would like this:
{
263: ('jsonfile10.json', '0')
264: ('jsonfile10.json', '30'),
# The json document can be found on the jsonfile10.json file on line 30
}
By keeping an index for the most queried values, you can turn a linear time search into a logarithmic time search not to mention that inserting a new document is much faster. in your case, you seems to only need an index on the name field.
Creating/updating the index is done when you insert, update or remove a document. Using a balanced binary tree can accelerate the updates on the index.
As a suggestion, why don't you just process all the incoming files and insert the data into a database? You will have a toolset to query that database. SQLite for example will do (as well as any other more sophisticated database):
http://www.sqlite.org/
http://docs.python.org/2/library/sqlite3.html
Simple other solution might be to build a file mapping name_id to /file/path. Then you can logarithmically do a binary search by the name id. But I'd still advise using a proper database as maintaining the index will be more cumbersome than doing some inserts/deletes.
Related
I have a bunch of JSON files, and suppose each have the following structure:
{
"fields": {
"name": "Bob",
"key": "bob"
},
"results": {
"bob": { ... }
}
}
Where by some unfortunate reason, while the structure of the JSON is fairly consistent, there is one dynamic key under "results". Defining the schema for under the fields is fairly straight-forward to me.
So, for several JSON files, the final schema might be:
fieldSchema = StructField(...)
resultSchema = StructField("results", StructType([StructField("bob", ...)]))
finalSchema = StructType([fieldSchema, resultsSchema])
Where the problem is this line: StructField("bob", ...)
Obviously, bob is not the key I'm looking for. This name for the StructField would ideally be some kind of wildcard character, regex pattern, or worst case, some dynamic field based on other fields.
I'm a newbie to Spark and have been scouring the documentation and historical StackOverflow posts, but I've been unable to find anything.
Long story short, I want to be able to pass some kind of wide net for the name parameter in StructField to encompass a variety of different keys, similar to a regex pattern.
I'm working with MongoDB and want to analyze the extracted data from this database by python to visualize required information.Two question arises: 1) in such data there is DBRef that I don't know how to manipulate it, 2) it seems that is a nested data and needs to be broken to lowe level! 3) can I covert DBref to JSON file and the analyze it?
Thanks guys
Have a look at this.
This allows you to essentially "unpack" the DBRef and retrieve only the id's, if that is of any use to you.
Example:
x = {
"oId": 567,
"notice": [
DBRef("noticeId", ObjectId("5f45177b93d7b757bcbd2d55"))
]
}
print(x.get('oId'), d.get('notice')[0].id)
In my realtime database I have a path /stats which contains a set of documents.
I want to using the python sdk get the /stats document as a dict. My code looks like that
path = "/stats"
ref = db.reference(path, firebase_app)
document = ref.get()
print(document)
And the output is
[None, {'name': 'Full Time Statistics', 'thumbnail': 'https://***', 'url': 'https://***'}]
which is a list not a dictionary. How to change it and read this document path as a dictionary something like that
{"1": {'name': 'Full Time Statistics', 'thumbnail': 'https://***', 'url': 'https://***'}}
On the other hand I can get other documents with similar structures as a dictionary with no issue. Why is it like that and how to solve it ?
Two things are happening here:
Since you are retrieving /stats you are getting all nodes under it. Since this is a repeated list and Firebase Realtime Database keys are strings, you'd normally get a dictionary (with the keys in the dictionary being the keys in the JSON).
Since your keys are numeric values, Firebase "thinks" you are trying to store an array/list and it tries to coerce the data into an array for you. That's why you get a None entry in the list: that's Firebase filling in the zeroth element for you.
There's unfortunately no way to disable this array coercion. I typically get around it by prefixing the keys with a fixed string, so that Firebase bypasses its array logic. So:
stats: {
stat1: { ... },
stat2: { ... }
}
Also see:
Best Practices: Arrays in Firebase
I am trying to add data to the Firestore database without overwriting it. The data is in the format written below and has numerous other "Question" in the same format and I want to add this to just one document.
{
"Question": String,
"Answer": String,
}
The same question has been asked here but it covers it in java and not in python. I have tried updating it and setting it but it has only been overwriting it.
Note that all of my Questions are elements in a list in this format:
['{\n "Question": String,\n "Answer":String \n}, ...]
What I am currently doing in my code is going through the array and performing the code below:
doc_ref = db.collection(u"Questions").document(u"ques")
doc_ref.update(questionsAnswers)
but this only leaves me with the last question added to the database.
Use the update method to change the contents of an existing document as shown in the documentation.
city_ref = db.collection(u'your-collection').document(u'your-document')
city_ref.update({u'your-field': u'your-field-value'})
I suggest also using the API documentation.
Disclaimer: Both Python and CouchDB are new for me. So far my "programming" has mostly consisted of Bash scripts.
I'm trying to create a small script that updates objects in a CouchDB database. The objects however aren't created by my script but by an App called Tap Forms that uses CouchDB for sync. Basically I'm trying to automatically update the content of the app. That also means I can't really influence the structure or names of the objects in CouchDB.
The Database is mostly filled with objects of this structure:
{
"_id": "rec-3b17...",
"_rev": "21-cdf6...",
"values": {
"fld-c3d4...": 4,
"fld-1def...": 1000000000000,
"fld-bb44...": 760000000000,
"fld-a44f...": "admin,name",
"fld-5fc0...": "SSD",
"fld-642c...": true,
},
"deviceName": "MacBook Air",
"dateModified": "2019-02-08T14:47:06.051Z",
"dateCreated": "2019-02-08T11:33:00.018Z",
"type": "frm-7ff3...",
"dbID": "db-1435...",
"form": "frm-7ff3..."
}
I shortened the numbers a bit and removed some entries to increase readability.
Now the actual values I'm trying to update are within the "values" : {...} array (or object, or list, guess I don't have much experience with JSON either).
As I know some of these values, I managed to create view that finds the _id of an object on the server. I then use the python-couchdb module as described in documentation:
for item in db.view('CustomViews/test2', key="GENERIC"):
doc = db[item.id]
This gives me the object. However I want to update one of the values within the values array, lets say fld-c3d4.... But how? Using doc['values'] = 'new_value' updates the whole array. I tried other (seemingly logical) ways along the lines of doc['values['fld-c3d4']'] = 'new_value' but couldn't wrap my head around it. I couldn't find an example in any documentation.
So here's a example how to update the fld-c3d4.
You have your document that represent a dictionary with nested dictionary.
If you want to get the values, you will do something like this:
values = doc['values']
Now the variable values points to the values in your document.
From there, you can access a sub value:
values['fld-c3d4'] = 'new value'
If you want to directly update the value from the doc, you just have to chain those operations:
doc['values']['fld-c3d4'] = 'new value'