I retrieve data from my DB for a Python app and it comes in the following format (as a list, tbl):
[
{
"id": "rec2fiwnTQewTv9HC",
"createdTime": "2022-06-27T08:25:47.000Z",
"fields": {
"Num": 19,
"latitude": 31.101405,
"longitude": 36.391831,
"State": 2,
"Label": "xyz",
"Red": 0,
"Green": 255,
"Blue": 0
}
},
{
"id": "rec4y7vhgZVDHrhrQ",
"createdTime": "2022-06-27T08:25:47.000Z",
"fields": {
"Num": 30,
"latitude": 31.101405,
"longitude": 36.391831,
"State": 2,
"Label": "abc",
"Red": 0,
"Green": 255,
"Blue": 0
}
}
]
I can retrieve the values in the fields nested list by doing this:
pd.DataFrame([d['fields'] for d in tbl])
I would like to add the id field to each row of the dataframe but I can't figure out how to do this.
Try:
data = [
{
"id": "rec2fiwnTQewTv9HC",
"createdTime": "2022-06-27T08:25:47.000Z",
"fields": {
"Num": 19,
"latitude": 31.101405,
"longitude": 36.391831,
"State": 2,
"Label": "xyz",
"Red": 0,
"Green": 255,
"Blue": 0,
},
},
{
"id": "rec4y7vhgZVDHrhrQ",
"createdTime": "2022-06-27T08:25:47.000Z",
"fields": {
"Num": 30,
"latitude": 31.101405,
"longitude": 36.391831,
"State": 2,
"Label": "abc",
"Red": 0,
"Green": 255,
"Blue": 0,
},
},
]
df = pd.DataFrame([{"id": d["id"], **d["fields"]} for d in data])
print(df)
Prints:
id Num latitude longitude State Label Red Green Blue
0 rec2fiwnTQewTv9HC 19 31.101405 36.391831 2 xyz 0 255 0
1 rec4y7vhgZVDHrhrQ 30 31.101405 36.391831 2 abc 0 255 0
Take a look at this example dictionary:
This is in a JSON file:
{
"fridge": {
"vegetables": {
"Cucumber": 0,
"Carrot": 2,
"Lettuce": 5
},
"drinks": {
"Water": 12,
"Juice": 4,
"Soda": 2
}
}
}
So in this example, we are showing the contents of my fridge, except we show the AMOUNT (2) of every ITEM (soda) in every CATEGORY (drinks). The way we did this, we firstly created a dictionary for the fridge and for every category we have another dictionary where inside it we have the item-type and amount of it.
Now let's say we went shopping... We bought some FRUITS at the supermarket and we got:
"fruits": {
"Apple": 3,
"Banana": 2,
"Melon": 1
}
We want to put this data (or fruits) into my fridge, except we don't have a CATEGORY for "fruits"!! So not only do we have to add a new dictionary into my fridge, but we also have data that we want to already add too!
Now this fridge thing was an example to help you understand what I want. So how do you insert a new dictionary into an already existing one with key-value pairs in it? In other words, I want my fridge to look like this:
{
"fridge": {
"vegetables": {
"Cucumber": 0,
"Carrot": 2,
"Lettuce": 5
},
"drinks": {
"Water": 12,
"Juice": 4,
"Soda": 2
},
"fruits": {
"Apple": 3,
"Banana": 2,
"Melon": 1
}
}
}
I tried APPEND but as expected, it does not work for dictionaries (it is for lists only) and so I do not know what to do... keep in mind that I do not want to re-define my data, I want to ADD data to existing data so that I can edit it later. Would appreciate some help, Thanks!
Suppose you have this string in proper json format:
j_str1='''\
{"fridge": {
"vegetables": {
"Cucumber": 0,
"Carrot": 2,
"Lettuce": 5
},
"drinks": {
"Water": 12,
"Juice": 4,
"Soda": 2
}
}}'''
And:
j_str2='''\
{"fruits": {
"Apple": 3,
"Banana": 2,
"Melon": 1
}}
'''
First convert to a Python dict:
import json
j=json.loads(j_str1)
>>> j
{'fridge': {'vegetables': {'Cucumber': 0, 'Carrot': 2, 'Lettuce': 5}, 'drinks': {'Water': 12, 'Juice': 4, 'Soda': 2}}}
Then update:
j["fridge"].update(json.loads(j_str2))
>>> j
{'fridge': {'vegetables': {'Cucumber': 0, 'Carrot': 2, 'Lettuce': 5}, 'drinks': {'Water': 12, 'Juice': 4, 'Soda': 2}, 'fruits': {'Apple': 3, 'Banana': 2, 'Melon': 1}}}
Then convert back into json:
>>> print(json.dumps(j,indent=3))
{
"fridge": {
"vegetables": {
"Cucumber": 0,
"Carrot": 2,
"Lettuce": 5
},
"drinks": {
"Water": 12,
"Juice": 4,
"Soda": 2
},
"fruits": {
"Apple": 3,
"Banana": 2,
"Melon": 1
}
}
}
I have a DataFrame with lists in one column.
I want to pretty print the data as JSON.
How can I use indentation without affecting the values in each cell to be indented.
An example:
df = pd.DataFrame(range(3))
df["lists"] = [list(range(i+1)) for i in range(3)]
print(df)
output:
0 lists
0 0 [0]
1 1 [0, 1]
2 2 [0, 1, 2]
Now I want to print the data as JSON using:
print(df.to_json(orient="index", indent=2))
output:
{
"0":{
"0":0,
"lists":[
0
]
},
"1":{
"0":1,
"lists":[
0,
1
]
},
"2":{
"0":2,
"lists":[
0,
1,
2
]
}
}
desired output:
{
"0":{
"0":0,
"lists":[0]
},
"1":{
"0":1,
"lists":[0,1]
},
"2":{
"0":2,
"lists":[0,1,2]
}
}
If you don't want to bother with json format output, you can just turn the list type to string temporarily when printing the dataframe
print(df.astype({'lists':'str'}).to_json(orient="index", indent=2))
{
"0":{
"0":0,
"lists":"[0]"
},
"1":{
"0":1,
"lists":"[0, 1]"
},
"2":{
"0":2,
"lists":"[0, 1, 2]"
}
}
If you don't want to see the quote mark, you use regex to replace them
import re
import re
result = re.sub(r'("lists":)"([^"]*)"', r"\1 \2",
df.astype({'lists':'str'}).to_json(orient="index", indent=2))
{
"0":{
"0":0,
"lists": [0]
},
"1":{
"0":1,
"lists": [0, 1]
},
"2":{
"0":2,
"lists": [0, 1, 2]
}
}
I'm relatively new to Elasticsearch and am having a problem determining why the number of records from a pythondataframe is different than the indexes document count Elasticsearch.
I start by creating an index by running the following: As you can see there are 62932 records.
I'm creating an index in elasticsearch using the following:
Python code
When I check the index in Kibana Management/Index Management there are only 62630 documents. According to Stats window there were 302 deleted count. I don't know what this means.
Below is the output from the STATS window
{
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"stats": {
"uuid": "egOx_6EwTFysBr0WkJyR1Q",
"primaries": {
"docs": {
"count": 62630,
"deleted": 302
},
"store": {
"size_in_bytes": 4433722
},
"indexing": {
"index_total": 62932,
"index_time_in_millis": 3235,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 140,
"query_time_in_millis": 1178,
"query_current": 0,
"fetch_total": 140,
"fetch_time_in_millis": 1233,
"fetch_current": 0,
"scroll_total": 1,
"scroll_time_in_millis": 6262,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 2,
"total_time_in_millis": 417,
"total_docs": 62932,
"total_size_in_bytes": 4882755,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 26,
"total_time_in_millis": 597,
"external_total": 24,
"external_total_time_in_millis": 632,
"listeners": 0
},
"flush": {
"total": 1,
"periodic": 0,
"total_time_in_millis": 10
},
"warmer": {
"current": 0,
"total": 23,
"total_time_in_millis": 0
},
"query_cache": {
"memory_size_in_bytes": 17338,
"total_count": 283,
"hit_count": 267,
"miss_count": 16,
"cache_size": 4,
"cache_count": 4,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 2,
"memory_in_bytes": 22729,
"terms_memory_in_bytes": 17585,
"stored_fields_memory_in_bytes": 2024,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 512,
"points_memory_in_bytes": 2112,
"doc_values_memory_in_bytes": 496,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 62932,
"size_in_bytes": 17585006,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 0
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
},
"total": {
"docs": {
"count": 62630,
"deleted": 302
},
"store": {
"size_in_bytes": 4433722
},
"indexing": {
"index_total": 62932,
"index_time_in_millis": 3235,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 140,
"query_time_in_millis": 1178,
"query_current": 0,
"fetch_total": 140,
"fetch_time_in_millis": 1233,
"fetch_current": 0,
"scroll_total": 1,
"scroll_time_in_millis": 6262,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 2,
"total_time_in_millis": 417,
"total_docs": 62932,
"total_size_in_bytes": 4882755,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 26,
"total_time_in_millis": 597,
"external_total": 24,
"external_total_time_in_millis": 632,
"listeners": 0
},
"flush": {
"total": 1,
"periodic": 0,
"total_time_in_millis": 10
},
"warmer": {
"current": 0,
"total": 23,
"total_time_in_millis": 0
},
"query_cache": {
"memory_size_in_bytes": 17338,
"total_count": 283,
"hit_count": 267,
"miss_count": 16,
"cache_size": 4,
"cache_count": 4,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 2,
"memory_in_bytes": 22729,
"terms_memory_in_bytes": 17585,
"stored_fields_memory_in_bytes": 2024,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 512,
"points_memory_in_bytes": 2112,
"doc_values_memory_in_bytes": 496,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 62932,
"size_in_bytes": 17585006,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 0
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}
}
}
why does the doc count differ from the index total? I've exported the data and the number of records matches the doc count. How can I find out why documents were deleted and make sure they are not in the future?
Possible causes:
Deleted documents tie up disk space in the index.
In-memory per-document data structures, such as norms or field data, will still consume RAM for deleted documents.
Search throughput is lower, since each search must check the deleted bitset for every potential hit. More on this below.
Aggregate term statistics, used for query scoring, will still reflect deleted terms and documents. When a merge completes, the term statistics will suddenly jump closer to their true values, changing hit scores. In practice this impact is minor, unless the deleted documents had divergent statistics from the rest of the index.
A deleted document ties up a document ID from the maximum 2.1 B documents for a single shard. If your shard is riding close to that limit (not recommended!) this could matter.
Fuzzy queries can have slightly different results, because they may match ghost terms.
https://www.elastic.co/guide/en/elasticsearch/reference/current//cat-indices.html
https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
I created a nested dictionary in Python like this:
{
"Laptop": {
"sony": 1
"apple": 2
"asus": 5
},
"Camera": {
"sony": 2
"sumsung": 1
"nikon" : 4
},
}
But I couldn't figure out how to write this nested dict into a json file. Any comments will be appreciated..!
d = {
"Laptop": {
"sony": 1,
"apple": 2,
"asus": 5,
},
"Camera": {
"sony": 2,
"sumsung": 1,
"nikon" : 4,
},
}
with open("my.json","w") as f:
json.dump(d,f)