Related
I have a DataFrame with lists in one column.
I want to pretty print the data as JSON.
How can I use indentation without affecting the values in each cell to be indented.
An example:
df = pd.DataFrame(range(3))
df["lists"] = [list(range(i+1)) for i in range(3)]
print(df)
output:
0 lists
0 0 [0]
1 1 [0, 1]
2 2 [0, 1, 2]
Now I want to print the data as JSON using:
print(df.to_json(orient="index", indent=2))
output:
{
"0":{
"0":0,
"lists":[
0
]
},
"1":{
"0":1,
"lists":[
0,
1
]
},
"2":{
"0":2,
"lists":[
0,
1,
2
]
}
}
desired output:
{
"0":{
"0":0,
"lists":[0]
},
"1":{
"0":1,
"lists":[0,1]
},
"2":{
"0":2,
"lists":[0,1,2]
}
}
If you don't want to bother with json format output, you can just turn the list type to string temporarily when printing the dataframe
print(df.astype({'lists':'str'}).to_json(orient="index", indent=2))
{
"0":{
"0":0,
"lists":"[0]"
},
"1":{
"0":1,
"lists":"[0, 1]"
},
"2":{
"0":2,
"lists":"[0, 1, 2]"
}
}
If you don't want to see the quote mark, you use regex to replace them
import re
import re
result = re.sub(r'("lists":)"([^"]*)"', r"\1 \2",
df.astype({'lists':'str'}).to_json(orient="index", indent=2))
{
"0":{
"0":0,
"lists": [0]
},
"1":{
"0":1,
"lists": [0, 1]
},
"2":{
"0":2,
"lists": [0, 1, 2]
}
}
I have a json file which looks like this:
"Aveiro": {
"Albergaria-a-Velha": {
"candidates": [
{
"effectiveCandidates": [
"JOSÉ OLIVEIRA SANTOS"
],
"party": "B.E.",
"votes": {
"absoluteMajority": 0,
"acronym": "B.E.",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 1.34,
"presidents": 0,
"validVotesPercentage": 1.4,
"votes": 179
}
},
{
"effectiveCandidates": [
"ANTÓNIO AUGUSTO AMARAL LOUREIRO E SANTOS"
],
"party": "CDS-PP",
"votes": {
"absoluteMajority": 1,
"acronym": "CDS-PP",
"constituenctyCounter": 1,
"mandates": 5,
"percentage": 59.7,
"presidents": 1,
"validVotesPercentage": 62.5,
"votes": 7970
}
},
{
"effectiveCandidates": [
"CARLOS MANUEL DA COSTA SERVEIRA VASQUES"
],
"party": "CH",
"votes": {
"absoluteMajority": 0,
"acronym": "CH",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 1.87,
"presidents": 0,
"validVotesPercentage": 1.95,
"votes": 249
}
},
{
"effectiveCandidates": [
"RODRIGO MANUEL PEREIRA MARQUES LOURENÇO"
],
"party": "PCP-PEV",
"votes": {
"absoluteMajority": 0,
"acronym": "PCP-PEV",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 1.57,
"presidents": 0,
"validVotesPercentage": 1.65,
"votes": 210
}
},
{
"effectiveCandidates": [
"DELFINA LISBOA MARTINS DA CUNHA"
],
"party": "PPD/PSD",
"votes": {
"absoluteMajority": 0,
"acronym": "PPD/PSD",
"constituenctyCounter": 1,
"mandates": 2,
"percentage": 24.23,
"presidents": 0,
"validVotesPercentage": 25.37,
"votes": 3235
}
},
{
"effectiveCandidates": [
"JESUS MANUEL VIDINHA TOMÁS"
],
"party": "PS",
"votes": {
"absoluteMajority": 0,
"acronym": "PS",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 6.82,
"presidents": 0,
"validVotesPercentage": 7.14,
"votes": 910
}
}
],
"parentTerritoryName": "Aveiro",
"territoryKey": "LOCAL-010200",
"territoryName": "Albergaria-a-Velha",
"total_votes": {
"availableMandates": 0,
"blankVotes": 377,
"blankVotesPercentage": 2.82,
"displayMessage": null,
"hasNoVoting": false,
"nullVotes": 221,
"nullVotesPercentage": 1.66,
"numberParishes": 6,
"numberVoters": 13351,
"percentageVoters": 59.48
}
},
The full file is here for reference
I thought that this code would work
import pandas as pd
from pandas import json_normalize
import json
with open('autarquicas_2021.json') as f:
data = json.load(f)
df = pd.json_normalize(data)
However this is returning the following:
df.head()
Aveiro.Albergaria-a-Velha.candidates ... Évora.Évora.total_votes.percentageVoters
0 [{'effectiveCandidates': ['JOSÉ OLIVEIRA SANTO... ... 49.84
[1 rows x 4312 columns]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 4312 entries, Aveiro.Albergaria-a-Velha.candidates to Évora.Évora.total_votes.percentageVoters
dtypes: bool(308), float64(924), int64(1540), object(1540)
memory usage: 31.7+ KB
None
For some reason the code is not working, and my research has led me to no solutions, as it seems that every json file has a mind of its own.
Any help would be much appreciated. Thank you in advance!
Disclaimer: This is for an open source project to bring more transparency into local elections in Portugal. It will not be used for commercial, or for profit projects.
You can use json_normalize with a little transformation of original JSON format.
Convert JSON into list format.
I am assuming "Aveiro" as city, and "Albergaria-a-Velha" as district. Apologies of my unfamiliarity of the area, so if it is wrong, please rename the key.
res = [{**z, **{'city': x, 'district': y}} for x, y in data.items() for y, z in y.items()]
This will transform original JSON of key-values style into list of objects.
[{
"city": "Aveiro",
"district": "Albergaria-a-Velha",
"candidates": [{
...
}]
Then use json_normalize.
df = pd.json_normalize(res, record_path=['candidates'], meta=['total_votes', 'city', 'district'])
Further expanding the nested object total_votes.
df = pd.concat([df, pd.json_normalize(df['total_votes'])], axis=1)
>>> df.iloc[0]
effectiveCandidates [JOSÉ OLIVEIRA SANTOS]
party B.E.
votes.absoluteMajority 0
votes.acronym B.E.
votes.constituenctyCounter 1
votes.mandates 0
votes.percentage 1.34
votes.presidents 0
votes.validVotesPercentage 1.4
votes.votes 179
total_votes {'availableMandates': 0, 'blankVotes': 377, 'b...
city Aveiro
district Albergaria-a-Velha
availableMandates 0
blankVotes 377
blankVotesPercentage 2.82
displayMessage None
hasNoVoting False
nullVotes 221
nullVotesPercentage 1.66
numberParishes 6
numberVoters 13351
percentageVoters 59.48
Name: 0, dtype: object
Recursive Approach:
I usually use this function (a recursive approach) to do that kind of thing:
# Function for flattening
# json
def flatten_json(y):
out = {}
def flatten(x, name =''):
# If the Nested key-value
# pair is of dict type
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
# If the Nested key-value
# pair is of list type
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
You can call flatten_json for flattening your nested json.
# Driver code
print(flatten_json(data))
Library-based approach:
from flatten_json import flatten
unflat_json = {'user' :
{'foo':
{'UserID':0123456,
'Email': 'foo#mail.com',
'friends': ['Johnny', 'Mark', 'Tom']
}
}
}
flat_json = flatten(unflat_json)
print(flat_json)
I have a dict stored under the variable parsed:
{
"8119300029": {
"store": 4,
"total": 4,
"web": 4
},
"8119300030": {
"store": 2,
"total": 2,
"web": 2
},
"8119300031": {
"store": 0,
"total": 0,
"web": 0
},
"8119300032": {
"store": 1,
"total": 1,
"web": 1
},
"8119300033": {
"store": 0,
"total": 0,
"web": 0
},
"8119300034": {
"store": 2,
"total": 2,
"web": 2
},
"8119300036": {
"store": 0,
"total": 0,
"web": 0
},
"8119300037": {
"store": 0,
"total": 0,
"web": 0
},
"8119300038": {
"store": 2,
"total": 2,
"web": 2
},
"8119300039": {
"store": 3,
"total": 3,
"web": 3
},
"8119300040": {
"store": 3,
"total": 3,
"web": 3
},
"8119300041": {
"store": 0,
"total": 0,
"web": 0
}
}
I am trying to get the "web" value from each JSON entry but can only get the key values.
for x in parsed:
print(x["web"])
I tried doing this ^ but kept getting this error: "string indices must be integers". Can somebody explain why this is wrong?
because your x variable is dict key name
for x in parsed:
print(parsed[x]['web'])
A little information on your parsed data there: this is basically a dictionary of dictionaries. I won't go into too much of the nitty gritty but it would do well to read up a bit on json: https://www.w3schools.com/python/python_json.asp
In your example, for x in parsed is iterating through the keys of the parsed dictionary, e.g. 8119300029, 8119300030, etc. So x is a key (in this case, a string), not a dictionary. The reason you're getting an error about not indexing with an integer is because you're trying to index a string -- for example x[0] would give you the first character 8 of the key 8119300029.
If you need to get each web value, then you need to access that key in the parsed[x] dictionary:
for x in parsed:
print(parsed[x]["web"])
Output:
4
2
0
...
I'm relatively new to Elasticsearch and am having a problem determining why the number of records from a pythondataframe is different than the indexes document count Elasticsearch.
I start by creating an index by running the following: As you can see there are 62932 records.
I'm creating an index in elasticsearch using the following:
Python code
When I check the index in Kibana Management/Index Management there are only 62630 documents. According to Stats window there were 302 deleted count. I don't know what this means.
Below is the output from the STATS window
{
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"stats": {
"uuid": "egOx_6EwTFysBr0WkJyR1Q",
"primaries": {
"docs": {
"count": 62630,
"deleted": 302
},
"store": {
"size_in_bytes": 4433722
},
"indexing": {
"index_total": 62932,
"index_time_in_millis": 3235,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 140,
"query_time_in_millis": 1178,
"query_current": 0,
"fetch_total": 140,
"fetch_time_in_millis": 1233,
"fetch_current": 0,
"scroll_total": 1,
"scroll_time_in_millis": 6262,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 2,
"total_time_in_millis": 417,
"total_docs": 62932,
"total_size_in_bytes": 4882755,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 26,
"total_time_in_millis": 597,
"external_total": 24,
"external_total_time_in_millis": 632,
"listeners": 0
},
"flush": {
"total": 1,
"periodic": 0,
"total_time_in_millis": 10
},
"warmer": {
"current": 0,
"total": 23,
"total_time_in_millis": 0
},
"query_cache": {
"memory_size_in_bytes": 17338,
"total_count": 283,
"hit_count": 267,
"miss_count": 16,
"cache_size": 4,
"cache_count": 4,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 2,
"memory_in_bytes": 22729,
"terms_memory_in_bytes": 17585,
"stored_fields_memory_in_bytes": 2024,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 512,
"points_memory_in_bytes": 2112,
"doc_values_memory_in_bytes": 496,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 62932,
"size_in_bytes": 17585006,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 0
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
},
"total": {
"docs": {
"count": 62630,
"deleted": 302
},
"store": {
"size_in_bytes": 4433722
},
"indexing": {
"index_total": 62932,
"index_time_in_millis": 3235,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 140,
"query_time_in_millis": 1178,
"query_current": 0,
"fetch_total": 140,
"fetch_time_in_millis": 1233,
"fetch_current": 0,
"scroll_total": 1,
"scroll_time_in_millis": 6262,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 2,
"total_time_in_millis": 417,
"total_docs": 62932,
"total_size_in_bytes": 4882755,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 26,
"total_time_in_millis": 597,
"external_total": 24,
"external_total_time_in_millis": 632,
"listeners": 0
},
"flush": {
"total": 1,
"periodic": 0,
"total_time_in_millis": 10
},
"warmer": {
"current": 0,
"total": 23,
"total_time_in_millis": 0
},
"query_cache": {
"memory_size_in_bytes": 17338,
"total_count": 283,
"hit_count": 267,
"miss_count": 16,
"cache_size": 4,
"cache_count": 4,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 2,
"memory_in_bytes": 22729,
"terms_memory_in_bytes": 17585,
"stored_fields_memory_in_bytes": 2024,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 512,
"points_memory_in_bytes": 2112,
"doc_values_memory_in_bytes": 496,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 62932,
"size_in_bytes": 17585006,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 0
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}
}
}
why does the doc count differ from the index total? I've exported the data and the number of records matches the doc count. How can I find out why documents were deleted and make sure they are not in the future?
Possible causes:
Deleted documents tie up disk space in the index.
In-memory per-document data structures, such as norms or field data, will still consume RAM for deleted documents.
Search throughput is lower, since each search must check the deleted bitset for every potential hit. More on this below.
Aggregate term statistics, used for query scoring, will still reflect deleted terms and documents. When a merge completes, the term statistics will suddenly jump closer to their true values, changing hit scores. In practice this impact is minor, unless the deleted documents had divergent statistics from the rest of the index.
A deleted document ties up a document ID from the maximum 2.1 B documents for a single shard. If your shard is riding close to that limit (not recommended!) this could matter.
Fuzzy queries can have slightly different results, because they may match ghost terms.
https://www.elastic.co/guide/en/elasticsearch/reference/current//cat-indices.html
https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
I am creating an h5 files with 5 datasets ['a160'],['a1214']
How can I make it so that the datasets will be sorted by the dataset name..
For example when I do h5dump on my file I get:
HDF5 "jjjj.h5" {
GROUP "/" {
DATASET "a1214" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 1, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 160, 0, 165, 4, 2.29761, 264, 4, 1.74368, 1, 0, 17, 193, 0, 0,
(0,14): 0, 0, 0, 0, 0
}
}
DATASET "a160" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 3, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 263, 0, 262, 7, 4.90241, 201, 34, 0.348432, 1, 0, 29, 11, 0, 0,
(0,14): 0, 0, 0, 0, 0,
}
}
But I want it to be ordered by the dataset name, I need h5dump to output
HDF5 "jjjj.h5" {
GROUP "/" {
DATASET "a160" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 3, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 263, 0, 262, 7, 4.90241, 201, 34, 0.348432, 1, 0, 29, 11, 0, 0,
(0,14): 0, 0, 0, 0, 0,
}
}
DATASET "a1214" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 1, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 160, 0, 165, 4, 2.29761, 264, 4, 1.74368, 1, 0, 17, 193, 0, 0,
(0,14): 0, 0, 0, 0, 0
}
}
}
By default h5dump sorts HDF5 files' groups and attributes by their names in ascending order:
-q Q, --sort_by=Q Sort groups and attributes by index Q
-z Z, --sort_order=Z Sort groups and attributes by order Z
Q - is the sort index type. It can be "creation_order" or "name" (default)
Z - is the sort order type. It can be "descending" or "ascending" (default)
The problem in this case is that "a160" is considered greater than "a1214" because that's how dictionary sorting works ('a12' < 'a16').
There's no change you can make to the internal structure of the HDF5 file to force h5dump to sort these data structures in a different order. However, you could zero-pad your names like so:
a0040
a0160
a1214
and then the standard dictionary sort will output the file the way you want.