How can we convert this to dataframes? I have tried multiple ways on how it can be achived, i have tried with json file on w3school but it is working correctly, i am new with python, any recommendations on this?
Json format is
[
{
"id": 14256,
"city": {
"id": {
"$numberLong": "14256"
},
"name": "Azadshahr",
"findname": "AZADSHAHR",
"country": "IR",
"coord": {
"lon": 48.570728,
"lat": 34.790878
},
"zoom": {
"$numberLong": "10"
}
}
},
{
"id": {
"$numberLong": "465726"
},
"city": {
"id": {
"$numberLong": "465726"
},
"name": "Zadonsk",
"findname": "ZADONSK",
"country": "RU",
"coord": {
"lon": 38.926102,
"lat": 52.3904
},
"zoom": {
"$numberLong": "16"
}
}
}
]
The expected output is :
it tried to do a conversion but i am receiving error and it is not the whole data
with open('data/history.city.list.json') as f:
data = json.load(f)
but not able to load as data, This is what i have tried but i feel
_id = []
country = []
coord_lat = []
coord_lon = []
counter = 0
for i in data:
_id.append(data[counter]['id'])
country.append(data[counter]['city']['country'])
coord_lat.append(data[counter]['city']['coord']['lon'])
coord_lat.append(data[counter]['city']['coord']['lat'])
counter += 1
When i have tried to print it as a dataframe
df = pd.DataFrame({'Longtitude' : coord_lat , 'Latitude' : coord_lat})
df.head(10)
This was able to set it to dataframe, but as soon as i add 'Country' to pd.dataframe() , it will return as ValueError: arrays must all be same length.
i understand that country column does not match the other columns but can we achieve this and is there a simpler way to do it ?
You can use json_normalize() as described here:
import pandas as pd
d = [
{
"id": 14256,
"city": {
"id": {
"$numberLong": "14256"
},
"name": "Azadshahr",
"findname": "AZADSHAHR",
"country": "IR",
"coord": {
"lon": 48.570728,
"lat": 34.790878
},
"zoom": {
"$numberLong": "10"
}
}
},
{
"id": {
"$numberLong": "465726"
},
"city": {
"id": {
"$numberLong": "465726"
},
"name": "Zadonsk",
"findname": "ZADONSK",
"country": "RU",
"coord": {
"lon": 38.926102,
"lat": 52.3904
},
"zoom": {
"$numberLong": "16"
}
}
}
]
pd.io.json.json_normalize(d)
Output:
id city.id.$numberLong city.name city.findname city.country city.coord.lon city.coord.lat city.zoom.$numberLong id.$numberLong
0 14256.0 14256 Azadshahr AZADSHAHR IR 48.570728 34.790878 10 NaN
1 NaN 465726 Zadonsk ZADONSK RU 38.926102 52.390400 16 465726
The column names do not match your expected output, but you can change that easily with df.columns = ['Id', 'city', ... 'Zoom']
Related
I am using Python and Pandas. Trying to convert a Pandas Dataframe to a nested JSON. The function .to_json() doesn't give me enough flexibility for my aim.
Here are some data points of the data frame (in CSV, comma separated):
Hotel_id,Room_id,Client_id,Loayalty_level,Price
1,100,1,Default,100
1,100,2,Default,98
1,101,1,Default,200
1,101,1,Discounted,196
1,101,2,Default,202
1,101,3,Default,204
There is a lot of repetitive information and I would like to have a JSON like this:
{
"hotelId": 1,
"rooms": [
{
"roomId": 100,
"prices": [
{
"clientId": 1,
"price": {
"default": 100
}
},
{
"clientId": 2,
"price": {
"default": 98
}
}
]
},
{
"roomId": 101,
"prices": [
{
"clientId": 1,
"price": {
"default": 200,
"discounted": 196
}
},
{
"clientId": 2,
"price": {
"default": 202
}
},
{
"clientId": 3,
"price": {
"default": 204
}
}
]
}
]
}
How to achieve this?
Have a look at convtools library, it provides a lot of primitives for data processing.
Here is the solution for your case:
import json
from convtools import conversion as c
from convtools.contrib.tables import Table
input_data = [
("Hotel_id", "Room_id", "Client_id", "Loayalty_level", "Price"),
("1", "100", "1", "Default", "100"),
("1", "100", "2", "Default", "98"),
("1", "101", "1", "Default", "200"),
("1", "101", "1", "Discounted", "196"),
("1", "101", "2", "Default", "202"),
("1", "101", "3", "Default", "204"),
]
# if reading from csv is needed
# rows = Table.from_csv("tmp/input.csv", header=True).into_iter_rows(tuple)
# convert to list of dicts
rows = list(Table.from_rows(input_data, header=True).into_iter_rows(dict))
# generate the converter (store somewhere and reuse, because this is where
# code-generation happens)
converter = (
c.group_by(c.item("Hotel_id"))
.aggregate(
{
"hotelId": c.item("Hotel_id").as_type(int),
"rooms": c.ReduceFuncs.Array(c.this()).pipe(
c.group_by(c.item("Room_id")).aggregate(
{
"roomId": c.item("Room_id").as_type(int),
"prices": c.ReduceFuncs.Array(c.this()).pipe(
c.group_by(c.item("Client_id")).aggregate(
{
"clientId": c.item("Client_id").as_type(
int
),
"price": c.ReduceFuncs.DictFirst(
c.item("Loayalty_level"),
c.item("Price").as_type(float),
),
}
)
),
}
)
),
}
)
.gen_converter()
)
print(json.dumps(converter(rows)))
The output is:
[
{
"hotelId": 1,
"rooms": [
{
"roomId": 100,
"prices": [
{ "clientId": 1, "price": { "Default": 100.0 } },
{ "clientId": 2, "price": { "Default": 98.0 } }
]
},
{
"roomId": 101,
"prices": [
{ "clientId": 1, "price": { "Default": 200.0, "Discounted": 196.0 } },
{ "clientId": 2, "price": { "Default": 202.0 } },
{ "clientId": 3, "price": { "Default": 204.0 } }
]
}
]
}
]
P.S. pay attention to the c.ReduceFuncs.DictFirst part, this is where it takes the first price per loyalty level, you may want to change this to DictLast / DictMax / DictMin / DictArray.
I am trying to link several Altair charts that share aspects of the same data. I can do this by merging all the data into one data frame, but because of the nature of the data the merged data frame is much larger than is needed to have two separate data frames for each of the two charts. This is because the columns unique to each chart have many repeated rows for each entry in the shared column.
Would using transform_lookup save space over just using the merged data frame, or does transform_lookup end up doing the whole merge internally?
No, the entire dataset is still included in the vegaspec when you use transform_lookup. You can see this by printing the json spec of the charts you create. With the example from the docs:
import altair as alt
import pandas as pd
from vega_datasets import data
people = data.lookup_people().head(3)
people
name age height
0 Alan 25 180
1 George 32 174
2 Fred 39 182
groups = data.lookup_groups().head(3)
groups
group person
0 1 Alan
1 1 George
2 1 Fred
With pandas merge:
merged = pd.merge(groups, people, how='left',
left_on='person', right_on='name')
print(alt.Chart(merged).mark_bar().encode(
x='mean(age):Q',
y='group:O'
).to_json())
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"name": "data-b41b97ffc89b39c92e168871d447e720"
},
"datasets": {
"data-b41b97ffc89b39c92e168871d447e720": [
{
"age": 25,
"group": 1,
"height": 180,
"name": "Alan",
"person": "Alan"
},
{
"age": 32,
"group": 1,
"height": 174,
"name": "George",
"person": "George"
},
{
"age": 39,
"group": 1,
"height": 182,
"name": "Fred",
"person": "Fred"
}
]
},
"encoding": {
"x": {
"aggregate": "mean",
"field": "age",
"type": "quantitative"
},
"y": {
"field": "group",
"type": "ordinal"
}
},
"mark": "bar"
}
With transform lookup all the data is there but as to separate dataset (so technically it takes a little bit of more space with the additional braces and the transform):
print(alt.Chart(groups).mark_bar().encode(
x='mean(age):Q',
y='group:O'
).transform_lookup(
lookup='person',
from_=alt.LookupData(data=people, key='name',
fields=['age'])
).to_json())
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"name": "data-5fe242a79352d1fe243b588af570c9c6"
},
"datasets": {
"data-2b374d1509415e1d327c3a7521f8117c": [
{
"age": 25,
"height": 180,
"name": "Alan"
},
{
"age": 32,
"height": 174,
"name": "George"
},
{
"age": 39,
"height": 182,
"name": "Fred"
}
],
"data-5fe242a79352d1fe243b588af570c9c6": [
{
"group": 1,
"person": "Alan"
},
{
"group": 1,
"person": "George"
},
{
"group": 1,
"person": "Fred"
}
]
},
"encoding": {
"x": {
"aggregate": "mean",
"field": "age",
"type": "quantitative"
},
"y": {
"field": "group",
"type": "ordinal"
}
},
"mark": "bar",
"transform": [
{
"from": {
"data": {
"name": "data-2b374d1509415e1d327c3a7521f8117c"
},
"fields": [
"age",
"height"
],
"key": "name"
},
"lookup": "person"
}
]
}
When transform_lookup can save space is if you use it with the URLs of two dataset:
people = data.lookup_people.url
groups = data.lookup_groups.url
print(alt.Chart(groups).mark_bar().encode(
x='mean(age):Q',
y='group:O'
).transform_lookup(
lookup='person',
from_=alt.LookupData(data=people, key='name',
fields=['age'])
).to_json())
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"url": "https://vega.github.io/vega-datasets/data/lookup_groups.csv"
},
"encoding": {
"x": {
"aggregate": "mean",
"field": "age",
"type": "quantitative"
},
"y": {
"field": "group",
"type": "ordinal"
}
},
"mark": "bar",
"transform": [
{
"from": {
"data": {
"url": "https://vega.github.io/vega-datasets/data/lookup_people.csv"
},
"fields": [
"age",
"height"
],
"key": "name"
},
"lookup": "person"
}
]
}
I am trying to load json from a url and convert to a Pandas dataframe, so that the dataframe would look like the sample below.
I've tried json_normalize, but it duplicates the columns, one for each data type (value and stringValue). Is there a simpler way than this method and then dropping and renaming columns after creating the dataframe? I want to keep the stringValue.
Person ID Position ID Job ID Manager
0 192 936 93 Tom
my_json = {
"columns": [
{
"alias": "c3",
"label": "Person ID",
"dataType": "integer"
},
{
"alias": "c36",
"label": "Position ID",
"dataType": "string"
},
{
"alias": "c40",
"label": "Job ID",
"dataType": "integer",
"entityType": "job"
},
{
"alias": "c19",
"label": "Manager",
"dataType": "integer"
},
],
"data": [
{
"c3": {
"value": 192,
"stringValue": "192"
},
"c36": {
"value": "936",
"stringValue": "936"
},
"c40": {
"value": 93,
"stringValue": "93"
},
"c19": {
"value": 12412453,
"stringValue": "Tom"
}
}
]
}
If c19 is of type string, this should work
alias_to_label = {x['alias']: x['label'] for x in my_json["columns"]}
is_str = {x['alias']: ('string' == x['dataType']) for x in my_json["columns"]}
data = []
for x in my_json["data"]:
data.append({
k: v["stringValue" if is_str[k] else 'value']
for k, v in x.items()
})
df = pd.DataFrame(data).rename(columns=alias_to_label)
I have a project in which i have to convert a json file into a CSV file.
The Json sample :
{
"P_Portfolio Group": {
"depth": 1,
"dataType": "PortfolioOverview",
"levelId": "P_Portfolio Group",
"path": [
{
"label": "Portfolio Group",
"levelId": "P_Portfolio Group"
}
],
"label": "Portfolio Group",
"header": [
{
"id": "Label",
"label": "Security name",
"type": "text",
"contentType": "text"
},
{
"id": "SecurityValue",
"label": "MioCHF",
"type": "text",
"contentType": "number"
},
{
"id": "SecurityValuePct",
"label": "%",
"type": "text",
"contentType": "pct"
}
],
"data": [
{
"dataValues": [
{
"value": "Client1",
"type": "text"
},
{
"value": 2068.73,
"type": "number"
},
{
"value": 14.0584,
"type": "pct"
}
]
},
{
"dataValues": [
{
"value": "Client2",
"type": "text"
},
{
"value": 1511.9,
"type": "number"
},
{
"value": 10.2744,
"type": "pct"
}
]
},
{
"dataValues": [
{
"value": "Client3",
"type": "text"
},
{
"value": 1354.74,
"type": "number"
},
{
"value": 9.2064,
"type": "pct"
}
]
},
{
"dataValues": [
{
"value": "Client4",
"type": "text"
},
{
"value": 1225.78,
"type": "number"
},
{
"value": 8.33,
"type": "pct"
}
]
}
],
"summary": [
{
"value": "Total",
"type": "text"
},
{
"value": 11954.07,
"type": "number"
},
{
"value": 81.236,
"type": "pct"
}
]
}
}
And i want o obtain something like:
Client1,2068.73,14.0584
Client2,1511.9,10.2744
Client3,871.15,5.92
Client4,11954.07,81.236
Can you please give me a hint.
import csv
import json
with open("C:\Users\SVC\Desktop\test.json") as file:
x = json.load(file)
f = csv.writer(open("C:\Users\SVC\Desktop\test.csv", "wb+"))
for x in x:
f.writerow(x["P_Portfolio Group"]["data"]["dataValues"]["value"])
but it doesn't work.
Can you please give me a hint.
import csv
import json
with open('C:\Users\SVC\Desktop\test.json') as json_file:
portfolio_group = json.load(json_file)
with open('C:\Users\SVC\Desktop\test.csv', 'w') as csv_file:
csv_obj = csv.writer(csv_file)
for data in portfolio_group['P_Portfolio Group']['data']:
csv_obj.writerow([d['value'] for d in data['dataValues']])
This results in the following C:\Users\SVC\Desktop\test.csv content:
Client1,2068.73,14.0584
Client2,1511.9,10.2744
Client3,1354.74,9.2064
Client4,1225.78,8.33
Use the pandas library:
import pandas as pd
data = pd.read_csv("C:\Users\SVC\Desktop\test.json")
data.to_csv('test.csv')
done
In Python I'm currently working with a very large JSON file with some deep dictionaries and arrays. I'm having an issue where it's not constant. For example that's below, it's essentially countries, with regions/states, cities, and suburbs. The issue is that if there is only one suburb, it'll return a dictionary, though if there's more than one, it's a array with a dictionary making me have to add another line of code to go deeper. Sure, can ifelse/for it, but this is only a very small portion of the inconstancy and it's just not proper going ifelse all the time.
What I'd like to do is simply search anything within Belgium for the dictionary entry "code": "8400" and return it's location within the JSON file. What would be my best approach in order to do something like this? Thanks!
***SNIP***
{
"code": "BE",
"name": "Belgium",
"regions": {
"region": [
{
"code": "45",
"name": "Flanders",
"places": {
"place": [
{
"code": "1790",
"name": "Affligem"
},
{
"code": "8570",
"name": "Anzegem"
},
{
"code": "8630",
"name": "Diksmuide"
},
{
"code": "9600",
"name": "Ronse"
}
]
},
"subregions": {
"subregion": [
{
"code": "46",
"name": "Coast",
"places": {
"place": [
{
"code": "8300",
"name": "Knokke-Heist"
},
{
"code": "8400",
"name": "Oostende",
"subplaces": {
"subplace": {
"code": "8450",
"name": "Bredene"
}
}
},
{
"code": "8420",
"name": "De Haan"
},
{
"code": "8430",
"name": "Middelkerke"
},
{
"code": "8434",
"name": "Westende-Bad"
},
{
"code": "8490",
"name": "Jabbeke"
},
{
"code": "8660",
"name": "De Panne"
},
{
"code": "8670",
"name": "Oostduinkerke"
}
]
}
},
{
"code": "47",
"name": "Cities",
"places": {
"place": [
{
"code": "1000",
"name": "Brussels"
},
{
"code": "2000",
"name": "Antwerp"
},
{
"code": "8000",
"name": "Bruges"
},
{
"code": "8340",
"name": "Damme"
},
{
"code": "9000",
"name": "Gent"
}
]
}
},
{
"code": "48",
"name": "Interior",
"places": {
"place": [
{
"code": "2260",
"name": "Westerlo"
},
{
"code": "2400",
"name": "Mol"
},
{
"code": "2590",
"name": "Berlaar"
},
{
"code": "8500",
"name": "Kortrijk",
"subplaces": {
"subplace": {
"code": "8940",
"name": "Wervik"
}
}
},
{
"code": "8610",
"name": "Handzame"
},
{
"code": "8755",
"name": "Ruiselede"
},
{
"code": "8900",
"name": "Ieper"
},
{
"code": "8970",
"name": "Poperinge"
}
]
}
},
EDIT:
I was asked to show how I'm currently getting through this JSON file. Root is a dictionary containing numbers that equal the city/suburb I'm trying to search for. It doesn't define whether it is a city or suburb before hand. Below is my lazyly coded search while I was trying to learn how to dig through this JSON file, until I realized how complicated it was getting and got a bit stuck.
SNIP
for k in dataDict['countries']['country']:
if k['code'] == root['country']:
for y in k['regions']['region']['places']['place']:
if y['code'] == root['place']:
city = y['name']
else:
try:
for p in y['subplaces']['subplace']:
if p['code'] == root['place']:
city = p['name']
except:
pass
If I understand well, each dictionary has the following structure:
{"code": # some int
"name": # some str
none / "country" / "place" / whatever # some dict or list
You can write a recursive function that handle one and only one dict:
def foo(my_dict):
if my_dict['code'] == root['place']:
city = my_dict['name']
elif "country" in my_dict:
city = foo(my_dict['country'])
elif "place" in my_dict:
#
# and so on...
else:
city = None
return city
Hope this example will help you.