I have a DataFrame (df) like this:
PointID Time geojson
---- ---- ----
36F 2016-04-01T03:52:30 {'type': 'Point', 'coordinates': [3.961389, 43.123]}
36G 2016-04-01T03:52:50 {'type': 'Point', 'coordinates': [3.543234, 43.789]}
The geojson column contains data in geoJSON format (esentially, a Python dict).
I want to create a new column in geoJSON format, which includes the time coordinate. In other words, I want to inject the time information into the geoJSON info.
For a single value, I can successfully do:
oldjson = df.iloc[0]['geojson']
newjson = [df['coordinates'][0], df['coordinates'][1], df.iloc[0]['time'] ]
For a single parameter, I successfully used dataFrame.apply in combination with lambda (thanks to SO: related question
But now, I have two parameters, and I want to use it on the whole DataFrame. As I am not confident with the .apply syntax and lambda, I do not know if this is even possible. I would like to do something like this:
def inject_time(geojson, time):
"""
Injects Time dimension into geoJSON coordinates. Expects a dict in geojson POINT format.
"""
geojson['coordinates'] = [geojson['coordinates'][0], geojson['coordinates'][1], time]
return geojson
df["newcolumn"] = df["geojson"].apply(lambda x: inject_time(x, df['time'])))
...but that does not work, because the function would inject the whole series.
EDIT:
I figured that the format of the timestamped geoJSON should be something like this:
TimestampedGeoJson({
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [[-70,-25],[-70,35],[70,35]],
},
"properties": {
"times": [1435708800000, 1435795200000, 1435881600000]
}
}
]
})
So the time element is in the properties element, but this does not change the problem much.
You need DataFrame.apply with axis=1 for processing by rows:
df['new'] = df.apply(lambda x: inject_time(x['geojson'], x['Time']), axis=1)
#temporary display long string in column
with pd.option_context('display.max_colwidth', 100):
print (df['new'])
0 {'type': 'Point', 'coordinates': [3.961389, 43.123, '2016-04-01T03:52:30']}
1 {'type': 'Point', 'coordinates': [3.543234, 43.789, '2016-04-01T03:52:50']}
Name: new, dtype: object
Related
I am new to Python (and coding in general) so I'll do my best to explain the challenge I'm trying to work through.
I'm working with a large dataset which was exported as a CSV from a database. However, there is one column within this CSV export that contains a nested list of dictionaries (as best as I can tell). I've looked around extensively online for a solution, including on Stackoverflow, but haven't quite gotten a full solution. I think I understand conceptually what I'm trying to accomplish, but not clear as to the best method or data prepping process to use.
Here is an example of the data (pared down to just the two columns I'm interested in):
{
"app_ID": {
"0": 1abe23574,
"1": 4gbn21096
},
"locations": {
"0": "[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876"
},
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"}
]",
"1": "[ ]",
]"
}
}
Basically each app_ID can have multiple locations tied to a single ID, or it can be empty as seen above. I have attempted using some guides I found online using Panda's json_normalize() function to "unfold" or get the list of dictionaries into their own rows in a Panda dataframe.
I'd like to end up with something like this:
loc_id lat long app_ID
abc1 12.3456 101.9876 1abe23574
abc1 45.7890 102.6543 1abe23574
etc...
I am learning about how to use the different functions of json_normalize, like "record_path" and "meta", but haven't been able to get it to work yet.
I have tried loading the json file into a Jupyter Notebook using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data, record_path = ['locations'])
but it only creates a dataframe that is 1 row and multiple columns long, where I'd like to have multiple rows generated from the inner-most dictionary that tie back to the app_ID and loc_ID fields.
Attempt at a solution:
I was able to get close to the dataframe format I wanted using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['locations']['0'])
but that would then require some kind of iteration through the list in order to create a dataframe, and then I'd lose the connection to the app_ID fields. (As best as I can understand how the json_normalize function works).
Am I on the right track trying to use json_normalize, or should I start over again and try a different route? Any advice or guidance would be greatly appreciated.
I can't say that suggesting you using convtools library is a good thing since you are a beginner, because this library is almost like another Python over the Python. It helps to dynamically define data conversions (generating Python code under the hood).
But anyway, here is the code if I understood the input data right:
import json
from convtools import conversion as c
data = {
"app_ID": {"0": "1abe23574", "1": "4gbn21096"},
"locations": {
"0": """[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876" },
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"} ]""",
"1": "[ ]",
},
}
# define it once and use multiple times
converter = (
c.join(
# converts "app_ID" data to iterable of dicts
(
c.item("app_ID")
.call_method("items")
.iter({"id": c.item(0), "app_id": c.item(1)})
),
# converts "locations" data to iterable of dicts,
# where each id like "0" is zipped to each location.
# the result is iterable of dicts like {"id": "0", "loc": {"loc_id": ... }}
(
c.item("locations")
.call_method("items")
.iter(
c.zip(id=c.repeat(c.item(0)), loc=c.item(1).pipe(json.loads))
)
.flatten()
),
# join on "id"
c.LEFT.item("id") == c.RIGHT.item("id"),
how="full",
)
# process results, where 0 index is LEFT item, 1 index is the RIGHT one
.iter(
{
"loc_id": c.item(1, "loc", "loc_id", default=None),
"lat": c.item(1, "loc", "lat", default=None),
"long": c.item(1, "loc", "long", default=None),
"app_id": c.item(0, "app_id"),
}
)
.as_type(list)
.gen_converter()
)
result = converter(data)
assert result == [
{'loc_id': 'abc1', 'lat': '12.3456', 'long': '101.9876', 'app_id': '1abe23574'},
{'loc_id': 'abc2', 'lat': '45.7890', 'long': '102.6543', 'app_id': '1abe23574'},
{'loc_id': None, 'lat': None, 'long': None, 'app_id': '4gbn21096'}
]
I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this
I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.
I have a big nested, then nested then nested json file saved as .txt format. I need to access some specific key pairs and crate a data frame or another transformed json object for further use. Here is a small sample with 2 key pairs.
[
{
"ko_id": [819752],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
},
{
"ko_id": [819753],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
}
]
The following code load the data as list but I need to access to the data probably as a dictionary and I need the "ko_id", "uri" and "prefLabel" from each key pair and put it to a pandas data frame or a dictionary for further analysis.
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
The following code gives me the exact value of the first element. But donot actually know how to put it together and build the ultimate algorithm to create the dataframe.
print(sample_dict["ko_id"][0])
print(sample_dict["concepts"][0]["prefLabel"][0])
print(sample_dict["concepts"][0]["uri"][0])
for record in sample_dict:
df = pd.DataFrame(record['concepts'])
df['ko_id'] = record['ko_id']
final_df = final_df.append(df)
You can pass the data to pandas.DataFrame using a generator:
import pandas as pd
import json as js
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
df = pd.DataFrame(data = ((key["ko_id"][0],
key["concepts"][0]["prefLabel"][0],
key["concepts"][0]["uri"][0]) for key in json_sample),
columns = ("ko_id", "prefLabel", "uri"))
Output:
>>> df
ko_id prefLabel uri
0 819752 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116
1 819753 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116
I tried to create a json object but I made a mistake somewhere. I'm getting some data on CSV file (center is string, lat and lng are float).
My codes:
data = []
data.append({
'id': 'id',
'is_city': false,
'name': center,
'county': center,
'cluster': i,
'cluster2': i,
'avaible': true,
'is_deleted': false,
'coordinates': ('{%s,%s}' %(lat,lng))
})
json_data = json.dumps(data)
print json_data
It goes with this:
[{
"county": "County",
"is_city": false,
"is_deleted": false,
"name": "name",
"cluster": 99,
"cluster2": 99,
"id": "id",
"coordinates": "{41.0063945,28.9048234}",
"avaible": true
}]
That's I want:
{
"id" : "id",
"is_city" : false,
"name" : "name",
"county" : "county",
"cluster" : 99,
"cluster2" : 99,
"coordinates" : [
41.0870185,
29.0235126
],
"available" : true,
"isDeleted" : false,
}
You are defining coordinates to be a string of the specified format. There is no way json can encode that as a list; you are saying one thing when you want another.
Similarly, if you don't want the top-level dictionary to be the only element in a list, don't define it to be an element in a list.
data = {
'id': 'id',
'is_city': false,
'name': name,
'county': county,
'cluster': i,
'cluster2': i,
'available': true,
'is_deleted': false,
'coordinates': [lat, lng]
}
I don't know how you defined center, or how you expected it to have the value 'name' and the value 'county' at basically the same time. I have declared two new variables to hold these values; you will need to adapt your code to take care of this detail. I also fixed the typo in "available" where apparently you expected Python to somehow take care of this.
You can use pprint to make pretty printing at python, but it should be applied on object not string.
At your case json_data is a string that represents a JSON object, so you need to load it back to be an object when you try to pprint it, (or to use the data variable itself since it already contains this JSON object in your example)
for example try to run:
pprint.pprint(json.loads(json_data))