I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.
Related
Given a list of dictionaries such as:
list_ = [
{ 'name' : 'date',
'value': '2021-01-01'
},
{ 'name' : 'length',
'value': '500'
},
{ 'name' : 'server',
'value': 'g.com'
},
How can I access the value where the key name == length?
I want to avoid iteration if possible, and just be able to check if the key called 'length' exists, and if so, get its value.
With iteration, and using next, you could do:
list_ = [
{'name': 'date',
'value': '2021-01-01'
},
{'name': 'length',
'value': '500'
},
{'name': 'server',
'value': 'g.com'
}
]
res = next(dic["value"] for dic in list_ if dic.get("name", "") == "length")
print(res)
Output
500
As an alternative, if the "names" are unique you could build a dictionary to avoid further iterations on list_, as follows:
lookup = {d["name"] : d["value"] for d in list_}
res = lookup["length"]
print(res)
Output
500
Notice that if you need a second key such as "server", you won't need to iterate, just do:
lookup["server"] # g.com
It sure is hard to find an element in a list without iterating through it. Thats the first solution I will show:
list(filter(lambda element: element['name'] == 'length', list_))[0]['value']
this will filter through your list only the elements with name 'length', choose the first from that list, then select the 'value' of that element.
Now, if you had a better data structure, you wouldn't have to iterate. In order to create that better data structure, unfortunately, we will have to iterate the list. A list of dicts with "name" and "value" could really just be a single dict where "name" is the key and "value" is the value. To create that dict:
dict_ = {item['name']:item['value'] for item in list_}
then you can just select 'length'
dict_['length']
I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))
I have looked for solutions to my problem but couldn't find anything that applies. I'm trying to import a high dimension JSON file into a Pandas dataframe.
The structure is something like:
{ 'manufacturing_plant_events':
{ 'data':
{ 'shiftInformation':
{ 'shift1':
{ 'color': 'red'
, 'amount' : 32
, 'order' : None
},
'shift2':
{ 'color': 'blue'
, 'amount' : 44
, 'order' : 1
},
'shift3':
{ 'color': 'green'
, 'amount' : 98
, 'order' : 2
}
}
...}
...}
}
I have tried numerous solutions including:
json.loads()
pd.DataFrame(json)
json_normalize(json)
pd.read_json(json)
and others, I've tried flattening my array and converting it into a dataframe bu that didn't work either. I'm not sure if this is even possible or if the dataframe supports only a few levels of nested.
The flattening I've tried was to just try and create columns in a dataframe that contain the leaf information. Hence, I'm also fine with a dataframe which has the following column names the full path and the value, the actual value stored in the node.
First row in my dataframe:
(
manufacturing_plant_events.data.shiftInformation.shift1.color
'red'
manufacturing_plant_events.data.shiftInformation.shift1.amount
32
manufacturing_plant_events.data.shiftInformation.shift1.order
None
)
and so on.
Any suggestion on how to solve this is highly appreciated.
I have come up with a dataframe by flattening the dict :
import pandas as pd
def flat_dict(dictionary, prefix):
if type(dictionary) == dict:
rows = []
for key, items in dictionary.items():
rows += flat_dict(items, prefix + [key])
return rows
else:
return [prefix + [dictionary]]
def dict_to_df(dictionary):
return pd.DataFrame(flat_dict(dictionary, []))
Sure you need to import your json as a dict first thanks to json package.
I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.
Currently i am having an question in python pandas. I want to filter a dataframe using url query string dynamically.
For eg:
CSV:
url: http://example.com/filter?Name=Sam&Age=21&Gender=male
Hardcoded:
filtered_data = data[
(data['Name'] == 'Sam') &
(data['Age'] == 21) &
(data['Gender'] == 'male')
];
I don't want to hard code the filter keys like before because the csv file changes anytime with different column headers.
Any suggestions
The easiest way to create this filter dynamically is probably to use np.all.
For example:
import numpy as np
query = {'Name': 'Sam', 'Age': 21, 'Gender': 'male'}
filters = [data[k] == v for k, v in query.items()]
filter_data = data[np.all(filters, axis=0)]
use df.query. For example
df = pd.read_csv(url)
conditions = "Name == 'Sam' and Age == 21 and Gender == 'Male'"
filtered_data = df.query(conditions)
You can build the conditions string dynamically using string formatting like
conditions = " and ".join("{} == {}".format(col, val)
for col, val in zip(df.columns, values)
Typically, your web framework will return the arguments in a dict-like structure. Let's say your args are like this:
args = {
'Name': ['Sam'],
'Age': ['21'], # Note that Age is a string
'Gender': ['male']
}
You can filter your dataset successively like this:
for key, values in args.items():
data = data[data[key].isin(values)]
However, this is likely not to match any data for Age, which may have been loaded as an integer. In that case, you could load the CSV file as a string via pd.read_csv(filename, dtype=object), or convert to string before comparison:
for key, values in args.items():
data = data[data[key].astype(str).isin(values)]
Incidentally, this will also match multiple values. For example, take the URL http://example.com/filter?Name=Sam&Name=Ben&Age=21&Gender=male -- which leads to the structure:
args = {
'Name': ['Sam', 'Ben'], # There are 2 names
'Age': ['21'],
'Gender': ['male']
}
In this case, both Ben and Sam will be matched, since we're using .isin to match.