I have written code to encode one row of a dataframe to json, as follows:
def encode_df_metadata_row(df):
return {'name': df['Title'].values[0], 'code': df['Code'].values[0], 'frequency': df['Frequency'].values[0], 'description': df['Subtitle'].values[0], 'source': df['Source'].values[0]}
Now I would like to encode an entire dataframe to json with some transformation, so I wrote this function:
def encode_metadata_list(df_metadata):
return [encode_df_metadata_row(df_row) for index, df_row in df_metadata.iterrows()]
I then try to call the function using this code:
df_oodler_metadata = pd.read_csv('DATA\oodler-datasets-metadata.csv')
response = encode_metadata_list(df_oodler_metadata)
print(response)
When I run this code, I get the following error:
AttributeError: 'str' object has no attribute 'values'
I've tried a bunch of variations but I keep getting similar errors. Does someone know the right way to do this?
DataFrame.iterrows yields pairs of index and row, where each row is a Series object. It stores a single element for each column, so the .values[0] part in your encode_df_metadata_row(df) function becomes irrelevant - the correct form of this function should be:
def encode_df_metadata_row(row):
return {'name': row['Title'], 'code': row['Code'], 'frequency': row['Frequency'], 'description': row['Subtitle'], 'source': row['Source']}
Related
I'm using Python (google colb) and I have a json dataframe with some fields like:
[{'ActedBy': ['team'], 'ActedAt': '2022-03-07T22:43:46Z', 'Status': 'Completed', 'LAB': 'No'}]
I need to get the "ActedAt" in order to get the "date" how can I get this?
Thanks!
You have an array of dictionaries. First, grab a dictionary from the array by index, then proceed to get the ActedAt property. Something like this:
json = [{'ActedBy': ['team'], 'ActedAt': '2022-03-07T22:43:46Z', 'Status': 'Completed', 'LAB': 'No'}]
# index into a variable for explicit readability
index = 0
# get the date you want
date = json[index]['ActedAt']
print(date)
I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))
When I run the following code I get an output containing a value (1.11113) that I want to use within the code (after this first section). The full outputI get is shown after the code. Basically what I'm trying to do is extract a real time forex (stock) value to use in an order. This order would be placed after this initial code within the same python module. Thanks for your help.
import json
from oandapyV20.contrib.requests import MarketOrderRequest
from oandapyV20.contrib.requests import TakeProfitDetails, StopLossDetails
import oandapyV20.endpoints.orders as orders
import oandapyV20
import oandapyV20.endpoints.pricing as pricing
from exampleauth import exampleAuth
import argparse
from oandapyV20 import API
from oandapyV20.exceptions import V20Error
import oandapyV20.endpoints.instruments as instruments
from oandapyV20.definitions.instruments import CandlestickGranularity
import re
import oandapyV20.endpoints.pricing as pricing
# pricef=float(price)
# parser.add_argument('--price', choices=price, default='M', help='Mid/Bid/Ask')
accountID, access_token = exampleAuth()
api = oandapyV20.API(access_token=access_token)
params = {"instruments": "EUR_USD"}
r = pricing.PricingInfo(accountID=accountID, params=params)
rv = api.request(r)
print(rv)
OUTPUT
{'prices': [{'asks': [{'liquidity': 10000000, 'price': '1.11132'}],
'bids': [{'liquidity': 10000000, 'price': '1.11113'}],
'closeoutAsk': '1.11132',
'closeoutBid': '1.11113',
'instrument': 'EUR_USD',
'quoteHomeConversionFactors': {'negativeUnits': '1.00000000',
'positiveUnits': '1.00000000'},
'status': 'tradeable',
'time': '2020-05-31T23:02:34.271983628Z',
'tradeable': True,
'type': 'PRICE',
'unitsAvailable': {'default': {'long': '3852555',
'short': '3852555'},
'openOnly': {'long': '3852555',
'short': '3852555'},
'reduceFirst': {'long': '3852555',
'short': '3852555'},
'reduceOnly': {'long': '0', 'short': '0'}}}],
'time': '2020-05-31T23:02:40.672716661Z'}
Your output is a dictionary with a number of nested lists and dictionaries.
To access a value from a dictionary, you use the same syntax as when accessing members of a list, only that the key does not have to be a number, but can be any data type, commonly Strings. So rv['time'] in your case would yield '2020-05-31T23:02:40.672716661Z'.
Since the number 1.11113 appears twice in the dictionary, here are the two pointers that would access the corresponding field:
rv['prices'][0]['bids'][0]['price']
and
rv['prices'][0]['closeoutBid']
This will be in String format, so to use it as a number, you would have to convert it using float()
Also notice the occasional [0] to access the first element of a list.
Looks like you want:
rv['prices'][0]['bids'][0]['price']
at least for this case. It might not be that you always want the first price entry or the first bid entry, in which case you might want to do some sort of sorting or filtering on whatever criteria you want to use to pick the right entry from among more than one.
I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.
As a learning project, I'm using MongoDB with Bottle for a web service. What I want to do is fetch results from MongoDB and display them in a template. Here's the output I want from my template:
output.tpl
<html><body>
%for record in records:
<li>{{record.city}} {{record.date}}
%end
</body></html>
I can pull the data out no problem:
result = db.records.find(query).limit(3)
return template('records_template', records=result)
But this resulted in no output at all - some debugging shows me that result is some sort of cursor:
<pymongo.cursor.Cursor object at 0x1560dd0>
So I attempted to convert this in to something that the template would like:
result = db.records.find(query).limit(3)
viewmodel=[]
for row in result:
l = dict()
for column in row:
l[str(column)]=row[column]
viewmodel.append(l)
return template('records_template', records=viewmodel)
Debugging shows me that my view data looks OK:
[{'_id': ObjectId('4fe3dfbc62933a0338000001'),
'city': u'CityName',
'date': u'Thursday June 21, 2012'},
{'_id': ObjectId('4fe3dfbd62933a0338000088')
'city': u'CityName',
'date': u'Thursday June 21, 2012'},
{'_id': ObjectId('4fe3dfbd62933a0338000089')
'city': u'CityName',
'date': u'Thursday June 21, 2012'}]
But this is the response I'm getting. Any ideas why?
AttributeError("'dict' object has no attribute 'city'",)
Edit: I added that bit about l[str(column)]=row[column] to convert the dictionary keys to non-unicode strings in case that was the problem, but it doesn't seem to matter either way.
You need to use the dictionary syntax to lookup the properties:
{{record['city']}} {{record['date']}}
result = db.records.find(query).limit(3)
viewmodel=[]
for row in result:
l = dict()
for column in row:
l[str(column)]=row[column]
viewmodel.append(l)
return template('records_template', records=viewmodel)
could be summarized in :
result = db.records.find(query).limit(3)
return template('records_template', records=list(result))
Python's beauty...