How to pick apart array data

How to pick apart array data - python

Trying to output just the employee data(empfirst, emplast, empsalary, emproles) to a bottle project. I Just want the value not the keys. How would I go about this? It feels like i've tried everything but can't get at the data I need!
My query
emp_curs = connection.coll.find({},{"_id": False,"employee.empFirst":True})
dept_list = list(emp_curs)```
(just playing with the first name for now until its working)
My loop
```% for d in emp_list:
% for i in d:
<tr>
<td>{{d[i]}}</td>
<td>{{d[i]}}</td>
<td>{{d[i]}}</td>
<td>{{d[i]}}</td>
</tr>
%end
%end```
thats the closest i've gotten :\
Looking to take all the data and place in a table.
Sorry, here is the whole data file!
Sorry, here's some sample data
[
{
"deptCode": "ACCT",
"deptName": "Accounting",
"deptBudget": 200000,
"employee": [
{
"empFirst": "Marsha",
"empLast": "Bonavoochi",
"empSalary": 59000
},
{
"empFirst": "Roberto",
"empLast": "Acostaletti",
"empSalary": 85000,
"empRoles": [
"Manager"
]
},
{
"empFirst": "Dini",
"empLast": "Cappelletti",
"empSalary": 50500
}
]
}
]

It looks like you are stopping just one layer early within your nested list of dictionaries. This should get you all the applicable values for the employee data:
for department in department_list:
for employee in department["employee"]:
for value in employee.values():
print(value) # or whatever operation you want, adding to the table in your case
Looks like you have adding to the table working as you want, so that should work for you. Based on the structure of your sample data, I'm assuming there will be multiple departments to pull this data from (hence me starting with department_list).

Related

Using pandas.json_normalize to "unfold" a dictionary of a list of dictionaries

I am new to Python (and coding in general) so I'll do my best to explain the challenge I'm trying to work through.
I'm working with a large dataset which was exported as a CSV from a database. However, there is one column within this CSV export that contains a nested list of dictionaries (as best as I can tell). I've looked around extensively online for a solution, including on Stackoverflow, but haven't quite gotten a full solution. I think I understand conceptually what I'm trying to accomplish, but not clear as to the best method or data prepping process to use.
Here is an example of the data (pared down to just the two columns I'm interested in):
{
"app_ID": {
"0": 1abe23574,
"1": 4gbn21096
},
"locations": {
"0": "[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876"
},
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"}
]",
"1": "[ ]",
]"
}
}
Basically each app_ID can have multiple locations tied to a single ID, or it can be empty as seen above. I have attempted using some guides I found online using Panda's json_normalize() function to "unfold" or get the list of dictionaries into their own rows in a Panda dataframe.
I'd like to end up with something like this:
loc_id lat long app_ID
abc1 12.3456 101.9876 1abe23574
abc1 45.7890 102.6543 1abe23574
etc...
I am learning about how to use the different functions of json_normalize, like "record_path" and "meta", but haven't been able to get it to work yet.
I have tried loading the json file into a Jupyter Notebook using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data, record_path = ['locations'])
but it only creates a dataframe that is 1 row and multiple columns long, where I'd like to have multiple rows generated from the inner-most dictionary that tie back to the app_ID and loc_ID fields.
Attempt at a solution:
I was able to get close to the dataframe format I wanted using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['locations']['0'])
but that would then require some kind of iteration through the list in order to create a dataframe, and then I'd lose the connection to the app_ID fields. (As best as I can understand how the json_normalize function works).
Am I on the right track trying to use json_normalize, or should I start over again and try a different route? Any advice or guidance would be greatly appreciated.

I can't say that suggesting you using convtools library is a good thing since you are a beginner, because this library is almost like another Python over the Python. It helps to dynamically define data conversions (generating Python code under the hood).
But anyway, here is the code if I understood the input data right:
import json
from convtools import conversion as c
data = {
"app_ID": {"0": "1abe23574", "1": "4gbn21096"},
"locations": {
"0": """[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876" },
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"} ]""",
"1": "[ ]",
},
}
# define it once and use multiple times
converter = (
c.join(
# converts "app_ID" data to iterable of dicts
(
c.item("app_ID")
.call_method("items")
.iter({"id": c.item(0), "app_id": c.item(1)})
),
# converts "locations" data to iterable of dicts,
# where each id like "0" is zipped to each location.
# the result is iterable of dicts like {"id": "0", "loc": {"loc_id": ... }}
(
c.item("locations")
.call_method("items")
.iter(
c.zip(id=c.repeat(c.item(0)), loc=c.item(1).pipe(json.loads))
)
.flatten()
),
# join on "id"
c.LEFT.item("id") == c.RIGHT.item("id"),
how="full",
)
# process results, where 0 index is LEFT item, 1 index is the RIGHT one
.iter(
{
"loc_id": c.item(1, "loc", "loc_id", default=None),
"lat": c.item(1, "loc", "lat", default=None),
"long": c.item(1, "loc", "long", default=None),
"app_id": c.item(0, "app_id"),
}
)
.as_type(list)
.gen_converter()
)
result = converter(data)
assert result == [
{'loc_id': 'abc1', 'lat': '12.3456', 'long': '101.9876', 'app_id': '1abe23574'},
{'loc_id': 'abc2', 'lat': '45.7890', 'long': '102.6543', 'app_id': '1abe23574'},
{'loc_id': None, 'lat': None, 'long': None, 'app_id': '4gbn21096'}
]

Update single row formatting for entire sheet

I want to just apply a formatting from a JSON Entry. The first thing I did was make my desirable format on my spreadsheet for the second row of all columns. I then retrieved them with a .get request (from A2 to AO3).
request = google_api.service.spreadsheets().get(
spreadsheetId=ss_id,
ranges="Tab1!A2:AO3",
includeGridData=True).execute()
The next thing I did was collect each of the formats for each column and record them in a dictionary.
my_dictionary_of_formats = {}
row_values = row_1['sheets'][0]['data'][0]['rowData'][0]['values']
for column in range(0, len(row_values)):
my_dictionary_of_formats[column] = row_values[column]['effectiveFormat']
Now I have a dictionray of all my effective formats for all my columns. I'm having trouble now applying that format to all rows in each column. I tried a batchUpdate request:
cell_data = {
"effectiveFormat": my_dictionary_of_formats[0]}
row_data = {
"values": [
cell_data
]
}
update_cell = {
"rows": [
row_data
],
"fields": "*",
"range":
{
"sheetId": input_master.tab_id,
"startRowIndex": 2,
"startColumnIndex": 0,
"endColumnsIndex": 1
}
}
request_body = {
"requests": [
{"updateCells": update_cell}],
"includeSpreadsheetInResponse": True,
"responseIncludeGridData": True}
service.spreadsheets().batchUpdate(spreadsheetId=my_id, body=request_body).execute()
This wiped out everything and I'm not sure why. I don't think I understand the fields='* attribute.
TL;DR
I want to apply a format to all rows in a single column. Much like if I used the "Paint Format" tool on the second row, first column and dragged it all the way down to the last row.
-----Update
Hi, thanks to the comments this was my solution:
###collect all formats from second row
import json
row_2 = goolge_api.service.spreadsheets().get(
spreadsheetId=spreadsheet_id,
ranges="tab1!A2:AO2",
includeGridData=True).execute()
my_dictionary = {}
row_values = row_2['sheets'][0]['data'][0]['rowData'][0]['values']
for column in range(0,len(row_values)):
my_dictionary[column] = row_values[column]
json.dumps(my_dictionary,open('config/format.json','w'))
###Part 2, apply formats
requests = []
my_dict = json.load(open('config/format.json'))
for column in my_dict:
requests.append(
{
"repeatCell": {
"range": {
"sheetId": tab_id,
"startRowIndex": str(1),
"startColumnIndex":str(column),
"endColumnIndex":str(int(column)+1)
},
"cell": {
"userEnteredFormat": my_dict[column]
},
'fields': "userEnteredFormat({})".format(",".join(my_dict[column].keys()))
}
})
body = {"requests": requests}
google_api.service.spreadsheets().batchUpdate(spreadsheetId=s.spreadsheet_id,body=body).execute()

When you include fields as a part of the request, you indicate to the API endpoint that it should overwrite the specified fields in the targeted range with the information found in your uploaded resource. fields="*" correspondingly is interpreted as "This request specifies the entire data and metadata of the given range. Remove any previous data and metadata from the range and use what is supplied instead."
Thus, anything not specified in your updateCells requests will be removed from the range supplied in the request (e.g. values, formulas, data validation, etc.).
You can learn more in the guide to batchUpdate
For an updateCell request, the fields parameter is as described:
The fields of CellData that should be updated. At least one field must be specified. The root is the CellData; 'row.values.' should not be specified. A single "*" can be used as short-hand for listing every field.
If you then view the resource description of CellData, you observe the following fields:
"userEnteredValue"
"effectiveValue"
"formattedValue"
"userEnteredFormat"
"effectiveFormat"
"hyperlink"
"note"
"textFormatRuns"
"dataValidation"
"pivotTable"
Thus, the proper fields specification for your request is likely to be fields="effectiveFormat", since this is the only field you supply in your row_data property.
Consider also using the repeatCell request if you are just specifying a single format.

Django queryset: Build a dictionary from a queryset, with common elements

I'm trying to construct a dictionary from my database, that will separate my data into values with common time stamps.
data_point:
time: <timestamp>
value: integer
I have 66k data points, out of which around 7k share timestamps (meaning the measurement was taken at the same time.
I need to make a dict that would look like:
{
"data_array": [
{
"time": "2018-05-11T10:34:43.826Z",
"values": [
13560465,
87856595,
78629348
]
},
{
"time": "2018-05-11T10:34:43.882Z",
"values": [
13560689,
78237945,
92378456
]
}
]
}
There are other keys in the dictionary, but I'm just having a bit of a struggle with this particular key.
The idea is, look at my data queryset, and group up objects that share a timestamp, then add a key "time" to my dict, with the value being the timestamp, and an array "values" with the value being a list of those data.value objects
I'm not experienced enough to build this without looping a lot and probably being very innefficient. Some kind of "while timestamp doesn't change: append value to list", though I'm not sure how to go about that either.
Ideally, if I can do this with queries (should be faster, right?) I would prefer that.

Why not use collections.defaultdict?
from collections import defaultdict
data = defaultdict(list)
# qs is your queryset
for time, value in qs.values_list('time', 'value'):
data[time].append(value)
In that case data looks like:
{
'time_1': [
value_1_1,
value_1_2,
...
],
'time_2': [
value_2_1,
value_2_2,
...
],
....
}
at this point you can build any output format you want

Create a data frame from a complex nested dictionary?

I have a big nested, then nested then nested json file saved as .txt format. I need to access some specific key pairs and crate a data frame or another transformed json object for further use. Here is a small sample with 2 key pairs.
[
{
"ko_id": [819752],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
},
{
"ko_id": [819753],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
}
]
The following code load the data as list but I need to access to the data probably as a dictionary and I need the "ko_id", "uri" and "prefLabel" from each key pair and put it to a pandas data frame or a dictionary for further analysis.
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
The following code gives me the exact value of the first element. But donot actually know how to put it together and build the ultimate algorithm to create the dataframe.
print(sample_dict["ko_id"][0])
print(sample_dict["concepts"][0]["prefLabel"][0])
print(sample_dict["concepts"][0]["uri"][0])

for record in sample_dict:
df = pd.DataFrame(record['concepts'])
df['ko_id'] = record['ko_id']
final_df = final_df.append(df)

You can pass the data to pandas.DataFrame using a generator:
import pandas as pd
import json as js
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
df = pd.DataFrame(data = ((key["ko_id"][0],
key["concepts"][0]["prefLabel"][0],
key["concepts"][0]["uri"][0]) for key in json_sample),
columns = ("ko_id", "prefLabel", "uri"))
Output:
>>> df
ko_id prefLabel uri
0 819752 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116
1 819753 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116

How to get documents where KEY is greater than X

i am recording user's daily usage of my platform.
structures of documents in mongodb are like that:
_id: X
day1:{
loginCount = 4
someDict { x:y, z:m }
}
day2:{
loginCount = 5
someDict { a:b, c:d }
}
then, i need to get last 2 day's user stats which belongs to user X.
how can i get values whose days are greater than two days ago? (like using '$gte' command?)

Ok, if you insist on this scheme try this:
{
_id: Usemongokeyhere
userid: X
days: [
{day:IsoDate(2013-08-12 00:00),
loginCount: 10,
#morestuff
},
{day:IsoDate(2013-08-13 00:00),
loginCount: 11,
#morestuff
},
]
},
#more users
Then you can query like:
db.items.find(
{"days.day":{$gte:ISODate("2013-08-30T00:00:00.000Z"),
$lt: ISODate("2013-08-31T00:00:00.000Z")
}
}
)

Unless there is any change in the question, i am answering based on this schema.
_id: X
day1:{
loginCount:4
someDict:{ x:y, z:m }
}
day2:{
loginCount:5
someDict:{ a:b, c:d }
}
Answer:
last 2 day's user stats which belongs to user X.
You cannot get it from mongo side with operators like $gte, with this structure, because you get the whole days when do query for user X. The document contains information about all days and keeping dynamic values as keys is in my opinion a bad practice. You can retrieve a documents by defining fields like db.collection.find({_id:X},{day1:1,day2:1})
However you have to know what the keys are and i am not sure how you keep day1 and day2 as key iso date, timestamp? Depending on how you hold it, you can write fields on the query by writing yesterday and before yesterday as date string or timestamp and get your required information.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pick apart array data - python

Related

Using pandas.json_normalize to "unfold" a dictionary of a list of dictionaries

Update single row formatting for entire sheet

Django queryset: Build a dictionary from a queryset, with common elements

Create a data frame from a complex nested dictionary?

How to get documents where KEY is greater than X

Categories

Resources