returning data from a dataframe using jsonify - python

I have a web service and the pattern to return data is basically get the required data into a dataframe and then use the code below to return the data.
return jsonify([{'id': row.id,
'name': row.name,
'age': row.age
} for row in data.itertuples()])
This works fine. However as is the case now when I have a dataframe with 30 odd columns is there a more efficient way of doing this? I don't want to have copy the above and write 30 lines of 'some_name' : row.some_name

You could iterate over attributes of your rows and return ones that aren't special methods (ones that starts with _) or functions.
def get_attributes(obj):
return {
attr: getattr(obj, attr) for attr in dir(obj) if
not attr.startswith('_') and not callable(getattr(obj, attr))
}
Example usage:
data = pd.DataFrame(
{
'id': [0, 1],
'name': ['name_1', 'name_2'],
'age': [16, 32]
},
index=['dog', 'hawk']
)
print([
get_attributes(row)
for row in data.itertuples()
])
Output:
[{'Index': 'dog', 'age': 16, 'id': 0, 'name': 'name_1'},
{'Index': 'hawk', 'age': 32, 'id': 1, 'name': 'name_2'}]

Related

Replace single quotes with doubles to turn contents of a file into a nested JSON and normalize it afterwards

I have 70k files all of which look similar to this:
{'id': 24, 'name': None, 'city': 'City', 'region_id': 19,
'story_id': 1, 'description': 'text', 'uik': None, 'ustatus': 'status',
'wuiki_tik_name': '', 'reaction': None, 'reaction_official': '',
'created_at': '2011-09-07T07:24:44.420Z', 'lat': 54.7, 'lng': 20.5,
'regions': {'id': 19, 'name': 'name'}, 'stories': {'id': 1, 'name': '2011-12-04'}, 'assets': [], 'taggings': [{'tags': {'id': 6, 'name': 'name',
'tag_groups': {'id': 3, 'name': 'Violation'}}},
{'tags': {'id': 8, 'name': 'name', 'tag_groups': {'id': 5, 'name': 'resource'}}},
{'tags': {'id': 1, 'name': '01. Federal', 'tag_groups': {'id': 1, 'name': 'Level'}}},
{'tags': {'id': 3, 'name': '03. Local', 'tag_groups': {'id': 1, 'name': 'stuff'}}},
{'tags': {'id': 2, 'name': '02. Regional', 'tag_groups':
{'id': 1, 'name': 'Level'}}}], 'message_id': None, '_count': {'assets': 0, 'other_messages': 0, 'similars': 0, 'taggings': 5}}
The ultimate goal is to export it into a single CSV file. It can be successfully done without flattening. But since it has a lot of nested values, I would like to flatten it, and this is where I began facing problems related to data types. Here's the code:
import json
from pandas.io.json import json_normalize
import glob
path = glob.glob("all_messages/*.json")
for file in path:
with open(file, "r") as filer:
content = json.loads(json.dumps(filer.read()))
if content != 404:
df_main = json_normalize(content)
df_regions = json_normalize(content, record_path=['regions'], record_prefix='regions.', meta=['id'])
df_stories = json_normalize(content, record_path=['stories'], record_prefix='stories.', meta=['id'])
#... More code related to normalization
df_out.to_csv('combined_json.csv')
This code occasionally throws:
AttributeError: 'str' object has no attribute 'values' or ValueError: DataFrame constructor not properly called!. I realise that this is caused by json.dumps() JSON string output. However, I have failed to turn it into anything useable.
Any possible solutions to this?
If you only need to change ' to ":
...
for file in path:
with open(file, "r") as filer:
filer.replace("\'", "\"")
...
Making copies and using grep would be easier
While it is not the solution I was initially expecting, this approach worked as well. I kept getting error messages related to the structure of the dict literals that were reluctant to become json, so I took the csv file that I wanted to normalise and worked with each column one by one:
df = pd.read_csv("combined_json.csv")
df['regions'] = df['regions'].apply(lambda x: x.replace("'", '"'))
regions = pd.json_normalize(df['regions'].apply(json.loads).tolist()).rename(
columns=lambda x: x.replace('regions.', ''))
df['regions'] = regions['name']
Or, if it had more nested levels:
df['taggings'] = df['taggings'].apply(lambda x: x.replace("'", '"'))
taggings = pd.concat([pd.json_normalize(json.loads(j)) for j in df['taggings']])
df = df.reset_index(drop=True)
taggings = taggings.reset_index(drop=True)
df[['tags_id', 'nametag', 'group_tag', 'group_tag_name']] = taggings[['tags.id', 'tags.name', 'tags.tag_groups.id', 'tags.tag_groups.name']]
Which was eventually df.to_csv().

Python: when trying to extract certain keys, how can I avoid a KeyError when in some dict elements, the key value is missing from APi json?

I can successfully extract every column using Python, except the one I need most (order_id) from an API generated json that lists field reps interactions with clients.
Not all interactions result in orders; there are multiple types of interactions. I know I will need to add the flag to show 'None' and then in my for loop and an if-statement to check whether the order_id is null or not. If not 'None/null', add it to the list.
I just cannot figure it out so would appreciate every bit of help!
This is the code that works:
import requests
import json
r = requests.get(baseurl + endpoint + '?page_number=1' + '&page_size=2', headers=headers)
output = r.json()
interactions_list = []
for item in output['data']:
columns = {
'id': item['id'],
'number': item['user_id'],
'name': item['user_name'],
}
interactions_list.append(columns)
print(interactions_list)
This returns an error-free result:
[{'id': 1, 'number': 6, 'name': 'Johnny'}, {'id': 2, 'number': 7, 'name': 'David'}]
When I include the order_id in the loop:
interactions_list = []
for item in output['data']:
columns = {
'id': item['id'],
'number': item['user_id'],
'name': item['user_name'],
'order': item['order_id'],
}
interactions_list.append(columns)
It returns:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_17856/1993147086.py in <module>
6 'number': item['user_id'],
7 'name': item['user_name'],
----> 8 'order': item['order_id'],
9 }
10
KeyError: 'order_id'
Use the get method of the dictionary:
columns = {
'id': item.get('id'),
'number': item.get('user_id'),
'name': item.get('user_name'),
'order': item.get('order_id'),
}
This will set your missing values to None. If you want to choose what the None value is, pass a second argument to get e.g. item.get('user_name', 'N/A')
EDIT: To conditionally add items based on the presence of the order_id
interactions_list = []
for item in output['data']:
if 'order_id' in item:
columns = {
'id': item.get('id'),
'number': item.get('user_id'),
'name': item.get('user_name', 'N/A'),
'order': item.get('order_id'),
}
interactions_list.append(columns)
Alternatively, you can use a list comprehension approach, which should be slightly more efficient than using list.append in a loop:
output = {'data': [{'order_id': 'n/a', 'id': '123'}]}
interactions_list = [
{
'id': item.get('id'),
'number': item.get('user_id'),
'name': item.get('user_name', 'N/A'),
'order': item.get('order_id'),
} for item in output['data'] if 'order_id' in item
]
# [{'id': '123', 'number': None, 'name': 'N/A', 'order': 'n/a'}]

Extracting value for one dictionary key in Pandas based on another in the same dictionary

This is from an R guy.
I have this mess in a Pandas column: data['crew'].
array(["[{'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production', 'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor', 'profile_path': None}, {'credit_id': '56407fa89251417055000b58', 'department': 'Sound', 'gender': 0, 'id': 6745, 'job': 'Music Editor', 'name': 'Richard Henderson', 'profile_path': None}, {'credit_id': '5789212392514135d60025fd', 'department': 'Production', 'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production', 'name': 'Jeffrey Stott', 'profile_path': None}, {'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up', 'gender': 0, 'id': 23783, 'job': 'Makeup Artist', 'name': 'Heather Plott', 'profile_path': None}
It goes on for quite some time. Each new dict starts with a credit_id field. One sell can hold several dicts in an array.
Assume I want the names of all Casting directors, as shown in the first entry. I need to check check the job entry in every dict and, if it's Casting, grab what's in the name field and store it in my data frame in data['crew'].
I tried several strategies, then backed off and went for something simple.
Running the following shut me down, so I can't even access a simple field. How can I get this done in Pandas.
for row in data.head().iterrows():
if row['crew'].job == 'Casting':
print(row['crew'])
EDIT: Error Message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-138-aa6183fdf7ac> in <module>()
1 for row in data.head().iterrows():
----> 2 if row['crew'].job == 'Casting':
3 print(row['crew'])
TypeError: tuple indices must be integers or slices, not str
EDIT: Code used to get the array of dict (strings?) in the first place.
def convert_JSON(data_as_string):
try:
dict_representation = ast.literal_eval(data_as_string)
return dict_representation
except ValueError:
return []
data["crew"] = data["crew"].map(lambda x: sorted([d['name'] if d['job'] == 'Casting' else '' for d in convert_JSON(x)])).map(lambda x: ','.join(map(str, x))
To create a DataFrame from your sample data, write:
df = pd.DataFrame(data=[
{ 'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production',
'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor',
'profile_path': None},
{ 'credit_id': '56407fa89251417055000b58', 'department': 'Sound',
'gender': 0, 'id': 6745, 'job': 'Music Editor',
'name': 'Richard Henderson', 'profile_path': None},
{ 'credit_id': '5789212392514135d60025fd', 'department': 'Production',
'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production',
'name': 'Jeffrey Stott', 'profile_path': None},
{ 'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up',
'gender': 0, 'id': 23783, 'job': 'Makeup Artist',
'name': 'Heather Plott', 'profile_path': None}])
Then you can get your data with a single instruction:
df[df.job == 'Casting'].name
The result is:
0 Terri Taylor
Name: name, dtype: object
The above result is Pandas Series object with names found.
In this case, 0 is the index value for the record found and
Terri Taylor is the name of (the only in your data) Casting Director.
Edit
If you want just a list (not Series), write:
df[df.job == 'Casting'].name.tolist()
The result is ['Terri Taylor'] - just a list.
I think, both my solutions should be quicker than "ordinary" loop
based on iterrows().
Checking the execution time, you may try also yet another solution:
df.query("job == 'Casting'").name.tolist()
==========
And as far as your code is concerned:
iterrows() returns each time a pair containing:
the key of the current row,
a named tuple - the content of this row.
So your loop should look something like:
for row in df.iterrows():
if row[1].job == 'Casting':
print(row[1]['name'])
You can not write row[1].name because it refers to the index value
(here we have a collision with default attributes of the named tuple).

Replacement for dataframe.iterrows()

I'am working on a script for migrating data from MongoDB to Clickhouse. Because of the reason that nested structures are'nt implemented good enough in Clickhouse, I iterate over nested structure and bring them to flat representation, where every element of nested structure is a distinct row in Clickhouse database.
What I do is iterate over list of dictionaries and take target values. The structure looks like this:
[
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
I have a function that does this on dataframe object via data.iterrows() method:
def to_flat(data, coldict, field_last_upd):
m_status_history = stc.special_mongo_names['status_history_cand']
n_statuse_change = coldict['n_statuse_change']['name']
data[n_statuse_change] = n_status_change(dp.force_take_series(data, m_status_history))
flat_cols = [ x for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT ]
old_cols_names = [ x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_PREPARATION ]
t_time = time.time()
t_len = 0
new_rows = list()
for j in range(row[n_statuse_change]):
t_new_value_row = np.empty(shape=[0, 0])
for k in range(len(flat_cols)):
if flat_cols[k]['colsubtype'] == stc.COLSUBTYPE_FLATPATH:
new_value = dp.under_value_line(
row,
path_for_status(j, row[n_statuse_change]-1, flat_cols[k]['path'])
)
# Дополнительно обрабатываем дату
if flat_cols[k]['name'] == coldict['status_set_at']['name']:
new_value = dp.iso_date_to_datetime(new_value)
if flat_cols[k]['name'] == coldict['status_set_at_mil']['name']:
new_value = dp.iso_date_to_miliseconds(new_value)
if flat_cols[k]['name'] == coldict['status_stage_order']['name']:
try:
new_value = int(new_value)
except:
new_value = new_value
else:
if flat_cols[k]['name'] == coldict['status_index']['name']:
new_value = j
t_new_value_row = np.append(t_new_value_row, dp.some_to_null(new_value))
new_rows.append(np.append(row[old_cols_names].values, t_new_value_row))
pdb.set_trace()
res = pd.DataFrame(new_rows, columns = [
x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT or x['coltype'] == stc.COLTYPE_PREPARATION
])
return res
It takes values from list of dicts, prepare them to correspond Clickhouse's requirements using numpy arrays and then appends them all together to get new dataframe with targeted values and its columnnames.
I've noticed that if nested structure is big enough, it begins to work much slower. I've found an article where different methods of iteration in Python are compared. article
It is claimed that it's much faster to iterate over .apply() method and even faster using vectorization. But the samples given are pretty trivial and rely on using the same function on all of the values. Is it possible to iterate over pandas object in faster manner, while using variety of functions on different types of data?
I think your first step should be converting your data into a pandas dataframe, then it will be so much easier to handle it. I couldn't deschiper the exact functions you wanted to run, but perhaps my example helps
import datetime
import pandas as pd
data_dict_array = [
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
#converting your data into something pandas can read
# in particular, flattening the stage dict
for data_dict in data_dict_array:
d_temp = data_dict.pop("Stage")
data_dict["Stage_Label"] = d_temp["Label"]
data_dict["Stage_Order"] = d_temp["Order"]
data_dict["Stage_id"] = d_temp["_id"]
df = pd.DataFrame(data_dict_array)
# lets say i want to set comment to "cool" if name is 'В работе'
# in .loc[], the first argument is filtering the rows, the second argument is picking the column
df.loc[df['Name'] == 'В работе','Comment'] = "cool"
df

How can I recursively add dictionaries in Python from JSON?

Dear Stackoverflow Members,
I have this JSON array, and it consists of the following items (basically):
{
{
'Name': 'x',
'Id': 'y',
'Unsusedstuff' : 'unused',
'Unsusedstuff2' : 'unused2',
'Children': []
},
{ 'Name' : 'xx',
'Id': 'yy',
'Unsusedstuff' : 'unused',
'Unsusedstuff2' : 'unused2',
'Children': [{
'Name': 'xyx',
'Id' : 'yxy',
'Unsusedstuff' : 'unused',
'Unsusedstuff2' : 'unused2',
'Children: []
}
You get the basic idea. I want to emulate this (and just grab the id and the name and the structure) in a Python-list using the following code:
names = []
def parseNames(col):
for x in col:
if(len(x['Children'])> 0):
names.append({'Name' : x['Name'], 'Id' : x['Id'], 'Children' : parseNames(x['Children'])})
else:
return {'Name' : x['Name'], 'Id' : x['Id']}
But, it only seems to return the first 'root' and the first nested folder, but doesn't loop through them all.
How would I be able to fix this?
Greetings,
Mats
The way I read this, you're trying to convert this tree into a tree of nodes which only have Id, Name and Children. In that case, the way I'd think of it is as cleaning nodes.
To clean a node:
Create a node with the Name and Id of the original node.
Set the new node's Children to be the cleaned versions of the original node's children. (This is the recursive call.)
In code, that would be:
def clean_node(node):
return {
'Name': node['Name'],
'Id': node['Id'],
'Children': map(clean_node, node['Children']),
}
>>> print map(clean_node, data)
[{'Name': 'x', 'Children': [], 'Id': 'y'}, {'Name': 'xx', 'Children': [{'Name': 'xyx', 'Children': [], 'Id': 'yxy'}], 'Id': 'yy'}]
I find it's easier to break recursive problems down like this - trying to use global variables turns simple things very confusing very quickly.
Check this
def parseNames(col):
for x in col:
if(len(x['Children'])> 0):
a = [{
'Name' : x['Name'],
'Id' : x['Id'],
'Children' : x['Children'][0]['Children']
}]
parseNames(a)
names.append({'Name' : x['Name'], 'Id' : x['Id']})
return names
Output I get is
[{'Name': 'x', 'Id': 'y'}, {'Name': 'xx', 'Id': 'yy'}, {'Name': 'xx', 'Id': 'yy'}]
You can parse a Json object with this:
import json
response = json.loads(my_string)
Now response is a dictionary with the keys of every Json object.

Categories

Resources