Replacement for dataframe.iterrows() - python

I'am working on a script for migrating data from MongoDB to Clickhouse. Because of the reason that nested structures are'nt implemented good enough in Clickhouse, I iterate over nested structure and bring them to flat representation, where every element of nested structure is a distinct row in Clickhouse database.
What I do is iterate over list of dictionaries and take target values. The structure looks like this:
[
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
I have a function that does this on dataframe object via data.iterrows() method:
def to_flat(data, coldict, field_last_upd):
m_status_history = stc.special_mongo_names['status_history_cand']
n_statuse_change = coldict['n_statuse_change']['name']
data[n_statuse_change] = n_status_change(dp.force_take_series(data, m_status_history))
flat_cols = [ x for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT ]
old_cols_names = [ x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_PREPARATION ]
t_time = time.time()
t_len = 0
new_rows = list()
for j in range(row[n_statuse_change]):
t_new_value_row = np.empty(shape=[0, 0])
for k in range(len(flat_cols)):
if flat_cols[k]['colsubtype'] == stc.COLSUBTYPE_FLATPATH:
new_value = dp.under_value_line(
row,
path_for_status(j, row[n_statuse_change]-1, flat_cols[k]['path'])
)
# Дополнительно обрабатываем дату
if flat_cols[k]['name'] == coldict['status_set_at']['name']:
new_value = dp.iso_date_to_datetime(new_value)
if flat_cols[k]['name'] == coldict['status_set_at_mil']['name']:
new_value = dp.iso_date_to_miliseconds(new_value)
if flat_cols[k]['name'] == coldict['status_stage_order']['name']:
try:
new_value = int(new_value)
except:
new_value = new_value
else:
if flat_cols[k]['name'] == coldict['status_index']['name']:
new_value = j
t_new_value_row = np.append(t_new_value_row, dp.some_to_null(new_value))
new_rows.append(np.append(row[old_cols_names].values, t_new_value_row))
pdb.set_trace()
res = pd.DataFrame(new_rows, columns = [
x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT or x['coltype'] == stc.COLTYPE_PREPARATION
])
return res
It takes values from list of dicts, prepare them to correspond Clickhouse's requirements using numpy arrays and then appends them all together to get new dataframe with targeted values and its columnnames.
I've noticed that if nested structure is big enough, it begins to work much slower. I've found an article where different methods of iteration in Python are compared. article
It is claimed that it's much faster to iterate over .apply() method and even faster using vectorization. But the samples given are pretty trivial and rely on using the same function on all of the values. Is it possible to iterate over pandas object in faster manner, while using variety of functions on different types of data?

I think your first step should be converting your data into a pandas dataframe, then it will be so much easier to handle it. I couldn't deschiper the exact functions you wanted to run, but perhaps my example helps
import datetime
import pandas as pd
data_dict_array = [
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
#converting your data into something pandas can read
# in particular, flattening the stage dict
for data_dict in data_dict_array:
d_temp = data_dict.pop("Stage")
data_dict["Stage_Label"] = d_temp["Label"]
data_dict["Stage_Order"] = d_temp["Order"]
data_dict["Stage_id"] = d_temp["_id"]
df = pd.DataFrame(data_dict_array)
# lets say i want to set comment to "cool" if name is 'В работе'
# in .loc[], the first argument is filtering the rows, the second argument is picking the column
df.loc[df['Name'] == 'В работе','Comment'] = "cool"
df

Related

Python: when trying to extract certain keys, how can I avoid a KeyError when in some dict elements, the key value is missing from APi json?

I can successfully extract every column using Python, except the one I need most (order_id) from an API generated json that lists field reps interactions with clients.
Not all interactions result in orders; there are multiple types of interactions. I know I will need to add the flag to show 'None' and then in my for loop and an if-statement to check whether the order_id is null or not. If not 'None/null', add it to the list.
I just cannot figure it out so would appreciate every bit of help!
This is the code that works:
import requests
import json
r = requests.get(baseurl + endpoint + '?page_number=1' + '&page_size=2', headers=headers)
output = r.json()
interactions_list = []
for item in output['data']:
columns = {
'id': item['id'],
'number': item['user_id'],
'name': item['user_name'],
}
interactions_list.append(columns)
print(interactions_list)
This returns an error-free result:
[{'id': 1, 'number': 6, 'name': 'Johnny'}, {'id': 2, 'number': 7, 'name': 'David'}]
When I include the order_id in the loop:
interactions_list = []
for item in output['data']:
columns = {
'id': item['id'],
'number': item['user_id'],
'name': item['user_name'],
'order': item['order_id'],
}
interactions_list.append(columns)
It returns:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_17856/1993147086.py in <module>
6 'number': item['user_id'],
7 'name': item['user_name'],
----> 8 'order': item['order_id'],
9 }
10
KeyError: 'order_id'
Use the get method of the dictionary:
columns = {
'id': item.get('id'),
'number': item.get('user_id'),
'name': item.get('user_name'),
'order': item.get('order_id'),
}
This will set your missing values to None. If you want to choose what the None value is, pass a second argument to get e.g. item.get('user_name', 'N/A')
EDIT: To conditionally add items based on the presence of the order_id
interactions_list = []
for item in output['data']:
if 'order_id' in item:
columns = {
'id': item.get('id'),
'number': item.get('user_id'),
'name': item.get('user_name', 'N/A'),
'order': item.get('order_id'),
}
interactions_list.append(columns)
Alternatively, you can use a list comprehension approach, which should be slightly more efficient than using list.append in a loop:
output = {'data': [{'order_id': 'n/a', 'id': '123'}]}
interactions_list = [
{
'id': item.get('id'),
'number': item.get('user_id'),
'name': item.get('user_name', 'N/A'),
'order': item.get('order_id'),
} for item in output['data'] if 'order_id' in item
]
# [{'id': '123', 'number': None, 'name': 'N/A', 'order': 'n/a'}]

Python JSON to dataframe [duplicate]

This question already has answers here:
Python - How to convert JSON File to Dataframe
(5 answers)
Closed 1 year ago.
i have json in format
{
"projects":[
{
"author":{
"id":163,
"name":"MyApp",
"easy_external_id":null
},
"sum_time_entries":0,
"sum_estimated_hours":29,
"currency":"EUR",
"custom_fields":[
{
"id":42,
"name":"System",
"internal_name":null,
"field_format":"string",
"value":null
},
{
"id":40,
"name":"Short describe",
"internal_name":null,
"field_format":"string",
"value":""
}
]
}
]"total_count":1772,
"offset":0,
"limit":1
}
And I don't know how to convert this Json "completely" to a dataframe. Respectively, I just want what's in projects. But when I do this:
df = pd.DataFrame(data['projects'])
Although I only get the dataframe from projects, in some columns (for example: author or custom_fields) the format will still remain undecomposed and I would like to decompose it in these columns as well.
can anyone advise?
I expect:
author.id
author.name
author.easy_external_id
sum_time_entries
currency
custom_fields.id
custom_fields.name
etc..
163
MyApp
null
0
EUR
42
System
...
Try:
df = pd.json_normalize(data['projects'])
See documentation here.
I tried here and it works... I think the problem is in your JSON file. Try doing:
data = {'projects': [{'author': {'id': 163,
'name': 'MyApp',
'easy_external_id': None},
'sum_time_entries': 0,
'sum_estimated_hours': 29,
'currency': 'EUR',
'custom_fields': [{'id': 42,
'name': 'System',
'internal_name': None,
'field_format': 'string',
'value': None},
{'id': 40,
'name': 'Short describe',
'internal_name': None,
'field_format': 'string',
'value': ''}]}],
'total_count': 1772,
'offset': 0,
'limit': 1}

Fastest way to get specific key from a dict if it is found

I am currently writing a scraper that reads from an API that contains a JSON. By doing response.json() it would return a dict where we could easily use the e.g response["object"]to get the value we want as I assume that converts it to a dict. The current mock data looks like this:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
What I am looking after is that the API sometimes does contain "orders -> CL" but sometime doesn't . That means that both happy path and unhappy path is what I am looking for which is the fastest way to get a data from a dict.
I have currently done something like this:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
if (
"stock" in data
and "orders" in data["stock"]
and "CL" in data["stock"]["orders"]
and "status" in data["stock"]["orders"]["CL"]
and data["stock"]["orders"]["CL"]["status"]
):
print(f'{data["stock"]["orders"]["CL"]["status"]}: {data["stock"]["orders"]["CL"]["ordered"]}')
1: -10
However my question is that I would like to know which is the fastest way to get the data from a dict if it is in the dict?
Lookups are faster in dictionaries because Python implements them using hash tables.
If we explain the difference by Big O concepts, dictionaries have constant time complexity, O(1). This is another approach using .get() method as well:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
if (data.get('stock', {}).get('orders', {}).get('CL')):
print(f'{data["stock"]["orders"]["CL"]["status"]}: {data["stock"]["orders"]["CL"]["ordered"]}')
Here is a nice writeup on lookups in Python with list and dictionary as example.
I got your point. For this question, since your stock has just 4 values it is hard to say if .get() method will work faster than using a loop or not. If your dictionary would have more items then certainly .get() would have worked much faster but since there are few keys, using loop will not make much difference.

Expand a dataframe column to many

I have read some posts but I have not been able to get what I want.
I have a dataframe with ~4k rows and a few columns which I exported from Infoblox (DNS server).
One of them is dhcp attributes and I would like to expand it to have separated values.
This is my df (I attach a screenshot from excel):
excel screenshot
One of the columns is a dictionary from all the options, this is an example(sanitized):
[
{"name": "tftp-server-name", "num": 66, "value": "10.70.0.27", "vendor_class": "DHCP"},
{"name": "bootfile-name", "num": 67, "value": "pxelinux.0", "vendor_class": "DHCP"},
{"name": "dhcp-lease-time", "num": 51, "use_option": False, "value": "21600", "vendor_class": "DHCP"},
{"name": "domain-name-servers", "num": 6, "use_option": False, "value": "10.71.73.143,10.71.74.163", "vendor_class": "DHCP"},
{"name": "domain-name", "num": 15, "use_option": False, "value": "example.com", "vendor_class": "DHCP"},
{"name": "routers", "num": 3, "use_option": True, "value": "10.70.1.200", "vendor_class": "DHCP"},
]
I would like to expand this column to some (to the same row), like this.
Using "name" as df column and "value" as row value.
This would be the goal:
tftp-server-name voip-tftp-server dhcp-lease-time domain-name-server domain-name routers
0 10.71.69.58 10.71.69.58,10.71.69.59 86400 10.71.73.143,10.71.74.163 example.com 10.70.12.254
In order to have a global df with all the information, I guess I should create a new df keeping the index to merge with primary, but I wasn't able to do it.
I have tried with expand, append, explode...
Please, could you help me?
Thank you so much for your solution (to both).
I could get it work, this is my final file:
I could do it. I add complete solution, just in case someone need it (maybe there is a way more pythonic, but it works):
def formato(df):
opciones = df['options']
df_int = pd.DataFrame()
for i in opciones:
df_int = df_int.append(pd.DataFrame(i).set_index("name")[["value"]].T.reset_index(drop=True))
df_int.index = range(len(df_int.index))
df_global = pd.merge(df, df_int, left_index=True, right_index=True, how="inner")
df_global = df.rename(columns={"comment": "Comentario", "end_addr": "IP Fin", "network": "Red",
"start_addr": "IP Inicio", "disable": "Deshabilitado"})
df_global = df_global[["Red", "Comentario", "IP Inicio", "IP Fin", "dhcp-lease-time",
"domain-name-servers", "domain-name", "routers", "tftp-server-name", "bootfile-name",
"voip-tftp-server", "wdm-server-ip-address", "ftp-file-server", "vendor-encapsulated-options"]]
return df_global
Here is one solution:
import pandas as pd
data = [{'name': 'tftp-server-name', 'num': 66, 'value': '10.70.0.27', 'vendor_class': 'DHCP'}, {'name': 'bootfile-name', 'num': 67, 'value': 'pxelinux.0', 'vendor_class': 'DHCP'}, {'name': 'dhcp-lease-time', 'num': 51, 'use_option': False, 'value': '21600', 'vendor_class': 'DHCP'}, {'name': 'domain-name-servers', 'num': 6, 'use_option': False, 'value': '10.71.73.143,10.71.74.163', 'vendor_class': 'DHCP'}, {'name': 'domain-name', 'num': 15, 'use_option': False, 'value': 'example.com', 'vendor_class': 'DHCP'}, {'name': 'routers', 'num': 3, 'use_option': True, 'value': '10.70.1.200', 'vendor_class': 'DHCP'}]
df = pd.DataFrame(data).set_index("name")[["value"]].T.reset_index(drop=True)
output:
name tftp-server-name bootfile-name dhcp-lease-time domain-name-servers domain-name routers
0 10.70.0.27 pxelinux.0 21600 10.71.73.143,10.71.74.163 example.com 10.70.1.200
You can use json_normalize as follows:
from pandas.io.json import json_normalize
import ast
import pandas as pd
def extract_dict(ld):
res ={}
for d in ast.literal_eval(ld):
res[d['name']] = d['value']
return res
# load dataframe (I made a dummy, replace it with read from file)
df = pd.DataFrame.from_dict({'temp':['temp'],'option':['''[{'name': 'tftp-server-name', 'num': 66, 'value': '10.70.0.27', 'vendor_class': 'DHCP'}, {'name': 'bootfile-name', 'num': 67, 'value': 'pxelinux.0', 'vendor_class': 'DHCP'}, {'name': 'dhcp-lease-time', 'num': 51, 'use_option': False, 'value': '21600', 'vendor_class': 'DHCP'}, {'name': 'domain-name-servers', 'num': 6, 'use_option': False, 'value': '10.71.73.143,10.71.74.163', 'vendor_class': 'DHCP'}, {'name': 'domain-name', 'num': 15, 'use_option': False, 'value': 'example.com', 'vendor_class': 'DHCP'}, {'name': 'routers', 'num': 3, 'use_option': True, 'value': '10.70.1.200', 'vendor_class': 'DHCP'}]''']})
B = json_normalize(df['option'].apply(extract_dict).tolist())
print(B)
The output looks like this:
tftp-server-name bootfile-name dhcp-lease-time domain-name-servers domain-name routers
0 10.70.0.27 pxelinux.0 21600 10.71.73.143,10.71.74.163 example.com 10.70.1.200

returning data from a dataframe using jsonify

I have a web service and the pattern to return data is basically get the required data into a dataframe and then use the code below to return the data.
return jsonify([{'id': row.id,
'name': row.name,
'age': row.age
} for row in data.itertuples()])
This works fine. However as is the case now when I have a dataframe with 30 odd columns is there a more efficient way of doing this? I don't want to have copy the above and write 30 lines of 'some_name' : row.some_name
You could iterate over attributes of your rows and return ones that aren't special methods (ones that starts with _) or functions.
def get_attributes(obj):
return {
attr: getattr(obj, attr) for attr in dir(obj) if
not attr.startswith('_') and not callable(getattr(obj, attr))
}
Example usage:
data = pd.DataFrame(
{
'id': [0, 1],
'name': ['name_1', 'name_2'],
'age': [16, 32]
},
index=['dog', 'hawk']
)
print([
get_attributes(row)
for row in data.itertuples()
])
Output:
[{'Index': 'dog', 'age': 16, 'id': 0, 'name': 'name_1'},
{'Index': 'hawk', 'age': 32, 'id': 1, 'name': 'name_2'}]

Categories

Resources