Expand a dataframe column to many

Expand a dataframe column to many - python

I have read some posts but I have not been able to get what I want.
I have a dataframe with ~4k rows and a few columns which I exported from Infoblox (DNS server).
One of them is dhcp attributes and I would like to expand it to have separated values.
This is my df (I attach a screenshot from excel):
excel screenshot
One of the columns is a dictionary from all the options, this is an example(sanitized):
[
{"name": "tftp-server-name", "num": 66, "value": "10.70.0.27", "vendor_class": "DHCP"},
{"name": "bootfile-name", "num": 67, "value": "pxelinux.0", "vendor_class": "DHCP"},
{"name": "dhcp-lease-time", "num": 51, "use_option": False, "value": "21600", "vendor_class": "DHCP"},
{"name": "domain-name-servers", "num": 6, "use_option": False, "value": "10.71.73.143,10.71.74.163", "vendor_class": "DHCP"},
{"name": "domain-name", "num": 15, "use_option": False, "value": "example.com", "vendor_class": "DHCP"},
{"name": "routers", "num": 3, "use_option": True, "value": "10.70.1.200", "vendor_class": "DHCP"},
]
I would like to expand this column to some (to the same row), like this.
Using "name" as df column and "value" as row value.
This would be the goal:
tftp-server-name voip-tftp-server dhcp-lease-time domain-name-server domain-name routers
0 10.71.69.58 10.71.69.58,10.71.69.59 86400 10.71.73.143,10.71.74.163 example.com 10.70.12.254
In order to have a global df with all the information, I guess I should create a new df keeping the index to merge with primary, but I wasn't able to do it.
I have tried with expand, append, explode...
Please, could you help me?
Thank you so much for your solution (to both).
I could get it work, this is my final file:
I could do it. I add complete solution, just in case someone need it (maybe there is a way more pythonic, but it works):
def formato(df):
opciones = df['options']
df_int = pd.DataFrame()
for i in opciones:
df_int = df_int.append(pd.DataFrame(i).set_index("name")[["value"]].T.reset_index(drop=True))
df_int.index = range(len(df_int.index))
df_global = pd.merge(df, df_int, left_index=True, right_index=True, how="inner")
df_global = df.rename(columns={"comment": "Comentario", "end_addr": "IP Fin", "network": "Red",
"start_addr": "IP Inicio", "disable": "Deshabilitado"})
df_global = df_global[["Red", "Comentario", "IP Inicio", "IP Fin", "dhcp-lease-time",
"domain-name-servers", "domain-name", "routers", "tftp-server-name", "bootfile-name",
"voip-tftp-server", "wdm-server-ip-address", "ftp-file-server", "vendor-encapsulated-options"]]
return df_global

Here is one solution:
import pandas as pd
data = [{'name': 'tftp-server-name', 'num': 66, 'value': '10.70.0.27', 'vendor_class': 'DHCP'}, {'name': 'bootfile-name', 'num': 67, 'value': 'pxelinux.0', 'vendor_class': 'DHCP'}, {'name': 'dhcp-lease-time', 'num': 51, 'use_option': False, 'value': '21600', 'vendor_class': 'DHCP'}, {'name': 'domain-name-servers', 'num': 6, 'use_option': False, 'value': '10.71.73.143,10.71.74.163', 'vendor_class': 'DHCP'}, {'name': 'domain-name', 'num': 15, 'use_option': False, 'value': 'example.com', 'vendor_class': 'DHCP'}, {'name': 'routers', 'num': 3, 'use_option': True, 'value': '10.70.1.200', 'vendor_class': 'DHCP'}]
df = pd.DataFrame(data).set_index("name")[["value"]].T.reset_index(drop=True)
output:
name tftp-server-name bootfile-name dhcp-lease-time domain-name-servers domain-name routers
0 10.70.0.27 pxelinux.0 21600 10.71.73.143,10.71.74.163 example.com 10.70.1.200

You can use json_normalize as follows:
from pandas.io.json import json_normalize
import ast
import pandas as pd
def extract_dict(ld):
res ={}
for d in ast.literal_eval(ld):
res[d['name']] = d['value']
return res
# load dataframe (I made a dummy, replace it with read from file)
df = pd.DataFrame.from_dict({'temp':['temp'],'option':['''[{'name': 'tftp-server-name', 'num': 66, 'value': '10.70.0.27', 'vendor_class': 'DHCP'}, {'name': 'bootfile-name', 'num': 67, 'value': 'pxelinux.0', 'vendor_class': 'DHCP'}, {'name': 'dhcp-lease-time', 'num': 51, 'use_option': False, 'value': '21600', 'vendor_class': 'DHCP'}, {'name': 'domain-name-servers', 'num': 6, 'use_option': False, 'value': '10.71.73.143,10.71.74.163', 'vendor_class': 'DHCP'}, {'name': 'domain-name', 'num': 15, 'use_option': False, 'value': 'example.com', 'vendor_class': 'DHCP'}, {'name': 'routers', 'num': 3, 'use_option': True, 'value': '10.70.1.200', 'vendor_class': 'DHCP'}]''']})
B = json_normalize(df['option'].apply(extract_dict).tolist())
print(B)
The output looks like this:
tftp-server-name bootfile-name dhcp-lease-time domain-name-servers domain-name routers
0 10.70.0.27 pxelinux.0 21600 10.71.73.143,10.71.74.163 example.com 10.70.1.200

Related

Convert nested dictionary to pandas dataframe in python

All, I have the following nested dictionary (from a JSON API response). I would like to access the individual items by means of a pandas dataframe.
The dictionary looks as follows:
{'pagination': {'limit': 2, 'offset': 0, 'count': 2, 'total': 1474969}, 'data': [{'flight_date': '2022-10-12', 'flight_status': 'active', 'departure': {'airport': 'Tullamarine', 'timezone': 'Australia/Melbourne', 'iata': 'MEL', 'icao': 'YMML', 'terminal': '2', 'gate': '16', 'delay': 20, 'scheduled': '2022-10-12T00:50:00+00:00', 'estimated': '2022-10-12T00:50:00+00:00', 'actual': '2022-10-12T01:09:00+00:00', 'estimated_runway': '2022-10-12T01:09:00+00:00', 'actual_runway': '2022-10-12T01:09:00+00:00'}, 'arrival': {'airport': 'Hong Kong International', 'timezone': 'Asia/Hong_Kong', 'iata': 'HKG', 'icao': 'VHHH', 'terminal': '1', 'gate': None, 'baggage': None, 'delay': None, 'scheduled': '2022-10-12T06:55:00+00:00', 'estimated': '2022-10-12T06:55:00+00:00', 'actual': None, 'estimated_runway': None, 'actual_runway': None}, 'airline': {'name': 'Finnair', 'iata': 'AY', 'icao': 'FIN'}, 'flight': {'number': '5844', 'iata': 'AY5844', 'icao': 'FIN5844', 'codeshared': {'airline_name': 'cathay pacific', 'airline_iata': 'cx', 'airline_icao': 'cpa', 'flight_number': '178', 'flight_iata': 'cx178', 'flight_icao': 'cpa178'}}, 'aircraft': None, 'live': None}, {'flight_date': '2022-10-12', 'flight_status': 'active', 'departure': {'airport': 'Tullamarine', 'timezone': 'Australia/Melbourne', 'iata': 'MEL', 'icao': 'YMML', 'terminal': '2', 'gate': '5', 'delay': 25, 'scheduled': '2022-10-12T00:30:00+00:00', 'estimated': '2022-10-12T00:30:00+00:00', 'actual': '2022-10-12T00:55:00+00:00', 'estimated_runway': '2022-10-12T00:55:00+00:00', 'actual_runway': '2022-10-12T00:55:00+00:00'}, 'arrival': {'airport': 'Kuala Lumpur International Airport (klia)', 'timezone': 'Asia/Kuala_Lumpur', 'iata': 'KUL', 'icao': 'WMKK', 'terminal': '1', 'gate': None, 'baggage': None, 'delay': 3, 'scheduled': '2022-10-12T06:00:00+00:00', 'estimated': '2022-10-12T06:00:00+00:00', 'actual': None, 'estimated_runway': None, 'actual_runway': None}, 'airline': {'name': 'KLM', 'iata': 'KL', 'icao': 'KLM'}, 'flight': {'number': '4109', 'iata': 'KL4109', 'icao': 'KLM4109', 'codeshared': {'airline_name': 'malaysia airlines', 'airline_iata': 'mh', 'airline_icao': 'mas', 'flight_number': '128', 'flight_iata': 'mh128', 'flight_icao': 'mas128'}}, 'aircraft': None, 'live': None}]}
The dictionary is stored under the variable name api_response. I am using the following code to convert to a dataframe as described in https://sparkbyexamples.com/pandas/pandas-convert-json-to-dataframe/
My code:
import boto3
import json
from datetime import datetime
import calendar
import random
import time
import requests
import pandas as pd
aircraftdata = ''
params = {
'access_key': 'KEY',
'limit': '2',
'flight_status':'active'
}
url = "http://api.aviationstack.com/v1/flights"
api_result = requests.get('http://api.aviationstack.com/v1/flights', params)
api_statuscode = api_result.status_code
api_response = api_result.json()
print (type(api_response)) #dictionary
print (api_response)
df = pd.DataFrame.from_dict(api_response, orient = 'index')
This yields the following error:
AttributeError: 'list' object has no attribute 'items'
I would like to obtain a dataframe with for each flight the live data:
flight_iata, live_latitude, live_longitude
AA1004, 36.2, -106.8

df = pd.json_normalize(api_response["data"])
df = df[df.loc[:, df.columns.str.contains("live", case=False)].columns]
print(df)
live.updated live.latitude live.longitude live.altitude live.direction live.speed_horizontal live.speed_vertical live.is_ground
0 2019-12-12T10:00:00+00:00 36.2856 -106.807 8846.82 114.34 894.348 1.188 False
If you want to drop live. from the headers you can:
df.columns = df.columns.str.split(".").str[-1]
print(df)
updated latitude longitude altitude direction speed_horizontal speed_vertical is_ground
0 2019-12-12T10:00:00+00:00 36.2856 -106.807 8846.82 114.34 894.348 1.188 False

Considering the desired output, let's say that the dictionary dic looks like the following
dic = {
"pagination": {
"limit": 100,
"offset": 0,
"count": 100,
"total": 1669022
},
"data": [
{
"flight_date": "2019-12-12",
"flight_status": "active",
"departure": {
"airport": "San Francisco International",
"timezone": "America/Los_Angeles",
"iata": "SFO",
"icao": "KSFO",
"terminal": "2",
"gate": "D11",
"delay": 13,
"scheduled": "2019-12-12T04:20:00+00:00",
"estimated": "2019-12-12T04:20:00+00:00",
"actual": "2019-12-12T04:20:13+00:00",
"estimated_runway": "2019-12-12T04:20:13+00:00",
"actual_runway": "2019-12-12T04:20:13+00:00"
},
"arrival": {
"airport": "Dallas/Fort Worth International",
"timezone": "America/Chicago",
"iata": "DFW",
"icao": "KDFW",
"terminal": "A",
"gate": "A22",
"baggage": "A17",
"delay": 0,
"scheduled": "2019-12-12T04:20:00+00:00",
"estimated": "2019-12-12T04:20:00+00:00",
"actual": None,
"estimated_runway": None,
"actual_runway": None
},
"airline": {
"name": "American Airlines",
"iata": "AA",
"icao": "AAL"
},
"flight": {
"number": "1004",
"iata": "AA1004",
"icao": "AAL1004",
"codeshared": None
},
"aircraft": {
"registration": "N160AN",
"iata": "A321",
"icao": "A321",
"icao24": "A0F1BB"
},
"live": {
"updated": "2019-12-12T10:00:00+00:00",
"latitude": 36.28560000,
"longitude": -106.80700000,
"altitude": 8846.820,
"direction": 114.340,
"speed_horizontal": 894.348,
"speed_vertical": 1.188,
"is_ground": False
}
}
]
}
In order to obtain the desired output, one can start by converting the dictionary to a dataframe with pandas.DataFrame
df = pd.DataFrame(dic['data'], columns=['flight', 'live'])
[Out]:
flight live
0 {'number': '1004', 'iata': 'AA1004', 'icao': '... {'updated': '2019-12-12T10:00:00+00:00', 'lati...
Then, one can use .apply() with a custom lambda function as follows
df = df[['flight', 'live']].apply(lambda x: pd.Series([x['flight']['iata'], x['live']['latitude'], x['live']['longitude'], x['live']['altitude']]), axis=1)
[Out]:
0 1 2 3
0 AA1004 36.2856 -106.807 8846.82
Finally, the only thing missing is the name of the columns. And in order to change it, one can do the following
df.columns = ['flight_iata', 'live_latitude', 'live_longitude', 'live_altitude']
[Out]:
flight_iata live_latitude live_longitude live_altitude
0 AA1004 36.2856 -106.807 8846.82
And that is the desired output.

Creating Pandas DataFrame from SmartSheet API (nested, awkward, JSON)

I'm trying to connect to my office's SmartSheet API via Python to create some performance tracking dashboards that utilize data outside of SmartSheet. All I want to do is create a simple DataFrame where fields reflect columnId and cell values reflect the displayValue key in the Smartsheet dictionary. I am doing this using a standard API requests.get rather than SmartSheet's API documentation because I've found the latter less easy to work with.
The table (sample) is set up as:
Number Letter Name
1 A Joe
2 B Jim
3 C Jon
The JSON syntax from the sheet GET request is:
{'id': 339338304219012,
'name': 'Sample Smartsheet',
'version': 1,
'totalRowCount': 3,
'accessLevel': 'OWNER',
'effectiveAttachmentOptions': ['GOOGLE_DRIVE',
'EVERNOTE',
'DROPBOX',
'ONEDRIVE',
'LINK',
'FILE',
'BOX_COM',
'EGNYTE'],
'ganttEnabled': False,
'dependenciesEnabled': False,
'resourceManagementEnabled': False,
'cellImageUploadEnabled': True,
'userSettings': {'criticalPathEnabled': False, 'displaySummaryTasks': True},
'userPermissions': {'summaryPermissions': 'ADMIN'},
'hasSummaryFields': False,
'permalink': 'https://app.smartsheet.com/sheets/5vxMCJQhMV7VFFPMVfJgg2hX79rj3fXgVGG8fp61',
'createdAt': '2020-02-13T16:32:02Z',
'modifiedAt': '2020-02-14T13:15:18Z',
'isMultiPicklistEnabled': True,
'columns': [{'id': 6273865019090820,
'version': 0,
'index': 0,
'title': 'Number',
'type': 'TEXT_NUMBER',
'primary': True,
'validation': False,
'width': 150},
{'id': 4022065205405572,
'version': 0,
'index': 1,
'title': 'Letter',
'type': 'TEXT_NUMBER',
'validation': False,
'width': 150},
{'id': 8525664832776068,
'version': 0,
'index': 2,
'title': 'Name',
'type': 'TEXT_NUMBER',
'validation': False,
'width': 150}],
'rows': [{'id': 8660990817003396,
'rowNumber': 1,
'expanded': True,
'createdAt': '2020-02-14T13:15:18Z',
'modifiedAt': '2020-02-14T13:15:18Z',
'cells': [{'columnId': 6273865019090820, 'value': 1.0, 'displayValue': '1'},
{'columnId': 4022065205405572, 'value': 'A', 'displayValue': 'A'},
{'columnId': 8525664832776068, 'value': 'Joe', 'displayValue': 'Joe'}]},
{'id': 498216492394372,
'rowNumber': 2,
'siblingId': 8660990817003396,
'expanded': True,
'createdAt': '2020-02-14T13:15:18Z',
'modifiedAt': '2020-02-14T13:15:18Z',
'cells': [{'columnId': 6273865019090820, 'value': 2.0, 'displayValue': '2'},
{'columnId': 4022065205405572, 'value': 'B', 'displayValue': 'B'},
{'columnId': 8525664832776068, 'value': 'Jim', 'displayValue': 'Jim'}]},
{'id': 5001816119764868,
'rowNumber': 3,
'siblingId': 498216492394372,
'expanded': True,
'createdAt': '2020-02-14T13:15:18Z',
'modifiedAt': '2020-02-14T13:15:18Z',
'cells': [{'columnId': 6273865019090820, 'value': 3.0, 'displayValue': '3'},
{'columnId': 4022065205405572, 'value': 'C', 'displayValue': 'C'},
{'columnId': 8525664832776068, 'value': 'Jon', 'displayValue': 'Jon'}]}]}
Here are the two ways I've approached the problem:
INPUT:
from pandas.io.json import json_normalize
samplej = sample.json()
s_rows = json_normalize(data=samplej['rows'], record_path='cells', meta=['id', 'rowNumber'])
s_rows
OUTPUT:
DataFrame with columnId, value, disdlayValue, id, and rowNumber as their own fields.
If I could figure out how to transpose this data in the right way I could probably make it work, but that seems incredibly complicated.
INPUT:
samplej = sample.json()
cellist = []
def get_cells():
srows = samplej['rows']
for s_cells in srows:
scells = s_cells['cells']
cellist.append(scells)
get_cells()
celldf = pd.DataFrame(cellist)
celldf
OUTPUT:
This returns a DataFrame with the correct number of columns and rows, but each cell is populated with a dictionary that looks like
In [14]:
celldf.loc[1,1]
Out [14]:
{'columnId': 4022065205405572, 'value': 'B', 'displayValue': 'B'}
If there was a way to remove everything except the value corresponding to the displayValue key in every cell, this would probably solve my problem. Again, though, it seems weirdly complicated.
I'm fairly new to Python and working with API's, so there may be a simple way to address the problem I'm overlooking. Or, if you have a suggestion for approaching the possible solutions I outlined above I'm all ears. Thanks for your help!

You must make use of the columns field:
colnames = {x['id']: x['title'] for x in samplej['columns']}
columns = [x['title'] for x in samplej['columns']]
cellist = [{colnames[scells['columnId']]: scells['displayValue']
for scells in s_cells['cells']} for s_cells in samplej['rows']]
celldf = pd.DataFrame(cellist, columns=columns)
This gives as expected:
Number Letter Name
0 1 A Joe
1 2 B Jim
2 3 C Jon
If some cells could contain only a columnId but no displayValue field, scells['displayValue'] should be replaced in above code with scells.get('displayValue', defaultValue), where defaultValue could be None, np.nan or any other relevant default.

Replacement for dataframe.iterrows()

I'am working on a script for migrating data from MongoDB to Clickhouse. Because of the reason that nested structures are'nt implemented good enough in Clickhouse, I iterate over nested structure and bring them to flat representation, where every element of nested structure is a distinct row in Clickhouse database.
What I do is iterate over list of dictionaries and take target values. The structure looks like this:
[
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
I have a function that does this on dataframe object via data.iterrows() method:
def to_flat(data, coldict, field_last_upd):
m_status_history = stc.special_mongo_names['status_history_cand']
n_statuse_change = coldict['n_statuse_change']['name']
data[n_statuse_change] = n_status_change(dp.force_take_series(data, m_status_history))
flat_cols = [ x for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT ]
old_cols_names = [ x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_PREPARATION ]
t_time = time.time()
t_len = 0
new_rows = list()
for j in range(row[n_statuse_change]):
t_new_value_row = np.empty(shape=[0, 0])
for k in range(len(flat_cols)):
if flat_cols[k]['colsubtype'] == stc.COLSUBTYPE_FLATPATH:
new_value = dp.under_value_line(
row,
path_for_status(j, row[n_statuse_change]-1, flat_cols[k]['path'])
)
# Дополнительно обрабатываем дату
if flat_cols[k]['name'] == coldict['status_set_at']['name']:
new_value = dp.iso_date_to_datetime(new_value)
if flat_cols[k]['name'] == coldict['status_set_at_mil']['name']:
new_value = dp.iso_date_to_miliseconds(new_value)
if flat_cols[k]['name'] == coldict['status_stage_order']['name']:
try:
new_value = int(new_value)
except:
new_value = new_value
else:
if flat_cols[k]['name'] == coldict['status_index']['name']:
new_value = j
t_new_value_row = np.append(t_new_value_row, dp.some_to_null(new_value))
new_rows.append(np.append(row[old_cols_names].values, t_new_value_row))
pdb.set_trace()
res = pd.DataFrame(new_rows, columns = [
x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT or x['coltype'] == stc.COLTYPE_PREPARATION
])
return res
It takes values from list of dicts, prepare them to correspond Clickhouse's requirements using numpy arrays and then appends them all together to get new dataframe with targeted values and its columnnames.
I've noticed that if nested structure is big enough, it begins to work much slower. I've found an article where different methods of iteration in Python are compared. article
It is claimed that it's much faster to iterate over .apply() method and even faster using vectorization. But the samples given are pretty trivial and rely on using the same function on all of the values. Is it possible to iterate over pandas object in faster manner, while using variety of functions on different types of data?

I think your first step should be converting your data into a pandas dataframe, then it will be so much easier to handle it. I couldn't deschiper the exact functions you wanted to run, but perhaps my example helps
import datetime
import pandas as pd
data_dict_array = [
{
'Comment': None,
'Details': None,
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'Новый',
'SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Новые',
'Order': 0,
'_id': 'newStage'
},
'Tags': None,
'Type': 'Unknown',
'Weight': 120,
'_id': 'new'
},
{
'Comment': None,
'Details': {
'Name': 'взят в работу',
'_id': 1
},
'FunnelId': 'MegafonCompany',
'IsHot': False,
'IsReadonly': False,
'Name': 'В работе',
'SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),
'SetById': 'ekaterina.karpenko',
'SetByName': 'Екатерина Карпенко',
'Stage': {
'Label': 'Приглашение на интервью',
'Order': 1,
'_id': 'recruiterStage'
},
'Tags': None,
'Type': 'InProgress',
'Weight': 80,
'_id': 'phoneInterview'
}
]
#converting your data into something pandas can read
# in particular, flattening the stage dict
for data_dict in data_dict_array:
d_temp = data_dict.pop("Stage")
data_dict["Stage_Label"] = d_temp["Label"]
data_dict["Stage_Order"] = d_temp["Order"]
data_dict["Stage_id"] = d_temp["_id"]
df = pd.DataFrame(data_dict_array)
# lets say i want to set comment to "cool" if name is 'В работе'
# in .loc[], the first argument is filtering the rows, the second argument is picking the column
df.loc[df['Name'] == 'В работе','Comment'] = "cool"
df

Python manipulating json, lists, and dictionaries

Sorry for the length but tried to be complete.
I'm trying to get the following data -
(only small sampling from a much larger json file, same structure)
{
"count": 394,
"status": "ok",
"data": [
{
"md5": "cd042ba78d0810d86755136609793d6d",
"threatscore": 90,
"threatlevel": 0,
"avdetect": 0,
"vxfamily": "",
"domains": [
"dynamicflakesdemo.com",
"www.bountifulbreast.co.uk"
],
"hosts": [
"66.33.214.180",
"64.130.23.5",
],
"environmentId": "1",
},
{
"md5": "4f3a560c8deba19c5efd48e9b6826adb",
"threatscore": 65,
"threatlevel": 0,
"avdetect": 0,
"vxfamily": "",
"domains": [
"px.adhigh.net"
],
"hosts": [
"130.211.155.133",
"65.52.108.163",
"172.225.246.16"
],
"environmentId": "1",
}
]
}
if "threatscore" is over 70 I want to add it to this json structure -
Ex.
"data": [
{
"md5": "cd042ba78d0810d86755136609793d6d",
"threatscore": 90,
{
"Event":
{"date":"2015-11-25",
"threat_level_id":"1",
"info":"HybridAnalysis",
"analysis":"0",
"distribution":"0",
"orgc":"SOC",
"Attribute": [
{"type":"ip-dst",
"category":"Network activity",
"to_ids":True,
"distribution":"3",
"value":"66.33.214.180"},
{"type":"ip-dst",
"category":"Network activity",
"to_ids":True,
"distribution":"3",
"value":"64.130.23.5"}
{"type":"domain",
"category":"Network activity",
"to_ids":True,
"distribution":"3",
"value":"dynamicflakesdemo.com"},
{"type":"domain",
"category":"Network activity",
"to_ids":True,
"distribution":"3",
"value":"www.bountifulbreast.co.uk"}
{"type":"md5",
"category":"Payload delivery",
"to_ids":True,
"distribution":"3",
"value":"cd042ba78d0810d86755136609793d6d"}]
}
}
This is my code -
from datetime import datetime
import os
import json
from pprint import pprint
now = datetime.now()
testFile = open("feed.json")
feed = json.load(testFile)
for x in feed['data']:
if x['threatscore'] > 90:
data = {}
data['Event']={}
data['Event']["date"] = now.strftime("%Y-%m-%d")
data['Event']["threat_level_id"] = "1"
data['Event']["info"] = "HybridAnalysis"
data['Event']["analysis"] = 0
data['Event']["distribution"] = 3
data['Event']["orgc"] = "Malware"
data['Event']["Attribute"] = []
if 'hosts' in x:
data['Event']["Attribute"].append({'type': "ip-dst"})
data['Event']["Attribute"][0]["category"] = "Network activity"
data['Event']["Attribute"][0]["to-ids"] = True
data['Event']["Attribute"][0]["distribution"] = "3"
data["Event"]["Attribute"][0]["value"] =x['hosts']
if 'md5' in x:
data['Event']["Attribute"].append({'type': "md5"})
data['Event']["Attribute"][1]["category"] = "Payload delivery"
data['Event']["Attribute"][1]["to-ids"] = True
data['Event']["Attribute"][1]["distribution"] = "3"
data['Event']["Attribute"][1]['value'] = x['md5']
if 'domains' in x:
data['Event']["Attribute"].append({'type': "domain"})
data['Event']["Attribute"][2]["category"] = "Network activity"
data['Event']["Attribute"][2]["to-ids"] = True
data['Event']["Attribute"][2]["distribution"] = "3"
data['Event']["Attribute"][2]["value"] = x['domains']
attributes = data["Event"]["Attribute"]
data["Event"]["Attribute"] = []
for attribute in attributes:
for value in attribute["value"]:
if value == " ":
pass
else:
new_attr = attribute.copy()
new_attr["value"] = value
data["Event"]["Attribute"].append(new_attr)
pprint(data)
with open('output.txt', 'w') as outfile:
json.dump(data, outfile)
And now it seems to be cleaned up a little but the data['md5'] is being split on each letter and I think it's just like L3viathan said earlier I keep overwriting the first element in the dictionary... but I'm not sure how to get it to keep appending???
{'Event': {'Attribute': [{'category': 'Network activity',
'distribution': '3',
'to-ids': True,
'type': 'ip-dst',
'value': u'216.115.96.174'},
{'category': 'Network activity',
'distribution': '3',
'to-ids': True,
'type': 'ip-dst',
'value': u'64.4.54.167'},
{'category': 'Network activity',
'distribution': '3',
'to-ids': True,
'type': 'ip-dst',
'value': u'63.250.200.37'},
{'category': 'Payload delivery',
'distribution': '3',
'to-ids': True,
'type': 'md5',
'value': u'7'},
{'category': 'Payload delivery',
'distribution': '3',
'to-ids': True,
'type': 'md5',
'value': u'1'},
And still getting the following error in the end:
Traceback (most recent call last):
File "hybridanalysis.py", line 34, in
data['Event']["Attribute"][1]["category"] = "Payload delivery"
IndexError: list index out of range
The final goal is to get it set so that I can post the events into MISP but they have to go one at a time.

I think this should fix your problems. I added the attribute dictionary all in one go, and moved the data in a list (which is more appropriate), but you might want to remove the superfluous list which wraps the Events.
from datetime import datetime
import os
import json
from pprint import pprint
now = datetime.now()
testFile = open("feed.json")
feed = json.load(testFile)
data_list = []
for x in feed['data']:
if x['threatscore'] > 90:
data = {}
data['Event']={}
data['Event']["date"] = now.strftime("%Y-%m-%d")
data['Event']["threat_level_id"] = "1"
data['Event']["info"] = "HybridAnalysis"
data['Event']["analysis"] = 0
data['Event']["distribution"] = 3
data['Event']["orgc"] = "Malware"
data['Event']["Attribute"] = []
if 'hosts' in x:
data['Event']["Attribute"].append({
'type': 'ip-dst',
'category': 'Network activity',
'to-ids': True,
'distribution': '3',
'value': x['hosts']})
if 'md5' in x:
data['Event']["Attribute"].append({
'type': 'md5',
'category': 'Payload delivery',
'to-ids': True,
'distribution': '3',
'value': x['md5']})
if 'domains' in x:
data['Event']["Attribute"].append({
'type': 'domain',
'category': 'Network activity',
'to-ids': True,
'distribution': '3',
'value': x['domains']})
attributes = data["Event"]["Attribute"]
data["Event"]["Attribute"] = []
for attribute in attributes:
for value in attribute["value"]:
if value == " ":
pass
else:
new_attr = attribute.copy()
new_attr["value"] = value
data["Event"]["Attribute"].append(new_attr)
data_list.append(data)
with open('output.txt', 'w') as outfile:
json.dump(data_list, outfile)

In the json, "Attiribute" Holds the value of a list with a 1 item, a dict, in it, as shown here.
{'Event': {'Attribute': [{'category': 'Network activity',
'distribution': '3',
'to-ids': True,
'type': 'ip-dst',
'value': [u'54.94.221.70']}]
...
When you call data['Event']["Attribute"][1]["category"] you are getting the second item (index 1) in the list of attribute, while it only has one item, which is why you are getting the error.

Thanks L3viathan! Below is how I tweaked it to not iterate over MD5's.
attributes = data["Event"]["Attribute"]
data["Event"]["Attribute"] = []
for attribute in attributes:
if attribute['type'] == 'md5':
new_attr = attribute.copy()
new_attr["value"] = str(x['md5'])
data["Event"]["Attribute"].append(new_attr)
else:
for value in attribute["value"]:
new_attr = attribute.copy()
new_attr["value"] = value
data["Event"]["Attribute"].append(new_attr)
data_list.append(data)
Manipulating json seems to be the way to go to learn lists and dictionaries.

Dictionary keys and values

I'm working with Python dictionaries and there's something I don't understand when I use the function dict.values() and dict.keys().
Why do they give as a result also the "description" of the function? Am I missing something here?
participant = {"name": "Lisa", "age": 16, "activities": [{"name": "running", "duration": 340},{"name": "walking", "duration": 790}]}
print(participant.values())
print(participant.keys())
The print gives these results:
dict_values([[{'duration': 340, 'name': 'running'}, {'duration': 790, 'name': 'walking'}], 'Lisa', 16])
dict_keys(['activities', 'name', 'age'])
I don't want 'dict_values' and 'dict_keys' in the result. What am I doing wrong?

For this purpose you can use keyword list:
list(participant.keys()) # ['name', 'activities', 'age']
list(participant.values())
# ['Lisa', [{'name': 'running', 'duration': 340}, {'name': 'walking', 'duration': 790}], 16]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Expand a dataframe column to many - python

Related

Convert nested dictionary to pandas dataframe in python

Creating Pandas DataFrame from SmartSheet API (nested, awkward, JSON)

Replacement for dataframe.iterrows()

Python manipulating json, lists, and dictionaries

Dictionary keys and values

Categories

Resources