I have a config file that I've been working out of, Ex:
"Preprocessing": {
"BOW":{"ngram_range":[1,2], "max_features":[100, 200]},
"RemoveStopWords": {"Parameter1": ["..."]}
}
The idea is to take this data and run every iteration between the two preprocessing steps and pass it into a Preprocessing object. The output I'm looking for is:
[{"BOW":{"ngram_range":1, "max_features":100}, "RemoveStopWords":{"Parameter1": "..."},
{"BOW":{"ngram_range":2, "max_features":100}, "RemoveStopWords":{"Parameter1": "..."},
{"BOW":{"ngram_range":1, "max_features":200}, "RemoveStopWords":{"Parameter1": "..."},
{"BOW":{"ngram_range":2, "max_features":200}, "RemoveStopWords":{"Parameter1": "..."}]
Current code:
def unpack_preprocessing_steps(preprocessing: dict):
"""
This script will take the Preprocessing section of the config file
and produce a list of preprocessing combinations.
"""
preprocessing_steps = [] # save for all steps bow, w2v, etc.
preprocessing_params = [] # individual parameters for each preprocessing step
for key, values in preprocessing.items():
for key2, values2 in values.items():
preprocessing_steps.append(key2)
preprocessing_params.append(values2)
iterables = product(*preprocessing_params) # Creates a matrix of every combination
iterable_of_params = [i for i in iterables]
exploded_preprocessing_list = []
for params in iterable_of_params:
individual_objects = {} # store each object as an unpackable datatype
for step, param in zip(preprocessing_steps, params):
individual_objects[step] = param # This stores ever iteration as it's own set of preprocesses
exploded_preprocessing_list.append(individual_objects)
return exploded_preprocessing_list
Current output (and wrong) output is:
[{"ngram_range":1, "max_features":100, "Parameter1":"..."},
{"ngram_range":2, "max_features":200, "Parameter1":"..."},
{"ngram_range":1, "max_features":100, "Parameter1":"..."},
{"ngram_range":2, "max_features":200, "Parameter1":"..."}]
This should work for you - assuming you always want that same RemoveStopWords section. It generates the product of all feature keys and values:
from itertools import product
# pprint just to make output clearer
from pprint import pprint
config = {
"Preprocessing": {
"BOW": {"ngram_range":[1,2]},
"RemoveStopWords": {"Parameter1": ["..."]},
}
}
features, values = zip(*config["Preprocessing"]["BOW"].items())
bows = [dict(zip(features, v)) for v in product(*values)]
newconf = []
for bow in bows:
newconf.append({
"BOW": bow,
"RemoveStopWords": config["Preprocessing"]["RemoveStopWords"],
})
print(newconf)
Result:
[{'BOW': {'max_features': 100, 'ngram_range': 1}, 'RemoveStopWords': {'Parameter1': ['...']}},
{'BOW': {'max_features': 200, 'ngram_range': 1}, 'RemoveStopWords': {'Parameter1': ['...']}},
{'BOW': {'max_features': 100, 'ngram_range': 2}, 'RemoveStopWords': {'Parameter1': ['...']}},
{'BOW': {'max_features': 200, 'ngram_range': 2}, 'RemoveStopWords': {'Parameter1': ['...']}}]
Related
I'm looking for any suggestions to resolve an issue I'm facing. It might seem as a simple problem, but after a few days trying to find an answer - I think it is not anymore.
I'm receiving data (StringType) in a following JSON-like format, and there is a requirement to turn it into flat key-value pair dictionary. Here is a payload sample:
s = """{"status": "active", "name": "{\"first\": \"John\", \"last\": \"Smith\"}", "street_address": "100 \"Y\" Street"}"""
and the desired output should look like this:
{'status': 'active', 'name_first': 'John', 'name_last': 'Smith', 'street_address': '100 "Y" Street'}
The issue is I can't find a way to turn original string (s) into a dictionary. If I can achieve that the flattening part is working perfectly fine.
import json
import collections
import ast
#############################################################
# Flatten complex structure into a flat dictionary
#############################################################
def flatten_dictionary(dictionary, parent_key=False, separator='_', value_to_str=True):
"""
Turn a nested complex json into a flattened dictionary
:param dictionary: The dictionary to flatten
:param parent_key: The string to prepend to dictionary's keys
:param separator: The string used to separate flattened keys
:param value_to_str: Force all returned values to string type
:return: A flattened dictionary
"""
items = []
for key, value in dictionary.items():
new_key = str(parent_key) + separator + key if parent_key else key
try:
value = json.loads(value)
except BaseException:
value = value
if isinstance(value, collections.MutableMapping):
if not value.items():
items.append((new_key,None))
else:
items.extend(flatten_dictionary(value, new_key, separator).items())
elif isinstance(value, list):
if len(value):
for k, v in enumerate(value):
items.extend(flatten_dictionary({str(k): (str(v) if value_to_str else v)}, new_key).items())
else:
items.append((new_key,None))
else:
items.append((new_key, (str(value) if value_to_str else value)))
return dict(items)
# Data sample; sting and dictionary
s = """{"status": "active", "name": "{\"first\": \"John\", \"last\": \"Smith\"}", "street_address": "100 \"Y\" Street"}"""
d = {"status": "active", "name": "{\"first\": \"John\", \"last\": \"Smith\"}", "street_address": "100 \"Y\" Street"}
# Works for dictionary type
print(flatten_dictionary(d))
# Doesn't work for string type, for any of the below methods
e = eval(s)
# a = ast.literal_eval(s)
# j = json.loads(s)
Try:
import json
import re
def jsonify(s):
s = s.replace('"{','{').replace('}"','}')
s = re.sub(r'street_address":\s+"(.+)"(.+)"(.+)"', r'street_address": "\1\2\3"',s)
return json.loads(s)
If you must keep the quotes around Y, try:
def jsonify(s):
s = s.replace('"{','{').replace('}"','}')
search = re.search(r'street_address":\s+"(.+)"(.+)"(.+)"',s)
if search:
s = re.sub(r'street_address":\s+"(.+)"(.+)"(.+)"', r'street_address": "\1\2\3"',s)
dict_version = json.loads(s)
dict_version['street_address'] = dict_version['street_address'].replace(search.group(2),'"'+search.group(2)+'"')
return dict_version
A more generalized attempt:
def jsonify(s):
pattern = r'(?<=[,}])\s*"(.[^\{\}:,]+?)":\s+"([^\{\}:,]+?)"([^\{\}:,]+?)"([^\{\}:,]+?)"([,\}])'
s = s.replace('"{','{').replace('}"','}')
search = re.search(pattern,s)
matches = []
if search:
matches = re.findall(pattern,s)
s = re.sub(pattern, r'"\1": "\2\3\4"\5',s)
dict_version = json.loads(s)
for match in matches:
dict_version[match[0]] = dict_version[match[0]].replace(match[2],'"'+match[2]+'"')
return dict_version
Why can't I convert the loop group in groupby as list? Currently, I am working on Django==2.2.1 and when I try this data = [...] below into python console, it is working fine.
from itertools import groupby
from operator import itemgetter
#login_required
def list(request, template_name='cart/list.html'):
# I also try with this dummy data
test_data = [{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':5.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':5.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':7.0}]
print(type(data)) # a list
sorted_totals = sorted(test_data, key=itemgetter('total_order'))
for agent_name, group in groupby(sorted_totals, key=lambda x: x['agent_name']):
print(agent_name, list(group)) # I stopped here when converting the `group` as list.
But, I am getting an error looking like this when I try it at views in Django.
I also tried it with defaultdict
from collections import defaultdict
#login_required
def list(request, template_name='cart/list.html'):
test_data = [{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':5.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':5.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':7.0}]
grouped = defaultdict(list)
for data_total in test_data:
grouped[data_total['agent_name']].append(data_total) # stoped here
grouped_out = []
for agent_name, group in grouped.items():
total_order = 0
total_pcs = 0
total_kg = 0
if isinstance(group, list):
for data_total in group:
total_order += data_total.get('total_order')
total_pcs += data_total.get('total_pcs')
total_kg += data_total.get('total_kg')
grouped_out.append({
'agent_name': agent_name,
'total_order': total_order,
'total_pcs': total_pcs,
'total_kg': total_kg
})
But the error I found stoped by wrapper view. If we following the previous issue, it referenced with this _wrapped_view
Finally, I fixed it manually by using a dict.
test_data = [{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':5.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':5.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agentbeli','total_pcs':1,'total_kg':6.0},{'total_order':1,'agent_name':'agent123','total_pcs':1,'total_kg':7.0}]
grouped = {}
for data_total in test_data:
agent_name = data_total.get('agent_name')
if agent_name in grouped:
new_data = grouped[agent_name] # dict
new_data['total_order'] += data_total.get('total_order')
new_data['total_pcs'] += data_total.get('total_pcs')
new_data['total_kg'] += data_total.get('total_kg')
grouped[agent_name].update(**new_data)
else:
grouped[agent_name] = data_total
And the result of grouped is look like this:
{'agent123': {'agent_name': 'agent123',
'total_kg': 18.0,
'total_order': 3,
'total_pcs': 3},
'agentbeli': {'agent_name': 'agentbeli',
'total_kg': 17.0,
'total_order': 3,
'total_pcs': 3}}
I have the following code:
import json
data_sample = [{
"name":"John",
"age":30,
"cars":[ {
"temp":{
"sum":"20",
"for":12,
}
,
"id":30,
"element":[ {"model":"Taurus1", "doors":{"id":"1", "id2":101}}, {"model":"T1", "doors":{"id":"2", "id2":12}}, {"model":"As", "doors":{"id":"Mo", "id2":4}} ]
}, {
"temp":{
"sum":"10",
"for":12,
}
,
"id":31,
"element":[ {"model":"Taurus2", "doors":{"id":"2", "id2":102}}, {"model":"T2", "doors":{"id":"5", "id2":12}}, {"model":"Thing", "doors":{"id":"Fo", "id2":4}} ]
}, {
"temp":{
"sum":"20",
"for":10,
}
,
"id":32,
"element":[ {"model":"Taurus3", "doors":{"id":"3", "id2":103}}, {"model":"T3", "doors":{"id":"15", "id2":62}}, {"model":"By", "doors":{"id":"Log", "id2":4}} ]
} ]
}]
def flat_list(z):
x = []
for i, data_obj in enumerate(z):
if type(data_obj) is dict or type(data_obj) is list:
x.extend([flatten_data(data_obj)])
else:
x.extend([data_obj])
return x
def flatten_data(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
out[name[:-1]] = flat_list(x)
else:
out[name[:-1]] = x
flatten(y)
return out
def generatejson(response2):
# response 2 is [(first data set), (second data set)] convert it to dictionary {0: (first data set), 1: (second data set)}
sample_object = {i: data_response for i, data_response in enumerate(response2)}
flat = {k: flatten_data(v) for k, v in sample_object.items()}
return json.dumps(flat, sort_keys=True)
print generatejson(data_sample)
This code takes data from the following format:
[(first data set), (second data set)]
and begin to look for nesting dicts. If nesting dict is detected the code flats it to the parent level.
For example the code detects this:
doors is nested dict so it converts it to:
Note that it doesn't change the lists/arrays. They are not being flattened.
My issue:
On small amount of data the code works great however handling large amount of sets (1000+) the performance is very low... And sometimes even crash.
How can I improve and optimize the performance of this code?
The data_sample contains only 1 data set (I assume that's enough for checking).
I've created a function for iterating through a multiple level dictionary, and execute a second function ssocr that needs four arguments: coord, background, foreground, type (they are the value of my keys). This is my dictionary which is taken from a json file.
document json
def parse_image(self, d):
bg = d['background']
fg = d['foreground']
results = {}
for k, v in d['boxes'].iteritems():
if 'foreground' in d['boxes']:
myfg = d['boxes']['foreground']
else:
myfg = fg
if k != 'players_home' and k != 'players_opponent':
results[k] = MyAgonism.ssocr(v['coord'], bg, myfg, v['type'])
results['players_home'] = {}
for k, v in d['boxes']['players_home'].iteritems():
if 'foreground' in d['boxes']['players_home']:
myfg = d['boxes']['players_home']['foreground']
else:
myfg = fg
if k != 'background' and k != 'foreground':
for k2, v2 in d['boxes']['players_home'][k].iteritems():
if k2 != 'fouls':
results['players_home'][k] = {}
results['players_home'][k][k2] = MyAgonism.ssocr(v2['coord'], bg, myfg, v2['type'])
return results
In the last iteritems I get the correct value just for the name key. The score key doesn't appear. It looks like name override score in my results['players_home'] dictionary
output: ... "player4": {"name": 9}, "player5": {"name": 24} ...
I would like something like ... "player4": {"name": 9, "score": value}, "player5": {"name": 24, "score": value} ...
What am I doing wrong? Here is the full code just in case: Full Code
This might be the/a problem:
if k2 != 'fouls':
results['players_home'][k] = {}
In your loop, each time that k2 is not 'fouls', you create a new empty dict and store that in results['players_home']. That means that any entry previously stored there is no longer accessible.
There is a way to initialize structure with dictionary:
fooData= {'y': 1, 'x': 2}
fooStruct = ffi.new("foo_t*", fooData)
fooBuffer = ffi.buffer(fooStruct)
Is there some ready function to do the conversion?
fooStruct = ffi.new("foo_t*")
(ffi.buffer(fooStruct))[:] = fooBuffer
fooData= convert_to_python( fooStruct[0] )
Do I have to use ffi.typeof("foo_t").fields by myself?
I come up with this code so far:
def __convert_struct_field( s, fields ):
for field,fieldtype in fields:
if fieldtype.type.kind == 'primitive':
yield (field,getattr( s, field ))
else:
yield (field, convert_to_python( getattr( s, field ) ))
def convert_to_python(s):
type=ffi.typeof(s)
if type.kind == 'struct':
return dict(__convert_struct_field( s, type.fields ) )
elif type.kind == 'array':
if type.item.kind == 'primitive':
return [ s[i] for i in range(type.length) ]
else:
return [ convert_to_python(s[i]) for i in range(type.length) ]
elif type.kind == 'primitive':
return int(s)
Is there a faster way?
Arpegius' solution works fine for me, and is quite elegant. I implemented a solution based on Selso's suggestion to use inspect. dir() can substitute inspect.
from inspect import getmembers
from cffi import FFI
ffi = FFI()
from pprint import pprint
def cdata_dict(cd):
if isinstance(cd, ffi.CData):
try:
return ffi.string(cd)
except TypeError:
try:
return [cdata_dict(x) for x in cd]
except TypeError:
return {k: cdata_dict(v) for k, v in getmembers(cd)}
else:
return cd
foo = ffi.new("""
struct Foo {
char name[6];
struct {
int a, b[3];
} item;
} *""",{
'name': b"Foo",
'item': {'a': 3, 'b': [1, 2, 3]}
})
pprint(cdata_dict(foo))
Output:
{'item': {'a': 3, 'b': [1, 2, 3]}, 'name': b'Foo'}
This code infortunately does not work for me, as some struct members are "pointer" types, it leads to storing "none" in the dict.
I am a Python noob, but maybe the inspect module would be another starting point, and a shorter way to print "simple" data. Then we would iterate over the result in order to unroll data structure.
For example with the following example :
struct foo {
int a;
char b[10];
};
Using inspect.getmembers( obj ) I have the following result :
[('a', 10), ('b', <cdata 'char[10]' 0x7f0be10e2824>)]
Your code is fine.
Even if there was a built-in way in CFFI, it would not be what you need here. Indeed, you can say ffi.new("foo_t*", {'p': p1}) where p1 is another cdata, but you cannot recursively pass a dictionary containing more dictionaries. The same would be true in the opposite direction: you would get a dictionary that maps field names to "values", but the values themselves would be more cdata objects anyway, and not recursively more dictionaries.