How to vectorize a json dictionary using R wrapped in python?

How to vectorize a json dictionary using R wrapped in python? - python

High level description of what I want: I want to be able to receive a json response detailing certain values of fields/features, say {a: 1, b:2, c:3} as a flask (json) request. Then I want to convert the resulting python_dict into an r dataframe with rpy2(a single row of one), and feed it into a model in R which is expecting to receive a set of input where each column is a factor in r. I usually use python for this sort of thing, and serialize a vectorizer object from sklearn -- but this particular analysis needs to be done an R.
So here is what I'm doing so far.
import rpy2.robjects as robjects
from rpy2.robjects.packages import STAP
model = os.path.join('model', 'rsource_file.R')
with open(model, 'r') as f:
string = f.read()
model = STAP(string, "model")
data_r = robjects.DataFrame(data)
data_factored = model.prepdata(data_r)
result = model.predict(data_factored)
the relevant r functions from rsource_code are:
prepdata = function(row){
for(v in vars) if(typeof(row[,v])=="character") row[,v] = as.factor(row[,v], levs[0,v])
modm2=model.matrix(frm, data=tdz2, contrasts.arg = c1,xlev = levs)
}
where contrasts and levels have been pre-extracted from an existing dataset likeso:
#vars = vector of columns of interest
load(data.Rd)
for(v in vars) if(typeof(data[,v])=="character") data[,v] = as.factor(data[,v])
frm = ~ weightedsum_of_things #function mapped, causes no issue
modm= (model.matrix(frm,data=data))
levs = lapply(data, levels)
c1 = attributes(modm)$contrasts
calling prepdata does not give me what I want, which is for the newly dataframe(from the json request data_r) to be properly turned into a vector of "factors" with the same encoding by which the elements of the data.Rd database where transformed.
Thank you for your assistance, will upvote.
More detail: So what my code is attempting to do is map the labels() method over a the dataset to extract a list of lists of possible "levels" for a factor -- and then for matching values in the new input, call factor() with the new data row as well as the corresponding set of levels, levs[0,v].
This throws an error that you can't use factor if there isn't more than one level. I think this might have something to do with the labels/level difference? I'm calling levs[,v] to get the element of the return value of lapply(data, levels) corresponding to the "title" v (a string). I extracted the levelsfrom the data set -- but referencing them in the body of prep_data this way doesn't seem to work. Do I need to extract labels instead? if so, how can I do that?

Related

How to set attributes in pandasdmx

The "pandasdmx" documentation has an example of how to set dimensions, but no how to set attributes.
I tried passing "key" to the variable, but an error occurs.
Perhaps setting attributes requires a different syntax?
At the end of my sample code, I printed out a list of dimensions and attributes.
import pandasdmx as sdmx
import pandas as pd
ecb = sdmx.Request('ECB')
key = dict(FREQ=['M'], SEC_ITEM=['F51100'], REF_AREA = ['I8'], SEC_ISSUING_SECTOR = ['1000'])
params = dict(startPeriod='2010-02', endPeriod='2010-02')
data_msg = ecb.data('SEC', key=key, params=params)
data = data_msg.data[0]
daily = [s for sk, s in data.series.items() if sk.FREQ == 'M']
cur_df = pd.concat(sdmx.to_pandas(daily)).unstack()
print(cur_df.to_string())
flow_msg = ecb.dataflow()
dataflows = sdmx.to_pandas(flow_msg.dataflow)
exr_msg = ecb.dataflow('SEC')
exr_flow = exr_msg.dataflow.SEC
dsd = exr_flow.structure
dimen = dsd.dimensions.components
attrib = dsd.attributes.components
print(dimen)
print(attrib)

To clarify, what you're asking here is not “how to set attributes” but “how to include attributes when converting (or ‘writing’) an SDMX DataSet to a pandas DataFrame”.
Using sdmx1,¹ see the documentation for the attribute parameter to the write_dataset() function. This is the function that is ultimately called by to_pandas() when you pass a DataSet object as the first argument.
In your sample code, the key line is:
sdmx.to_pandas(daily)
You might try instead:
sdmx.to_pandas(data_msg.data[0], attributes="dsgo")
Per the documentation, the string "dsgo" means “include in additional columns all attributes attached at the dataset, series key, group, and observation levels.”
¹ sdmx1 is a fork of pandaSDMX with many enhancements and more regular releases.

Variable indexed by an indexed Set with Pyomo

im trying to figure out how to index a variable with an indexed Set:
For example:
model = AbstractModel()
model.J = Set()
model.O = Set(model.J)
I want to define a variable indexed over both Sets. Can Someone help me? I tried the following:
model.eb=Param(model.J, model.O)
which gives
TypeError("Cannot index a component with an indexed set")
Has anyone any suggestions on how to define this variable properly?

Pyomo doesn't support indexed Sets like that (I'm actually unaware of use cases for indexed sets in Pyomo, although they seem to be a thing in GAMS). You could approach this as follows (using ConcreteModel here, for illustration):
Define Sets for all unique values of jobs and operations (I assume you have some data structure which maps the operations to the jobs):
import pyomo.environ as po
import itertools
model = po.ConcreteModel()
map_J_O = {'J1': ['O11', 'O12'],
'J2': ['O21']}
unique_J = map_J_O.keys()
model.J = po.Set(initialize=unique_J)
unique_O = set(itertools.chain.from_iterable(map_J_O.values()))
model.O = po.Set(initialize=unique_O)
Then you could define a combined Set which contains all valid combinations of J and O:
model.J_O = po.Set(within=model.J * model.O,
initialize=[(j, o) for j in map_J_O for o in map_J_O[j]])
model.J_O.display()
# Output:
#J_O : Dim=0, Dimen=2, Size=3, Domain=J_O_domain, Ordered=False, Bounds=None
# [('J1', 'O11'), ('J1', 'O12'), ('J2', 'O21')]
Create the parameter using the combined Set:
model.eb = po.Param(model.J_O)
This last line will throw an error the parameter is initialized using any non-valid combination of J and O. Alternatively, you can also initialize the parameter for all combinations
po.Param(model.J * model.O)
and only initialize for the valid combinations, but this might bite you later. Also, model.J_O might be handy also for variables and constraints, depending on your model formulation.

Python Extracting Data from JSON without a label?

The API here: https://api.bitfinex.com/v2/tickers?symbols=ALL
does not have any labels and I want to extract all of the tBTCUSD, tLTCUSD etc.. Basically everything without numbers. Normally, i would extract this information if they are labeled so i can do something like:
data['name']
or something like that however this API does not have labels.. how can i get this info with python?

You can do it like this:
import requests
j = requests.get('https://api.bitfinex.com/v2/tickers?symbols=ALL').json()
mydict = {}
for i in j:
mydict[i[0]] = i[1:]
Or using dictionary comprehension:
mydict = {i[0]: i[1:] for i in j}
Then access it as:
mydict['tZRXETH']

I don't have access to Python right now, but it looks like they're organized in a superarray of several subarrays.
You should be able to extract everything (the superarray) as data, and then do a:
for array in data:
print array[0]
Not sure if this answers your question. Let me know!

Even if it doesn't have labels (or, more specifically, if it's not a JSON object) it's still a perfectly legal piece of JSON, since it's just some arrays contained within a parent array.
Assuming you can already get the text from the api, you can load it as a Python object using json.loads:
import json
data = json.loads(your_data_as_string)
Then, since the labels you want to extract are always in the first position of the arrays, you can store them in a list using a list comprehension:
labels = [x[0] for x in data]
labels will be:
['tBTCUSD', 'tLTCUSD', 'tLTCBTC', 'tETHUSD', 'tETHBTC', 'tETCBTC', ...]

Python: JSON to CSV - special array handling

I am using this JSON to CSV Converter to convert my JSON data to CSV, which I can further work on in Excel:
https://github.com/vinay20045/json-to-csv
The structure of my JSON data looks like following: https://pastebin.com/rPkqcXiF
{
"page": 1,
"pages": 1,
"limit": 100,
"total": 20,
"items": [
{...}
]
}
In line 64 there is an array of items. First item is shown from line 65 to 92.
The next array with the same content would then be started in line 93, when available.
My problem now is: I fetch 2 datasets from the REST API.
One of those datasets has an items array of 2 items. Then the python script will generate further columns for new items. First array is items_0, next is items_1 and so on.
Example where you can see that I mean, with formatting for view in Excel: Pastebin EqGHX07U (only 2 links allowed here)
Instead of generating new columns when the amount of array elements rise, I'd like to have only one set of columns in the header of the csv. When the amount of array elements rise, there should be a new line generated with all other data like before - only the data of the new array changes.
Example where you can see that I mean, with formatting for view in Excel: Pastebin QLnaiqDs (only 2 links allowed here)
It would be awesome if you could help me out here! A few hints how to solve that are highly appreciated - I am not used to python though :(
Thank you so much!

If i understood correctly, here you have an approach. Think about it:
headers = []
csv_text = ""
def add_empty_fields():
"""Adds ';' char in each line in order to have empty fields for each new header"""
csv_lines = csv_text.split("\n")
csv_text = ""
for line in csv_lines[:-1]:
csv_text+=line+";\n"
csv_text+=csv_lines[-1]
def add_ordered_data(json_parsed_to_dict):
#Get headers of json
tmp_list = set(json_parsed_to_dict.keys())
#Add ordered data with headers list
for h in headers:
if h in tmp_list:
tmp_list.discard(h)
csv_text+=json_parsed_to_dict[h]+";"
else:
csv_text+=";"
#Add new headers behind it
for new_header in tmp_list:
headers.append(new_header)
csv_text+=json_parsed_to_dict[h]+";"
#add_empty_fields()
""" You can do csv_lines.replace("\n",";\n") here instead of add_empty_fields hahah """
csv_text+="\n"
I wrote all directly here, probably it have some fails but I hope this helps you

Apart from handling JSON data with node element. The script can also handle a JSON array without a node element as input. Refer readme file and sample_2 in the repo.
So, you can pre-process the input, to get all the items from both the API and merge them before feeding the list to the converter. Like...
data_to_be_processed = items_from_api_1 + items_from_api_2
You could write this pre-processor as a standalone module or modify my script just after line 77.
Hope it helps...

What is the fastest way to import to Neo4j?

I have a list of JSON documents, in the format:
[{a:1, b:[2,5,6]}, {a:2, b:[1,3,5]}, ...]
What I need to do is make nodes with parameter a, and connect them to all the nodes in the list b that have that value for a. So the first node will connect to nodes 2, 5 and 6. Right now I'm using Python's neo4jrestclient to populate but it's taking a long time. Is there a faster way to populate?
Currently this is my script:
break_list = []
for each in ans[1:]:
ref = each[0]
q = """MATCH n WHERE n.url = '%s' RETURN n;""" %(ref)
n1 = gdb.query(q, returns=client.Node)[0][0]
for link in each[6]:
if len(link)>4:
text,link = link.split('!__!')
q2 = """MATCH n WHERE n.url = '%s' RETURN n;""" %(link)
try:
n2 = gdb.query(q2, returns=client.Node)
n1.relationships.create("Links", n2[0][0], anchor_text=text)
except:
break_list.append((ref,link))

You might want to consider converting your JSON to CSV (using some like jq), then you could use the LOAD CSV Cypher tool for import. LOAD CSV is optimized for data import so you will have much better performance using this method. With your example the LOAD CSV script would look something like this:
Your JSON converted to CSV:
"a","b"
"1","2,5,6"
"2","1,3,5"
First create uniqueness constraint / index. This will ensure only one Node is created for any "name" and create an index for faster lookup performance.
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;
Given the above CSV file this Cypher script can be used to efficiently import data:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///path/to/file.csv" AS row
MERGE (a:Person{name: row.a})
WITH a,row
UNWIND split(row.b,',') AS other
MERGE (b:Person {name:other})
CREATE UNIQUE (a)-[:CONNECTED_TO]->(b);
Other option
Another option is to use the JSON as a parameter in a Cypher query and then iterate through each element of the JSON array using UNWIND.
WITH {d} AS json
UNWIND json AS doc
MERGE (a:Person{name: doc.a})
WITH doc, a
UNWIND doc.b AS other
MERGE (b:Person{name:other})
CREATE UNIQUE (a)-[:CONNECTED_TO]->(b);
Although there might be some performance issues with a very large JSON array. See some examples of this here and here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to vectorize a json dictionary using R wrapped in python? - python

Related

How to set attributes in pandasdmx

Variable indexed by an indexed Set with Pyomo

Python Extracting Data from JSON without a label?

Python: JSON to CSV - special array handling

What is the fastest way to import to Neo4j?

Categories

Resources