Python Cubes OLAP Framework - how to work with joins? - python

I'm trying to use python's olap framework cubes on a very simple database, but I am having some trouble joining tables.
My schema looks like this:
Users table
ID | name
Products table
ID | name | price
Purchases table
ID | user_id | product_id | date
And the cubes model:
{
'dimensions': [
{'name': 'user_id'},
{'name': 'product_id'},
{'name': 'date'},
],
'cubes': [
{
'name': 'purchases',
'dimensions': ['user_id', 'product_id', 'date'],
'measures': ['price']
'mappings': {
'purchases.user_id': 'users.id',
'purchases.product_id': 'products.id',
'purchases.price': 'products.price'
},
'joins': [
{
'master': 'purchases.user_id',
'detail': 'users.id'
},
{
'master': 'purchases.product_id',
'detail': 'products.id'
}
]
}
]
}
Now I would like to display all the purchases, showing the product's name, user's name and purchase date. I can't seem to find a way to do this. The documentation is a bit scarce.
Thank you

First let's fix the model a bit. In your schema you have more attributes per dimension: id and name, you might end up having more details in the future. You can add them by specifying attributes as a list: "attriubtes": ["id", "name"]. Note also that the dimension is named as entity product not as a key id_product. The key id_product is just an attribute of the product dimension, as is name or in the future maybe category. Dimension reflects analysts point of view.
For the time being, we ignore the fact that date should be a special dimension and consider date as single-value key, for example a year, not to make things complicated here.
"dimensions": [
{"name": "user", "attributes": ["id", "name"]},
{"name": "product", "attributes": ["id", "name"]},
{"name": "date"}
],
Because we changed names of the dimensions, we have to change them in the cube's dimension list:
"cubes": [
{
"name": "purchases",
"dimensions": ["user", "product", "date"],
...
Your schema reflects classic transactional schema, not traditional data warehouse schema. In this case, you have to be explicit, as you were, and mention all necessary mappings. The rule is: if the attribute belongs to a fact table (logical view), then the key is just attribute, such as price, no table specification. If the attribute belongs to a dimension, such as product.id, then the syntax is dimension.attribute. The value of the mappings dictionary is physical table and physical column. See more information about mappings. Mappings for your schema look like:
"mappings": {
"price": "products.price",
"product.id": "products.id",
"product.name": "products.name",
"user.id": "users.id",
"user.name": "users.name"
}
You would not have to write mappings if your schema was:
fact purchases
id | date | user_id | product_id | amount
dimension product
id | name | price
dimension user
id | name
In this case you will need only joins, because all dimension attributes are in their respective dimension tables. Note the amount in the fact table, which in your case, as you do not have count of purchased products per purchase, would be the same as price in product.
Here is the updated model for your model:
{
"dimensions": [
{"name": "user", "attributes": ["id", "name"]},
{"name": "product", "attributes": ["id", "name"]},
{"name": "date"}
],
"cubes": [
{
"name": "purchases",
"dimensions": ["user", "product", "date"],
"measures": ["price"],
"mappings": {
"price": "products.price",
"product.id": "products.id",
"product.name": "products.name",
"user.id": "users.id",
"user.name": "users.name"
},
"joins": [
{
"master": "purchases.user_id",
"detail": "users.id"
},
{
"master": "purchases.product_id",
"detail": "products.id"
}
]
}
]
}
You can try the model without writing any Python code, just by using the slicer command. For that you will need slicer.ini configuration file:
[server]
backend: sql
port: 5000
log_level: info
prettyprint: yes
[workspace]
url: sqlite:///data.sqlite
[model]
path: model.json
Change url in [workspace] to point to your database and change path in [model] to point to your model file. Now you can try:
curl "http://localhost:5000/aggregate"
Also try to drill-down:
curl "http://localhost:5000/aggregate?drilldown=product"
If you need any further help, just let me know, I'm the Cubes author.

Related

JSON jq/python file manipulation with specific key name aggregation

I need to modify the structure of this json file:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop"
}
},
{
"id":"9998",
"type":"file_system",
"properties":{
"mount_point":"/opt",
"name":"/dev/mapper/rhel-opt",
"root_container":"3333"
},
"label":"FileSystem"
},
{
"id":"9999",
"type":"file_system",
"properties":{
"mount_point":"/var",
"name":"/dev/mapper/rhel-var",
"root_container":"3333"
},
"label":"FileSystem"
}
]
in order to have this kind of output:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop",
"file_system":[
"/opt",
"/var"
]
}
}
]
The idea is to have, in the new json structure, the visibility of my laptop with the two file-system partition in an array named "file_system".
As you can see the two partition are related to the first by the id and root_container.
So, imagine to have not only one laptop, bat thousands of laptop, with different id and every one of these have different partition, related to the laptop by the root_container key.
Is there an option to do this with jq functions or python script?
Many thanks
You could employ reduce to iterate over the items while extracting their id, mount_point and root_container. Then, if a root_container was present, delete that entry and add its mount_point to the entry whose id matches their root_container. For convenience, I also employed INDEX on the items' id fields to simplify their access as .[$id] and .[$root_container], which had to be undone at the end using map(.).
jq '
reduce .[] as {$id, properties: {$mount_point, $root_container}} (
INDEX(.id);
if $root_container then
del(.[$id])
| .[$root_container].properties.file_system += [$mount_point]
else . end
)
| map(.)
'
[
{
"id": "3333",
"properties": {
"label": "Computer",
"name": "My-Laptop",
"file_system": [
"/opt",
"/var"
]
}
}
]
Demo

How can I make checking the value for the parameter?

I want to write a program that will save information from the API, in the form of a JSON file. The API has the 'exchangeId' parameter. When I save information from the API, I want to save only those files in which the 'exchangeId' will be different and his value will be more then one. How can I make it? Please, give me hand.
My Code:
exchangeIds = {102,311,200,302,521,433,482,406,42,400}
for pair in json_data["marketPairs"]:
if (id := pair.get("exchangeId")):
if id in exchangeIds:
json_data["marketPairs"].append(pair)
exchangeIds.remove(id)
pairs.append({
"exchange_name": pair["exchangeName"],
"market_url": pair["marketUrl"],
"price": pair["price"],
"last_update" : pair["lastUpdated"],
"exchange_id": pair["exchangeId"]
})
out_object["name_of_coin"] = json_data["name"]
out_object["marketPairs"] = pairs
out_object["pairs"] = json_data["numMarketPairs"]
name = json_data["name"]
Example of ExchangeIds output, that I don't need:
{200} #with the one id in `ExchangeId`
Example of JSON output:
{
"name_of_coin": "Pax Dollar",
"marketPairs": [
{
"exchange_name": "Bitrue",
"market_url": "https://www.bitrue.com/trade/usdp_usdt",
"price": 1.0000617355334473,
"last_update": "2021-12-24T16:39:09.000Z",
"exchange_id": 433
},
{
"exchange_name": "Hotbit",
"market_url": "https://www.hotbit.io/exchange?symbol=USDP_USDT",
"price": 0.964348817699553,
"last_update": "2021-12-24T16:39:08.000Z",
"exchange_id": 400
}
],
"pairs": 22
} #this one of exapmle that I need, because there are two id

How can I define a structure of a json to transform it to csv

I have a json structured as this:
{
"data": [
{
"groups": {
"data": [
{
"group_name": "Wedding planning - places and bands (and others) to recommend!",
"date_joined": "2009-03-12 01:01:08.677427"
},
{
"group_name": "Harry Potter and the Deathly Hollows",
"date_joined": "2009-01-15 01:38:06.822220"
},
{
"group_name": "Xbox , Playstation, Wii - console fans",
"date_joined": "2010-04-02 04:02:58.078934"
}
]
},
"id": "0"
},
{
"groups": {
"data": [
{
"group_name": "Lost&Found (Strzegom)",
"date_joined": "2010-02-01 14:13:34.551920"
},
{
"group_name": "Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)",
"date_joined": "2008-09-24 17:29:43.356992"
}
]
},
"id": "1"
}
]
}
How does one parse jsons in this form? Should i try building a class resembling this format? My desired output is a csv where index is an "id" and in the first column I have the most recently taken group, in the second column the second most recently taken group and so on.
Meaning the result of this would be:
most recent second most recent
0 Xbox , Playstation, Wii - console fans Wedding planning - places and bands (and others) to recommend!
1 Lost&Found (Strzegom) Tennis, Squash, Badminton, table tennis - looking for sparring partner (Strzegom)
solution could be like this:
data = json.load(f)
result = []
# it's max element in there for each id. Helping how many group_name here for this example [3,2]
max_element_group_name = [len(data['data'][i]['groups']['data']) for i in range(len(data['data']))]
max_element_group_name.sort()
for i in range(len(data['data'])):
# get id for each groups
id = data['data'][i]['id']
# sort data_joined in groups
sorted_groups_by_date = sorted(data['data'][i]['groups']['data'],key=lambda x : time.strptime(x['date_joined'],'%Y-%m-%d %H:%M:%S.%f'),reverse=True)
# get groups name using minumum value in max_element_group_name for this example [2]
group_names = [sorted_groups_by_date[j]['group_name'] for j in range(max_element_group_name[0])]
# add result list with id
result.append([id]+group_names)
# create df for list
df = pd.DataFrame(result, columns = ['id','most recent', 'second most recent'])
# it could be better.

A efficient way to unpack nested json into a dataframe

I have a nested json, and i want to transform it into a pandas dataframe. I was able to normalize with json_normalize.
However, there are still json layer within the dataframe, which i also want to unpack. How can i do it in the best way? I will likely have to deal with this a few more times within the project i am doing currently
The json i have is the following
{
"data": {
"allOpportunityApplication": {
"data": [
{
"id": "111111111",
"opportunity": {
"programme": {
"short_name": "XX"
}
},
"person": {
"home_lc": {
"name": "NAME"
}
},
"standards": [
{
"constant_name": "constant1",
"standard_option": {
"option": "true"
}
},
{
"constant_name": "constant2",
"standard_option": {
"option": "true"
}
}
]
}
]
}
}
}
Used json_normalize
standards_df = json_normalize(
standard_json['allOpportunityApplication']['data'],
record_path=['standards'],
meta=['id','person','opportunity']
)
with that i get a dataframe with the columns: constant_name, standard_option, id, person, opportunity. The problem is that the data standard_option, person and opportunity are json, with a single option inside.
The current ouput and expected output for each column is as follow
Standard_option
Currently an item in the column "standard_option" looks like:
{'option': 'true'}
I want it to be just true
Person
Currently an item in the column "person" looks like:
{'programme': {'short_name': 'XX'}}
I want it to look like: XX
Opportunity
Currently an item in the column "opportunity" looks like:
{'home_lc': {'name': 'NAME'}}
I want it to look like: NAME
Might not be the best way, but I think it works.
standards_df['person'] = (standards_df.loc[:, 'person']
.apply(lambda x: x['home_lc']['name']))
standards_df['opportunity'] = (standards_df.loc[:, 'opportunity']
.apply(lambda x: x['programme']['short_name']))
constant_name standard_option.option id person opportunity
0 constant1 true 111111111 NAME XX
1 constant2 true 111111111 NAME XX
standard_option was already fine when I run your code

PyMongo: Using $push to update an existing document with a reference to another document

I have a teams collection and a players collection. I am trying to insert documents into the teams* collection from the **players collection using $push.
Here are the data models for both:
Teams:
{
"team_id": 1,
"team_name": team_name,
"general_manager": general_manager,
"players": [
]
}
Players:
{
"_id": "5c076550c779ce4fa2d4c9fd"
"first_name": first_name,
"last_name": last_name,
}
Here is the code I'm using:
player = players.find_one({ "$and": [
{"first_name": first_name},
{"last_name": last_name}] })
teams.update(
{"team_name": team_name},
{"$push":
{"players": {
"$ref": "players",
"$id": player["_id"],
"$db": db
}}})
When I execute this, I get the following error message:
pymongo.errors.WriteError: Found $id field without a $ref before it, which is invalid.
What am I doing wrong? Thanks in advance!
I simplified your queries a bit. Try below (explanation in comments)
//Locate the player record
player = players.find_one({"first_name": first_name,"last_name": last_name})
//push this into the "players" array of the team
teams.update_one({"team_name": team_name},
{"$push": {"players": player } }
)
I used update_one instead of update, as I assume you only need to update one document in the teams collection.

Categories

Resources