How to consider foreign key relations in (pandas) pivot_tables? - python

With PowerPivot in Exycel I am able to
a) create a data model that consists of several connected tables (see example below)
b) create a pivot table based on that model
For display, the pivot table does not use id (=integer) values but the corresponding string values as row/column headers.
With pandas I could
a) load and join those related tables
b) create a pivot table based on the joined table
pivot_table = pandas.pivot_table(
joined_table,
index=["scenario_name"], #entries to show as row headers
columns='param_name', #entries to show as column headers
values='value', #entries to aggregate and show as cells
aggfunc=numpy.sum, #aggregation function(s)
)
However, with huge tables, I would expect it to be more efficient, if the pivot_table operates
on the non-joined data table and applies the string values only for result display.
=> Is there a convenient way to consider foreign key relations when using
pandas DataFrame and pivot_table ?
I would expect something like
pivot_table = pandas.pivot_table(
{"data": data_table,
"scenario": scenario_table,
"param": param_table
},
index=["scenario:name"], #entries to show as row headers
columns="param:name", #entries to show as column headers
values="data:value", #entries to aggregate and show as cells
aggfunc=numpy.sum, #aggregation function(s)
)
=> If not, are there some alternatives libraries to pandas that could handle related tables as source for pivot tables?
Small example table structure:
table "data"
id scenario_id param_id value
1 1 1 100
2 1 2 200
table "scenario"
id name
1 reference
2 best_case
table "param"
id name
1 solar
2 wind
scenario_id of table data points on id of table scenario
param_id of table data points on id of table param
Another example with some more columns:

Related

Joining portions of a python dictionary using a reference dataframe

I have a dictionary of dataframes with keys that look like this. It's called frames1.
dict_keys(['TableA','TableB','TableC','TableD'])
I also have a 'master' dataframe that tells me how to join these dataframes.
Gold Table
Silver Table 1
Silver Table 2
Join Type
Left_Attr
Right_Attr
System
Table A
Table B
left
ID
applic_id
System
Table C
Table A
right
fam
famid
System
Table A
Table D
left
NameID
name
The "System" gold table is the combination of all 3 rows. In other words, I need to join Table A to Table B on the attributes listed and then use that output as my NEW Table A when I join Table C and Table A in row 2. Then I need to use that table to as my NEW Table A to join to Table D. This creates the final "System" Table.
What I've tried:
for i in range(len(master)):
System = pd.merge(frames1[master.iloc[i,1]],frames1[master.iloc[i,2]], how=master.iloc[i,3], on_left= master.iloc[i,4],on_right=master.iloc[i,5])
This only gets me one row which will then over write the other rows as it goes on. How would I go about creating a for loop to join these together?

How do I insert data in selective columns using PySpark?

I have a table on Redshift, to which I want to insert some data using pyspark dataframe. The redshift table has schema:
CREATE TABLE admin.audit_of_all_tables
(
wh_table_name varchar,
wh_schema_name varchar,
wh_population_method integer,
wh_audit_date timestamp without time,
wh_percent_change numeric(15,5),
wh_s3_path varchar
)
DISTSTYLE AUTO;
In my dataframe, I want to keep values for only the first 4 columns and write that dataframe's data to this table.
My dataframe is something like this:
Now, I want to do df.write.format to my table on Redshift, but I need to somehow specify that I want to insert data to only the first four columns and pass no value for the last 2 columns (keeping them null by default).
Any idea how to specify this using dataframe.write.format (or any method).
Thanks for reading.
You can use selectExpr to select the first four columns plus two additional columns with null that have been cast to the required type:
df2 = df.selectExpr("table_name as wh_table_name",
"schema_name as wh_schema_name",
"population_method as wh_population_method",
"audit_date as wh_audit_date",
"cast(null as double) as wh_percent_change",
"cast(null as string) as wh_s3_path")
df2.write....

Pandas convert data from two tables into third table. Cross Referencing and converting unique rows to columns

I have the following tables:
Table A
listData = {'id':[1,2,3],'date':['06-05-2021','07-05-2021','17-05-2021']}
pd.DataFrame(listData,columns=['id','date'])
Table B
detailData = {'code':['D123','F268','A291','D123','F268','A291'],'id':['1','1','1','2','2','2'],'stock':[5,5,2,10,11,8]}
pd.DataFrame(detailData,columns=['code','id','stock'])
OUTPUT TABLE
output = {'code':['D123','F268','A291'],'06-05-2021':[5,5,2],'07-05-2021':[10,11,8]}
pd.DataFrame(output,columns=['code','06-05-2021','07-05-2021'])
Note: The code provided is hard coded code for the output. I need to generate the output table from Table A and Table B
Here is brief explanation of how the output table is generated if it is not self explanatory.
The id column needs to be cross reference from Table A to Table B and the dates should be put instead in Table B
Then all the unique dates in Table B should be made into columns and the corresponding stock values need to be shifted to then newly created date columns.
I am not sure where to start to do this. I am new to pandas and have only ever used it for simple data manipulation. If anyone can suggest me where to get started, it will be of great help.
Try:
tableA['id'] = tableA['id'].astype(str)
tableB.merge(tableA, on='id').pivot('code', 'date', 'stock')
Output:
date 06-05-2021 07-05-2021
code
A291 2 8
D123 5 10
F268 5 11
Details:
First, merge on id, this is like doing a SQL join. First, the
dtypes much match, hence using astype to str.
Next, reshape the dataframe using pivot to get code by date.

Grist: lookup value from another table

Attempting to learn Grist, and don't know Python...but willing to learn. Just trying to draw the lines between Python and formulas.
I have a table that has "Items": fields named "ProductID" "collection" & "buyer"
There is another table that is named "Sales": fields named "Sku"(same as ProductID) "Qty" "Cost" "Sales" "Date"
I would like to create another table, that consolidates the data into one document (since all of sales may not be in all of items, and sales has a ton of duplicates due to the date the transaction occurred.)
Something like: "Sku" "Buyer" "collection" "Qty" "Cost" "Sales" "margin"(formula to calculate)
"Sales" would need to be the root table, and reference the "items" table for more information.
If my data was smaller, in excel I would:
copy skus, paste in new tab, remove duplicates, and run a sumifs.
ex: if in cell B1 and sku is in a1:
=Sumifs(sales!$Qty, sales!$sku, A1)
Then I would run an index match on items in c1 for example:
=index(items!$Buyer, match(a1, Items!$ProductID, 0), 1)
(Very late, but answering in case it helps others.)
It sounds like the resulting table should have one record per Sku (aka ProductId). You can do it in two ways: as another view of the Items table or as a summary of the Sales table grouped by Sku.
In either case, you can pull in the sum of Qty or Cost as needed.
In case of a summary table, such sums get included automatically (more on that in https://support.getgrist.com/summary-tables/#summary-formulas).
If you base it on the Items table, you can look up relevant Sales records by adding a formula column (e.g. named sales) with formula:
Sales.lookupRecords(Sku=$ProductID)
Then you can easily add sums of Qty and Cost as columns with formulas like SUM($sales.Qty) or SUM($sales.Cost).

How to add values from one column to another table using join?

I am having difficulties in merging 2 tables. In fact, I would like add a column from table B into table A based on one key
Table A (632 rows) contains the following columns:
part_number / part_designation / AC / AC_program
Table B (4,674 rows) contains the following columns:
part_ref / supplier_id / supplier_name / ac_program
I would like to add the supplier_name values into Table A
I have succeeded compiling a left joint based on the condition tableA.part_number == tableB.part_ref
However, when I look at the resulting Table, additional rows were created. I have now 683 rows instead of the initial 632 rows in Table A. How do I keep the same number of rows with including the supplier_name values in Table A? Below is presented a graph of my transformations:
Here is my code:
Table B seems to contain duplicates (part_ref). The join operation creates a new record in your original table for each duplicate in Table B
import pandas as pd
print(len(pd.unique(updated_ref_table.part_ref)))
print(updated_ref_table.shape[0])

Categories

Resources