Prepping data for sankey plot in plotly - python

I have a data as follows:
df = pd.DataFrame({'Id': [1, 2, 3, 4], 'ColA': [30, 20, 20,30], 'ColB':[50, 20, 30,70],
'ColC':[70, 30, 20,80]})
I want to prepare this for a sankey plot using plotly. I am not sure how to do the same. What I want to plot is essentially, Id's as bases and the column values as levels in the data. Added the image for reference.

There are two many weights given the suggested nodes. Weights are reflected between the nodes, by the widths of the edges.
If your Sankey has 12 nodes, only connected row-wise, then you can only have 9 weights. (One weight between each of the rectangles in your picture.) I arbitrarily chose to use the first three rows of your data, for columns a, b, and c.
Nodes
First, I created node labels, one for each of the 12. I used your column & row names and concatenated them. Notice that the output of this list is in order first by columns, then by rows. You don't have to do it this way, but you do need to refer to the node labels by index position, so putting them in a specific order now makes it easier later.
Next, you need to create a source, target, and weight list.
Source, Target, and Weights
Your source and target list are index positions of the from and to nodes. So the first edge is between ColA id1 and ColA id2; therefore, the first source list element is 0; the first target list element is 1.
Since the 4th (index 3) element is the last in the row and the rows aren't connected, I dynamically created the list of values for source and target, then remove the elements that coincided with the end of one row to the beginning of the next (i.e, ColA id4 does not connect to ColB id1).
(If this doesn't make sense, comment out the code that contains .remove(), then plot. It should clarify things!)
For the weights, I used .iloc to extract the columns other than id, and rows 1–3. To make this into a list, first I made it a numpy array, transposed it (rows to columns, columns to rows), then flattened it into a list. (This was done so it would be in the same order as the nodes list.)
Finally...
Finally, a plot.
import plotly.graph_objects as go
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'Id': [1, 2, 3, 4], 'ColA': [30, 20, 20, 30], 'ColB':[50, 20, 30,70],
'ColC':[70, 30, 20,80]})
# more weights than edges...using the first 3 rows for weights
# using depicted col/row names as nodes:
nds = [rws + str(cls) for rws in ["colA", "colB", "colC"] for cls in range(1, 5) ]
print(nds)
# ['colA1', 'colA2', 'colA3', 'colA4', 'colB1', 'colB2', 'colB3', 'colB4',
# 'colC1', 'colC2', 'colC3', 'colC4']
# notice that index 0, is col 1, row 1—the first source entry
# where's it going to? that's your target
sre = list(range(0, 11)) # source list
sre.remove(3) # remove row connectors
sre.remove(7)
trg = list(range(1, 12)) # target list
trg.remove(4)
trg.remove(8)
# now the weights—the values in the data frame (leave out id, last row)
wts = df.iloc[0:3, 1:4].to_numpy()
# transpose, so columns/rows are swapped, then flatten to a list
wts = np.transpose(wts).flatten()
# plot it
fig = go.Figure([go.Sankey(
node = dict(label = nds), link = dict(source = sre, target = trg, value = wts)
)])
fig.show() # gimme

Related

How to extract the labels from sns.clustermap

If I'm plotting a (correlation) dataframe with sns.clustermap it automatically takes the dataframes multindex as labels and plots them right and below the clustermap.
How do I access these labels? I'm using clustermaps as an exploratory tool for large-ish datasets (100-200 entries) and I need the names for the entries in various clusters.
EXAMPLE:
elev = [1, 100, 10, 1000, 100, 10]
number = [1, 2, 3, 4, 5, 6]
name = ['foo', 'bar', 'baz', 'qux', 'quux', 'quuux']
idx = pd.MultiIndex.from_arrays([name, elev, number],
names=('name','elev', 'number'))
data = np.random.rand(20,6)
df = pd.DataFrame(data=data, columns=idx)
clustermap = sns.clustermap(df.corr())
gives
Now I'd say that theres two distinct clusters: the first two rows and the last 4 rows, so [foo-1-1, bar-100-2] and [baz-10-3, qux-1000-4, quux-100-5, quuux-10-6].
How can I extract these (or the whole [foo-1-1, bar-100-2, baz-10-3, qux-1000-4, quux-100-5, quuux-10-6] list)? With 100+ Entries, just writing them down by hand isn't really an option.
The documentation offers clustergrid.dendrogram_row.reordered_ind but that just gives me the index numbers in the original dataframe. But I'm looking for something more like the output of df.columns
With this it seems to me like I'm getting into the right direction, but I can only extract to which cluster a given row belongs, when I let it form clusters automatically, but I'd like to define the clusters myself, visually.
As always with such things, the answer is out there, I just overlooked it.
This answer (pointed out by Trenton McKinney in comments) has the needed snipped in it:
ax_heatmap.yaxis.get_majorticklabels()
(I wouldn't have looked into ax_heatmap to get to that...). So, continuing the MWE from the question:
labels = clustermap.ax_heatmap.yaxis.get_majorticklabels()
However, that's a list of
type(labels[0])
matplotlib.text.Text
so unless I'm missing something (again), it's not exactly straigtforward to use. However, that can simply be looped into something more usefull. Let's say I'm interested in the whole name (i.e. the complete former df multiindex) and the number:
labels_list = []
number_list = []
for i in labels:
i = str(i)
name_start = i.find('\'')+1
name_end = i.rfind('\'')
name = i[name_start:name_end]
number_start = name.rfind('-')+1
number = name[number_start:]
number = int(number)
labels_list.append(name)
number_list.append(number)
Now I've got two easily workable lists, one with full strings and one with ints.

retrieve original index of sequentially removed column (a row is also removed) of an matrix in Julia or python

I want to retrieve the original index of the column with the largest sum at each iteration after the previous column with the largest sum is removed. Meanwhile, the row of the same index of the deleted column is also deleted from the matrix at each iteration.
For example, in a 10 by 10 matrix, the 5th column has the largest sum, hence the 5th column and row are removed. Now the matrix is 9 by 9 and the sum of columns is recalculated. Suppose the 6th column has the largest sum, hence the 6th column and row of the current matrix are removed, which is the 7th in the original matrix. Do this iteratively until the desired number of columns index is preserved.
My code in Julia that does not work is pasted below. Step two in the for loop is not correct because a row is removed at each iteration, thus the sum of columns are different.
Thanks!
# a matrix of random numbers
mat = rand(10, 10);
# column sum of the original matrix
matColSum = sum(mat, dims=1);
# iteratively remove columns with the largest sum
idxColRemoveList = [];
matTemp = mat;
for i in 1:4 # Suppose 4 columns need to be removed
# 1. find the index of the column with the largest column sum at current iteration
sumTemp = sum(matTemp, dims=1);
maxSumTemp = maximum(sumTemp);
idxColRemoveTemp = argmax(sumTemp)[2];
# 2. record the orignial index of the removed scenario
idxColRemoveOrig = findall(x->x==maxSumTemp, matColSum)[1][2];
push!(idxColRemoveList, idxColRemoveOrig);
# 3. update the matrix. Note that the corresponding row is also removed.
matTemp = matTemp[Not(idxColRemoveTemp), Not(idxColRemoveTemp)];
end
python solution:
import numpy as np
mat = np.random.rand(5, 5)
n_remove = 3
original = np.arange(len(mat)).tolist()
removed = []
for i in range(n_remove):
col_sum = np.sum(mat, axis=0)
col_rm = np.argsort(col_sum)[-1]
removed.append(original.pop(col_rm))
mat = np.delete(np.delete(mat, col_rm, 0), col_rm, 1)
print(removed)
print(original)
print(mat)
I'm guessing the problem you had was keeping track with information what was the index of current columns/rows in original array. I've just used a list [0, 1, 2, ...] and then pop one value in each iteration.
A simpler way to code the problem would be to replace elements in the selected column with a significantly small number instead of deleting the column. This approach avoids the use of "sort" and "pop" to improve code efficiency.
import numpy as np
n = 1000
mat = np.random.rand(n, n)
n_remove = 500
removed = []
for i in range(n_remove):
# get sum of each column
col_sum = np.sum(mat, axis=0)
col_rm = np.argmax(col_sum)
# record the column ID
removed.append(col_rm)
# replace elements in the col_rm-th column and row with the zeros
mat[:, col_rm] = 1e-10
mat[col_rm, :] = 1e-10
print(removed)

How to only iterate certain positions in the Itertool combinations

I am working on a python project which iterates through all the possible combinations of entries in a row of excel data to find which combination produces the correct output.
To achieve this, I am iterating through different combinations of 0 and 1 to choose whether that entry is required for the combination. 1 meaning data point is included in the calculation and 0 meaning the data point is not included.
The number of combinations would thus be equal to 2 ^ (Number of excel columns)
Example Excel Data:
1, 22, 7, 11, 2, 4
Example Iteration:
(1, 0, 0, 0, 1, 0)
I could be looking for what combination of the excel data would result in an output of 3, the only correct combination of the excel data being the above iteration.
However, I would know that any value greater than 3 would not be included in a possible combination that would equal 3. As such I would like to choose and set the values of these columns to 0 and iterate the other columns only. This would in turn reduce the number of combinations.
Combination = 2 ^ (Number of excel columns - Fixed Entry Columns)
At the moment I am using Itertools.products to get all combination which I need:
Numbers = ["0","1"]
for item in itertools.product(Numbers, repeat=len(df.columns)):
Iteration = pd.DataFrame(item) # Iteration e.g (0,1,1,1,0,0,1)
Data = df.iloc[0] # Excel data row
Data = Data.to_numpy()
Iteration = Iteration.astype(float)
Answer = np.dot(Data, Iteration) # Get the result of (Iteration * Data) to check if answer is correct
This results in iterating through combinations which I know will not work.
Is there a way to only iterate 0's and 1's in certain positions of the combination while keeping the known entries a fixed value (either 0 or 1) to reduce the possible combinations?
There are some excel files have over 25 columns which as a result would be 33,554,432 combinations. As such I am trying to reduce the number of columns which I need to iterate by setting values to the columns that I do know.
If you would need further clarification please let me know. I am novice programmer so I may be overlooking or over complicating a simple solution.
Find which columns meet your criteria for exclusion. Then just get the product combinations for the other columns.
One possible method:
from itertools import product
LIMIT=10
column_data = [1, 22, 7, 11, 2, 4]
changeable_indexes = [i for i,x in enumerate(column_data) if x <= LIMIT]
for item in product([0,1], repeat=len(changeable_indexes)):
row_iteration = [0] * len(column_data)
for index, value in zip(changeable_indexes, item):
row_iteration[index] = value
print(row_iteration)

Sort array values for a particular slice from 3d DataArray

Summary: Given a 3D array, how I can I slice at two particular co-ordinates and then sort on the VALUES of the 3rd dimension, retaining index information
Preamble:
I am trying to compare the cost of shopping baskets for customers buying a combination of apples and bananas. I know our competitors unit costs for these fruits, and depending on what cost I choose, I can be cheaper or more expensive. I would like to be able to rank my basket costs for a particular combination (e.g. 3 apples and 15 bananas) within my competitors.
I've tried to include all the relevant code but the real salient point is at the end.
1) Building a function which takes in a price point for apples and bananas, and returns a grid of order cost:
apple_range = np.arange(1, 12, 1)
banana_range = np.arange(5, 30, 5)
def order_costs(no_apples, no_bananas, apple_cost=None, banana_cost=None):
return (no_apples * apple_cost) + (no_bananas * banana_cost)
fv = np.vectorize(order_costs, excluded=['apple_cost', 'banana_costs'])
2) My competitors pricing as a dataframe, and then a 3D numpy array with the 'depth' axis used for each competitor
fruit_prices = pd.DataFrame(
data = [[1,2], [3,4], [5,6]],
index = ['A', 'B', 'C'],
columns = ['apple_cost', 'banana_cost'],
)
order_costs_dict = {}
for idx, row in fruit_prices.iterrows():
order_costs_dict[idx] = fv(apple_range[:, np.newaxis], banana_range, **dict(row))
order_costs = np.dstack(list(order_costs_dict.values()))
3) Convert the data into a DataArray
bvs_dataset = xr.Dataset(
{'order_costs':(['apples', 'bananas', 'supplier'], order_costs)},
coords = {'apples': (['apples'], apple_range),
'bananas': (['bananas'], banana_range),
'supplier': (['supplier'], list(order_costs_dict.keys()))}
)
bvs_array = bvs_dataset.to_array()
Now I make my selection, I want to know the costs of ordering 1 apple and 5 bananas
4)
selection = bvs_array.sel(apples=1, bananas=5)
selection
QUESTION:
Assuming these results aren't ordered ascending, how can I
1) Sort them according to order_costs, whilst retaining the information in the 'index' (Supplier name, A, B or C)
2) Find the rank of my corresponding order cost e.g. if my order costs 19 then this will return 2.
I have tried the sortby() method on my selection but if I pass 'order_costs' as the variable, I receive KeyError. Sorting by 'variable' doesn't seem to have the right effect, although doesn't raise an error.
What am I doing wrong?
I think I found my answer.
1) Make my selection 1 dimensional
selection = selection[0]
2) Reindex by the argsorted variable
selection = selection[selection.variable.argsort()]
3) Now selection should be sorted and you have the indicies to look at the supplier column too.
I had a look at the indices returned by argsort() and they didn't appear to match order_value order, but when I actually used it, it gave me the right answer.

one-hot encoding, access list elements

I have a .csv file with data of which i want to transform some columns to one-hot. The problem occurs in the second last line, where the one-hot index (e.g. 1st feature) gets placed in all rows instead of just the one i am in currently.
It seems to be some problem with how i access the 2D list... any suggestions?
thank you
def one_hot_encode(data_list, column):
one_hot_list = [[]]
different_elements = []
for row in data_list[1:]: # count different elements
if row[column] not in different_elements:
different_elements.append(row[column])
for i in range(len(different_elements)): # set variable names
one_hot_list[0].append(different_elements[i])
vector = [] # create list shape with zeroes
for i in range(len(different_elements)):
vector.append(0)
for i in range(1460):
one_hot_list.append(vector)
ind_row = 1 # encode 1 for each sample
for row in data_list[1:]:
index = different_elements.index(row[column])
one_hot_list[ind_row][index] = 1 # mistake!! sets all rows to 1
ind_row += 1
Your problem stems from the vector object you're creating to do the one-hot encoding; you've created one object, and then built a one_hot_list that contains 1460 references to the same object. When you make a change in one of the rows, it will be reflected in all of the rows.
Quick solution would be to create separate copies of the vector for each row (See How to clone or copy a list?):
one_hot_list.append(vector[:])
Some of the other things you're doing in your function are a bit slow or roundabout. I'd suggest a few changes:
def one_hot_encode(data_list, column):
one_hot_list = [[]]
# count different elements
different_elements = set(row[column] for row in data_list[1:])
# convert different_elements to a list with a canonical order,
# store in the first element of one_hot_list
one_hot_list[0] = sorted(different_elements)
vector = [0] * len(different_elements) # create list shape with zeroes
one_hot_list.extend([vector[:] for _ in range(1460)])
# build a mapping of different_element values to indices into
# one_hot_list[0]
index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
# encode 1 for each sample
for rindex, row in enumerate(data_list[1:], 1):
cindex = index_lookup[row[column]]
one_hot_list[rindex][cindex] = 1
This builds different_elements in linear time by using the set data type, and uses list comprehensions to produce the values for one_hot_list[0] (the list of element values which are being one-hot encoded), the zero vector, and one_hot_list[1:] (which is the actual one-hot-encoded matrix value). Also, there's a dict called index_lookup that lets you quickly map element values onto their integer index, instead of searching for them over and over again. Finally, your row index into the one_hot_list matrix can be managed for you by enumerate.
I'm not 100% sure of what you are trying to do but the problem you are seeing is in these lines:
for i in range(1460):
one_hot_list.append(vector)
These are creating the one_hot_list as 1460 references to the same vector of zeros. Whereas I think you want it to be a new vector each time. A direct fix would just be to copy it each time:
for i in range(1460):
one_hot_list.append(vector[:])
But a more Pythonic approach would be to create the list with a comprehension. Perhaps something like this:
vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]
you can use set() for counting unique items in the list
different_elements = list(set(data[1:]))
I suggest you save yourself from the hassle of re-implementing this in plain Python. You can use use pandas.get_dummies for this:
First some test data (test.csv):
A
Foo
Bar
Baz
Then in Python:
import pandas as pd
df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])
You can retrieve the underlying numpy array using:
pd.get_dummies(df['A']).values
Which results in:
array([[0, 0, 1],
[1, 0, 0],
[0, 1, 0]], dtype=uint8)

Categories

Resources