I hardly try to build a treemap with plotly.
The main difficulty I have is that the sub-categories don't fullfill the map. I think there is a problem in my data strucure. Thanks for any idea you could have
My source dataframe looks like this :
id parent value color
0 F A1 20 0.298782
1 F A2 10 0.030511
2 F B1 35 0.562464
3 F B2 45 0.778931
4 F C1 30 0.308459
5 F C2 46 0.505771
6 M A1 24 0.242964
7 M A2 6 0.604043
8 M B1 24 0.279880
9 M B2 57 0.269249
10 M C1 82 0.914589
11 M C2 61 0.827076
12 A1 Pat A 44 0.896741
13 A2 Pat B 16 0.112626
14 B2 Pat A 102 0.024187
15 B1 Pat B 59 0.462012
16 C1 Pat A 112 0.003501
17 C2 Pat B 107 0.614476
18 Pat A total 258 0.150514
19 Pat B total 182 0.698287
20 total NaN 440 0.744805
I used the following code :
fig = go.Figure(go.Treemap(
ids=df_all_trees['id'],
labels=df_all_trees['id'],
parents=df_all_trees['parent'],
values=df_all_trees['value'],
#branchvalues='total',
marker=dict(
colors=df_all_trees['value'],
colorscale='RdBu',
cmid=average_score),
hovertemplate='<b>%{label} </b> <br> Sales: %{value}<br> Success rate: %{color:.2f}',
name=''
))
fig.show()
and obtain something like this :
What I would like : something "fully mapped" like this :
Seems to be only the #branchvalues='total' to decoment. Hope nobody has wasted time with this.
Related
I want to sum many columns to many columns in a data frame.
My code:
df =
A1 B1 A2 B2
0 15 30 50 70
1 25 40 60 80
# I have many columns like this. I want to do something like this A1-A2, B1-B2, etc
# My approach is
first_cols = [A1,B1]
sec_cols = [A2,B2]
# New column names
sub_cols = [A_sub,B_sub]
df[sub_cols] = df[first_cols] - df[sec_cols]
Present output:
ValueError: Wrong number of items passed , placement implies 1
Expected output:
df =
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40
I think what you are trying to do is similar to this post. In Dataframes generally the arithmetic operations are aligned on column and row indices. Since you are tring to subtract different columns, pandas doesn't carry out the operation. So, df[sub_cols] = df[first_cols] - df[second_cols] won't work.
However, if you were to use numpy array and do the operation, pandas carries it out elementwise. So, df[sub_cols] = df[first_cols] - df[second_cols].values will work and give you the expected result.
import pandas as pd
df = {"A1":[15,25], "B1": [30, 40], "A2":[50,60], "B2": [70, 80]}
df = pd.DataFrame(df)
first_cols = ["A1", "B1"]
second_cols = ["A2", "B2"]
sub_cols = ["A_sub","B_sub"]
df[sub_cols] = df[first_cols] - df[second_cols].values
print(df)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40
You could also pull it off with a groupby on the columns:
subtraction = (df.groupby(df.columns.str[0], axis = 1)
.agg(np.subtract.reduce, axis = 1)
.add_suffix("_sub")
)
df.assign(**subtraction)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40
It's not quite clear what you want. If you want one column that's A1-A2 and another that's B1-B2, you can do df[[A1,A,2]].sub(df[[B1,B2]]).
Help me to change the only one B to change C
df = pd.DataFrame({'Student': ['A','B','B','D','E','F'],
'maths': [50,60,75,85,64,24],
'sci':[25,34,68,58,75,64],
'sco':[36,49,58,63,85,96]})
Student maths sci sco
0 A 50 25 36
1 B 60 34 49
2 B 75 68 58
3 D 85 58 63
4 E 64 75 85
5 F 24 64 96
df.replace('B','C') # it is changing both B values
using replace I want to change row 2 'B' to 'C'
I would suggest using the df.at function. Try using this code:
import pandas as pd
df = pd.DataFrame({'Student': ['A','B','B','D','E','F'],
'maths': [50,60,75,85,64,24],
'sci':[25,34,68,58,75,64],
'sco':[36,49,58,63,85,96]})
df.at[2, "Student"]= "C"
print(df)
You can use also iloc to select the row to change :
df = pd.DataFrame({'Student': ['A','B','B','D','E','F'],
'maths': [50,60,75,85,64,24],
'sci':[25,34,68,58,75,64],
'sco':[36,49,58,63,85,96]})
df.iloc[2,:] = df.iloc[2,'Student'].replace('B','C')
print(df)
# Student maths sci sco
#0 A 50 25 36
#1 B 60 34 49
#2 C 75 68 58
#3 D 85 58 63
#4 E 64 75 85
#5 F 24 64 96
I have a multi-indexed dataframe and I wish to extract a subset based on index values and on a boolean criteria. I wish to overwrite the values of a specific new values using multi-index keys and boolean indexers to select the records to modify.
import pandas as pd
import numpy as np
years = [1994,1995,1996]
householdIDs = [ id for id in range(1,100) ]
midx = pd.MultiIndex.from_product( [years, householdIDs], names = ['Year', 'HouseholdID'] )
householdIncomes = np.random.randint( 10000,100000, size = len(years)*len(householdIDs) )
householdSize = np.random.randint( 1,5, size = len(years)*len(householdIDs) )
df = pd.DataFrame( {'HouseholdIncome':householdIncomes, 'HouseholdSize':householdSize}, index = midx )
df.sort_index(inplace = True)
Here's what the sample data looks like...
df.head()
=> HouseholdIncome HouseholdSize
Year HouseholdID
1994 1 23866 3
2 57956 3
3 21644 3
4 71912 4
5 83663 3
I'm able to successfully query the dataframe using the indices and column labels.
This example gives me the HouseholdSize for household 3 in year 1996
df.loc[ (1996,3 ) , 'HouseholdSize' ]
=> 1
However, I'm unable to combine boolean selection with multi-index queries...
The pandas docs on Multi-indexing says there is a way to combine boolean indexing with multi-indexing and gives an example...
In [52]: idx = pd.IndexSlice
In [56]: mask = dfmi[('a','foo')]>200
In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
...which I can't seem to replicate on my dataframe
idx = pd.IndexSlice
housholdSizeAbove2 = ( df.HouseholdSize > 2 )
df.loc[ idx[ housholdSizeAbove2, 1996, :] , 'HouseholdSize' ]
Traceback (most recent call last):
File "python", line 1, in <module>
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (3), lexsort depth (2)'
In this example I would want to see all the households in 1996 with householdsize above 2
Pandas.query() should work in this case:
df.query("Year == 1996 and HouseholdID > 2")
Demo:
In [326]: with pd.option_context('display.max_rows',20):
...: print(df.query("Year == 1996 and HouseholdID > 2"))
...:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 4
4 11057 1
5 36321 2
6 89469 4
7 35711 2
8 85741 1
9 34758 3
10 56085 2
11 32275 4
12 77096 4
... ... ...
90 40276 4
91 10594 2
92 61080 4
93 65334 2
94 21477 4
95 83112 4
96 25627 2
97 24830 4
98 85693 1
99 84653 4
[97 rows x 2 columns]
UPDATE:
Is there a way to select a specific column?
In [333]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdIncome']
Out[333]:
Year HouseholdID
1996 3 28664
4 11057
5 36321
6 89469
7 35711
8 85741
9 34758
10 56085
11 32275
12 77096
...
90 40276
91 10594
92 61080
93 65334
94 21477
95 83112
96 25627
97 24830
98 85693
99 84653
Name: HouseholdIncome, dtype: int32
and ultimately I want to overwrite the data on the dataframe.
In [331]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdSize'] *= 10
In [332]: df.loc[df.eval("Year == 1996 and HouseholdID > 2")]
Out[332]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 40
4 11057 10
5 36321 20
6 89469 40
7 35711 20
8 85741 10
9 34758 30
10 56085 20
11 32275 40
12 77096 40
... ... ...
90 40276 40
91 10594 20
92 61080 40
93 65334 20
94 21477 40
95 83112 40
96 25627 20
97 24830 40
98 85693 10
99 84653 40
[97 rows x 2 columns]
UPDATE2:
I want to pass a variable year instead of a specific value. Is there
a cleaner way to do it than Year == " + str(year) + " and HouseholdID > " + str(householdSize) ?
In [5]: year = 1996
In [6]: household_ids = [1, 2, 98, 99]
In [7]: df.loc[df.eval("Year == #year and HouseholdID in #household_ids")]
Out[7]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 1 42217 1
2 66009 3
98 33121 4
99 45489 3
I have a table with 3 columns delimited by whitespaces
A1 3445 1 24
A1 3445 1 214
A2 3603 2 45
A2 3603 2 144
A0 3314 3 8
A0 3314 3 134
A0 3314 4 46
I would like to compare the last column with the ID (e.g. A1) in the first column to return the string with biggest number. So, the end result will be like this.
A1 3445 1 214
A2 3603 2 144
A0 3314 3 134
I have done up to spliting the lines, but I don't get how to compare the line.
A help would be nice.
Use the sorted function, giving the last column as the key
with open('a.txt', 'r') as a: # 'a.txt' is your file
table = []
for line in a:
table.append(line.split())
s = sorted(table, key=lambda x:int(x[-1]), reverse=True)
for r in s:
print '\t'.join(r)
Result:
A1 3445 1 214
A2 3603 2 144
A0 3314 3 134
A0 3314 4 46
A2 3603 2 45
A1 3445 1 24
A0 3314 3 8
dataDic = {}
for data in open('1.txt').readlines():
id, a, b ,num = data.split(" ")
if not dataDic.has_key(id):
dataDic[id] = [a, b, int(num)]
else:
if int(num) >= dataDic[id][-1]:
dataDic[id] = [a, b, int(num)]
print dataDic
I think, maybe this result is what you want.
data = [('A1',3445,1,24), ('A1',3445,1,214), ('A2',3603,2,45),
('A2',3603,2,144), ('A0',3314,3,8), ('A0',3314,3,134),
('A0',3314,4, 46)]
from itertools import groupby
for key, group in groupby(data, lambda x: x[0]):
print sorted(group, key=lambda x: x[-1], reverse=True)[0]
The output is:
('A1', 3445, 1, 214)
('A2', 3603, 2, 144)
('A0', 3314, 3, 134)
You can use this function groupby.
I'm trying to restructure a data-frame in R for k-means. Presently the data is structured like this:
Subject Posture s1 s2 s3....sn
1 45 45 43 42 ...
2 90 35 45 42 ..
3 0 3 56 98
4 45 ....
and so on. I'd like to collapse all the sn variables into a single column and create an additional variable with the s-number:
Subject Posture sn dv
1 45 1 45
2 90 2 35
3 0 3 31
4 45 4 45
Is this possible within R, or am I better off reshaping the csv directly in python?
Any help is greatly appreciated.
Here's the typical approach in base R (though using "reshape2" is probably the more typical practice).
Assuming we're starting with "mydf", defined as:
mydf <- data.frame(Subject = 1:3, Posture = c(45, 90, 0),
s1 = c(45, 35, 3), s2 = c(43, 45, 56), s3 = c(42, 42, 98))
You can reshape with:
reshape(mydf, direction = "long", idvar=c("Subject", "Posture"),
varying = 3:ncol(mydf), sep = "", timevar="sn")
# Subject Posture sn s
# 1.45.1 1 45 1 45
# 2.90.1 2 90 1 35
# 3.0.1 3 0 1 3
# 1.45.2 1 45 2 43
# 2.90.2 2 90 2 45
# 3.0.2 3 0 2 56
# 1.45.3 1 45 3 42
# 2.90.3 2 90 3 42
# 3.0.3 3 0 3 98
require(reshape2)
melt(df, id.vars="Posture")
Where df is the data.frame you presented. Next time please use dput() to provide actual data.
I think this will work for you.
EDIT:
Make sure to install the reshape2 package first of course.