How to extract the labels from sns.clustermap - python

If I'm plotting a (correlation) dataframe with sns.clustermap it automatically takes the dataframes multindex as labels and plots them right and below the clustermap.
How do I access these labels? I'm using clustermaps as an exploratory tool for large-ish datasets (100-200 entries) and I need the names for the entries in various clusters.
EXAMPLE:
elev = [1, 100, 10, 1000, 100, 10]
number = [1, 2, 3, 4, 5, 6]
name = ['foo', 'bar', 'baz', 'qux', 'quux', 'quuux']
idx = pd.MultiIndex.from_arrays([name, elev, number],
names=('name','elev', 'number'))
data = np.random.rand(20,6)
df = pd.DataFrame(data=data, columns=idx)
clustermap = sns.clustermap(df.corr())
gives
Now I'd say that theres two distinct clusters: the first two rows and the last 4 rows, so [foo-1-1, bar-100-2] and [baz-10-3, qux-1000-4, quux-100-5, quuux-10-6].
How can I extract these (or the whole [foo-1-1, bar-100-2, baz-10-3, qux-1000-4, quux-100-5, quuux-10-6] list)? With 100+ Entries, just writing them down by hand isn't really an option.
The documentation offers clustergrid.dendrogram_row.reordered_ind but that just gives me the index numbers in the original dataframe. But I'm looking for something more like the output of df.columns
With this it seems to me like I'm getting into the right direction, but I can only extract to which cluster a given row belongs, when I let it form clusters automatically, but I'd like to define the clusters myself, visually.

As always with such things, the answer is out there, I just overlooked it.
This answer (pointed out by Trenton McKinney in comments) has the needed snipped in it:
ax_heatmap.yaxis.get_majorticklabels()
(I wouldn't have looked into ax_heatmap to get to that...). So, continuing the MWE from the question:
labels = clustermap.ax_heatmap.yaxis.get_majorticklabels()
However, that's a list of
type(labels[0])
matplotlib.text.Text
so unless I'm missing something (again), it's not exactly straigtforward to use. However, that can simply be looped into something more usefull. Let's say I'm interested in the whole name (i.e. the complete former df multiindex) and the number:
labels_list = []
number_list = []
for i in labels:
i = str(i)
name_start = i.find('\'')+1
name_end = i.rfind('\'')
name = i[name_start:name_end]
number_start = name.rfind('-')+1
number = name[number_start:]
number = int(number)
labels_list.append(name)
number_list.append(number)
Now I've got two easily workable lists, one with full strings and one with ints.

Related

Print Pandas without dtype

I've read a few other posts about this but the other solutions haven't worked for me. I'm trying to look at 2 different CSV files and compare data from 1 column from each file. Here's what I have so far:
import pandas as pd
import numpy as np
dataBI = pd.read_csv("U:/eu_inventory/EO BI Orders.csv")
dataOrderTrimmed = dataBI.iloc[:,1:2].values
dataVA05 = pd.read_csv("U:\eu_inventory\VA05_Export.csv")
dataVAOrder = dataVA05.iloc[:,1:2].values
dataVAList = []
ordersInBoth = []
ordersInBI = []
ordersInVA = []
for order in np.nditer(dataOrderTrimmed):
if order in dataVAOrder:
ordersInBoth.append(order)
else:
ordersInBI.append(order)
So if the order number from dataOrderTrimmed is also in dataVAOrder I want to add it to ordersInBoth, otherwise I want to add it to ordersInBI. I think it splits the information correctly but if I try to print ordersInBoth each item prints as array(5555555, dtype=int64) I want to have a list of the order numbers not as an array and not including the dtype information. Let me know if you need more information or if the way I've typed it out is confusing. Thanks!
The way you're using .iloc is giving you a DataFrame, which becomes 2D array when you access values. If you just want the values in the column at index 1, then you should just say:
dataOrderTrimmed = dataBI.iloc[:, 1].values
Then you can iterate over dataOrderTrimmed directly (i.e. you don't need nditer), and you will get regular scalar values.

Setting DataFrame columns from current columns' data

I've stumbled upon intricate data and I want to present it totally differently.
Currently, my dataframe has a default index (numerated) and 3 labels: sequence (that stores sentences), labels (which is a list that contains 20 different strings) and scores which is again a list (length of 20) that corresponds to the labels list and the ith element in the scores list is the score of the ith element in the labels list.
The labels list is sorted via the scores list; if label j has the highest score in row i, then j would show up first in the labels list; but if another label has the highest score, it would show up first instead.. so essentially it's sorted by the scores list.
I want to paint a different picture: use the labels list as my new columns and as value, use the corresponding values via the scores list.
For example, if this is is how my current dataframe looks like:
d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'labels': [['Happy', 'Sad'],['Sad', 'Happy']],'score': [['0.9','0.1'],['0.8','0.2']]}
df = pd.DataFrame(data=d)
df
I want to keep the first column which is the sentence, but then use the labels like the rest of the columns and fill it with the value of the corresponding scores.
An example output would be then:
new_format_d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'Happy': ['0.9', '0.2'],'Sad': ['0.1','0.2']}
new_format_df = pd.DataFrame(data=new_format_df )
new_format_df
Is there an "easy" way to execute that?
I was finally able to solve it using a NumPy array hack:
First you convert the lists to np arrays:
df['labels'] = df['labels'].map(lambda x: np.array(x))
df['scores'] = df['scores'].map(lambda x: np.array(x))
Then, you loop over the labels and add each label, one at a time, and its corresponding scores using the boolean condition described below:
for label in df['labels'][0]:
df[label] = df_text_20[['labels','scores']].apply(lambda x: x[1][x[0]==label][0], axis=1)
My suggestion is to change your dictionary if you can. First find the indices of the Happy and Sad from labels:
happy_index = [internal_list.index('Happy') for internal_list in d['labels']]
sad_index = [internal_list.index('Sad') for internal_list in d['labels']]
Then add new keys name Happy and Sad to your dictionary:
d['Happy'] = [d['score'][cnt][index] for cnt, index in enumerate(happy_index)]
d['Sad'] = [d['score'][cnt][index] for cnt, index in enumerate(sad_index)]
Finally, delete your redundant keys and convert it to dataframe:
del d['labels']
del d['score']
df = pd.DataFrame(d)
sentence Happy Sad
0 Hello, my name is... 0.9 0.1
1 I enjoy reading books 0.2 0.8

Bootstrap Samples of a Dask Dataframe

I have a large data frame with all binary variables (a sparse matrix that was converted into pandas so that I can later convert to Dask). The dimensions are 398,888 x 52,034.
I am trying to create a much larger data frame that consists of 10,000 different bootstrap samples from the original data frame. Each sample is the same size as the original data. The final data frame will also have a column that keeps track of which bootstrap sample that row is from.
Here is my code:
# sample df
df_pd = pd.DataFrame(np.array([[0, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 1]]),
columns=['a', 'b', 'c'])
# convert into Dask dataframe
df_dd = dd.from_pandas(df_pd, npartitions=4)
B = 2 # eventually 10,000
big_df = dd.from_pandas(pd.DataFrame([]), npartitions = 1000)
for i in range(B+1):
data = df_dd.sample(frac = 1, replace = True, random_state=i)
data["sample"] = i
big_df.append(data)
The data frame produced by the loop is empty, but I cannot figure out why. To be more specific, if I look at big_df.head() I get, UserWarning: Insufficient elements for 'head'. 5 elements requested, only 0 elements available. Try passing larger 'npartitions' to 'head'. If I try print(big_df), I get, ValueError: No objects to concatenate.
My guess is there is at least a problem with this line, big_df = dd.from_pandas(pd.DataFrame([]), npartitions = 1000), but I have no idea.
Let me know if I need to clarify anything. I am somewhat new to Python and even newer to Dask, so even small tips or feedback that don't fully answer the question would be greatly appreciated. Thanks!
You are probably better off using dask.dataframe.concat and concatting dataframes together -- still there are a few problems.
append creates a new object so you will have to save that object -> df = df.append(data)
try calling big_df.head(npartitions=-1), it use all partitions to get 5 rows (the appending/concatting here can create small partitions with less than 5 rows).
It would be good to write this first with Pandas before jumping to Dask especially. You might also be interested in reading through: https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask

Remove specific elements from a numpy array

I have an np.array I would like to remove specific elements from based on the element "name" and not the index. Is this sort of thing possible with np.delete() ?
Namely my original ndarray is
textcodes= data['CODES'].unique()
which captures unique text codes given the quarter.
Specifically I want to remove certain codes which I need to run through a separate process and put them into a separate ndarray
sep_list = np.array(['SPCFC_CODE_1','SPCFC_CODE_2','SPCFC_CODE_3','SPCFC_CODE_4])
I have trouble finding a solution on removing these specific codes in sep_list from textcodes because I don't know exactly where these sep_list codes would be indexed as it would be different each quarter and I would like to automate it based on the specific names instead because those will always be the same.
Any help is greatly appreciated. Thank you.
You should be able to do something like:
import numpy as np
data = [3,2,1,0,10,5]
bad_list = [1, 2]
data = np.asarray(data)
new_list = np.asarray([x for x in data if x not in bad_list])
print("BAD")
print(data)
print("GOOD")
print(new_list)
Yields:
BAD
[ 3 2 1 0 10 5]
GOOD
[ 3 0 10 5]
It is impossible to tell for sure since you did not provide sample data, but the following implementation using your variables should work:
import numpy as np
textcodes= data['CODES'].unique()
sep_list = np.array(['SPCFC_CODE_1','SPCFC_CODE_2','SPCFC_CODE_3','SPCFC_CODE_4'])
final_list = [x for x in textcodes if x not in sep_list]

Pandas python barplot by subgroup

Ok so I have a dataframe object that's indexed as follows:
index, rev, metric1 (more metrics.....)
exp1, 92365, 0.018987
exp2, 92365, -0.070901
exp3, 92365, 0.150140
exp1, 87654, 0.003008
exp2, 87654, -0.065196
exp3, 87654, -0.174096
For each of these metrics I want to create individual stacked barplots comparing them based on their rev.
here's what I've tried:
df = df[['rev', 'metric1']]
df = df.groupby("rev")
df.plot(kind = 'bar')
This results in 2 individual bar graphs of the metric. Ideally I would have these two merged and stacked (right now stacked=true does nothing). Any help would be much appreciated.
This would give me my ideal result, however I don't think reorganizing to fit this is the best way to achieve my goal as I have many metrics and many revisions.
index, metric1(rev87654), metric1(rev92365)
exp1, 0.018987, 0.003008
exp2, -0.070901, -0.065196
exp3, 0.150140, -0.174096
This is my goal. (made by hand)
http://i.stack.imgur.com/5GRqB.png
following from this matplotlib gallery example:
http://matplotlib.org/examples/api/barchart_demo.html
there they get multiple to plot by calling bar once for each set.
You could access these values in pandas with indexing operations as follows:
fig, ax = subplots(figsize=(16.2,10),dpi=300)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[0]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))
ax.bar(X,Y,width=.4)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[2]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))+.5
ax.bar(X,Y,width=.4,color='r')
working from the inside out:
get all of the unique values of 'SL' in one of the cols (rev in your case)
Get a Boolean vector of all rows where 'SL' equals the first (or nth) unique value
Index Tire by that Boolean vector (this will pull out only those rows where the vector is True
access the values of SA or a metric in yourcase. (took only the `[0:13]' values because i was testing this on a huge data set)
bar plot those values
if your experiments are consistently in order in the frame(as shown), that's that. Otherwise you might need to run a little sorting to get your Y values in the right order. .sort(column name) should take care of that. In my code, i'd slip it in between ...[0]] and.SA...
In general, this kind of operation can really help you out in wrangling big frames. .between is useful. And you can always add, multiply etc. the Boolean vectors to construct more complex logic.
I'm not sure how to get the plot you want automatically without doing exactly the reorganization you specify at the end. The answer by user3823992 gives you more detailed control of the plots, but if you want them more automatic here is some temporary reorganization that should work using the indexing similarly but also concatenating back into a DataFrame that will do the plot for you.
import numpy as np
import pandas as pd
exp = ['exp1','exp2','exp3']*2
rev = [1,1,1,2,2,2]
met1 = np.linspace(-0.5,1,6)
met2 = np.linspace(1.0,5.0,6)
met3 = np.linspace(-1,1,6)
df = pd.DataFrame({'rev':rev, 'met1':met1, 'met2':met2, 'met3':met3}, index=exp)
for met in df.columns:
if met != 'rev':
merged = df[df['rev'] == df.rev.unique()[0]][met]
merged.name = merged.name+'rev'+str(df.rev.unique()[0])
for rev in df.rev.unique()[1:]:
tmp = df[df['rev'] == rev][met]
tmp.name = tmp.name+'rev'+str(rev)
merged = pd.concat([merged, tmp], axis=1)
merged.plot(kind='bar')
This should give you three plots, one for each of my fake metrics.
EDIT : Or something like this might do also
df['exp'] = df.index
pt = pd.pivot_table(df, values='met1', rows=['exp'], cols=['rev'])
pt.plot(kind='bar')

Categories

Resources