Highlighting specific columns in bar chart in python using altair - python

I want to highlight specific data points in a bar chart in python based on my requirement. using altair, I am able to achieve results for one data point (for e.g 'A' in the code). Here's the sample data frame and code
import pandas as pd
import altair as alt
df = pd.DataFrame({
'Rank': [25, 20, 40, 10, 50, 35],
'Name': ['A', 'B', 'C', 'D', 'E', 'F'],
})
bar1 =alt.Chart(df).mark_bar().encode( y = alt.Y('Rank'),
x = alt.X('Name:O', sort ='x'),
color=alt.condition(
alt.datum.Name == 'A',
alt.value('red'),
alt.value('blue')
))
bar1
How can I highlight two or more datapoints (eg. A & B) with the same color and others with a different one? I tried passing the names as a list Select = ['A', 'B'] and then passing the alt.datum.Name == Select but that does not work?
How can i get this done?
Also, trying to understand why passing as a list did not work?
thank you

You could use the FieldOneOfPredicate to check if the Name column is one of the items in the list:
import pandas as pd
import altair as alt
df = pd.DataFrame({
'Rank': [25, 20, 40, 10, 50, 35],
'Name': ['A', 'B', 'C', 'D', 'E', 'F'],
})
bar1 =alt.Chart(df).mark_bar().encode( y = alt.Y('Rank'),
x = alt.X('Name:O', sort ='x'),
color=alt.condition(
alt.FieldOneOfPredicate('Name', ['A', 'B']),
alt.value('red'),
alt.value('blue')
))
bar1
You can read more about it in the VegaLite docs. You could also use two expression string with an "or" operator:
color=alt.condition(
"datum.Name == 'A'"
"|| datum.Name == 'B'", # splitting on two rows for readability
alt.value('red'),
alt.value('blue')
)
I don't think there is a single Vega expression operator that you can use for checking membership like Python's in. This answer doesn't mention it either.

Related

how to automatically drop index levels that only have single value?

I have a dataframe that has A to M columns for example. Then I did:
groups = df.groupby(['E', 'D', 'B', 'G', 'I'])
stats = pd.concat(
[
groups['N'].mean().rename("N_mean"),
groups['H'].median().rename('H_median')
]
)
stats = stats[stats['N']>0]
now if I print stats, the index is ('E', 'D', 'B', 'G', 'I'). However, many of them only contain single value which means they are insignificant. I knew I can determine which level is insignificant then stats.index.droplevel(...). But I wonder is there already builtin method to automatically do this?

Grouping, summing, sorting and selecting within a DataFrame

I have a DataFrame like this one:
df=pd.DataFrame({'State' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'County' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'],
'Population': [10, 11, 12, 13, 17, 16, 15, 18, 14]})
Looking at the two most populous counties for each state, what are the two most populous states (in order of highest population to lowest population)?
I solved it by using a loop, and now I'm trying to get the same result grouping, summing, sorting and selecting.
The following code works, but I'm sure there are many differnt and more elegant way to do it.
df.groupby(['State'])['Population'].nlargest(2).groupby(['State']).sum()\
.sort_values(ascending=False)[:2].to_frame()\
.reset_index()['State'].tolist()
You can't shorten this alittle.
df.groupby(['State'])['Population'].nlargest(2)\
.sum(level=0).sort_values(ascending=False).index[:2].tolist()
No need to convert back to dataframe to retreive states, just get the states from the index directly. Using sum with level parameter is just short syntax that over groupby again.
(df.sort_values('Population', ascending=False) # order by highest population per country
.groupby('State').head(2) # get two most populous counties per state
.groupby('State').sum() # get population of two largest counties per state
.sort_values('Population', ascending = False)[:2] # get top 2 states by population
.index # get states names
.tolist() # convert to list
)
Here's an alternate way to do it with explanations of each operation

Python Pandas DataFrame pivot_table bizarre values

"Bizarre" is such an emotionally charged word.
Assume that I have 5 students: A, B, C, D, and E. Each of these students grades two of their peers on a writing assignment. The data is as follows:
peer_review = pd.DataFrame({
'Student': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E'],
'Assessor': ['B', 'C', 'A', 'D', 'D', 'D', 'B', 'D', 'D', 'D', 'A', 'A', 'A', 'E', 'C', 'E'],
'Score': [72, 53, 92, 100, 2, 90, 75, 50, 50, 47, 97, 86, 41, 17, 47, 29]})
Now, in some cases an assessor graded the student's assignment more than once. Maybe the student turned it in and revised several times. Maybe the assessor was drunk and didn't remember that he had already graded this student's assignment. In any case, I would like to be able to see a list of all scores that each assessor gave to each student. I tried to do this as follows:
peer_review.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=identity)
I can already hear you asking --- What is the "identity" function? It's this:
def identity(x):
return x
However, when I run this the pivot_table function repeatedly, it gives me different answers each time for the cells that have multiple values.
So, here are the questions:
What is the significance of the numbers that seem to change randomly as I run the pivot_table function repeatedly?
How do I fix the identity function so that it returns a simple list of all the scores when an assessor graded the same assignment more than once?
------------------UPDATE #1:------------------
I found that it is a pandas Series object that is being passed to the identity function. I changed the identity function to this:
def identity(x):
return x.values
This still gives me the bizarre random numbers. Realizing that x.values is a numpy.ndarray, I then tried this:
def identity(x):
return x.values.tolist()
This results in a ValueError exception. ("Function does not reduce.")
------------------UPDATE #2:------------------
The workaround proposed by ZJS works perfectly. Still wondering why pivot_table has failed me.
This will work every time...
groups = peer_review.groupby(['Assessor','Student']) #groups into Assessor,Student combos
peer_review = groups.apply(lambda x:list(x['Score'])) #apply your group function
peer_review =peer_review.unstack('Student') #Set student index as the columns
I'm still investigating why pivot_table doesn't work

Pass Multiple Columns to groupby.transform

I understand that when you call a groupby.transform with a DataFrame column, the column is passed to the function that transforms the data. But what I cannot understand is how to pass multiple columns to the function.
people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
Now I can easily demean that data etc. but what I can't seem to do properly is to transform data inside groups using multiple column values as parameters of the function. For example if I wanted to add a column 'f' that took the value a.mean() - b.mean() * c for each observation how can that be achived using the transform method.
I have tried variants of the following
people['f'] = float(NA)
Grouped = people.groupby(key)
def TransFunc(col1, col2, col3):
return col1.mean() - col2.mean() * col3
Grouped.f.transform(TransFunc(Grouped['a'], Grouped['b'], Grouped['c']))
But this is clearly wrong. I have also trued to wrap the function in a lamba but can't quite make that work either.
I am able to achieve the result by iterating through the groups in the following manner:
for group in Grouped:
Amean = np.mean(list(group[1].a))
Bmean = np.mean(list(group[1].b))
CList = list(group[1].c)
IList = list(group[1].index)
for y in xrange(len(CList)):
people['f'][IList[y]] = (Amean - Bmean) * CList[y]
But that does not seem a satisfactory solution, particulalry if the index is non-unique. Also I know this must be possible using groupby.transform.
To generalise the question: how does one write functions for transforming data that have parameters that involve using values from multiple columns?
Help appreciated.
You can use apply() method:
import numpy as np
import pandas as pl
np.random.seed(0)
people2 = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
Grouped = people2.groupby(key)
def f(df):
df["f"] = (df.a.mean() - df.b.mean())*df.c
return df
people2 = Grouped.apply(f)
print people2
If you want some generalize method:
Grouped = people2.groupby(key)
def f(a, b, c, **kw):
return (a.mean() - b.mean())*c
people2["f"] = Grouped.apply(lambda df:f(**df))
print people2
This is based upon the answer provided by HYRY (thanks) who made me see how this could be achieved. My version does nothing more than generalise the function and enter the arguments of the function when it is called. I think though the function has to be called with a lambda:
import pandas as pd
import numpy as np
people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
people['f'] = ""
Grouped = people.groupby(key)
def FUNC(df, col1, col2, col3, col4):
df[col1] = (df[col2].mean() - df[col3].mean())*df[col4]
return df
people2 = Grouped.transform(lambda x: FUNC(x, 'f', 'a', 'b', 'c'))
This appears to me to be the best way I have seen of doing this... Basically the entire grouped data frame is passed to the function as x, and then columns can be called as arguments.

Stacked bar chart with differently ordered colors using matplotlib

I am a begginer of python. I am trying to make a horizontal barchart with differently ordered colors.
I have a data set like the one in the below:
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
The first list contains numerical data, and the second one contains the order of each data item. I need the second list here, because the order of A, B, C, and D is crucial for the dataset when presenting them in my case.
Using data like the above, I want to make a stacked bar chart like the picture in the below. It was made with MS Excel by me manually. What I hope to do now is to make this type of bar chart using Matplotlib with the dataset like the above one in a more automatic way. I also want to add a legend to the chart if possible.
Actually, I have totally got lost in trying this by myself. Any help will be very, very helpful.
Thank you very much for your attention!
It's a long program, but it works, I added one dummy data to distinguish rows count and columns count:
import numpy as np
from matplotlib import pyplot as plt
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16},
{'A':35, 'B':45, 'C':66, 'D':50}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D'],
['A', 'B', 'C', 'D']]
colors = ["r","g","b","y"]
names = sorted(dataset[0].keys())
values = np.array([[data[name] for name in order] for data,order in zip(dataset, data_orders)])
lefts = np.insert(np.cumsum(values, axis=1),0,0, axis=1)[:, :-1]
orders = np.array(data_orders)
bottoms = np.arange(len(data_orders))
for name, color in zip(names, colors):
idx = np.where(orders == name)
value = values[idx]
left = lefts[idx]
plt.bar(left=left, height=0.8, width=value, bottom=bottoms,
color=color, orientation="horizontal", label=name)
plt.yticks(bottoms+0.4, ["data %d" % (t+1) for t in bottoms])
plt.legend(loc="best", bbox_to_anchor=(1.0, 1.00))
plt.subplots_adjust(right=0.85)
plt.show()
the result figure is:
>>> dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
>>> data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
>>> for i,x in enumerate(data_orders):
for y in x:
#do something here with dataset[i][y] in matplotlib

Categories

Resources