Function for bar plot in python - python

I am trying to write an function for bar plot and it has to be like the plot shown below for Category and Group based on the index. The problem here is function has to divide X - Index and Y -Index separately and plot the graphs for Category and Group.
Index Group Category Population
X A 5 12
X A 5 34
Y B 5 23
Y B 5 34
Y B 6 33
X A 6 44
Y C 7 12
X C 7 23
Y A 8 12
Y A 8 4
X B 8 56
Y B 9 67
X B 10 23
Y A 8 45
X C 9 34
X C 9 56
Here the Men and Women are Index- X, Y in my case
I have tried many different ways but not able to solve this issue. It would be really helpful if anyone would help me in this.

Not sure if this is what you are looking for, but it's the easiest way to plot multi-indices IMP:
df["Index"] = df["Index"].map({"X":"Male", "Y": "Female"})
df_ = df.groupby(["Group","Category","Index"]).mean().unstack()
df_.plot.bar()
This will give you:

Related

Change plot color according to the values from array [duplicate]

This question already has answers here:
matplotlib color line by "value" [duplicate]
(2 answers)
How to manually create a legend
(5 answers)
map pandas values to a categorical level
(1 answer)
Closed 1 year ago.
I did try other solutions that are similar to my question but I did not succeed,
python: how to plot one line in different colors
Color by Column Values in Matplotlib
pandas plot one line graph with color change on column
I want the plot to change color when the values changes, for instance, if the emotion is 0, it will stay black, if the value changes to 1, the color will be red, if the value is 2, the color will be blue and etc. The progress I've made so far is attached to this question, thank you in advance.
random_emotions = [0,0,0,0,0,0,0,1,2,3,2,1,2,1,
2,3,2,2,2,2,1,1,2,3,3,3,3,3,3,4,
4,4,4,4,2,1,2,2,1,2,3,4,0,0,0,0,0]
random_emotions = np.array(random_emotions)
EmotionsInNumber = random_emotions
x = np.array(list(range(0,len(EmotionsInNumber))))
Angry = np.ma.masked_where(EmotionsInNumber == 0,EmotionsInNumber)
Fear = np.ma.masked_where(EmotionsInNumber == 1,EmotionsInNumber)
Happy = np.ma.masked_where(EmotionsInNumber == 2,EmotionsInNumber)
Neutral = np.ma.masked_where(EmotionsInNumber == 3, EmotionsInNumber)
Sad = np.ma.masked_where(EmotionsInNumber == 4,EmotionsInNumber)
fig, ax = plt.subplots()
ax.plot(x, Angry,linewidth = 4, color = 'black')
ax.plot(x, Fear,linewidth = 4, color = 'red')
ax.plot(x, Happy,linewidth = 4, color = 'blue')
ax.plot(x, Neutral,linewidth = 4, color = 'yellow')
ax.plot(x, Sad,linewidth = 4, color = 'green')
ax.legend(['Angry','Fear','Happy','Neutral','Sad',])
ax.set_title("Emotion Report of ")
plt.show()
This is the result that I am getting
The color is not changed accordingly, the legends are wrong and I have no idea how to fix this.
matplotlib color line by "value" [duplicate]
This 'matplotlib color line by "value" [duplicate]' is the closest I got, but when the color changes to cyan on index 1 and 5, the blue should be empty but it keeps plotting both blue and cyan. This is because the dataframe is grouped by 'colors' but it should not plot blue on 1 and 5 and cyan on 2,3,4 on the graph.
The main question will be closed as a duplicate to this answer of this question
The code is explained in the duplicates.
When a question is marked as a duplicate and you don't agree, it is your responsibility to show with code, exactly how you tried to incorporate the duplicate, and what's not working.
SO is a repository of questions and answers, which can be used as a reference to answer new questions. When a question is answered by code in an existing question/answer, it is up to you to do the work.
Since it's a duplicate, this answer has been added as a community wiki.
from matplotlib.lines import Line2D
import pandas as pd
import matplotlib.pyplot as plt
# set up the dataframe to match the duplicate
random_emotions = [0,0,0,0,0,0,0,1,2,3,2,1,2,1, 2,3,2,2,2,2,1,1,2,3,3,3,3,3,3,4, 4,4,4,4,2,1,2,2,1,2,3,4,0,0,0,0,0]
df = pd.DataFrame({'val': random_emotions})
# map values is covered in duplicate
emotion_dict = {0: 'Angry', 1: 'Fear', 2: 'Happy', 3: 'Neutral', 4: 'Sad'}
color_dict = {0: 'k', 1: 'r', 2: 'b', 3: 'y', 4: 'g'}
df['emotion'] = df.val.map(emotion_dict)
df['color'] = df.val.map(color_dict)
# everything else from here is a duplicated
df['change'] = df.val.ne(df.val.shift().bfill()).astype(int)
df['subgroup'] = df['change'].cumsum()
df.index += df['subgroup'].values
first_i_of_each_group = df[df['change'] == 1].index
for i in first_i_of_each_group:
# Copy next group's first row to current group's last row
df.loc[i-1] = df.loc[i]
# But make this new row part of the current group
df.loc[i-1, 'subgroup'] = df.loc[i-2, 'subgroup']
# Don't need the change col anymore
df.drop('change', axis=1, inplace=True)
df.sort_index(inplace=True)
# Create duplicate indexes at each subgroup border to ensure the plot is continuous.
df.index -= df['subgroup'].values
fig, ax = plt.subplots(figsize=(15, 4))
for k, g in df.groupby('subgroup'):
g.plot(ax=ax, y='val', color=g['color'].values[0], marker='.', legend=False, xticks=df.index)
ax.margins(x=0)
# create custom legend is covered in duplicate
custom_lines = [Line2D([0], [0], color=color, lw=4) for color in color_dict.values()]
_ = ax.legend(title='Emotion', handles=custom_lines, labels=emotion_dict.values(), bbox_to_anchor=(1, 1.02), loc='upper left')
# display(df.T)
0 1 2 3 4 5 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 18 19 20 20 21 22 22 23 23 24 25 26 27 28 29 29 30 31 32 33 34 34 35 35 36 36 37 38 38 39 39 40 40 41 41 42 42 43 44 45 46
val 0 0 0 0 0 0 0 1 1 2 2 3 3 2 2 1 1 2 2 1 1 2 2 3 3 2 2 2 2 2 1 1 1 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 2 2 1 1 2 2 2 1 1 2 2 3 3 4 4 0 0 0 0 0 0
emotion Angry Angry Angry Angry Angry Angry Angry Fear Fear Happy Happy Neutral Neutral Happy Happy Fear Fear Happy Happy Fear Fear Happy Happy Neutral Neutral Happy Happy Happy Happy Happy Fear Fear Fear Happy Happy Neutral Neutral Neutral Neutral Neutral Neutral Neutral Sad Sad Sad Sad Sad Sad Happy Happy Fear Fear Happy Happy Happy Fear Fear Happy Happy Neutral Neutral Sad Sad Angry Angry Angry Angry Angry Angry
color k k k k k k k r r b b y y b b r r b b r r b b y y b b b b b r r r b b y y y y y y y g g g g g g b b r r b b b r r b b y y g g k k k k k k
subgroup 0 0 0 0 0 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 10 10 10 11 11 11 12 12 13 13 13 13 13 13 13 14 14 14 14 14 14 15 15 16 16 17 17 17 18 18 19 19 20 20 21 21 22 22 22 22 22

Assign values to newly created column by matching certain words in Python

I have a dataset where, whenever I see a certain word that contains specific words, I'd like to match specific values to within a new column.
Data
id status
see-dd-2333 y
see-dd-aaaaa y
sal-led-sss y
sal-led-sss n
dis-dd-red n
Desired
id status pw
see-dd-2333 y 14
see-dd-aaaaa y 14
sal-led-sss y 8
sal-led-sss n 8
dis-dd-red n 5
Doing
I am thinking I can use a dictionary. Whenever I see a pattern of 'see-dd', I'd like to supply the numerical value of 14. When I see a word that contains 'sal-led' I wish to supply the 8 numerical value. Whenever I see 'dis-dd' I would like to match this with the value of 5.
out= {
'see-dd': 14,
'sal-led': 8,
}
Any suggestion is appreciated.
The simplest would be to use the replace method. As the doc noted:
This method has a lot of options. You are encouraged to experiment
and play with this method to gain intuition about how it works.
df['id'].replace(regex=out)
0 14
1 14
2 8
3 8
4 5
Name: id, dtype: int64
with out as:
out= {
'see-dd': 14,
'sal-led': 8,
'dis-dd': 5
}
df['pw'] = df['id'].replace(regex=out)
df
id status pw
0 see-dd-2333 y 14
1 see-dd-aaaaa y 14
2 sal-led-sss y 8
3 sal-led-sss n 8
4 dis-dd-red n 5
You can also use:
df['pw'] = df['id'].str.rsplit('-', 1).str.get(0).map(out)
Output:
id status pw
0 see-dd-2333 y 14
1 see-dd-aaaaa y 14
2 sal-led-sss y 8
3 sal-led-sss n 8
4 dis-dd-red n 5

Finding row with closest numerical proximity within Pandas DataFrame

I have a Pandas DataFrame with the following hypothetical data:
ID Time X-coord Y-coord
0 1 5 68 5
1 2 8 72 78
2 3 1 15 23
3 4 4 81 59
4 5 9 78 99
5 6 12 55 12
6 7 5 85 14
7 8 7 58 17
8 9 13 91 47
9 10 10 29 87
For each row (or ID), I want to find the ID with the closest proximity in time and space (X & Y) within this dataframe. Bonus: Time should have priority over XY.
Ideally, in the end I would like to have a new column called "Closest_ID" containing the most proximal ID within the dataframe.
I'm having trouble coming up with a function for this.
I would really appreciate any help or hint that points me in the right direction!
Thanks a lot!
Let's denote df as our dataframe. Then you can do something like:
from sklearn.metrics import pairwise_distances
space_vals = df[['X-coord', 'Y-coord']]
time_vals =df['Time']
space_distance = pairwise_distance(space_vals)
time_distance = pairwise_distance(time_vals)
space_distance[space_distance == 0] = 1e9 # arbitrary large number
time_distance[time_distance == 0] = 1e9 # again
closest_space_id = np.argmin(space_distance, axis=0)
closest_time_id = np.argmin(time_distance, axis=0)
Then, you can store the last 2 results in 2 columns, or somehow decide which one is closer.
Note: this code hasn't been checked, and it might have a few bugs...

Dataframe set_index produces duplicate index values instead of doing hierarchical grouping

I have a dataframe that looks like this (index not shown)
Time Letter Type Value
0 A x 10
0 B y 20
1 A y 30
1 B x 40
3 C x 50
I want to produce a dataframe that looks like this:
Time Letter TypeX TypeY
0 A 10 20
0 B 20
1 A 30
1 B 40
3 C 50
To do that, I decided I would first create a table that have multiple indices, Time, Letter and then unstack the last index Type.
Let's say my original dataframe was named my_table:
my_table.reset_index().set_index(['Time', 'Letter']) and instead of grouping it so that under every time index, letter there is BOTH Type X and Type Y, they seemed to have been sorted (adding a few more entries to demonstrate a point):
Time(i) Letter(i) Type Value
0 A x 10
D x 25
H x 15
G x 33
1 B x 40
G x 10
3 C x 50
0 B y 20
H y 10
1 A y 30
Why does this happen? I expected a result like so:
Time Letter Type Value
0 A x 10
y 30
B y 20
H x 15
y 10
D x 25
G x 33
1 B x 40
G x 10
3 C x 50
The same behavior occurs when I make Type one of the indices, it just becomes bold as an index.
How do I successfully group columns using Time and Letter to get X and Y to be matched by those columns, so I can successfully use unstack?
You need to set type as the index as well
df.set_index(['Time','Letter','Type']).Value.unstack(fill_value='').reset_index()
Out[178]:
Type Time Letter x y
0 0 A 10
1 0 B 20
2 1 A 30
3 1 B 40
4 3 C 50

Pandas groupby multiple columns with rolling date offset - How?

I am trying to do a rolling sum across partitioned data based on a moving 2 business day window. It feels like it should be both easy and widely used, but the solution is beyond me.
#generate sample data
import pandas as pd
import numpy as np
import datetime
vals = [-4,17,-4,-16,2,20,3,10,-17,-8,-21,2,0,-11,16,-24,-10,-21,5,12,14,9,-15,-15]
grp = ['X']*6 + ['Y'] * 6 + ['X']*6 + ['Y'] * 6
typ = ['foo']*12+['bar']*12
dat = ['19/01/18','19/01/18','22/01/18','22/01/18','23/01/18','24/01/18'] * 4
#create dataframe with sample data
df = pd.DataFrame({'group': grp,'type':typ,'value':vals,'date':dat})
df.date = pd.to_datetime(df.date)
df.head(12)
gives the following (note this is just the head 12 rows):
date group type value
0 19/01/2018 X foo -4
1 19/01/2018 X foo 17
2 22/01/2018 X foo -4
3 22/01/2018 X foo -16
4 23/01/2018 X foo 2
5 24/01/2018 X foo 20
6 19/01/2018 Y foo 3
7 19/01/2018 Y foo 10
8 22/01/2018 Y foo -17
9 22/01/2018 Y foo -8
10 23/01/2018 Y foo -21
11 24/01/2018 Y foo 2
The desired results are (all rows shown here):
date group type 2BD Sum
1 19/01/2018 X foo 13
2 22/01/2018 X foo -7
3 23/01/2018 X foo -18
4 24/01/2018 X foo 22
5 19/01/2018 Y foo 13
6 22/01/2018 Y foo -12
7 23/01/2018 Y foo -46
8 24/01/2018 Y foo -19
9 19/01/2018 X bar -11
10 22/01/2018 X bar -19
11 23/01/2018 X bar -18
12 24/01/2018 X bar -31
13 19/01/2018 Y bar 17
14 22/01/2018 Y bar 40
15 23/01/2018 Y bar 8
16 24/01/2018 Y bar -30
I have viewed this question and tried
df.groupby(['group','type']).rolling('2d',on='date').agg({'value':'sum'}
).reset_index().groupby(['group','type','date']).agg({'value':'sum'}).reset_index()
Which would work fine if 'value' is always positive, but this is not the case here. I have tried many other ways that have caused errors that I can list if it is of value. Can anyone help?
I expected the following to work:
g = lambda ts: ts.rolling('2B', on='date')['value'].sum()
df.groupby(['group', 'type']).apply(g)
However, I get an error as a business day is not a fixed frequency.
This brings me to suggesting the following solution, a lot uglier:
value_per_bday = lambda df: df.resample('B', on='date')['value'].sum()
df = df.groupby(['group', 'type']).apply(value_per_bday).stack()
value_2_bdays = lambda x: x.rolling(2, min_periods=1).sum()
df = df.groupby(axis=0, level=['group', 'type']).apply(value_2_bdays)
Maybe it sounds better with a function, your pick.
def resample_and_sum(x):
x = x.resample('B', on='date')['value'].sum()
x = x.rolling(2, min_periods=1).sum()
return x
df = df.groupby(['group', 'type']).apply(resample_and_sum).stack()

Categories

Resources