Barplot comparing two columns - python

I would like to draw a barplot graph that would compare the evolution of 2 variables of revenues on a monthly time-axis (12 months of invoices).
I wanted to use sns.barplot, but can't use "hue" (cause the 2 variables aren't subcategories?). Is there another way, as simple as with hue? Can I "create" a hue?
Here is a small sample of my data:
(I did transform my table into a pivot table)
[In]
data_pivot['Revenue-Small-Seller-in'] = data_pivot["Small-Seller"] + data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot['Revenue-Not-Small-Seller-in'] = data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot
[Out]
InvoiceNo Month Year Revenue-Small-Seller-in Revenue-Not-Small-Seller-in
536365 12 2010 139.12 139.12
536366 12 2010 22.20 11.10
536367 12 2010 278.73 246.93
(sorry for the ugly presentation of my data, see the picture to see the complete table (as there are multiple columns))

You can do:
render_df = data_pivot[data_pivot.columns[-2:]]
fig, ax = plt.subplots(1,1)
render_df.plot(kind='bar', ax=ax)
ax.legend()
plt.show()
Output:
Or sns style like you requested
render_df = data_pivot[data_pivot.columns[-2:]].stack().reset_index()
sns.barplot('level_0', 0, hue='level_1',
render_df)
here render_df after stack() is:
+---+---------+-----------------------------+--------+
| | level_0 | level_1 | 0 |
+---+---------+-----------------------------+--------+
| 0 | 0 | Revenue-Small-Seller-in | 139.12 |
| 1 | 0 | Revenue-Not-Small-Seller-in | 139.12 |
| 2 | 1 | Revenue-Small-Seller-in | 22.20 |
| 3 | 1 | Revenue-Not-Small-Seller-in | 11.10 |
| 4 | 2 | Revenue-Small-Seller-in | 278.73 |
| 5 | 2 | Revenue-Not-Small-Seller-in | 246.93 |
+---+---------+-----------------------------+--------+
and output:

Related

Python - pandas remove duplicate rows based on condition

I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})

Using pandas apply to pass in both a row and the entire dataframe with it [duplicate]

This question already has an answer here:
Pandas - Finding percent contributed by each group
(1 answer)
Closed 2 years ago.
I have a df and I want to create some new cols with it. How would I use the apply function to both pass in the row, and the entire df with it? I need the entire df to do some filtering, and the data is subject to the values in each row.
Or maybe I don't need to use apply, but that's the first thing that came to my mind. Thank you and all help is appreciated!
Ex of df:
+----+--------+--------+
| ID | Family | Amount |
+----+--------+--------+
| 1 | A | 2 |
| 2 | A | 10 |
| 3 | B | 4 |
| 4 | B | 7 |
+----+--------+--------+
Result:
+----+--------+--------+-----------+------------+
| ID | Family | Amount | Total_Fam | Id_Percent |
+----+--------+--------+-----------+------------+
| 1 | A | 2 | 12 | .166 |
| 2 | A | 10 | 12 | .833 |
| 3 | B | 4 | 11 | .363 |
| 4 | B | 7 | 11 | .636 |
+----+--------+--------+-----------+------------+
First, group by Family and then transform amount and then you can directly divide Amount by the new column.
df['Total_Fam'] = df.groupby('Family')['Amount'].transform(np.sum)
df['Id_Percent'] = df['Amount']/df['Total_Fam']
df
Using apply on a column passes each row individualy. If you use apply on the entire dataset, it sees the entire dataset, hence, you can use all columns. As you can see in the example below, df['new_2] which is made using a function which I apply to the dataset, I do not need to pass the df to it.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris')
df['new'] = df['species'].apply(lambda x: x[:2])
def sumIsMore(dataframe):
x = dataframe['sepal_length']
y = dataframe['sepal_width']
return x+y >= 8.5
df['new_2'] = df.apply(sumIsMore, axis=1)

how to differentiate with color values from two columns both on x-axis with matplotlib python?

I am trying to do a plot that has on x axis dates and on y some values. But I have two columns as dates. I would like to highlight the date of the second column with a dot of another color. Is it possible?
|---------------------|------------------|------------------|------------------|
| ID | Date1 | Date2 | value |
|---------------------|------------------|------------------|------------------|
| 1 | 2008-05-14 | 2010-03-28 | 5 |
|---------------------|------------------|------------------|------------------|
| 1 | 2005-12-07 | 2010-03-28 | 3 |
|---------------------|------------------|------------------|------------------|
| 1 | 2008-10-27 | 2010-03-28 | 6 |
df1 = df[df['ID']== 1]
df1= df1.sort_values(by='Date1')
date = df1['Date1']
res = df1['values']
fig, ax = plt.subplots()
ax.plot(date, res, 'o-')

pandas plot columns from two dataframes in in one figure

I have a dataframe consisting of mean and std-dev of distributions
df.head()
+---+---------+----------------+-------------+---------------+------------+
| | user_id | session_id | sample_mean | sample_median | sample_std |
+---+---------+----------------+-------------+---------------+------------+
| 0 | 1 | 20081023025304 | 4.972789 | 5 | 0.308456 |
| 1 | 1 | 20081023025305 | 5.000000 | 5 | 1.468418 |
| 2 | 1 | 20081023025306 | 5.274419 | 5 | 4.518189 |
| 3 | 1 | 20081024020959 | 4.634855 | 5 | 1.387244 |
| 4 | 1 | 20081026134407 | 5.088195 | 5 | 2.452059 |
+---+---------+----------------+-------------+---------------+------------+
From this, I plot a histogram of the distribution
plt.hist(df['sample_mean'],bins=50)
plt.xlabel('sampling rate (sec)')
plt.ylabel('Frequency')
plt.title('Histogram of trips mean sampling rate')
plt.show()
I then write a function to compute pdf and cdf, passing dataframe and column name:
def compute_distrib(df, col):
stats_df = df.groupby(col)[col].agg('count').pipe(pd.DataFrame).rename(columns = {col: 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
return stats_df
So for example:
stats_df = compute_distrib(df, 'sample_mean')
stats_df.head(2)
+---+---------------+-----------+----------+----------+
| | sample_median | frequency | pdf | cdf |
+---+---------------+-----------+----------+----------+
| 0 | 1 | 4317 | 0.143575 | 0.143575 |
| 1 | 2 | 10169 | 0.338200 | 0.481775 |
+---+---------------+-----------+----------+----------+
Then I plot the cdf distribution this way:
ax1 = stats_df.plot(x = 'sample_mean', y = ['cdf'], grid = True)
ax1.legend(loc='best')
Goal:
I would like to plot these figures in one figure side-by-side instead of plotting separately and somehow putting them together in my slides.
You can use matplotlib.pyplot.subplots to draw multiple plots next to each other:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=1, ncols=2)
# Pass the data you wish to plot.
axs[0][0].hist(...)
axs[0][1].plot(...)
plt.show()

Plotting an awkward pandas multi index dataframe

I have a very awkward dataframe that looks like this:
+----+------+-------+-------+--------+----+--------+
| | | hour1 | hour2 | hour 3 | … | hour24 |
+----+------+-------+-------+--------+----+--------+
| id | date | | | | | |
| 1 | 3 | 4 | 0 | 96 | 88 | 35 |
| | 4 | 10 | 2 | 54 | 42 | 37 |
| | 5 | 9 | 32 | 8 | 70 | 34 |
| | 6 | 36 | 89 | 69 | 46 | 78 |
| 2 | 5 | 17 | 41 | 48 | 45 | 71 |
| | 6 | 50 | 66 | 82 | 72 | 59 |
| | 7 | 14 | 24 | 55 | 20 | 89 |
| | 8 | 76 | 36 | 13 | 14 | 21 |
| 3 | 5 | 97 | 19 | 41 | 61 | 72 |
| | 6 | 22 | 4 | 56 | 82 | 15 |
| | 7 | 17 | 57 | 30 | 63 | 88 |
| | 8 | 83 | 43 | 35 | 8 | 4 |
+----+------+-------+-------+--------+----+--------+
For each id there is a list of dates and for each date the hour columns represent that full day's worth of data broken out by hour for the full 24hrs.
What I would like to do is plot (using matplotlib) the full hourly data for each of the ids, but I can't think of a way to do this. I was looking into the possibility of creating numpy matrices, but I'm not sure if that is the right path to go down.
Clarification: Essentially, for each id I want to concatenate all the hourly data together in order and plot that. I already have the days in the proper order, so I imagine it's just a matter finding a way to put all of the hourly data for each id into one object
Any thoughts on how to best accomplish this?
Here is some sample data in csv format: http://www.sharecsv.com/s/e56364930ddb3d04dec6994904b05cc6/test1.csv
Here is one approach:
for groupID, data in d.groupby(level='id'):
fig = pyplot.figure()
ax = fig.gca()
ax.plot(data.values.ravel())
ax.set_xticks(np.arange(len(data))*24)
ax.set_xticklabels(data.index.get_level_values('date'))
ravel is a numpy method that will string out multiple rows into one long 1D array.
Beware running this interactively on a large dataset, as it creates a separate plot for each line. If you want to save the plots or the like, set a noninteractive matplotlib backend and use savefig to save each figure, then close it before creating the next one.
It might also be of interest to stack the data frame so that you have the dates and times together in the same index. For example, doing
df = df.stack().unstack(0)
Will put the dates and times in the index and the id as the columns names. Calling df.plot() will give you a line plot for each time series on the same axes. So you could do it as
ax = df.stack().unstack(0).plot()
and format the axes either by passing arguments to the plot method or by calling methods on ax.
I am not totally happy with this solution but maybe it can serve as starting point. Since your data is cyclic, I chose a polar chart. Unfortunately, the resolution in the y direction is poor. Therefore, I zoomed manually into the plot:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('test1.csv')
df_new = df.set_index(['id','date'])
n = len(df_new.columns)
# convert from hours to rad
angle = np.linspace(0,2*np.pi,n)
# color palete to cycle through
n_data = len(df_new.T.columns)
color = plt.cm.Paired(np.linspace(0,1,n_data/2)) # divided by two since you have 'red', and 'blue'
from itertools import cycle
c_iter = cycle(color)
fig = plt.figure()
ax = fig.add_subplot(111, polar=True)
# looping through the columns and manually select one category
for ind, i in enumerate(df_new.T.columns):
if i[0] == 'red':
ax.plot(angle,df_new.T[i].values,color=c_iter.next(),label=i,linewidth=2)
# set the labels
ax.set_xticks(np.linspace(0, 2*np.pi, 24, endpoint=False))
ax.set_xticklabels(range(24))
# make the legend
ax.legend(loc='upper left', bbox_to_anchor = (1.2,1.1))
plt.show()
Zoom 0:
Zoom 1:
Zoom 2:

Categories

Resources