I have a dataframe that looks like this:
id|date |amount
1 |02-04-18|3000
1 |05-04-19|5000
1 |10-04-19|2600
2 |10-04-19|2600
2 |11-04-19|3000
I want to the amount spent over time for each unique id and have an average trend line. This is the code that I have:
import matplotlib.pyplot as plt
import pandas as pd
temp_m = df.pivot_table(index='id',columns='id',values='amount', fill_value=0)
temp_m = pd.melt(temp, id_vars=['id'])
temp_m['date'] = temp_m['date'].astype('str')
fig, ax = plt.subplots(figsize=(20,10))
for i, group in temp_m.groupby('id'):
group.plot('id', y='amount', ax=ax,legend=None)
plt.xticks(rotation = 90)
Each line is a unique customer.
Goal: I want to add another line that is the average of all the individual customer trends.
Also if there is a better way to graph the individual lines as well please let me know
At first we reshape the data
agg = df.set_index(['date', 'id']).unstack()
agg.columns = agg.columns.get_level_values(-1)
This makes plotting very easy:
sns.lineplot(data=agg)
The average trends can be calculated by
from sklearn.linear_model import LinearRegression
regress = {}
idx = agg.index.to_julian_date()[:, None]
for c in agg.columns:
regress[c] = LinearRegression().fit(idx, agg[c].fillna(0)).predict(idx)
trend = pd.Series(pd.DataFrame(regress).mean(axis=1).values, agg.index)
Related
I have a histogram:
# Lets load a dataset of house prices in Boston.
from sklearn.datasets import load_diabetes
#sklearn gives you the data as a dictionary, so
diabetes = load_diabetes(as_frame=True)
data = diabetes['frame']
import matplotlib.pyplot as plt
%matplotlib inline
bmi_hist = plt.hist(data['bmi'], density=False)
bmi_hist = plt.ylabel("Frequency")
bmi_hist = plt.xlabel("Normalized BMI")
bp_hist = plt.hist(data['bp'], density=False)
bp_hist = plt.ylabel("Frequency")
bp_hist = plt.xlabel("Normalized BP")
This is a histogram for two of the columns in the frame above.
I want to compare these two in a scatter graph. My attempts haven't been quite successful as I know I need an X and a Y to plot.
I thought I would use the same axis as the histogram:
y_bmi = data['bmi'].value_counts() # frequency
x_bmi = data['bmi'] # normalized value
ax1 = df.plot.scatter(x = x_bmi, y= y_bmi, c='DarkBlue')
But this can only be used on the 'dataframe' so do I have to repeat the values of bmi column into a new dataframe? or is there a simpler method?
Any help would be greatly appreciated.
Many Thanks.
enter image description here
Hi everyone, I'm trying to plot a graph data from CSV. There are 7 columns in my CSV. I've already plot the Genre column with my code:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
df = pd.read_csv('booksellers.csv')
genre = df['Genre']
countFiction = 0
countNonFiction = 0
for i in genre:
if i == "Fiction":
countFiction+=1
else:
countNonFiction+=1
labels = 'Fiction','Non Fiction'
sizes = [countFiction,countNonFiction]
fig1, ax1 = plt.subplots()
ax1.pie(sizes,labels=labels,startangle=90,autopct='%1.1f%%')
plt.show()
Now, I want to plot another 2 columns which are 'Author' and the average of 'User Rating'. If the Author is duplicated, how can I get only one Author with their average user rating? And also what kind of graph is compatible with it?
# you can iterate line by line
from statistics import mean
data = {}
for index, row in df.iterrows():
author = row['Author']
if not author in data:
data[author] = {'rating':[]}
data[author].append(row['User Rating'])
rates_by_authors = {}
for k in data.keys()
rates_by_authors[k] = means(data[k])
# after create the data with that code
# you can use list(rates_by_authors.keys()) that is author's list as a X axis
# you can use list(rates_by_authors.values() ) that is average of ratings by authors list as a Y axis
I'm trying to do a line plot with one line per column. My dataset looks like this:
I'm using this code, but it's giving me the following error:
ValueError: Wrong number of items passed 3, placement implies 27
plot_x = 'bill__effective_due_date'
plot_y = ['RR_bucket1_perc', 'RR_bucket7_perc', 'RR_bucket14_perc']
ax = sns.pointplot(x=plot_x, y=plot_y, data=df_rollrates_plot, marker="o", palette=sns.color_palette("coolwarm"))
display(ax.figure)
Maybe it's a silly question but I'm new to python so I'm not sure how to do this. This is my expected output:
Thanks!!
You can plot the dataframe as follows (edit: I updated the code below to make bill__effective_due_date the index of the dataframe):
import seaborn as sns
import pandas as pd
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df_rollrates_plot = pd.DataFrame({'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
df_rollrates_plot.index = x
df_rollrates_plot.index.name = 'bill__effective_due_date'
sns.lineplot(data=df_rollrates_plot)
plt.grid()
Your data is in the wrong shape to take advantage of the hue parameter in seaborn's lineplot. You need to stack it so that the columns become categorical values.
import pandas as pd
import seaborn as sns
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df = pd.DataFrame({'bill_effective_due_date':x,
'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
# This is where you are reshaping your data to make it work like you want
df = df.set_index('bill_effective_due_date').stack().reset_index()
df.columns=['bill_effective_due_date','roll_rates_perc','roll_rates']
sns.lineplot(data=df, x='bill_effective_due_date',y='roll_rates', hue='roll_rates_perc', marker='o')
I have a panda dataframe. I am making scatter plot and tried to categorize the data based on colorbar. I did it for monthly classification and quality classification as shown in the example code below.
a = np.random.rand(366)
b = np.random.rand(366)*0.4
index = (pd.date_range(pd.to_datetime('01-01-2000'), periods=366))
df = pd.DataFrame({'a':a,'b':b},index = index)
plt.scatter(df['a'],df['b'],c = df.index.month)
plt.colorbar()
And also for quality:
plt.scatter(df['a'],df['b'],c = df.index.quarter)
plt.colorbar()
My question: is there any way to categorize by half yearly. for example from the month 1-6 and 7-12 and also by month like: 10-3 and 4-9
Thank you and your help/suggestion will be highly appreciated.
Make a custom function to put in scatter function to color argument. I made an example for half yearly division. You can use it as template for your own split function:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
# if month is 1 to 6 then the first halfyear else the second halfyear
def halfyear(m):
return 0 if (m <= 6) else 1
# vectorize function to use with Series
hy = np.vectorize(halfyear)
a = np.random.rand(366)
b = np.random.rand(366)*0.4
index = (pd.date_range(pd.to_datetime('01-01-2000'), periods=366))
df = pd.DataFrame({'a':a,'b':b},index = index)
# apply custom function 'hy' for 'c' argument
plt.scatter(df['a'],df['b'], c = hy(df.index.month))
plt.colorbar()
plt.show()
Another way to use lambda function like:
plt.scatter(df['a'],df['b'], \
c = df.index.map(lambda m: 0 if (m.month > 0 and m.month < 7) else 1))
I would opt for a solution which does not completely truncate the monthly information. Using colors which are similar but distinguishable for the months allows to visually classify by half-year as well as month.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
a = np.random.rand(366)
b = np.random.rand(366)*0.4
index = (pd.date_range(pd.to_datetime('01-01-2000'), periods=366))
df = pd.DataFrame({'a':a,'b':b},index = index)
colors=["crimson", "orange", "darkblue", "skyblue"]
cdic = list(zip([0,.499,.5,1],colors))
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("name", cdic,12 )
norm = matplotlib.colors.BoundaryNorm(np.arange(13)+.5,12)
plt.scatter(df['a'],df['b'],c = df.index.month, cmap=cmap, norm=norm)
plt.colorbar(ticks=np.arange(1,13))
plt.show()
I have three dataframes containing 17 sets of data with groups A, B, and C. A shown in the following code snippet
import pandas as pd
import numpy as np
data1 = pd.DataFrame(np.random.rand(17,3), columns=['A','B','C'])
data2 = pd.DataFrame(np.random.rand(17,3)+0.2, columns=['A','B','C'])
data3 = pd.DataFrame(np.random.rand(17,3)+0.4, columns=['A','B','C'])
I would like to plot a box plot to compare the three groups as shown in the figure below
I am trying make the plot using seaborn's box plot as follows
import seaborn as sns
sns.boxplot(data1, groupby='A','B','C')
but obviously this does not work. Can someone please help?
Consider assigning an indicator like Location to distinguish your three sets of data. Then concatenate all three and melt the data to retrieve one value column, one Letter categorical column, and one Location column, all inputs into sns.boxplot:
import pandas as pd
import numpy as np
from matplotlib.pyplot as plt
import seaborn as sns
data1 = pd.DataFrame(np.random.rand(17,3), columns=['A','B','C']).assign(Location=1)
data2 = pd.DataFrame(np.random.rand(17,3)+0.2, columns=['A','B','C']).assign(Location=2)
data3 = pd.DataFrame(np.random.rand(17,3)+0.4, columns=['A','B','C']).assign(Location=3)
cdf = pd.concat([data1, data2, data3])
mdf = pd.melt(cdf, id_vars=['Location'], var_name=['Letter'])
print(mdf.head())
# Location Letter value
# 0 1 A 0.223565
# 1 1 A 0.515797
# 2 1 A 0.377588
# 3 1 A 0.687614
# 4 1 A 0.094116
ax = sns.boxplot(x="Location", y="value", hue="Letter", data=mdf)
plt.show()