Add text annotation to plot from a pandas dataframe - python

My Code:
import matplotlib.pyplot as plt
import pandas as pd
import os, glob
path = r'C:/Users/New folder'
all_files = glob.glob(os.path.join(path, "*.txt"))
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=',', parse_dates=[0], infer_datetime_format=True,header=None, usecols=[0,1,2,3,4,5,6], names=['Date','Time','open', 'high', 'low', 'close','volume','tradingsymbol'])
df = df[['Date','Time','close','volume','tradingsymbol']]
df["Time"] = pd.to_datetime(df['Time'])
df.set_index('Time', inplace=True)
print(df)
fig, axes = plt.subplots(nrows=2, ncols=1)
################### Volume ###########################
df.groupby('tradingsymbol')['volume'].plot(legend=True, rot=0, grid=True, ax=axes[0])
################### PRICE ###########################
df.groupby('tradingsymbol')['close'].plot(legend=True, rot=0, grid=True, ax=axes[1])
plt.show()
My Current Output is like:
I need add text annotation to matplotlib plot. My desired output similar to below image:

It's hard to answer this question without access to your dataset, or a simpler example. However, I'll try my best.
Let's begin by setting up a dataframe which may or may resemble your data:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 3)),
columns=['a', 'b', 'c'])
With the dataset we'll now proceed to plot it with
fig, ax = plt.subplots(1, 1)
df.plot(legend=True, ax=ax)
Finally, we'll loop over the columns and annotate each datapoint as
for col in df.columns:
for id, val in enumerate(df[col]):
ax.text(id, val, str(val))
This gave me the plot following plot, which resembles your desired figure.

Related

Seaborn xaxis with large timeline

I have around 4475 rows of csv data like below:
,Time,Values,Size
0,1900-01-01 23:11:30.368,2,
1,1900-01-01 23:11:30.372,2,
2,1900-01-01 23:11:30.372,2,
3,1900-01-01 23:11:30.372,2,
4,1900-01-01 23:11:30.376,2,
5,1900-01-01 23:11:30.380,,
6,1900-01-01 23:11:30.380,,
7,1900-01-01 23:11:30.380,,
8,1900-01-01 23:11:30.380,,321
9,1900-01-01 23:11:30.380,,111
.
.
4474,1900-01-01 23:11:32.588,,
When I try to create simple seaborn lineplot with below code. It creates line chart but its continuous chart while my data i.e. 'Values' has many empty/nan values which should show as gap on chart. How can I do that?
[from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("Data.csv")
sns.set(rc={'figure.figsize':(13,4)})
ax =sns.lineplot(x="Time", y="Values", data=df)
ax.set(xlabel='Time', ylabel='Values')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()]
As reported in this answer:
I've looked at the source code and it looks like lineplot drops nans from the DataFrame before plotting. So unfortunately it's not possible to do it properly.
So, the easiest way to do it is to use matplotlib in place of seaborn.
In the code below I generate a dataframe like your with 20% of missing values in 'Values' column and I use matplotlib to draw a plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'Time': pd.date_range(start = '1900-01-01 23:11:30', end = '1900-01-01 23:11:30.1', freq = 'L')})
df['Values'] = np.random.randint(low = 2, high = 10, size = len(df))
df['Values'] = df['Values'].mask(np.random.random(df['Values'].shape) < 0.2)
fig, ax = plt.subplots(figsize = (13, 4))
ax.plot(df['Time'], df['Values'])
ax.set(xlabel = 'Time', ylabel = 'Values')
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

Dataframe Bar plot with Seaborn

I'm trying to create a bar plot from a DataFrame with Datetime Index.
This is an example working code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()
index = pd.date_range('2012-01-01', periods=48, freq='M')
data = np.random.randint(100, size = (len(index),1))
df = pd.DataFrame(index=index, data=data, columns=['numbers'])
fig, ax = plt.subplots()
ax.bar(df.index, df['numbers'])
The result is:
As you can see the white bars cannot be distinguished well with respect of the background (why?).
I tried using instead:
df['numbers'].plot(kind='bar')
import matplotlib.ticker as ticker
ticklabels = df.index.strftime('%Y-%m')
ax.xaxis.set_major_formatter(ticker.FixedFormatter(ticklabels))
with this result:
But in this way I lose the automatic xticks labels (and grid) 6-months spacing.
Any idea?
You can just change the style:
import matplotlib.pyplot as plt
index = pd.date_range('2012-01-01', periods=48, freq='M')
data = np.random.randint(100, size = (len(index),1))
df = pd.DataFrame(index=index, data=data, columns=['numbers'])
plt.figure(figsize=(12, 5))
plt.style.use('default')
plt.bar(df.index,df['numbers'],color="red")
You do not actually use seaborn. Replace ax.bar(df.index, df['numbers'])
with
sns.barplot(df.index, df['numbers'], ax=ax)

Divide axes.table multiindex into different columns

I am trying to create a crosstab with multiple index which I need to print on pdf.
I am using matplotlib for printing data on pdf and am not able to find any method which helps print dataframe directly to pdf.
So using axes.table to convert dataframe to table to be printed on pdf.
However, the 2 indexes in dataframe are combined in 1 in table.
See output below
Can these indexes ('ABC', 'D') separated in 2 columns like ABC | D .
If Yes, how?
import matplotlib.pyplot as plt
import matplotlib.backends.backend_pdf
import pandas as pd
pdf = matplotlib.backends.backend_pdf.PdfPages("test.pdf")
fig = plt.figure(figsize=(20, 20))
grid = plt.GridSpec(1, 2, wspace=0.2,width_ratios=[14, 6])
plt.autoscale()
ax0 = fig.add_subplot(grid[0 ,0])
ax1 = fig.add_subplot(grid[0, 1])
df = pd.DataFrame({'country': ['ABC','PQR','XYZ','ABC','PQR'], 'region': ['D','E','F','D','F'], 'month_day':[1,1,1,2,3],'sales' : [100,200,300,500,100]})
table=pd.pivot_table(df, values='sales', index=['country','region'], columns=['month_day'], aggfunc=sum, fill_value=0)
#for printing on pdf
the_table = ax0.table(cellText=table.values,colLabels=table.columns,rowLabels=table.index,loc='center')
pdf.savefig(fig, bbox_inches='tight')
pdf.close()
Found a solution after few tries.
table.reset_index(inplace=True)
worked in this case.
import matplotlib.pyplot as plt
import matplotlib.backends.backend_pdf
import pandas as pd
pdf = matplotlib.backends.backend_pdf.PdfPages("test.pdf")
fig = plt.figure(figsize=(20, 20))
grid = plt.GridSpec(1, 2, wspace=0.2,width_ratios=[14, 6])
plt.autoscale()
ax0 = fig.add_subplot(grid[0 ,0])
ax1 = fig.add_subplot(grid[0, 1])
df = pd.DataFrame({'country': ['ABC','PQR','XYZ','ABC','PQR'], 'region': ['D','E','F','D','F'], 'month_day':[1,1,1,2,3],'sales' : [100,200,300,500,100]})
table=pd.pivot_table(df, values='sales', index=['country','region'], columns=['month_day'], aggfunc=sum, fill_value=0)
table.reset_index(inplace=True)
the_table = ax0.table(cellText=table.values,colLabels=table.columns,colWidths=[0.07,0.06,0.04,0.04,0.04],loc='center')
ax0.axis("off")
ax1.axis("off")
plt.axis("off")
pdf.savefig(fig, bbox_inches='tight')
pdf.close()

Box plot using pandas

Trying to plot a box plot for a pandas dataframe but the x-axis column names don't appear to be clear.
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style', 'default')
fig, ax1 = plt.subplots()
%matplotlib inline
df.boxplot(column = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver','avg_surge','surge_pct','trips_in_first_30_days','weekday_pct'])
Below is the output
How to fix this so that the x-axis columns appear clear
I think you need parameter rot:
cols = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver',
'avg_surge','surge_pct','trips_in_first_30_days','weekday_pct']
df.boxplot(column=cols, rot=90)
Sample:
np.random.seed(100)
cols = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver',
'avg_surge','surge_pct','trips_in_first_30_days','weekday_pct']
df = pd.DataFrame(np.random.rand(10, 7), columns=cols)
df.boxplot(column=cols, rot=90)
Another option is to make the orientation of you boxes horizontal.
np.random.seed(100)
cols = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver',
'avg_surge','surge_pct','trips_in_first_30_days','weekday_pct']
df = pd.DataFrame(np.random.rand(10, 7), columns=cols)
df.boxplot(column=cols, vert=False)

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Categories

Resources