I have a dataframe with 1000 rows like below
start_time val
0 15:16:25 0.01
1 15:17:51 0.02
2 15:26:16 0.03
3 15:27:28 0.04
4 15:32:08 0.05
5 15:32:35 0.06
6 15:33:02 0.07
7 15:33:46 0.08
8 15:33:49 0.09
9 15:34:04 0.10
10 15:34:23 0.11
11 15:34:32 0.12
12 15:34:32 0.13
13 15:35:53 0.14
14 15:37:31 0.15
15 15:38:11 0.16
16 15:38:17 0.17
17 15:38:29 0.18
18 15:40:07 0.19
19 15:40:32 0.20
20 15:40:53 0.21
... .... ..
I would like to plot it, with the the time on the x axis. I have used
plt.plot(df['start_time'].dt.total_seconds(),df['val'])
# generate a formatter, using the fields required
fmtr = mdates.DateFormatter("%H:%M")
# need a handle to the current axes to manipulate it
ax = plt.gca()
# set this formatter to the axis
ax.xaxis.set_major_formatter(fmtr)
And it works fine, but on the x axis I have labels which are not showing correct time, see below:
Any help? thank you in advance
You can convert timedeltas to seconds:
plt.plot(df['start_time'].dt.total_seconds(),df['val'])
Solution for converting timedeltas to strings from here, only necessary convert nanoseconds to seconds:
import datetime
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df['start_time'], df['val'])
def timeTicks(x, pos):
seconds = x / 10**9
d = datetime.timedelta(seconds=seconds)
return str(d)
formatter = matplotlib.ticker.FuncFormatter(timeTicks)
ax.xaxis.set_major_formatter(formatter)
plt.xticks(rotation=90)
plt.show()
Related
I have a Pandas data frame with the following structure:
alpha beta gamma mse
0 0.00 0.00 0.00 0.000000
1 0.05 0.05 0.90 0.025411
2 0.05 0.10 0.85 0.025794
3 0.05 0.15 0.80 0.026289
4 0.05 0.20 0.75 0.025320
.. ... ... ... ...
148 0.75 0.05 0.20 0.026816
149 0.75 0.10 0.15 0.025817
150 0.75 0.15 0.10 0.025702
151 0.80 0.05 0.15 0.027104
152 0.80 0.10 0.10 0.025936
I would like to visualise the data frame with a heatmap where alpha is represented on the x-axis, beta is represented on the y-axis, and for each square of the lattice, the mean MSE over all gammas is computed. Is there an easy way to do this by using Seaborn?
Thanks in advance.
For what you showed, yes, you can do with:
sns.heatmap(df.pivot_table(index='beta', columns='alpha', values='mse'))
All the calculation should be done in your DataFrame.
Once you have the data, you could use pivoted DataFrame to build the heatmap
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Assuming that you have the df variable with your data
# pivot the data
pivoted = df.pivot('alpha', 'beta', 'mse')
# plot the heatmap
sns.heatmap(pivoted, annot=True)
plt.show()
More information in the official documentation: https://seaborn.pydata.org/generated/seaborn.heatmap.html
Following is my Pandas dataframe, its very easy creating a line plot for all the items with matplotlib. I just write
df.plot()
And it create a separate line for all the items, But I want to create same line plots with plotly express, But I am not able to do it, may be because I have date columns
df;
dataDate 2019-10-01 2019-10-02 2019-10-01 2019-10-01 2019-10-02
name
item1 0.24 0.12 0.19 0.20 0.12
item2 0.26 0.25 0.17 0.17 0.13
item3 0.22 0.24 0.18 0.17 0.16
item4 0.72 0.22 0.19 0.20 0.15
item5 0.55 0.23 0.19 0.18 0.14
Suggest me how I can create line plots for all the items across the time with plotly express. Thanks
They have great examples on their documentation (https://plot.ly/python/plotly-express/#scatter-and-line-plots).
By design it works best with tidy data so you would have a column for Date, a column for Item Number, and then a column for the value.
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
base = datetime.today()
dates = [base - timedelta(days=x) for x in range(10)] * 3
cats = ['A'] * 10 + ['B'] * 10 + ['C'] * 10
vals = np.arange(30)
df = pd.DataFrame({'Date': dates, 'Category': cats, 'Value': vals})
px.line(df, x='Date', y='Value', color='Category')
I am running the code below.
import datetime
import pandas as pd
import numpy as np
import pylab as pl
import datetime
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from matplotlib.collections import LineCollection
from pandas_datareader import data as wb
from sklearn import cluster, covariance, manifold
###############################################################################
start = '2019-02-01'
end = '2020-02-01'
tickers = ['MMM',
'ABT',
'ABBV',
'ABMD',
'ACN',
'ATVI']
thelen = len(tickers)
price_data = []
for ticker in tickers:
prices = wb.DataReader(ticker, start = start, end = end, data_source='yahoo')[['Open','Adj Close']]
price_data.append(prices.assign(ticker=ticker)[['ticker', 'Open', 'Adj Close']])
df = pd.concat(price_data)
df.rename(columns = {'ticker':'Ticker', 'Adj Close':'Close'}, inplace = True)
df.dtypes
df.head()
df.shape
#df.reset_index()
pd.set_option('display.max_columns', 500)
open = np.array([df.Open]).astype(np.float)
close = np.array([df.Close]).astype(np.float)
# The daily variations of the quotes are what carry most information
variation = (close - open)
The code above gives me this 1d array, here.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 0.38 0.93 0.3 0.72 -0.42 0.37 0.36 0.71 0.89 -0.32 0.11 -0.06 -0.17 0.4 0.25 -0.48 0.1 -0.29 -0.29 -0.38 0.21 0.22 0.11 -0.01 -0.07 -0.66 0 -0.78 0.24 -0.89 0.07
My desired output would be a 2d array, like this.
0 1 2 3 4 5 6 7 8 9 10
0 0.38 0.93 0.3 0.72 -0.42 0.37 0.36 0.71 0.89 -0.32 0.11
1 0.61 0.18 0.63 0.02 -0.03 -0.27 -0.75 -1 0.48 -0.74 -0.34
2 1.77 0.95 1.69 2.05 -1.36 2.25 1.83 -0.8 1.35 -0.99 -1.35
3 0.7 -0.12 0.32 -0.14 -0.53 0.63 0.85 0.46 0.23 -0.83 0.59
4 1.71 -0.8 0.74 -0.58 -1.2 0.38 0.35 0.06 0.56 -0.38 0.64
5 0.47 0.25 0.93 -0.9 -0.15 0.64 -0.11 -0.09 0.44 -0.47 -0.09
How can I change my 1d array to a 2d array, with the difference between open and close horizontal, and different stock open-close vertical? Thanks?
I actually got this to work. Apparently you have to store items in a list rather than a dataframe.
import datetime
import pandas as pd
import numpy as np
import pylab as pl
import datetime
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from matplotlib.collections import LineCollection
from pandas_datareader import data as wb
from sklearn import cluster, covariance, manifold
start = '2019-02-01'
end = '2020-02-01'
tickers = ['AXP',
'AAPL',
'BA',
'CAT',
'CSCO',
'CVX',
'XOM',
'GS',
'HD',
'IBM',
'INTC',
'JNJ',
'KO',
'JPM',
'MCD',
'MMM',
'MRK',
'MSFT',
'NKE',
'PFE',
'PG',
'TRV',
'UNH',
'UTX',
'VZ',
'V',
'WBA',
'WMT',
'DIS']
thelen = len(tickers)
price_data = []
for ticker in tickers:
prices = wb.DataReader(ticker, start = start, end = end, data_source='yahoo')[['Open','Adj Close']]
price_data.append(prices.assign(ticker=ticker)[['ticker', 'Open', 'Adj Close']])
#names = np.reshape(price_data, (len(price_data), 1))
names = pd.concat(price_data)
names.reset_index()
#pd.set_option('display.max_columns', 500)
open = np.array([q['Open'] for q in price_data]).astype(np.float)
close = np.array([q['Adj Close'] for q in price_data]).astype(np.float)
#close_prices = np.array([q.close for q in quotes]).astype(np.float)
# The daily variations of the quotes are what carry most information
variation = (close - open)
# pd.DataFrame(variation).to_csv("C:\\path\\file.csv")
# Learn a graphical structure from the correlations
edge_model = covariance.GraphicalLassoCV()
X = variation
# standardize the time series: using correlations rather than covariance
# is more efficient for structure recovery
X = variation.copy().T
X /= X.std(axis=0)
edge_model.fit(X)
# Cluster using affinity propagation
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()
details = [(name,cluster) for name, cluster in zip(tickers,labels)]
for detail in details:
print(detail)
I have this kind of data :
ID x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1 -0.18 5 -0.40 -0.26 0.53 -0.66 0.10 2 -0.20 1
2 -0.58 5 -0.52 -1.66 0.65 -0.15 0.08 3 3.03 -2
3 -0.62 5 -0.09 -0.38 0.65 0.22 0.44 4 1.49 1
4 -0.22 -3 1.64 -1.38 0.08 0.42 1.24 5 -0.34 0
5 0.00 5 1.76 -1.16 0.78 0.46 0.32 5 -0.51 -2
what's the best method for visualizing this data, i'm using matplotlib to visualizing it, and read it from csv using pandas
thanks
Visualising data in a high-dimensional space is always a difficult problem. One solution that is commonly used (and is now available in pandas) is to inspect all of the 1D and 2D projections of the data. It doesn't give you all of the information about the data, but that's impossible to visualise unless you can see in 10D! Here's an example of how to do this with pandas (version 0.7.3 upwards):
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
#first make some fake data with same layout as yours
data = pd.DataFrame(np.random.randn(100, 10), columns=['x1', 'x2', 'x3',\
'x4','x5','x6','x7','x8','x9','x10'])
#now plot using pandas
scatter_matrix(data, alpha=0.2, figsize=(6, 6), diagonal='kde')
This generates a plot with all of the 2D projections as scatter plots, and KDE histograms of the 1D projections:
I also have a pure matplotlib approach to this on my github page, which produces a very similar type of plot (it is designed for MCMC output, but is also appropriate here). Here's how you'd use it here:
import corner_plot as cp
cp.corner_plot(data.as_matrix(),axis_labels=data.columns,nbins=10,\
figsize=(7,7),scatter=True,fontsize=10,tickfontsize=7)
You may change the plot over the time, for each instant you plot a different "dimension" of the dataframe.
Here an example on how you can do plots that change over the time, you may adjust it for your purposes
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111)
plt.grid(True)
plt.hold(False)
x = np.arange(-3, 3, 0.01)
for n in range(15):
y = np.sin(np.pi*x*n) / (np.pi*x*n)
line, = ax.plot(x, y)
plt.draw()
plt.pause(0.5)
I have data as those ones
a b c d e
alpha 5.51 0.60 -0.12 26.90 76284.53
beta 3.39 0.94 -0.17 -0.20 -0.20
gamma 7.98 3.34 -1.41 7.74 28394.93
delta 2.29 1.24 0.40 0.29 0.28
I want to do a nice publishable histogram as this one
but with a break in the y axis so we can figure out the variation of a , b , c , d and e so that data will not be squashed by extreme values in e column as this one but using interlaced colorbar histogram:
I would like to do that in python (matplotlib, pandas, numpy/scipy) or in mathematica... or any other open and free high-level language (R, scilab, ...). Thanks for your help.
edit: using matplotlib through pandas allows to adjust the space between the two subgraph using option button at bottom left "hspace".
Have you seen this example? It's for a broken y-axis plot in matplotlib.
Hope this helps.
Combining with pandas this gives:
import pandas as pd
import matplotlib.pyplot as plt
from StringIO import StringIO
data = """\
a b c d e
alpha 5.51 0.60 -0.12 26.90 76284.53
beta 3.39 0.94 -0.17 -0.20 -0.20
gamma 7.98 3.34 -1.41 7.74 28394.93
delta 2.29 1.24 0.40 0.29 0.28
"""
df = pd.read_csv(StringIO(data), sep='\s+')
f, axis = plt.subplots(2, 1, sharex=True)
df.plot(kind='bar', ax=axis[0])
df.plot(kind='bar', ax=axis[1])
axis[0].set_ylim(20000, 80000)
axis[1].set_ylim(-2, 30)
axis[1].legend().set_visible(False)
axis[0].spines['bottom'].set_visible(False)
axis[1].spines['top'].set_visible(False)
axis[0].xaxis.tick_top()
axis[0].tick_params(labeltop='off')
axis[1].xaxis.tick_bottom()
d = .015
kwargs = dict(transform=axis[0].transAxes, color='k', clip_on=False)
axis[0].plot((-d,+d),(-d,+d), **kwargs)
axis[0].plot((1-d,1+d),(-d,+d), **kwargs)
kwargs.update(transform=axis[1].transAxes)
axis[1].plot((-d,+d),(1-d,1+d), **kwargs)
axis[1].plot((1-d,1+d),(1-d,1+d), **kwargs)
plt.show()