Matplotlib like graphs with plotly express - python

Following is my Pandas dataframe, its very easy creating a line plot for all the items with matplotlib. I just write
df.plot()
And it create a separate line for all the items, But I want to create same line plots with plotly express, But I am not able to do it, may be because I have date columns
df;
dataDate 2019-10-01 2019-10-02 2019-10-01 2019-10-01 2019-10-02
name
item1 0.24 0.12 0.19 0.20 0.12
item2 0.26 0.25 0.17 0.17 0.13
item3 0.22 0.24 0.18 0.17 0.16
item4 0.72 0.22 0.19 0.20 0.15
item5 0.55 0.23 0.19 0.18 0.14
Suggest me how I can create line plots for all the items across the time with plotly express. Thanks

They have great examples on their documentation (https://plot.ly/python/plotly-express/#scatter-and-line-plots).
By design it works best with tidy data so you would have a column for Date, a column for Item Number, and then a column for the value.
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
base = datetime.today()
dates = [base - timedelta(days=x) for x in range(10)] * 3
cats = ['A'] * 10 + ['B'] * 10 + ['C'] * 10
vals = np.arange(30)
df = pd.DataFrame({'Date': dates, 'Category': cats, 'Value': vals})
px.line(df, x='Date', y='Value', color='Category')

Related

How to visualise means with Seaborn?

I have a Pandas data frame with the following structure:
alpha beta gamma mse
0 0.00 0.00 0.00 0.000000
1 0.05 0.05 0.90 0.025411
2 0.05 0.10 0.85 0.025794
3 0.05 0.15 0.80 0.026289
4 0.05 0.20 0.75 0.025320
.. ... ... ... ...
148 0.75 0.05 0.20 0.026816
149 0.75 0.10 0.15 0.025817
150 0.75 0.15 0.10 0.025702
151 0.80 0.05 0.15 0.027104
152 0.80 0.10 0.10 0.025936
I would like to visualise the data frame with a heatmap where alpha is represented on the x-axis, beta is represented on the y-axis, and for each square of the lattice, the mean MSE over all gammas is computed. Is there an easy way to do this by using Seaborn?
Thanks in advance.
For what you showed, yes, you can do with:
sns.heatmap(df.pivot_table(index='beta', columns='alpha', values='mse'))
All the calculation should be done in your DataFrame.
Once you have the data, you could use pivoted DataFrame to build the heatmap
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Assuming that you have the df variable with your data
# pivot the data
pivoted = df.pivot('alpha', 'beta', 'mse')
# plot the heatmap
sns.heatmap(pivoted, annot=True)
plt.show()
More information in the official documentation: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Getting meaningful results from pandas.describe()

I called describe on one column of a dataframe and ended up with the following output,
count 1.048575e+06
mean 8.232821e+01
std 2.859016e+02
min 0.000000e+00
25% 3.000000e+00
50% 1.400000e+01
75% 6.000000e+01
max 8.599700e+04
What parameter do I pass to get meaningful integer values. What I mean is when I check the SQL count its about 43 million. All the other values are also different.Can someone help me understand what this conversion means and how do I get float rounded to 2 decimal places. I'm new to Pandas.
You can directly use round() and pass the number of decimals you want as argument
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# setting the seed to create the dataframe
np.random.seed(25)
# Creating a 5 * 4 dataframe
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
# rounding describe
df.describe().round(2)
A B C D
count 5.00 5.00 5.00 5.00
mean 0.52 0.47 0.38 0.42
std 0.21 0.23 0.19 0.29
min 0.33 0.12 0.16 0.11
25% 0.41 0.37 0.28 0.19
50% 0.45 0.58 0.37 0.44
75% 0.56 0.59 0.40 0.52
max 0.87 0.70 0.68 0.84
DOCS
There are two ways to control the output of pandas, either by controlling it or by using apply.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)

Python: plot timedelta and cumulative values

I have a dataframe with 1000 rows like below
start_time val
0 15:16:25 0.01
1 15:17:51 0.02
2 15:26:16 0.03
3 15:27:28 0.04
4 15:32:08 0.05
5 15:32:35 0.06
6 15:33:02 0.07
7 15:33:46 0.08
8 15:33:49 0.09
9 15:34:04 0.10
10 15:34:23 0.11
11 15:34:32 0.12
12 15:34:32 0.13
13 15:35:53 0.14
14 15:37:31 0.15
15 15:38:11 0.16
16 15:38:17 0.17
17 15:38:29 0.18
18 15:40:07 0.19
19 15:40:32 0.20
20 15:40:53 0.21
... .... ..
I would like to plot it, with the the time on the x axis. I have used
plt.plot(df['start_time'].dt.total_seconds(),df['val'])
# generate a formatter, using the fields required
fmtr = mdates.DateFormatter("%H:%M")
# need a handle to the current axes to manipulate it
ax = plt.gca()
# set this formatter to the axis
ax.xaxis.set_major_formatter(fmtr)
And it works fine, but on the x axis I have labels which are not showing correct time, see below:
Any help? thank you in advance
You can convert timedeltas to seconds:
plt.plot(df['start_time'].dt.total_seconds(),df['val'])
Solution for converting timedeltas to strings from here, only necessary convert nanoseconds to seconds:
import datetime
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df['start_time'], df['val'])
def timeTicks(x, pos):
seconds = x / 10**9
d = datetime.timedelta(seconds=seconds)
return str(d)
formatter = matplotlib.ticker.FuncFormatter(timeTicks)
ax.xaxis.set_major_formatter(formatter)
plt.xticks(rotation=90)
plt.show()

Only one index label in the dataset

I am working with the ecoli dataset from http://archive.ics.uci.
edu/ml/datasets/Ecoli. The values are separated by tabs. I would like to index each column and give them a name. But when i do that using the following code:
import pandas as pd
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6,'info7','type']
d= pd.read_table('ecoli.csv',sep= ' ',header = None, names= ecoli_cols)
Instead of creating the name for each index it creates a 6 new columns. But i would like to have those index name for each of the columns that i already have. And later i would like to extract information from this dataset. So it is important to have them as comma separated or in tables. Thanks
You can use url with data and separator \s+ - one or more whitespaces:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data'
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6','info7','type']
df = pd.read_table(url,sep= '\s+',header = None, names= ecoli_cols)
#alternative use parameter delim_whitespace
#df = pd.read_table(url, delim_whitespace= True, header = None, names = ecoli_cols)
print (df.head())
N_ecoli info1 info2 info3 info4 info5 info6 info7 type
0 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
1 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
2 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
3 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
4 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp
But if want use your file with separator as tab:
d = pd.read_table('ecoli.csv', sep='\t',header = None, names= ecoli_cols)
And if separator is ;:
d = pd.read_table('ecoli.csv', sep=';',header = None, names= ecoli_cols)

Create new DF with values representing difference between two dataframes

I am working with two numeric data.frames, both with 13803obs and 13803 variables. Their col- and rownames are identical however their entries are different. What I want to do is create a new data.frame where I have subtracted df2 values with df1 values.
"Formula" would be this, df1(entri-values) - df2(entri-values) = df3 difference. In other words, the purpose is to find the difference between all entries.
My problem illustrated here.
DF1
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
DF2
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.70 0.94 0.30
[GENE128] 0.25 0.51 0.80
[GENE271] 0.82 0.92 0.60
NEW DF3
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.01 0.04 0.02
[GENE128] -.02 0.10 0.10
[GENE271] 0.05 0.03 0.03
So, in DF3 the values are the difference between DF1 and DF2 for each entry.
DF1(GENE231) - DF2(GENE231) = DF3(DIFFERENCE-GENE231)
DF1(GENE271) - DF2(GENE271) = DF3(DIFFERENCE-GENE271)
and so on...
Help would be much appreciated!
Kind regards,
Harebell

Categories

Resources