Format numerous floats in a data frame - python

I need help, I am unable to display the seaborn plot well.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('sales.csv', header=0,sep =',',
usecols = [1,2,3,4])
#remove NaN
dataset.dropna(inplace = True)
df = pd.DataFrame(data=dataset)
sns.regplot(data=df, x='TV', y='sales')
plt.show()
As example for sales_csv :
id,TV,radio,newspaper,sales
1,230.10000000,37.8,69.2,22.1
2,1e12,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9
5,180.8,10.8,58.4,12.9
6,8.7,48.9,75,7.2
7,57.5,32.8,23.5,11.8
8,120.2,19.6,11.6,13.2
9,8.6,2.1,1,4.8
10,199.8,2.6,21.2,10.6
11,66.1,5.8,24.2,8.6
12,214.7,24,4,17.4
13,23.8,35.1,65.9,9.2
14,97.5,7.6,7.2,9.7
15,1,32.9,46,19
16,195.4,47.7,52.9,22.4
17,67.8,36.6,114,12.5
18,281.4,39.6,55.8,24.4
19,69.2,20.5,18.3,11.3
20,147.3,23.9,19.1,14.6
21,218.4,27.7,53.4,18
22,237.4,5.1,23.5,12.5
23,13.2,15.9,49.6,5.6
24,228.3,16.9,26.2,15.5
25,62.3,12.6,18.3,9.7
26,262.9,3.5,19.5,12
27,142.9,29.3,12.6,15
28,240.1,16.7,22.9,15.9
29,248.8,27.1,22.9,18.9
30,70.6,16,40.8,10.5
31,292.9,28.3,43.2,21.4
32,112.9,17.4,38.6,11.9
33,97.2,1.5,30,9.6
34,1e12,20,0.3,17.4

The main problem is that the dataset contains values of 1e12 used to represent NA. These values should be replaced or dropped. The easiest way to convert '1e12' to NA is via the na_values='1e12' parameter to pd.read_csv().
Alternatively, dataset.replace(1e12, pd.NA, inplace=True) can be used to convert them later.
Note that dataset already is a dataframe, so the call df = pd.DataFrame(data=dataset) is unnecessary.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dataset = pd.read_csv('sales.csv', header=0, sep=',', na_values='1e12',
usecols=[1, 2, 3, 4])
# remove NaN
dataset.dropna(inplace=True)
sns.regplot(data=dataset, x='TV', y='sales')
plt.show()

Related

Plotting data with pandas and matplotlib but no error also not showing data

import matplotlib.pyplot as plt
import pandas as pd
sickpay = pd.read_csv('sickleavedata.csv', index_col = 0)
plt.bar(sickpay, height=1)
plt.xlabel('JobTitle')
plt.ylabel('SickLeaveHours')
plt.title('Ages of different persons')
plt.legend()
plt.show()
Trying to create a visual data but it wont show any data on the chart also not getting any errors
as per the given code sickpay is your dataframe, not a column of a dataframe
so you can give a column as flows or use dataframe.plot(kind='bar') option
import matplotlib.pyplot as plt
import pandas as pd
sickpay = pd.read_csv('sickleavedata.csv', index_col = 0)
sickpay.plot(kind='bar')
# or
plt.bar(sickpay['column_name'], height=1)

plotting boxplot with sns

I would like to depict the value of my variables found in a dataset in the form of a boxplot. The dataset is the following:
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)
So far my code is the following:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
df=pd.read_csv(file,names=['id', 'clump_thickness','unif_cell_size',
'unif_cell_shape', 'marg_adhesion', 'single_epith_cell_size',
'bare_nuclei', 'bland_chromatin', 'normal_nucleoli','mitoses','Class'])
#boxplot
plt.figure(figsize=(15,10))
names=list(df.columns)
names=names[:-1]
min_max_scaler=preprocessing.MinMaxScaler()
X = df.drop(["Class"],axis=1)
columnsN=list(X.columns)
x_scaled=min_max_scaler.fit_transform(X) #normalization
X[columnsN]=x_scaled
y = df['Class']
sns.set_context('notebook', font_scale=1.5)
sns.boxplot(x=X['unif_cell_size'],y=y,data=df.iloc[:, :-1],orient="h")
My boxplot returns the following figure:
but I would like to display my information like the following graph:
I know that is from a different dataset, but I can see that they have displayed the diagnosis, at the same time, for each feature with their values. I have tried to do it in different ways, but I am not able to do that graph.
I have tried the following:
data_st = pd.concat([y,X],axis=1)
data_st = pd.melt(data_st,id_vars=columnsN,
var_name="X",
value_name='value')
sns.boxplot(x='value', y="X", data=data_st,hue=y,palette='Set1')
plt.legend(loc='best')
but still no results. Any help?
Thanks
Reshape the data with pandas.DataFrame.melt:
Most of the benign (class 2) boxplots are at 0 (scaled) or 1 (unscaled), as they should be
print(df_scaled_melted.groupby(['Class', 'Attributes', 'Values'])['Values'].count().unstack()) after melt, to understand the counts
MinMaxScaler has been used, but is unnecessary in this case, because all of the data values are very close together. If you plot the data without scaling, the plot will look the same, except the y-axis range will be 1 - 10 instead.
This should really only be used in cases when the data is widely diverging, where an attribute will have too much influence with some ML algorithm.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# path to file
p = Path(r'c:\some_path_to_file\breast-cancer-wisconsin.data')
# create dataframe
df = pd.read_csv(p, names=['id', 'clump_thickness','unif_cell_size',
'unif_cell_shape', 'marg_adhesion', 'single_epith_cell_size',
'bare_nuclei', 'bland_chromatin', 'normal_nucleoli','mitoses','Class'])
# replace ? with np.NaN
df.replace('?', np.NaN, inplace=True)
# scale the data
min_max_scaler = MinMaxScaler()
df_scaled = pd.DataFrame(min_max_scaler.fit_transform(df.iloc[:, 1:-1]))
df_scaled.columns = df.columns[1:-1]
df_scaled['Class'] = df['Class']
# melt the dataframe
df_scaled_melted = df_scaled.iloc[:, 1:].melt(id_vars='Class', var_name='Attributes', value_name='Values')
# plot the data
plt.figure(figsize=(12, 8))
g = sns.boxplot(x='Attributes', y='Values', hue='Class', data=df_scaled_melted)
for item in g.get_xticklabels():
item.set_rotation(90)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
Without scaling:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import numpy as np
p = Path.cwd() / r'data\breast_cancer\breast-cancer-wisconsin.data'
df = pd.read_csv(p, names=['id', 'clump_thickness','unif_cell_size',
'unif_cell_shape', 'marg_adhesion', 'single_epith_cell_size',
'bare_nuclei', 'bland_chromatin', 'normal_nucleoli','mitoses','Class'])
df.replace('?', np.NaN, inplace=True)
df.dropna(inplace=True)
df = df.astype('int')
df_melted = df.iloc[:, 1:].melt(id_vars='Class', var_name='Attributes', value_name='Values')
plt.figure(figsize=(12, 8))
g = sns.boxplot(x='Attributes', y='Values', hue='Class', data=df_melted)
for item in g.get_xticklabels():
item.set_rotation(90)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

How do I style only the last row of a pandas dataframe?

I can style a pandas dataframe:
import pandas as pd
import numpy as np
import seaborn as sns
cm = sns.diverging_palette(-5, 5, as_cmap=True)
df = pd.DataFrame(np.random.randn(3, 4))
df.style.background_gradient(cmap=cm)
but I can't figure out how to only apply a style to the last row. There is a subset option in the background_gradient call, and it suggests that I use an index slice but I cannot figure out how to make just the last row have any kind of styling.
Here is my closest to success:
df.style.background_gradient(cmap=cm, subset=[2], axis='index')
Use the last element of your index as your subset.
df.style.background_gradient(cmap=cm, axis=1, subset=df.index[-1])
You could also use pd.IndexSlice which is useful if you want to apply the style to multiple rows, including the last:
import pandas as pd
import numpy as np
import seaborn as sns
cm = sns.diverging_palette(-5, 5, as_cmap=True)
df = pd.DataFrame(np.random.randn(3, 4))
indices = pd.IndexSlice[[0, df.last_valid_index()], :]
df.style.background_gradient(cmap=cm, axis=1, subset=indices)

Plot using seaborn with FacetGrid where values are ndarray in dataframe

I want to plot a dataframe where y values are stored as ndarrays within a column
i.e.:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(index=np.arange(0,4), columns=('sample','class','values'))
for iloc in [0,2]:
df.loc[iloc] = {'sample':iloc,
'class':'raw',
'values':np.random.random(5)}
df.loc[iloc+1] = {'sample':iloc,
'class':'predict',
'values':np.random.random(5)}
grid = sns.FacetGrid(df, col="class", row="sample")
grid.map(plt.plot, np.arange(0,5), "value")
TypeError: unhashable type: 'numpy.ndarray'
Do I need to break out the ndarrays into separate rows? Is there a simple way to do this?
Thanks
This is quite an unusual way of storing data in a dataframe. Two options (I'd recommend option B):
A. Custom mapping in seaborn
Indeed seaborn does not support such format natively. You may construct your own function to plot to the grid though.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(index=np.arange(0,4), columns=('sample','class','values'))
for iloc in [0,2]:
df.loc[iloc] = {'sample':iloc,
'class':'raw',
'values':np.random.random(5)}
df.loc[iloc+1] = {'sample':iloc,
'class':'predict',
'values':np.random.random(5)}
grid = sns.FacetGrid(df, col="class", row="sample")
def plot(*args,**kwargs):
plt.plot(args[0].iloc[0], **kwargs)
grid.map(plot, "values")
B. Unnesting
However I would advise to "unnest" the dataframe first and get rid of the numpy arrays inside the cells.
pandas: When cell contents are lists, create a row for each element in the list shows a way to do that.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(index=np.arange(0,4), columns=('sample','class','values'))
for iloc in [0,2]:
df.loc[iloc] = {'sample':iloc,
'class':'raw',
'values':np.random.random(5)}
df.loc[iloc+1] = {'sample':iloc,
'class':'predict',
'values':np.random.random(5)}
res = df.set_index(["sample", "class"])["values"].apply(pd.Series).stack().reset_index()
res.columns = ["sample", "class", "original_index", "values"]
Then use the FacetGrid in the usual way.
grid = sns.FacetGrid(res, col="class", row="sample")
grid.map(plt.plot, "original_index", "values")

How to plot DataFrames? in Python

I'm trying to plot a DataFrame, but I'm not getting the results I need. This is an example of what I'm trying to do and what I'm currently getting. (I'm new in Python)
import pandas as pd
import matplotlib.pyplot as plt
my_data = {1965:{'a':52, 'b':54, 'c':67, 'd':45},
1966:{'a':34, 'b':34, 'c':35, 'd':76},
1967:{'a':56, 'b':56, 'c':54, 'd':34}}
df = pd.DataFrame(my_data)
df.plot( style=[])
plt.show()
I'm getting the following graph, but what I need is: the years in the X axis and each line must be what is currently in X axis (a,b,c,d). Thanks for your help!!.
import pandas as pd
import matplotlib.pyplot as plt
my_data = {1965:{'a':52, 'b':54, 'c':67, 'd':45},
1966:{'a':34, 'b':34, 'c':35, 'd':76},
1967:{'a':56, 'b':56, 'c':54, 'd':34}}
df = pd.DataFrame(my_data)
df.T.plot( kind='bar') # or df.T.plot.bar()
plt.show()
Updates:
If this is what you want:
df = pd.DataFrame(my_data)
df.columns=[str(x) for x in df.columns] # convert year numerical values to str
df.T.plot()
plt.show()
you can do it this way:
ax = df.T.plot(linewidth=2.5)
plt.locator_params(nbins=len(df.columns))
ax.xaxis.set_major_formatter(mtick.FormatStrFormatter('%4d'))

Categories

Resources