i am a beginner with coding with python and i have a question:
This code works fantastic to creat a chart for each Column:
The Main DF is:
enter image description here
1- Removing Outliers:
def remove_outliers(df_in, col):
q1 = df_in[col].quantile(0.25)
q3 = df_in[col].quantile(0.75)
iqr = q3-q1
lower_bound = q1-1.5*iqr
upper_bound = q3+1.5*iqr
df_out = df_in.loc[(df_in[col] > lower_bound) & (df_in[col] < upper_bound)]
return df_out
2- Define the Format of the Lineplot
rc={'axes.labelsize': 20, 'font.size': 20, 'legend.fontsize':20,'axes.titlesize':20,'xtick.labelsize': 14,'ytick.labelsize': 14, 'lines.linewidth':1, 'lines.markersize':7, 'xtick.major.pad':10}
sns.set(rc=rc)
3- Creat a Lineplot with seaborn:
df1_DH001= remove_outliers(main_df, 'DH001')[['DH 001','Datum']]
df1_DH001_chart= sns.scatterplot(x='Datum', y='DH 001', data=df1_DH001)
df1_DH001_chart= sns.lineplot(x='Datum', y='DH 001', data=df1_DH001, lw=3, color="b")
df1_DH001_chart.set(xlim=('1995','2019'), ylim=(0, 220) ,title='DH 001', ylabel='Nitrat mg/L', xlabel="Jahr")
df1_DH001_chart.xaxis.set_major_locator(mdates.YearLocator(1))
df1_DH001_chart.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
df1_DH001_chart
So I got this:
enter image description here
Now I would like to creat a for-Loop to creat the same plot and the same x-Axis (Datum) but with another column (There are 22 Columns)
Could some one help me?
Import the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Create asample DF:
data = {'day': ['Mon','Tue','Wed','Thu'],
'col1': [22000,25000,27000,35000],
'col2': [2200,2500,2700,3500],
}
df = pd.DataFrame(data)
Select only numeric columns from your DF or alternatively select the columns that you want to consider in the loop:
df1 = df.select_dtypes([np.int, np.float])
Iterate through the columns and print a line plot with seaborn:
for i, col in enumerate(df1.columns):
plt.figure(i)
sns.lineplot(x='day',y=col, data=df)
Then the following pictures will be shown:
Related
I am trying to write a for loop that for distplot subplots.
I have a dataframe with many columns of different lengths. (not including the NaN values)
fig = make_subplots(
rows=len(assets), cols=1,
y_title = 'Hourly Price Distribution')
i=1
for col in df_all.columns:
fig = ff.create_distplot([[df_all[[col]].dropna()]], col)
fig.append()
i+=1
fig.show()
I am trying to run a for loop for subplots for distplots and get the following error:
PlotlyError: Oops! Your data lists or ndarrays should be the same length.
UPDATE:
This is an example below:
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
fig = ff.create_distplot([df[c].dropna() for c in df.columns],
df.columns,show_hist=False,show_rug=False)
fig.show()
I would like to plot each distribution in a different subplot.
Thank you.
Update: Distribution plots
Calculating the correct values is probably both quicker and more elegant using numpy. But I often build parts of my graphs using one plotly approach(figure factory, plotly express) and then use them with other elements of the plotly library (plotly.graph_objects) to get what I want. The complete snippet below shows you how to do just that in order to build a go based subplot with elements from ff.create_distplot. I'd be happy to give further explanations if the following suggestion suits your needs.
Plot
Complete code
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objects as go
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
cols = dfm.year.unique()
nrows = len(cols)
fig = make_subplots(rows=nrows, cols=1)
for r, col in enumerate(cols, 1):
dfs = dfm[dfm['year']==col]
fx1 = ff.create_distplot([dfs['value'].values], ['distplot'],curve_type='kde')
fig.add_trace(go.Scatter(
x= fx1.data[1]['x'],
y =fx1.data[1]['y'],
), row = r, col = 1)
fig.show()
First suggestion
You should:
1. Restructure your data with pd.melt(df, id_vars=['index'], value_vars=df.columns[1:]),
2. and the use the occuring column 'variable' to build subplots for each year through the facet_row argument to get this:
In the complete snippet below you'll see that I've changed 'variable' to 'year' in order to make the plot more intuitive. There's one particularly convenient side-effect with this approach, namely that running dfm.dropna() will remove the na value for 2012 only. If you were to do the same thing on your original dataframe, the corresponding value in the same row for 2013 would also be removed.
import numpy as np
import pandas as pd
import plotly.express as px
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
fig = px.histogram(dfm, x="value",
facet_row = 'year')
fig.show()
Problem
I have data looks like the following:
Month
Product
SalesCount
1
4
94
1
6
38
1
2
56
1
7
47
I would like:
Display a histogram and sort them by SalesCount, from highest to lowest.
Display all labels and titles.
What I've Tried
import numpy as np
import seaborn as sns
rng = np.random.default_rng()
dft = pd.DataFrame({'Month': 1,
'Product': rng.choice(30, size=30, replace=False),
'SalesCount': np.random.randint(1, 100, 30),
})
# Try to sort the dataframe
#dft = dft.sort_values(by=['SalesCount'])
print(dft)
g = sns.catplot(data=dft, kind='bar', x='Product', y='SalesCount', height=6, aspect=1.8, facecolor=(0.3,0.3,0.7,1))
#, order=dft[['Product', 'SalesCount']].index
(g.set_axis_labels('Product', 'Count')
.set_titles('test'))
Which shows chart similar to this:
I have tried sorting the dataframe first (dft = dft.sort_values(by=['SalesCount'])) and also add order parameter (order=dft[['Product', 'SalesCount']].index) to sns.catplot method. Both of these attempts don't sort the histogram.
The second issue I have is adding the titles. I have tried .set_titles('test') in FacetGrid (from sns.catplot) instance, but title would not show up.
Thanks!
You may need to make your Product column a string instead of an integer. This should work.
import numpy as np
import pandas as pd
import seaborn as sns
rng = np.random.default_rng()
dft = pd.DataFrame({'Month': 1,
'Product': rng.choice(30, size=30, replace=False),
'SalesCount': np.random.randint(1, 100, 30),
})
# Try to sort the dataframe
dft = dft.sort_values(by=['SalesCount'])
dft['Product'] = dft['Product'].astype(str)
print(dft)
g = sns.catplot(data=dft, kind='bar', x='Product', y='SalesCount', height=6, aspect=1.8, facecolor=(0.3,0.3,0.7,1))
#, order=dft[['Product', 'SalesCount']].index
(g.set_axis_labels('Product', 'Count')
.set_titles('test'))
Hello,
I'm trying to plot a box plot combining columns from two different data frames. Help please :)
This is the code:
import pandas as pd
from numpy import random
#Generating the data frame
df1 = pd.DataFrame(data = random.randn(5,2), columns = ['W','Y'])
df2 = pd.DataFrame(data = random.randn(5,2), columns = ['X','Y'])
print(df1.head())
print('\n')
print(df2.head())
This is the output:
This is what I want to get:
The following will give you what you desire:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
ax.boxplot([df1['Y'], df2['Y']], positions=[1, 2])
ax.set_xticklabels(['W', 'X'])
ax.set_ylabel('Y')
This gave me the plot below (which I think is what you were aiming for):
I want to plot a line/scatter plot for country name == 'Argentina' vs its corresponding 'value' only, out of the entire data.
Sample data
total data file
This is my code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C:/Users/kdandebo/Desktop/Models/Python excercise/Data3.xlsx")
x = (df['Country Name'])
#Although i have figured out x cannot be compared to a string named Argentina, i couldnt think of any other way, Also ive tried the below version too, but none works
#if (df['Country Name'] == 'Argentina'):
# y = (df['Value'])
for x == ("Argentina"):
y = (df['Value'])
plt.scatter(x,y)
plt.show()
Main problem was reading spread sheet file and selecting the right sheet
import pandas as pd
import matplotlib.pyplot as plt
xl = pd.ExcelFile("Data3.xlsx")
df=xl.parse("Data")
x = df[df['Country Name']=="Argentina"]
plt.scatter(x['Country Name'],x['Value'])
plt.show()
this code is self contained and provides an answer the question.
import pandas as pd
df = pd.DataFrame({'Series Name': ['GDP']*4,
'Country Name': ['Argentina']*2 + ['Bolivia']*2,
'Time': [2001, 2002, 2001, 2002],
'Value': [1, 3, 2, 4]})
#print(df)
df[df['Country Name'] == 'Argentina'].plot.scatter('Time', 'Value')
quite often answers to questions like this are found in the library documentation under examples or tutorials.
Before you start make plot, first you ought to extract data about Argentina.
import pandas as pd
import matplotlib.pyplot as plt
# Define the headers
headers = ["SeriesName", "CountryName", "Time", "Value"]
# Read in the Excel file
df_raw = pd.read_excel("C:/1/Data3.xlsx",header=None, names=headers)
# extract data to only Argentina
country = ["Argentina"]
# Create a copy of the data with only the Argentina
df = df_raw[df_raw.CountryName.isin(country)].copy()
#print(df)
After extracting you can use only Pandas to make plot.
'''Pandas plot'''
df.plot.line(x='Time', y='Value', c='Red',legend =0, title = "ARGENTINA GDP per capita")
plt.show()
You can also make plot by Matplotlib library and Seaborn or Plotly.
# Create plot from matplotlib
plt.figure()
plt.scatter(df.Value, df.Time)
plt.xlabel('GPD Value')
plt.ylabel('Years')
plt.title('''ARGENTINA
GDP per capita (constant 2010 US$) ''')
plt.show()
enter image description here
Seaborn plot
import seaborn as sns
sns.scatterplot(x="Value", y="Time", data=df, color = 'DarkBlue')
plt.subplots_adjust(top=0.9)
plt.suptitle("ARGENTINA GDP per capita")
plt.show()
Plotly plot
import plotly
import plotly.graph_objs as go
trace = go.Scatter(x = df.Time, y = df.Value)
data = [trace]
plotly.offline.plot({"data": data}, filename='Argentina GDP.html')
I wish to create a seaborn pointplot to display the full data distribution in a column, alongside the distribution of the lowest 25% of values, and the distribution of the highest 25% of values, and all side by side (on the x axis).
My attempt so far provides me with the values, but they are displayed on the same part of the x-axis only and not spread out from left to right on the graph, and with no obvious way to label the points from x-ticks (which I would prefer , rather than via a legend).
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df1 = df[(df.total_bill < df.total_bill.quantile(.25))]
df2 = df[(df.total_bill > df.total_bill.quantile(.75))]
sns.pointplot(y=df['total_bill'], data=df, color='red')
sns.pointplot(y=df1['total_bill'], data=df1, color='green')
sns.pointplot(y=df2['total_bill'], data=df2, color='blue')
You could .join() the new distributions to your existing df and then .plot() using wide format:
lower, upper = df.total_bill.quantile([.25, .75]).values.tolist()
df = df.join(df.loc[df.total_bill < lower, 'total_bill'], rsuffix='_lower')
df = df.join(df.loc[df.total_bill > upper, 'total_bill'], rsuffix='_upper')
sns.pointplot(data=df.loc[:, [c for c in df.columns if c.startswith('total')]])
to get:
If you wanted to add groups, you could simply use .unstack() to get to long format:
df = df.loc[:, ['total_bill', 'total_bill_upper', 'total_bill_lower']].unstack().reset_index().drop('level_1', axis=1).dropna()
df.columns = ['grp', 'val']
to get:
sns.pointplot(x='grp', y='val', hue='grp', data=df)
I would think along the lines of adding a "group" and then plot as a single DataFrame.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df = df.append(df)
df.loc[(df.total_bill < df.total_bill.quantile(.25)),'group'] = 'L'
df.loc[(df.total_bill > df.total_bill.quantile(.75)),'group'] = 'H'
df = df.reset_index(drop=True)
df.loc[len(df)/2:,'group'] = 'all'
sns.pointplot(data = df,
y='total_bill',
x='group',
hue='group',
linestyles='')