How can I evenly space date data from a data frame? - python

I have a dataframe of measurements from an experiment.
I can easily plot the data from the data frame using pandas. Here is the result.
The dates are evenly spaced on the axis, but in reality, they are not evenly spaced. How can I get an accurate representation of the time between measurements?
Here is my code for plotting the data frame:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
normal_df= pd.DataFrame(normal, columns = cols, index = rows[2::])
print normal_df
#Write the data frame to an xlsx file
normal_df.to_excel(csv_file[0:-4] + '_Normalized_Survival.xlsx')
avg = normal_df.mean()
errors = normal_df.sem()
avg.plot(marker = 'v',yerr = errors)
plt.title('Mean Survival with Standard Error',fontsize = 20)
plt.xticks(fontsize = 12,rotation = 45)
plt.yticks(fontsize = 12)
plt.xlabel('Time',fontsize = 18)
plt.ylabel('% Survival',fontsize = 18)
plt.xlim([0,6.1])
plt.legend(['Survival'])
plt.show()

Here's one option you can try, by performing string operations to extract the integer Day and setting the index to be the resultant values
In [10]: cpy = [100, 89, 84, 73, 65, 6, 0]
In [11]: days = ['Day 1','Day 2','Day 3','Day 6','Day 9','Day 14','Day 16']
In [12]: df = pd.DataFrame({'day':days,'val':cpy})
In [13]: df['dayint'] = df.day.apply(lambda x : int(x.split(' ')[-1]))
In [14]: df.set_index(df.dayint, inplace=True)
In [15]: df.val.plot()
In [16]: plt.show()

Related

How to normalize coloring of data with seaborn in pandas?

I got data like you can see in picture 1, because I have value 0 and rest is much bigger (values are between 0 and 100). I would like to get data like is show in picture 2. How to solve this problem?
This is minimal reproducible code.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors
index = pd.MultiIndex.from_product([[2019, 2020], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Group1', 'Group2', 'Group3'], ['value1', 'value2']],
names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 20
data += 50
rdata = pd.DataFrame(data, index=index, columns=columns)
cc = sns.light_palette("red", as_cmap=True)
cc.set_bad('white')
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(s.replace(np.inf, np.nan))]
styler = rdata.style
red = styler.apply(
my_gradient,
cmap=cc,
subset=rdata.columns.get_loc_level('value1', level=1)[0],
axis=0)
styler
Picture 1
Picture 2
You need to normalize. Usually, in matplotlib, a norm is used, of which plt.Normalize() is the most standard one.
The updated code could look like:
my_norm = plt.Normalize(0, 100)
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(my_norm(s.replace(np.inf, np.nan)))]
You can normalize you data with the following equation (x-min)/(max-min). So to apply this to your dataframe you could use something like the following:
result = pd.DataFrame()
for i,row in df.iterrows():
hold = {}
for h in df:
hold[h] = (row[h]-df[h].min())/(df[h].max()-df[h].min())
result = result.append(hold,ignore_index=True)

Seaborn catplot Sort by Count column

Problem
I have data looks like the following:
Month
Product
SalesCount
1
4
94
1
6
38
1
2
56
1
7
47
I would like:
Display a histogram and sort them by SalesCount, from highest to lowest.
Display all labels and titles.
What I've Tried
import numpy as np
import seaborn as sns
rng = np.random.default_rng()
dft = pd.DataFrame({'Month': 1,
'Product': rng.choice(30, size=30, replace=False),
'SalesCount': np.random.randint(1, 100, 30),
})
# Try to sort the dataframe
#dft = dft.sort_values(by=['SalesCount'])
print(dft)
g = sns.catplot(data=dft, kind='bar', x='Product', y='SalesCount', height=6, aspect=1.8, facecolor=(0.3,0.3,0.7,1))
#, order=dft[['Product', 'SalesCount']].index
(g.set_axis_labels('Product', 'Count')
.set_titles('test'))
Which shows chart similar to this:
I have tried sorting the dataframe first (dft = dft.sort_values(by=['SalesCount'])) and also add order parameter (order=dft[['Product', 'SalesCount']].index) to sns.catplot method. Both of these attempts don't sort the histogram.
The second issue I have is adding the titles. I have tried .set_titles('test') in FacetGrid (from sns.catplot) instance, but title would not show up.
Thanks!
You may need to make your Product column a string instead of an integer. This should work.
import numpy as np
import pandas as pd
import seaborn as sns
rng = np.random.default_rng()
dft = pd.DataFrame({'Month': 1,
'Product': rng.choice(30, size=30, replace=False),
'SalesCount': np.random.randint(1, 100, 30),
})
# Try to sort the dataframe
dft = dft.sort_values(by=['SalesCount'])
dft['Product'] = dft['Product'].astype(str)
print(dft)
g = sns.catplot(data=dft, kind='bar', x='Product', y='SalesCount', height=6, aspect=1.8, facecolor=(0.3,0.3,0.7,1))
#, order=dft[['Product', 'SalesCount']].index
(g.set_axis_labels('Product', 'Count')
.set_titles('test'))

working with spectra (energy/ data paires) [duplicate]

I have a DataFrame where the index is NOT time. I need to re-scale all of the values from an old index which is not equi-spaced, to a new index which has different limits and is equi-spaced.
The first and last values in the columns should stay as they are (although they will have the new, stretched index values assigned to them).
Example code is:
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
df.plot();
newindex = np.linspace(0, 29, 100)
How do I create a DataFrame where the index is newindex and the new x values are interpolated from the old x values?
The first new x value should be the same as the first old x value. Ditto for the last x value. That is, there should not be NaNs at the beginning and copies of the last old x repeated at the end.
The others should be interpolated to fit the new equi-spaced index.
I tried df.interpolate() but couldn't work out how to interpolate against the newindex.
Thanks in advance for any help.
This is works well:
import numpy as np
import pandas as pd
def interp(df, new_index):
"""Return a new DataFrame with all columns values interpolated
to the new_index values."""
df_out = pd.DataFrame(index=new_index)
df_out.index.name = df.index.name
for colname, col in df.iteritems():
df_out[colname] = np.interp(new_index, df.index, col)
return df_out
I have adopted the following solution:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
def reindex_and_interpolate(df, new_index):
return df.reindex(df.index | new_index).interpolate(method='index', limit_direction='both').loc[new_index]
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = pd.Float64Index(np.linspace(min(index)-5, max(index)+5, 50))
df_reindexed = reindex_and_interpolate(df, newindex)
plt.figure()
plt.scatter(df.index, df.values, color='red', alpha=0.5)
plt.scatter(df_reindexed.index, df_reindexed.values, color='green', alpha=0.5)
plt.show()
I wonder if you're up against one of pandas limitations; it seems like you have limited choices for aligning your df to an arbitrary set of numbers (your newindex).
For example, your stated newindex only overlaps with the first and last numbers in index, so linear interpolation (rightly) interpolates a straight line between the start (2) and end (27) of your index.
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = np.linspace(min(index), max(index), 100)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
If you change newindex to provide more overlapping points with your original data set, interpolation works in a more expected manner:
newindex = np.linspace(min(index), max(index), 26)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
There are other methods that do not require one to manually align the indices, but the resulting curve (while technically correct) is probably not what one wants:
newindex = np.linspace(min(index), max(index), 1000)
df_reindexed = df.reindex(index = newindex, method = 'ffill')
df.plot()
df_reindexed.plot()
I looked at the pandas docs but I couldn't identify an easy solution.
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing

Scatter and curve plot using matplotlib

I am trying to plot the accuracy evolution of NN models overtimes. So, I have an excel file with data like the following:
and I wrote the following code to extract data and plot the scatter:
import pandas as pd
data = pd.read_excel (r'SOTA DNN.xlsx')
acc1 = pd.DataFrame(data, columns= ['Top-1-Acc'])
para = pd.DataFrame(data, columns= ['Parameters'])
dates = pd.to_datetime(data['Date'], format='%Y-%m-%d')
import matplotlib.pyplot as plt
plt.grid(True)
plt.ylim(40, 100)
plt.scatter(dates, acc1)
plt.show()
Is there a way to draw a line in the same figure to show only the ones achieving the maximum and print their names at the same time as in this example:
is it also possible to stretch the x-axis (for the dates)?
It is still not clear what you mean by "stretch the x-axis" and you did not provide your dataset, but here is a possible general approach:
import matplotlib.pyplot as plt
import pandas as pd
#fake data generation, this has to be substituted by your .xls import routine
from pandas._testing import rands_array
import numpy as np
np.random.seed(1234)
n = 30
acc = np.concatenate([np.random.randint(0, 10, 10), np.random.randint(0, 30, 10), np.random.randint(0, 100, n-20)])
date_range = pd.date_range("20190101", periods=n)
model = rands_array(5, n)
df = pd.DataFrame({"Model": model, "Date": date_range, "TopAcc": acc})
df = df.sample(frac=1).reset_index(drop=True)
#now to the actual data modification
#first, we extract the dataframe with monotonically increasing values after sorting the date column
df = df.sort_values("Date").reset_index()
df["Max"] = df.TopAcc.cummax().diff()
df.loc[0, "Max"] = 1
dfmax = df[df.Max > 0]
#then, we plot all data, followed by the best performers
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(df.Date, df.TopAcc, c="grey")
ax.plot(dfmax.Date, dfmax.TopAcc, marker="x", c="blue")
#finally, we annotate the best performers
for _, xylabel in dfmax.iterrows():
ax.text(xylabel.Date, xylabel.TopAcc, xylabel.Model, c="blue", horizontalalignment="right", verticalalignment="bottom")
plt.show()
Sample output:

Pandas pie plot actual values for multiple graphs

I'm trying to print actual values in pies instead of percentage, for one dimensonal series this helps:
Matplotlib pie-chart: How to replace auto-labelled relative values by absolute values
But when I try to create multiple pies it won't work.
d = {'Yes':pd.Series([825, 56], index=["Total", "Last 2 Month"]), 'No':pd.Series([725, 73], index=["Total", "Last 2 Month"])}
df = pd.DataFrame(d)
df = df.T
def absolute_value(val):
a = np.round(val/100.*df.values, 0)
return a
df.plot.pie(subplots=True, figsize=(12, 6),autopct=absolute_value)
plt.show()
How can I make this right?
Thanks.
A hacky solution would be to index the dataframe within the absolute_value function, considering that this function is called exactly once per value in that dataframe.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
d = {'Yes':pd.Series([825, 56], index=["Total", "Last 2 Month"]),
'No':pd.Series([725, 73], index=["Total", "Last 2 Month"])}
df = pd.DataFrame(d)
df = df.T
i = [0]
def absolute_value(val):
a = df.iloc[i[0]%len(df),i[0]//len(df)]
i[0] += 1
return a
df.plot.pie(subplots=True, figsize=(12, 6),autopct=absolute_value)
plt.show()
The other option is to plot the pie charts individually by looping over the columns.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
d = {'Yes':pd.Series([825, 56], index=["Total", "Last 2 Month"]),
'No':pd.Series([725, 73], index=["Total", "Last 2 Month"])}
df = pd.DataFrame(d)
df = df.T
print df.iloc[:,0].sum()
def absolute_value(val, summ):
a = np.round(val/100.*summ,0)
return a
fig, axes = plt.subplots(ncols=len(df.columns))
for i,ax in enumerate(axes):
df.iloc[:,i].plot.pie(ax=ax,autopct=lambda x: absolute_value(x,df.iloc[:,i].sum()))
plt.show()
In both cases the output would look similar to this

Categories

Resources