I have a DataFrame where the index is NOT time. I need to re-scale all of the values from an old index which is not equi-spaced, to a new index which has different limits and is equi-spaced.
The first and last values in the columns should stay as they are (although they will have the new, stretched index values assigned to them).
Example code is:
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
df.plot();
newindex = np.linspace(0, 29, 100)
How do I create a DataFrame where the index is newindex and the new x values are interpolated from the old x values?
The first new x value should be the same as the first old x value. Ditto for the last x value. That is, there should not be NaNs at the beginning and copies of the last old x repeated at the end.
The others should be interpolated to fit the new equi-spaced index.
I tried df.interpolate() but couldn't work out how to interpolate against the newindex.
Thanks in advance for any help.
This is works well:
import numpy as np
import pandas as pd
def interp(df, new_index):
"""Return a new DataFrame with all columns values interpolated
to the new_index values."""
df_out = pd.DataFrame(index=new_index)
df_out.index.name = df.index.name
for colname, col in df.iteritems():
df_out[colname] = np.interp(new_index, df.index, col)
return df_out
I have adopted the following solution:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
def reindex_and_interpolate(df, new_index):
return df.reindex(df.index | new_index).interpolate(method='index', limit_direction='both').loc[new_index]
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = pd.Float64Index(np.linspace(min(index)-5, max(index)+5, 50))
df_reindexed = reindex_and_interpolate(df, newindex)
plt.figure()
plt.scatter(df.index, df.values, color='red', alpha=0.5)
plt.scatter(df_reindexed.index, df_reindexed.values, color='green', alpha=0.5)
plt.show()
I wonder if you're up against one of pandas limitations; it seems like you have limited choices for aligning your df to an arbitrary set of numbers (your newindex).
For example, your stated newindex only overlaps with the first and last numbers in index, so linear interpolation (rightly) interpolates a straight line between the start (2) and end (27) of your index.
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = np.linspace(min(index), max(index), 100)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
If you change newindex to provide more overlapping points with your original data set, interpolation works in a more expected manner:
newindex = np.linspace(min(index), max(index), 26)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
There are other methods that do not require one to manually align the indices, but the resulting curve (while technically correct) is probably not what one wants:
newindex = np.linspace(min(index), max(index), 1000)
df_reindexed = df.reindex(index = newindex, method = 'ffill')
df.plot()
df_reindexed.plot()
I looked at the pandas docs but I couldn't identify an easy solution.
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing
Related
I got data like you can see in picture 1, because I have value 0 and rest is much bigger (values are between 0 and 100). I would like to get data like is show in picture 2. How to solve this problem?
This is minimal reproducible code.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors
index = pd.MultiIndex.from_product([[2019, 2020], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Group1', 'Group2', 'Group3'], ['value1', 'value2']],
names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 20
data += 50
rdata = pd.DataFrame(data, index=index, columns=columns)
cc = sns.light_palette("red", as_cmap=True)
cc.set_bad('white')
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(s.replace(np.inf, np.nan))]
styler = rdata.style
red = styler.apply(
my_gradient,
cmap=cc,
subset=rdata.columns.get_loc_level('value1', level=1)[0],
axis=0)
styler
Picture 1
Picture 2
You need to normalize. Usually, in matplotlib, a norm is used, of which plt.Normalize() is the most standard one.
The updated code could look like:
my_norm = plt.Normalize(0, 100)
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(my_norm(s.replace(np.inf, np.nan)))]
You can normalize you data with the following equation (x-min)/(max-min). So to apply this to your dataframe you could use something like the following:
result = pd.DataFrame()
for i,row in df.iterrows():
hold = {}
for h in df:
hold[h] = (row[h]-df[h].min())/(df[h].max()-df[h].min())
result = result.append(hold,ignore_index=True)
I would like to get the 1st element (so the one with the lowest index of the df) in a transform_joinaggregate and use that in a grouped transform_calculate, for example to subtract it.
I can calculate it in pandas like below, but I am looking for a way to do it in Altair.
The 'offset' column below is what I want to get in Altair:
import altair as alt
import numpy as np
import pandas as pd
y = [1, 2, 3, 2, 4, 6]
x = [1,2,3] * 2
cat = ['a','a','a','b','b','b']
df = pd.DataFrame({'cat':cat, 'x':x, 'y':y})
df['offset'] = df.groupby('cat')['y'].transform('first')
line = alt.Chart(df).mark_line().encode(
x='x:Q',
y='y:Q',
color='cat:N'
)
line2 = alt.Chart(df).mark_line(opacity=0.4).transform_calculate(
y2 = alt.datum.y - alt.datum.offset
).encode(
x='x:Q',
y='y2:Q',
color='cat:N'
)
line + line2
Which gives this plot:
I think I might need to use the "values" aggregation and the "slice" expression but I can't figure it out. The below returns an empty chart (NaNs).
alt.Chart(df).mark_line(point=True).transform_joinaggregate(
vals = 'values(y)',
groupby = ['cat']
).transform_calculate(
corr = alt.expr.slice(alt.datum.vals, 0, 1),
y2 = alt.datum.y - alt.datum.corr
).encode(
x='x:Q',
y='y2:Q',
)
I am trying to compare two simple and summarized pandas dataframe with line plot from Seaborn library but one of the lines shifts one unit in X axis. What's wrong with it?
The dataframes are:
Here is my code:
df = pd.read_csv('/home/gazelle/Documents/m3inference/m3_result.csv',index_col='id')
df = df.drop("Unnamed: 0",axis=1)
for i, v in df.iterrows():
if str(i) not in result:
df.drop(i, inplace=True)
else:
df.loc[i, 'estimated'] = result[str(i)]
m3 = pd.read_csv('plot_result.csv').set_index('id')
ids = list(m3.index.values)
m3 = m3['age'].value_counts().to_frame().reset_index().sort_values('index')
m3 = m3.rename(columns={m3.columns[0]:'bucket', m3.columns[1]:'age'})
df_estimated = df[df.index.isin(ids)]['estimated'].value_counts().to_frame().reset_index().sort_values('index')
df_estimated = df_estimated.rename(columns={df_estimated.columns[0]:'bucket', df_estimated.columns[1]:'age'})
sns.lineplot(x='bucket', y='age', data=m3)
sns.lineplot(x='bucket', y='age', data=df_estimated)
And the result is:
As has been pointed out in the comments, the data and code you provide appear to produce the correct result:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set()
m3 = pd.DataFrame({"index": [2, 3, 4, 1], "age": [123, 116, 66, 33]})
df_estimated = pd.DataFrame({"index": [3, 2, 4, 1], "estimated": [200, 100, 37, 1]})
sns.lineplot(x="index", y="age", data=m3)
sns.lineplot(x="index", y="estimated", data=df_estimated)
plt.show()
This gives a plot which is different from the one you posted above:
From your screenshots it looks like you are working in a Jupyter notebook. You are probably suffering from the issue that at the time you plot, the dataframe m3 no longer has the values you printed above, but has been modified.
Here's the short example version of my problem:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(1, 101, 1)
y1 = np.linspace(0, 1000, 50)
y2 = np.linspace(500, 2000, 50)
y = np.concatenate((y1, y2))
data = np.asmatrix([x, y]).T
df = pd.DataFrame(data)
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(df.iloc[:, 0], df.iloc[:, 1], 'r')
plt.gca().invert_yaxis()
plt.show()
Please run the code and see the plot it generates.
I want to read the dataframe from the back (from x=100 to x=0) and make sure my y-axis is always decreasing (from y=2000 to y=0). I want to remove rows where the y value is not decreasing when read from the end of the dataframe.
How can I edit my dataframe to make this happen?
I'm not really happy with this solution, but it's better than nothing. I found it really hard to describe this problem without becoming too vague. Please comment if you see room for improvement, because I know there is.
newindex = []
max = -999
for row in df.index:
if df.loc[row, 1] > max:
max = df.loc[row, 1]
newindex.append(row)
df = df.loc[newindex,]
I have a dataframe of measurements from an experiment.
I can easily plot the data from the data frame using pandas. Here is the result.
The dates are evenly spaced on the axis, but in reality, they are not evenly spaced. How can I get an accurate representation of the time between measurements?
Here is my code for plotting the data frame:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
normal_df= pd.DataFrame(normal, columns = cols, index = rows[2::])
print normal_df
#Write the data frame to an xlsx file
normal_df.to_excel(csv_file[0:-4] + '_Normalized_Survival.xlsx')
avg = normal_df.mean()
errors = normal_df.sem()
avg.plot(marker = 'v',yerr = errors)
plt.title('Mean Survival with Standard Error',fontsize = 20)
plt.xticks(fontsize = 12,rotation = 45)
plt.yticks(fontsize = 12)
plt.xlabel('Time',fontsize = 18)
plt.ylabel('% Survival',fontsize = 18)
plt.xlim([0,6.1])
plt.legend(['Survival'])
plt.show()
Here's one option you can try, by performing string operations to extract the integer Day and setting the index to be the resultant values
In [10]: cpy = [100, 89, 84, 73, 65, 6, 0]
In [11]: days = ['Day 1','Day 2','Day 3','Day 6','Day 9','Day 14','Day 16']
In [12]: df = pd.DataFrame({'day':days,'val':cpy})
In [13]: df['dayint'] = df.day.apply(lambda x : int(x.split(' ')[-1]))
In [14]: df.set_index(df.dayint, inplace=True)
In [15]: df.val.plot()
In [16]: plt.show()