Here's the short example version of my problem:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(1, 101, 1)
y1 = np.linspace(0, 1000, 50)
y2 = np.linspace(500, 2000, 50)
y = np.concatenate((y1, y2))
data = np.asmatrix([x, y]).T
df = pd.DataFrame(data)
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(df.iloc[:, 0], df.iloc[:, 1], 'r')
plt.gca().invert_yaxis()
plt.show()
Please run the code and see the plot it generates.
I want to read the dataframe from the back (from x=100 to x=0) and make sure my y-axis is always decreasing (from y=2000 to y=0). I want to remove rows where the y value is not decreasing when read from the end of the dataframe.
How can I edit my dataframe to make this happen?
I'm not really happy with this solution, but it's better than nothing. I found it really hard to describe this problem without becoming too vague. Please comment if you see room for improvement, because I know there is.
newindex = []
max = -999
for row in df.index:
if df.loc[row, 1] > max:
max = df.loc[row, 1]
newindex.append(row)
df = df.loc[newindex,]
Related
Limsup is defined as the supremum of a sequence onward. In other words, at the current moment one can look at the future values and get the maximum of them to create the limsup.
Question
What is the most efficient and Pythonic way of calculating limsup/liminf in pandas?
My try
I am calculating the limsup using a for loop which I am sure is not an efficient way.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
x = np.random.randn(2000)
y = np.cumsum(x)
df = pd.DataFrame(y, columns=['value'])
df['lim_sup'] = 0
fig, ax = plt.subplots(figsize=(20, 4))
for i in range(len(df)):
df['lim_sup'].iloc[i] = df['value'].iloc[i:].max()
df['value'].plot(ax=ax)
df['lim_sup'].plot(ax=ax, color='r')
ax.legend(['value', 'limsup'])
plt.show()
Reverse the values and use cummax to get the cumulative maximum from the bottom up:
df["suprema"] = df.loc[::-1, "value"].cummax()
This column should probably be referred to as the suprema for m >= n, rather than the limsup.
I got data like you can see in picture 1, because I have value 0 and rest is much bigger (values are between 0 and 100). I would like to get data like is show in picture 2. How to solve this problem?
This is minimal reproducible code.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors
index = pd.MultiIndex.from_product([[2019, 2020], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Group1', 'Group2', 'Group3'], ['value1', 'value2']],
names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 20
data += 50
rdata = pd.DataFrame(data, index=index, columns=columns)
cc = sns.light_palette("red", as_cmap=True)
cc.set_bad('white')
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(s.replace(np.inf, np.nan))]
styler = rdata.style
red = styler.apply(
my_gradient,
cmap=cc,
subset=rdata.columns.get_loc_level('value1', level=1)[0],
axis=0)
styler
Picture 1
Picture 2
You need to normalize. Usually, in matplotlib, a norm is used, of which plt.Normalize() is the most standard one.
The updated code could look like:
my_norm = plt.Normalize(0, 100)
def my_gradient(s, cmap):
return [f'background-color: {colors.rgb2hex(x)}'
for x in cmap(my_norm(s.replace(np.inf, np.nan)))]
You can normalize you data with the following equation (x-min)/(max-min). So to apply this to your dataframe you could use something like the following:
result = pd.DataFrame()
for i,row in df.iterrows():
hold = {}
for h in df:
hold[h] = (row[h]-df[h].min())/(df[h].max()-df[h].min())
result = result.append(hold,ignore_index=True)
I have a DataFrame where the index is NOT time. I need to re-scale all of the values from an old index which is not equi-spaced, to a new index which has different limits and is equi-spaced.
The first and last values in the columns should stay as they are (although they will have the new, stretched index values assigned to them).
Example code is:
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
df.plot();
newindex = np.linspace(0, 29, 100)
How do I create a DataFrame where the index is newindex and the new x values are interpolated from the old x values?
The first new x value should be the same as the first old x value. Ditto for the last x value. That is, there should not be NaNs at the beginning and copies of the last old x repeated at the end.
The others should be interpolated to fit the new equi-spaced index.
I tried df.interpolate() but couldn't work out how to interpolate against the newindex.
Thanks in advance for any help.
This is works well:
import numpy as np
import pandas as pd
def interp(df, new_index):
"""Return a new DataFrame with all columns values interpolated
to the new_index values."""
df_out = pd.DataFrame(index=new_index)
df_out.index.name = df.index.name
for colname, col in df.iteritems():
df_out[colname] = np.interp(new_index, df.index, col)
return df_out
I have adopted the following solution:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
def reindex_and_interpolate(df, new_index):
return df.reindex(df.index | new_index).interpolate(method='index', limit_direction='both').loc[new_index]
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = pd.Float64Index(np.linspace(min(index)-5, max(index)+5, 50))
df_reindexed = reindex_and_interpolate(df, newindex)
plt.figure()
plt.scatter(df.index, df.values, color='red', alpha=0.5)
plt.scatter(df_reindexed.index, df_reindexed.values, color='green', alpha=0.5)
plt.show()
I wonder if you're up against one of pandas limitations; it seems like you have limited choices for aligning your df to an arbitrary set of numbers (your newindex).
For example, your stated newindex only overlaps with the first and last numbers in index, so linear interpolation (rightly) interpolates a straight line between the start (2) and end (27) of your index.
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = np.linspace(min(index), max(index), 100)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
If you change newindex to provide more overlapping points with your original data set, interpolation works in a more expected manner:
newindex = np.linspace(min(index), max(index), 26)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
There are other methods that do not require one to manually align the indices, but the resulting curve (while technically correct) is probably not what one wants:
newindex = np.linspace(min(index), max(index), 1000)
df_reindexed = df.reindex(index = newindex, method = 'ffill')
df.plot()
df_reindexed.plot()
I looked at the pandas docs but I couldn't identify an easy solution.
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing
I am importing data from a .txt file and plot it, that is working.
Now I want to hide some parts of the data, i.e. I want to set all y-values for an interval x to 0 or better hide them completely, the rest of the plot shouldnt disappear.
data = pd.read_csv('C:\\users\johan\Documents\Arbeit\Schwarzkoerper\Avasoft\winkel3\\'+''.join(L[k]),sep='\;',skiprows=10,decimal=",",header=None)
data = pd.DataFrame(data) #muss sein
x = data[0]*10**(-9)
y = data[1]
plt.plot(x, y*Teilung())
plt.axis([450*10**(-9), 1100*10**(-9), 0, 60000])
plt.show
To be concret: I want to hide y-values for x in [500-600*10**(-9)]
Using the following column operations with boolean condition is possibly to filter only the data you need from y-values.
df.y[(df.x < 500 * 1e-9) | (df.x > 600 * 1e-9)]
Then use the new range of y-values over the original x-values.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
lower_bound = 500 * 1e-9
upper_bound = 600 * 1e-9
df = pd.DataFrame({
"x" : np.linspace(0,1100*1e-9,100),
"y" : [5*np.cos(i)+i for i in range(100)]
})
#original values
plt.plot(df.iloc[:,0], df.iloc[:,1])
hidden_df = df.y[(df.x < lower_bound) | (df.x > upper_bound)]
plt.plot(df.iloc[:len(hidden_df), 0], hidden_df)
plt.ticklabel_format(axis="x", style="sci", scilimits=(0,0))
plt.legend(("Original", "Hidden"))
plt.grid()
plt.show()
I have a plot with some outliers (wrong measurements):
The base data is good though. I want to just delete everything that is too far off the "current average". I tried using pd.rolling().mean() but with no satisfactory result:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
plt.plot(df)
plt.plot(df2)
plt.show()
I tried to search the web for a good solution but couldn't find one. It shouldn't be that hard to delete data points, that jump through the roof, should it?
Edit:
data file can be downloaded here: https://ufile.io/pviuc
Edit2:
I takled this problem of too many outliers by improving my data set creation.
The core of it:
if abs(D - D_List[-2]) > 30:
D = D_List[-2]
D_List.pop()
D_List.append(D)
Basically what this does is checking if the change of a value is larger than 30, if so it deletes the last value and replaces is with the second last. Not very spectacular but just what I need. I used one of the answers though because it is so much prettier. Thank you guys very much.
Let's try using scipy.signal see docs:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
b, a = signal.butter(3, 0.05)
y = signal.filtfilt(b,a, df[1].values)
df3 = pd.DataFrame(y, index=df2.index)
plt.plot(df, alpha=.3)
plt.plot(df2, alpha=.3)
plt.plot(df3)
plt.show()
Output:
Use medfilt:
y = signal.medfilt(df[1].values)
Output:
There are many ways to smooth a curve (rolling mean, GAM, smoothing spline etc.), my favorite one is the Savitzky–Golay method.
It works as follows: after having regressed a small window around a data point y onto a polynomial (with least squares), it uses this polynomial to get the estimation of your data point ^y. Then the window is shifted forward by one data point.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,5,150)
y = np.cos(x) + np.random.random(150) * 0.15
yhat = savgol_filter(y, 49, 3)
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
Note that rolling mean can't work in your case with a perimeter as low as 20, since the outlier point will have a non-negligible weight (5%) and will always induce a big bias...