Quantile values for each column in dataframe - python

I have a dataframe which consists of columns of numbers. I am trying to calc the decile rank values for each column. The following code gives me the values for the dataframe as a whole. How can I do it by column?
pd.qcut(df, 10, labels=False)
Thanks.

If you apply qcut across the columns you will get a dataframe where each entry is the rank value.
import numpy as np
import pandas as pd
data_a = np.random.random(100)
data_b = 100*np.random.random(100)
df = pd.DataFrame(columns=['A','B'], data=list(zip(data_a, data_b)))
rank = df.apply(pd.qcut, axis=0, q=10, labels=False)

Related

Normalization row by row in data set

I'm trying to normalize a datasheet between [-1,+1], and this code I wrote can normalize columns by columns. Could you tell me how to normalize rows by rows?
from sklearn import preprocessing
import pandas as pd
df = pd.read_csv('/-----.csv')
df_max_scaled = df.copy()
for column in df.columns:
df_max_scaled[column] = df_max_scaled[column] /df_max_scaled[column].abs().max()
You could use apply with axis=1 which will process the DF row-by-row:
df.apply(lambda x: x/x.abs().max(), axis=1)

Pandas dynamic rolling on the dataframe

I have a panda dataframe
import yfinance as yf
import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
import pandas as pd
n = 2
df = yf.Ticker("INFY.NS").history(period='400d', interval='1D')
df['max'] = df.iloc[argrelextrema(df1['Close'].values, np.greater_equal,order=n)[0]]['Close']
print(df)
I have created a column name max which has values as shown in the screenshot. The screenshot is only for reference. Sample data can be obtained by running the code above.
I want to compare max values (which are non Nan) with each other but only in the forward direction.
for example,
777.244202 will be compared with all other values of the "max" column which are higher than 777.244202
print those rows which are having .618 Fibonacci retracement with 777.244202
Is there any simpler method in pandas that can do this?

Interpolating from multiple dataframes

I have 2 timeseries dataframes where xp are say the x coordinates of the data fp.
I want to interpolate the values from xp/fp combinations per date for a fixed set of x values. So resulting output being a timeseries dataframe with same datetime index as the xp and fp and no of columns = no of elements in x
I have tried to use numpy.interp() but end up with ValueError: object too deep for desired array
import pandas as pd
import numpy as np
fp = pd.DataFrame(
data=np.random.randint(0,100,size=(10, 4)),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
xp = pd.DataFrame(
data=np.column_stack([
list(range(1,11)),
list(range(70,80)),
list(range(150,160)),
list(range(220,230))
]),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
x = [60, 120, 180]
x_interp = np.interp(x,xp,fp)
It seems like np.interp cant take dataframes as input? but it sounds like this is the fastest way for me to do it for a large dataset (of >3000 xp and fp rows)
Would appreciate any pointers.
UPDATE
found a way of doing what I wanted as below
x_interp = pd.DataFrame.from_records(fp.index.to_series().apply(lambda z: np.interp(x, xp.loc[(z)], fp.loc[(z)])).values, index = fp.index)

DataFrame Pandas shows NAN

I have hdf5 and I have moved to DataFrame, but problem is when I want to plot, nothing shows on the graph. And I have checked new dataframe, but I saw, there was nothing.
This is my DF (
I don't allowed to post pics, so please click to the link )
df1 = pd.DataFrame(df.Price, index = df.Timestamp)
plt.figure()
df1.plot()
plt.show()
Second DF shows NAN in price column. Whats wrong?
I think you need set_index from column Timestamp, select column Price and plot:
#convert column to floats
df['Price'] = df['Price'].astype(float)
df.set_index('Timestamp')['Price'].plot()
#if some non numeric data, convert them to NaNs
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
df.set_index('Timestamp')['Price'].plot()
And get NaNs if use DataFrame constructor, because data not aligned - values of index of df are not same as Timestamp column.
You can do this by adding .values, And how about creating a series instead?
#df1 = pd.DataFrame(df.Price.values, df.Timestamp)
serie = pd.Series(df.Price.values, df.Timestamp)
Saw it was answered here: pandas.Series() Creation using DataFrame Columns returns NaN Data entries
Full example:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=["Price","Timestamp","Random"])
df.Price = np.random.randint(100, size = 10)
df.Timestamp = [datetime.datetime(2000,1,1) + \
datetime.timedelta(days=int(i)) for i in np.random.randint(100, size = 10)]
df.Random = np.random.randint(10, size= 10)
serie = pd.Series(df.Price.values, df.Timestamp)
serie.plot()
plt.show()
Difference
print("{}\n{}".format(type(df.Price), type(df.Price.values)))
<class 'pandas.core.series.Series'> # does not work
<class 'numpy.ndarray'> # works

Bin pandas dataframe by integer values

I have a pandas dataframe and I want to bin the data by values of a single column. E.G. 0-1, 1-2, etc, starting at 0 and ending at 1, with intervals of 0.1, taking the mean of each column within each bin.
I'm attempting to accomplish this using the .groupby functionality of pandas. See my code below:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
bins = np.linspace(0, 1, 0.1)
groups = my_df.groupby(np.digitize(my_df.a, bins))
binned_data = groups.mean()
print binned_data
The print line then gives a single row with index "1", even though the data of column "a" should have a range of values for the bins specified.
I think its a problem with the creation of "bins" but I can't work out what.
I want 10 rows binned from 0 to 1 in 0.1 intervals. How can I accomplish this?
Many thanks.

Categories

Resources