I have a dataframe named "df" with a datetime index and four columns:
A B C D
1/1/2020 0.1 0.3 0.2 0.2
1/2/2020 0.3 0.1 0.3 0.3
1/3/2020 0.2 0.2 0.3 0.1
1/4/2020 0.1 0.1 0.1 0.3
I would like to divide the data into 4 "discretized" quantiles.
If I wanted to do this for the column "A", all I would need to do is to use Pandas's q-cut function as below:
df["A"] = pd.qcut(df["A"], 4)
However, the problem is I would like to create quantiles for each date, i.e. to divide the data into 4 quintiles for each row (NOT column). How would I do this?
You can use .apply with axis=1 parameter:
df.apply(lambda x: pd.qcut(x, 4, duplicates='drop'), axis=1)
Related
I have a dataframe with having 5 columns with having missing values.
How do i fill the missing values with taking the average of previous two column values.
Here is the sample code for the same.
coh0 = [0.5, 0.3, 0.1, 0.2,0.2]
coh1 = [0.4,0.3,0.6,0.5]
coh2 = [0.2,0.2,0.3]
coh3 = [0.8,0.8]
coh4 = [0.5]
df= pd.DataFrame({'coh0': pd.Series(coh0), 'coh1': pd.Series(coh1),'coh2': pd.Series(coh2), 'coh3': pd.Series(coh3),'coh4': pd.Series(coh4)})
df
Here is the sample output
coh0coh1coh2coh3coh4
0 0.5 0.4 0.2 0.8 0.5
1 0.3 0.3 0.2 0.8 NaN
2 0.1 0.6 0.3 NaN NaN
3 0.2 0.5 NaN NaN NaN
4 0.2 NaN NaN NaN NaN
Here is the desired result i am looking for.
The NaN value in each column should be replaced by the previous two columns average value at the same position. However for the first NaN value in second column, it will take the default last value of first column.
The sample desired output would be like below.
For the exception you named, the first NaN, you can do
df.iloc[1, -1] = df.iloc[0, -1]
though it doesn't make a difference in this case as the mean of .2 and .8 is .5, anyway.
Either way, the rest is something like a rolling window calculation, except it has to be computed incrementally. Normally, you want to vectorize your operations and avoid iterating over the dataframe, but IMHO this is one of the rarer cases where it's actually appropriate to loop over the columns (cf. this excellent post), i.e.,
compute the row-wise (axis=1) mean of up to two columns left of the current one (df.iloc[:, max(0, i-2):i]),
and fill its NaN values from the resulting series.
for i in range(1, df.shape[1]):
mean_df = df.iloc[:, max(0, i-2):i].mean(axis=1)
df.iloc[:, i] = df.iloc[:, i].fillna(mean_df)
which results in
coh0 coh1 coh2 coh3 coh4
0 0.5 0.4 0.20 0.800 0.5000
1 0.3 0.3 0.20 0.800 0.5000
2 0.1 0.6 0.30 0.450 0.3750
3 0.2 0.5 0.35 0.425 0.3875
4 0.2 0.2 0.20 0.200 0.2000
I have a dataframe like as shown below
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame({'grade': np.random.choice(list('ABCD'),size=(20)),
'dash': np.random.choice(list('PQRS'),size=(20)),
'dumeel': np.random.choice(list('QWER'),size=(20)),
'dumma': np.random.choice((1234),size=(20)),
'target': np.random.choice([0,1],size=(20))
})
I would like to do the below
a) event rate - Compute the % occurrence of 1s (from target column) for each unique value in a each of the input categorical column
b) non event rate - Compute the % occurrence of 0s (from target column) for each unique value in each of the input categorical columns
I tried the below
input_category_columns = df.select_dtypes(include='object')
df_rate_calc = pd.DataFrame()
for ip in input_category_columns:
feature,target = ip,'target'
df_rate_calc['col_name'] = (pd.crosstab(df[feature],df[target],normalize='columns'))
I would like to do this on a million rows and if there is any efficient approach, would really be helpful
I expect my output to be like as shown below. I have shown for only two columns but I want to produce this output for all categorical columns
Here is one approach:
Select the catgorical columns (cols)
Melt the dataframe with target as id variable and cols as value variables
Group the dataframe and use value_counts to calculate frequency
Unstack to reshape the dataframe
cols = df.select_dtypes('object')
df_out = (
df.melt('target', cols)
.groupby(['variable', 'target'])['value']
.value_counts(normalize=True)
.unstack(1, fill_value=0)
)
print(df_out)
target 0 1
variable value
dash P 0.4 0.3
Q 0.2 0.3
R 0.2 0.1
S 0.2 0.3
dumeel E 0.2 0.2
Q 0.1 0.0
R 0.4 0.6
W 0.3 0.2
grade A 0.4 0.2
B 0.0 0.2
C 0.4 0.3
D 0.2 0.3
Using Pandas, how can I efficiently add a new column that is true/false if the value in one column (x) is between the values in two other columns (low and high)?
The np.select approach from here works perfectly, but I "feel" like there should be a one-liner way to do this.
Using Python 3.7
fid = [0, 1, 2, 3, 4]
x = [0.18, 0.07, 0.11, 0.3, 0.33]
low = [0.1, 0.1, 0.1, 0.1, 0.1]
high = [0.2, 0.2, 0.2, 0.2, 0.2]
test = pd.DataFrame(data=zip(fid, x, low, high), columns=["fid", "x", "low", "high"])
conditions = [(test["x"] >= test["low"]) & (test["x"] <= test["high"])]
labels = ["True"]
test["between"] = np.select(conditions, labels, default="False")
display(test)
Like mentioned by #Brebdan, you can use this builtin:
test["between"] = test["x"].between(test["low"], test["high"])
output:
fid x low high between
0 0 0.18 0.1 0.2 True
1 1 0.07 0.1 0.2 False
2 2 0.11 0.1 0.2 True
3 3 0.30 0.1 0.2 False
4 4 0.33 0.1 0.2 False
I have the following dataframe:
actual_credit min_required_credit
0 0.3 0.4
1 0.5 0.2
2 0.4 0.4
3 0.2 0.3
I need to add a column indicating where actual_credit >= min_required_credit. The result would be:
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.1 0.3 False
I am doing the following:
df['result'] = abs(df['actual_credit']) >= abs(df['min_required_credit'])
However the 3rd row (0.4 and 0.4) is constantly resulting in False. After researching this issue at various places including: What is the best way to compare floats for almost-equality in Python? I still can't get this to work. Whenever the two columns have an identical value, the result is False which is not correct.
I am using python 3.3
Due to imprecise float comparison you can or your comparison with np.isclose, isclose takes a relative and absolute tolerance param so the following should work:
df['result'] = df['actual_credit'].ge(df['min_required_credit']) | np.isclose(df['actual_credit'], df['min_required_credit'])
#EdChum's answer works great, but using the pandas.DataFrame.round function is another clean option that works well without the use of numpy.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df['result'] = df['actual_credit'].round(1) >= df['min_required_credit'].round(1)
print(df)
actual_credit min_required_credit result
0 0.3 0.400 False
1 0.5 0.200 True
2 0.4 0.401 True
3 0.2 0.300 False
You might consider using round() to more permanently edit your dataframe, depending if you desire that precision or not. In this example, it seems like the OP suggests this is probably just noise and is just causing confusion.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df = df.round(1)
df['result'] = df['actual_credit'] >= df['min_required_credit']
print(df)
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.2 0.3 False
In general numpy Comparison functions work well with pd.Series and allow for element-wise comparisons:
isclose, allclose, greater, greater_equal, less, less_equal etc.
In your case greater_equal would do:
df['result'] = np.greater_equal(df['actual_credit'], df['min_required_credit'])
or alternatively, as proposed, using pandas.ge(alternatively le, gt etc.):
df['result'] = df['actual_credit'].ge(df['min_required_credit'])
The risk with oring with ge (as mentioned above) is that e.g. comparing 3.999999999999 and 4.0 might return True which might not necessarily be what you want.
Use pandas.DataFrame.abs() instead of the built-in abs():
df['result'] = df['actual_credit'].abs() >= df['min_required_credit'].abs()
I have a PySpark DataFrame, df1, that looks like:
Customer1 Customer2 v_cust1 v_cust2
1 2 0.9 0.1
1 3 0.3 0.4
1 4 0.2 0.9
2 1 0.8 0.8
I want to take the cosine similarity of the two dataframes. And have something like that
Customer1 Customer2 v_cust1 v_cust2 cosine_sim
1 2 0.9 0.1 0.1
1 3 0.3 0.4 0.9
1 4 0.2 0.9 0.15
2 1 0.8 0.8 1
I have a python function that receives number/array of numbers like this:
def cos_sim(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
How can i create the cosine_sim column in my dataframe using udf?
Can i pass several columns instead of one column to the udf cosine_sim function?
It would be more efficient if you'd rather use a pandas_udf.
It performs better at vectorized operations than spark udfs: Introducing Pandas UDF for PySpark
from pyspark.sql.functions import PandasUDFType, pandas_udf
import pyspark.sql.functions as F
# Names of columns
a, b = "v_cust1", "v_cust2"
cosine_sim_col = "cosine_sim"
# Make a reserved column to fill the values since the constraint of pandas_udf
# is that the input schema and output schema has to remain the same.
df = df.withColumn("cosine_sim", F.lit(1.0).cast("double"))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def cos_sim(df):
df[cosine_sim_col] = float(np.dot(df[a], df[b]) / (np.linalg.norm(df[a]) * np.linalg.norm(df[b])))
return df
# Assuming that you want to groupby Customer1 and Customer2 for arrays
df2 = df.groupby(["Customer1", "Customer2"]).apply(cos_sim)
# But if you want to send entire columns then make a column with the same
# value in all rows and group by it. For e.g.:
df3 = df.withColumn("group", F.lit("group_a")).groupby("group").apply(cos_sim)