Create Max and Min column values from a single column value pandas

Create Max and Min column values from a single column value pandas - python

I have a dataframe like the one below and I need to create two columns out of the base column.
Input
Kg
0.5
0.5
1
1
1
2
2
5
5
5
Expected Output
Kg_From Kg_To
0 0.5
0 0.5
0.5 1
0.5 1
0.5 1
1 2
1 2
2 5
2 5
2 5
How can this be done in pandas ?

Assuming your kg column is sorted:
s = df["Kg"].unique()
df["Kg_from"] = df["Kg"].map({k:v for k,v in zip(s[1:], s)}).fillna(0)
print (df)
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0

#get unique values and counts of each value in the Kg column
val,counts = np.unique(df.Kg,return_counts=True)
#shift forward by 1 and replace the first value with 0
val = np.roll(val,1)
val[0] = 0
#repeat the count of each value with the counts generated earlier
df['Kg_from'] = np.repeat(val,counts)
df
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0

Use zip and dict for mapping new column created by DataFrame.insert with unique sorted values by np.unique with added first 0 value by np.insert:
df = df.rename(columns={'Kg':'Kg_To'})
a = np.unique(df["Kg_To"])
df.insert(0, 'Kg_from', df['Kg_To'].map(dict(zip(a, np.insert(a, 0, 0)))))
print (df)
Kg_from Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

Code:
kgs = df.Kg.unique()
lower = [0] + list(kgs[:-1])
kg_dict = {k:v for v,k in zip(lower,kgs)}
# new dataframe
new_df = pd.DataFrame({
'Kg_From': df['Kg'].map(kg_dict),
'Kg_To': df['Kg']
})
# or if you want new columns:
df['Kg_from'] = df['Kg'].map(kg_dict)
Output:
Kg_From Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

Related

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.

Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names

Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0

It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Find values from other dataframe and assign to original dataframe

Having input dataframe:
x_1 x_2
0 0.0 0.0
1 1.0 0.0
2 2.0 0.2
3 2.5 1.5
4 1.5 2.0
5 -2.0 -2.0
and additional dataframe as follows:
index x_1_x x_2_x x_1_y x_2_y value dist dist_rank
0 0 0.0 0.0 0.1 0.1 5.0 0.141421 2.0
4 0 0.0 0.0 1.5 1.0 -2.0 1.802776 3.0
5 0 0.0 0.0 0.0 0.0 3.0 0.000000 1.0
9 1 1.0 0.0 0.1 0.1 5.0 0.905539 1.0
11 1 1.0 0.0 2.0 0.4 3.0 1.077033 3.0
14 1 1.0 0.0 0.0 0.0 3.0 1.000000 2.0
18 2 2.0 0.2 0.1 0.1 5.0 1.902630 3.0
20 2 2.0 0.2 2.0 0.4 3.0 0.200000 1.0
22 2 2.0 0.2 1.5 1.0 -2.0 0.943398 2.0
29 3 2.5 1.5 2.0 0.4 3.0 1.208305 3.0
30 3 2.5 1.5 2.5 2.5 4.0 1.000000 1.0
31 3 2.5 1.5 1.5 1.0 -2.0 1.118034 2.0
38 4 1.5 2.0 2.0 0.4 3.0 1.676305 3.0
39 4 1.5 2.0 2.5 2.5 4.0 1.118034 2.0
40 4 1.5 2.0 1.5 1.0 -2.0 1.000000 1.0
45 5 -2.0 -2.0 0.1 0.1 5.0 2.969848 2.0
46 5 -2.0 -2.0 1.0 -2.0 6.0 3.000000 3.0
50 5 -2.0 -2.0 0.0 0.0 3.0 2.828427 1.0
I want to create new columns in input dataframe, basing on additional dataframe with respect to dist_rank. It should extract x_1_y, x_2_y and value for each row, with respect to index and dist_rank so my expected output is following:
I tried following lines:
df['value_dist_rank1']=result.loc[result['dist_rank']==1.0, 'value']
df['value_dist_rank1 ']=result[result['dist_rank']==1.0]['value']
but both gave the same output:
x_1 x_2 value_dist_rank1
0 0.0 0.0 NaN
1 1.0 0.0 NaN
2 2.0 0.2 NaN
3 2.5 1.5 NaN
4 1.5 2.0 NaN
5 -2.0 -2.0 3.0

Here is a way to do it :
(For the sake of clarity I consider the input df as df1 and the additional df as df2)
# First we goupby df2 by index to get all the column information of each index on one line
df2 = df2.groupby('index').agg(lambda x: list(x)).reset_index()
# Then we explode each column into three columns since there is always three columns for each index
columns = ['dist_rank', 'value', 'x_1_y', 'x_2_y']
column_to_add = ['value', 'x_1_y', 'x_2_y']
for index, row in df2.iterrows():
for i in range(3):
column_names = ["{}_dist_rank{}".format(x, row.dist_rank[i])[:-2] for x in column_to_add]
values = [row[x][i] for x in column_to_add]
for column, value in zip(column_names, values):
df2.loc[index, column] = value
# We drop the columns that are not useful :
df2.drop(columns=columns+['dist', 'x_1_x', 'x_2_x'], inplace = True)
# Finally we merge the modified df with our initial dataframe :
result = df1.merge(df2, left_index=True, right_on='index', how='left')
Output :
x_1 x_2 index value_dist_rank2 x_1_y_dist_rank2 x_2_y_dist_rank2 \
0 0.0 0.0 0 5.0 0.1 0.1
1 1.0 0.0 1 3.0 0.0 0.0
2 2.0 0.2 2 -2.0 1.5 1.0
3 2.5 1.5 3 -2.0 1.5 1.0
4 1.5 2.0 4 4.0 2.5 2.5
5 -2.0 -2.0 5 5.0 0.1 0.1
value_dist_rank3 x_1_y_dist_rank3 x_2_y_dist_rank3 value_dist_rank1 \
0 -2.0 1.5 1.0 3.0
1 3.0 2.0 0.4 5.0
2 5.0 0.1 0.1 3.0
3 3.0 2.0 0.4 4.0
4 3.0 2.0 0.4 -2.0
5 6.0 1.0 -2.0 3.0
x_1_y_dist_rank1 x_2_y_dist_rank1
0 0.0 0.0
1 0.1 0.1
2 2.0 0.4
3 2.5 2.5
4 1.5 1.0
5 0.0 0.0

Concatenating crosstabs of different variables

I have a Pandas (0.23.4) DataFrame with several categorical columns.
df = pd.DataFrame(np.random.choice([True, False, np.nan], (6,4)), columns = ['a','b','c','d'])
a b c d
0 NaN 1.0 NaN NaN
1 NaN 1.0 NaN 0.0
2 1.0 NaN 1.0 NaN
3 0.0 NaN 0.0 1.0
4 NaN 1.0 NaN NaN
5 NaN 1.0 0.0 1.0
I have two sets of columns of interest:
cross_cols = ['a', 'b']
type_cols = ['c', 'd']
I would like to get a cross tab of counts of each cross_col variable with each type_col variable (a with c and d, and b with c and d), excluding NaN, all displayed side-by-side. The desired result is:
c d
0.0 1.0 All 0.0 1.0 All
a 0.0 0 0 0 1 1 2
1.0 2 1 3 1 0 1
All 2 1 3 2 1 3
b 0.0 0 0 0 0 1 1
1.0 2 1 3 2 0 2
All 2 1 3 2 1 3
Notice that I am not interested in counts for different combinations of a and b or of c and d, which is what I'm getting by changing the index and columns parameters of pd.crosstab.
Currently I'm using the following code:
cross_rows = []
for col in cross_cols:
cross_rows.append(pd.concat([pd.crosstab(df[col], df[type_var],margins=True) for type_var in type_cols],axis=1,keys = type_cols,sort=True))
results = pd.concat(cross_rows, keys = cross_cols,sort=True)
It gives the following result:
c d
c 0.0 1.0 All 0.0 1.0 All
a 1.0 2.0 1.0 3.0 1 0 1
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 1 1 2
b 1.0 2.0 1.0 3.0 2 0 2
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 0 1 1
The result is fine, but the code is slow and a bit ugly. I suspect that there's a faster and more Pythonic approach. Is there a single function call that would get the job done, or another faster solution?

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66

Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Greedy most diverse subset of pandas dataframe

This is my dataset:
import pandas as pd
import itertools
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)
A M F
0 A 1 plus
1 A 1 minus
2 A 1 square
3 A 2 plus
4 A 2 minus
5 A 2 square
I want to get the top-n rows (subset) from that dataframe which maximum diverse.
To compute diversity, I used 1- jaccard.
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
By using dataframe operation, I can do a cartesian product to that dataframe using apply and compute the diversity values of each pair, and get the max value of diversity each pair by using df.idxmax(axis=1). But in here I have to compute all diversity values of each pair first which is not efficient.
0 1 2 3 4 5 6 7 8 9 10
0 0.0 1.0 0.8 0.5 0.5 0.8 0.5 1.0 0.8 0.8 0.8
1 0.0 0.0 1.0 0.8 1.0 0.8 1.0 0.8 0.8 0.8 0.8
2 0.0 0.0 0.0 1.0 0.5 1.0 0.5 0.8 0.8 1.0 1.0
3 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8 0.5 0.8 0.5
4 0.0 0.0 0.0 0.0 0.0 0.8 0.8 1.0 0.5 1.0 0.8
df.idxmax(axis=1).sample(4)
5 6
2 3
0 1
8 9
dtype: int64
I want to implement this algorithm, but in some how, I did not understand the lines : 6 and 7.
How to compute argmax in here? and why in the line 10, it returns Sk but there is no initiation Sk value inside the looping?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create Max and Min column values from a single column value pandas - python

I have a dataframe like the one below and I need to create two columns out of the base column. Input Kg 0.5 0.5 1 1 1 2 2 5 5 5 Expected Output Kg_From Kg_To 0 0.5 0 0.5 0.5 1 0.5 1 0.5 1 1 2 1 2 2 5 2 5 2 5 How can this be done in pandas ?

Assuming your kg column is sorted: s = df["Kg"].unique() df["Kg_from"] = df["Kg"].map({k:v for k,v in zip(s[1:], s)}).fillna(0) print (df) Kg Kg_from 0 0.5 0.0 1 0.5 0.0 2 1.0 0.5 3 1.0 0.5 4 1.0 0.5 5 2.0 1.0 6 2.0 1.0 7 5.0 2.0 8 5.0 2.0 9 5.0 2.0

Related

Pandas apply function to column taking the value of previous column

Find values from other dataframe and assign to original dataframe

Concatenating crosstabs of different variables

new Pandas Dataframe column calculated from other column values

Greedy most diverse subset of pandas dataframe

Categories

Resources