Create Max and Min column values from a single column value pandas - python

I have a dataframe like the one below and I need to create two columns out of the base column.
Input
Kg
0.5
0.5
1
1
1
2
2
5
5
5
Expected Output
Kg_From Kg_To
0 0.5
0 0.5
0.5 1
0.5 1
0.5 1
1 2
1 2
2 5
2 5
2 5
How can this be done in pandas ?

Assuming your kg column is sorted:
s = df["Kg"].unique()
df["Kg_from"] = df["Kg"].map({k:v for k,v in zip(s[1:], s)}).fillna(0)
print (df)
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0

#get unique values and counts of each value in the Kg column
val,counts = np.unique(df.Kg,return_counts=True)
#shift forward by 1 and replace the first value with 0
val = np.roll(val,1)
val[0] = 0
#repeat the count of each value with the counts generated earlier
df['Kg_from'] = np.repeat(val,counts)
df
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0

Use zip and dict for mapping new column created by DataFrame.insert with unique sorted values by np.unique with added first 0 value by np.insert:
df = df.rename(columns={'Kg':'Kg_To'})
a = np.unique(df["Kg_To"])
df.insert(0, 'Kg_from', df['Kg_To'].map(dict(zip(a, np.insert(a, 0, 0)))))
print (df)
Kg_from Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

Code:
kgs = df.Kg.unique()
lower = [0] + list(kgs[:-1])
kg_dict = {k:v for v,k in zip(lower,kgs)}
# new dataframe
new_df = pd.DataFrame({
'Kg_From': df['Kg'].map(kg_dict),
'Kg_To': df['Kg']
})
# or if you want new columns:
df['Kg_from'] = df['Kg'].map(kg_dict)
Output:
Kg_From Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

Related

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.
Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names
Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0
It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Find values from other dataframe and assign to original dataframe

Having input dataframe:
x_1 x_2
0 0.0 0.0
1 1.0 0.0
2 2.0 0.2
3 2.5 1.5
4 1.5 2.0
5 -2.0 -2.0
and additional dataframe as follows:
index x_1_x x_2_x x_1_y x_2_y value dist dist_rank
0 0 0.0 0.0 0.1 0.1 5.0 0.141421 2.0
4 0 0.0 0.0 1.5 1.0 -2.0 1.802776 3.0
5 0 0.0 0.0 0.0 0.0 3.0 0.000000 1.0
9 1 1.0 0.0 0.1 0.1 5.0 0.905539 1.0
11 1 1.0 0.0 2.0 0.4 3.0 1.077033 3.0
14 1 1.0 0.0 0.0 0.0 3.0 1.000000 2.0
18 2 2.0 0.2 0.1 0.1 5.0 1.902630 3.0
20 2 2.0 0.2 2.0 0.4 3.0 0.200000 1.0
22 2 2.0 0.2 1.5 1.0 -2.0 0.943398 2.0
29 3 2.5 1.5 2.0 0.4 3.0 1.208305 3.0
30 3 2.5 1.5 2.5 2.5 4.0 1.000000 1.0
31 3 2.5 1.5 1.5 1.0 -2.0 1.118034 2.0
38 4 1.5 2.0 2.0 0.4 3.0 1.676305 3.0
39 4 1.5 2.0 2.5 2.5 4.0 1.118034 2.0
40 4 1.5 2.0 1.5 1.0 -2.0 1.000000 1.0
45 5 -2.0 -2.0 0.1 0.1 5.0 2.969848 2.0
46 5 -2.0 -2.0 1.0 -2.0 6.0 3.000000 3.0
50 5 -2.0 -2.0 0.0 0.0 3.0 2.828427 1.0
I want to create new columns in input dataframe, basing on additional dataframe with respect to dist_rank. It should extract x_1_y, x_2_y and value for each row, with respect to index and dist_rank so my expected output is following:
I tried following lines:
df['value_dist_rank1']=result.loc[result['dist_rank']==1.0, 'value']
df['value_dist_rank1 ']=result[result['dist_rank']==1.0]['value']
but both gave the same output:
x_1 x_2 value_dist_rank1
0 0.0 0.0 NaN
1 1.0 0.0 NaN
2 2.0 0.2 NaN
3 2.5 1.5 NaN
4 1.5 2.0 NaN
5 -2.0 -2.0 3.0
Here is a way to do it :
(For the sake of clarity I consider the input df as df1 and the additional df as df2)
# First we goupby df2 by index to get all the column information of each index on one line
df2 = df2.groupby('index').agg(lambda x: list(x)).reset_index()
# Then we explode each column into three columns since there is always three columns for each index
columns = ['dist_rank', 'value', 'x_1_y', 'x_2_y']
column_to_add = ['value', 'x_1_y', 'x_2_y']
for index, row in df2.iterrows():
for i in range(3):
column_names = ["{}_dist_rank{}".format(x, row.dist_rank[i])[:-2] for x in column_to_add]
values = [row[x][i] for x in column_to_add]
for column, value in zip(column_names, values):
df2.loc[index, column] = value
# We drop the columns that are not useful :
df2.drop(columns=columns+['dist', 'x_1_x', 'x_2_x'], inplace = True)
# Finally we merge the modified df with our initial dataframe :
result = df1.merge(df2, left_index=True, right_on='index', how='left')
Output :
x_1 x_2 index value_dist_rank2 x_1_y_dist_rank2 x_2_y_dist_rank2 \
0 0.0 0.0 0 5.0 0.1 0.1
1 1.0 0.0 1 3.0 0.0 0.0
2 2.0 0.2 2 -2.0 1.5 1.0
3 2.5 1.5 3 -2.0 1.5 1.0
4 1.5 2.0 4 4.0 2.5 2.5
5 -2.0 -2.0 5 5.0 0.1 0.1
value_dist_rank3 x_1_y_dist_rank3 x_2_y_dist_rank3 value_dist_rank1 \
0 -2.0 1.5 1.0 3.0
1 3.0 2.0 0.4 5.0
2 5.0 0.1 0.1 3.0
3 3.0 2.0 0.4 4.0
4 3.0 2.0 0.4 -2.0
5 6.0 1.0 -2.0 3.0
x_1_y_dist_rank1 x_2_y_dist_rank1
0 0.0 0.0
1 0.1 0.1
2 2.0 0.4
3 2.5 2.5
4 1.5 1.0
5 0.0 0.0

Concatenating crosstabs of different variables

I have a Pandas (0.23.4) DataFrame with several categorical columns.
df = pd.DataFrame(np.random.choice([True, False, np.nan], (6,4)), columns = ['a','b','c','d'])
a b c d
0 NaN 1.0 NaN NaN
1 NaN 1.0 NaN 0.0
2 1.0 NaN 1.0 NaN
3 0.0 NaN 0.0 1.0
4 NaN 1.0 NaN NaN
5 NaN 1.0 0.0 1.0
I have two sets of columns of interest:
cross_cols = ['a', 'b']
type_cols = ['c', 'd']
I would like to get a cross tab of counts of each cross_col variable with each type_col variable (a with c and d, and b with c and d), excluding NaN, all displayed side-by-side. The desired result is:
c d
0.0 1.0 All 0.0 1.0 All
a 0.0 0 0 0 1 1 2
1.0 2 1 3 1 0 1
All 2 1 3 2 1 3
b 0.0 0 0 0 0 1 1
1.0 2 1 3 2 0 2
All 2 1 3 2 1 3
Notice that I am not interested in counts for different combinations of a and b or of c and d, which is what I'm getting by changing the index and columns parameters of pd.crosstab.
Currently I'm using the following code:
cross_rows = []
for col in cross_cols:
cross_rows.append(pd.concat([pd.crosstab(df[col], df[type_var],margins=True) for type_var in type_cols],axis=1,keys = type_cols,sort=True))
results = pd.concat(cross_rows, keys = cross_cols,sort=True)
It gives the following result:
c d
c 0.0 1.0 All 0.0 1.0 All
a 1.0 2.0 1.0 3.0 1 0 1
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 1 1 2
b 1.0 2.0 1.0 3.0 2 0 2
All 2.0 1.0 3.0 2 1 3
0.0 NaN NaN NaN 0 1 1
The result is fine, but the code is slow and a bit ugly. I suspect that there's a faster and more Pythonic approach. Is there a single function call that would get the job done, or another faster solution?

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Greedy most diverse subset of pandas dataframe

This is my dataset:
import pandas as pd
import itertools
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)
A M F
0 A 1 plus
1 A 1 minus
2 A 1 square
3 A 2 plus
4 A 2 minus
5 A 2 square
I want to get the top-n rows (subset) from that dataframe which maximum diverse.
To compute diversity, I used 1- jaccard.
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
By using dataframe operation, I can do a cartesian product to that dataframe using apply and compute the diversity values of each pair, and get the max value of diversity each pair by using df.idxmax(axis=1). But in here I have to compute all diversity values of each pair first which is not efficient.
0 1 2 3 4 5 6 7 8 9 10
0 0.0 1.0 0.8 0.5 0.5 0.8 0.5 1.0 0.8 0.8 0.8
1 0.0 0.0 1.0 0.8 1.0 0.8 1.0 0.8 0.8 0.8 0.8
2 0.0 0.0 0.0 1.0 0.5 1.0 0.5 0.8 0.8 1.0 1.0
3 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8 0.5 0.8 0.5
4 0.0 0.0 0.0 0.0 0.0 0.8 0.8 1.0 0.5 1.0 0.8
df.idxmax(axis=1).sample(4)
5 6
2 3
0 1
8 9
dtype: int64
I want to implement this algorithm, but in some how, I did not understand the lines : 6 and 7.
How to compute argmax in here? and why in the line 10, it returns Sk but there is no initiation Sk value inside the looping?

Categories

Resources