shift below cells to count for R - python

I am using the code below to produce following result in Python and I want equivalent for this code on R.
here N is the column of dataframe data . CN column is calculated from values of column N with a specific pattern and it gives me following result in python.
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+
a short overview of my code is
data = pd.read_table(filename,skiprows=15,decimal=',', sep='\t',header=None,names=["Date ","Heure ","temps (s) ","X","Z"," LVDT V(mm) " ,"Force normale (N) ","FT","FN(N) ","TS"," NS(kPa) ","V (mm/min)","Vitesse normale (mm/min)","e (kPa)","k (kPa/mm) " ,"N " ,"Nb cycles normal" ,"Cycles " ,"Etat normal" ,"k imposÈ (kPa/mm)"])
data.columns = [col.strip() for col in data.columns.tolist()]
N = data[data.keys()[15]]
N = np.array(N)
data["CN"] = (data.N.shift().bfill() != data.N).astype(int).cumsum()
an example of data.head() is here
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| Index | Date | Heure | temps (s) | X | Z(mm) | LVDT V(mm) | Force normale (N) | FT | FN(N) | FT (kPa) | NS(kPa) | V (mm/min) | Vitesse normale (mm/min) | e (kPa) | k (kPa/mm) | N | Nb cycles normal | Cycles | Etat normal | k imposÈ (kPa/mm) | CN |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| 184 | 01/02/2022 | 12:36:52 | 402.163 | 6.910243 | 1.204797 | 0.001101 | 299.783665 | 31.494351 | 1428.988908 | 11.188704 | 505.825016 | 0.1 | 2.0 | 512.438828 | 50.918786 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 185 | 01/02/2022 | 12:36:54 | 404.288 | 6.907822 | 1.205647 | 4.9e-05 | 296.072718 | 31.162313 | 1404.195316 | 11.028167 | 494.97955 | 0.1 | -2.0 | 500.084986 | 49.685639 | 0.0 | 0.0 | Sort | Descend | 0.0 | 0 |
| 186 | 01/02/2022 | 12:36:56 | 406.536 | 6.907906 | 1.204194 | -0.000214 | 300.231424 | 31.586401 | 1429.123486 | 11.21895 | 505.750815 | 0.1 | 2.0 | 512.370164 | 50.914002 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 187 | 01/02/2022 | 12:36:58 | 408.627 | 6.910751 | 1.204293 | -0.000608 | 300.188686 | 31.754064 | 1428.979519 | 11.244542 | 505.624564 | 0.1 | 2.0 | 512.309254 | 50.906544 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 188 | 01/02/2022 | 12:37:00 | 410.679 | 6.907805 | 1.205854 | -0.000181 | 296.358074 | 31.563389 | 1415.224427 | 11.129375 | 502.464948 | 0.1 | 2.0 | 510.702313 | 50.742104 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+

A one line cumsum trick solves it.
cumsum(c(0L, diff(df1$N) != 0))
#> [1] 0 1 1 2 2 3 3 4 4 4 5 5 6 7 8 9 10
all.equal(
cumsum(c(0L, diff(df1$N) != 0)),
df1$CN
)
#> [1] TRUE
Created on 2022-02-14 by the reprex package (v2.0.1)
Data
x <- "
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+"
df1 <- read.table(textConnection(x), header = TRUE, sep = "|", comment.char = "+")[2:3]
Created on 2022-02-14 by the reprex package (v2.0.1)

Related

Pandas: group and count columns values per another column

I have a dataframe in following form:
+---------+---------+-------+-------+-----------------+
| country | payment | type | err | email |
+---------+---------+-------+-------+-----------------+
| AU | visa | type1 | OK | user1#email.com |
| DE | paypal | type1 | OK | user2#email.com |
| AU | visa | type2 | ERROR | user1#email.com |
| US | visa | type2 | OK | user4#email.com |
| FR | visa | type1 | OK | user2#email.com |
| FR | visa | type1 | ERROR | user2#email.com |
+---------+---------+-------+-------+-----------------+
df = pd.DataFrame({'country':['AU','DE','AU','US','FR','FR'],
'payment':['visa','paypal','visa','visa','visa','visa'],
'type':['type1','type1','type2','type2','type1','type1'],
'err':['OK','OK','ERROR','OK','OK','ERROR'],
'email': ['user1#email.com','user2#email.com','user1#email.com','user4#email.com','user2#email.com','user2#email.com'] })
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values).
num_errors_per_unique_email - Average number of errors per unique email for this group,
num_type1_per_unique_email .. num_type3_per_unique_email - Average number of type per unique email for this group.
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_errors_per_unique_email | num_type1_per_unique_email | num_type2_per_unique_email | num_type3_per_unique_email |
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
| paypal | DE | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | AU | 2 | 1 | 1 | 1 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 1 | 1 | 2 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
Thanks to #anky's solution (get dummies, create the group, join the size with sum) I'm able to get first part of task.And receive this:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type']],prefix=['num'])).groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
+---------+---------+-----------------+------------+-----------+-----------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 |
+---------+---------+-----------------+------------+-----------+-----------+
| paypal | DE | 1 | 0 | 1 | 0 |
| visa | AU | 2 | 1 | 1 | 1 |
| visa | FR | 2 | 1 | 2 | 0 |
| visa | US | 1 | 0 | 0 | 1 |
+---------+---------+-----------------+------------+-----------+-----------+
But I stuck how to properly add columns like 'num_errors_per_unique_email' and 'num_type_per_unique_email'..
Appreciate any help.
Like this?
dfemail = df.groupby('email')[['err', 'type']]. count()
dfemail
err type
email
user1#email.com 2 2
user2#email.com 3 3
user4#email.com 1 1
I've managed to do this but not very efficient proper way, so correct answers appreciated.
c = df['err'].eq("ERROR")
g = (df[['payment','country','email']].assign(num_errors=c,
**pd.get_dummies(df[['type']],prefix=['num'])).groupby(['payment','country']))
out = g.size().to_frame("number_payments").join([g.sum(), g['email'].nunique().to_frame("unique_emails")]).reset_index()
out['num_errors_per_unique_email'] = out['num_errors'] / out['unique_emails']
out['num_type1_per_unique_email'] = out['num_type1'] / out['unique_emails']
out['num_type2_per_unique_email'] = out['num_type2'] / out['unique_emails']
out
+---------+---------+-----------------+------------+-----------+-----------+---------------+-----------------------------+----------------------------+----------------------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | unique_emails | num_errors_per_unique_email | num_type1_per_unique_email | num_type2_per_unique_email |
+---------+---------+-----------------+------------+-----------+-----------+---------------+-----------------------------+----------------------------+----------------------------+
| paypal | DE | 1 | 0 | 1 | 0 | 1 | 0.0 | 1.0 | 0.0 |
| visa | AU | 2 | 1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 |
| visa | FR | 2 | 1 | 2 | 0 | 1 | 1.0 | 2.0 | 0.0 |
| visa | US | 1 | 0 | 0 | 1 | 1 | 0.0 | 0.0 | 1.0 |
+---------+---------+-----------------+------------+-----------+-----------+---------------+-----------------------------+----------------------------+----------------------------+

update table information based on columns of another table

I am new in python have two dataframes, df1 contains information about all students with their group and score, and df2 contains updated information about few students when they change their group and score. How could I update the information in df1 based on the values of df2 (group and score)?
df1
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 0 | 0.845435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 3 | 0.843209 |
| 9 | 9 | 4 | 0.84902 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 2 | 0.843043 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 1 | 0.85426 |
+----+----------+-----------+----------------+
df2
+----+-----------+----------+----------------+
| | group |student No| score |
|----+-----------+----------+----------------|
| 0 | 2 | 1 | 0.887435 |
| 1 | 0 | 19 | 0.81214 |
| 2 | 3 | 17 | 0.899041 |
| 3 | 0 | 8 | 0.853333 |
| 4 | 4 | 9 | 0.88512 |
+----+-----------+----------+----------------+
The result
df: 3
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 2 | 0.887435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 0 | 0.853333 |
| 9 | 9 | 4 | 0.88512 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 3 | 0.899041 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 0 | 0.81214 |
+----+----------+-----------+----------------+
my code to update df1 from df2
dfupdated = df1.merge(df2, how='left', on=['student No'], suffixes=('', '_new'))
dfupdated['group'] = np.where(pd.notnull(dfupdated['group_new']), dfupdated['group_new'],
dfupdated['group'])
dfupdated['score'] = np.where(pd.notnull(dfupdated['score_new']), dfupdated['score_new'],
dfupdated['score'])
dfupdated.drop(['group_new', 'score_new'],axis=1, inplace=True)
dfupdated.reset_index(drop=True, inplace=True)
but I face the following error
KeyError: "['group'] not in index"
I don't know what's wrong I ran same and got the answer
giving a different way to solve it
try :
dfupdated = df1.merge(df2, on='student No', how='left')
dfupdated['group'] = dfupdated['group_y'].fillna(dfupdated['group_x'])
dfupdated['score'] = dfupdated['score_y'].fillna(dfupdated['score_x'])
dfupdated.drop(['group_x', 'group_y','score_x', 'score_y'], axis=1,inplace=True)
will give you the solution you want.
to get the max from each group
dfupdated.groupby(['group'], sort=False)['score'].max()

i have from my original dataframe obtained another two , how can i merge in a final one the columns that i need

i have a table with 4 columns , from this data i obtained another 2 tables with some rolling averages from the original table. now i want to combine these 3 into a final table. but the indexes are not in order now and i cant do it. I just started to learn python , i have zero experience and i would really need all the help i can get.
DF
+----+------------+-----------+------+------+
| | A | B | C | D |
+----+------------+-----------+------+------+
| 1 | Home Team | Away Team | Htgs | Atgs |
| 2 | dalboset | sopot | 1 | 2 |
| 3 | calnic | resita | 1 | 3 |
| 4 | sopot | dalboset | 2 | 2 |
| 5 | resita | sopot | 4 | 1 |
| 6 | sopot | dalboset | 2 | 1 |
| 7 | caransebes | dalboset | 1 | 2 |
| 8 | calnic | resita | 1 | 3 |
| 9 | dalboset | sopot | 2 | 2 |
| 10 | calnic | resita | 4 | 1 |
| 11 | sopot | dalboset | 2 | 1 |
| 12 | resita | sopot | 1 | 2 |
| 13 | sopot | dalboset | 1 | 3 |
| 14 | caransebes | dalboset | 2 | 2 |
| 15 | calnic | resita | 4 | 1 |
| 16 | dalboset | sopot | 2 | 1 |
| 17 | calnic | resita | 1 | 2 |
| 18 | sopot | dalboset | 4 | 1 |
| 19 | resita | sopot | 2 | 1 |
| 20 | sopot | dalboset | 1 | 2 |
| 21 | caransebes | dalboset | 1 | 3 |
| 22 | calnic | resita | 2 | 2 |
+----+------------+-----------+------+------+
CODE
df1 = df.groupby('Home Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df1 =df1.rename(columns={'Htgs': 'Htgs/3', 'Atgs': 'Htgc/3'})
df1
df2 = df.groupby('Away Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df2 =df2.rename(columns={'Htgs': 'Atgc/3', 'Atgs': 'Atgs/3'})
df2
now i need a solution to see the columns with the rolling average next to the Home Team,,,,Away Team,,,,Htgs,,,,Atgs from the original table
Done !
i create a new column direct in the data frame like this
df = pd.read_csv('Fd.csv', )
df['Htgs/3'] = df.groupby('Home Team', ) ['Htgs'].rolling(window=4, min_periods=3).mean().reset_index(0,drop=True)
the Htgs/3 will be the new column with the rolling average of the column Home Team, and for the rest i will do the same like in this part.

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

self define auto-binning function with ValueError: too many values to unpack (expected 4)

data is like:
+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
| SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents |
+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
| 1 | 0.186762647 | 44 | 0 | 0.579385737 | 1920 | 5 | 0 | 1 | 0 | 4 |
| 1 | 0.023579132 | 57 | 0 | 0.299046087 | 8700 | 12 | 0 | 3 | 0 | 0 |
| 1 | 0.003621589 | 58 | 0 | 0.310172457 | 4000 | 14 | 0 | 1 | 0 | 2 |
| 1 | 0.145603022 | 77 | 0 | 0.313491151 | 4350 | 10 | 0 | 1 | 0 | 0 |
| 1 | 0.245191827 | 53 | 0 | 0.238513897 | 7051 | 3 | 0 | 1 | 0 | 0 |
| 1 | 0.443504996 | 23 | 0 | 54 | 6670.221237 | 3 | 0 | 0 | 0 | 0.757222268 |
| 1 | 0.332611231 | 51 | 0 | 0.267748028 | 9000 | 9 | 0 | 2 | 0 | 6 |
| 1 | 0.625243566 | 40 | 0 | 0.474557522 | 7231 | 9 | 0 | 1 | 0 | 2 |
| 1 | 0.091590841 | 51 | 0 | 2359 | 6670.221237 | 3 | 0 | 1 | 0 | 0 |
| 1 | 0.69186808 | 48 | 3 | 0.125587441 | 10000 | 7 | 0 | 0 | 0 | 2 |
| 1 | 0.004999828 | 63 | 1 | 0.246688328 | 4000 | 5 | 0 | 1 | 0 | 0 |
| 1 | 0.064841612 | 53 | 0 | 0.239478872 | 11666 | 13 | 0 | 3 | 0 | 1 |
| 1 | 0.512060051 | 44 | 1 | 0.412406271 | 4400 | 14 | 0 | 0 | 0 | 1 |
| 1 | 0.9999999 | 25 | 0 | 0.024314936 | 2590 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0.372130998 | 32 | 0 | 0.206849144 | 8000 | 7 | 0 | 0 | 0 | 2 |
| 1 | 0.9999999 | 34 | 0 | 0.208158368 | 5000 | 4 | 0 | 0 | 0 | 2 |
| 1 | 0.023464572 | 63 | 0 | 0.149350649 | 1539 | 13 | 0 | 0 | 0 | 0 |
| 1 | 0.937531861 | 64 | 2 | 0.563646207 | 14776 | 9 | 0 | 2 | 1 | 1 |
| 1 | 0.001808414 | 51 | 0 | 1736 | 6670.221237 | 7 | 0 | 1 | 0 | 0 |
| 1 | 0.019950125 | 54 | 1 | 3622 | 6670.221237 | 7 | 0 | 1 | 0 | 0 |
| 1 | 0.183178709 | 42 | 0 | 0.162644416 | 7789 | 12 | 0 | 0 | 0 | 2 |
| 1 | 0.039786673 | 76 | 0 | 0.011729323 | 3324 | 6 | 0 | 0 | 0 | 1 |
| 1 | 0.047418557 | 41 | 1 | 1.178100863 | 2200 | 7 | 0 | 1 | 0 | 2 |
| 1 | 0.127890461 | 59 | 0 | 67 | 6670.221237 | 2 | 0 | 0 | 0 | 0 |
| 1 | 0.074955088 | 57 | 0 | 776 | 6670.221237 | 9 | 0 | 0 | 0 | 0 |
| 1 | 0.025459356 | 63 | 0 | 0.326794591 | 18708 | 30 | 0 | 3 | 0 | 1 |
| 1 | 0.9999999 | 29 | 0 | 0.06104034 | 3767 | 1 | 0 | 0 | 0 | 3 |
| 1 | 0.016754935 | 50 | 0 | 0.046870976 | 7765 | 6 | 0 | 0 | 0 | 0 |
| 0 | 0.566751792 | 40 | 0 | 0.2010541 | 6450 | 9 | 1 | 0 | 1 | 4 |
+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
# self define auto-binning function
def mono_bin(Y, X, n = 20):
r = 0
good=Y.sum()
bad=Y.count()-good
while np.abs(r) < 1:
d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})
d2 = d1.groupby('Bucket', as_index = True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
n = n - 1
d3 = pd.DataFrame(d2.X.min(), columns = ['min'])
d3['min']=d2.min().X
d3['max'] = d2.max().X
d3['sum'] = d2.sum().Y
d3['total'] = d2.count().Y
d3['rate'] = d2.mean().Y
d3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))
d4 = (d3.sort_index(by = 'min')).reset_index(drop=True)
print("=" * 60)
print(d4)
return d4
have checked some answers,most of them suggested "to use something like iteritems."it seems differnet with mine
dfx1, ivx1,cutx1,woex1=mono_bin(data.SeriousDlqin2yrs,data.RevolvingUtilizationOfUnsecuredLines,n=10)
in ()
----> 1 dfx1, ivx1,cutx1,woex1=mono_bin(data.SeriousDlqin2yrs,data.RevolvingUtilizationOfUnsecuredLines,n=10)
ValueError: too many values to unpack (expected 4)
Any idea how I can fix this?

Categories

Resources