I have a dataframe in following form:
+---------+---------+-------+-------+-----------------+
| country | payment | type | err | email |
+---------+---------+-------+-------+-----------------+
| AU | visa | type1 | OK | user1#email.com |
| DE | paypal | type1 | OK | user2#email.com |
| AU | visa | type2 | ERROR | user1#email.com |
| US | visa | type2 | OK | user4#email.com |
| FR | visa | type1 | OK | user2#email.com |
| FR | visa | type1 | ERROR | user2#email.com |
+---------+---------+-------+-------+-----------------+
df = pd.DataFrame({'country':['AU','DE','AU','US','FR','FR'],
'payment':['visa','paypal','visa','visa','visa','visa'],
'type':['type1','type1','type2','type2','type1','type1'],
'err':['OK','OK','ERROR','OK','OK','ERROR'],
'email': ['user1#email.com','user2#email.com','user1#email.com','user4#email.com','user2#email.com','user2#email.com'] })
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values).
num_errors_per_unique_email - Average number of errors per unique email for this group,
num_type1_per_unique_email .. num_type3_per_unique_email - Average number of type per unique email for this group.
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_errors_per_unique_email | num_type1_per_unique_email | num_type2_per_unique_email | num_type3_per_unique_email |
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
| paypal | DE | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | AU | 2 | 1 | 1 | 1 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 1 | 1 | 2 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
Thanks to #anky's solution (get dummies, create the group, join the size with sum) I'm able to get first part of task.And receive this:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type']],prefix=['num'])).groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
+---------+---------+-----------------+------------+-----------+-----------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 |
+---------+---------+-----------------+------------+-----------+-----------+
| paypal | DE | 1 | 0 | 1 | 0 |
| visa | AU | 2 | 1 | 1 | 1 |
| visa | FR | 2 | 1 | 2 | 0 |
| visa | US | 1 | 0 | 0 | 1 |
+---------+---------+-----------------+------------+-----------+-----------+
But I stuck how to properly add columns like 'num_errors_per_unique_email' and 'num_type_per_unique_email'..
Appreciate any help.
Like this?
dfemail = df.groupby('email')[['err', 'type']]. count()
dfemail
err type
email
user1#email.com 2 2
user2#email.com 3 3
user4#email.com 1 1
I've managed to do this but not very efficient proper way, so correct answers appreciated.
c = df['err'].eq("ERROR")
g = (df[['payment','country','email']].assign(num_errors=c,
**pd.get_dummies(df[['type']],prefix=['num'])).groupby(['payment','country']))
out = g.size().to_frame("number_payments").join([g.sum(), g['email'].nunique().to_frame("unique_emails")]).reset_index()
out['num_errors_per_unique_email'] = out['num_errors'] / out['unique_emails']
out['num_type1_per_unique_email'] = out['num_type1'] / out['unique_emails']
out['num_type2_per_unique_email'] = out['num_type2'] / out['unique_emails']
out
+---------+---------+-----------------+------------+-----------+-----------+---------------+-----------------------------+----------------------------+----------------------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | unique_emails | num_errors_per_unique_email | num_type1_per_unique_email | num_type2_per_unique_email |
+---------+---------+-----------------+------------+-----------+-----------+---------------+-----------------------------+----------------------------+----------------------------+
| paypal | DE | 1 | 0 | 1 | 0 | 1 | 0.0 | 1.0 | 0.0 |
| visa | AU | 2 | 1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 |
| visa | FR | 2 | 1 | 2 | 0 | 1 | 1.0 | 2.0 | 0.0 |
| visa | US | 1 | 0 | 0 | 1 | 1 | 0.0 | 0.0 | 1.0 |
+---------+---------+-----------------+------------+-----------+-----------+---------------+-----------------------------+----------------------------+----------------------------+
Related
I am using the code below to produce following result in Python and I want equivalent for this code on R.
here N is the column of dataframe data . CN column is calculated from values of column N with a specific pattern and it gives me following result in python.
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+
a short overview of my code is
data = pd.read_table(filename,skiprows=15,decimal=',', sep='\t',header=None,names=["Date ","Heure ","temps (s) ","X","Z"," LVDT V(mm) " ,"Force normale (N) ","FT","FN(N) ","TS"," NS(kPa) ","V (mm/min)","Vitesse normale (mm/min)","e (kPa)","k (kPa/mm) " ,"N " ,"Nb cycles normal" ,"Cycles " ,"Etat normal" ,"k imposÈ (kPa/mm)"])
data.columns = [col.strip() for col in data.columns.tolist()]
N = data[data.keys()[15]]
N = np.array(N)
data["CN"] = (data.N.shift().bfill() != data.N).astype(int).cumsum()
an example of data.head() is here
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| Index | Date | Heure | temps (s) | X | Z(mm) | LVDT V(mm) | Force normale (N) | FT | FN(N) | FT (kPa) | NS(kPa) | V (mm/min) | Vitesse normale (mm/min) | e (kPa) | k (kPa/mm) | N | Nb cycles normal | Cycles | Etat normal | k imposÈ (kPa/mm) | CN |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| 184 | 01/02/2022 | 12:36:52 | 402.163 | 6.910243 | 1.204797 | 0.001101 | 299.783665 | 31.494351 | 1428.988908 | 11.188704 | 505.825016 | 0.1 | 2.0 | 512.438828 | 50.918786 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 185 | 01/02/2022 | 12:36:54 | 404.288 | 6.907822 | 1.205647 | 4.9e-05 | 296.072718 | 31.162313 | 1404.195316 | 11.028167 | 494.97955 | 0.1 | -2.0 | 500.084986 | 49.685639 | 0.0 | 0.0 | Sort | Descend | 0.0 | 0 |
| 186 | 01/02/2022 | 12:36:56 | 406.536 | 6.907906 | 1.204194 | -0.000214 | 300.231424 | 31.586401 | 1429.123486 | 11.21895 | 505.750815 | 0.1 | 2.0 | 512.370164 | 50.914002 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 187 | 01/02/2022 | 12:36:58 | 408.627 | 6.910751 | 1.204293 | -0.000608 | 300.188686 | 31.754064 | 1428.979519 | 11.244542 | 505.624564 | 0.1 | 2.0 | 512.309254 | 50.906544 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 188 | 01/02/2022 | 12:37:00 | 410.679 | 6.907805 | 1.205854 | -0.000181 | 296.358074 | 31.563389 | 1415.224427 | 11.129375 | 502.464948 | 0.1 | 2.0 | 510.702313 | 50.742104 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
A one line cumsum trick solves it.
cumsum(c(0L, diff(df1$N) != 0))
#> [1] 0 1 1 2 2 3 3 4 4 4 5 5 6 7 8 9 10
all.equal(
cumsum(c(0L, diff(df1$N) != 0)),
df1$CN
)
#> [1] TRUE
Created on 2022-02-14 by the reprex package (v2.0.1)
Data
x <- "
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+"
df1 <- read.table(textConnection(x), header = TRUE, sep = "|", comment.char = "+")[2:3]
Created on 2022-02-14 by the reprex package (v2.0.1)
I am working on a logistic regression for employee turnover.
The variables I have are
q3attendance object
q4attendance object
average_attendance object
training_days int64
esat float64
lastcompany object
client _category object
qual_category object
location object
rating object
role object
band object
resourcegroup object
skill object
status int64
I have marked the categorical variables using
cat=[' bandlevel ',' resourcegroup ',' skill ',..}
I define x and y using x=df.iloc[:,:-1] and y=df.iloc[:,-1].
Next I need to create dummy variables. So, I use the command
xd = pd.get_dummies(x,drop_first='True')
After this, I expect the continuous variables to remain as they are and the dummies to be created for all categorical variables. However, on executing the command, I find that the code is treating the continuous variables also as categorical and ends up creating dummies for all of them as well. So, it the tenure is 3 years 2 months, 4 years 3 months etc, 3.2 and 4.3 both are taken as categorical. I end up with more than 1500 dummies and it is a challenge to run the regression after that.
What am I missing? Should I specifically mark out the categorical variables when using get_dummies?
pd.get_dummies has a optional param columns which accepts list of columns for which you need to create encoding.
For example:
df.head()
+----+------+--------------+-------------+---------------------------+----------+-----------------+
| | id | first_name | last_name | email | gender | ip_address |
|----+------+--------------+-------------+---------------------------+----------+-----------------|
| 0 | 1 | Lucine | Krout | lkrout0#sourceforge.net | Female | 199.158.46.27 |
| 1 | 2 | Sherm | Jullian | sjullian1#mapy.cz | Male | 8.97.22.209 |
| 2 | 3 | Derk | Mulloch | dmulloch2#china.com.cn | Male | 132.108.184.131 |
| 3 | 4 | Elly | Sulley | esulley3#com.com | Female | 63.177.149.251 |
| 4 | 5 | Brocky | Jell | bjell4#huffingtonpost.com | Male | 152.32.40.4 |
| 5 | 6 | Harv | Allot | hallot5#blogtalkradio.com | Male | 71.135.240.164 |
| 6 | 7 | Wolfie | Stable | wstable6#utexas.edu | Male | 211.31.189.141 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7#whitehouse.gov | Male | 224.214.43.40 |
| 8 | 9 | Devina | Salerg | dsalerg8#furl.net | Female | 49.169.34.38 |
| 9 | 10 | Missie | Korpal | mkorpal9#wunderground.com | Female | 119.115.90.232 |
+----+------+--------------+-------------+---------------------------+----------+-----------------+
then
columns = ["gender"]
pd.get_dummies(df, columns=columns)
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
| | id | first_name | last_name | email | ip_address | gender_Female | gender_Male |
|----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------|
| 0 | 1 | Lucine | Krout | lkrout0#sourceforge.net | 199.158.46.27 | 1 | 0 |
| 1 | 2 | Sherm | Jullian | sjullian1#mapy.cz | 8.97.22.209 | 0 | 1 |
| 2 | 3 | Derk | Mulloch | dmulloch2#china.com.cn | 132.108.184.131 | 0 | 1 |
| 3 | 4 | Elly | Sulley | esulley3#com.com | 63.177.149.251 | 1 | 0 |
| 4 | 5 | Brocky | Jell | bjell4#huffingtonpost.com | 152.32.40.4 | 0 | 1 |
| 5 | 6 | Harv | Allot | hallot5#blogtalkradio.com | 71.135.240.164 | 0 | 1 |
| 6 | 7 | Wolfie | Stable | wstable6#utexas.edu | 211.31.189.141 | 0 | 1 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7#whitehouse.gov | 224.214.43.40 | 0 | 1 |
| 8 | 9 | Devina | Salerg | dsalerg8#furl.net | 49.169.34.38 | 1 | 0 |
| 9 | 10 | Missie | Korpal | mkorpal9#wunderground.com | 119.115.90.232 | 1 | 0 |
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
print(tabulate())
will encode only column gender
All data is auto-generated, doesn't represent real world
I have loaded raw_data from MySQL using sqlalchemy and pymysql
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
df = pd.read_sql_table('data', engine)
df is something like this
| Age Category | Category |
|--------------|----------------|
| 31-26 | Engaged |
| 26-31 | Engaged |
| 31-36 | Not Engaged |
| Above 51 | Engaged |
| 41-46 | Disengaged |
| 46-51 | Nearly Engaged |
| 26-31 | Disengaged |
Then i had performed analysis as follow
age = pd.crosstab(df['Age Category'], df['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
I want to change it to
Pandas DataFrame something like this.
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
Thank you for your time and consideration
Both texts are called columns and index names, solution for change them is use DataFrame.rename_axis:
age = age.rename_axis(index=None, columns='Age Category')
Or set columns names by index names, and then set index names to default - None:
age.columns.name = age.index.name
age.index.name = None
print (age)
Age Category Disengaged Engaged Nearly Engaged Not Engaged
26-31 1 1 0 0
31-26 0 1 0 0
31-36 0 0 0 1
41-46 1 0 0 0
46-51 0 0 1 0
Above 51 0 1 0 0
But this texts are something like metadata, so some functions should remove them.
i have a table with 4 columns , from this data i obtained another 2 tables with some rolling averages from the original table. now i want to combine these 3 into a final table. but the indexes are not in order now and i cant do it. I just started to learn python , i have zero experience and i would really need all the help i can get.
DF
+----+------------+-----------+------+------+
| | A | B | C | D |
+----+------------+-----------+------+------+
| 1 | Home Team | Away Team | Htgs | Atgs |
| 2 | dalboset | sopot | 1 | 2 |
| 3 | calnic | resita | 1 | 3 |
| 4 | sopot | dalboset | 2 | 2 |
| 5 | resita | sopot | 4 | 1 |
| 6 | sopot | dalboset | 2 | 1 |
| 7 | caransebes | dalboset | 1 | 2 |
| 8 | calnic | resita | 1 | 3 |
| 9 | dalboset | sopot | 2 | 2 |
| 10 | calnic | resita | 4 | 1 |
| 11 | sopot | dalboset | 2 | 1 |
| 12 | resita | sopot | 1 | 2 |
| 13 | sopot | dalboset | 1 | 3 |
| 14 | caransebes | dalboset | 2 | 2 |
| 15 | calnic | resita | 4 | 1 |
| 16 | dalboset | sopot | 2 | 1 |
| 17 | calnic | resita | 1 | 2 |
| 18 | sopot | dalboset | 4 | 1 |
| 19 | resita | sopot | 2 | 1 |
| 20 | sopot | dalboset | 1 | 2 |
| 21 | caransebes | dalboset | 1 | 3 |
| 22 | calnic | resita | 2 | 2 |
+----+------------+-----------+------+------+
CODE
df1 = df.groupby('Home Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df1 =df1.rename(columns={'Htgs': 'Htgs/3', 'Atgs': 'Htgc/3'})
df1
df2 = df.groupby('Away Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df2 =df2.rename(columns={'Htgs': 'Atgc/3', 'Atgs': 'Atgs/3'})
df2
now i need a solution to see the columns with the rolling average next to the Home Team,,,,Away Team,,,,Htgs,,,,Atgs from the original table
Done !
i create a new column direct in the data frame like this
df = pd.read_csv('Fd.csv', )
df['Htgs/3'] = df.groupby('Home Team', ) ['Htgs'].rolling(window=4, min_periods=3).mean().reset_index(0,drop=True)
the Htgs/3 will be the new column with the rolling average of the column Home Team, and for the rest i will do the same like in this part.
I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791