This question already has answers here:
how do I insert a column at a specific column index in pandas?
(6 answers)
Closed 2 years ago.
I would like to create a new column in Python and place in a specific position. For instance, let "example" be the following dataframe:
import pandas as pd
example = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27],
'job': ['manager','clerk','policeman'],
'gender': ['female','male','male'],
'nationality': ['french','german','american']
})
I would like to create a new column to contain the values of the column "age":
example['age_times_two']= example['age'] *2
Yet, this code creates a column at the end of the dataframe. I would like to place it as the third column, or, in other words, the column right next to the column "age". How could this be done:
a) By setting an absolute place to the new column (e.g. third position)?
b) By setting a relative place for the new column (e.g. right to the column "age")?
You can use df.insert here.
example.insert(2,'age_times_two',example['age']*2)
example
name age age_times_two job gender nationality
0 alice 25 50 manager female french
1 bob 26 52 clerk male german
2 charlie 27 54 policeman male american
This is a bit manual way of doing it:
example['age_times_two']= example['age'] *2
cols = list(example.columns.values)
cols
You get a list of all the columns, and you can rearrange them manually and place them in the code below
example = example[['name', 'age', 'age_times_two', 'job', 'gender', 'nationality']]
Another way to do it:
example['age_times_two']= example['age'] *2
cols = example.columns.tolist()
cols = cols[:2]+cols[-1:]+cols[2:-1]
example = example[cols]
print(example)
.
name age age_times_two job gender nationality
0 alice 25 50 manager female french
1 bob 26 52 clerk male german
2 charlie 27 54 policeman male american
Related
From the dataframe, I create a new dataframe, in which the values from the "Select activity" column contain lists, which I will split and transform into new rows. But there is a value: "Nothing, just walking", which I need to leave unchanged. Tell me, please, how can I do this?
The original dataframe looks like this:
Name Age Select activity Profession
0 Ann 25 Cycling, Running Saleswoman
1 Mark 30 Nothing, just walking Manager
2 John 41 Cycling, Running, Swimming Accountant
My code looks like this:
df_new = df.loc[:, ['Name', 'Age']]
df_new['Activity'] = df['Select activity'].str.split(', ')
df_new = df_new.explode('Activity').reset_index(drop=True)
I get this result:
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing
3 Mark 30 just walking
4 John 41 Cycling
5 John 41 Running
6 John 41 Swimming
In order for the value "Nothing, just walking" not to be divided by 2 values, I added the following line:
if df['Select activity'].isin(['Nothing, just walking']) is False:
But it throws an error.
then let's look ahead after comma to guarantee a Capital letter, and only then split. So instead of , we have , (?=[A-Z])
df_new = df.loc[:, ["Name", "Age"]]
df_new["Activity"] = df["Select activity"].str.split(", (?=[A-Z])")
df_new = df_new.explode("Activity", ignore_index=True)
i only changed the splitter, and ignore_index=True to explode instead of resetting afterwards (also the single quotes..)
to get
>>> df_new
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing, just walking
3 John 41 Cycling
4 John 41 Running
5 John 41 Swimming
one line as usual
df_new = (df.loc[:, ["Name", "Age"]]
.assign(Activity=df["Select activity"].str.split(", (?=[A-Z])"))
.explode("Activity", ignore_index=True))
This is just an oversimplification but I have this large categorical data.
Name Age Gender
John 12 Male
Ana 24 Female
Dave 16 Female
Cynthia 17 Non-Binary
Wayne 26 Male
Hebrew 29 Non-Binary
Suppose that it is assigned as df and I want it to return as a list with non-duplicate values:
'Male','Female','Non-Binary'
I tried it with this code, but this returns the gender with duplicates
list(df['Gender'])
How can I code it in pandas so that it can return values without duplicates?
In these cases you have to remember that df["Gender"] is a Pandas Series so you could use .drop_duplicates() to retrieve another Pandas Series with the duplicated values removed or use .unique() to retrieve a Numpy Array containing the unique values.
>> df["Gender"].drop_duplicates()
0 Male
1 Female
3 Non-Binary
4 Male
Name: Gender, dtype: object
>> df["Gender"].unique()
['Male ' 'Female' 'Non-Binary' 'Male']
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have 2 csv file. They have one common column which is ID. What I want to do is I want to extract the common rows and built another dataframe. Firstly, I want to select job, and after that, as I said they have one common column, I want to find the rows whose IDs are the same. Visually, the dataframe should be seen like this:
Let first DataFrame is:
#ID
#Gender
#Job
#Shift
#Wage
1
Male
Engineer
Night
8000
2
Male
Engineer
Night
7865
3
Female
Worker
Day
5870
4
Male
Accountant
Day
5870
5
Female
Architecture
Day
4900
Let second one is:
#ID
#Department
1
IT
2
Quality Control
5
Construction
7
Construction
8
Human Resources
And the new DataFrame should be like:
#ID
#Department
#Job
#Wage
1
IT
Engineer
8000
2
Quality Control
Engineer
7865
5
Construction
Architecture
4900
You can use:
df_result = df1.merge(df2, on = 'ID', how = 'inner')
If you want to select only certain columns from a certain df use:
df_result = df1[['ID','Job', 'Wage']].merge(df2[['ID', 'Department']], on = `ID`, how = 'inner')
Use:
df = df2.merge(df1[['ID','Job', 'Wage']], on='ID')
for example, there is a column in a dataframe, 'ID'.
One of the entries is for example, '13245993, 3004992'
I only want to get '13245993'.
That also applies for every row in column 'ID'.
How to change the data in each row in column 'ID'?
You can try like this, apply slicing on ID column to get the required result. I am using 3 chars as no:of chars here
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'ID':[90877, 10909, 12223, 12334]}
df=pd.DataFrame(data)
print('Before change')
print(df)
df["ID"]=df["ID"].apply(lambda x: (str(x)[:3]))
print('After change')
print(df)
output
Before change
Name ID
0 Tom 90877
1 nick 10909
2 krish 12223
3 jack 12334
After change
Name ID
0 Tom 908
1 nick 109
2 krish 122
3 jack 123
You could do something like
data[data['ID'] == '13245993']
this will give you the columns where ID is 13245993
More Indepth Code
I hope this answers your question if not please let me know.
With best regards
I have two csv files as my raw data to read into different dataframes. One is called 'employee' and another is called 'origin'. However, I cannot upload the files here so I hardcoded the data into the dataframes below. The task I'm trying to solve is to update the 'Eligible' column in employee_details with 'Yes' or 'No' based on the value of the 'Country' column in origin_details. If Country = UK, then put 'Yes' in the Eligible column for that Personal_ID. Else, put 'No'.
import pandas as pd
import numpy as np
employee = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943'],
'Name': ['Tom', 'Joseph', 'Krish', 'John'],
'Age': ['40', '35', '43', '51'],
'Eligible': ' '}
origin = {
'Personal_ID': ['1000123', '1100258', '1104682', '1020943', '1573482', '1739526'],
'Country': ['UK', 'USA', 'FRA', 'SWE', 'UK', 'AU']}
employee_details = pd.DataFrame(employee)
origin_details = pd.DataFrame(origin)
employee_details['Eligible'] = np.where((origin_details['Country']) == 'UK', 'Yes', 'No')
print(employee_details)
print(origin_details)
The output of above code shows the below error message:
ValueError: Length of values (6) does not match length of index (4)
However, I am expecting to see the below as my output.
Personal_ID Name Age Eligible
0 1000123 Tom 40 Yes
1 1100258 Joseph 35 No
2 1104682 Krish 43 No
3 1020943 John 51 No
I also don't want to delete anything in my dataframes to match the size specified in the ValueError message because I may need the extra Personal_IDs in the origin_details later. Alternatively, I can keep all the existing Personal_ID's in the raw data (employee_details, origin_details) and create a new dataframe from those to extract the records which have the same Personal_ID's and determine the np.where() condition from there.
Please advise! Any helps are appreciated, thank you!!
You can merge the 2 dataframes on Personal ID and then use np.where
Merge with how='outer' to keep all personal IDs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='outer')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
4 1573482 NaN NaN Yes UK
5 1739526 NaN NaN No AU
If you dont want to keep all personal IDs then you can merge with how='inner' and you won't see the NANs
df_merge = pd.merge(employee_details, origin_details, on='Personal_ID', how='inner')
df_merge['Eligible'] = np.where(df_merge['Country']=='UK', 'Yes', 'No')
Personal_ID Name Age Eligible Country
0 1000123 Tom 40 Yes UK
1 1100258 Joseph 35 No USA
2 1104682 Krish 43 No FRA
3 1020943 John 51 No SWE
You are using a Pandas Series object inside a Numpy method, np.where((origin_details['Country'])). I believe this is the problem.
try:
employee_details['Eligible'] = origin_details['Country'].apply(lambda x:"Yes" if x=='UK' else "No")
It is always much easier and faster to use the pandas library to analyze dataframes instead of converting them back to numpy arrays
Well, the first thing I want to answer about is the exception and how lucky you are that it didn't if your tables were the same length your code was going to work.
but there is an assumption in the code that I don't think you thought about and that is that the ids may not be in the same order or like in the example there are more ids in some table than the other if you had the same length of tables but not the same order you would have got incorrect eligible values for each row. the current way to do this is as follow
first join the table to one using personal_id but use left join as you don't want to lose data if there is no origin info for that personal id.
combine_df = pd.merge(employee_details, origin_details, on='Personal_ID', how='left')
use the apply function to fill the new column
combine_df['Eligible'] = combine_df['Country'].apply(lambda x:'Yes' if x=='UK' else 'No')