Python Pandas concatenate every 2nd row to previous row - python

I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?

You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN

TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN

Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN

Related

In-place update in pandas: update the value of the cell based on a condition

DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate

Create categorical column in python from string values

I have a pandas dataframe that includes a "Name" column. Strings in the Name column may contain "Joe", "Bob", or "Joe Bob". I want to add a column for the type of person: just Joe, just Bob, or Both.
I was able to do this by creating boolean columns, turning them into strings, combining the strings, and then replacing the values. It just...didn't feel very elegant! I am new to Python...is there a better way to do this?
My original dataframe:
df = pd.DataFrame(data= [['Joe Biden'],['Bobby Kennedy'],['Joe Bob Briggs']], columns = ['Name'])
0
Name
1
Joe Biden
2
Bobby Kennedy
3
Joe Bob Briggs
I added two boolean columns to find names:
df['Joe'] = df.Name.str.contains('Joe')
df['Joe'] = df.Joe.astype('int')
df['Bob'] = df.Name.str.contains('Bob')
df['Bob'] = df.Bob.astype('int')
Now my dataframe looks like this:
df = pd.DataFrame(data= [['Joe Biden',1,0],['Bobby Kennedy',0,1],['Joe Bob Briggs',1,1]], columns = ['Name','Joe', 'Bob'])
0
Name
Joe
Bob
1
Joe Biden
1
0
2
Bobby Kennedy
0
1
3
Joe Bob Briggs
1
1
But what I really want is one "Type" column with categorical values: Joe, Bob, or Both.
To do that, I added a column to combine the booleans, then I replaced the values:
df["Type"] = df["Joe"].astype(str) + df["Bob"].astype(str)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
10
2
Bobby Kennedy
0
1
1
3
Joe Bob Briggs
1
1
11
df['Type'] = df.Type.astype('str') df['Type'].replace({'11': 'Both', '10': 'Joe','1': 'Bob'}, inplace=True)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
Joe
2
Bobby Kennedy
0
1
Bob
3
Joe Bob Briggs
1
1
Both
This feels clunky. Anyone have a better way?
Thanks!
You can use np.select to create the column Type.
You need to ordered correctly your condlist from the most precise to the widest.
df['Type'] = np.select([df['Name'].str.contains('Joe') & df['Name'].str.contains('Bob'),
df['Name'].str.contains('Joe'),
df['Name'].str.contains('Bob')],
choicelist=['Both', 'Joe', 'Bob'])
Output:
>>> df
Name Type
0 Joe Biden Joe
1 Bobby Kennedy Bob
2 Joe Bob Briggs Both

How to map to multiple values in a dictionary in pandas

I have the following pandas df:
Name
Jack
Alex
Jackie
Susan
i also have the following dict:
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
I would like to add in two colums for Gender and Age so that my df returns:
Name Gender Age
Jack Male 22
Alex Male 26
Jackie Female 28
Susan Female 30
I have tried:
df['Gender'] = df.Name.map(d[0])
df['Age'] = df.Name.map(d[1])
but no such luck. Any ideas or help would be muhc appreciated! Thanks!
df['Gender'] = df.Name.map(lambda x: d[x][0])
df['Age'] = df.Name.map(lambda x: d[x][1])
Take all the values of the dictionary
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
value_list = list(d.values())
df = pd.DataFrame(value_list, columns =['Gender', 'Age'])
print(df)
Use pd.DataFrame constructor with Series.map and use pd.concat to concat with df:
In [2696]: df = pd.concat([df,pd.DataFrame(df.Name.map(d).tolist(), columns=['Gender', 'Age'])], axis=1)
In [2695]: df
Out[2696]:
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Solutions working well also if no match in dictionary like:
d = {'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
print (df)
Name Gender Age
0 Alex Male 26
1 Jack NaN NaN
2 Jackie Female 28
3 Susan Female 30
Use DataFrame.from_dict from your dictionary and add to column Name by DataFrame.join, advantage is if more columns in input data all working same way:
df = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['Gender','Age']), on='Name')
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Your solution should working if create 2 dictionaries:
d1 = {k:v[0] for k,v in d.items()}
d2 = {k:v[1] for k,v in d.items()}
df['Gender'] = df.Name.map(d1)
df['Age'] = df.Name.map(d2)
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30

How can I count how many male/female are in each title?

I am a newbie to datascience and I want to count how many female/male are in each Title.
I tried the following piece of code:
'''
newdf = pd.DataFrame()
newdf[ 'Title' ] = full[ 'Name' ].map( lambda name: name.split( ',' )
[1].split( '.' )[0].strip() )
newdf['Age'] = full['Age']
newdf['Sex'] = full['Sex']
newdf.dropna(axis = 0,inplace=True)
print(newdf.head())
What I get is :
Title Age Sex
0 Mr 22.0 male
1 Mrs 38.0 female
2 Miss 26.0 female
3 Mrs 35.0 female
4 Mr 35.0 male
Then I am trying this to add #male,#female columns
df = pd.DataFrame()
df = newdf[['Age','Title']].groupby('Title').mean().sort_values(by='Age',ascending=False)
df['#People'] = newdf['Title'].value_counts()
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
Error message that I have:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What I expected is to have four columns: Title,Age(average),#People, #male,#female. So I want to know how many of those #people are male and female
P.S Without these lines :
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
everything works fine,and I get:
Age #People
Title
Capt 70.000000 1
Col 54.000000 4
Sir 49.000000 1
Major 48.500000 2
Lady 48.000000 1
Dr 43.571429 7
....
But without #male,#female.
Use GroupBy.agg for aggregate mean with size and for new columns add crosstab by DataFrame.join:
df1 = (df.groupby('Title')['Age']
.agg([('Age','mean'),('#People','size')])
.sort_values(by='Age',ascending=False))
df2 = pd.crosstab(df['Title'], df['Sex']).add_suffix('_avg')
df = df1.join(df2)
print (df)
Age #People female_avg male_avg
Title
Mrs 36.5 2 2 0
Mr 28.5 2 0 2
Miss 26.0 1 1 0

add a row at top in pandas dataframe [duplicate]

This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?
Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...
#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works
This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male

Categories

Resources