How to extract first word from DataFrame - python

Background
I have created the below data frame combining two dataset from Kaggle.
Titanic: Machine Learning from Disaster
(input/titanic/train.csv)
titanic-nationalities
DataFrame name: output
PassengerId Nationality Name
0 1 CelticEnglish Braund, Mr. Owen Harris
1 2 CelticEnglish Cumings, Mrs. John Bradley (Florence Briggs Th...
2 3 Nordic,Scandinavian,Sweden Heikkinen, Miss. Laina
3 4 CelticEnglish Futrelle, Mrs. Jacques Heath (Lily May Peel
....
What I hoped to transform
PassengerId Nationality Name
0 1 CelticEnglish Braund
1 2 CelticEnglish Cumings
2 3 Nordic Heikkinen
3 4 CelticEnglish Futrelle
....
Problem
I tried to execute the below code, but I have no idea to fix the below.
Error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
----> 1 output['Nationality'].split('\n', 1)[0]
2 output['Name'].split('\n', 1)[0]
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'split'
Code
output['Nationality'].split('\n', 1)[0]
output['Name'].split('\n', 1)[0]
What I tried to do
I tried to change the type conversion, but the result was not changed.
output['Nationality'] = output['Nationality'].astype(str)
output['Name'] = output['Name'].astype(str)
output['Nationality'] = output['Nationality'].str.split('\n', expand=True)[0]
output['Name'] = output['Name'].str.split('\n', expand=True)[0]
output
PassengerId Nationality Name
0 1 CelticEnglish Braund, Mr. Owen Harris
1 2 CelticEnglish Cumings, Mrs. John Bradley (Florence Briggs Th...
2 3 Nordic,Scandinavian,Sweden Heikkinen, Miss. Laina
3 4 CelticEnglish Futrelle, Mrs. Jacques Heath (Lily May Peel)
Environment
Kaggle Notebook

A Series object doesn't have a split method. You're trying to split a string so you'll need to convert the column datatype into string first (or expand the column out into multiple columns) before applying a split.
check data type of columns with df.dtypes
assign datatype with output['Nationality'].astype(str)
edit: no parentheses on dtype call

Try with .str.split()
output['Nationality'] = output['Nationality'].str.split('\n', expand=True)[0]
output['Name'] = output['Name'].str.split('\n', expand=True)[0]

Related

How to search for name in dataframe

For example I want to find all the people that has "Abbott" in their name
0 Abbing, Mr. Anthony
1 Abbott, Mr. Rossmore Edward
2 Abbott, Mrs. Stanton (Rosa Hunt)
3 Abelson, Mr. Samuel
4 Abelson, Mrs. Samuel (Hannah Wizosky)
...
886 de Mulder, Mr. Theodore
887 de Pelsmaeker, Mr. Alfons
888 del Carlo, Mr. Sebastiano
889 van Billiard, Mr. Austin Blyler
890 van Melkebeke, Mr. Philemon
Name: Name, Length: 891, dtype: object
df.loc[name in df["Name"]]
I tried this and it didn't work
'False: boolean label can not be used without a boolean index'
You can use str.contains with the column you are interested in searching
>>> import pandas as pd
>>> df = pd.DataFrame(data={'Name': ['Smith', 'Jones', 'Smithson']})
>>> df
Name
0 Smith
1 Jones
2 Smithson
>>> df[df['Name'].str.contains('Smith')]
Name
0 Smith
2 Smithson

How to extract status in full name in pd.Dataframe column?

I have dataset. Here is the column of 'Name':
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
151 Pears, Mrs. Thomas (Edith Wearne)
152 Meo, Mr. Alfonzo
153 van Billiard, Mr. Austin Blyler
154 Olsen, Mr. Ole Martin
155 Williams, Mr. Charles Duane
and need to extract first name, status, and second name. When I try this on simple string, its ok:
full_name="Braund, Mr. Owen Harris"
first_name=full_name.split(',')[0]
second_name=full_name.split('.')[1]
print('First name:',first_name)
print('Second name:',second_name)
status = full_name.replace(first_name, '').replace(',','').split('.')[0]
print('Status:',status)
>First name: Braund
>Second name: Owen Harris
>Status: Mr
But after trying to do this with pandas, I fail with the status:
df['first_Name'] = df['Name'].str.split(',').str.get(0) #its ok, worsk well
But after this:
status= df['Name'].str.replace(df['first_Name'], '').replace(',','').split('.').str.get(0)
I get a mistake:
>>TypeError: 'Series' objects are mutable, thus they cannot be hashed
What are possible solutions?
Edit:Thanks for the answers and extract columns. I do
def extract_name_data(row):
row.str.extract('(?P<first_name>[^,]+), (?P<status>\w+.) (?P<second_name>[^(]+\w) ?')
last_name = row['second_name']
title = row['status']
first_name = row['first_name']
return first_name, second_name, status
and get
AttributeError: 'str' object has no attribute 'str'
What can be done? Row is meaned to be df['Name']
You could use str.extract with named capturing groups:
df['Name'].str.extract('(?P<first_name>[^,]+), (?P<status>\w+.) (?P<second_name>[^(]+\w) ?')
output:
first_name status second_name
0 Braund Mr. Owen Harris
1 Cumings Mrs. John Bradley
2 Heikkinen Miss. Laina
3 Futrelle Mrs. Jacques Heath
4 Allen Mr. William Henry
5 Pears Mrs. Thomas
6 Meo Mr. Alfonzo
7 van Billiard Mr. Austin Blyler
8 Olsen Mr. Ole Martin
9 Williams Mr. Charles Duane
You can also place your original codes with slight modification into Pandas .apply() function for it to work, as follows:
Just replace your variable names in Python with the column names in Pandas.
For example, replace full_name with x['Name'] and first_name with x['first_Name'] within the lambda function of .apply() function:
df['status'] = df.apply(lambda x: x['Name'].replace(x['first_Name'], '').replace(',','').split('.')[0], axis=1)
Though may not be the most efficient way of doing it, it's a way to easily modify your existing codes in Python into a workable version in Pandas.
Result:
print(df)
Name first_Name status
0 Braund, Mr. Owen Harris Braund Mr
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Cumings Mrs
2 Heikkinen, Miss. Laina Heikkinen Miss
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Futrelle Mrs
4 Allen, Mr. William Henry Allen Mr
151 Pears, Mrs. Thomas (Edith Wearne) Pears Mrs
152 Meo, Mr. Alfonzo Meo Mr
153 van Billiard, Mr. Austin Blyler van Billiard Mr
154 Olsen, Mr. Ole Martin Olsen Mr
155 Williams, Mr. Charles Duane Williams Mr

How do I read specific csv file into pandas df?

I'm having a problem reading the file titanic.csv into a pandas dataframe. The csv is delimited by ",", but when I try to read into pandas with the following code:
df = pd.read_csv("titanic_train.csv")
df.head()
I get an issue with all values ending up in the first column. I tried to add delimiter="," in the read command, but still no luck.
Any ideas on where I'm going wrong?
Thanks a lot!
Like others mentioned, a simple read_csv should have worked for you.
Here are few ways to debug:
You can run the all-inclusive code below and see if it functions.
You can copy paste the included string in a text file and try to load it.
You can use an online python editor e.g. google colab, to ensure that its not related to your local setup.
You can post the link to csv to get further help.
import pandas as pd
from io import StringIO
sample=StringIO('''PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
''')
df = pd.read_csv(sample)
print(df)
Output:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q

TypeError: find() takes at least 1 argument (0 given)

I have this dummy dataset,
df = pd.DataFrame(['Braund, Mr. Owen Harris','Cumings, Mrs.John','Heikkinen, Miss. Lainia', 'Futerelle, Mrs. Jacques Health', 'Allen, Mr. William Henry'], columns=['Names'])
which has head of
Names
0 Braund, Mr. Owen Harris
1 Cumings, Mrs.John
2 Heikkinen, Miss. Lainia
3 Futerelle, Mrs. Jacques Health
4 Allen, Mr. William Henry
I am trying to solve a dummy problem where I am finding index of first column occurring ',' by this code
df['Names'].apply(str.find(','))
but it is giving following error.
TypeError: find() takes at least 1 argument (0 given)
Can I know why is this giving this error even I am providing the argument?
Two main issues:
You are trying to call find in a static way (from the str class instead of an instance), in which case it is expecting 2 arguments (the string and the substring).
.apply accepts a function and you given it an integer (the assumed output of str.find).
Pandas provides an str accessor that exposes the most common str methods and applies them in vectorized way:
print(df.Names.str.find(','))
outputs
0 6
1 7
2 9
3 9
4 5
Name: Names, dtype: int64
You could still use Python's str.find, but you'd have to create a custom lambda:
print(df.Names.apply(lambda string: string.find(',')))
Also outputs
0 6
1 7
2 9
3 9
4 5
Name: Names, dtype: int64
But using the str accessor (or any other available accessor, or pandas method) will almost always be more efficient than a lambda passed to .apply.
Of course, you can reassign the result back to a new column in both cases:
df['First Comma Index'] = df.Names.str.find(',')
df['First Comma Index'] = df.Names.apply(lambda string: string.find(','))
You can access string methods for a column or Series directly using df['Names'].str. This'll let you do df['Names'].str.find(",").
You're getting the error because "str" here is just the class, not any particular string, so it's expecting an underlying string in which to look and doesn't find any.
You may loop to get the result as follows -
Code -
for i in range(len(df)):
string = df['Names'][i]
df['Index'][i] = string.find(',',0,len(string))
Output
Hopefully You can Apply This Method :
df["Finds"] = df["Names"].str.find(",")
This Will Give You The Result As Follows :
Names Finds
0 Braund, Mr. Owen Harris 6
1 Cumings, Mrs.John 7
2 Heikkinen, Miss. Lainia 9
3 Futerelle, Mrs. Jacques Health 9
4 Allen, Mr. William Henry 5
I Hope This Will Help You.

'DataFrame' object has no attribute 'melt'

I just want to use the melt function in pandas and I just keep on getting the same error.
Just typing the example provided by the documentation:
cheese = pd.DataFrame({'first' : ['John', 'Mary'],
'last' : ['Doe', 'Bo'],
'height' : [5.5, 6.0],
'weight' : [130, 150]})
I just get the error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-119-dc0a0b96cf46> in <module>()
----> 1 cheese.melt(id_vars=['first', 'last'])
C:\Anaconda2\lib\site-packages\pandas\core\generic.pyc in __getattr__(self, name)
2670 if name in self._info_axis:
2671 return self[name]
-> 2672 return object.__getattribute__(self, name)
2673
2674 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'melt'`
You pandas version is bellow 0.20.0, so need pandas.melt instead DataFrame.melt:
df = pd.melt(cheese, id_vars=['first', 'last'])
print (df)
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
def grilled(d):
return d.set_index(['first', 'last']) \
.rename_axis('variable', 1) \
.stack().reset_index(name='value')
grilled(cheese)
first last variable value
0 John Doe height 5.5
1 John Doe weight 130.0
2 Mary Bo height 6.0
3 Mary Bo weight 150.0

Categories

Resources