Python: How to split a string column in a dataframe? - python

I have a dataframe with two columns one is Date and the other one is Location(Object) datatype, below is the format of Location columns with values :
Date Location
1 07/12/1912 AtlantiCity, New Jersey
2 08/06/1913 Victoria, British Columbia, Canada
3 09/09/1913 Over the North Sea
4 10/17/1913 Near Johannisthal, Germany
5 03/05/1915 Tienen, Belgium
6 09/03/1915 Off Cuxhaven, Germany
7 07/28/1916 Near Jambol, Bulgeria
8 09/24/1916 Billericay, England
9 10/01/1916 Potters Bar, England
10 11/21/1916 Mainz, Germany
my requirement is to split the Location by "," separator and keep only the second part of it (ex. New Jersey, Canada, Germany, England etc..) in the Location column. I also have to check if its only a single element (values with single element having no ",")
Is there a way I can do it with predefined method without looping each and every row ?
Sorry if the question is off the standard as I am new to Python and still learning.

A straight forward way is to apply the split method to each element of the column and pick up the last one:
df.Location.apply(lambda x: x.split(",")[-1])
1 New Jersey
2 Canada
3 Over the North Sea
4 Germany
5 Belgium
6 Germany
7 Bulgeria
8 England
9 England
10 Germany
Name: Location, dtype: object
To check if each cell has only one element we can use str.contains method on the column:
df.Location.str.contains(",")
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 True
9 True
10 True
Name: Location, dtype: bool

We could try with str.extract
print(df['Location'].str.extract(r'([^,]+$)'))
#0 New Jersey
#1 Canada
#2 Over the North Sea
#3 Germany
#4 Belgium
#5 Germany
#6 Bulgeria
#7 England
#8 England
#9 Germany

Related

Flag if a row is duplicated an attach if it's the 1st, 2nd, etc duplicated row

I'd like to flag if a row is duplicated, and attach if it's the 1st, 2nd, 3rd, etc duplicated column in a Pandas DataFrame.
More visually, I'd like to go from:
id
Country
City
1
France
Paris
2
France
Paris
3
France
Lyon
4
France
Lyon
5
France
Lyon
to
id
Country
City
duplicated_flag
1
France
Paris
1
2
France
Paris
1
3
France
Lyon
2
4
France
Lyon
2
5
France
Lyon
2
Note that id is not taken into account to see if the row is duplicated.
Two options:
First, if you have lots of columns that you need to compare, you can use:
comparison_df = df.drop("id", axis=1)
df["duplicated_flag"] = (comparison_df != comparison_df.shift()).any(axis=1).cumsum()
We drop the columns that aren't needed in the comparison. Then, we check whether each row is equivalent to the one above it using .shift() and .any(). Finally, we read off the value of duplicated_flag using .cumsum().
But, if you only have two columns to compare (or if for some reason you have lots of columns that you need to drop), you can find mismatched rows one at a time, and then use .cumsum() to get the value of duplicated_flag for each row. It's a bit more verbose so I'm not super happy with this option, but I'm leaving this here for completeness in case this suits your use case better:
country_comparison = df["Country"].ne(df["Country"].shift())
city_comparison = df["City"].ne(df["City"].shift())
df["duplicated_flag"] = (country_comparison | city_comparison).cumsum()
print(df)
These output:
id Country City duplicated_flag
0 1 France Paris 1
1 2 France Paris 1
2 3 France Lyon 2
3 4 France Lyon 2
4 5 France Lyon 2

How to Split a column into two by comma delimiter, and put a value without comma in second column and not in first?

I have a column in a df that I want to split into two columns splitting by comma delimiter. If the value in that column does not have a comma I want to put that into the second column instead of first.
Origin
New York, USA
England
Russia
London, England
California, USA
USA
I want the result to be:
Location
Country
New York
USA
NaN
England
NaN
Russia
London
England
California
USA
NaN
USA
I used this code
df['Location'], df['Country'] = df['Origin'].str.split(',', 1)
We can try using str.extract here:
df["Location"] = df["Origin"].str.extract(r'(.*),')
df["Country"] = df["Origin"].str.extract(r'(\w+(?: \w+)*)$')
Here is a way by using str.extract() and named groups
df['Origin'].str.extract(r'(?P<Location>[A-Za-z ]+(?=,))?(?:, )?(?P<Country>\w+)')
Output:
Location Country
0 New York USA
1 NaN England
2 NaN Russia
3 London England
4 California USA
5 NaN USA

Pandas sorting strings in cells

I have dataframe like this:
individual states
1 Alaska, Hawaii
2 Hawaii, Alaska
3 Kansas, Iowa, Maryland
4 New Jersey, Newada
5 Newada, New Jersey
I would like to sort the strings within the cells and would like to obtain the following dataframe
individual states
1 Alaska, Hawaii
2 Alaska, Hawaii
3 Iowa, Kansas, Maryland
4 New Jersey, Newada
5 New Jersey, Newada
How could I do it?
This isn't a simple problem... I would suggest splitting, sorting and joining with map:
df['states'] = df['states'].map(lambda x: ', '.join(sorted(x.split(', '))))
df
individual states
0 1 Alaska, Hawaii
1 2 Alaska, Hawaii
2 3 Iowa, Kansas, Maryland
3 4 New Jersey, Newada
4 5 New Jersey, Newada
I am using get_dummies then dot back the result
s = df.states.str.get_dummies(', ')
s.dot(s.columns+',').str[:-1]
Out[861]:
0 Alaska,Hawaii
1 Alaska,Hawaii
2 Iowa,Kansas,Maryland
3 New Jersey,Newada
4 New Jersey,Newada
dtype: object
df['state'] = s.dot(s.columns+',').str[:-1]

how to slice between two elements in a pandas series

I have a Series containing a column with names and their nationalities in parenthesis.
I want this column to contain just the individuals nationality and without parenthesis, with the same index.
0 LOMBARDI Domingo (URU)
1 MACIAS Jose (ARG)
2 TEJADA Anibal (URU)
3 WARNKEN Alberto (CHI)
4 REGO Gilberto (BRA)
5 CRISTOPHE Henry (BEL)
6 MATEUCCI Francisco (URU)
7 MACIAS Jose (ARG)
8 LANGENUS Jean (BEL)
9 TEJADA Anibal (URU)
10 SAUCEDO Ulises (BOL)
I have tried using .split(' ')[2] to the series.
But found out "'Series' object has no attribute 'split'."
You need to use str accessor on series.
df.name.str.split('(').str[1].str[:-1]
Output:
0 URU
1 ARG
2 URU
3 CHI
4 BRA
5 BEL
6 URU
7 ARG
8 BEL
9 URU
10 BOL
Name: name, dtype: object
Using extract
s.str.extract('.*\((.*)\).*',expand=True)[0]
Out[463]:
0 URU
1 ARG
2 URU
3 CHI
Name: 0, dtype: object
Using slice. May not be optimal as it assumes the right side of the string is constant but it's another possible solution.
df.name.str.slice(start = -4).str[:-1]

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?
You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Categories

Resources