Pandas sorting strings in cells - python

I have dataframe like this:
individual states
1 Alaska, Hawaii
2 Hawaii, Alaska
3 Kansas, Iowa, Maryland
4 New Jersey, Newada
5 Newada, New Jersey
I would like to sort the strings within the cells and would like to obtain the following dataframe
individual states
1 Alaska, Hawaii
2 Alaska, Hawaii
3 Iowa, Kansas, Maryland
4 New Jersey, Newada
5 New Jersey, Newada
How could I do it?

This isn't a simple problem... I would suggest splitting, sorting and joining with map:
df['states'] = df['states'].map(lambda x: ', '.join(sorted(x.split(', '))))
df
individual states
0 1 Alaska, Hawaii
1 2 Alaska, Hawaii
2 3 Iowa, Kansas, Maryland
3 4 New Jersey, Newada
4 5 New Jersey, Newada

I am using get_dummies then dot back the result
s = df.states.str.get_dummies(', ')
s.dot(s.columns+',').str[:-1]
Out[861]:
0 Alaska,Hawaii
1 Alaska,Hawaii
2 Iowa,Kansas,Maryland
3 New Jersey,Newada
4 New Jersey,Newada
dtype: object
df['state'] = s.dot(s.columns+',').str[:-1]

Related

Get the number of IDs that have the same combination of distinct values in the 'locations' column

I have a table with ids and locations they have been to.
id
Location
1
Maryland
1
Iowa
2
Maryland
2
Texas
3
Georgia
3
Iowa
4
Maryland
4
Iowa
5
Maryland
5
Iowa
5
Texas
I'd like to perform a query that would allow me to get the number of ids per combination.
In this example table, the output would be -
Maryland, Iowa - 2
Maryland, Texas - 1
Georgia, Iowa - 1
Maryland, Iowa, Texas - 1
My original thought was to add the ASCII values of the distinct locations of each id, and see how many have each value, and what the combinations are that correspond to the value. I was not able to do that as SQL server would not let me cast an nvarchar as a numeric data type. Is there any other way I could use SQL to get the number of devices per combination? Using python to get the number of ids per combination is also acceptable, however, SQL is preferred.
If you want to solve this in SQL and you are running SQL Server 2017 or later, you can use a CTE to aggregate the locations for each id using STRING_AGG, and then count the occurrences of each aggregated string:
WITH all_locations AS (
SELECT STRING_AGG(Location, ', ') WITHIN GROUP (ORDER BY Location) AS aloc
FROM locations
GROUP BY id
)
SELECT aloc, COUNT(*) AS cnt
FROM all_locations
GROUP BY aloc
ORDER BY cnt, aloc
Output:
aloc cnt
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Maryland, Texas 1
Iowa, Maryland 2
Note that I have applied an ordering to the STRING_AGG to ensure that someone who visits Maryland and then Iowa is treated the same way as someone who visits Iowa and then Maryland. If this is not the desired behaviour, simply delete the WITHIN GROUP clause.
Demo on dbfiddle
Use groupby + agg + value_counts:
new_df = df.groupby('id')['Location'].agg(list).str.join(', ').value_counts().reset_index()
Output:
>>> new_df
index Location
0 Maryland, Iowa 2
1 Maryland, Texas 1
2 Georgia, Iowa 1
3 Maryland, Iowa, Texas 1
Let us do groupby with join then value_counts
df.groupby('id')['Location'].agg(', '.join).value_counts()
Out[938]:
join
Maryland, Iowa 2
Georgia, Iowa 1
Maryland, Iowa, Texas 1
Maryland, Texas 1
dtype: int64
Use a frozenset to aggregate to ensure having unique groups:
df.groupby('id')['Location'].agg(', '.join).value_counts()
Output:
(Maryland, Iowa) 2
(Texas, Maryland) 1
(Georgia, Iowa) 1
(Texas, Maryland, Iowa) 1
Name: Location, dtype: int64
Or a sorted string join:
df.groupby('id')['Location'].agg(lambda x: ', '.join(sorted(x))).value_counts()
output:
Iowa, Maryland 2
Maryland, Texas 1
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Name: Location, dtype: int64

Extracting Specific Text From column in dataframe in pandas

I have a pandas dataframe with a column, which I need to extract the word with [ft,mi,FT,MI] of the state column using regular expression and stored in other column.
df1 = {
'State':['Arizona 4.47ft','Georgia 1023mi','Newyork 2022 NY 74.6 FT','Indiana 747MI(In)','Florida 453mi FL']}
Expected output
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork NY 74.6ft 74.6ft
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi
Would anyone please help?
Build a regex pattern with the help of list l then use str.extract to extract the occurrence of this pattern from the State column
l = ['ft','mi','FT','MI']
df1['Distance'] = df1['State'].str.extract(r'(\S+(?:%s))\b' % '|'.join(l))
State Distance
0 Arizona 4.47ft 4.47ft
1 Georgia 1023mi 1023mi
2 Newyork 2022 NY 74.6FT 74.6FT
3 Indiana 747MI(In) 747MI
4 Florida 453mi FL 453mi

formatting multiple city names into universal name for each city all at once using pandas

change all city name into one universal name.
City b c
0 New york 1 1
1 New York 2 2
2 N.Y. 3 3
3 NY 4 4
They call refer to the city New york however python sees them as separate entity therefore I've changed all into one.
df["City"] = df["City"].replace({"N.Y.":"New york", "New York": "New york", "NY": "New york"})
After this I need to check if all variation of new york is covered, to do that I've created a function
def universal_ok(universal_name):
count = 0
for c in df.City:
if c == universal_name:
count += 1
# This only works when column consists of only one type of city
if count == len(df.City):
return "Yes all names are formatted correctly"
else:
return f"there are {len(df.City) - count} names that need to be changed"
universal_ok("New york")
but the problem is what about when there are more than one city in a column
City b c
0 New york 1 1
1 New York 2 2
2 N.Y. 3 3
3 NY 4 4
4 Toronto 3 2
5 TO 3 2
6 toronto 3 2
is there a way to change each city to universal name?
Convert to Lower, Unique Values, Map and Count:
Data:
City b c
New york 1 1
New York 2 2
N.Y. 3 3
NY 4 4
Toronto 3 2
TO 3 2
toronto 3 2
Convert to Lower:
This will reduce the variations of the city names.
pandas.Series.str.lower
df.City = df.City.str.lower()
City b c
new york 1 1
new york 2 2
n.y. 3 3
ny 4 4
toronto 3 2
to 3 2
toronto 3 2
Unique Values:
This will give you all the values in the column
pandas.Series.unique
df.City.unique()
array(['new york', 'n.y.', 'ny', 'toronto', 'to'], dtype=object)
Mapping the City Names:
Use the unique values list, to map the values to the preferred form
I created a tuple, then used dict comprehension to create the dictionary
I did this, so I wouldn't have to repeatedly type the preferred city name, because I'm lazy / efficient, that way.
Tuples
Python Dictionary Comprehension Tutorial
pandas.Series.map
cities_tup = (('New York', ['ny', 'n.y.', 'new york']),
('Toronto', ['toronto', 'to']))
cities_map = {y:x[0] for x in cities_tup for y in x[1]}
{'ny': 'New York',
'n.y.': 'New York',
'new york': 'New York',
'toronto': 'Toronto',
'to': 'Toronto'}
df.City = df.City.map(cities_map)
City b c
New York 1 1
New York 2 2
New York 3 3
New York 4 4
Toronto 3 2
Toronto 3 2
Toronto 3 2
Unique Counts to verify:
Verify city names have been updated and count them
pandas.Series.value_counts
df.City.value_counts()
New York 4
Toronto 3
Name: City, dtype: int64
Remarks
Undoubtedly, there are alternate methods to accomplish this task, but I think this is straightforward and easy to follow.
Someone will probably come along and offer a one-liner.
You need a specific column with some sort of city id, otherwise you won’t be able to distinguish between Paris, France and Paris, Texas, nor will you be able to group Istanbul and Constantinople.

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.
Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

Python: How to split a string column in a dataframe?

I have a dataframe with two columns one is Date and the other one is Location(Object) datatype, below is the format of Location columns with values :
Date Location
1 07/12/1912 AtlantiCity, New Jersey
2 08/06/1913 Victoria, British Columbia, Canada
3 09/09/1913 Over the North Sea
4 10/17/1913 Near Johannisthal, Germany
5 03/05/1915 Tienen, Belgium
6 09/03/1915 Off Cuxhaven, Germany
7 07/28/1916 Near Jambol, Bulgeria
8 09/24/1916 Billericay, England
9 10/01/1916 Potters Bar, England
10 11/21/1916 Mainz, Germany
my requirement is to split the Location by "," separator and keep only the second part of it (ex. New Jersey, Canada, Germany, England etc..) in the Location column. I also have to check if its only a single element (values with single element having no ",")
Is there a way I can do it with predefined method without looping each and every row ?
Sorry if the question is off the standard as I am new to Python and still learning.
A straight forward way is to apply the split method to each element of the column and pick up the last one:
df.Location.apply(lambda x: x.split(",")[-1])
1 New Jersey
2 Canada
3 Over the North Sea
4 Germany
5 Belgium
6 Germany
7 Bulgeria
8 England
9 England
10 Germany
Name: Location, dtype: object
To check if each cell has only one element we can use str.contains method on the column:
df.Location.str.contains(",")
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 True
9 True
10 True
Name: Location, dtype: bool
We could try with str.extract
print(df['Location'].str.extract(r'([^,]+$)'))
#0 New Jersey
#1 Canada
#2 Over the North Sea
#3 Germany
#4 Belgium
#5 Germany
#6 Bulgeria
#7 England
#8 England
#9 Germany

Categories

Resources