How to extract city from dataframe column (not same format) - python

i have this dataframe, and i want to extract cities in a separate column. You can also see, that the format is not same, and the city can be anywhere in the row. How can i extract only cities i a new column?
Prompt. Here we are talking about German cities. May be to find a dictionary, that shows all German cities and somehow compare with my dataset?
Here is dictionary of german cities: https://gist.github.com/embayer/772c442419999fa52ca1
Dataframe
Adresse
0 Karlstr 10, 10 B, 30,; 04916 Hamburg
1 München Dorfstr. 28-55, 22555
2 Marnstraße. Berlin 12, 45666 Berlin
3 Musterstr, 24855 Dresden
... ...
850 Muster Hausweg 11, Hannover, 56668
851 Mariestr. 4, 48669 Nürnberg
852 Hilden Weederstr 33-55, 56889
853 Pt-gaanen-Str. 2, 45883 Potsdam
Output
Cities
0 Hamburg
1 München
2 Berlin
3 Dresden
... ...
850 Hannover
851 Nürnberg
852 Hilden
853 Potsdam

You could extract in a list all the cities from the dictionary you provided ( I asssume it's the 'stadt' key ), and then use str.findall in your column:
cities_ = [cities[n]['stadt'] for n in range(0,len(cities))]
df.Adresse.str.findall(r'|'.join(cities_))
>>>
0 [Karlstr, Hamburg]
1 []
2 []
3 []
4 []
5 []
6 []
7 []
8 []
Name: Adresse, dtype: object

You can simply use str.extract since all the names are between couple of stars.
df["cities"] = df["Adress"].str.extract(r'\*\*(\w+)\*\*')
Since it seems the stars are not present in your file, you can do it differently.
Use the dictionary of cities, called cities from the file you linked but keep only a unique sequence (called a set) of cities.
german_cities = set(map(lambda x: x['stadt'], cities))
Then, we'll split the address string for each row and lookup in the German cities dictionary.
Since the first argument of apply is the series itself, we just need to tell it to have a look at the set of German cities.
def lookup_cities(string, cities):
splits = string.replace(",", "").split(" ")
for s in splits:
if s in cities:
return s
return "NaN"
df["Adress"].apply(lookup_cities, args=(german_cities,))
Now if you find any "NaN" then it's either that a city in your document has a typo or maybe several way to write it, you'll have to investigate yourself.
P.S: I had to remove all the spaces in the cities files otherwise the names wouldn't match. It was just a matter of using find and replace all in my editor.

You can use regular expression to extract the city names, as they are indicated by **:
import re
import pandas
df = pd.DataFrame({"Adresse": ["Karlstr 10, 10 B, 30,; 04916 **Hamburg**", "**München** Dorfstr. 28-55, 22555", "Marnstraße. Berlin 12, 45666 **Berlin**", "Musterstr, 24855 **Dresden**"]})
df['Cities'] = [re.findall(r".*\*\*(.*)\*\*", address)[0] for address in df['Adresse']]
This results in:
df
Adresse Cities
0 Karlstr 10, 10 B, 30,; 04916 **Hamburg** Hamburg
1 **München** Dorfstr. 28-55, 22555 München
2 Marnstraße. Berlin 12, 45666 **Berlin** Berlin
3 Musterstr, 24855 **Dresden** Dresden

Related

To get the Index of a particular element, which is a sentence string present in a column of a DataFrame with conditions

I have a table. There are numbers in the column 'Para'. I have to find the index for a particular sentence from the column 'Country_Title in such a way that the value at Column 'Para' is 2.
Main DataFrame 'df_countries' is shown below:
Index
Sequence
Para
Country_Title
0
5
4
India is seventh largest country
1
6
6
Australia is a continent country
2
7
2
Canada is the 2nd largest country
3
9
3
UAE is a country in Western Asia
4
10
2
China is a country in East Asia
5
11
1
Germany is in Central Europe
6
13
2
Russia is the largest country
7
14
3
Capital city of China is Beijing
Suppose my keyword is China. And I want to get the index for the sentence with 'China', but only the one at 'Para' = 2
Consider the rows at index :- 4 and 7 ; both have same Country_Title. But I want to obtain the index for the one with 'Para' = 2, i.e., the result must be index = 4
My Approach:
I derived another DataFrame 'df_para2_countries' from above table as shown below:
Index
Para
Country_Title
2
2
Canada is the 2nd largest country
4
2
China is a country in East Asia
6
2
Russia is the largest country
Now I store the country title as:
c = list(df_level2_countries['Country_Title'])
I used a for loop to parse through elements in 'c' and find the index of a particular country in the table 'df_countries'
for i in c:
if 'China' in i:
print(i)
ind= df_para2_countries.loc[df_para2_countries['Country_Title'] = i]
print(ind)
the identifier 'ind' gives error.
I want to get the index, but this doesn't work.
Please post your suggestion on how can I approach to this.
You need two equals in your condition?
If you need only the 'index', that is, the value from your first column called index, then you can change the series returned from .loc() to a list, and then get first value, for instance:
ind = df_para2_countries.loc[df_para2_countries['Country_Title'] == i].to_list()[0]
Hope it works :)

Python Text File to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number and the Register variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
Number
Register
City
Id
Share
Name
Born
c
f
h
i
01600
4314
London
1
73/1284
John Smith
1960-01-01
NaN
4222/2001
1334/2000
5774/2000
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/1988
4222/2000
NaN
NaN
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number
Register
City
Id
Share
Name
Born
f
h
i
c
0
01600
4314
London
1
73/1284
John Smith
1960-01-01
4222/2001
1334/2000
5774/2000
nan
1
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/2000
nan
nan
4222/1988

How to extract unique values from pandas column where values are in list

I want to extract unique cities from city column in pandas dataframe. City column has values in list. How would I extract the cities frequency like:
Lahore 3
Karachi 2
Sydney 1
etc.
Sample dataframe:
Name Age City
a jack 34 [Sydney,Delhi]
b Riti 31 [Lahore,Delhi]
c Aadi 16 [New York, Karachi, Lahore]
d Mohit 32 [Peshawar,Delhi, Karachi]
Thank you
Let us try explode + value_counts
out = df.City.explode().value_counts()

How to condense text cleaning steps into single Python function?

Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object

using len() in Pandas dataframe

This is the look of my DataFrame:
StateAb GivenNm Surname PartyNm PartyAb ElectedOrder
35 WA Joe BULLOCK Australian Labor Party ALP 2
36 WA Michaelia CASH Liberal LP 3
37 WA Linda REYNOLDS Liberal LP 4
38 WA Wayne DROPULICH Australian Sports Party SPRT 5
39 WA Scott LUDLAM The Greens (WA) GRN 6
and I want to list a list of senators whose surname is more than 9 characters long.
So I think the code should be like this:
df[len(df.Surname) > 9]
but this raises a KeyError, where did I go wrong?
The correct way to filter a DataFrame based on the length of strings in a column is
df[df['Surname'].str.len() > 9]
df['Surname'].str.len() creates a Series of lengths for the surname column and df[df['Surname'].str.len() > 9] filters out the ones less than or equal to 9. What you did is to check the length of the Series itself (how many rows it has).
Have a look at the python filter function. It does exactly what you want.
df = [
{"Surname": "Bullock-ish"},
{"Surname": "Cash"},
{"Surname": "Reynolds"},
]
longnames = list(filter(lambda s: len(s["Surname"]) > 9, df))
print(longnames)
>>[{'Surname': 'Bullock-ish'}]
Sytse

Categories

Resources