Split dataframe column based on specific list of words - python

Is it possible to split strings from a dataframe column based on a list of words?
For example: There is a dataframe with a column Company, each record includes the company name, a legal form, and sometimes additional information after the legal form like 'electronics'.
Company
XYZ ltd electronics
ABC ABC inc iron
AB XY Z inc
CD EF GHI JK llc incident
I have list with 1500 worldwide legal form for companies (inc, ltd, ...). I would like to split the string in the dataframe column, based on this legal form list for example:
['gmbh', 'ltd', 'inc', 'srl', 'spa', 'co', 'sa', 'ag', 'kg', 'ab', 'spol', 'sasu', 'sas', 'pvt', 'sarl', 'gmbh & co kg', 'llc', 'ilc', 'corp', 'ltda', 'coltd', 'se', 'as', 'sp zoo', 'plc', 'pvtltd', 'og', 'gen']
In other words, to separate everything before and after the words in the list to new columns. This is the desired output:
Company
Legal form
Addition
XYZ
ltd
electronics
ABC ABC
inc
iron
AB XY Z
inc
CD EF GHI JK
llc
incident
Note that "inc" appears in the middle, at the end, and also part of a word in the various company name examples.

You could use regular expression (Regex) to filter out the legal form. Each legal form is in this format: \slegalform\s
\s means the legal form is preceded by and ended with a whitespace. Because I have appended all company names with a white space, so the legal form can be at the end as well. The data is processed in lowercase, then converted back to Title Case. So try this:
import pandas as pd
import re
legal_forms = '(\sgmbh\s|\sltd\s|\sinc\s|\ssrl\s|\sspa\s|\sco\s|\ssa\s|\sag\s|\skg\s|\sab\s|\sspol\s|\ssasu\s|\ssas\s|\spvt\s|\ssarl\s|\sgmbh\s&\sco\skg\s|\sllc\s|\silc\s|\scorp\s|\sltda\s|\scoltd\s|\sse\s|\sas\s|\ssp\szoo\s|\splc\s|\spvtltd\s|\sog\s|\sgen\s)'
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc incident']}, columns=['Company'])
df['Coy']= df['Company'].apply(lambda x: [e.strip() for e in re.split(legal_forms, x.lower()+' ')])
print(df)
This will create a list for each company name, separated by the legal form
Company Coy
0 XYZ ltd electronics [xyz, ltd, electronics]
1 ABC ABC inc iron [abc abc, inc, iron]
2 AB XY Z inc [ab xy z, inc, ]
3 CD EF GHI JK llc incident [cd ef ghi jk, llc, incident]
After that you can split them into 3 separate columns:
df1 = pd.DataFrame(df['Coy'].tolist(), columns=['Company', 'Legal form', 'Addition'])
for col in df1.columns:
df1[col] = df1[col].str.title()
print(df1)
Output:
Company Legal form Addition
0 Xyz Ltd Electronics
1 Abc Abc Inc Iron
2 Ab Xy Z Inc
3 Cd Ef Ghi Jk Llc Incident

Assuming you are just trying to string split after spaces you could try something like this:
import re
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc chicago']}, columns=['Company'])
df['Addition'] = df['Company'].apply(lambda x: re.split('(ltd|inc|llc)', x))
print(df)
Company Addition
0 XYZ ltd electronics [XYZ , ltd, electronics]
1 ABC ABC inc iron [ABC ABC , inc, iron]
2 AB XY Z inc [AB XY Z , inc, ]
3 CD EF GHI JK llc chicago [CD EF GHI JK , llc, chicago]

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

How can I remove the incomplete parenthesis without affect other row using pandas?

I have a data frame where I want to remove the parentheses and the string without affecting other row.
Assume the following dataframe "cusomter" with the column "Name"
Name
Company ABC (Malaysia) Ltd
Company HIJ (M) Ltd (B12
Company KLM (M) Ltd (
My code:
customer["Name"] = customer["Name"].str.replace(r"\([^()]{1,3}", "", regex=True)
Output:
Name
Company ABC aysia) Ltd
Company HIJ (M) Ltd
Company KLM (M) Ltd
My desire output:
Name
Company ABC (Malaysia) Ltd
Company HIJ (M) Ltd
Company KLM (M) Ltd
May I know if is it possible to do it? I had tried different codes and I'm still not able to get my desired output.
Assuming there won't be nested parentheses in the garbage section, you can anchor the replacement to the end of the string.
import pandas as pd
df = pd.DataFrame({"company": ["Company ABC (Malaysia) Ltd", "Company HIJ (M) Ltd (B12", "Company KLM (M) Ltd ("]})
df["company"] = df["company"].str.replace(r"\s*\([^()]*?$", "", regex=True)
print(df)
company
0 Company ABC (Malaysia) Ltd
1 Company HIJ (M) Ltd
2 Company KLM (M) Ltd

Removing everything after a char in a dataframe

If I have the following dataframe 'countries':
country info
england london-europe
scotland edinburgh-europe
china beijing-asia
unitedstates washington-north_america
I would like to take the info field and have to remove everything after the '-', to become:
country info
england london
scotland edinburgh
china beijing
unitedstates washington
How do I do this?
Try:
countries['info'] = countries['info'].str.split('-').str[0]
Output:
country info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington
You just need to keep the first part of the string after a split on the dash character:
countries['info'] = countries['info'].str.split('-').str[0]
Or, equivalently, you can use
countries['info'] = countries['info'].str.split('-').map(lambda x: x[0])
You can also use str.extract with pattern r"(\w+)(?=\-)"
Ex:
print(df['info'].str.extract(r"(\w+)(?=\-)"))
Output:
info
0 london
1 edinburgh
2 beijing
3 washington

Replace only last occurrence of column value in DataFrame

I've a DataFrame with a Company column.
Company
-------------------------------
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security
I've a dictionary with regexes to normalize the company entities.
(^|\s)corporation(\s|$); Corp
(^|\s)Limited(\s|$); LTD
(^|\s)Incorporated(\s|$); INC
...
I need to normalize only the last occurrence. This is my desired output.
Company
-------------------------------
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security
(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)
My code:
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
Is it possible to only change the last occurrence of an entity (do i need to change my regex)?
Change (\s|$) to ($) for match end of strings:
entity_dict = {'(^|\s)corporation($)': ' Corp',
'(^|\s)Limited($)': ' LTD',
'(^|\s)Incorporated($)': ' INC'}
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
EDIT: You can simplify dictionary for no regex, then create lowercase dict for possible use Series.str.findall, get last value of indexing str[-1] and Series.map by lowercase dict, last replace in list comprension:
entity_dict = {'corporation': 'Corp',
'Limited': 'LTD',
'Incorporated': 'INC'}
lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')
df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
3 Carter, Rath and Mueller LTD (USD/AC)
4 Barrows Corp /PACIFIC
5 Corp, Mounted Security

Pandas - how to remove spaces in each column in a dataframe?

I'm trying to remove spaces, apostrophes, and double quote in each column data using this for loop
for c in data.columns:
data[c] = data[c].str.strip().replace(',', '').replace('\'', '').replace('\"', '').strip()
but I keep getting this error:
AttributeError: 'Series' object has no attribute 'strip'
data is the data frame and was obtained from an excel file
xl = pd.ExcelFile('test.xlsx');
data = xl.parse(sheetname='Sheet1')
Am I missing something? I added the str but that didn't help. Is there a better way to do this.
I don't want to use the column labels, like so data['column label'], because the text can be different. I would like to iterate each column and remove the characters mentioned above.
incoming data:
id city country
1 Ontario Canada
2 Calgary ' Canada'
3 'Vancouver Canada
desired output:
id city country
1 Ontario Canada
2 Calgary Canada
3 Vancouver Canada
UPDATE: using your sample DF:
In [80]: df
Out[80]:
id city country
0 1 Ontario Canada
1 2 Calgary ' Canada'
2 3 'Vancouver Canada
In [81]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[81]:
id city country
0 1 Ontario Canada
1 2 Calgary Canada
2 3 Vancouver Canada
OLD answer:
you can use DataFrame.replace() method:
In [75]: df.to_dict('r')
Out[75]:
[{'a': ' x,y ', 'b': 'a"b"c', 'c': 'zzz'},
{'a': "x'y'z", 'b': 'zzz', 'c': ' ,s,,'}]
In [76]: df
Out[76]:
a b c
0 x,y a"b"c zzz
1 x'y'z zzz ,s,,
In [77]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[77]:
a b c
0 xy abc zzz
1 xyz zzz s
r'\1' - is a numbered capturing RegEx group
data[c] does not return a value, it returns a series (a whole column of data).
You can apply the strip operation to an entire column df.apply. You can apply the strip function this way.

Categories

Resources