Remove extra spaces between columns

Remove extra spaces between columns - python

I got the output below:
sports(6 spaces)mourinho keen to tie up long-term de gea deal
opinion(5 spaces)the reality of north korea as a nuclear power
How can I make them become sports(1 space) .... and opinion(1 space)... when I write to a .txt file?
Here is my code:
the_frame = pdsql.read_sql_query("SELECT category, title FROM training;", conn)
pd.set_option('display.max_colwidth', -1)
print(the_frame)
the_frame = the_frame.replace('\s+', ' ', regex=True)#tried to remove multiple spaces
base_filename = 'Values.txt'
with open(os.path.join(base_filename),'w') as outfile:
df = pd.DataFrame(the_frame)
df.to_string(outfile, index=False, header=False)

I think your solution is nice, only should be simplify:
Also tested for multiple tabs, it working nice too.
the_frame = pdsql.read_sql_query("SELECT category, title FROM training;", conn)
the_frame = the_frame.replace('\s+', ' ', regex=True)
base_filename = 'Values.txt'
the_frame.to_csv(base_filename, index=False, header=False)
Sample:
the_frame = pd.DataFrame({
'A': ['sports mourinho keen to tie up long-term de gea deal',
'opinion the reality of north korea as a nuclear power'],
'B': list(range(2))
})
print (the_frame)
A B
0 sports mourinho keen to tie up long-term ... 0
1 opinion the reality of north korea as a nu... 1
the_frame = the_frame.replace('\s+', ' ', regex=True)
print (the_frame)
A B
0 sports mourinho keen to tie up long-term de ge... 0
1 opinion the reality of north korea as a nuclea... 1
EDIT: I believe you need join both columns with space and write output to file without sep parameter.
the_frame = pd.DataFrame({'category': {0: 'sports', 1: 'sports', 2: 'opinion', 3: 'opinion', 4: 'opinion'}, 'title': {0: 'mourinho keen to tie up long-term de gea deal', 1: 'suarez fires barcelona nine clear in sociedad fightback', 2: 'the reality of north korea as a nuclear power', 3: 'the real fire fury', 4: 'opposition and dr mahathir'}} )
print (the_frame)
category title
0 sports mourinho keen to tie up long-term de gea deal
1 sports suarez fires barcelona nine clear in sociedad ...
2 opinion the reality of north korea as a nuclear power
3 opinion the real fire fury
4 opinion opposition and dr mahathir
the_frame = the_frame['category'] + ' ' + the_frame['title']
print (the_frame)
0 sports mourinho keen to tie up long-term de ge...
1 sports suarez fires barcelona nine clear in so...
2 opinion the reality of north korea as a nuclea...
3 opinion the real fire fury
4 opinion opposition and dr mahathir
dtype: object
base_filename = 'Values.txt'
the_frame.to_csv(base_filename, index=False, header=False)

You can try the following instead of
the_frame = the_frame.replace('\s+', ' ', regex=True)
#use the below syntax
the_frame = the_frame.str.replace('\s+', ' ', regex=True)# this will remove multiple whitespaces .

Related

Appling a custom function to each row in a column in a dataframe

I have a bit of code which pulls the latitude and longitude for a location. It is here:
address = 'New York University'
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
I'm wanting to apply this as a function to a long column of "address".
I've seen loads of questions about 'apply' and 'map', but they're almost all simple math examples.
Here is what I tried last night:
def locate (address):
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
return
df['lat'] = df['lat'].map(locate)
df['lon'] = df['lon'].map(locate)
This ended up just applying the first row lat / lon to the entire csv.
What is the best method to turn the code into a custom function and apply it to each row?
Thanks in advance.
EDIT: Thank you #PacketLoss for your assistance. I'm getting an indexerror:list index out of range, but it does work on his sample dataframe.
Here is the read_csv I used to pull in the data:
df = pd.read_csv('C:\\Users\\CIHAnalyst1\\Desktop\\InstitutionLocations.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode', encoding = "utf-8", warn_bad_lines=False)
Here is a text copy of the rows from the dataframe:
address
0 GRAND CANYON UNIVERSITY
1 SOUTHERN NEW HAMPSHIRE UNIVERSITY
2 WESTERN GOVERNORS UNIVERSITY
3 FLORIDA INTERNATIONAL UNIVERSITY - UNIVERSITY ...
4 PENN STATE UNIVERSITY UNIVERSITY PARK
... ...
4292 THE ART INSTITUTES INTERNATIONAL LLC
4293 INTERCOAST - ONLINE
4294 CAROLINAS COLLEGE OF HEALTH SCIENCES
4295 DYERSBURG STATE COMMUNITY COLLEGE COVINGTON
4296 ULTIMATE MEDICAL ACADEMY - NY

You need to return your values from your function, or nothing will happen.
We can use apply here and pass the address from the df as well.
data = {'address': ['New York University', 'Sydney Opera House', 'Paris', 'SupeRduperFakeAddress']}
df = pd.DataFrame(data)
def locate(row):
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(row['address']) +'?format=json'
response = requests.get(url).json()
if response:
row['lat'] = response[0]['lat']
row['lon'] = response[0]['lon']
return row
df = df.apply(locate, axis=1)
Outputs
address lat lon
0 New York University 40.72925325 -73.99625393609625
1 Sydney Opera House -33.85719805 151.21512338473752
2 Paris 48.8566969 2.3514616
3 SupeRduperFakeAddress NaN NaN

Pandas: Remove all words from specific list within dataframe strings in large dataset

So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original):
id
name_1
name_2
0
sun blinds decoration paris inc.
indl de cuautitlan sa cv
1
eih ltd. dongguan wei shi
plastic new york product co., ltd.
2
jsh ltd. (hk) mexico city
arab shipbuilding seoul and repair yard madrid c
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output:
id
name_1
name_2
0
sun blinds decoration inc.
indl de sa cv
1
eih ltd. wei shi
plastic product co., ltd.
2
jsh ltd. (hk)
arab shipbuilding and repair yard c
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?

Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.
Therefore we can start by creating the dictionary and then using it with replace:
replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)
Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:
train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')

Creating pandas df columns with list items of uneven length?

I have a list of addresses that I would like to put into a dataframe where each row is a new address and the columns are the units of the address (title, street, city).
However, the way the list is structured, some addresses are longer than others. For example:
address = ['123 Some Street, City','45 Another Place, PO Box 123, City']
I have a pandas dataframe with the following columns:
Index Court Address Zipcode Phone
0 Court 1 123 Court Dr, Springfield 12345 11111
1 Court 2 45 Court Pl, PO Box 45, Pawnee 54321 11111
2 Court 3 1725 Slough Ave, Scranton 18503 11111
3 Court 4 101 Court Ter, Unit 321, Eagleton 54322 11111
I would like to split the Address column into up to three columns depending on how many comma separators there are in the address, with NaN filling in where values will be missing.
For example, I hope the data will look like this:
Index Court Address Address2 City Zip Phone
0 Court 1 123 Court Dr NaN Springfield ... ...
1 Court 2 45 Court Pl PO Box 45 Pawnee ... ...
2 Court 3 1725 Slough Ave NaN Scranton ... ...
3 Court 4 101 Court Ter Unit 321 Eagleton ... ...
I have plowed through and tried a ton of different solutions on StackOverflow to no avail. The closest I have gotten is with this code:
df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)
But that returns a dataframe that adds the following three columns to the end structured as such:
... 0 1 2
... 123 Court Dr Springfield None
... 45 Court Pl PO Box 45 Pawnee
This is close, but as you can see, for the shorter entries, the city lines up with the second address line for the longer entries.
Ideally, column 2 should populate every single row with a city, and column 1 should alternate between "None" and the second address line if applicable.
I hope this makes sense -- this is a tough one to put into words. Thanks!

You could do something like this:
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]

Addresses, especially those produced by human input can be tricky. But, if your addresses only fit those two formats this will work:
Note: If there is an additional format you have to account for, this will print the culprit.
def split_address(df):
for index,row in df.iterrows():
full_address = df['address']
if full_address.count(',') == 3:
split = full_address.split(',')
row['address_1'] = split[0]
row['address_2'] = split[1]
row['city'] = split[2]
else if full_address.count(',') == 2:
split = full_address.split(',')
row['address_1'] = split[0]
row['city'] = split[1]
else:
print("address does not fit known formats {0}".format(full_address))
Essentially the two things that should help you are the string.count() function which will tell you the number of commas in a string, and the string.split() which you already found that will split the input into an array. You can reference the portions of this array to allocate the pieces to the correct column.

You can look into creating a function using the package usaddress. It has been very helpful for me when I need to split address into parts:
import usaddress
df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])
Then create functions for how you want to split the data:
def Address1(x):
try:
data = usaddress.tag(x)
if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
except:
pass
def Address2(x):
try:
data = usaddress.tag(x)
if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
except:
pass
def PlaceName(x):
try:
data = usaddress.tag(x)
if 'PlaceName' in data[0].keys():
return data[0]['PlaceName']
except:
pass
df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)
out:
Address Address1 Address2 City
0 123 Main St. Suite 100 Chicago, IL 123 Main St. Suite 100 Chicago
1 123 Main St. PO Box 100 Chicago, IL 123 Main St. PO Box 100 Chicago

Replace whole string if it contains substring in pandas dataframe based on dictionary key

I am trying to replace data in column 'Place' with data from the dictionary i created. The column 'Place' contains a substring (not case sensitive) of the dictionary key. I cannot get either of my methods to work any guidance is appreciated.
incoming_df = pd.DataFrame({'First_Name' : ['John', 'Chris', 'renzo', 'Laura', 'Stan', 'Russ', 'Lip', 'Hick', 'Donald'],
'Last_Name' : ['stanford', 'lee', 'Olivares', 'Johnson', 'Stanley', 'Russaford', 'Lipper', 'Hero', 'Lipsey'],
'location' : ['Grant Elementary', 'Code Academy', 'Queen Prep', 'Waves College', 'duke Prep', 'california Academy', 'SF College Prep', 'San Ramon Prep', 'San Jose High']})
df = pd.DataFrame({'FirstN': [],
'LastN':[],
'Place': []})
# re index based on data given
df = df.reindex(incoming_df.index)
# copy data over to new dataframe
df['LastN'] = incoming_df.loc[:, incoming_df.columns.str.contains('Last', case=False)]
df['FirstN'] = incoming_df.loc[:, incoming_df.columns.str.contains('First', case=False)]
df['Place'] = incoming_df.loc[:, incoming_df.columns.str.contains('School|Work|Site|Location', case=False)]
places = { 'Grant' : 'DEF Grant Elementary',
'Code' : 'DEF Code Academy',
'Queen' : 'DEF Queen Preparatory High School',
'Waves' : 'DEF Waves College Prep',
'Duke' : 'DEF Duke Preparatory Institute',
'California' : 'DEF California Academy',
'SF College' : 'DEF San Francisco College',
'San Ramon' : 'DEF San Ramon Prep',
'San Jose' : 'DEF San Jose High School' }
# replace dictionary values with values in Place (results in NAN values inside 'Place' column
pat = r'({})'.format('|'.join(places.keys()))
extracted = df.Place.str.extract(pat, expand=False).dropna()
df['Place'] = extracted.apply(lambda x: places[x])
# Also tried this method but did not work
df['Place'] = df['Place'].replace(places)
# original df
FirstN LastN Place
0 John stanford Grant Elementary
1 Chris lee Code Academy
2 renzo Olivares Queen Prep
3 Laura Johnson Waves College
4 Stan Stanley duke Prep
5 Russ Russaford california Academy
6 Lip Lipper SF College Prep
7 Hick Hero San Ramon Prep
8 Donald Lipsey San Jose High
# target df
FirstN LastN Place
0 John Stanford DEF Grant Elementary
1 Chris Lee DEF Code Academy
2 Renzo Olivares DEF Queen Preparatory High School
3 Laura Johnson DEF Waves College Prep
4 Stan Stanley DEF Duke Preparatory Institute
5 Russ Russaford DEF California Academy
6 Lip Lipper DEF San Francisco College
7 Hick Hero DEF San Ramon Prep
8 Donald Lipsey DEF San Jose High School

Using this loop solved my issue
for k, v in dic.items():
df['Place'] = np.where(df['Place'].str.contains(k, case=False), v, df['Place'])

Using a list comprehension, and making use of next to short circuit and avoid wasted iteration.
df.assign(Place=[next((v for i in df.Place if i in k.lower()), None) for k,v in dic.items()])
Place User
0 Heights College arenzo
1 Queens University brenzo
2 York Academy crenzo
3 Danes Institute drenzo
4 Duke University erenzo

Using apply and loc
for key, value in dic.items():
df.loc[df['Place'].apply(lambda x: x in key.lower()), 'Place'] = value

This is challenging given the string mismatch on 'Place'. Some naive workarounds:
1) You can utilize an index mapping, reformatting your dict to:
dic = {'1' : 'Heights College',
'2' : 'Queens University',
'3' : 'York Academy',
'4' : 'Danes Institute',
'5' : 'Duke University'}
Then use a map from your dict to df index:
df['Place'] = df.index.to_series().map(dic)
2) Alternatively, if your user column is unique you could replicate the above, edit your dic to map to user and then apply a similar df.map.If your user column is unique, you could try using map which performs a lookup based on user to your dict and return place.
dic = {'arenzo' : 'Heights College',
'brenzo' : 'Queens University',
'crenzo' : 'York Academy',
'drenzo' : 'Danes Institute',
'erenzo' : 'Duke University'}
df['Place'] = df['User'].map(dic)

Use dictionary to replace a string within a string in Pandas columns

I am trying to use a dictionary key to replace strings in a pandas column with its values. However, each column contains sentences. Therefore, I must first tokenize the sentences and detect whether a Word in the sentence corresponds with a key in my dictionary, then replace the string with the corresponding value.
However, the result that I continue to get it none. Is there a better pythonic way to approach this problem?
Here is my MVC for the moment. In the comments, I specified where the issue is happening.
import pandas as pd
data = {'Categories': ['animal','plant','object'],
'Type': ['tree','dog','rock'],
'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
}
ids = {'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']}
df = pd.DataFrame(data)
ids = pd.DataFrame(ids)
def col2dict(ids):
data = ids[['Id', 'City']]
idDict = data.set_index('Id').to_dict()['City']
return idDict
def replaceIds(data,idDict):
ids = idDict.keys()
types = idDict.values()
data['commentTest'] = data['Comment']
words = data['commentTest'].apply(lambda x: x.split())
for (i,word) in enumerate(words):
#Here we can see that the words appear
print word
print ids
if word in ids:
#Here we can see that they are not being recognized. What happened?
print ids
print word
words[i] = idDict[word]
data['commentTest'] = ' '.apply(lambda x: ''.join(x))
return data
idDict = col2dict(ids)
results = replaceIds(df, idDict)
Results:
None
I am using python2.7 and when I am printing out the dict, there are u' of Unicode.
My expected outcome is:
Categories
Comment
Type
commentTest
Categories Comment Type commentTest
0 animal The NYC tree is very big tree The New York City tree is very big
1 plant The cat from the UK is small dog The cat from the United Kingdom is small
2 object The rock was found in LA. rock The rock was found in Los Angeles.

You can create dictionary and then replace:
ids = {'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']}
ids = dict(zip(ids['Id'], ids['City']))
print (ids)
{'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'}
df['commentTest'] = df['Comment'].replace(ids, regex=True)
print (df)
Categories Comment Type \
0 animal The NYC tree is very big tree
1 plant The cat from the UK is small dog
2 object The rock was found in LA. rock
commentTest
0 The New York City tree is very big
1 The cat from the United Kingdom is small
2 The rock was found in Los Angeles.

It's actually much faster to use str.replace() than replace(), even though str.replace() requires a loop:
ids = {'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'}
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
# Categories Type Comment
# 0 animal tree The New York City tree is very big
# 1 plant dog The cat from the United Kingdom is small
# 2 object rock The rock was found in Los Angeles
The only time replace() outperforms a str.replace() loop is with small dataframes:
The timing functions for reference:
def Series_replace(df):
df['Comment'] = df['Comment'].replace(ids, regex=True)
return df
def Series_str_replace(df):
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
return df
Note that if ids is a dataframe instead of dictionary, you can get the same performance with itertuples():
ids = pd.DataFrame({'Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom']})
for row in ids.itertuples():
df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.