How do I remove multiple spaces between two strings in python.
e.g:-
"Bertug 'here multiple blanks' Mete" => "Bertug Mete"
to
"Bertug Mete"
Input is read from an .xls file. I have tried using split() but it doesn't seem to work as expected.
import pandas as pd , string , re
dataFrame = pd.read_excel("C:\\Users\\Bertug\\Desktop\\example.xlsx")
#names1 = ''.join(dataFrame.Name.to_string().split())
print(type(dataFrame.Name))
#print(dataFrame.Name.str.split())
Let me know where I'm doing wrong.
I think use replace:
df.Name = df.Name.replace(r'\s+', ' ', regex=True)
Sample:
df = pd.DataFrame({'Name':['Bertug Mete','a','Joe Black']})
print (df)
Name
0 Bertug Mete
1 a
2 Joe Black
df.Name = df.Name.replace(r'\s+', ' ', regex=True)
#similar solution
#df.Name = df.Name.str.replace(r'\s+', ' ')
print (df)
Name
0 Bertug Mete
1 a
2 Joe Black
Related
I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.
The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.
The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.
Please notice. I would like to see the resulting output in a dataframe format.
Thank you!
Link to sample dataframe data
You can use:
df['col'].str.startswith(' ')
df['col'].str.endswith(' ')
or with a regex:
df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')
Example:
df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})
df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']
col start end either
0 abc True False True
1 def False False False
2 ghi False True True
3 jkl True True True
However, this is likely not faster than directly stripping the spaces:
df['col'] = df['col'].str.strip()
col
0 abc
1 def
2 ghi
3 jkl
updated answer
To detect the columns with leading/traiing spaces, you can use:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index
example on the provided link:
Index(['First Name', 'Team'], dtype='object')
I have texts in one column and respective dictionary in another column. I have tokenized the text and want to replace those tokens which found a match for the key in respective dictionary. the text and and the dictionary are specific to each record of a pandas dataframe.
import pandas as pd
data =[['1','i love mangoes',{'love':'hate'}],['2', 'its been a long time we have not met',{'met':'meet'}],['3','i got a call from one of our friends',{'call':'phone call','one':'couple of'}]]
df = pd.DataFrame(data, columns = ['id', 'text','dictionary'])
The final dataframe the output should be
data =[['1','i hate mangoes'],['2', 'its been a long time we have not meet'],['3','i got a phone call from couple of of our friends']
df = pd.DataFrame(data, columns =['id, 'modified_text'])
I am using Python 3 in a windows machine
You can use dict.get method after zipping the 2 cols and splitting the sentence:
df['modified_text']=([' '.join([b.get(i,i) for i in a.split()])
for a,b in zip(df['text'],df['dictionary'])])
print(df)
Output:
id text \
0 1 i love mangoes
1 2 its been a long time we have not met
2 3 i got a call from one of our friends
dictionary \
0 {'love': 'hate'}
1 {'met': 'meet'}
2 {'call': 'phone call', 'one': 'couple of'}
modified_text
0 i hate mangoes
1 its been a long time we have not meet
2 i got a phone call from couple of of our friends
I added spaces to the key and values to distinguish a whole word from part of it:
def replace(text, mapping):
new_s = text
for key in mapping:
k = ' '+key+' '
val = ' '+mapping[key]+' '
new_s = new_s.replace(k, val)
return new_s
df_out = (df.assign(modified_text=lambda f:
f.apply(lambda row: replace(row.text, row.dictionary), axis=1))
[['id', 'modified_text']])
print(df_out)
id modified_text
0 1 i hate mangoes
1 2 its been a long time we have not met
2 3 i got a phone call from couple of of our friends
I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))
I am trying to write (excel file) /print a dataframe , where column values consists of next line (\n) .
>>> textbox = ""
>>> textbox+=str('So, so you think you can tell \n')
>>> textbox+=str('Heaven from hell')
>>> print textbox
So, so you think you can tell
Heaven from hell
>>> df1 = pd.DataFrame({'lyric':[textbox]})
>>> df1
lyric
0 So, so you think you can tell \nHeaven from hell
>>> print df1
lyric
0 So, so you think you can tell \nHeaven from hell
So when i print the df or write a to excel file , I see that instead of next line , "\n" is printed.
How to print next line?
I think you need create lists from string by split with \n, then strip start and end whitespaces:
splitlines solution:
df1 = pd.DataFrame({'lyric':textbox.splitlines()})
df1.lyric = df1.lyric.str.strip()
print (df1)
lyric
0 So, so you think you can tell
1 Heaven from hell
split solution:
print (textbox.split('\n'))
['So, so you think you can tell ', 'Heaven from hell']
df1 = pd.DataFrame({'lyric':textbox.split('\n')})
df1.lyric = df1.lyric.str.strip()
print (df1)
lyric
0 So, so you think you can tell
1 Heaven from hell
strip by list comprehension:
df1 = pd.DataFrame({'lyric':[x.strip() for x in textbox.split('\n')]})
print (df1)
lyric
0 So, so you think you can tell
1 Heaven from hell
EDIT:
I think you need replace:
df1 = pd.DataFrame({'lyric':[textbox]})
df1.lyric = df1.lyric.str.replace('\n', '')
print (df1)
lyric
0 So, so you think you can tell Heaven from hell
titanic_df['Embarked'] = titanic_df['Embarked'].fillna("S")
titanic_df is data frame,Embarked is a column name. I have to missing cells in my column i.e blank spaces and I want to add "S" at the missing place but the code I mentioned above is not working.Please help me.
I think you need replace:
titanic_df['Embarked'] = titanic_df['Embarked'].replace(" ", "S")
Sample:
import pandas as pd
titanic_df = pd.DataFrame({'Embarked':['a','d',' ']})
print (titanic_df)
Embarked
0 a
1 d
2
titanic_df['Embarked'] = titanic_df['Embarked'].replace(" ", "S")
print (titanic_df)
Embarked
0 a
1 d
2 S
Also you can use str.replace with regex if need replace one or more whitespaces.
^ means the beginning of whitespace(s), $ means the end of whitespace(s):
titanic_df = pd.DataFrame({'Embarked':['a ',' d',' ', ' ']})
print (titanic_df)
Embarked
0 a
1 d
2
3
titanic_df['Embarked'] = titanic_df['Embarked'].str.replace("^\s+$", "S")
#same output
#titanic_df['Embarked'] = titanic_df['Embarked'].replace("^\s+$", "S", regex=True)
print (titanic_df)
Embarked
0 a
1 d
2 S
3 S
Or you could use apply
titanic_df['Embarked'] = titanic_df['Embarked'].apply(lambda x: "S" if x == " " else x)