Remove * from a specific column value - python

For this dataframe, what is the best way to get ride of the * of "Stad Brussel*". In the real dataframe, the * is also on the upside. Please refer to the pic. Thanks.
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel* 1000 176545
4 Etterbeek 1040 47414
Desired results:
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel 1000 176545
4 Etterbeek 1040 47414

You can try:
df['Dutch name'] = df['Dutch name'].replace({'\*':''}, regex = True)
This will remove all * characters in the 'Dutch name' column. If you need to remove the character from multiple columns use:
df.replace({'\*':''}, regex = True)

If you manipulate only strings you can use regular expression matching. See here.
Something like :
import re
txt = 'Your file as a string here'
out = re.sub('\*', '', txt)
out now contain what you want.

for dataframe, first define column(s) to be checked:
cols_to_check = ['4']
then,
df[cols_to_check] = df[cols_to_check].replace({'*':''}, regex=True)

Related

How to identify records in a DataFrame (Python/Pandas) that contains leading or trailing spaces

I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.
The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.
The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.
Please notice. I would like to see the resulting output in a dataframe format.
Thank you!
Link to sample dataframe data
You can use:
df['col'].str.startswith(' ')
df['col'].str.endswith(' ')
or with a regex:
df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')
Example:
df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})
df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']
col start end either
0 abc True False True
1 def False False False
2 ghi False True True
3 jkl True True True
However, this is likely not faster than directly stripping the spaces:
df['col'] = df['col'].str.strip()
col
0 abc
1 def
2 ghi
3 jkl
updated answer
To detect the columns with leading/traiing spaces, you can use:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index
example on the provided link:
Index(['First Name', 'Team'], dtype='object')

Pandas : Changing a column of dataset from string to integer

One Column of my dataset is like this:
0 10,000+
1 500,000+
2 5,000,000+
3 50,000,000+
4 100,000+
Name: Installs, dtype: object
and I want to change these 'xxx,yyy,zzz+' strings to integers.
first I tried this function:
df['Installs'] = pd.to_numeric(df['Installs'])
and I got this error:
ValueError: Unable to parse string "10,000" at position 0
and then I tried to remove '+' and ',' with this method:
df['Installs'] = df['Installs'].str.replace('+','',regex = True)
df['Installs'] = df['Installs'].str.replace(',','',regex = True)
but nothing changed!
How can I convert these strings to integers?
With regex=True, the + (plus) character is interepreted specially, as a regex feature. You can either disable regular expression replacement (regex=False), or even better, change your regular expression to match + or , and remove them at once:
df['Installs'] = df['Installs'].str.replace('[+,]', '', regex=True).astype(int)
Output:
>>> df['Installs']
0 10000
1 500000
2 5000000
3 50000000
4 100000
Name: 0, dtype: int64
+ is not a valid regex, use:
df['Installs'] = pd.to_numeric(df['Installs'].str.replace(r'\D', '', regex=True))

Number formatting after mapping?

I have a data frame with a number column, such as:
CompteNum
100
200
300
400
500
and a file with the mapping of all these numbers to other numbers, that I import to python and convert into a dictionary:
{100: 1; 200:2; 300:3; 400:4; 500:5}
And I am creating a second column in the data frame that combine both numbers in the format df number + dict number: From 100 to 1001 and so on...
## dictionary
accounts = pd.read_excel("mapping-accounts.xlsx")
accounts = accounts[['G/L Account #','FrMap']]
accounts = accounts.set_index('G/L Account #').to_dict()['FrMap']
## data frame --> CompteNum is the Number Column
df['CompteNum'] = df['CompteNum'].map(accounts1).astype(str) + df['CompteNum'].astype(str)
The problem is that my output then is 100.01.0 instead of 1001 and that creates additional manual work in the output excel file. I have tried:
df['CompteNum'] = df['CompteNum'].str.replace('.0', '')
but it doesn't deletes ALL the zero's, and I would want the additional ones deleted. Any suggestions?
There is problem missing values for non matched values after map, possible solution is:
print (df)
CompteNum
0 100
1 200
2 300
3 400
4 500
5 40
accounts1 = {100: 1, 200:2, 300:3, 400:4, 500:5}
s = df['CompteNum'].astype(str)
s1 = df['CompteNum'].map(accounts1).dropna().astype(int).astype(str)
df['CompteNum'] = (s + s1).fillna(s)
print (df)
CompteNum
0 1001
1 2002
2 3003
3 4004
4 5005
5 40
Your solution should be changed for replace by regex - $ for end of string with escape ., because special regex character (regex any char):
df['CompteNum'] = df['CompteNum'].str.replace('\.0$', '')

Pandas: How to extract columns on new columns which contain special separators?

My data frame has some columns which contains digits and words. Before the digits and words sometimes there are special character like ">*".
The column are mostly divided in , or /. Based on separators, I want to section it into new columns and delete it.
Reproduced my dataframe and with my code:
d = {'error': [
'test,121',
'123',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error'] = df['error'].str.replace(' ', '')
df[['error1', 'error2']] = df['error'].str.extract('.*?(\w*)[,|/](\w*)')
df
So far my approach is first to remove the whitespaces with
df['error'] = df['error'].str.replace(' ', '')
Than I constructed my regex with this help
https://regex101.com/r/UHzTOq/13
.*?(\w*)[,|/](\w*)
Afterwards I delete the messy column with:
df.drop(columns =["error"], inplace = True)
My single values in the row are not considered. Therefore I get a NaN as a result. How to include them in my regex?
Solution is:
df[['error1', 'error2']] = df['error'].str.extract(r'^[>*:]*(.*?)(?:[,|\\](.*))?$')
Assuming that we'd like to add those values with only a test or a 123 in error1 column, maybe then we'd just slightly modify your original expression:
^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$
I'm pretty sure there should be other easier ways though.
Test
import pandas as pd
d = {'error': [
'test,121',
'123',
'test',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error1'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\1')
df['error2'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\2')
print(df)
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Output
error error1 error2
0 test,121 test 121
1 123 123
2 test test
3 test,test test test
4 >errrI1GB,213 errrI1GB 213
5 *errrI1GB,213 errrI1GB 213
6 *errrI1GB/213 errrI1GB 213
7 *>errrI1GB/213 errrI1GB 213
8 >*errrI1GB,213 errrI1GB 213
9 >test, test test test
10 >>test, test test test
11 >>:test,test test test
RegEx Circuit
jex.im visualizes regular expressions:

Is there a way in pandas to remove duplicates from within a series?

I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))

Categories

Resources