How can I delete this number in my "Kecamatan" Columns
You can replace the leading digits and dot using regular expression, as follows:
import pandas as pd
df = pd.DataFrame({
'Kecamatan': ['010. Donomulyo', '020. Kalipare', '030. Pagak', '040. Bankur']
})
df['Kecamatan'] = df['Kecamatan'].str.replace(r'\A[0-9. ]+', '', regex=True)
print(df)
>>>df
Kecamatan
0 Donomulyo
1 Kalipare
2 Pagak
3 Bankur
Related
I have a pandas dataframe of postcodes which have been concatenated with the two-letter country code. Some of these are Brazilian postcodes and I want to replace the last three characters of any postcode which starts with 'BR' with '000'.
import pandas as pd
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
I have tried the below, but it is not changing any of the postcodes:
if df['postcode'].str.startswith('BR').all():
df["postcode"] = df["postcode"].str.replace(r'.{3}$', '000')
Use str.replace with a capturing group:
df['postcode'] = df['postcode'].str.replace(r'(BR.*)...', r'\g<1>000', regex=True)
# or, more generic
df['postcode'] = df['postcode'].str.replace(r'(BR.*).{3}', r'\g<1>'+'0'*3, regex=True)
Output:
postcode
0 BR86037-000
1 GBBB7
2 BR86071-000
3 BR86200-000
4 BR86026-000
5 BR86082-000
6 GBCW9
7 NO3140
regex demo
The code is not working because df['postcode'].str.startswith('BR').all() will return a boolean value indicating whether all postcodes in the column start with 'BR'.
try this
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
mask = df['postcode'].str.startswith('BR')
df.loc[mask, 'postcode'] = df.loc[mask, 'postcode'].str.replace(r'.{3}$', '000')
I want to clean the following Pandas dataframe column, but in a single and efficient statement than the way I am trying to achieve it in the code below.
Input:
string
0 ['string', '#string']
1 ['#string']
2 []
Output:
string
0 string, #string
1 #string
2 NaN
Code:
import pandas as pd
import numpy as np
d = {"string": ["['string', '#string']", "['#string']", "[]"]}
df = pd.DataFrame(d)
df['string'] = df['string'].astype(str).str.strip('[]')
df['string'] = df['string'].replace("\'", "", regex=True)
df['string'] = df['string'].replace(r'^\s*$', np.nan, regex=True)
print(df)
You can use
df['string'] = df['string'].astype(str).str.replace(r"^[][\s]*$|(^\[+|\]+$|')", lambda m: '' if m.group(1) else np.nan)
Details:
^[][\s]*$ - matches a string that only consists of zero or more [, ] or whitespace chars
| - or
(^\[+|\]+$|') - captures into Group 1 one or more [ chars at the start of string, or one or more ] chars at the end of string or any ' char.
If Group 1 matches, the replacement is an empty string (the match is removed), else, the replacement is np.nan.
I would need to change the name of my indices:
Country Date (other columns)
/link1/subpath2/Text by Poe/
/link1/subpath2/Text by Wilde/
/link1/subpath2/Text by Whitman/
Country and Date are my indices. I would like to extract the words Poe, Wilde and Whitman from index column Country in order to have
Country Date (other columns)
Poe
Wilde
Whitman
Currently I am doing it one by one:
df=df.rename(index={'/link1/subpath2/Text by Poe/': 'Poe'})
df=df.rename(index={'/link1/subpath2/Text by Wilde/': 'Wilde'})
df=df.rename(index={'/link1/subpath2/Text by Whitman/': 'Whitman'})
It works, but since I have hundreds of datasets, as you can imagine is not doable
You can use str.replace:
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
df['Country'] = df['Country'].str.replace(r'/', '')
If 'Country' is an Index you can do as follows:
df = df.set_index('Country')
df.index = df.index.str.replace(r'/link1/subpath2/Text by ', '')
If it's a MultiIndex you can use .reset_index:
df = df.reset_index()
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
You can always use regex pattern if things get more complicated:
import re
import pandas as pd
df = pd.DataFrame(['foo', 'bar', 'z'], index=['/link1/subpath2/Text by Poe/',
'/link1/subpath2/Text by Wilde/',
'/link1/subpath2/Text by Whitman/'])
name_pattern = re.compile(r'by (\w+)/')
df.index = [name_att.findall(idx)[0] for idx in df.index]
df
where name_pattern will capture all groups between 'by ' and '/'
you can use str.extract with a pattern to catch the last word with (\w*), delimited by a white space \s before and after the character / at the end of the line $. Because it is an index, you need to rebuild the MultiIndex.from_arrays.
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0)
.str.extract('\s(\w*)\/$')[0],
df.index.get_level_values(1)],
names=['Country', 'Dates'])
I have an ID column with grave accent like this `1234ABC40 and I want to remove just that character from this column but keep the dataframe form.
I tried this on the column only. I have a file name x here and has multiple columns. id is the col i want to fix.
pd.read_csv(r'C:\filename.csv', index_col = False)
id = str(x['id'])
id2 = unidecode.unidecode(id)
id3 = id2.replace('`','')
This changes to str but I want that column back in the dataframe form
DataFrames have their own replace() function. Note, for partial replacements you must enable regex=True in the parameters:
import pandas as pd
d = {'id': ["12`3", "32`1"], 'id2': ["004`", "9`99"]}
df = pd.DataFrame(data=d)
df["id"] = df["id"].replace('`','', regex=True)
print df
id id2
0 123 004`
1 321 9`99
I have a column with addresses, and sometimes it has these characters I want to remove => ' - " - ,(apostrophe, double quotes, commas)
I would like to replace these characters with space in one shot. I'm using pandas and this is the code I have so far to replace one of them.
test['Address 1'].map(lambda x: x.replace(',', ''))
Is there a way to modify these code so I can replace these characters in one shot? Sorry for being a noob, but I would like to learn more about pandas and regex.
Your help will be appreciated!
You can use str.replace:
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
Sample:
import pandas as pd
test = pd.DataFrame({'Address 1': ["'aaa",'sa,ss"']})
print (test)
Address 1
0 'aaa
1 sa,ss"
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
print (test)
Address 1
0 aaa
1 sass
Here's the pandas solution:
To apply it to an entire dataframe use, df.replace. Don't forget the \ character for the apostrophe.
Example:
import pandas as pd
df = #some dataframe
df.replace('\'','', regex=True, inplace=True)