In Python, how to sort a dataframe containing accents? - python

I use sort_values to sort a dataframe. The dataframe contains UTF-8 characters with accents. Here is an example:
>>> df = pd.DataFrame ( [ ['i'],['e'],['a'],['é'] ] )
>>> df.sort_values(by=[0])
0
2 a
1 e
0 i
3 é
As you can see, the "é" with an accent is at the end instead of being after the "e" without accent.
Note that the real dataframe has several columns !

This is one way. The simplest solution, as suggested by #JonClements:
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
An alternative, long-winded solution, normalization code courtesy of #EdChum:
df = pd.DataFrame([['i'],['e'],['a'],['é']])
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
# remove accents
df[1] = df[0].str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')
# sort by new column, then drop
df = df.sort_values(1, ascending=True)\
.drop(1, axis=1)
print(df)
0
2 a
1 e
3 é
0 i

Related

How to replace last three characters of a string in a column if it starts with character

I have a pandas dataframe of postcodes which have been concatenated with the two-letter country code. Some of these are Brazilian postcodes and I want to replace the last three characters of any postcode which starts with 'BR' with '000'.
import pandas as pd
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
I have tried the below, but it is not changing any of the postcodes:
if df['postcode'].str.startswith('BR').all():
df["postcode"] = df["postcode"].str.replace(r'.{3}$', '000')
Use str.replace with a capturing group:
df['postcode'] = df['postcode'].str.replace(r'(BR.*)...', r'\g<1>000', regex=True)
# or, more generic
df['postcode'] = df['postcode'].str.replace(r'(BR.*).{3}', r'\g<1>'+'0'*3, regex=True)
Output:
postcode
0 BR86037-000
1 GBBB7
2 BR86071-000
3 BR86200-000
4 BR86026-000
5 BR86082-000
6 GBCW9
7 NO3140
regex demo
The code is not working because df['postcode'].str.startswith('BR').all() will return a boolean value indicating whether all postcodes in the column start with 'BR'.
try this
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
mask = df['postcode'].str.startswith('BR')
df.loc[mask, 'postcode'] = df.loc[mask, 'postcode'].str.replace(r'.{3}$', '000')

How to find and replace substrings at the end of column headers

I have the following columns, among others, in my dataframe: dom_pop', 'an_dom_n', 'an_dom_ncmplt. Equivalent columns exist in multiple dataframes, with the suffix changing. For example, in another dataframe they may be called out as pa_pop', 'an_pa_n', 'an_pa_ncmplt. I want to append '_kwh' to these cols across all my dataframes.
I wrote the following code:
cols = ['_n$', '_ncmplt', '_pop'] << the $ is added to indicate string ending in _n.
filterfuel = 'kwh'
for c in cols:
dfdom.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfdom.columns]
dfpa.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfpa.columns]
dfsw.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfsw.columns]
kwh gets appended to _ncmplt and _pop cols, but not the _n column. If I remove the $ _n gets appended but then _ncmplt looks like 'an_dom_n_kwh_cmplt'.
for df dom the corrected names should look like dom_pop_kwh', 'an_dom_n_kwh', 'an_dom_ncmplt_kwh'
Why is $ not being recongnised as an end of string parameter?
You can use np.where with a regex
cols = ['_n$', '_ncmplt', '_pop']
filterfuel = 'kwh'
pattern = fr"(?:{'|'.join(cols)})"
for df in [dfdom, dfpa, dfsw]:
df.columns = np.where(df.columns.str.contains(pattern, regex=True),
df.columns + f"_{filterfuel}", df.columns)
Output:
>>> pattern
'(?:_n$|_ncmplt|_pop)'
# dfdom = pd.DataFrame([[0]*4], columns=['dom_pop', 'an_dom_n', 'an_dom_ncmplt', 'hello'])
# After:
>>> dfdom
dom_pop_kwh an_dom_n_kwh an_dom_ncmplt_kwh hello
0 0 0 0 0

Subset string rows that contain a 'flexible' pattern

I have the following df.
data = [
['DWWWWD'],
['DWDW'],
['WDWWWWWWWWD'],
['DDW'],
['WWD'],
]
df = pd.DataFrame(data, columns=['letter_sequence'])
I want to subset the rows that contain the pattern 'D' + '[whichever number of W's]' + 'D'. Examples of rows I want in my output df: DWD, DWWWWWWWWWWWD, WWWWWDWDW...
I came up with the following, but it does not really work for 'whichever number of W's'.
df[df['letter_sequence'].str.contains(
'DWD|DWWD|DWWWD|DWWWWD|DWWWWWD|DWWWWWWD|DWWWWWWWD|DWWWWWWWWD', regex=True
)]
Desired output new_df:
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD
Any alternatives?
Use [W]{1,} for one or more W, regex=True is by default, so should be omit:
df = df[df['letter_sequence'].str.contains('D[W]{1,}D')]
print (df)
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD
You can use the regex: D\w+D.
The code is shown below:
df = df[df['letter_sequence'].str.contains('Dw+D')]
Please let me know if it helps.

Replace string in pandas dataframe if it contains specific substring

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

How can I split a column into 2 in the correct way?

I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:
You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...

Categories

Resources