Sorting pandas dataframe with German Umlaute

Sorting pandas dataframe with German Umlaute - python

I have a dataframe which I want to sort via sort_values on one column.
Problem is there are German umlaute as first letter of the words.
Like Österreich, Zürich.
Which will sort to Zürich, Österreich.
It should be sorting Österreich, Zürich.
Ö should be between N and O.
I have found out how to do this with lists in python using locale and strxfrm.
Can I do this in the pandas dataframe somehow directly?
Edit:
Thank You. Stef example worked quite well, somehow I had Numbers where his Version did not work with my real life Dataframe example, so I used alexey's idea.
I did the following, probably you can shorten this..:
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b', 'v']})
#create index as column for joining later
df = df.reset_index(drop=False)
#convert int to str
df['location']=df['location'].astype(str)
#sort by location with umlaute
df_sort_index = df['location'].str.normalize('NFD').sort_values(ascending=True).reset_index(drop=False)
#drop location so we dont have it in both tables
df = df.drop('location', axis=1)
#inner join on index
new_df = pd.merge(df_sort_index, df, how='inner', on='index')
#drop index as column
new_df = new_df.drop('index', axis=1)

You could use sorted with a locale aware sorting function (in my example, setlocale returned 'German_Germany.1252') to sort the column values. The tricky part is to sort all the other columns accordingly. A somewhat hacky solution would be to temporarily set the index to the column to be sorted and then reindex on the properly sorted index values and reset the index.
import functools
import locale
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern'],'code':['ö','z','b']})
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
Output of print(df):
location code
0 Bern b
1 Österreich ö
2 Zürich z
Update for mixed type columns
If the column to be sorted is of mixed types (e.g. strings and integers), then you have two possibilities:
a) convert the column to string and then sort as written above (result column will be all strings):
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df.location=df.location.astype(str)
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
print(df.location.values)
# ['254345' 'Bern' 'Österreich' 'Zürich']
b) sort on a copy of the column converted to string (result column will retain mixed types)
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df = df.set_index(df.location.astype(str))
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index(drop=True)
print(df.location.values)
# [254345 'Bern' 'Österreich' 'Zürich']

you can use unicode NFD normal form
>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3 N
1 Ost
0 Österreich
2 S
dtype: object
# use result to rearrange a dataframe
>>> df[names.str.normalize('NFD').sort_values().index]
It's not quite what you wanted, but for proper ordering you need language knowladge (like locale you mentioned).
NFD employs two symbols for umlauts e.g. Ö becomes O\xcc\x88 (you can see the difference with names.str.normalize('NFD').encode('utf-8'))

Sort with locale:
import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'de_de')
#codes: https://github.com/python/cpython/blob/3.10/Lib/locale.py
#create df
df = pd.DataFrame({'location': ['Zürich','Österreich','Bern', 254345],'code':['z','ö','b','v']})
#convert int to str
df['location']=df['location'].astype(str)
#sort
df_ord = df.sort_values(by = 'location', key = lambda col: col.map(lambda x: locale.strxfrm(x)))
Multisort with locale:
import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'es_es')
# create df
lista1 = ['sarmiento', 'ñ', 'á', 'sánchez', 'a', 'ó', 's', 'ñ', 'á', 'sánchez']
lista2 = [10, 20, 60, 40, 20, 20, 10, 5, 30, 20]
df = pd.DataFrame(list(zip(lista1, lista2)), columns = ['Col1', 'Col2'])
#sort by Col2 and Col1
df_temp = df.sort_values(by = 'Col2')
df_ord = df_temp.sort_values(by = 'Col1', key = lambda col: col.map(lambda x: locale.strxfrm(x)), kind = 'mergesort')

Related

change the first occurrence in a pandas column based on certain condition

so I would like to change the first number in the number column to +233 given the first number is 0, basically I would like all rows in number to be like that of row Paul
Both columns are string objects.
Expectations:
The first character of the values in the column named df["numbers"] should be replaced to "+233" if only == "0"
df = pd.DataFrame([[ken, 080222333222],
[ben, +233948433],
[Paul, 0800000073]],
columns=['name', 'number'])`

Hope I understood your edition, try this:
Notice - I removed the first 0 and replaced it with +233
import pandas as pd
df = pd.DataFrame([["ken", "080222333222"], ["ben", "+233948433"], ["Paul", "0800000073"]], columns=['name', 'number'])
def convert_number(row):
if row[0] == '0':
row = "+233" + row[1:]
return row
df['number'] = df['number'].apply(convert_number)
print(df)

You can used replace directly
df['relace_Col']=df.number.str.replace(r'^\d', '+233',1)
which produced
name number relace_Col
0 ken 080222333222 +23380222333222
1 ben +233948433 +233948433
2 Paul 0800000073 +233800000073
The full code to reproduce the above
import pandas as pd
df = pd.DataFrame([['ken', '080222333222'], ['ben', '+233948433'],
['Paul', '0800000073']], columns=['name', 'number'])
df['relace_Col']=df.number.str.replace(r'^\d', '+233',1)
print(df)

Renaming index values in pandas dataframe

I would need to change the name of my indices:
Country Date (other columns)
/link1/subpath2/Text by Poe/
/link1/subpath2/Text by Wilde/
/link1/subpath2/Text by Whitman/
Country and Date are my indices. I would like to extract the words Poe, Wilde and Whitman from index column Country in order to have
Country Date (other columns)
Poe
Wilde
Whitman
Currently I am doing it one by one:
df=df.rename(index={'/link1/subpath2/Text by Poe/': 'Poe'})
df=df.rename(index={'/link1/subpath2/Text by Wilde/': 'Wilde'})
df=df.rename(index={'/link1/subpath2/Text by Whitman/': 'Whitman'})
It works, but since I have hundreds of datasets, as you can imagine is not doable

You can use str.replace:
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
df['Country'] = df['Country'].str.replace(r'/', '')
If 'Country' is an Index you can do as follows:
df = df.set_index('Country')
df.index = df.index.str.replace(r'/link1/subpath2/Text by ', '')
If it's a MultiIndex you can use .reset_index:
df = df.reset_index()
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')

You can always use regex pattern if things get more complicated:
import re
import pandas as pd
df = pd.DataFrame(['foo', 'bar', 'z'], index=['/link1/subpath2/Text by Poe/',
'/link1/subpath2/Text by Wilde/',
'/link1/subpath2/Text by Whitman/'])
name_pattern = re.compile(r'by (\w+)/')
df.index = [name_att.findall(idx)[0] for idx in df.index]
df
where name_pattern will capture all groups between 'by ' and '/'

you can use str.extract with a pattern to catch the last word with (\w*), delimited by a white space \s before and after the character / at the end of the line $. Because it is an index, you need to rebuild the MultiIndex.from_arrays.
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0)
.str.extract('\s(\w*)\/$')[0],
df.index.get_level_values(1)],
names=['Country', 'Dates'])

Remove grave accent from IDs

I have an ID column with grave accent like this `1234ABC40 and I want to remove just that character from this column but keep the dataframe form.
I tried this on the column only. I have a file name x here and has multiple columns. id is the col i want to fix.
pd.read_csv(r'C:\filename.csv', index_col = False)
id = str(x['id'])
id2 = unidecode.unidecode(id)
id3 = id2.replace('`','')
This changes to str but I want that column back in the dataframe form

DataFrames have their own replace() function. Note, for partial replacements you must enable regex=True in the parameters:
import pandas as pd
d = {'id': ["12`3", "32`1"], 'id2': ["004`", "9`99"]}
df = pd.DataFrame(data=d)
df["id"] = df["id"].replace('`','', regex=True)
print df
id id2
0 123 004`
1 321 9`99

Search and replace dots and commas in pandas dataframe

This is my DataFrame:
d = {'col1': ['sku 1.1', 'sku 1.2', 'sku 1.3'], 'col2': ['9.876.543,21', 654, '321,01']}
df = pd.DataFrame(data=d)
df
col1 col2
0 sku 1.1 9.876.543,21
1 sku 1.2 654
2 sku 1.3 321,01
Data in col2 are numbers in local format, which I would like to convert into:
col2
9876543.21
654
321.01
I tried df['col2'] = pd.to_numeric(df['col2'], downcast='float'), which returns a ValueError: : Unable to parse string "9.876.543,21" at position 0.
I tried also df = df.apply(lambda x: x.str.replace(',', '.')), which returns ValueError: could not convert string to float: '5.023.654.46'

The best is use if possible parameters in read_csv:
df = pd.read_csv(file, thousands='.', decimal=',')
If not possible, then replace should help:
df['col2'] = (df['col2'].replace('\.','', regex=True)
.replace(',','.', regex=True)
.astype(float))

You can try
df = df.apply(lambda x: x.replace(',', '&'))
df = df.apply(lambda x: x.replace('.', ','))
df = df.apply(lambda x: x.replace('&', '.'))

You are always better off using standard system facilities where they exist. Knowing that some locales use commas and decimal points differently I could not believe that Pandas would not use the formats of the locale.
Sure enough a quick search revealed this gist, which explains how to make use of locales to convert strings to numbers. In essence you need to import locale and after you've built the dataframe call locale.setlocale to establish a locale that uses commas as decimal points and periods for separators, then apply the dataframe's applymapp method.

Changing multiple column names but not all of them - Pandas Python

I would like to know if there is a function to change specific column names but without selecting a specific name or without changing all of them.
I have the code:
df=df.rename(columns = {'nameofacolumn':'newname'})
But with it i have to manually change each one of them writing each name.
Also to change all of them I have
df = df.columns['name1','name2','etc']
I would like to have a function to change columns 1 and 3 without writing their names just stating their location.

say you have a dictionary of the new column names and the name of the column they should replace:
df.rename(columns={'old_col':'new_col', 'old_col_2':'new_col_2'}, inplace=True)
But, if you don't have that, and you only have the indices, you can do this:
column_indices = [1,4,5,6]
new_names = ['a','b','c','d']
old_names = df.columns[column_indices]
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)

You can use a dict comprehension and pass this to rename:
In [246]:
df = pd.DataFrame(columns=list('abc'))
new_cols=['d','e']
df.rename(columns=dict(zip(df.columns[1:], new_cols)),inplace=True)
df
Out[246]:
Empty DataFrame
Columns: [a, d, e]
Index: []
It also works if you pass a list of ordinal positions:
df.rename(columns=dict(zip(df.columns[[1,2]], new_cols)),inplace=True)

You don't need to use rename method at all.
You simply replace the old column names with new ones using lists. To rename columns 1 and 3 (with index 0 and 2), you do something like this:
df.columns.values[[0, 2]] = ['newname0', 'newname2']
or possibly if you are using older version of pandas than 0.16.0, you do:
df.keys().values[[0, 2]] = ['newname0', 'newname2']
The advantage of this approach is, that you don't need to copy the whole dataframe with syntax df = df.rename, you just change the index values.

You should be able to reference the columns by index using ..df.columns[index]
>> temp = pd.DataFrame(np.random.randn(10, 5),columns=['a', 'b', 'c', 'd', 'e'])
>> print(temp.columns[0])
a
>> print(temp.columns[1])
b
So to change the value of specific columns, first assign the values to an array and change only the values you want
>> newcolumns=temp.columns.values
>> newcolumns[0] = 'New_a'
Assign the new array back to the columns and you'll have what you need
>> temp.columns = newcolumns
>> temp.columns
>> print(temp.columns[0])
New_a

if you have a dict of {position: new_name}, you can use items()
e.g.,
new_columns = {3: 'fourth_column'}
df.rename(columns={df.columns[i]: new_col for i, new_col in new_cols.items()})
full example:
$ ipython
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
...: import pandas as pd
...:
...: rng = np.random.default_rng(seed=0)
...: df = pd.DataFrame({key: rng.uniform(size=3) for key in list('abcde')})
...: df
Out[1]:
a b c d e
0 0.636962 0.016528 0.606636 0.935072 0.857404
1 0.269787 0.813270 0.729497 0.815854 0.033586
2 0.040974 0.912756 0.543625 0.002739 0.729655
In [2]: new_columns = {3: 'fourth_column'}
...: df.rename(columns={df.columns[i]: new_col for i, new_col in new_columns.items()})
Out[2]:
a b c fourth_column e
0 0.636962 0.016528 0.606636 0.935072 0.857404
1 0.269787 0.813270 0.729497 0.815854 0.033586
2 0.040974 0.912756 0.543625 0.002739 0.729655
In [3]:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting pandas dataframe with German Umlaute - python

Related

change the first occurrence in a pandas column based on certain condition

Renaming index values in pandas dataframe

Remove grave accent from IDs

Search and replace dots and commas in pandas dataframe

Changing multiple column names but not all of them - Pandas Python

Categories

Resources