I would need to change the name of my indices:
Country Date (other columns)
/link1/subpath2/Text by Poe/
/link1/subpath2/Text by Wilde/
/link1/subpath2/Text by Whitman/
Country and Date are my indices. I would like to extract the words Poe, Wilde and Whitman from index column Country in order to have
Country Date (other columns)
Poe
Wilde
Whitman
Currently I am doing it one by one:
df=df.rename(index={'/link1/subpath2/Text by Poe/': 'Poe'})
df=df.rename(index={'/link1/subpath2/Text by Wilde/': 'Wilde'})
df=df.rename(index={'/link1/subpath2/Text by Whitman/': 'Whitman'})
It works, but since I have hundreds of datasets, as you can imagine is not doable
You can use str.replace:
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
df['Country'] = df['Country'].str.replace(r'/', '')
If 'Country' is an Index you can do as follows:
df = df.set_index('Country')
df.index = df.index.str.replace(r'/link1/subpath2/Text by ', '')
If it's a MultiIndex you can use .reset_index:
df = df.reset_index()
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
You can always use regex pattern if things get more complicated:
import re
import pandas as pd
df = pd.DataFrame(['foo', 'bar', 'z'], index=['/link1/subpath2/Text by Poe/',
'/link1/subpath2/Text by Wilde/',
'/link1/subpath2/Text by Whitman/'])
name_pattern = re.compile(r'by (\w+)/')
df.index = [name_att.findall(idx)[0] for idx in df.index]
df
where name_pattern will capture all groups between 'by ' and '/'
you can use str.extract with a pattern to catch the last word with (\w*), delimited by a white space \s before and after the character / at the end of the line $. Because it is an index, you need to rebuild the MultiIndex.from_arrays.
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0)
.str.extract('\s(\w*)\/$')[0],
df.index.get_level_values(1)],
names=['Country', 'Dates'])
Related
I have a pandas dataframe of postcodes which have been concatenated with the two-letter country code. Some of these are Brazilian postcodes and I want to replace the last three characters of any postcode which starts with 'BR' with '000'.
import pandas as pd
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
I have tried the below, but it is not changing any of the postcodes:
if df['postcode'].str.startswith('BR').all():
df["postcode"] = df["postcode"].str.replace(r'.{3}$', '000')
Use str.replace with a capturing group:
df['postcode'] = df['postcode'].str.replace(r'(BR.*)...', r'\g<1>000', regex=True)
# or, more generic
df['postcode'] = df['postcode'].str.replace(r'(BR.*).{3}', r'\g<1>'+'0'*3, regex=True)
Output:
postcode
0 BR86037-000
1 GBBB7
2 BR86071-000
3 BR86200-000
4 BR86026-000
5 BR86082-000
6 GBCW9
7 NO3140
regex demo
The code is not working because df['postcode'].str.startswith('BR').all() will return a boolean value indicating whether all postcodes in the column start with 'BR'.
try this
data = ['BR86037-890', 'GBBB7', 'BR86071-570','BR86200-000','BR86026-480','BR86082-701', 'GBCW9', 'NO3140']
df = pd.DataFrame(data, columns=['postcode'])
mask = df['postcode'].str.startswith('BR')
df.loc[mask, 'postcode'] = df.loc[mask, 'postcode'].str.replace(r'.{3}$', '000')
I have a DataFrame with columns that look like this:
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df:
(NYSE_close, close) (NYSE_close, open) (NYSE_close, volume) (NASDAQ_close, close) (NASDAQ_close, open) (NASDAQ_close, volume)
I want to remove everything after the underscore and append whatever comes after the comma to get the following:
df:
NYSE_close NYSE_open NYSE_volume NASDAQ_close NASDAQ_open NASDAQ_volume
I tried to strip the column name but it replaced it with nan. Any suggestions on how to do that?
Thank you in advance.
You could use re.sub to extract the appropriate parts of the column names to replace them with:
import re
df=pd.DataFrame(columns=['(NYSE_close, close)','(NYSE_close, open)','(NYSE_close, volume)', '(NASDAQ_close, close)','(NASDAQ_close, open)','(NASDAQ_close, volume)'])
df.columns = [re.sub(r'\(([^_]+_)\w+, (\w+)\)', r'\1\2', c) for c in df.columns]
Output:
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
You could:
import re
def cvt_col(x):
s = re.sub('[()_,]', ' ', x).split()
return s[0] + '_' + s[2]
df.rename(columns = cvt_col)
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
Use a list comprehension, twice:
step1 = [ent.strip('()').split(',') for ent in df]
df.columns = ["_".join([left.split('_')[0], right.strip()])
for left, right in step1]
df
Empty DataFrame
Columns: [NYSE_close, NYSE_open, NYSE_volume, NASDAQ_close, NASDAQ_open, NASDAQ_volume]
Index: []
So, I'm new to Python and I have this dataframe with company names, country information and activities description. I'm trying to group all this information by names, concatenating the countries and activities strings.
First, I did something like this:
df3_['Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
df4_ = df3_.drop_duplicates()
df4_['Activity'] = df4_.groupby(['Name', 'Country'])['Activity'].transform(lambda x: ','.join(x))
This way, I got a 'SettingWithCopyWarning', so I read a little bit about this error and tried copying the dataframe before applying the functions (didn't work) and using .loc (didn't work as well):
df3_.loc[:, 'Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
Any idea how to fix this?
Edit: I was asked to post an example of my data. The first pic is what I have, the second one is what it should look like
You want to group by the Company Name and then use some aggregating functions for the other columns, like:
df.groupby('Company Name').agg({'Country Code':', '.join, 'Activity':', '.join})
You were trying it the other way around.
Note that the empty string value ('') gets ugly with this aggregation, so you could make it more difficult with an aggregation like such:
df.groupby('Company Name').agg({'Country Code':lambda x: ', '.join(filter(None,x)), 'Activity':', '.join})
Following should work,
import pandas as pd
data = {
'Country Code': ['HK','US','SG','US','','US'],
'Company Name': ['A','A','A','A','B','B'],
'Activity': ['External services','Commerce','Transfer','Others','Others','External services'],
}
df = pd.DataFrame(data)
#grouping
grp = df.groupby('Company Name')
#custom function for replacing space and adding ,
def str_replace(ser):
s = ','.join(ser.values)
if s[0] == ',':
s = s[1:]
if s[len(s)-1] == ',':
s = s[:len(s)-1]
return s
#using agg functions
res = grp.agg({'Country Code':str_replace,'Activity':str_replace}).reset_index()
res
Output:
Company Name Country Code Activity
0 A HK,US,SG,US External services,Commerce,Transfer,Others
1 B US Others,External services
Another approach this time using transform()
# group the companies and concatenate the activities
df['Activities'] = df.groupby(['Company Name'])['Activity'] \
.transform(lambda x : ', '.join(x))
# group the companies and concatenate the country codes
df['Country Codes'] = df.groupby(['Company Name'])['Country Code'] \
.transform(lambda x : ', '.join([i for i in x if i != '']))
# the list comprehension deals with missing country codes (that have the value '')
# take this, drop the original columns and remove all the duplicates
result = df.drop(['Activity', 'Country Code'], axis=1) \
.drop_duplicates().reset_index(drop=True)
# reset index isn't really necessary
Result is
Company Name Activitys Country Codes
0 A External services, Commerce, Transfer, Others HK, US, SG, US
1 B Others, External services US
I have a dataframe which I want to sort via sort_values on one column.
Problem is there are German umlaute as first letter of the words.
Like Österreich, Zürich.
Which will sort to Zürich, Österreich.
It should be sorting Österreich, Zürich.
Ö should be between N and O.
I have found out how to do this with lists in python using locale and strxfrm.
Can I do this in the pandas dataframe somehow directly?
Edit:
Thank You. Stef example worked quite well, somehow I had Numbers where his Version did not work with my real life Dataframe example, so I used alexey's idea.
I did the following, probably you can shorten this..:
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b', 'v']})
#create index as column for joining later
df = df.reset_index(drop=False)
#convert int to str
df['location']=df['location'].astype(str)
#sort by location with umlaute
df_sort_index = df['location'].str.normalize('NFD').sort_values(ascending=True).reset_index(drop=False)
#drop location so we dont have it in both tables
df = df.drop('location', axis=1)
#inner join on index
new_df = pd.merge(df_sort_index, df, how='inner', on='index')
#drop index as column
new_df = new_df.drop('index', axis=1)
You could use sorted with a locale aware sorting function (in my example, setlocale returned 'German_Germany.1252') to sort the column values. The tricky part is to sort all the other columns accordingly. A somewhat hacky solution would be to temporarily set the index to the column to be sorted and then reindex on the properly sorted index values and reset the index.
import functools
import locale
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern'],'code':['ö','z','b']})
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
Output of print(df):
location code
0 Bern b
1 Österreich ö
2 Zürich z
Update for mixed type columns
If the column to be sorted is of mixed types (e.g. strings and integers), then you have two possibilities:
a) convert the column to string and then sort as written above (result column will be all strings):
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df.location=df.location.astype(str)
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
print(df.location.values)
# ['254345' 'Bern' 'Österreich' 'Zürich']
b) sort on a copy of the column converted to string (result column will retain mixed types)
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df = df.set_index(df.location.astype(str))
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index(drop=True)
print(df.location.values)
# [254345 'Bern' 'Österreich' 'Zürich']
you can use unicode NFD normal form
>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3 N
1 Ost
0 Österreich
2 S
dtype: object
# use result to rearrange a dataframe
>>> df[names.str.normalize('NFD').sort_values().index]
It's not quite what you wanted, but for proper ordering you need language knowladge (like locale you mentioned).
NFD employs two symbols for umlauts e.g. Ö becomes O\xcc\x88 (you can see the difference with names.str.normalize('NFD').encode('utf-8'))
Sort with locale:
import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'de_de')
#codes: https://github.com/python/cpython/blob/3.10/Lib/locale.py
#create df
df = pd.DataFrame({'location': ['Zürich','Österreich','Bern', 254345],'code':['z','ö','b','v']})
#convert int to str
df['location']=df['location'].astype(str)
#sort
df_ord = df.sort_values(by = 'location', key = lambda col: col.map(lambda x: locale.strxfrm(x)))
Multisort with locale:
import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'es_es')
# create df
lista1 = ['sarmiento', 'ñ', 'á', 'sánchez', 'a', 'ó', 's', 'ñ', 'á', 'sánchez']
lista2 = [10, 20, 60, 40, 20, 20, 10, 5, 30, 20]
df = pd.DataFrame(list(zip(lista1, lista2)), columns = ['Col1', 'Col2'])
#sort by Col2 and Col1
df_temp = df.sort_values(by = 'Col2')
df_ord = df_temp.sort_values(by = 'Col1', key = lambda col: col.map(lambda x: locale.strxfrm(x)), kind = 'mergesort')
I have a column with addresses, and sometimes it has these characters I want to remove => ' - " - ,(apostrophe, double quotes, commas)
I would like to replace these characters with space in one shot. I'm using pandas and this is the code I have so far to replace one of them.
test['Address 1'].map(lambda x: x.replace(',', ''))
Is there a way to modify these code so I can replace these characters in one shot? Sorry for being a noob, but I would like to learn more about pandas and regex.
Your help will be appreciated!
You can use str.replace:
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
Sample:
import pandas as pd
test = pd.DataFrame({'Address 1': ["'aaa",'sa,ss"']})
print (test)
Address 1
0 'aaa
1 sa,ss"
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
print (test)
Address 1
0 aaa
1 sass
Here's the pandas solution:
To apply it to an entire dataframe use, df.replace. Don't forget the \ character for the apostrophe.
Example:
import pandas as pd
df = #some dataframe
df.replace('\'','', regex=True, inplace=True)