Adding conditional prefixes to column names

Adding conditional prefixes to column names - python

So I have a dataframe with some weird suffixes, like _a or _b that map to certain codes, I was wondering how would you add a prefix depending on the the suffix and remove the suffix to something easier to understand.
i.e.:
red_a blue_a green_b
....
....
to
A red A blue B green
....
....
I tried
for col in df.columns:
if col.endswith('_a'):
batch_match[col].replace('_a', '')
batch_match[col].add_prefix('A ')
else:
batch_match[col].add_prefix('B ')
But it returns a df of NaN.

You can use pandas.DataFrame.rename with a customer mapper
df = pd.DataFrame(
{"red_a": ['a', 'b', 'c'], "blue_a": [1, 2, 3], 'green_b': ['x', 'y', 'z']}
)
def renamer(col):
if any(col.endswith(suffix) for suffix in ['_a', '_b']):
prefix = col[-1] # use last char as prefix
return prefix.upper() + " " + col[:-2] # add prefix and strip last 2 chars
else:
return col
df = df.rename(mapper=renamer, axis='columns')
print(df)
# A blue B green A red
#0 1 x a
#1 2 y b
#2 3 z c

What I will do
df.columns=df.columns.str.split('_').map(lambda x : '{} {}'.format(x[1].upper(),x[0]))
df
Out[512]:
A red A blue B green
0 a 1 x
1 b 2 y
2 c 3 z

Related

Update last column header dynamically - pandas

I'm hoping to update the last column in a pandas df using the first column header as a prefix. Using below as an example I want to update the col Z to include col X as X_Z.
import pandas as pd
df = pd.DataFrame({
'X' : [1,2,3],
'Y' : [1,2,3],
'Z' : [1,2,3],
})
# Update all cols to include a consistent suffix
df.columns.values[-1:] = [str(col) + '_Col' for col in df.columns[-1:]]
# Update last col to include hard coded string
df.columns = [str(col) + '_Col' for col in df.columns]
Please note: I don't want to update this manually using the function below. The column headers will vary so I don't want to go back to check what the first column is and add it. I'm hoping to handle all cases.
df.rename(columns={'Z': 'X_Z'}, inplace=True)
Intended Output:
X Y X_Z
0 1 1 1
1 2 2 2
2 3 3 3

We can do this using list indexing and f-strings:
cols = df.columns
df = df.rename(columns={cols[-1]:f'{cols[0]}_{cols[-1]}'})
X Y X_Z
0 1 1 1
1 2 2 2
2 3 3 3
Or we can adjust our columns list by index, then pass this adjusted list back:
cols = df.columns.tolist()
cols[-1] = f'{cols[0]}_{cols[-1]}'
df.columns = cols
X Y X_Z
0 1 1 1
1 2 2 2
2 3 3 3
Bonus: weird out of the box list comprehension:
[df.columns[0] + '_' + col if idx+1 == df.shape[1] else col for idx, col in enumerate(df.columns)]
# Out
['X', 'Y', 'X_Z']

How to give duplicated columns distinct names in Pandas [duplicate]

I have several columns named the same in a df. I need to rename them but the problem is that the df.rename method renames them all the same way. How I can rename the below blah(s) to blah1, blah4, blah5?
df = pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns = ['blah','blah2','blah3','blah','blah']
df
# blah blah2 blah3 blah blah
# 0 0 1 2 3 4
# 1 5 6 7 8 9
Here is what happens when using the df.rename method:
df.rename(columns={'blah':'blah1'})
# blah1 blah2 blah3 blah1 blah1
# 0 0 1 2 3 4
# 1 5 6 7 8 9

Starting with Pandas 0.19.0 pd.read_csv() has improved support for duplicate column names
So we can try to use the internal method:
In [137]: pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
Out[137]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']
Since Pandas 1.3.0:
pd.io.parsers.base_parser.ParserBase({'names':df.columns, 'usecols':None})._maybe_dedup_names(df.columns)
This is the "magic" function:
def _maybe_dedup_names(self, names):
# see gh-7160 and gh-9424: this helps to provide
# immediate alleviation of the duplicate names
# issue and appears to be satisfactory to users,
# but ultimately, not needing to butcher the names
# would be nice!
if self.mangle_dupe_cols:
names = list(names) # so we can index
counts = {}
for i, col in enumerate(names):
cur_count = counts.get(col, 0)
if cur_count > 0:
names[i] = '%s.%d' % (col, cur_count)
counts[col] = cur_count + 1
return names

I was looking to find a solution within Pandas more than a general Python solution.
Column's get_loc() function returns a masked array if it finds duplicates with 'True' values pointing to the locations where duplicates are found. I then use the mask to assign new values into those locations. In my case, I know ahead of time how many dups I'm going to get and what I'm going to assign to them but it looks like df.columns.get_duplicates() would return a list of all dups and you can then use that list in conjunction with get_loc() if you need a more generic dup-weeding action
'''UPDATED AS-OF SEPT 2020'''
cols=pd.Series(df.columns)
for dup in df.columns[df.columns.duplicated(keep=False)]:
cols[df.columns.get_loc(dup)] = ([dup + '.' + str(d_idx)
if d_idx != 0
else dup
for d_idx in range(df.columns.get_loc(dup).sum())]
)
df.columns=cols
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9
New Better Method (Update 03Dec2019)
This code below is better than above code. Copied from another answer below (#SatishSK):
#sample df with duplicate blah column
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
# you just need the following 4 lines to rename duplicates
# df is the dataframe that you want to rename duplicated columns
cols=pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
# rename the columns with the cols list.
df.columns=cols
df
Output:
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9

You could use this:
def df_column_uniquify(df):
df_columns = df.columns
new_columns = []
for item in df_columns:
counter = 0
newitem = item
while newitem in new_columns:
counter += 1
newitem = "{}_{}".format(item, counter)
new_columns.append(newitem)
df.columns = new_columns
return df
Then
import numpy as np
import pandas as pd
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
so that df:
blah blah2 blah3 blah blah
0 0 1 2 3 4
1 5 6 7 8 9
then
df = df_column_uniquify(df)
so that df:
blah blah2 blah3 blah_1 blah_2
0 0 1 2 3 4
1 5 6 7 8 9

You could assign directly to the columns:
In [12]:
df.columns = ['blah','blah2','blah3','blah4','blah5']
df
Out[12]:
blah blah2 blah3 blah4 blah5
0 0 1 2 3 4
1 5 6 7 8 9
[2 rows x 5 columns]
If you want to dynamically just rename the duplicate columns then you could do something like the following (code taken from answer 2: Index of duplicates items in a python list):
In [25]:
import collections
dups = collections.defaultdict(list)
dup_indices=[]
col_list=list(df.columns)
for i, e in enumerate(list(df.columns)):
dups[e].append(i)
for k, v in sorted(dups.items()):
if len(v) >= 2:
dup_indices = v
for i in dup_indices:
col_list[i] = col_list[i] + ' ' + str(i)
col_list
Out[25]:
['blah 0', 'blah2', 'blah3', 'blah 3', 'blah 4']
You could then use this to assign back, you could also have a function to generate a unique name that is not present in the columns prior to renaming.

duplicated_idx = dataset.columns.duplicated()
duplicated = dataset.columns[duplicated_idx].unique()
rename_cols = []
i = 1
for col in dataset.columns:
if col in duplicated:
rename_cols.extend([col + '_' + str(i)])
else:
rename_cols.extend([col])
dataset.columns = rename_cols

Thank you #Lamakaha for the solution. Your idea gave me a chance to modify it and make it workable in all the cases.
I am using Python 3.7.3 version.
I tried your piece of code on my data set which had only one duplicated column i.e. two columns with same name. Unfortunately, the column names remained As-Is without being renamed. On top of that I got a warning that "get_duplicates() is deprecated and same will be removed in future version". I used duplicated() coupled with unique() in place of get_duplicates() which did not yield the expected result.
I have modified your piece of code little bit which is working for me now for my data set as well as in other general cases as well.
Here are the code runs with and without code modification on the example data set mentioned in the question along with results:
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
df
f:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning:
'get_duplicates' is deprecated and will be removed in a future
release. You can use idx[idx.duplicated()].unique() instead
Output:
blah blah2 blah3 blah blah.1
0 0 1 2 3 4
1 5 6 7 8 9
Two of the three "blah"(s) are not renamed properly.
Modified code
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
cols=pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
df.columns=cols
df
Output:
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9
Here is a run of modified code on some another example:
cols = pd.Series(['X', 'Y', 'Z', 'A', 'B', 'C', 'A', 'A', 'L', 'M', 'A', 'Y', 'M'])
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
cols
Output:
0 X
1 Y
2 Z
3 A
4 B
5 C
6 A_1
7 A_2
8 L
9 M
10 A_3
11 Y_1
12 M_1
dtype: object
Hope this helps anybody who is seeking answer to the aforementioned question.

Since the accepted answer (by Lamakaha) is not working for recent versions of pandas, and because the other suggestions looked a bit clumsy, I worked out my own solution:
def dedupIndex(idx, fmt=None, ignoreFirst=True):
# fmt: A string format that receives two arguments:
# name and a counter. By default: fmt='%s.%03d'
# ignoreFirst: Disable/enable postfixing of first element.
idx = pd.Series(idx)
duplicates = idx[idx.duplicated()].unique()
fmt = '%s.%03d' if fmt is None else fmt
for name in duplicates:
dups = idx==name
ret = [ fmt%(name,i) if (i!=0 or not ignoreFirst) else name
for i in range(dups.sum()) ]
idx.loc[dups] = ret
return pd.Index(idx)
Use the function as follows:
df.columns = dedupIndex(df.columns)
# Result: ['blah', 'blah2', 'blah3', 'blah.001', 'blah.002']
df.columns = dedupIndex(df.columns, fmt='%s #%d', ignoreFirst=False)
# Result: ['blah #0', 'blah2', 'blah3', 'blah #1', 'blah #2']

Here's a solution that also works for multi-indexes
# Take a df and rename duplicate columns by appending number suffixes
def rename_duplicates(df):
import copy
new_columns = df.columns.values
suffix = {key: 2 for key in set(new_columns)}
dup = pd.Series(new_columns).duplicated()
if type(df.columns) == pd.core.indexes.multi.MultiIndex:
# Need to be mutable, make it list instead of tuples
for i in range(len(new_columns)):
new_columns[i] = list(new_columns[i])
for ix, item in enumerate(new_columns):
item_orig = copy.copy(item)
if dup[ix]:
for level in range(len(new_columns[ix])):
new_columns[ix][level] = new_columns[ix][level] + f"_{suffix[tuple(item_orig)]}"
suffix[tuple(item_orig)] += 1
for i in range(len(new_columns)):
new_columns[i] = tuple(new_columns[i])
df.columns = pd.MultiIndex.from_tuples(new_columns)
# Not a MultiIndex
else:
for ix, item in enumerate(new_columns):
if dup[ix]:
new_columns[ix] = item + f"_{suffix[item]}"
suffix[item] += 1
df.columns = new_columns

I just wrote this code it uses a list comprehension to update all duplicated names.
df.columns = [x[1] if x[1] not in df.columns[:x[0]] else f"{x[1]}_{list(df.columns[:x[0]]).count(x[1])}" for x in enumerate(df.columns)]

Created a function with some tests so it should be drop in ready; this is a little different than Lamakaha's excellent solution since it renames the first appearance of a duplicate column:
from collections import defaultdict
from typing import Dict, List, Set
import pandas as pd
def rename_duplicate_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Rename column headers to ensure no header names are duplicated.
Args:
df (pd.DataFrame): A dataframe with a single index of columns
Returns:
pd.DataFrame: The dataframe with headers renamed; inplace
"""
if not df.columns.has_duplicates:
return df
duplicates: Set[str] = set(df.columns[df.columns.duplicated()].tolist())
indexes: Dict[str, int] = defaultdict(lambda: 0)
new_cols: List[str] = []
for col in df.columns:
if col in duplicates:
indexes[col] += 1
new_cols.append(f"{col}.{indexes[col]}")
else:
new_cols.append(col)
df.columns = new_cols
return df
def test_rename_duplicate_columns():
df = pd.DataFrame(data=[[1, 2]], columns=["a", "b"])
assert rename_duplicate_columns(df).columns.tolist() == ["a", "b"]
df = pd.DataFrame(data=[[1, 2]], columns=["a", "a"])
assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "a.2"]
df = pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "a"])
assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "b", "a.2"]

We can just assign each column a different name.
Suppoese duplicate column name is like = [a,b,c,d,d,c]
Then just create a list of name what you want to assign:
C = [a,b,c,d,D1,C1]
df.columns = c
This works for me.

This is my solution:
cols = [] # for tracking if we alread seen it before
new_cols = []
for col in df.columns:
cols.append(col)
count = cols.count(col)
if count > 1:
new_cols.append(f'{col}_{count}')
else:
new_cols.append(col)
df.columns = new_cols

Here's an elegant solution:
Isolate a dataframe with only the repeated columns (looks like it will be a series but it will be a dataframe if >1 column with that name):
df1 = df['blah']
For each "blah" column, give it a unique number
df1.columns = ['blah_' + str(int(x)) for x in range(len(df1.columns))]
Isolate a dataframe with all but the repeated columns:
df2 = df[[x for x in df.columns if x != 'blah']]
Merge back together on indices:
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
Et voila:
blah_0 blah_1 blah_2 blah2 blah3
0 0 3 4 1 2
1 5 8 9 6 7

Merge specific values in a pandas df

I'm currently merging the first and last string in a row. These strings are merged when they are to the right of a specific value. I'm hoping to change that to below a specific value.
import pandas as pd
d = ({
'A' : ['X','Foo','','X','Big'],
'B' : ['No','','','No',''],
'C' : ['Merge','Bar','','Merge','Cat'],
})
df = pd.DataFrame(data = d)
m = df.A == 'X'
def f(x):
s = x[x!= '']
x[s.index[1]] = x[s.index[1]] + ' ' + x[s.index[-1]]
x[s.index[-1]] = ''
return x
df = df.astype(str).mask(m, df[m].apply(f, axis=1))
This code merges the first and last string when followed by X.
Output:
A B C
0 X No Merge
1 Foo Bar
2
3 X No Merge
4 Big Cat
I'm hoping to change it to rows beneath the value X.
Intended Output:
A B C
0 X No Merge
1 Foo Bar
2
3 X No Merge
4 Big Cat

Solution is very similar, only boolean mask is shifted and first NaN is replaced to False and also indices from [1] are changed to [0] for seelct first value (of column A):
m = (df.A == 'X').shift().fillna(False)
def f(x):
s = x[x!= '']
x[s.index[0]] = x[s.index[0]] + ' ' + x[s.index[-1]]
x[s.index[-1]] = ''
return x
df = df.astype(str).mask(m, df[m].apply(f, axis=1))
print (df)
A B C
0 X No Merge
1 Foo Bar
2
3 X No Merge
4 Big Cat

Match values in dataframe rows

I have a dataframe (df) that looks like:
name type cost
a apples 1
b apples 2
c oranges 1
d banana 4
e orange 6
Apart from using 2 for loops is there a way to loop through and compare each name and type in the list against each other and where the name is not itself (A vs A), the type is the same (apples vs apples) and its not a repeat of the same pair but the other way around e.g. if we have A vs B, I would not want to see B vs A, produce an output list of that looks:
name1, name2, status
a b 0
c e 0
Where the first 2 elements are the names where the criteria match and the third element is always a 0.
I have tried to do this with 2 for loops (see below) but can't get it to reject say b vs a if we already have a vs b.
def pairListCreator(staticData):
for x, row1 in df.iterrows():
name1 = row1['name']
type1= row1['type']
for y, row2 in df.iterrows():
name2 = row['name']
type2 = row['type']
if name1<> name2 and type1 = type2:
pairList = name1,name2,0

Something like this
import pandas as pd
# Data
data = [['a', 'apples', 1],
['b', 'apples', 2],
['c', 'orange', 1],
['d', 'banana', 4],
['e', 'orange', 6]]
# Create Dataframe
df = pd.DataFrame(data, columns=['name', 'type', 'cost'])
df.set_index('name', inplace=True)
# Print DataFrame
print df
# Count number of rows
nr_of_rows = df.shape[0]
# Create result and compare
res_col_nam = ['name1', 'name2', 'status']
result = pd.DataFrame(columns=res_col_nam)
for i in range(nr_of_rows):
x = df.iloc[i]
for j in range(i + 1, nr_of_rows):
y = df.iloc[j]
if x['type'] == y['type']:
temp = pd.DataFrame([[x.name, y.name, 0]], columns=res_col_nam)
result = result.append(temp)
# Reset the index
result.reset_index(inplace=True)
result.drop('index', axis=1, inplace=True)
# Print result
print 'result:'
print result
Output:
type cost
name
a apples 1
b apples 2
c orange 1
d banana 4
e orange 6
result:
name1 name2 status
0 a b 0.0
1 c e 0.0

You can use self join on column type first, then sort values in names column per row by apply(sorted).
Then remove same values in names columns by boolean indexing, drop_duplicates and add new column status by assign:
df = pd.merge(df,df, on='type', suffixes=('1','2'))
names = ['name1','name2']
df[names] = df[names].apply(sorted, axis=1)
df = df[df.name1 != df.name2].drop_duplicates(subset=names)[names]
.assign(status=0)
.reset_index(drop=True)
print (df)
name1 name2 status
0 a b 0
1 c e 0

Pandas - Modify string values in each cell

I have a pandas dataframe and I need to modify all values in a given string column. Each column contains string values of the same length. The user provides the index they want to be replaced for each value
for example: [1:3] and the replacement value "AAA".
This would replace the string from values 1 to 3 with the value AAA.
How can I use the applymap(), map() or apply() function to get this done?
SOLUTION: Here is the final solution I went off of using the answer marked below:
import pandas as pd
df = pd.DataFrame({'A':['ffgghh','ffrtss','ffrtds'],
#'B':['ffrtss','ssgghh','d'],
'C':['qqttss',' 44','f']})
print df
old = ['g', 'r', 'z']
new = ['y', 'b', 'c']
vals = dict(zip(old, new))
pos = 2
for old, new in vals.items():
df.ix[df['A'].str[pos] == old, 'A'] = df['A'].str.slice_replace(pos,pos + len(new),new)
print df

Use str.slice_replace:
df['B'] = df['B'].str.slice_replace(1, 3, 'AAA')
Sample Input:
A B
0 w abcdefg
1 x bbbbbbb
2 y ccccccc
3 z zzzzzzzz
Sample Output:
A B
0 w aAAAdefg
1 x bAAAbbbb
2 y cAAAcccc
3 z zAAAzzzzz

IMO the most straightforward solution:
In [7]: df
Out[7]:
col
0 abcdefg
1 bbbbbbb
2 ccccccc
3 zzzzzzzz
In [9]: df.col = df.col.str[:1] + 'AAA' + df.col.str[4:]
In [10]: df
Out[10]:
col
0 aAAAefg
1 bAAAbbb
2 cAAAccc
3 zAAAzzzz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding conditional prefixes to column names - python

What I will do df.columns=df.columns.str.split('_').map(lambda x : '{} {}'.format(x[1].upper(),x[0])) df Out[512]: A red A blue B green 0 a 1 x 1 b 2 y 2 c 3 z

Related

Update last column header dynamically - pandas

How to give duplicated columns distinct names in Pandas [duplicate]

Merge specific values in a pandas df

Match values in dataframe rows

Pandas - Modify string values in each cell

Categories

Resources