Renaming column names from a data set in pandas - python

I am trying to rename column names from a DataFrame that have space in the name. DataFrame (df) consists of 45 columns and the majority have spaces in the name. For instance: df.column.values [1] = 'Date Release', and the name should be changed to 'Date_Release'. I tried DataFrame.rename () and DataFrame.columns.values[] but did not work. I would much appreciate it if you could help me to find out what I did wrong
for colmns in df:
if ' ' in colmns:
colmns_new = '_'.join(colmns.split())
df = df.rename (columns = {"\"%s\"" %colmns : "\"%s\"" %colmns_new})
else:
print (colmns)
print (df)
or this one:
for i in range (len(df.columns)):
old= df.columns.values[i]
if ' ' in old:
new = '_'.join(old.split())
df = df.columns.values[i] = ['%s' % new]
print ("\"%s\"" % new)
print (df)
Error: AttributeError: 'list' object has no attribute 'columns'

import pandas as pd
df.columns = [i.replace(' ','_') for i in df.columns]

You can use regex to replace spaces with underscore
Here is an example df with some columns containing spaces,
cols = ['col {}'.format(i) for i in range(1, 10, 1)] + ['col10']
df = pd.DataFrame(columns = cols)
import re
df.columns = [re.sub(' ','_',i) for i in df.columns]
You get
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col10

You can just give df.columns = df.columns.str.replace(' ','_') to replace the space with an underscore.
Here's an example. Here column a1 does not have a space. However columns b 2 and c 3 have a space.
>>> df = pd.DataFrame({'a1': range(1,5), 'b 2': list ('abcd'), 'c 3':list('pqrs')})
>>> df
a1 b 2 c 3
0 1 a p
1 2 b q
2 3 c r
3 4 d s
>>> df.columns = df.columns.str.replace(' ','_')
>>> df
a1 b_2 c_3
0 1 a p
1 2 b q
2 3 c r
3 4 d s

Related

Rename duplicate column name by order in Pandas

I have a dataframe, df, where I would like to rename two duplicate columns in consecutive order:
Data
DD Nice Nice Hello
0 1 1 2
Desired
DD Nice1 Nice2 Hello
0 1 1 2
Doing
df.rename(columns={"Name": "Name1", "Name": "Name2"})
I am running the rename function, however, because both column names are identical, the results are not desirable.
Here's an approach with groupby:
s = df.columns.to_series().groupby(df.columns)
df.columns = np.where(s.transform('size')>1,
df.columns + s.cumcount().add(1).astype(str),
df.columns)
Output:
DD Nice1 Nice2 Hello
0 0 1 1 2
You could use an itertools.count() counter and a list expression to create new column headers, then assign them to the data frame.
For example:
>>> import itertools
>>> df = pd.DataFrame([[1, 2, 3]], columns=["Nice", "Nice", "Hello"])
>>> df
Nice Nice Hello
0 1 2 3
>>> count = itertools.count(1)
>>> new_cols = [f"Nice{next(count)}" if col == "Nice" else col for col in df.columns]
>>> df.columns = new_cols
>>> df
Nice1 Nice2 Hello
0 1 2 3
(Python 3.6+ required for the f-strings)
EDIT: Alternatively, per the comment below, the list expression can replace any label that may contain "Nice" in case there are unexpected spaces or other characters:
new_cols = [f"Nice{next(count)}" if "Nice" in col else col for col in df.columns]
You can use:
cols = pd.Series(df.columns)
dup_count = cols.value_counts()
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) for i in range(1, dup_count[dup]+1)]
df.columns = cols
Input:
col_1 Nice Nice Nice Hello Hello Hello
col_2 1 2 3 4 5 6
Output:
col_1 Nice1 Nice2 Nice3 Hello1 Hello2 Hello3
col_2 1 2 3 4 5 6
Setup to generate duplicate cols:
df = pd.DataFrame(data={'col_1':['Nice', 'Nice', 'Nice', 'Hello', 'Hello', 'Hello'], 'col_2':[1,2,3,4, 5, 6]})
df = df.set_index('col_1').T

How I can merge the columns into a single column in Python?

I want to merge 3 columns into a single column. I have tried changing the column types. However, I could not do it.
For example, I have 3 columns such as A: {1,2,4}, B:{3,4,4}, C:{1,1,1}
Output expected: ABC Column {131, 241, 441}
My inputs are like this:
df['ABC'] = df['A'].map(str) + df['B'].map(str) + df['C'].map(str)
df.head()
ABC {13.01.0 , 24.01.0, 44.01.0}
The type of ABC seems object and I could not change via str, int.
df['ABC'].apply(str)
Also, I realized that there are NaN values in A, B, C column. Is it possible to merge these even with NaN values?
# Example
import pandas as pd
import numpy as np
df = pd.DataFrame()
# Considering NaN's in the data-frame
df['colA'] = [1,2,4, np.NaN,5]
df['colB'] = [3,4,4,3,np.NaN]
df['colC'] = [1,1,1,4,1]
# Using pd.isna() to check for NaN values in the columns
df['colA'] = df['colA'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colB'] = df['colB'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colC'] = df['colC'].apply(lambda x: x if pd.isna(x) else str(int(x)))
# Filling the NaN values with a blank space
df = df.fillna('')
# Transform columns into string
df = df.astype(str)
# Concatenating all together
df['ABC'] = df.sum(axis=1)
A workaround your NaN problem could look like this but now NaN will be 0
import numpy as np
df = pd.DataFrame({'A': [1,2,4, np.nan], 'B':[3,4,4,4], 'C':[1,np.nan,1, 3]})
df = df.replace(np.nan, 0, regex=True).astype(int).applymap(str)
df['ABC'] = df['A'] + df['B'] + df['C']
output
A B C ABC
0 1 3 1 131
1 2 4 0 240
2 4 4 1 441
3 0 4 3 043

Changing multiple column names

Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)

Match values in dataframe rows

I have a dataframe (df) that looks like:
name type cost
a apples 1
b apples 2
c oranges 1
d banana 4
e orange 6
Apart from using 2 for loops is there a way to loop through and compare each name and type in the list against each other and where the name is not itself (A vs A), the type is the same (apples vs apples) and its not a repeat of the same pair but the other way around e.g. if we have A vs B, I would not want to see B vs A, produce an output list of that looks:
name1, name2, status
a b 0
c e 0
Where the first 2 elements are the names where the criteria match and the third element is always a 0.
I have tried to do this with 2 for loops (see below) but can't get it to reject say b vs a if we already have a vs b.
def pairListCreator(staticData):
for x, row1 in df.iterrows():
name1 = row1['name']
type1= row1['type']
for y, row2 in df.iterrows():
name2 = row['name']
type2 = row['type']
if name1<> name2 and type1 = type2:
pairList = name1,name2,0
Something like this
import pandas as pd
# Data
data = [['a', 'apples', 1],
['b', 'apples', 2],
['c', 'orange', 1],
['d', 'banana', 4],
['e', 'orange', 6]]
# Create Dataframe
df = pd.DataFrame(data, columns=['name', 'type', 'cost'])
df.set_index('name', inplace=True)
# Print DataFrame
print df
# Count number of rows
nr_of_rows = df.shape[0]
# Create result and compare
res_col_nam = ['name1', 'name2', 'status']
result = pd.DataFrame(columns=res_col_nam)
for i in range(nr_of_rows):
x = df.iloc[i]
for j in range(i + 1, nr_of_rows):
y = df.iloc[j]
if x['type'] == y['type']:
temp = pd.DataFrame([[x.name, y.name, 0]], columns=res_col_nam)
result = result.append(temp)
# Reset the index
result.reset_index(inplace=True)
result.drop('index', axis=1, inplace=True)
# Print result
print 'result:'
print result
Output:
type cost
name
a apples 1
b apples 2
c orange 1
d banana 4
e orange 6
result:
name1 name2 status
0 a b 0.0
1 c e 0.0
You can use self join on column type first, then sort values in names column per row by apply(sorted).
Then remove same values in names columns by boolean indexing, drop_duplicates and add new column status by assign:
df = pd.merge(df,df, on='type', suffixes=('1','2'))
names = ['name1','name2']
df[names] = df[names].apply(sorted, axis=1)
df = df[df.name1 != df.name2].drop_duplicates(subset=names)[names]
.assign(status=0)
.reset_index(drop=True)
print (df)
name1 name2 status
0 a b 0
1 c e 0

How to (re)name an empty column header in a pandas dataframe without exporting to csv

I have a pandas dataframe df1 with an index column and an unnamed series of values. I want to assign a name to the unnamed series.
The only way to do this that I know so far is to export to df1.csv using:
df1.to_csv("df1.csv", header = ["Signal"])
and then re-import using:
pd.read_csv("df1.csv", sep=",")
However, this costs time and storage space. How to do this in-memory?
When I do df2 = df1.rename(columns = {"" : "Signal"}, inplace = True)
I yield:
AttributeError: "Series" object has no attribute "Signal".
I think inplace=True has to be removed, because it return None:
df2 = df1.rename(columns = {"" : "Signal"})
df1.rename(columns = {"" : "Signal"}, inplace = True)
Another solution is asign new name by position:
df.columns.values[0] = 'Signal'
Sample:
df1 = pd.DataFrame({'':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df1)
B C
0 1 4 7
1 2 5 8
2 3 6 9
df2 = df1.rename(columns = {"" : "Signal"})
print (df2)
Signal B C
0 1 4 7
1 2 5 8
2 3 6 9
You can use this if there are multiple empty columns. This will generate an empty column with cols and i (for the column position)
df.columns = ["cols_"+str(i) if a == "" else a for i, a in enumerate(df.columns)]
#cols -> just rename the column name just as you want
#i -> count the column number

Categories

Resources