I have a df with a number or columns, I would like to rename these with incremental numbers selecting the starting and ending range.
With the above image, I would like to select only columns B-D, and rename them to 1-4. So the resulting data frame would be:
So basically selecting the headers via index numbers and adding incremental numbers instead.
EDIT: The above dataframe
data = [['a','b','c','d','e','f'], ['a','b','c','d','e','f'], ['a','b','c','d','e','f'],['a','b','c','d','e','f']]
df = pd.DataFrame(data, columns = ['A','B','C','D','E','F'])
Use rename with selected columns by DataFrame.loc - here between B and E:
c = df.loc[:, 'B':'E'].columns
df = df.rename(columns=dict(zip(c, range(1, len(c) + 1))))
print (df)
A 1 2 3 4 F
0 a b c d e f
1 a b c d e f
2 a b c d e f
3 a b c d e f
If this is something you have to do frequently, you can write a custom function for that:
def col_rename(df, start, stop, inplace=False):
cols = list(df.loc[:, start:stop].columns)
new_cols = df.columns.map(lambda x: {k:v for v,k in
enumerate(cols, start=1)
}.get(x, x))
if inplace:
df.columns = new_cols
else:
return new_cols
df.columns = col_rename(df, 'B', 'F')
# or
# col_rename(df, 'B', 'F', inplace=True)
output:
A 1 2 3 4 F
0 0 1 2 3 4 5
used input:
df = pd.DataFrame([range(6)], columns=list('ABCDEF'))
# A B C D E F
# 0 0 1 2 3 4 5
Related
I am new to data science. I want to check which elements from one data frame exist in another data frame, e.g.
df1 = [1,2,8,6]
df2 = [5,2,6,9]
# for 1 output should be False
# for 2 output should be True
# for 6 output should be True
etc.
Note: I have matrix not vector.
I have tried using the following code:
import pandas as pd
import numpy as np
priority_dataframe = pd.read_excel(prioritylist_file_path, sheet_name='Sheet1', index=None)
priority_dict = {column: np.array(priority_dataframe[column].dropna(axis=0, how='all').str.lower()) for column in
priority_dataframe.columns}
keys_found_per_sheet = []
if file_path.lower().endswith(('.csv')):
file_dataframe = pd.read_csv(file_path)
else:
file_dataframe = pd.read_excel(file_path, sheet_name=sheet, index=None)
file_cell_array = list()
for column in file_dataframe.columns:
for file_cell in np.array(file_dataframe[column].dropna(axis=0, how='all')):
if isinstance(file_cell, str) == 'str':
file_cell_array.append(file_cell)
else:
file_cell_array.append(str(file_cell))
converted_file_cell_array = np.array(file_cell_array)
for key, values in priority_dict.items():
for priority_cell in values:
if priority_cell in converted_file_cell_array[:]:
keys_found_per_sheet.append(key)
break
I am doing something wrong in if priority_cell in converted_file_cell_array[:] ?
Is there any other efficient way to do that?
You can take the .values from each dataframe, convert them to a set(), and take the set intersection.
set1 = set(df1.values.reshape(-1).tolist())
set2 = set(dr2.values.reshape(-1).tolist())
different = set1 & set2
You can flatten all values of DataFrames by numpy.ravel and then use set.intersection():
df1 = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df1)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = pd.DataFrame({'A':[2,3,13,4], 'Z':list('abfr')})
print (df2)
A Z
0 2 a
1 3 b
2 13 f
3 4 r
L = list(set(df1.values.ravel()).intersection(df2.values.ravel()))
print (L)
['f', 2, 3, 4, 'a', 'b']
For a given dataframe df, imported from a csv file and containing redundant data (columns), I would like to write a function that allows to perform recursive filtering and sub-sequent renaming of df.columns, based on the amount of arguments given.
Ideally the function should perform as follows.
When input is (df, 'string1a', 'string1b', 'new_col_name1'), then:
filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
df_out = df [ filter1]
df_out.columns= ['new_col_name1']
return df_out
Whereas, when input is:
(df, 'string1a', 'string1b', 'new_col_name1','string2a', 'string2b', 'new_col_name2', 'string3a', 'string3b', 'new_col_name3')the function should return
filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
filter2 = [col for col in df.columns if 'string2a' in col and 'string2b' in col]
filter3 = [col for col in df.columns if 'string3a' in col and 'string3b' in col]
df_out = df [ filter1 + filter2 + filter3 ]
df_out.columns= ['new_col_name1','new_col_name2','new_col_name3']
return df_out
I think you can use dictionary for define values and then apply function with np.logical_and.reduce because need check multiple values in list:
df = pd.DataFrame({'aavfb':list('abcdef'),
'cedf':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'abds':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
F aavfb abds c cedf d
0 a a 5 1 4 7
1 a b 3 3 5 8
2 a c 6 5 4 9
3 b d 9 7 5 4
4 b e 2 1 5 2
5 b f 4 0 4 3
def rename1(df, d):
#loop in dict
for k,v in d.items():
#get mask for columns contains all values in lists
m = np.logical_and.reduce([df.columns.str.contains(x) for x in v])
#set new columns names by mask
df.columns = np.where(m, k, df.columns)
#filter all columns by keys of dict
return df.loc[:, df.columns.isin(d.keys())]
d = {'new_col_name1':['a', 'b'],
'new_col_name2':['c', 'd']}
print (rename1(df, d))
new_col_name1 new_col_name1 new_col_name2
0 a 5 4
1 b 3 5
2 c 6 4
3 d 9 5
4 e 2 5
5 f 4 4
I have the following dataframe, where '' is considered as empty:
df = pd.DataFrame({1: ['a', 'b', 'c']+ ['']*2, 2: ['']*2+ ['d','e', 'f']})
1 2
0 a ''
1 b ''
2 c d
3 '' e
4 '' f
How can I merge/join/combine (I don't know the correct term) col2 into col1 so that I have:
1 2
0 a ''
1 b ''
2 c d
3 e ''
4 f ''
or if I decide to merge col1 into col2:
1 2
0 '' a
1 '' b
2 c d
3 '' e
4 '' f
I would like to be able to decide in which col to merge and the other col should contain the conflict values.
Thank you in advance
You can also use the combine_first method for a vectorized (and simpler) version:
df[1].replace('', np.nan).combine_first(df[2])
results in:
0 a
1 b
2 c
3 e
4 f
You could also get both columns at once:
df.replace('', np.nan).combine_first(df.rename(columns={1: 2, 2: 1}))
results in:
1 2
0 a a
1 b b
2 c d
3 e e
4 f f
You can do this using the dataframe method apply():
Sample data:
df
1 2
0 a
1 b
2 c d
3 e
4 f
Define arbitrary variables:
merge_to_column = 2
other_column = 1
Use apply:
df['output'] = df.apply(lambda x: x[other_column] if x[merge_to_column] == '' else x[merge_to_column], axis=1)
Output:
df
1 2 output
0 a a
1 b b
2 c d d
3 e e
4 f f
def merge(col1, col2):
for x in range(len(col1)):
if col1[x] == '':
col1[x] = col2[x]
col2[x] = ''
This function will merge values from col2 into col1 where it finds quote marks, assuming both columns are the same size. You can handle different sizes as needed.
You can use .fillna():
df[1] = df[1].fillna(df[2])
then you take out the values from df[2] take collide:
df[2] = [None if r[1] == r[2] else r[2] for _, r in df.iterrows()]
output:
1 2
0 a None
1 b None
2 c d
3 e None
4 f None
Note that instead of using '' for empty values, you have to use None in this case:
df = pd.DataFrame({1: ['a', 'b', 'c']+[None]*2, 2: [None]*2+['d','e', 'f']})
let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I want to group my data set and enrich it with a formatted representation of the aggregated information.
This is my data set:
h = ['A', 'B', 'C']
d = [["a", "x", 1], ["a", "y", 2], ["b", "y", 4]]
rows = pd.DataFrame(d, columns=h)
A B C
0 a x 1
1 a y 2
2 b y 4
I create a pivot table to generate 0 for missing values:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
C
B x y
A
a 1 2
b 0 4
I groupy by A to remove dimension B:
wanted = rows.groupby("A").sum()
C
A
a 3
b 4
I try to add a column with the string representation of the aggregate details:
wanted["D"] = pivot["C"].applymap(lambda vs: reduce(lambda a,b: str(a)+"+"+str(b), vs.values))
AttributeError: ("'int' object has no attribute 'values'", u'occurred at index x')
It seems that I don't understand applymap.
What I want to achieve is:
C D
A
a 3 1+2
b 4 0+4
You can first remove [] from parameters in pivot_table, so you remove Multiindex from columns:
pivot = pd.pivot_table(rows,index="A", values="C", columns="B",fill_value=0)
Then sum values by columns:
pivot['C'] = pivot.sum(axis=1)
print (pivot)
B x y C
A
a 1 2 3
b 0 4 4
Cast by astype int columns x and y to str and output to D:
pivot['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (pivot)
B x y C D
A
a 1 2 3 1+2
b 0 4 4 0+4
Last remove column name by rename_axis (new in pandas 0.18.0) and drop unnecessary columns:
pivot = pivot.rename_axis(None, axis=1).drop(['x', 'y'], axis=1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
But if want Multiindex in columns:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
pivot['E'] = pivot["C"].sum(1)
print (pivot)
C E
B x y
A
a 1 2 3
b 0 4 4
pivot["D"] = pivot[('C','x')].astype(str) + '+' + pivot[('C','y')].astype(str)
print (pivot)
C E D
B x y
A
a 1 2 3 1+2
b 0 4 4 0+4
pivot = pivot.rename_axis((None,None), axis=1).drop('C', axis=1).rename(columns={'E':'C'})
pivot.columns = pivot.columns.droplevel(-1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
EDIT:
Another solution with groupby and MultiIndex.droplevel:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
#remove top level of Multiindex in columns
pivot.columns = pivot.columns.droplevel(0)
print (pivot)
B x y
A
a 1 2
b 0 4
wanted = rows.groupby("A").sum()
wanted['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (wanted)
C D
A
a 3 1+2
b 4 0+4