Automatically rename columns to ensure they are unique - python

I fetch a spreadsheet into a Python DataFrame named df.
Let's give a sample:
df=pd.DataFrame({'a': np.random.rand(10), 'b': np.random.rand(10)})
df.columns=['a','a']
a a
0 0.973858 0.036459
1 0.835112 0.947461
2 0.520322 0.593110
3 0.480624 0.047711
4 0.643448 0.104433
5 0.961639 0.840359
6 0.848124 0.437380
7 0.579651 0.257770
8 0.919173 0.785614
9 0.505613 0.362737
When I run df.columns.is_unique I get False
I would like to automatically rename column 'a' to 'a_2' (or things like that)
I don't expect a solution like df.columns=['a','a_2']
I looking for a solution that could be usable for several columns!

You can uniquify the columns manually:
df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']
def uniquify(df_columns):
seen = set()
for item in df_columns:
fudge = 1
newitem = item
while newitem in seen:
fudge += 1
newitem = "{}_{}".format(item, fudge)
yield newitem
seen.add(newitem)
list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']

I fetch a spreadsheet into a Python DataFrame named df... I would like
to automatically rename [duplicate] column [names].
Pandas does that automatically for you without you having to do anything...
test.xls:
import pandas as pd
import numpy as np
df = pd.io.excel.read_excel(
"./test.xls",
"Sheet1",
header=0,
index_col=0,
)
print df
--output:--
a b c b.1 a.1 a.2
index
0 10 100 -10 -100 10 21
1 20 200 -20 -200 11 22
2 30 300 -30 -300 12 23
3 40 400 -40 -400 13 24
4 50 500 -50 -500 14 25
5 60 600 -60 -600 15 26
print df.columns.is_unique
--output:--
True
If for some reason you are being given a DataFrame with duplicate columns, you can do this:
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame(
{
'k': np.random.rand(10),
'l': np.random.rand(10),
'm': np.random.rand(10),
'n': np.random.rand(10),
'o': np.random.rand(10),
'p': np.random.rand(10),
}
)
print df
--output:--
k l m n o p
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.16604
df.columns = ['a', 'b', 'c', 'b', 'a', 'a']
print df
--output:--
a b c b a a
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.166047
print df.columns.is_unique
--output:--
False
name_counts = defaultdict(int)
new_col_names = []
for name in df.columns:
new_count = name_counts[name] + 1
new_col_names.append("{}{}".format(name, new_count))
name_counts[name] = new_count
print new_col_names
--output:--
['a1', 'b1', 'c1', 'b2', 'a2', 'a3']
df.columns = new_col_names
print df
--output:--
a1 b1 c1 b2 a2 a3
0 0.264598 0.321378 0.466370 0.986725 0.580326 0.671168
1 0.938810 0.179999 0.403530 0.675112 0.279931 0.011046
2 0.935888 0.167405 0.733762 0.806580 0.392198 0.180401
3 0.218825 0.295763 0.174213 0.457533 0.234081 0.555525
4 0.891890 0.196245 0.425918 0.786676 0.791679 0.119826
5 0.721305 0.496182 0.236912 0.562977 0.249758 0.352434
6 0.433437 0.501975 0.088516 0.303067 0.916619 0.717283
7 0.026491 0.412164 0.787552 0.142190 0.665488 0.488059
8 0.729960 0.037055 0.546328 0.683137 0.134247 0.444709
9 0.391209 0.765251 0.507668 0.299963 0.348190 0.731980
print df.columns.is_unique
--output:--
True

In case anyone needs this in Scala->
def renameDup (Header : String) : String = {
val trimmedList: List[String] = Header.split(",").toList
var fudge =0
var newitem =""
var seen = List[String]()
for (item <- trimmedList){
fudge = 1
newitem = item
for (newitem2 <- seen){
if (newitem2 == newitem ){
fudge += 1
newitem = item + "_" + fudge
}
}
seen= seen :+ newitem
}
return seen.mkString(",")
}
>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']

Here's a solution that uses pandas all the way through.
import pandas as pd
# create data frame with duplicate column names
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.rename({'a': 'col', 'b': 'col'}, axis=1, inplace=True)
df
---output---
col col
0 1 4
1 2 5
2 3 6
# make a new data frame of column headers and number sequentially
dfcolumns = pd.DataFrame({'name': df.columns})
dfcolumns['counter'] = dfcolumns.groupby('name').cumcount().apply(str)
# remove counter for first case (optional) and combine suffixes
dfcolumns.loc[dfcolumns.counter=='0', 'counter'] = ''
df.columns = dfcolumns['name'] + dfcolumns['counter']
df
---output---
col col1
0 1 4
1 2 5
2 3 6

I ran into this problem when loading DataFrames from oracle tables. 7stud is right that pd.read_excel() automatically designates duplicated columns with a *.1, but not all of the read functions do this. One work around is to save the DataFrame to a csv (or excel) file and then reload it to re-designate duplicated columns.
data = pd.read_SQL(SQL,connection)
data.to_csv(r'C:\temp\temp.csv')
data=read_csv(r'C:\temp\temp.csv')

Related

pandas loop to run multiple cross tabs

I have the following dataset and I want to compute all possible combinations of cross-tabulations in the most efficient way, I have been able to calculate pairs against one master variable, but not for all possibilities (i have popped what i mean in below). Is there a way to get this all in loop that could handle any number of columns? Thanks so much!
data
import pandas as pd
df1 = pd.DataFrame(data={'id': [1,2,3,4,5,6,7,8,9,10],
'a': [1,1,2,2,2,1,1,2,1,1],
'b': [1,2,3,3,3,2,1,2,3,1],
'c': [1,2,2,1,1,1,1,2,1,2],
'd': [1,1,2,2,1,1,1,1,1,2],
})
d1={1: 'right', 2: 'left'}
d2={1: '10', 2: '30', 3: '20'}
d3={1: 'green', 2: 'red'}
d4={1: 'yes', 2: 'no'}
df1['a']=df1['a'].map(d1).fillna('Other')
df1['b']=df1['b'].map(d2).fillna('Other')
df1['c']=df1['c'].map(d3).fillna('Other')
df1['d']=df1['d'].map(d4).fillna('Other')
combinations
pd.crosstab(df1.a, df1.b)
pd.crosstab(df1.a, df1.c)
pd.crosstab(df1.a, df1.d)
pd.crosstab(df1.b, df1.c)
pd.crosstab(df1.b, df1.d)
pd.crosstab(df1.c, df1.d)
pd.crosstab(df1.a, [df1.b, df1.c])
pd.crosstab(df1.a, [df1.b, df1.d])
pd.crosstab(df1.a, [df1.c, df1.d])
pd.crosstab(df1.a, [df1.b, df1.c, df1.d])
what I have so far
def cross_tab(data_frame, id_col):
col_names=['b','c','d']
datasets = {}
for i in col_names:
datasets['crosstab_{}'.format(i)] = pd.crosstab(data_frame[id_col], data_frame[i])
return datasets
cross_tab(df1, 'a')
EDIT
slightly edited request now separate to cross tabulation - to split output based on whether the table includes a specific value, in this case, dfs (a) with a value of 100 should be stored in a separate list to the rest (b and c)
data
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data={
'a': [1,1,1,1],
'b': [1,1,2,1],
'c': [1,2,2,1]
})
d1={0: 'right', 1: 'left'}
d2={1: 'yes', 2: 'no'}
d3={1: 'up', 2: 'down', 3: 'sideways'}
#d4={1: 'yes', 2: 'no'}
df1['a']=df1['a'].map(d1).fillna('Other')
df1['b']=df1['b'].map(d2).fillna('Other')
df1['c']=df1['c'].map(d3).fillna('Other')
command solved (i think)
def split_cross_tabs(dataframe, cols, contain_val):
datasets = defaultdict(dict)
for x in df1:
p = df1[x].value_counts(normalize=True)*100
datasets[
'y' if p.eq(contain_val).any().any() else'n']['crosstab_{}'.format(x)] = p
return datasets
output
defaultdict(dict,
{'y': {'crosstab_a': left 100.0
Name: a, dtype: float64},
'n': {'crosstab_b': yes 75.0
no 25.0
Name: b, dtype: float64,
'crosstab_c': down 50.0
up 50.0
Name: c, dtype: float64}})
Try with the itertools recipe for a powerset and modify to only keep combinations of length 2 or greater:
from itertools import chain, combinations
def all_cross_tabs(dataframe, cols):
datasets = {}
for s in chain.from_iterable(
combinations(cols, r) for r in range(2, len(cols) + 1)
):
datasets[f'crosstab_{"_".join(s)}'] = pd.crosstab(
dataframe[s[0]],
[dataframe[c] for c in s[1:]]
)
return datasets
Sample:
d = all_cross_tabs(df1, ['a', 'b', 'c', 'd'])
d.keys():
dict_keys(['crosstab_a_b', 'crosstab_a_c', 'crosstab_a_d', 'crosstab_b_c',
'crosstab_b_d', 'crosstab_c_d', 'crosstab_a_b_c', 'crosstab_a_b_d',
'crosstab_a_c_d', 'crosstab_b_c_d', 'crosstab_a_b_c_d'])
d['crosstab_a_b']:
b 10 20 30
a
left 0 3 1
right 3 1 2
d['crosstab_a_b_c']:
b 10 20 30
c green red green red green red
a
left 0 0 2 1 0 1
right 2 1 1 0 1 1
d['crosstab_a_b_c_d']
b 10 20 30
c green red green red green red
d yes no no yes no yes yes
a
left 0 0 1 1 1 0 1
right 2 1 0 1 0 1 1
Edit: Split into two sections based on contain_val
def split_cross_tabs(dataframe, cols, contain_val):
datasets = defaultdict(dict)
for s in chain.from_iterable(
combinations(cols, r) for r in range(2, len(cols) + 1)
):
ct_df = pd.crosstab(
dataframe[s[0]],
[dataframe[c] for c in s[1:]]
)
datasets[
'y' if ct_df.eq(contain_val).any().any() else 'n'
][f'crosstab_{"_".join(s)}'] = ct_df
return datasets
d = split_cross_tabs(df1, ['a', 'b', 'c', 'd'], 3)
d.keys():
dict_keys(['y', 'n'])
list(map(lambda a: a.keys(), d.values())):
[dict_keys(['crosstab_a_b', 'crosstab_b_c', 'crosstab_b_d']),
dict_keys(['crosstab_a_c', 'crosstab_a_d', 'crosstab_c_d', 'crosstab_a_b_c',
'crosstab_a_b_d', 'crosstab_a_c_d', 'crosstab_b_c_d',
'crosstab_a_b_c_d'])]

Calculate average of column x if column y meets criteria, for each y

How do I retrieve the value of column Z and its average
if any value are >1
data=[9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
l=[]
for x,y in df.iterrows():
for i,s in y.iteritems():
if s >1:
l.append(x)
print(df['Z'])
The expected output will most likely be a dictionary with the column name as key and the average of Z as its values.
Using a dictionary comprehension:
res = {col: df.loc[df[col] > 1, 'Z'].mean() for col in df.columns[:-1]}
# {'A': 9.0, 'B': 5.0, 'C': 8.0, 'D': 7.5, 'E': 6.666666666666667}
Setup used for above:
np.random.seed(0)
data = [9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd = pd.DataFrame(data, columns=['Z'])
df = pd.concat([df, fd], axis=1)
Do you mean this?
df[df['Z']>1].loc[:,'Z'].mean(axis=0)
or
df[df['Z']>1]['Z'].mean()
I don't know if I understood your question correctly but do you mean this:
import pandas as pd
import numpy as np
data=[9,2,3,4,5,6,7,8]
columns = ['A', 'B', 'C', 'D','E']
df = pd.DataFrame(np.random.randn(8, 5),columns=columns)
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
print('df = \n', str(df))
anyGreaterThanOne = (df[columns] > 1).any(axis=1)
print('anyGreaterThanOne = \n', str(anyGreaterThanOne))
filtered = df[anyGreaterThanOne]
print('filtered = \n', str(filtered))
Zmean = filtered['Z'].mean()
print('Zmean = ', str(Zmean))
Result:
df =
A B C D E Z
0 -2.170640 -2.626985 -0.817407 -0.389833 0.862373 9
1 -0.372144 -0.375271 -1.309273 -1.019846 -0.548244 2
2 0.267983 -0.680144 0.304727 0.302952 -0.597647 3
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
5 -1.135545 -1.738466 -1.148341 0.764914 -1.140543 6
6 -2.078396 0.057462 -0.737875 -0.817707 0.570017 7
7 0.187877 0.363962 0.637949 -0.875372 -1.105744 8
anyGreaterThanOne =
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
dtype: bool
filtered =
A B C D E Z
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
Zmean = 4.5

pandas data frame sort

I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object
Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]
You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)

create multiple sub-dataframes in pandas/python [duplicate]

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 5 years ago.
I have a data-frame which looks similar to this (has about 300k rows):
df = DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'],
power = [1, 2, 2 ,4 ,5 ,5, 7],
rank = ['a', 'b', 'c', 'd', 'r', 'a', 'g']))
which gives this :
what I want is a list of data-frames(subset) like this:
df_list = [df_1, df_2, df_3]
where df_1, df_2, df_3 are essentially these:
df_1 = df.query("name == 'jon'")
df_2 = df.query("name == 'dany'")
df_3 = df.query("name == 'mindy'")
In the dataset that I'm working with, there are about 500+ names. So how do I efficiently do this?
Here's one way.
In [1497]: df_list = [x[1] for x in df.groupby('name', sort=False)]
In [1498]: df_list[0]
Out[1498]:
name power rank
0 jon 1 a
1 jon 2 b
In [1499]: df_list[1]
Out[1499]:
name power rank
2 dany 2 c
3 dany 4 d
In [1500]: df_list[2]
Out[1500]:
name power rank
4 mindy 5 r
5 mindy 5 a
6 mindy 7 g
But, it's better to store them as dict
In [1501]: {g: v for g, v in df.groupby('name', sort=False)}
Out[1501]:
{'dany': name power rank
2 dany 2 c
3 dany 4 d, 'jon': name power rank
0 jon 1 a
1 jon 2 b, 'mindy': name power rank
4 mindy 5 r
5 mindy 5 a
6 mindy 7 g}
In [1502]: df_dict = {g: v for g, v in df.groupby('name', sort=False)}
In [1503]: df_dict['jon']
Out[1503]:
name power rank
0 jon 1 a
1 jon 2 b
You can try doing this
import pandas as pd
df = pd.DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'],
power = [1, 2, 2 ,4 ,5 ,5, 7],
rank = ['a', 'b', 'c', 'd', 'r', 'a', 'g']))
dfs = []
for each in df.name.unique():
dfs.append(df.loc[df.name == each,:])
Alternatively, you can use numpy to do this -
import numpy as np
dfs2 = []
array = df.values
for each in np.unique(array[:,0]):
dfs2.append(pd.DataFrame(array[array[:,0] == each,:]))
Speed comparison between the above two methods -
import pandas as pd
import numpy as np
from time import time
df = pd.DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'],
power = [1, 2, 2 ,4 ,5 ,5, 7],
rank = ['a', 'b', 'c', 'd', 'r', 'a', 'g']))
t0 = time()
dfs = []
for each in df.name.unique():
dfs.append(df.loc[df.name == each,:])
t1 = time()
dfs2 = []
array = df.values
for each in np.unique(array[:,0]):
dfs2.append(pd.DataFrame(array[array[:,0] == each,:]))
t2 = time()
t1 - t0 #0.003524303436279297
t2 - t1 #0.0016787052154541016
Numpy is faster and can be helpful in your case as you have a large dataset

Count frequency of values in pandas DataFrame column

I want to count number of times each values is appearing in dataframe.
Here is my dataframe - df:
status
1 N
2 N
3 C
4 N
5 S
6 N
7 N
8 S
9 N
10 N
11 N
12 S
13 N
14 C
15 N
16 N
17 N
18 N
19 S
20 N
I want to dictionary of counts:
ex. counts = {N: 14, C:2, S:4}
I have tried df['status']['N'] but it gives keyError and also df['status'].value_counts but no use.
You can use value_counts and to_dict:
print df['status'].value_counts()
N 14
S 4
C 2
Name: status, dtype: int64
counts = df['status'].value_counts().to_dict()
print counts
{'S': 4, 'C': 2, 'N': 14}
An alternative one liner using underdog Counter:
In [3]: from collections import Counter
In [4]: dict(Counter(df.status))
Out[4]: {'C': 2, 'N': 14, 'S': 4}
You can try this way.
df.stack().value_counts().to_dict()
Can you convert df into a list?
If so:
a = ['a', 'a', 'a', 'b', 'b', 'c']
c = dict()
for i in set(a):
c[i] = a.count(i)
Using a dict comprehension:
c = {i: a.count(i) for i in set(a)}
See my response in this thread for a Pandas DataFrame output,
count the frequency that a value occurs in a dataframe column
For dictionary output, you can modify as follows:
def column_list_dict(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return dict(column_list_df)

Categories

Resources