pandas data frame sort - python

I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object

Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]

You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)

Related

verify if cell is the same and build an excelfile

I have some measurements(as a dict) and a list with labels. Need to verify if labels are in my measurements and write it to an excelfile.
my output-excelfile need to look like this.
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
#Output
'A' 'B' 'C' 'D'
measurement1 1 1 0 0
measurement2 0 0 1 1
I have no idea how to build the matrix with (0,1)
Hope you can help me.
EDIT
Finally i got a solution. At first i iterate over all measurements and wrote to dict measurements all missing labels.
Than building a dataframe with ones and putting with 3 loops zeros in the dataframe to the msising positions with .loc
d = pd.DataFrame(1, index = measurements.keys(), columns = list1)
for y in measurements.keys():
for z in measurements[y]:
for x in list1:
if x == z:
d.loc[y,z] = 0
Maybe its possible to make it with only 2 loops.
Use nested list comprehension with filtering for check membership in list1 and last create DataFrame by constructor:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
L = [measurement1, measurement2]
d = [dict.fromkeys([y for y in x.keys() if y in list1], 1) for x in L]
df = pd.DataFrame(d).fillna(0).astype(int)
print (df)
A B C D
0 1 1 0 0
1 0 0 1 1
This should work, using only standard Python:
list1 = ['A', 'B', 'C', 'D']
measurement1 = {'A':1, 'B':1}
measurement2 = {'C':3, 'D':4}
measurements = [measurement1, measurement2]
headers = { h: i for i, h in enumerate(list1) }
matrix = []
for measurement in measurements:
row = [0] * len(headers)
for header in measurement.keys():
row[headers[header]] = 1
matrix.append(row)
For your example, the output will be:
matrix
=> [[1, 1, 0, 0], [0, 0, 1, 1]]
You can use a list of the dictionaries ad create a dataframe then reindex with the list and convert to bool by checking notna
pd.DataFrame([measurement1,measurement2]).reindex(columns=list1).notna().astype(int)
A B C D
0 1 1 0 0
1 0 0 1 1

Calculate average of column x if column y meets criteria, for each y

How do I retrieve the value of column Z and its average
if any value are >1
data=[9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
l=[]
for x,y in df.iterrows():
for i,s in y.iteritems():
if s >1:
l.append(x)
print(df['Z'])
The expected output will most likely be a dictionary with the column name as key and the average of Z as its values.
Using a dictionary comprehension:
res = {col: df.loc[df[col] > 1, 'Z'].mean() for col in df.columns[:-1]}
# {'A': 9.0, 'B': 5.0, 'C': 8.0, 'D': 7.5, 'E': 6.666666666666667}
Setup used for above:
np.random.seed(0)
data = [9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd = pd.DataFrame(data, columns=['Z'])
df = pd.concat([df, fd], axis=1)
Do you mean this?
df[df['Z']>1].loc[:,'Z'].mean(axis=0)
or
df[df['Z']>1]['Z'].mean()
I don't know if I understood your question correctly but do you mean this:
import pandas as pd
import numpy as np
data=[9,2,3,4,5,6,7,8]
columns = ['A', 'B', 'C', 'D','E']
df = pd.DataFrame(np.random.randn(8, 5),columns=columns)
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
print('df = \n', str(df))
anyGreaterThanOne = (df[columns] > 1).any(axis=1)
print('anyGreaterThanOne = \n', str(anyGreaterThanOne))
filtered = df[anyGreaterThanOne]
print('filtered = \n', str(filtered))
Zmean = filtered['Z'].mean()
print('Zmean = ', str(Zmean))
Result:
df =
A B C D E Z
0 -2.170640 -2.626985 -0.817407 -0.389833 0.862373 9
1 -0.372144 -0.375271 -1.309273 -1.019846 -0.548244 2
2 0.267983 -0.680144 0.304727 0.302952 -0.597647 3
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
5 -1.135545 -1.738466 -1.148341 0.764914 -1.140543 6
6 -2.078396 0.057462 -0.737875 -0.817707 0.570017 7
7 0.187877 0.363962 0.637949 -0.875372 -1.105744 8
anyGreaterThanOne =
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
dtype: bool
filtered =
A B C D E Z
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
Zmean = 4.5

Pandas Apply Function That returns two new columns

I have a pandas dataframe that I would like to use an apply function on to generate two new columns based on the existing data. I am getting this error:
ValueError: Wrong number of items passed 2, placement implies 1
import pandas as pd
import numpy as np
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return [C, D]
df = pd.DataFrame(np.random.randint(0,10,size=(2, 2)), columns=list('AB'))
df['C', 'D'] = df.apply(myfunc1 ,axis=1)
Starting DF:
A B
0 6 1
1 8 4
Desired DF:
A B C D
0 6 1 16 56
1 8 4 18 58
Based on your latest error, you can avoid the error by returning the new columns as a Series
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return pd.Series([C, D])
df[['C', 'D']] = df.apply(myfunc1 ,axis=1)
Please be aware of the huge memory consumption and low speed of the accepted answer: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/ !
Using the suggestion presented there, the correct answer would be like this:
def run_loopy(df):
Cs, Ds = [], []
for _, row in df.iterrows():
c, d, = myfunc1(row['A'])
Cs.append(c)
Ds.append(d)
return pd.Series({'C': Cs,
'D': Ds})
def myfunc1(a):
c = a + 10
d = a + 50
return c, d
df[['C', 'D']] = run_loopy(df)
It works for me:
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return C, D
df = pd.DataFrame(np.random.randint(0,10,size=(2, 2)), columns=list('AB'))
df[['C', 'D']] = df.apply(myfunc1, axis=1, result_type='expand')
df
add: ==>> result_type='expand',
regards!
df['C','D'] is considered as 1 column rather than 2. So for 2 columns you need a sliced dataframe so use df[['C','D']]
df[['C', 'D']] = df.apply(myfunc1 ,axis=1)
A B C D
0 4 6 14 54
1 5 1 15 55
Or you can use chain assignment i.e
df['C'], df['D'] = df.apply(myfunc1 ,axis=1)
I believe can achieve similar results to #Federico Dorato answer without use of for loop. Return a list rather than a series and use lambda-apply + to_list() to expand results.
It's cleaner code and on a random df of 10,000,000 rows performs as well or faster.
Federico's code
run_time = []
for i in range(0,25):
df = pd.DataFrame(np.random.randint(0,10000000,size=(2, 2)), columns=list('AB'))
def run_loopy(df):
Cs, Ds = [], []
for _, row in df.iterrows():
c, d, = myfunc1(row['A'])
Cs.append(c)
Ds.append(d)
return pd.Series({'C': Cs,
'D': Ds})
def myfunc1(a):
c = a / 10
d = a + 50
return c, d
start = time.time()
df[['C', 'D']] = run_loopy(df)
end = time.time()
run_time.append(end-start)
print(np.average(run_time)) # 0.001240386962890625
Using lambda and to_list
run_time = []
for i in range(0,25):
df = pd.DataFrame(np.random.randint(0,10000000,size=(2, 2)), columns=list('AB'))
def myfunc1(a):
c = a / 10
d = a + 50
return [c, d]
start = time.time()
df[['C', 'D']] = df['A'].apply(lambda x: myfunc1(x)).to_list()
end = time.time()
run_time.append(end-start)
print(np.average(run_time)) #output 0.0009996891021728516
Add extra brackets when querying for multiple columns.
import pandas as pd
import numpy as np
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return [C, D]
df = pd.DataFrame(np.random.randint(0,10,size=(2, 2)), columns=list('AB'))
df[['C', 'D']] = df.apply(myfunc1 ,axis=1)

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

Automatically rename columns to ensure they are unique

I fetch a spreadsheet into a Python DataFrame named df.
Let's give a sample:
df=pd.DataFrame({'a': np.random.rand(10), 'b': np.random.rand(10)})
df.columns=['a','a']
a a
0 0.973858 0.036459
1 0.835112 0.947461
2 0.520322 0.593110
3 0.480624 0.047711
4 0.643448 0.104433
5 0.961639 0.840359
6 0.848124 0.437380
7 0.579651 0.257770
8 0.919173 0.785614
9 0.505613 0.362737
When I run df.columns.is_unique I get False
I would like to automatically rename column 'a' to 'a_2' (or things like that)
I don't expect a solution like df.columns=['a','a_2']
I looking for a solution that could be usable for several columns!
You can uniquify the columns manually:
df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']
def uniquify(df_columns):
seen = set()
for item in df_columns:
fudge = 1
newitem = item
while newitem in seen:
fudge += 1
newitem = "{}_{}".format(item, fudge)
yield newitem
seen.add(newitem)
list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']
I fetch a spreadsheet into a Python DataFrame named df... I would like
to automatically rename [duplicate] column [names].
Pandas does that automatically for you without you having to do anything...
test.xls:
import pandas as pd
import numpy as np
df = pd.io.excel.read_excel(
"./test.xls",
"Sheet1",
header=0,
index_col=0,
)
print df
--output:--
a b c b.1 a.1 a.2
index
0 10 100 -10 -100 10 21
1 20 200 -20 -200 11 22
2 30 300 -30 -300 12 23
3 40 400 -40 -400 13 24
4 50 500 -50 -500 14 25
5 60 600 -60 -600 15 26
print df.columns.is_unique
--output:--
True
If for some reason you are being given a DataFrame with duplicate columns, you can do this:
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame(
{
'k': np.random.rand(10),
'l': np.random.rand(10),
'm': np.random.rand(10),
'n': np.random.rand(10),
'o': np.random.rand(10),
'p': np.random.rand(10),
}
)
print df
--output:--
k l m n o p
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.16604
df.columns = ['a', 'b', 'c', 'b', 'a', 'a']
print df
--output:--
a b c b a a
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.166047
print df.columns.is_unique
--output:--
False
name_counts = defaultdict(int)
new_col_names = []
for name in df.columns:
new_count = name_counts[name] + 1
new_col_names.append("{}{}".format(name, new_count))
name_counts[name] = new_count
print new_col_names
--output:--
['a1', 'b1', 'c1', 'b2', 'a2', 'a3']
df.columns = new_col_names
print df
--output:--
a1 b1 c1 b2 a2 a3
0 0.264598 0.321378 0.466370 0.986725 0.580326 0.671168
1 0.938810 0.179999 0.403530 0.675112 0.279931 0.011046
2 0.935888 0.167405 0.733762 0.806580 0.392198 0.180401
3 0.218825 0.295763 0.174213 0.457533 0.234081 0.555525
4 0.891890 0.196245 0.425918 0.786676 0.791679 0.119826
5 0.721305 0.496182 0.236912 0.562977 0.249758 0.352434
6 0.433437 0.501975 0.088516 0.303067 0.916619 0.717283
7 0.026491 0.412164 0.787552 0.142190 0.665488 0.488059
8 0.729960 0.037055 0.546328 0.683137 0.134247 0.444709
9 0.391209 0.765251 0.507668 0.299963 0.348190 0.731980
print df.columns.is_unique
--output:--
True
In case anyone needs this in Scala->
def renameDup (Header : String) : String = {
val trimmedList: List[String] = Header.split(",").toList
var fudge =0
var newitem =""
var seen = List[String]()
for (item <- trimmedList){
fudge = 1
newitem = item
for (newitem2 <- seen){
if (newitem2 == newitem ){
fudge += 1
newitem = item + "_" + fudge
}
}
seen= seen :+ newitem
}
return seen.mkString(",")
}
>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']
Here's a solution that uses pandas all the way through.
import pandas as pd
# create data frame with duplicate column names
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.rename({'a': 'col', 'b': 'col'}, axis=1, inplace=True)
df
---output---
col col
0 1 4
1 2 5
2 3 6
# make a new data frame of column headers and number sequentially
dfcolumns = pd.DataFrame({'name': df.columns})
dfcolumns['counter'] = dfcolumns.groupby('name').cumcount().apply(str)
# remove counter for first case (optional) and combine suffixes
dfcolumns.loc[dfcolumns.counter=='0', 'counter'] = ''
df.columns = dfcolumns['name'] + dfcolumns['counter']
df
---output---
col col1
0 1 4
1 2 5
2 3 6
I ran into this problem when loading DataFrames from oracle tables. 7stud is right that pd.read_excel() automatically designates duplicated columns with a *.1, but not all of the read functions do this. One work around is to save the DataFrame to a csv (or excel) file and then reload it to re-designate duplicated columns.
data = pd.read_SQL(SQL,connection)
data.to_csv(r'C:\temp\temp.csv')
data=read_csv(r'C:\temp\temp.csv')

Categories

Resources