Creating a union of columns based on metrics - python

I have a dataframe-
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1,1]})
a b c
0 1 0 1
1 2 3 1
2 4 5 1
and a list [('a', 0.91), ('b', 5), ('c', 2)].
Now I want to create another dataframe which iterates through each row and multiplies df element and list element together and and then selects the top 2 scores and makes a new list which has the said column names.
for example in the first row we have-
1*0.9=0.9 , 0*5=0 , 1*2=2
therefore the top 2 columns are a and c so we append them to a new list.
second row-
2*0.9=1.8, 3*5=15,1*2=2
therefore list=[a,c,b]
and so on...
third row-
4*0.9=3.6,5*5=25,1*2=2
so list remains unchanged [a,c,b]
so final output is [a,c,b]

If i understand you correctly I think the previous answers are incomplete so here is a solution. It involves using numpy which i hope you accept.
Create the weights:
n = [('a', 0.91), ('b', 5), ('c', 2)]
d = { a:b for a,b in n}
weights = [d[i] for i in df.columns]
Then we create a table with weights multiplied in:
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]})
df = df*weights
This yields:
a b c
0 0.9 0.0 2.0
1 1.8 15.0 2.0
2 3.6 25.0 2.0
Then we can get top two indices for this in numpy:
b = np.argsort(df.values,axis=1)
b = b[:,-2:]
This yields:
array([[0, 2],
[2, 1],
[0, 1]], dtype=int64)
Finally we can calculate the order of appearance and give back column names:
c =b.reshape(-1)
_, idx = np.unique(c, return_index=True)
d = c[np.sort(idx)]
print(list(df.columns[d].values))
This yields:
['a', 'c', 'b']

Try this :
dict1 = {'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]} # arrays must all be same length
df = pd.DataFrame(dict1)
list1 = [('a', 0.91), ('b', 5), ('c', 2)]
df2 = pd.DataFrame({k : [j*v[1] for j in dict1[k]] for k in dict1 for v in list1 if k == v[0]})
"""
df2 should be like this :
a b c
0 0.91 0 2
1 1.82 15 2
2 3.64 25 2
"""

IIUC, you need:
a = [('a', 0.91), ('b', 5), ('c', 2)]
m= df.mul(pd.DataFrame(a).set_index(0)[1])
a b c
0 0.91 0.0 2.0
1 1.82 15.0 2.0
2 3.64 25.0 2.0
Applying rank on each row and taking the sum , then sorting and finding the index gives your desired output.
m.rank(axis=1,method='dense').sum().sort_values().index.tolist()
#['a', 'c', 'b']

Related

Extracting rows with most frequent value

Have a dataframe with several columns from which I want to extract one row for each "family" of individuals that has the most frequent number ("No"). I have tested this with a for -loop that seems to work, but being a newbe I wanted to know if there is a shorter/smarter way of doing it.
Here is a short example code:
import pandas as pd
ind = [ ('A', 'a', 0.1 , 9) ,
('B', 'b', 0.6 , 10) ,
('C', 'b', 0.4 , 10) ,
('D', 'b', 0.2, 7) ,
('E', 'a', 0.9 , 6) ,
('F', 'b', 0.7 , 11)
]
df = pd.DataFrame(ind, columns = ['Name' , 'Family', 'Prob', 'No'])
res = pd.DataFrame(columns = df.columns)
for name,g in df.groupby('Family'):
v = g['No'].value_counts().idxmax()
idx = g['No'] == v
si = g[idx].iloc[0]
res = res.append(si)
print(res)
I have looked at several exampels that do some of it like this but with that I can only get the "Family" and "No" and not the whole row...
Here is an alternative using duplicated and mode+groupby with mode:
c = df['No'].eq(df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]))
c1 = df[['Family','No']].duplicated()
output = df[c & ~c1]
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6
Use GroupBy.transform with first mode, then filter and last remove duplicates by DataFrame.drop_duplicates:
df1 = (df[df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]).eq(df['No'])]
.drop_duplicates(['Family','No']))
print (df1)
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6

Pandas conditionally creating a new dataframe using another

I have a list;
orig= [2, 3, 4, -5, -6, -7]
I want to create another where entries corresponding to positive values above are sum of positives, and those corresponding to negative values above are sum negatives. So the desired output is:
final = [9, 9, 9, 18, 18, 18]
I am doing this:
raw = pd.DataFrame(orig, columns =['raw'])
raw
raw
0 2
1 3
2 4
3 -5
4 -6
5 -7
sum_pos = raw[raw> 0].sum()
sum_neg = -1*raw[raw < 0].sum()
final = pd.DataFrame(index = raw.index, columns = ['final'])
final
final
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
final.loc[raw >0, 'final'] = sum_pos
KeyError: "[('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w')\n ('r', 'a', 'w') ('r', 'a', 'w')] not in index"
So basically i was trying to create an empty dataframe like raw, and then conditionally fill it. However, the above method is failing.
Even when i try to create a new column instead of a new df, it fails:
raw.loc[raw>0, 'final']= sum_pos
KeyError: "[('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w')\n ('r', 'a', 'w') ('r', 'a', 'w')] not in index"
The best solution I've found so far is this:
pd.DataFrame(np.where(raw>0, sum_pos, sum_neg), index= raw.index, columns=['final'])
final
0 9.0
1 9.0
2 9.0
3 18.0
4 18.0
5 18.0
However, I dont understand what is wrong with the other approaches. Is there something I am missing here?
You can try grouping on np.sign, then sum and abs:
s = pd.Series(orig)
s.groupby(np.sign(s)).transform('sum').abs().tolist()
Output:
[9, 9, 9, 18, 18, 18]
You're not aligning indexes. 'sum_pos' is a series with a single element that has an index of 'raw'. And, you are trying to assign that series to a part of dataframe that doesn't have 'raw' as an index.
Pandas does almost everything using index alignment. To properly do this you need to extract the values from the sum_pos series:
final.loc[raw['raw'] > 0, 'final'] = sum_pos.values
print(final)
Output:
final
0 9.0
1 9.0
2 9.0
3 NaN
4 NaN
5 NaN

matching two different arrays and making a new array in python

I have two two-dimensional arrays, and I have to create a new array filtering through the 2nd array where 1st column indexes match. The arrays are of different size.
basically the idea is as follow:
file A
#x y
1 2
3 4
2 2
5 4
6 4
7 4
file B
#x1 y1
0 1
1 1
11 1
5 1
7 1
My expected output 2D array should look like
#newx newy
1 1
5 1
7 1
I tried it following way:
match =[]
for i in range(len(x)):
if x[i] == x1[i]:
new_array = x1[i]
match.append(new_array)
print match
This does not seem to work. Please suggest a way to create the new 2D array
Try np.isin.
arr1 = np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]])
arr2 = np.array([[0,1,11,5,7], [1,1,1,1,1]])
arr2[:,np.isin(arr2[0], arr1[0])]
array([[1, 5, 7],
[1, 1, 1]])
np.isin(arr2[0], arr1[0]) checks whether each element of arr2[0] is in arr1[0]. Then, we use the result as the boolean index array to select elements in arr2.
If you make a set out of the first element in A, then it is fairly easy to find the elements in B to keep like:
Code:
a = ((1, 2), (3, 4), (2, 2), (5, 4), (6, 4), (7, 4))
b = ((0, 1), (1, 1), (11, 1), (5, 1), (7, 1))
in_a = {i[0] for i in a}
new_b = [i for i in b if i[0] in in_a]
print(new_b)
Results:
[(1, 1), (5, 1), (7, 1)]
Output results to file as:
with open('output.txt', 'w') as f:
for value in new_b:
f.write(' '.join(str(v) for v in value) + '\n')
#!/usr/bin/env python3
from io import StringIO
import pandas as pd
fileA = """x y
1 2
3 4
2 2
5 4
6 4
7 4
"""
fileB = """x1 y1
0 1
1 1
11 1
5 1
7 1
"""
df1 = pd.read_csv(StringIO(fileA), delim_whitespace=True, index_col="x")
df2 = pd.read_csv(StringIO(fileB), delim_whitespace=True, index_col="x1")
df = pd.merge(df1, df2, left_index=True, right_index=True)
print(df["y1"])
# 1 1
# 5 1
# 7 1
https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
If you use pandas:
import pandas as pd
A = pd.DataFrame({'x': pd.Series([1,3,2,5,6,7]), 'y': pd.Series([2,4,2,4,4,4])})
B = pd.DataFrame({'x1': pd.Series([0,1,11,5,7]), 'y1': 1})
C = A.join(B.set_index('x1'), on='x')
Then if you wanted to drop the unneeded row/columns and rename the columns:
C = A.join(B.set_index('x1'), on='x')
C = C.drop(['y'], axis=1)
C.columns = ['newx', 'newy']
which gives you:
>>> C
newx newy
0 1 1.0
3 5 1.0
5 7 1.0
If you are going to work with arrays, dataframes, etc - pandas is definitely worth a look: https://pandas.pydata.org/pandas-docs/stable/10min.html
Assuming that you have (x, y) pairs in your 2-D arrays, a simple loop may work:
arr1 = [[1, 2], [3, 4], [2, 2]]
arr2 = [[0, 1], [1, 1], [11, 1]]
result = []
for pair1 in arr1:
for pair2 in arr2:
if (pair1[0] == pair2[0]):
result.append(pair2)
print(result)
Not the best solution for smaller arrays, but for really large arrays, works fast -
import numpy as np
import pandas as pd
n1 = np.transpose(np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]]))
n2 = np.transpose(np.array([[0,1,11,5, 7], [1,1,1,1,1]]))
np.array(pd.DataFrame(n1).merge(pd.DataFrame(n2), on=0, how='inner').drop('1_x', axis=1))

pandas data frame sort

I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object
Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]
You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)

Pandas: for loop through columns

My data looks like:
SNP Name ss715583617 ss715592335 ss715591044 ss715598181
4 PI081762 T A A T
5 PI101404A T A A T
6 PI101404B T A A T
7 PI135624 T A A T
8 PI326581 T A A T
9 PI326582A T A A T
10 PI326582B T A A T
11 PI339732 T A A T
12 PI339735A T A A T
13 PI339735B T A A T
14 PI342618A T A A T
In reality I have a dataset of 50,000 columns of 479 rows. My objective is to go through each column with characters and convert the data to integers depending on which is the most abundant character.
Right now I have the data input, and I have more or less written the function I would like to use to analyze each column separately. However, I can't quite understand how to use a forloop or use the apply function through all of the columns in the dataset. I would prefer not to hardcode the columns because I will have 40,000~50,000 columns to analyze.
My code so far is:
import pandas as pd
df = pd.read_csv("/home/dfreese/Desktop/testSNPtext", delimiter='\t')
df.head() # check that the file format fits
# ncol df
df2 = df.iloc[4:-1] # Select the rows you want to analyze in a subset df
print(df2)
My function:
def countAlleles(N):
# N is just suppose to be the column, ideally once I've optimized the function
# I need to analyze every column
# Will hold the counts of each letter in the column
letterCount = []
# This is a parallel array to know the order
letterOrder = {'T','A','G','C','H','U'}
# Boolean to use which one is the maximum
TFlag = None
AFlag = None
GFlag = None
CFlag = None
HFlag = None
UFlag = None
# Loop through the column to determine which one is the maximum
for i in range(len(N)): # How do I get index information of the column?
if(N[i] == 'T'): # If the element in the column is T
letterCount[0] = letterCount[0] + 1
elif(N[i] == 'A'):
letterCount[1] = letterCount [1] + 1
elif (N[i] == 'G'):
letterCount[2] = letterCount [2] + 1
elif (N[i] == 'C'):
lettercount[3] = letterCount[3] + 1
elif(N[i] == 'H'):
letterCount[4] = letterCount[4] + 1
else:
letterCount[5] = letterCount[5] + 1
max = letterCount[0] # This will hold the value of maximum
mIndex = 0 # This holds the index position with the max value
# Determine which one is max
for i in range(len(letterCount)):
if (letterCount[i] > max):
max = letterCount[i]
mIndex = i
So I designed the function to input the column, in hopes to be able to iterate through all the columns of the dataframe. My main question is:
1) How would I pass each in each column as a parameter to the for loop through the elements of each column?
My major source of confusion is how indexes are being used in pandas. I'm familiar with 2-dimensional array in C++ and Java and that is most of where my knowledge stems from.
I'm attempting to use the apply function:
df2 = df2.apply(countAlleles('ss715583617'), axis=2)
but it doesn't seem that my application is correct.
Updated answer: Now the dataframe is analyzed and replaced with the int values according to the occurences of a allele per column. The problem with what happens if one allele has the same number of occurences than the other is still the same - the assignment will be not unique.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = np.array(['T', 'A', 'G', 'C', 'H', 'U'])
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
# dictionary with full sorted keys
repl = letterOrder[np.argsort(alleles)][::-1]
# directly replace chars by value
for num, char in enumerate(repl):
df[col].replace(char, num+1, inplace=True)
print(df)
This will change the initial dataframe
ss1 ss2 ss3
0 T G C
1 T G H
2 T T C
3 G A H
to the new dataframe with ints sorted according to the number of occurences:
ss1 ss2 ss3
0 1 1 2
1 1 1 1
2 1 3 2
3 2 2 1
For reference the old answer which gives the maximum column indices:
import pandas as pd
import numpy as np
from collections import OrderedDict
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = ['T', 'A', 'G', 'C', 'H', 'U']
full_results = OrderedDict()
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
full_results[col] = [letterOrder[np.argmax(alleles)], np.max(alleles)]
print(full_results)
This will give:
OrderedDict([('ss1', ['T', 3]), ('ss2', ['G', 2]), ('ss3', ['C', 2])])
The key in the dict is the name of your column, and the value is a list with [allele, number_of_occurences].
I used OrderedDict to keep the order of your columns and the name, but if you don't need the order, you can use a dict, or if you don't need the column name (and the implicit ID is enough), use a list.
But be careful: If in one column two (or more) characters have the same number of counts, this will only return one of them. You would need to add an additional test for this.
To iterate over columns in e.g. a for loop, use list(df). Anyhow, you can easily do what you are attempting using collections.Counter
assume a dataframe df
df
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 T A A T
#1 PI101404A T A A T
#2 PI101404B T A A T
#3 PI135624 T A A T
#4 PI326581 T A F D
#5 PI326582A G A F T
#6 PI326582B G A A T
#7 PI339732 D H A T
#8 PI339735A D A A T
#9 PI339735B A A A T
#10 PI342618A D A A T
What I gather from the comments sections and your original post, you want to replace each character in each column according to it's frequency of occurrence. This is one approach:
Make the Counters
from collections import Counter
cols = [ col for col in list(df) if col not in ['Name'] ] # all the column you want to operate on
col_counters = { col: Counter( df[col] ) for col in cols }
#{'ss715583617': Counter({'T': 5, 'D': 3, 'G': 2, 'A': 1}),
# 'ss715591044': Counter({'A': 9, 'F': 2}),
# 'ss715592335': Counter({'A': 10, 'H': 1}),
# 'ss715598181': Counter({'T': 10, 'D': 1})}
Sort the items in each Counter
sort_func = lambda items: sorted(items, key=lambda x:x[1], reverse=True ) # sort a nested list according to second element in each sublist
sort_result = { col: sort_func(counter.items()) for col,counter in col_counters.iteritems() }
#{'ss715583617': [('T', 5), ('D', 3), ('G', 2), ('A', 1)],
# 'ss715591044': [('A', 9), ('F', 2)],
# 'ss715592335': [('A', 10), ('H', 1)],
# 'ss715598181': [('T', 10), ('D', 1)]}
Replace letters in dataframe according to sort result
Here we will use enumerate to get the position of each sort result
mapper = { col: {letter:i+1 for i,letter in enumerate(sort_result[col]) } for col in sort_result }
#{'ss715583617': {'A': 4, 'D': 2, 'G': 3, 'T': 1},
# 'ss715591044': {'A': 1, 'F': 2},
# 'ss715592335': {'A': 1, 'H': 2},
# 'ss715598181': {'D': 2, 'T': 1}}
df.replace( to_replace=mapper, inplace=True)
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 1 1 1 1
#1 PI101404A 1 1 1 1
#2 PI101404B 1 1 1 1
#3 PI135624 1 1 1 1
#4 PI326581 1 1 2 2
#5 PI326582A 3 1 2 1
#6 PI326582B 3 1 1 1
#7 PI339732 2 2 1 1
#8 PI339735A 2 1 1 1
#9 PI339735B 4 1 1 1
#10 PI342618A 2 1 1 1
This should be enough to get you on your way. I am not sure how you want to handle duplicate elements, for instance if a column has the same number if T and G.

Categories

Resources