Pandas conditional lookup based on columns from a different dataframe - python

I have searched but found no answers for my problem. My first dataframe looks like:
df1
Item Value
1 23
2 3
3 45
4 65
5 17
6 6
7 18
… …
500 78
501 98
and the second lookup table looks like
df2
L1 H1 L2 H2 L3 H3 L4 H4 L5 H5 Name
1 3 5 6 11 78 86 88 90 90 A
4 4 7 10 79 85 91 99 110 120 B
89 89 91 109 0 0 0 0 0 0 C
...
What I am trying to do is to get Name from df2 to df1 when Item in df1 falls between the Low (L) and High (H) columns. Something (which does not work) like:
df1[Name]=np.where((df1['Item']>=df2['L1'] & df1['Item']<=df2['H1'])|
(df1['Item']>=df2['L2'] & df1['Item']<=df2['H2']) |
(df1['Item']>=df2['L3'] & df1['Item']<=df2['H3']) |
(df1['Item']>=df2['L4'] & df1['Item']<=df2['H4']) |
(df1['Item']>=df2['L5'] & df1['Item']<=df2['H5']) |
(df1['Item']>=df2['L6'] & df1['Item']<=df2['H6']), df2['Name'], "Other")
So that the result would be like:
Item Value Name
1 23 A
2 3 A
3 45 A
4 65 B
5 17 A
6 6 A
7 18 A
… … …
500 78 K
501 98 Other
If you have any guidance for my problem to share, I would much appreciate it! Thank you in advance!

Try:
Transform df2 using wide_to_long
Create lists of numbers from "L" to "H" for each row using apply and range
explode to have one value in each row
map each "Item" in df1 using a dict created from ranges with the structure {value: name}
ranges = pd.wide_to_long(df2, ["L","H"], i="Name", j="Subset")
ranges["values"] = ranges.apply(lambda x: list(range(x["L"], x["H"]+1)), axis=1)
ranges = ranges.explode("values").reset_index()
df1["Name"] = df1["Item"].map(dict(zip(ranges["values"], ranges["Name"])))
>>> df1
Item Value Name
0 1 23 A
1 2 3 A
2 3 45 A
3 4 65 B
4 5 17 A
5 6 6 A
6 7 18 B
7 500 78 NaN
8 501 98 NaN

A faster option (tests can prove/debunk that), would be to use conditional_join from pyjanitor (conditional_join uses binary search underneath the hood):
#pip install pyjanitor
import pandas as pd
import janitor
temp = (pd.wide_to_long(df2,
stubnames=['L', 'H'],
i='Name',
j='Num')
.reset_index('Name')
)
# the `Num` index is sorted already
(df1.conditional_join(
temp,
# left column, right column, join operator
('Item', 'L', '>='),
('Item', 'H', '<='),
how = 'left')
.loc[:, ['Item', 'Value', 'Name']]
)
Item Value Name
0 1 23 A
1 2 3 A
2 3 45 A
3 4 65 B
4 5 17 A
5 6 6 A
6 7 18 B
7 500 78 NaN
8 501 98 NaN

Related

Dividing one dataframe by another in python using pandas with float values

I have two separate data frames named df1 and df2 as shown below:
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 51 58 0.879310
1 1 16 20 95 115 0.826087
2 2 9 9 33 42 0.785714
3 2 12 86 51 137 0.372263
4 2 67 41 98 139 0.705036
5 3 8 0 0 0 0.000000
6 4 99 32 26 58 0.448276
7 4 101 100 24 124 0.193548
8 4 115 69 26 95 0.273684
9 5 6 40 57 97 0.587629
10 5 19 53 87 140 0.621429
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 64 71 0.901408
1 1 16 10 90 100 0.900000
2 2 9 79 86 165 0.521212
3 2 12 12 73 85 0.858824
4 2 67 54 96 150 0.640000
5 3 8 0 0 0 0.000000
6 4 99 86 28 114 0.245614
7 4 101 32 25 57 0.438596
8 4 115 97 16 113 0.141593
9 5 6 86 43 129 0.333333
10 5 19 59 27 86 0.313953
I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message :
TypeError: float() argument must be a string or a number, not 'DataFrame'
when I tried to convert them to floats and this table when I laft it as a df:
Alt_Allele_Count Coverage_Depth
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
My code so far:
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1[['Alt_Allele_Count']] + df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)
Coverage_Depth = (df1[['Coverage_Depth']] + df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)
The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.
Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:
df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe
So in the line:
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
Should return the correct result here. Same goes for the rest of the columns you're adding together.
This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
# note that I changed your double brackets ([["col_name"]]) to single (["col_name"])
# this results in pd.Series objects instead of pd.DataFrame objects
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1['Alt_Allele_Count'] + df2['Alt_Allele_Count'])
print(Alt_Allele_Count)
Coverage_Depth = (df1['Coverage_Depth'] + df2['Coverage_Depth']).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)

Pandas Python highest 2 rows of every 3 and tabling the results

Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300
You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN

How to iterate over rows and drop all other rows where column matches?

I'm trying to go row by row in a dataframe and delete any rows that have the same 'hole_ID' but keep the original row. So that the nearest neighbour only searches in different holes. Here's what I have so far:
import pandas as pd
s1 = StringIO(u'''east,north,elev,hole_ID
11,11,5,A
51,51,6,A
61,61,11,A
21,21,2,B
31,31,3,B
71,71,3,B
81,81,4,B''')
df2 = pd.read_csv(s1)
for idx,row in df2.iterrows():
dftype= df2.drop_duplicates(subset=['hole_ID'], keep='first')
This is what I get:
Out[20]:
east north elev hole_ID
0 11 11 5 A
3 21 21 2 B
And this is what I want to get:
Out[18]:
east north elev hole_ID
0 11 11 5 A
3 21 21 2 B
4 31 31 3 B
5 71 71 3 B
6 81 81 4 B
So for row 1, all the other rows with the same hole_ID ('A') are dropped.
EDIT: I need to do this for every row in the original data frame to perform a nearest neighbour calculation where the hole_ID's do not match.
Thanks in advance.
If you only want to drop duplicates where hole_ID is A, you can pd.concat on a one side the dataframe indexed when that is true and dropping duplicates, and on the other the rest of the cases:
pd.concat([
df2[df2.hole_ID.eq('A')].drop_duplicates(subset=['hole_ID'], keep='first'),
df2[df2.hole_ID.ne('A')]],
axis=0)
east north elev hole_ID
0 11 11 5 A
3 21 21 2 B
4 31 31 3 B
5 71 71 3 B
6 81 81 4 B
I would create a function. Use Series.isin
to be able to select different ID
def remove_by_hole_ID(df,hole_ID):
if not isinstance(hole_ID,list):
hole_ID = [hole_ID]
m = df['hole_ID'].isin(hole_ID)
return pd.concat([df[m].drop_duplicates(subset = 'hole_ID'),df[~m]],sort = True)
print(remove_by_hole_ID(df,'A'))
east elev hole_ID north
0 11 5 A 11
3 21 2 B 21
4 31 3 B 31
5 71 3 B 71
6 81 4 B 81
print(remove_by_hole_ID(df,['A','B']))
east elev hole_ID north
0 11 5 A 11
3 21 2 B 21

How to get the value if the value is present in the xlsx sheet without knowing index number?

I have a unstructured Xslx file. I want to get the full row if the values are present in the sheet. For example
A B C D F
abc 10 24 32 54
cdf 9 10 34 98
mgl 11 90 21 98
fgd 1 9 2 10
I want to get if the 10 value present in the sheet to get the full row values
output =>
abc 10 24 32 54
cdf 9 10 34 98
fgd 1 9 2 10
thanks for the contributions
Use DataFrame.eq with DataFrame.any for test if at least one True per rows:
df = pd.read_excel('file.xlsx')
df1 = df[df.eq(10).any(axis=1)]
Or:
df1 = df[(df == 10).any(axis=1)]
print (df1)
A B C D F
0 abc 10 24 32 54
1 cdf 9 10 34 98
3 fgd 1 9 2 10
You can use pandas.DataFrame.isin followed by pandas.DataFrame.any:
df[df.isin([10]).any(axis = 1)]
A B C D F
0 abc 10 24 32 54
1 cdf 9 10 34 98
3 fgd 1 9 2 10

DataFrame : Get the top n value of each type

I have a group of data like below
ID Type value_1 value_2
1 A 12 89
2 A 13 78
3 A 11 92
4 A 9 79
5 B 15 83
6 B 34 91
7 B 2 87
8 B 3 86
9 B 7 85
10 C 9 83
11 C 3 85
12 C 2 87
13 C 12 88
14 C 11 82
I want to get the top 3 member of each Type according to the value_1 . The only solution occurs to me is that: first , get each Type data into a dataframe and sorted according to the value_1 and get the top 3; Then, merge the result together.
But is ther any simple method to solve it ? For easy discuss , I have codes below
#coding:utf-8
import pandas as pd
_data = [
["1","A",12,89],
["2","A",13,78],
["3","A",11,92],
["4","A",9,79],
["5","B",15,83],
["6","B",34,91],
["7","B",2,87],
["8","B",3,86],
["9","B",7,85],
["10","C",9,83],
["11","C",3,85],
["12","C",2,87],
["13","C",12,88],
["14","C",11,82]
]
head= ["ID","type","value_1","value_2"]
df = pd.DataFrame(_data, columns=head)
Then we using groupby tail with sort_values
newdf=df.sort_values(['type','value_1']).groupby('type').tail(3)
newer
ID type value_1 value_2
2 3 A 11 92
0 1 A 12 89
1 2 A 13 78
8 9 B 7 85
4 5 B 15 83
5 6 B 34 91
9 10 C 9 83
13 14 C 11 82
12 13 C 12 88
Sure! DataFrame.groupby can split a dataframe into different parts by the group fields and apply function can apply UDF on each group.
df.groupby('type', as_index=False, group_keys=False)\
.apply(lambda x: x.sort_values('value_1', ascending=False).head(3))

Categories

Resources