Pandas conditional lookup based on columns from a different dataframe

Pandas conditional lookup based on columns from a different dataframe - python

I have searched but found no answers for my problem. My first dataframe looks like:
df1
Item Value
1 23
2 3
3 45
4 65
5 17
6 6
7 18
… …
500 78
501 98
and the second lookup table looks like
df2
L1 H1 L2 H2 L3 H3 L4 H4 L5 H5 Name
1 3 5 6 11 78 86 88 90 90 A
4 4 7 10 79 85 91 99 110 120 B
89 89 91 109 0 0 0 0 0 0 C
...
What I am trying to do is to get Name from df2 to df1 when Item in df1 falls between the Low (L) and High (H) columns. Something (which does not work) like:
df1[Name]=np.where((df1['Item']>=df2['L1'] & df1['Item']<=df2['H1'])|
(df1['Item']>=df2['L2'] & df1['Item']<=df2['H2']) |
(df1['Item']>=df2['L3'] & df1['Item']<=df2['H3']) |
(df1['Item']>=df2['L4'] & df1['Item']<=df2['H4']) |
(df1['Item']>=df2['L5'] & df1['Item']<=df2['H5']) |
(df1['Item']>=df2['L6'] & df1['Item']<=df2['H6']), df2['Name'], "Other")
So that the result would be like:
Item Value Name
1 23 A
2 3 A
3 45 A
4 65 B
5 17 A
6 6 A
7 18 A
… … …
500 78 K
501 98 Other
If you have any guidance for my problem to share, I would much appreciate it! Thank you in advance!

Try:
Transform df2 using wide_to_long
Create lists of numbers from "L" to "H" for each row using apply and range
explode to have one value in each row
map each "Item" in df1 using a dict created from ranges with the structure {value: name}
ranges = pd.wide_to_long(df2, ["L","H"], i="Name", j="Subset")
ranges["values"] = ranges.apply(lambda x: list(range(x["L"], x["H"]+1)), axis=1)
ranges = ranges.explode("values").reset_index()
df1["Name"] = df1["Item"].map(dict(zip(ranges["values"], ranges["Name"])))
>>> df1
Item Value Name
0 1 23 A
1 2 3 A
2 3 45 A
3 4 65 B
4 5 17 A
5 6 6 A
6 7 18 B
7 500 78 NaN
8 501 98 NaN

A faster option (tests can prove/debunk that), would be to use conditional_join from pyjanitor (conditional_join uses binary search underneath the hood):
#pip install pyjanitor
import pandas as pd
import janitor
temp = (pd.wide_to_long(df2,
stubnames=['L', 'H'],
i='Name',
j='Num')
.reset_index('Name')
)
# the `Num` index is sorted already
(df1.conditional_join(
temp,
# left column, right column, join operator
('Item', 'L', '>='),
('Item', 'H', '<='),
how = 'left')
.loc[:, ['Item', 'Value', 'Name']]
)
Item Value Name
0 1 23 A
1 2 3 A
2 3 45 A
3 4 65 B
4 5 17 A
5 6 6 A
6 7 18 B
7 500 78 NaN
8 501 98 NaN

Related

Dividing one dataframe by another in python using pandas with float values

I have two separate data frames named df1 and df2 as shown below:
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 51 58 0.879310
1 1 16 20 95 115 0.826087
2 2 9 9 33 42 0.785714
3 2 12 86 51 137 0.372263
4 2 67 41 98 139 0.705036
5 3 8 0 0 0 0.000000
6 4 99 32 26 58 0.448276
7 4 101 100 24 124 0.193548
8 4 115 69 26 95 0.273684
9 5 6 40 57 97 0.587629
10 5 19 53 87 140 0.621429
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 64 71 0.901408
1 1 16 10 90 100 0.900000
2 2 9 79 86 165 0.521212
3 2 12 12 73 85 0.858824
4 2 67 54 96 150 0.640000
5 3 8 0 0 0 0.000000
6 4 99 86 28 114 0.245614
7 4 101 32 25 57 0.438596
8 4 115 97 16 113 0.141593
9 5 6 86 43 129 0.333333
10 5 19 59 27 86 0.313953
I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message :
TypeError: float() argument must be a string or a number, not 'DataFrame'
when I tried to convert them to floats and this table when I laft it as a df:
Alt_Allele_Count Coverage_Depth
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
My code so far:
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1[['Alt_Allele_Count']] + df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)
Coverage_Depth = (df1[['Coverage_Depth']] + df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)

The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.
Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:
df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe
So in the line:
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
Should return the correct result here. Same goes for the rest of the columns you're adding together.

This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
# note that I changed your double brackets ([["col_name"]]) to single (["col_name"])
# this results in pd.Series objects instead of pd.DataFrame objects
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1['Alt_Allele_Count'] + df2['Alt_Allele_Count'])
print(Alt_Allele_Count)
Coverage_Depth = (df1['Coverage_Depth'] + df2['Coverage_Depth']).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)

Pandas Python highest 2 rows of every 3 and tabling the results

Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300

You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN

How to iterate over rows and drop all other rows where column matches?

I'm trying to go row by row in a dataframe and delete any rows that have the same 'hole_ID' but keep the original row. So that the nearest neighbour only searches in different holes. Here's what I have so far:
import pandas as pd
s1 = StringIO(u'''east,north,elev,hole_ID
11,11,5,A
51,51,6,A
61,61,11,A
21,21,2,B
31,31,3,B
71,71,3,B
81,81,4,B''')
df2 = pd.read_csv(s1)
for idx,row in df2.iterrows():
dftype= df2.drop_duplicates(subset=['hole_ID'], keep='first')
This is what I get:
Out[20]:
east north elev hole_ID
0 11 11 5 A
3 21 21 2 B
And this is what I want to get:
Out[18]:
east north elev hole_ID
0 11 11 5 A
3 21 21 2 B
4 31 31 3 B
5 71 71 3 B
6 81 81 4 B
So for row 1, all the other rows with the same hole_ID ('A') are dropped.
EDIT: I need to do this for every row in the original data frame to perform a nearest neighbour calculation where the hole_ID's do not match.
Thanks in advance.

If you only want to drop duplicates where hole_ID is A, you can pd.concat on a one side the dataframe indexed when that is true and dropping duplicates, and on the other the rest of the cases:
pd.concat([
df2[df2.hole_ID.eq('A')].drop_duplicates(subset=['hole_ID'], keep='first'),
df2[df2.hole_ID.ne('A')]],
axis=0)
east north elev hole_ID
0 11 11 5 A
3 21 21 2 B
4 31 31 3 B
5 71 71 3 B
6 81 81 4 B

I would create a function. Use Series.isin
to be able to select different ID
def remove_by_hole_ID(df,hole_ID):
if not isinstance(hole_ID,list):
hole_ID = [hole_ID]
m = df['hole_ID'].isin(hole_ID)
return pd.concat([df[m].drop_duplicates(subset = 'hole_ID'),df[~m]],sort = True)
print(remove_by_hole_ID(df,'A'))
east elev hole_ID north
0 11 5 A 11
3 21 2 B 21
4 31 3 B 31
5 71 3 B 71
6 81 4 B 81
print(remove_by_hole_ID(df,['A','B']))
east elev hole_ID north
0 11 5 A 11
3 21 2 B 21

How to get the value if the value is present in the xlsx sheet without knowing index number?

I have a unstructured Xslx file. I want to get the full row if the values are present in the sheet. For example
A B C D F
abc 10 24 32 54
cdf 9 10 34 98
mgl 11 90 21 98
fgd 1 9 2 10
I want to get if the 10 value present in the sheet to get the full row values
output =>
abc 10 24 32 54
cdf 9 10 34 98
fgd 1 9 2 10
thanks for the contributions

Use DataFrame.eq with DataFrame.any for test if at least one True per rows:
df = pd.read_excel('file.xlsx')
df1 = df[df.eq(10).any(axis=1)]
Or:
df1 = df[(df == 10).any(axis=1)]
print (df1)
A B C D F
0 abc 10 24 32 54
1 cdf 9 10 34 98
3 fgd 1 9 2 10

You can use pandas.DataFrame.isin followed by pandas.DataFrame.any:
df[df.isin([10]).any(axis = 1)]
A B C D F
0 abc 10 24 32 54
1 cdf 9 10 34 98
3 fgd 1 9 2 10

DataFrame : Get the top n value of each type

I have a group of data like below
ID Type value_1 value_2
1 A 12 89
2 A 13 78
3 A 11 92
4 A 9 79
5 B 15 83
6 B 34 91
7 B 2 87
8 B 3 86
9 B 7 85
10 C 9 83
11 C 3 85
12 C 2 87
13 C 12 88
14 C 11 82
I want to get the top 3 member of each Type according to the value_1 . The only solution occurs to me is that: first , get each Type data into a dataframe and sorted according to the value_1 and get the top 3; Then, merge the result together.
But is ther any simple method to solve it ? For easy discuss , I have codes below
#coding:utf-8
import pandas as pd
_data = [
["1","A",12,89],
["2","A",13,78],
["3","A",11,92],
["4","A",9,79],
["5","B",15,83],
["6","B",34,91],
["7","B",2,87],
["8","B",3,86],
["9","B",7,85],
["10","C",9,83],
["11","C",3,85],
["12","C",2,87],
["13","C",12,88],
["14","C",11,82]
]
head= ["ID","type","value_1","value_2"]
df = pd.DataFrame(_data, columns=head)

Then we using groupby tail with sort_values
newdf=df.sort_values(['type','value_1']).groupby('type').tail(3)
newer
ID type value_1 value_2
2 3 A 11 92
0 1 A 12 89
1 2 A 13 78
8 9 B 7 85
4 5 B 15 83
5 6 B 34 91
9 10 C 9 83
13 14 C 11 82
12 13 C 12 88

Sure! DataFrame.groupby can split a dataframe into different parts by the group fields and apply function can apply UDF on each group.
df.groupby('type', as_index=False, group_keys=False)\
.apply(lambda x: x.sort_values('value_1', ascending=False).head(3))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas conditional lookup based on columns from a different dataframe - python

Related

Dividing one dataframe by another in python using pandas with float values

Pandas Python highest 2 rows of every 3 and tabling the results

How to iterate over rows and drop all other rows where column matches?

How to get the value if the value is present in the xlsx sheet without knowing index number?

DataFrame : Get the top n value of each type

Categories

Resources