Formatting issues using Regex and Pandas

Formatting issues using Regex and Pandas - python

I don't exactly know how to describe the issue I'm having, so I'll just show it.
I have 2 data tables, and I'm using regex to search through and extract values in those tables based on if it matches with the correct word. I'll put the whole script for reference.
import re
import os
import pandas as pd
import numpy as np
os.chdir('C:/Users/Sams PC/Desktop')
f=open('test5.txt', 'w')
NHSQC=pd.read_csv('NHSQC.txt', sep='\s+', header=None)
NHSQC.columns=['Column_1','Column_2','Column_3']
HNCA=pd.read_csv('HNCA.txt', sep='\s+', header=None)
HNCA.columns=['Column_1','Column_2','Column_3','Column_4']
x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]-[H][N]',str(HNCA))
print (NHSQC)
print (HNCA)
print(x)
print (y)
data=[]
label=[]
for i in range (0,6):
if x[i] in str(NHSQC):
data2=NHSQC.set_index('Column_1',drop=False)
data3=(data2.loc[str(x[i]), 'Column_2':'Column_3'])
data.extend(list(data3))
a=[x[i]]
label.extend(a)
label.extend(a)
if y[i] in str(HNCA):
data2=HNCA.set_index('Column_1',drop=False)
data3=(data2.loc[str(y[i]),'Column_3'])
data.append(data3)
a=[y[i]]
label.extend(a)
else:
print('Not Found')
else:
print('Not Found')
data6=[label,data]
matrix=data6
data5=np.transpose(matrix)
print(data5)
f.write(str(data5))
f.close()
This script, does exactly what I want it to do, and it works as intended when I run my test data files, but fails when I run my actual data files. I don't know how to explain the issue, so I'll just show it. This is the output:
Column_1 Column_2 Column_3
0 S31N-HN 114.424 7.390
1 Y32N-HN 121.981 7.468
2 Q33N-HN 120.740 8.578
3 A34N-HN 118.317 7.561
4 G35N-HN 106.764 7.870
.. ... ... ...
89 R170N-HN 118.078 7.992
90 S171N-HN 110.960 7.930
91 R172N-HN 119.112 7.268
92 999_XN-HN 116.703 8.096
93 1000_XN-HN 117.530 8.040
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
.. ... ... ... ...
173 R170N-CA-HN 118.016 60.302 7.999
174 S171N-R170CA-S171HN 110.960 60.239 7.932
175 S171N-CA-HN 110.960 60.946 7.931
176 R172N-S171CA-R172HN 119.112 60.895 7.264
177 R172N-CA-HN 119.112 55.093 7.265
[178 rows x 4 columns]
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN']
Traceback (most recent call last):
File "test.py", line 29, in <module>
if y[i] in str(HNCA):
IndexError: list index out of range
As you can see, there is an issue because my regex for y isn't finding all the values. Furthermore, there is an issue with how many my x regex is finding (only 5 instead of the hundreds it should be). Initially I thought this was just a display thing (it wasn't displaying the hundreds of matches since it would take too long), and I also thought the ... in the middle of it printing my table was also for display purposes. However, if I copy part of my HNCA.txt data and save it as a separate file, it fixes the issue.
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
5 Y32N-S31CA-Y32HN 121.981 61.674 7.467
6 Y32N-CA-HN 121.981 60.789 7.469
7 Q33N-Y32CA-Q33HN 120.770 60.775 8.582
8 Q33N-CA-HN 120.701 58.706 8.585
9 A34N-Q33CA-A34HN 118.317 58.740 7.559
10 A34N-CA-HN 118.317 52.260 7.565
11 G35N-A34CA-G35HN 106.764 52.195 7.868
12 G35N-CA-HN 106.764 46.507 7.868
13 R36N-G35CA-R36HN 117.833 46.414 8.111
14 R36N-CA-HN 117.833 54.858 8.112
15 G37N-R36CA-G37HN 110.365 54.808 8.482
16 G37N-CA-HN 110.365 44.901 8.484
17 I55N-CA-HN 118.132 65.360 7.935
18 Y56N-I55CA-Y56HN 123.025 65.464 8.088
19 Y56N-CA-HN 123.025 62.195 8.082
20 A57N-Y56CA-A57HN 120.470 62.159 7.978
21 A57N-CA-HN 120.447 55.522 7.980
22 S72N-K71CA-S72HN 117.239 55.390 8.368
23 S72N-CA-HN 117.259 58.583 8.362
24 C73N-S72CA-C73HN 128.142 58.569 9.690
25 C73N-CA-HN 128.142 61.410 9.677
26 G74N-C73CA-G74HN 116.187 61.439 9.439
27 G74N-CA-HN 116.194 46.528 9.437
28 H75N-G74CA-H75HN 122.640 46.307 9.642
29 H75N-CA-HN 122.621 56.784 9.644
30 C76N-H75CA-C76HN 122.775 56.741 7.152
31 C76N-CA-HN 122.738 57.527 7.146
32 R77N-C76CA-R77HN 120.104 57.532 8.724
33 R77N-CA-HN 120.135 59.674 8.731
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN', 'Y32N-CA-HN', 'Q33N-CA-HN', 'A34N-CA-HN', 'G35N-CA-HN', 'R36N-CA-HN', 'G37N-CA-HN', 'I55N-CA-HN', 'Y56N-CA-HN', 'A57N-CA-HN', 'S72N-CA-HN', 'C73N-CA-HN', 'G74N-CA-HN', 'H75N-CA-HN', 'C76N-CA-HN', 'R77N-CA-HN']
[['S31N-HN' '114.42399999999999']
I won't post the whole output, but as you can see, now it finds all the proper matches. Its also now displaying the entire table, instead of doing ... and only showing the top and bottom halves. I don't exactly understand where this issue is arising from though. Why is it displaying only the top and bottom half of my table, but if I copy and paste it to another file, it displays the entire thing. Why does regex not search through the entire table even if it isn't displayed (based on the fact it shows the top and bottom half, makes me think the entire table is there, but again its not showing it because its trying to simplify the display, but why would whats being displayed effect what regex is searching)?

Why is python only displaying the top and bottom portions of your table?
Python classes can define two "magic" methods:
__repr__(), which is supposed to produce a "representation" of the object as a string, and which has a pretty useless default implementation for most objects; and
__str__(), which is supposed to produce a readable "string" of the object, and which falls back to __repr__().
When the line x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC)) is run, that last str(NHSQC) bit tells python to call NHSCQ.__str__(), which falls back to NHSCQ.__repr__(), which you can read about here.
The developers of the pandas library implemented DataFrame.__repr__() in such a way that, depending on the values of certain global variables, will produce a string that does not fully represent the underlying data. The defaults truncate the DataFrame to show only the first 5 and last 5 rows with ellipses (...) telling you that there are bits missing. Thus, as you suspected, you are only calling re.findall on the first 5 and last 5 rows of the DataFrame.
What should you do instead?
Using str(NHSQC) is probably not what you intend to do. This converts the entire DataFrame into a (incomplete) string representation, then runs the regular expression search over that entire string. That's extremely inefficient, so why not use the Series.str methods instead?
For instance, you appear to be lining up Column_2 and Column_3 of rows from DataFrame NHSQC where the value of Column_1 matches the first regex in order with Column_3 of rows from DataFrame HNCA where the value of Column_1 matches the second regex, right?
df1 = NHSQC.loc[NHSQC["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-HN"))]
df2 = HNCA.loc[HNCA["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-CA-HN")), ["Column_1", "Column_3"]]
Those lines will select the requisite rows and columns from the two DataFrames using Series.str.match on Column_1.
long1 = df1.melt(id_vars=["Column_1"]).drop("variable", axis="columns")
long2 = df2.rename(columns={"Column_3": "value"})
The first line uses DataFrame.melt to turn the three columns of df1 into a "longer" version with columns Column_1 as an identifier, variable as either the strings "Column_2" or "Column_3", and value, containing the thing you actually care about and are printing at the end of your program. You don't use the column name anymore, so it is dropped. The DataFrame df2 doesn't need to be converted to a longer format because it only has two columns, so we just rename Column_3 to value.
extra_long = pd.concat([long1, long2])
print(extra_long.to_numpy())
This just concatenates the two long DataFrames together, turns them into a numpy array, then prints them out.

Related

Ignore Pandas Dataframe indexes which are not intended to be mapped using map function

I have the following dataframe
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,NaN,3,no
e,Emily,9.0,2,no
I am trying to use pandas map function to update name column where name is either James or Emily to any test value 99.
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes)
dff
I am getting the following output -
index,name,score,attempts,qualify
a,NaN,12.5,1,yes
b,NaN,9.0,3,no
c,NaN,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
Note that name column values James and Emily have been updated to 99, but the rest of name values are mapped to NaN.
How can we ignore indexes which are not intended to be mapped?

The issue is that the map function will apply the dictionary values to all values in the 'name' column, not just the ones specified. To get around this, you can use the replace method instead:
dff['name'] = dff['name'].replace({'James':'99','Emily':'99'})
This will replace only the specified values and leave the others unchanged.

I believe you may be looking for replace instead of map.
import pandas as pd
names = pd.Series([
"Anastasia",
"Dima",
"Katherine",
"James",
"Emily"
])
names.replace({"James": "99", "Emily": "99"})
# 0 Anastasia
# 1 Dima
# 2 Katherine
# 3 99
# 4 99
# dtype: object
If you're really set on using map, then you have to provide a function that knows how to handle every single name it might encounter.
codes = {"James": "99", "Emily": "99"}
# If the lookup into `code` fails,
# return the name that was used for lookup
names.map(lambda name: codes.get(name, name))

codes = {'James':'99',
'Emily':'99'}
dff['name'] = dff['name'].replace(codes)
dff
replace() satisfies the requirement -
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no

You can replace back one way to achiev it
dff['name'] = dff['name'].map(codes).fillna(dff['name'])

codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
dff
index name score attempts qualify
0 a Anastasia 12.5 1 yes
1 b Dima 9.0 3 no
2 c Katherine 16.5 2 yes
3 d 99 NaN 3 no
4 e 99 9.0 2 no

dataframe transform partial row data on column

I have one dataframe where format is given as below image.
Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data.
Now I want to transform this in column where every type of data appended by another group.
Expected output is:
is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.

here is one way to do it
use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year)
create the new row value by combining ticker and year, then using pivot to get the desired result set
df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated
df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
Result set
txt ticker2 data1 data2
0 AAPL/2020 0.824676 0.616524
1 AAPL/2021 0.018540 0.046365
2 AAPL/2022 0.222349 0.729845
3 AMZ/2020 0.122288 0.087217
4 AMZ/2021 0.012168 0.734674
5 AMZ/2022 0.923501 0.437676
6 APPL/2020 0.886927 0.520650
7 APPL/2021 0.725515 0.543404
8 APPL/2022 0.211378 0.464898
9 GGL/2020 0.777676 0.052658
10 GGL/2021 0.297292 0.213876
11 GGL/2022 0.894150 0.185207
12 MICO/2020 0.898251 0.882252
13 MICO/2021 0.141342 0.105316
14 MICO/2022 0.440459 0.811005
based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now
df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)),
columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"])
ticker = ['APPL', 'MICO']
df2.insert(loc=0, column='ticker', value=ticker)
df2.head()
df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting
df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df3.head()
df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
txt ticker2 data1 data2
0 APPL/2020 26 9
1 APPL/2021 75 59
2 APPL/2022 20 44
3 MICO/2020 79 90
4 MICO/2021 63 30
5 MICO/2022 73 91

How to use pandas' df.get function for a dataframe column so that each row in the column maintains its own value?

To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).
This is the code that I have written currently:
import pandas as pd
import pubchempy
import numpy as np
df = pd.read_csv("Data.tsv.txt", sep="\t")
from pubchempy import get_properties
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))
df = df.drop(df[df.CID=='nan'].index)
df = df.drop( df.index.to_list()[5:] ,axis = 0 )
df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)
df = df.rename(columns={'CID.': 'MolecularWeight'})
print(df)
This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):
MolecularWeight
[{'CID': 5339, 'MolecularWeight': '398.4'}]
[{'CID': 3889, 'MolecularWeight': '520.5'}]
[{'CID': 2788, 'MolecularWeight': '305.50'}]
[{'CID': 1422517, 'MolecularWeight': '440.5'}]
.
.
.
Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).
So I then added this bit of code:
df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')
This is the output that I am now getting:
MolecularWeight
398.4
398.4
398.4
398.4
.
.
.
What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.
What I was hoping to get is something like this:
MolecularWeight
398.4
520.5
305.50
440.5
.
.
.
Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!
Few lines of text file:
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments
1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25) R2|R2|R25|R46| A
2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32) R25| A
3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H R18|R26|R27| A

If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float).

It appears that you may want multiple properties from each CID:
props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))
print(df2)
Output:
CID HBondDonorCount RotatableBondCount MolecularWeight
0 5339 398.4 3 6
1 3889 520.5 4 9
2 2788 305.50 1 0
You can then merge this information onto the original dataframe:
df = df.merge(df2) # df = df.merge(pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props)))
print(df)
...
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments MolecularWeight HBondDonorCount RotatableBondCount
0 1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]... O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18... NaN R2|R2|R25|R46| A NaN 398.4 3 6
1 2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]... COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)... 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8... NaN R25| A NaN 520.5 4 9
2 3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1... NaN R18|R26|R27| A NaN 305.50 1 0

Python automatically converting specific data to date format and losing data

I'm cleaning a csv file with pandas, mainly removing special characters such as ('/', '#', etc). The file has 7 columns (none of which are dates).
In some of the columns, there's just numerical data such as '11/6/1980'.
I've noticed that directly after reading the csv file,
df = pd.read_csv ('report16.csv', encoding ='ANSI')
this data becomes '11/6/80', after cleaning it becomes '11 6 80' (it's the same result in the output file). So wherever the data has ' / ', it's being interpreted as a date and python is eliminating the first 2 digits from the data.
Data
Expected result
Actual Result
11/6/1980
11 6 1980
11 6 80
12/8/1983
12 8 1983
12 8 83
Both of the above results are wrong because in the Actual Result column, I'm losing 2 digits towards the end.
The data looks like this
Org Name
Code
Code copy
ABC
11/6/1980
11/6/1980
DEF
12/8/1983
12/8/1983
GH
11/5/1987
11/5/1987
OrgName, Code, Code copy
ABC, 11/6/1980, 11/6/1980
DEF, 12/8/1983, 12/8/1983
GH, 11/5/1987, 11/5/1987
KC, 9000494, 9000494
It's worth mentioning that the column contains other data such as '900490', strings, etc but in these instances, there aren't any problems.
What could be done to not allow this conversion?

Not an answer, but comments do not allow to include well presented code and data.
Here is what I call a minimal reproducible example:
Content of the sample.csv file:
Data,Expected result,Actual Result
11/6/1980,11 6 1980,11 6 80
12/8/1983,12 8 1983,12 8 83
Code:
df = pd.read_csv('sample.csv')
print(df)
s = df['Data'].str.replace('/', ' ')
print((df['Expected result'] == s).all())
It gives :
Data Expected result Actual Result
0 11/6/1980 11 6 1980 11 6 80
1 12/8/1983 12 8 1983 12 8 83
True
This proves that read_csv has correctly read the file and has not changed anything.
PLEASE SHOW THE CONTENT OF YOUR CSV FILE AS TEXT, along with enough code to reproduce your problem.

How about trying string operation?! First select the column that you would like to modify and replace "/" or "#" with whitespace : column.str.replace("/", " "). I hope this is gonna work !

The behavior of converting dates is not strictly a python issue. You are using pandas read_csv.
Try to explicitly declare a separator. If sep not declared, it makes guesses.
df = pd.read_csv ('report16.csv', encoding ='ANSI', sep =',')

Python Lookup - Mapping Dynamic Ranges (fast)

This is an extension to a question I posted earlier: Python Sum lookup dynamic array table with df column
I'm currently investigating a way to efficiently map a decision variable to a dataframe. The main DF and the lookup table will be dynamic in length (+15,000 lines and +20 lines, respectively). Thus was hoping not to do this with a loop, but happy to hear suggestions.
The DF (DF1) will mostly look like the following, where I would like to lookup/search for the decision.
Where the decision value is found on a separate DF (DF0).
For Example: the first DF1["ValuesWhereXYcomefrom"] value is 6.915 which is between 3.8>=(value)>7.4 on the key table and thus the corresponding value DF0["Decision"] is -1. The process then repeats until every line is mapped to a decision.
I was thinking to use the python bisect library, but have not prevailed to any working solution and also working with a loop. Now I'm wondering if I am looking at the problem incorrectly as mapping and looping 15k lines is time consuming.
Example Main Data (DF1):
time
Value0
Value1
Value2
ValuesWhereXYcomefrom
Value_toSum
Decision Map
1
41.43
6.579482077
0.00531021
2
41.650002
6.756817908
46.72466411
6.915187703
0.001200456
-1
3
41.700001
6.221966706
11.64727001
1.871959552
0.000959257
-1
4
41.740002
6.230847055
46.92753343
7.531485368
0.006228989
1
5
42
6.637399856
8.031374656
1.210018204
0.010238095
-1
6
42.43
7.484894608
16.24547568
2.170434793
-0.007777563
-1
7
42.099998
7.595291765
38.73871244
5.100358702
0.003562993
-1
8
42.25
7.567457423
37.07538953
4.899319211
0.01088755
-1
9
42.709999
8.234795546
64.27986403
7.805884636
0.005151042
1
10
42.93
8.369526407
24.72700129
2.954408659
-0.003028209
-1
11
42.799999
8.146653099
61.52243361
7.55186613
0
1
Example KeyTable (DF0):
ValueX
ValueY
SUM
Decision
0.203627201
3.803627201
0.040294925
-1
3.803627201
7.403627201
0.031630668
-1
7.403627201
11.0036272
0.011841521
1

Here's how I would go about this, assuming your first DataFrame is called df and your second is decision:
def map_func(x):
for i in range(len(decision)):
try:
if x < decision["ValueY"].iloc[i]:
return decision["Decision"].iloc[i]
except Exception:
return np.nan
df["decision"] = df["ValuesWhereXYcomefrom"].apply(lambda x: map_func(x))
This will create a new row in your DataFrame called "decision" that contains the looked up value. You can then just query it:
df.decision.iloc[row]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Formatting issues using Regex and Pandas - python

Related

Ignore Pandas Dataframe indexes which are not intended to be mapped using map function

dataframe transform partial row data on column

How to use pandas' df.get function for a dataframe column so that each row in the column maintains its own value?

Python automatically converting specific data to date format and losing data

Python Lookup - Mapping Dynamic Ranges (fast)

Categories

Resources