Pandas dataframe.at - python

I am doing a project for faculty and I am getting an error. I am new to this language and I am following some steps, but the book that I am reading is a bit old, so some functions are outdated and I can't manage to get over one error. I searched about the function dataframe.set_value and I saw that this was changed to dataframe.at .
It goes like this :
for index, row in dataset.iterrows():
home_team = row["Home"]
visitor_team = row["Away"]
row["HomeLastWin"]=won_last[home_team]
dataset.at(index, "HomeLastWin") = won_last[home_team]
dataset.at(index, "VisitorLastWin") = won_last[visitor_team]
won_last[home_team] = int(row["HomeWin"])
won_last[visitor_team] = 1 - int(row["HomeWin"])
The original code found in the book was:
dataset.set_value(index,"HomeLastWin", won_last[home_team])
I understood that the parameters are dataset.at(What_row,What_column) = change_with_this.
The error I am getting is this:
File "<ipython-input-40-acfeaead26ef>", line 7
dataset.at(index, "HomeLastWin") = won_last[home_team]
^
SyntaxError: cannot assign to function call
Thank you for your time and answers!

See pandas documentation here.
You're using .at(), but want to use square brackets with .at[].
dataset.at[index, "HomeLastWin"] = won_last[home_team]

Related

Can you use .index for 2D arrays (without using numpy)?

So I've been tring to use .index with 2D arrays just so I can do some tests on my code, but it just comes up with an error saying that the value which is in the list actually isn't.
For example, from one project (where i was trying to revise network layering whilst practicing some coding), I tried doing this but didnt work:
answers = [['Application Layer','HTTP','HTTPS','SMTP','IMAP','FTP'],['Transport Layer','TCP','UDP'],['Network Layer','ARP','IP','ICMP'],['Data Link layer']]
correct = 0
incorrect = 0
qs = answers[randint(0,3)][0]
print(answers.index(qs))
print(qs)
Example from code
As you can see, I'm trying to get back the value of 'qs' by using index but no luck.
I've seen a few other posts saying to use numpy, but how would I do this without using numpy?
You can do it like this.
answers = [['Application Layer','HTTP','HTTPS','SMTP','IMAP','FTP'],['Transport Layer','TCP','UDP'],['Network Layer','ARP','IP','ICMP'],['Data Link layer']]
correct = 0
incorrect = 0
qs = answers[randint(0,3)][0]
for i, answer in enumerate(answers):
if qs in answer:
print(i)
break
print(qs)

ValueError: BitVects must be same length (rdkit)

I am calculating the structure similarity profile between 2 moles using rdkit. When I am running the program in google colab (rdkit=2020.09.2 python=3.7) the program is working fine.
I am getting an error when I am running on my PC (rdkit=2021.03.2 python=3.8.5). The error is a bit strange. The dataframe contains 500 rows and the code is working only for the first 10 rows (0-9) and for later rows I am getting an error
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
    ValueError: BitVects must be same length
The block of code is given below
data = pd.read_csv(os.path.join(os.path.join(os.getcwd(), "dataset"), "test_ssp.csv"), index_col=None)
#Proff and make a list of Smiles and id
c_smiles = []
count = 0
for index, row in data.iterrows():
try:
cs = Chem.CanonSmiles(row['SMILES'])
c_smiles.append([row['ID_Name'], cs])
except:
count = count + 1
print('Count Invalid SMILES:', count, row['ID_Name'], row['SMILES'])
# make a list of id, smiles, and mols
ms = []
df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
for index, row in df.iterrows():
mol = Chem.MolFromSmiles(row['SMILES'])
ms.append([row['ID_Name'], row['SMILES'], mol])
# make a list of id, smiles, mols, and fingerprints (fp)
fps = []
df_fps = DataFrame(ms,columns=['ID_Name','SMILES', 'mol'])
df_fps.head
for index, row in df_fps.iterrows():
fps_cal = FingerprintMols.FingerprintMol(row['mol'])
fps.append([row['ID_Name'], fps_cal])
fps_2 = DataFrame(fps,columns=['ID_Name','fps'])
fps_2 = fps_2[fps_2.columns[1]]
fps_2 = fps_2.values.tolist()
# compare all fp pairwise without duplicates
for n in range(len(fps_2)):
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
for m in range(len(s)):
qu.append(c_smiles2[n])
ta.append(c_smiles2[n+1:][m])
sim.append(s[m])
Can you tell me why I am getting this error on my PC while the code is working fine in Google Colab? How can I solve the issue? Is there anyway to install rdkit=2020.09.2?
Reproducible Data
DB00607 [H][C#]12SC(C)(C)[C##H](N1C(=O)[C#H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C##H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C##]12[C#H]3CC[C#H](C3)[C#]1([H])C(=O)N(C[C##H]1CCCC[C#H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C#]1(C)CN(C[C##]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C#H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1
To answer first on how to install a specific version of Rdkit, you can run this command:
conda install -c rdkit rdkit=2020.09.2
Coming to the original question, the error is coming because of the function:
FingerprintMols.FingerprintMol()
For whatever internal reasons, it's converting the first 10 SMILES to a 2048 length vector while the 11th SMILES to a 1024 length vector. The older versions are able to handle this mismatch but newer versions can't. There are two options to fix this:
Downgrade RdKit to an older version using the command I mentioned above.
Fix the length of the vector by passing it as an argument. Basically, replace the line
FingerprintMols.FingerprintMol(row['mol'])
with
FingerprintMols.FingerprintMol(row['mol'], minPath=1, maxPath=7, fpSize=2048,
bitsPerHash=2, useHs=True, tgtDensity=0.0,
minSize=128)
In the replacement, all arguments other than fpSize are set to their default values and fpSize is fixed to 2048. Please note that you must pass all the arguments and not just fpSize.
Just to extend on mnis's answer, since FingerPrintMol defaults to the RDKFingerprint, you may find it easier to use it directly, as it is much more flexible, plus you will not have to supply all the arguments. Tested on version 2021.03.3
Chem.RDKFingerprint(row['mol'], fpSize=2048)

Anova Analysis Python_Urgent

I hope I can be as clear as possible.
I have an excel file with 400 subjects for a study and for each one of them I have their age, their sex and 40 more columns of biological variables.
Es: CODE0001; (age)20; M\F; Biovalue1; BioValue 2 ..... Biovalue 40.
My goal is to analyze these data with the 1-way Anova because I think it's the best option I have. I'm trying do it (even using this guide https://www.marsja.se/four-ways-to-conduct-one-way-anovas-using-python/ ) but there's always a problem with the code.
So: how can I set up my data in order to be able to use the code for example from that website?
I've already done Dataset.mean() and Dataset.std() for all the data, but I can't use for example the value "Mean Age" because it seems like Jupyter only reads it as a string and not a value.
I'm in a deep state of confusion, so all kind of help will be super appreciated!!!
Thank you in advance
I'm sorry but I didn't understand. I'm relatively new to python so maybe i couldn't explain myself properly.
I need to do an Anova analysis:
First I did this:
AnalisiISAD.mean()
2) Then I made a list from that:
MeanList = [......]
3) Then i proceded with the anova script
AnalisiI.boxplot('MeanList', by='AgeT0', figsize=(12,8))
ctrl = Analisi['MeanList'][Analisi == 'ctrl']
grps = pd.unique(Analisi.group.values)
d_data = {grp:Analisi['MeanList'][Analisi.group ==grp] for grp in grps}
k = len(pd.unique(Analisi.group))
N = len(Analisi.values)
n = Analisi.groupby('AgeT0').size()[0]
but this error occurs: KeyError: 'Column not found: MeanList'
Does this mean I have to create a new column in the excel file? How do I do that?
When using df.mean() or df.std(), try changing the data to pd.Series first and run it.

I'm trying to make a simple script that says two different two phrase lines(Python)

So, I'm just starting to program Python and I wanted to make a very simple script that will say something like "Gabe- Hello, my name is Gabe (Just an example of a sentence" + "Jerry- Hello Gabe, I'm Jerry" OR "Gabe- Goodbye, Jerry" + "Jerry- Goodbye, Gabe". Here's pretty much what I wrote.
answers1 = [
"James-Hello, my name is James!"
]
answers2 = [
"Jerry-Hello James, my name is Jerry!"
]
answers3 = [
"Gabe-Goodbye, Samuel."
]
answers4 = [
"Samuel-Goodbye, Gabe"
]
Jack1 = (answers1 + answers2)
Jack2 = (answers3 + answers4)
Jacks = ([Jack1,Jack2])
import random
for x in range(2):
a = random.randint(0,2)
print (random.sample([Jacks, a]))
I'm quite sure it's a very simple fix, but as I have just started Python (Like, literally 2-3 days ago) I don't quite know what the problem would be. Here's my error message
Traceback (most recent call last):
File "C:/Users/Owner/Documents/Test Python 3.py", line 19, in <module>
print (random.sample([Jacks, a]))
TypeError: sample() missing 1 required positional argument: 'k'
If anyone could help me with this, I would very much appreciate it! Other than that, I shall be searching on ways that may be relevant to fixing this.
The problem is that sample requires a parameter k that indicates how many random samples you want to take. However in this case it looks like you do not need sample, since you already have the random integer. Note that that integer should be in the range [0,1], because the list Jack has only two elements.
a = random.randint(0,1)
print (Jacks[a])
or the same behavior with sample, see here for an explanation.
print (random.sample(Jacks,1))
Hope this helps!
random.sample([Jacks, a])
This sample method should looks like
random.sample(Jacks, a)
However, I am concerted you also have no idea how lists are working. Can you explain why do you using lists of strings and then adding values in them? I am losing you here.
If you going to pick a pair or strings, use method described by Florian (requesting data by index value.)
k parameter tell random.sample function that how many sample you need, you should write:
print (random.sample([Jacks, a], 3))
which means you need 3 sample from your list. the output will be something like:
[1, jacks, 0]

Intramolecular protein residue contact map using biopython, KeyError: 'CA'

I am trying to identify amino acid residues in contact in the 3D protein structure. I am new to BioPython but found this helpful website http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/protein_contact_map/
Following their lead (which I will reproduce here for completion; Note, however, that I am using a different protein):
import Bio.PDB
import numpy as np
pdb_code = "1QHW"
pdb_filename = "1qhw.pdb"
def calc_residue_dist(residue_one, residue_two) :
"""Returns the C-alpha distance between two residues"""
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
return np.sqrt(np.sum(diff_vector * diff_vector))
def calc_dist_matrix(chain_one, chain_two) :
"""Returns a matrix of C-alpha distances between two chains"""
answer = np.zeros((len(chain_one), len(chain_two)), np.float)
for row, residue_one in enumerate(chain_one) :
for col, residue_two in enumerate(chain_two) :
answer[row, col] = calc_residue_dist(residue_one, residue_two)
return answer
structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
dist_matrix = calc_dist_matrix(model["A"], model["A"])
But when I run the above code, I get the following error message:
Traceback (most recent call last):
File "<ipython-input-26-7239fb7ebe14>", line 4, in <module>
dist_matrix = calc_dist_matrix(model["A"], model["A"])
File "<ipython-input-3-730a11883f27>", line 15, in calc_dist_matrix
answer[row, col] = calc_residue_dist(residue_one, residue_two)
File "<ipython-input-3-730a11883f27>", line 6, in calc_residue_dist
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
File "/Users/anaconda/lib/python3.6/site-packages/Bio/PDB/Entity.py", line 39, in __getitem__
return self.child_dict[id]
KeyError: 'CA'
Any suggestions on how to fix this issue?
You have heteroatoms (water, ions, etc; anything that isn't an amino acid or nucleic acid) in your structure, remove them with:
for residue in chain:
if residue.id[0] != ' ':
chain.detach_child(residue.id)
This will remove them from your entire structure. You may want to modify if want to keep the heteroatoms for further analysis.
I believe the problem is that some of the elements in model["A"] are not amino acids and therefore do not contain "CA".
To get around this, I wrote a new function which returns only the amino acid residues:
from Bio.PDB import *
chain = model["A"]
def aa_residues(chain):
aa_only = []
for i in chain:
if i.get_resname() in standard_aa_names:
aa_only.append(i)
return aa_only
AA_1 = aa_residues(model["A"])
dist_matrix = calc_dist_matrix(AA_1, AA_1)
So I've been testing (bear in mind I know very little about Bio) and it looks like whatever is in you 1qhw.pdb file is very different from the one in that example.
pdb_code = '1qhw'
structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
next, to see what is in it, I did:
print(list(model))
Which gave me:
[<Chain id=A>]
exploring this, it appears the pdb file is a dict of dicts. So, using this id,
test = model['A']
gives me the next dict. This level is the level being passed to your function that is causing the error. Printing this with:
print(list(test))
Gave me a huge list of the data inside, including lots of residues and related info. But crucially, no CA. Try using this to see whats inside and modify the line:
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
to reflect what you are after, replacing CA where appropriate.
I hope this helps, its a little tricky to get much more specific.
Another solution to obtain the contact map for a protein chain is to use the PdbParser shipped with ConKit.
ConKit is a library specifically designed to work with predicted contacts but has the functionality to extract contacts from a PDB file:
>>> from conkit.io.PdbIO import PdbParser
>>> p = PdbParser()
>>> with open("1qhw.pdb", "r") as pdb_fhandle:
... pdb = p.read(pdb_fhandle, f_id="1QHW", atom_type="CA")
>>> print(pdb)
ContactFile(id="1QHW_0" nmaps=1
This reads your PDB file into the pdb variable, which stores an internal ContactFile hierarchy. In this example, two residues are considered to be in contact if the participating CA atoms are within 8Å of each other.
To access the information, you can then iterate through the ContactFile and access each ContactMap, which in your case corresponds to intra-molecular contacts for chain A.
>>> for cmap in pdb:
... print(cmap)
ContactMap(id="A", ncontacts=1601)
If you would have more than one chain, there would be a ContactMap for each chain, and additional ones for inter-molecular contacts between chains.
The ContactMap for chain A contains 1601 contact pairs. You can access the Contact instances in each ContactMap by either iterating or indexing. Both work fine.
>>> print(cmap[0])
Contact(id="(26, 27)" res1="S" res1_chain="A" res1_seq=26 res2="T" res2_chain="A" res2_seq=27 raw_score=0.961895)
Each level in the hierarchy has various functions with which you could manipulate contact maps. Examples can be found here.

Categories

Resources