I have a list of Uniprot IDs and need to know the PDB IDs plus the Chain IDs.
With the code given on the Uniprot website I can get the PDB IDs but not the Chain Information.
import urllib.parse
import urllib.request
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from': 'ACC+ID',
'to': 'PDB_ID',
'format': 'tab',
'query': UniProtIDs
}
data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with open('UniProt_PDB_IDs.txt', 'a') as f:
with urllib.request.urlopen(req) as q:
response = q.read()
f.write(response.decode('utf-8'))
so this code gets me this:
From To
A0A075B6N1 5HHM
A0A075B6N1 5HHO
A0A075B6N1 5NQK
A0A075B6T6 1AO7
A0A075B6T6 4ZDH
for the Protein A0A075B6N1 with PDB ID 5HHM the Chains are E and J so i need a way to also retrieve the chains to get something like that:
A0A075B6N1 5HHM_E
A0A075B6N1 5HHM_J
A0A075B6N1 5HHo_E
A0A075B6N1 5NQK_B
It doesen't has to be in this format, later I convert it into a dictionary with the UniProt IDs as keys and the PDB IDs as values.
Thank you for your help in advance!
A tool called localpdb was just recently released that might does exactly what you want: https://labstructbioinf.github.io/localpdb/.
Another way would be to split the structures by segments, which can be easily done with MDanalysis universe objects (https://www.mdanalysis.org). Assuming you have a list of PDB IDs:
#fetch structures
universe_objects = []
for pdb_id in pdb_ids:
mmtf_object = mda.fetch_mmtf(pdb_id)
universe_objects.append(mmtf_object)
#get rid of water and ligands and split structures into chains
universe_chains = []
for universe_object in universe_objects:
universe_chain = universe_object.select_atoms('protein').split('segment')
universe_chains.append(universe_chain)
#flatten nested list
universe_chain_list = [item for sublist in universe_chains for item in sublist]
Of course there is other tools you can do this with. E.g. via the ProDy Hierview function!
Hope that helps.
Related
I'm asking an API to look up part numbers I get from a user with a barcode scanner. The API returns a much longer document than the below code block, but I trimmed a bunch of unnecessary empty elements, but the structure of the document is still the same. I need to put each part number in a dictionary where the value is the text inside of the <mfgr> element. With each run of my program, I generate a list of part numbers and have a loop that asks the API about each item in my list and each returns a huge document as expected. I'm a bit stuck on trying to parse the XML and get only the text inside of <mfgr> element, then save it to a dictionary with the part number that it belongs to. I'll put my loop that goes through my list below the XML document
<ArrayOfitem xmlns="WhereDataComesFrom.com" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<item>
<associateditem_swap/>
<bulk>false</bulk>
<category>Memory</category>
<clei>false</clei>
<createddate>5/11/2021 7:34:58 PM</createddate>
<description>sample description</description>
<heci/>
<imageurl/>
<item_swap/>
<itemid>1640</itemid>
<itemnumber>**sample part number**</itemnumber>
<listprice>0.0000</listprice>
<manufactureritem/>
<maxavailable>66</maxavailable>
<mfgr>**sample manufacturer**</mfgr>
<minreorderqty>0</minreorderqty>
<noninventory>false</noninventory>
<primarylocation/>
<reorderpoint>0</reorderpoint>
<rep>AP</rep>
<type>Memory </type>
<updateddate>2/4/2022 2:22:51 PM</updateddate>
<warehouse>MAIN</warehouse>
</item>
</ArrayOfitem>
Below is my Python code that loops through the part number list and asks the API to look up each part number.
import http.client
import xml.etree.ElementTree as etree
raw_xml = None
pn_list=["samplepart1","samplepart2"]
api_key= **redacted lol**
def getMFGR():
global raw_xml
for part_number in pn_list:
conn = http.client.HTTPSConnection("api.website.com")
payload = ''
headers = {
'session-token': 'api_key',
'Cookie': 'firstpartofmycookie; secondpartofmycookie'
}
conn.request("GET", "/webapi.svc/MI/XML/GetItemsByItemNumber?ItemNumber="+part_number, payload, headers)
res = conn.getresponse()
data = res.read()
raw_xml = data.decode("utf-8")
print(raw_xml)
print()
getMFGR()
Here is some code I tried while trying to get the mfgr. It will go inside the getMFGR() method inside the for loop so that it saves the manufacturer to a variable with each loop. Once the code works I want to have the dictionary look like this: {"samplepart1": "manufacturer1", "samplepart2": "manufacturer2"}.
root = etree.fromstring(raw_xml)
my_ns = {'root': 'WhereDataComesFrom.com'}
mfgr = root.findall('root:mfgr',my_ns)[0].text
The code above gives me a list index out of range error when I run it. I don't think it's searching past the namespaces node but I'm not sure how to tell it to search further.
This is where an interactive session becomes very useful. Drop your XML data into a file (say, data.xml), and then start up a Python REPL:
>>> import xml.etree.ElementTree as etree
>>> with open('data.xml') as fd:
... raw_xml=fd.read()
...
>>> root = etree.fromstring(raw_xml)
>>> my_ns = {'root': 'WhereDataComesFrom.com'}
Let's first look at your existing xpath expression:
>>> root.findall('root:mfgr',my_ns)
[]
That returns an empty list, which is why you're getting an "index out of range" error. You're getting an empty list because there is no mfgr element at the top level of the document; it's contained in an <item> element. So this will work:
>>> root.findall('root:item/root:mfgr',my_ns)
[<Element '{WhereDataComesFrom.com}mfgr' at 0x7fa5a45e2b60>]
To actually get the contents of that element:
>>> [x.text for x in root.findall('root:item/root:mfgr',my_ns)]
['**sample manufacturer**']
Hopefully that's enough to point you in the right direction.
I suggest use pandas for this structure of XML:
import pandas as pd
# Read XML row into DataFrame
ns = {"xmlns":"WhereDataComesFrom.com", "xmlns:i":"http://www.w3.org/2001/XMLSchema-instance"}
df = pd.read_xml("parNo_plant.xml", xpath=".//xmlns:item", namespaces=ns)
# Print only columns of interesst
df_of_interest = df[['itemnumber', 'mfgr']]
print(df_of_interest,'\n')
#Print the dictionary from DataFrame
print(df_of_interest.to_dict(orient='records'))
# If I understood right, you search this layout:
dictionary = dict(zip(df.itemnumber, df.mfgr))
print(dictionary)
Result (Pandas dataframe or dictionary):
itemnumber mfgr
0 **sample part number** **sample manufacturer**
[{'itemnumber': '**sample part number**', 'mfgr': '**sample manufacturer**'}]
{'**sample part number**': '**sample manufacturer**'}
from re import I
from requests import get
res = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
kek = []
for x in res:
kek.append(x)
lnk = res[kek[0]]['downloads']
anime_name = res[kek[0]]['show']
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{anime_name}:\n\n{quality}: {links}\n\n"
print(data)
in this code how can i prevent repeating of anime name
if i add this outside of the loop only 1 link be printed
you can separate you string, 1st half outside the loop, 2nd inside the loop:
print(f"{anime_name}:\n\n")
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{quality}: {links}\n\n"
print(data)
Rewrote a bit, make sure you look at a 'pretty' version of the json request using pprint or something to understand where elements are and where you can loop (remembering to iterate through the dict)
from requests import get
data = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
for show, info in data.items():
print(show, '\n')
for download in info['downloads']:
print(download['magnet'])
print(download['res'])
print('\n')
Also you won't usually be able to just copy these links to get to the download, you usually need to use a torrent website.
I am trying to get a json response decode with utf and access the dictionaries in the list. The following is the JSON response
'[{"id":26769687,"final_price":58.9,"payment_method_cost":"\\u003cem\\u003e+ 0,00 €\\u003c/em\\u003e \\u003cspan\\u003eΑντικαταβολή\\u003c/span\\u003e","net_price":53.9,"net_price_formatted":"53,90 €","final_price_formatted":"58,90 €","shop_id":649,"no_credit_card":false,"sorting_score":[-5.0,-156,-201,649,20],"payment_method_cost_supported":true,"free_shipping_cost_supported":false,"shipping_cost":"\\u003cem\\u003e+ 5,00 €\\u003c/em\\u003e \\u003cspan\\u003eΜεταφορικά\\u003c/span\\u003e","link":"/products/show/26769687"},
{"id":26771682,"final_price":55.17,"payment_method_cost":"\\u003cem\\u003e+ 2,83 €\\u003c/em\\u003e \\u003cspan\\u003eΑντικαταβολή\\u003c/span\\u003e","net_price":48.5,"net_price_formatted":"48,50 €","final_price_formatted":"55,17 €","shop_id":54,"no_credit_card":false,"sorting_score":[-3.6,-169,-84,54,10],"payment_method_cost_supported":true,"free_shipping_cost_supported":false,"shipping_cost":"\\u003cem\\u003e+ 3,84 €\\u003c/em\\u003e \\u003cspan\\u003eΜεταφορικά\\u003c/span\\u003e","link":"/products/show/26771682"}]'
which is produce by the following
url2besearched = 'https://www.skroutz.gr/personalization/20783507/product_prices.js?_=1569161647'
Delays = [25,18,24,26,20,22,19,30]
no_of_pagedowns= 20
RandomDelays = np.random.choice(Delays)
#WAIT TIME
time.sleep(RandomDelays)
fp = urllib.request.urlopen(url2besearched)
mybytes = fp.read()
post_elems =[]
mystr = mybytes.decode("utf8")
fp.close()
mystr1 = mystr.rsplit('=')
mystr2 = mystr1[1].split(";")
#I ADD THE FOLLOWING BECAUSE THE INITIAL DOES NOT HAVE ENDING BRACKETS
mystr3 = mystr2[0]+"}"+"]"
for d in mystr3:
for key in d:
post_elems.append([d[key],d['final_price'],d['shop_id']])
When I do the for loop is getting character by character the mystr3 variable and not as a dictionary
How can I have a list with the key of dictionary and final_price with shop_id
My desired output needs to be a list like
post_elems =['26769687','58.9','649']
First the API you are calling for some reason gives a weird response. So .json() on response will not work as there is a field in front. It would be good to understand why or check the URL query strings are correct. Anyway. You have removed them. So I'll copy that code:
import requests, json
mystr = requests.get('https://www.skroutz.gr/personalization/20783507/product_prices.js?_=1569161647').text
mystr1 = mystr.rsplit('=')
mystr2 = mystr1[1].split(";")[0]
json.loads(mystr2)
This works. However. there are two things that are not great here. mystr1 is a systems Hugarian notation, this is very unpythonic. Use type-hinting to help remind what class something belongs to, not the variable name. Also your mystr2 gives a list, a nice example why Hugarian notation is bad.
I am writing a program to extract text from a website and write it into a text file. Each entry in the text file should have 3 values separated by a tab. The first value is hard-coded to XXXX, the 2nd value should initialize to the first item on the website with , and the third value is the next item on the website with a . The logic I'm trying to introduce is looking for the first and write the associated string into the text file. Then find the next and write the associated string into the text file. Then, look for the next p class. If it's "style4", start a new line, if it's another "style5", write it into the text file with the first style5 entry but separated with a comma (alternatively, the program could just skip the next style5.
I'm stuck on the part of the program in bold. That is, getting the program to look for the next p class and evaluate it against style4 and style5. Since I was having problems with finding and evaluating the p class tag, I chose to pull my code out of the loop and just try to accomplish the first iteration of the task for starters. Here's my code so far:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
next_vendor = soup.find('p', {'class': 'style4'})
print next_vendor
next_commodity = next_vendor.find_next('p', {'class': 'style5'})
print next_commodity
next = next_commodity.find_next('p')
print next
I'd appreciate any help anybody can provide! Thanks in advance!
I am not entirely sure how you are expecting your output to be. I am assuming that you are trying to get the data in the webpage in the format:
Alphabet \t Vendor \t Category
You can do this:
# The basic things
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.kcda.org/KCDA_Awarded_Contracts.htm').read())
Get the td of interest:
table = soup.find('table')
data = table.find_all('tr')[-1]
data = data.find_all('td')[1:]
Now, we will create a nested output dictionary with alphabets as the keys and an inner dict as the value. The inner dict has vendor name as key and category information as it's value
output_dict = {}
current_alphabet = ""
current_vendor = ""
for td in data:
for p in td.find_all('p'):
print p.text.strip()
if p.get('class')[0] == 'style6':
current_alphabet = p.text.strip()
vendors = {}
output_dict[current_alphabet] = vendors
continue
if p.get('class')[0] == 'style4':
print "Here"
current_vendor = p.text.strip()
category = []
output_dict[current_alphabet][current_vendor] = category
continue
output_dict[current_alphabet][current_vendor].append(p.text.strip())
This gets the output_dict in the format:
{ ...
u'W': { u'WTI - Weatherproofing Technologies': [u'Roofing'],
u'Wenger Corporation': [u'Musical Instruments and Equipment'],
u'Williams Scotsman, Inc': [u'Modular/Portable Buildings'],
u'Witt Company': [u'Interactive Technology']
},
u'X': { u'Xerox': [u"Copiers & MFD's", u'Printers']
}
}
Skipping the earlier parts for brevity. Now it is just a matter of accessing this dictionary and writing out to a tab separated file.
Hope this helps.
Agree with #shaktimaan. Using a dictionary or list is a good approach here. My attempt is slightly different.
import requests as rq
from bs4 import BeautifulSoup as bsoup
import csv
url = "http://www.kcda.org/KCDA_Awarded_Contracts.htm"
r = rq.get(url)
soup = bsoup(r.content)
primary_line = soup.find_all("p", {"class":["style4","style5"]})
final_list = {}
for line in primary_line:
txt = line.get_text().strip().encode("utf-8")
if txt != "\xc2\xa0":
if line["class"][0] == "style4":
key = txt
final_list[key] = []
else:
final_list[key].append(txt)
with open("products.csv", "wb") as ofile:
f = csv.writer(ofile)
for item in final_list:
f.writerow([item, ", ".join(final_list[item])])
For the scrape, we isolate style4 and style5 tags right away. I did not bother going for the style6 or the alphabet headers. We then get the text inside each tag. If the text is not a whitespace of sorts (this is all over the tables, probably obfuscation or bad mark-up), we then check if it's style4 or style5. If it's the former, we assign it as a key to a blank list. If it 's the latter, we append it to the blank list of the most recent key. Obviously the key changes every time we hit a new style4 only so it's a relatively safe approach.
The last part is easy: we just use ", ".join on the value part of the key-value pair to concatenate the list as one string. We then write it to a CSV file.
Due to the dictionary being unsorted, the resulting CSV file will not be sorted alphabetically. Screenshot of result below:
Changing it to a tab-delimited file is up to you. That's simple enough. Hope this helps!
Noob here. I have a large number of json files, each is a series of blog posts in a different language. The key-value pairs are meta data about the posts, e.g. "{'author':'John Smith', 'translator':'Jane Doe'}. What I want to do is convert it to a python dictionary, then extract the values so that I have a list of all the authors and translators across all the posts.
for lang in languages:
f = 'posts-' + lang + '.json'
file = codecs.open(f, 'rt', 'utf-8')
line = string.strip(file.next())
postAuthor[lang] = []
postTranslator[lang]=[]
while (line):
data = json.loads(line)
print data['author']
print data['translator']
When I tried this method, I keep getting a key error for translator and I'm not sure why. I've never worked with the json module before so I tried a more complex method to see what happened:
postAuthor[lang].append(data['author'])
for translator in data.keys():
if not data.has_key('translator'):
postTranslator[lang] = ""
postTranslator[lang] = data['translator']
It keeps returning an error that strings do not have an append function. This seems like a simple task and I'm not sure what I'm doing wrong.
See if this works for you:
import json
# you have lots of "posts", so let's assume
# you've stored them in some list. We'll use
# the example text you gave as one of the entries
# in said list
posts = ["{'author':'John Smith', 'translator':'Jane Doe'}"]
# strictly speaking, the single-quotes in your example isn't
# valid json, so you'll want to switch the single-quotes
# out to double-quotes, you can verify this with something
# like http://jsonlint.com/
# luckily, you can easily swap out all the quotes programmatically
# so let's loop through the posts, and store the authors and translators
# in two lists
authors = []
translators = []
for post in posts:
double_quotes_post = post.replace("'", '"')
json_data = json.loads(double_quotes_post)
author = json_data.get('author', None)
translator = json_data.get('translator', None)
if author: authors.append(author)
if translator: translators.append(translator)
# and there you have it, a list of authors and translators