Web scraping table from UniProt database

Web scraping table from UniProt database - python

I have a list of UniProt IDs and would like to use BeautifulSoup to scrap a table containing the structure information. The url I am using is as follows: https://www.uniprot.org/uniprot/P03496, with accession "P03496".
A snippet of the html code is as follows.
<div class="main-aside">
<div class="content entry_view_content up_entry swissprot">
<div class="section" id="structure">
<protvista-uniprot-structure accession="P03468">
<div class="protvista-uniprot-structure">
<div class="class=" protvista-uniprot-structure__table">
<protvista-datatable class="feature">
<table>...</table>
</protvista-datatable>
</div>
</div>
</protvista-uniprot-structure>
</div>
</div>
</div>
The information I require is contained between the <table>...</table> tag.
I tried
from bs4 import BeautifulSoup
import requests
url='https://www.uniprot.org/uniprot/P03468'
r=requests.get(url)
url=r.content
soup = BeautifulSoup(url,'html.parser')
soup.find("protvista-datatable", {"class": "feature"})
print(soup)

Content is provided dynamically and is not contained in your soup if you take a deeper look. It do not need BeautifulSoupto get data, your tabel is based on, simply use their api / rest interface to get structured data as JSON:
import requests
url='https://rest.uniprot.org/uniprot/P03468'
## fetch the json response
data = requests.get(url).json()
## pick needed data e.g.
data['uniProtKBCrossReferences']
Output
[{'database': 'EMBL',
'id': 'J02146',
'properties': [{'key': 'ProteinId', 'value': 'AAA43412.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'AF389120',
'properties': [{'key': 'ProteinId', 'value': 'AAM75160.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'EF467823',
'properties': [{'key': 'ProteinId', 'value': 'ABO21711.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'CY009446',
'properties': [{'key': 'ProteinId', 'value': 'ABD77678.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'K01031',
'properties': [{'key': 'ProteinId', 'value': 'AAA43415.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'RefSeq',
'id': 'NP_040981.1',
'properties': [{'key': 'NucleotideSequenceId', 'value': 'NC_002018.1'}]},
{'database': 'PDB',
'id': '6WZY',
'properties': [{'key': 'Method', 'value': 'X-ray'},
{'key': 'Resolution', 'value': '1.50 A'},
{'key': 'Chains', 'value': 'C=181-190'}]},...]

There's a Python package, Unipressed, by Michael Milton (#multimeric) that allows programmatic access query UniProt's new REST API.
Example:
from unipressed import UniprotkbClient
UniprotkbClient.fetch_one("P03468")["uniProtKBCrossReferences"]
Output
[{'database': 'EMBL',
'id': 'J02146',
'properties': [{'key': 'ProteinId', 'value': 'AAA43412.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'AF389120',
'properties': [{'key': 'ProteinId', 'value': 'AAM75160.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'EF467823',
'properties': [{'key': 'ProteinId', 'value': 'ABO21711.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'CY009446',
'properties': [{'key': 'ProteinId', 'value': 'ABD77678.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'EMBL',
'id': 'K01031',
'properties': [{'key': 'ProteinId', 'value': 'AAA43415.1'},
{'key': 'Status', 'value': '-'},
{'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
{'database': 'RefSeq',
'id': 'NP_040981.1',
'properties': [{'key': 'NucleotideSequenceId', 'value': 'NC_002018.1'}]},
{'database': 'PDB',
'id': '6WZY',
'properties': [{'key': 'Method', 'value': 'X-ray'},
{'key': 'Resolution', 'value': '1.50 A'},
{'key': 'Chains', 'value': 'C=181-190'}]}, ... ]
See more examples of using Unipressed to access Uniprot's new REST API here in my reply to Biostar's post 'Accessing UNIPROT using REST API'. See using Unipressed for ID mapping here and here.

Related

Retrieving ECS tags using boto3

I have below boto3 which gives list of dicts with key, value pairs.
service_paginator = ecs_client.get_paginator('list_services')
for page in service_paginator.paginate(cluster=cluster_name,
launchType='FARGATE'):
# print(page)
for service in page['serviceArns']:
response = ecs_client.list_tags_for_resource(resourceArn=service)['tags']
This dict has multiple keys, value pairs. in the below sample format:
Row-1:[{'key': 'Platform', 'value': 'XX'}, {'key': 'StackVersion', 'value': '1.0.1'},{'key': 'ResourceOwner', 'value': 'TeamA'}, {'key': 'Stackname', 'value': 'myfargate-1'}, {'key': 'Service', 'value': 'Processing'}, {'key': 'Name', 'value': 'someName'},{'key': 'deploy_date', 'value': '2021-07-12'}, {'key': 'Source', 'value': 'somesource'}]
Row-2:[{'key': 'Platform', 'value': 'XX'}, {'key': 'StackVersion', 'value': '1.0.1'},{'key': 'ResourceOwner', 'value': 'TeamA'}, {'key': 'Stackname', 'value': 'myfargate-1'}, {'key': 'Service', 'value': 'Processing'}, {'key': 'Name', 'value': 'someName'},{'key': 'deploy_date', 'value': '2021-07-12'}]
Row-3:[{'key': 'Platform', 'value': 'XXY'}, {'key': 'StackVersion', 'value': '1.0.1'},{'key': 'ResourceOwner', 'value': 'TeamA'}, {'key': 'Stackname', 'value': 'myfargate-1'}, {'key': 'Service', 'value': 'Processing'}, {'key': 'Name', 'value': 'someName'},{'key': 'deploy_date', 'value': '2021-07-12'}, {'key': 'Source', 'value': 'somesource'}]
From this lists, I would like to print the service, where in the dict 'key' == 'Platform' and 'key' == 'Source' present. So output should be Row-1 and Row-3 , as ROw-2 doesn't have key called source.
for one key it's ok, but if I have to check multiple keys then it gives me ZERO count.
Is there any pythonic way to do it for more than one key?

I will answer this to myself, but I personally don't like this solution, if anyone has any better solution, please post it here.
service_paginator = ecs_client.get_paginator('list_services')
for page in service_paginator.paginate(cluster=cluster_name,
launchType='FARGATE'):
# print(page)
for service in page['serviceArns']:
response = ecs_client.list_tags_for_resource(resourceArn=service)['tags']
tags = {}
for item in response:
tags[item['key']] = item['value']
if 'Platform' in tags and 'Source' in tags:
platform_name.append(tags['Platform'])
print(set(platform_name)) # to print the unique platform_name

add key value in nested dictionary

datainput = {'thissong-fav-user:type1-chan-44-John': [{'Song': 'Rock',
'Type': 'Hard',
'Price': '10'}],
'thissong-fav-user:type1-chan-45-kelly-md': [{'Song': 'Rock',
'Type': 'Soft',
'Price': '5'}]}
Outputrequired:
{'thissong-fav-user:type1-chan-44-John': [{key:'Song',Value:'Rock'},
{key:'Type', Value:'Hard'},
{Key: 'Price', Value:'10'}],
'thissong-fav-user:type1-chan-45-kelly-md': [{key:'Song',Value:'Rock'},
{key:'Type', Value:'Soft'},
{Key: 'Price', Value:'5'}]}
I started with below, which gives me an inner nested pattern not sure how I can get the desired output.
temps = [{'Key': key, 'Value': value} for (key, value) in datainput.items()]

Here is how:
datainput = {'thissong-fav-user:type1-chan-44-John': [{'Song': 'Rock',
'Type': 'Hard',
'Price': '10'}],
'thissong-fav-user:type1-chan-45-kelly-md': [{'Song': 'Rock',
'Type': 'Soft',
'Price': '5'}]}
temps = {k:[{'Key':a, 'Value':b}
for a,b in v[0].items()]
for k,v in datainput.items()}
print(datainput)
Output:
{'thissong-fav-user:type1-chan-44-John': [{'Key': 'Song', 'Value': 'Rock'},
{'Key': 'Type', 'Value': 'Hard'},
{'Key': 'Price', 'Value': '10'}],
'thissong-fav-user:type1-chan-45-kelly-md': [{'Key': 'Song', 'Value': 'Rock'},
{'Key': 'Type', 'Value': 'Soft'},
{'Key': 'Price', 'Value': '5'}]}

I believe the way of having taken the input is fine but in order to get the desired output, you got to take the inputs initially, then key-value pair and finally iterate.
datainput = {'thissong-fav-user:type1-chan-44-John': [{'Song': 'Rock',
'Type': 'Hard',
'Price': '10'}],
'thissong-fav-user:type1-chan-45-kelly-md': [{'Song': 'Rock',
'Type': 'Soft',
'Price': '5'}]}
datainput = {k:[{'Key':a, 'Value':b} for a,b in v[0].items()] for k,v in datainput.items()}
print(datainput)
Most probably, you'll get the desired output in this fashion.

Convert nested dictionary within JSON from a string

I have JSON data that I loaded that appears to have a bit of a messy data structure where nested dictionaries are wrapped in single quotes and recognized as a string, rather than a single dictionary which I can loop through. What is the best way to drop the single quotes from the key-value property ('value').
Provided below is an example of the structure:
for val in json_data:
print(val)
{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'},
If I add a nested look targeting ['value'], it loops by character and not key-value pair in the dictionary.

Using json.loads to convert string to dict
import json
json_data = [{'id': 'status6',
'title': 'Estimation',
'text': '> 2 days',
'type': 'color',
'value': '{"index":14,"post_id":null,"changed_at":"2020-06-12T09:04:58.659Z"}',
'name': 'Internal: online course'},
{'id': 'date',
'title': 'Deadline',
'text': '2020-06-26',
'type': 'date',
'value': '{"date":"2020-06-26","changed_at":"2020-06-12T11:33:37.195Z"}',
'name': 'Internal: online course'},
{'id': 'tags',
'title': 'Tags',
'text': 'Internal',
'type': 'tag',
'value': '{"tag_ids":[3223513]}',
'name': 'Internal: online course'}]
# the result is a Python dictionary:
for val in json_data:
print(json.loads(val['value']))
this should be work!!

How to parse VirtualMachinePaged object using Azure SDK for Python?

I am trying to get list of VMs in a Resource Group using Azure SDK for Python. I configured my Visual Studio code with all the required Azure Tools. I created a function and used below code to get List of VMs.
import os
import random
import string
from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.resource import ResourceManagementClient
def main():
SUBSCRIPTION_ID = os.environ.get("SUBSCRIPTION_ID", None)
GROUP_NAME = "testgroupx"
VIRTUAL_MACHINE_NAME = "virtualmachinex"
SUBNET_NAME = "subnetx"
INTERFACE_NAME = "interfacex"
NETWORK_NAME = "networknamex"
VIRTUAL_MACHINE_EXTENSION_NAME = "virtualmachineextensionx"
resource_client = ResourceManagementClient(
credential=DefaultAzureCredential(),
subscription_id=SUBSCRIPTION_ID
)
network_client = NetworkManagementClient(
credential=DefaultAzureCredential(),
subscription_id=SUBSCRIPTION_ID
)
compute_client = ComputeManagementClient(
credential=DefaultAzureCredential(),
subscription_id=SUBSCRIPTION_ID
)
vm = compute_client .virtual_machines.list(
'RGName'
)
print("Get virtual machine:\n{}", vm)
When I see the logs, I see below as the print response.
<azure.mgmt.compute.v2019_12_01.models._paged_models.VirtualMachinePaged object at 0x0000024584F92EC8>
I am really trying to get the actual object, I am not sure how can I parse it. Any ideas?

Since it returns a collection you need to use Use for loop , You can do something like this
for vm in compute_client .virtual_machines.list('RGName'):
print("\tVM: {}".format(vm.name))

VirtualMachinePaged contains a collection of an object of type VirtualMachine. You can see the source code of that class here: https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/compute/azure-mgmt-compute/azure/mgmt/compute/v2019_12_01/models/_models.py.
From this link, here're the list of attributes:
{
'id': {'key': 'id', 'type': 'str'},
'name': {'key': 'name', 'type': 'str'},
'type': {'key': 'type', 'type': 'str'},
'location': {'key': 'location', 'type': 'str'},
'tags': {'key': 'tags', 'type': '{str}'},
'plan': {'key': 'plan', 'type': 'Plan'},
'hardware_profile': {'key': 'properties.hardwareProfile', 'type': 'HardwareProfile'},
'storage_profile': {'key': 'properties.storageProfile', 'type': 'StorageProfile'},
'additional_capabilities': {'key': 'properties.additionalCapabilities', 'type': 'AdditionalCapabilities'},
'os_profile': {'key': 'properties.osProfile', 'type': 'OSProfile'},
'network_profile': {'key': 'properties.networkProfile', 'type': 'NetworkProfile'},
'diagnostics_profile': {'key': 'properties.diagnosticsProfile', 'type': 'DiagnosticsProfile'},
'availability_set': {'key': 'properties.availabilitySet', 'type': 'SubResource'},
'virtual_machine_scale_set': {'key': 'properties.virtualMachineScaleSet', 'type': 'SubResource'},
'proximity_placement_group': {'key': 'properties.proximityPlacementGroup', 'type': 'SubResource'},
'priority': {'key': 'properties.priority', 'type': 'str'},
'eviction_policy': {'key': 'properties.evictionPolicy', 'type': 'str'},
'billing_profile': {'key': 'properties.billingProfile', 'type': 'BillingProfile'},
'host': {'key': 'properties.host', 'type': 'SubResource'},
'provisioning_state': {'key': 'properties.provisioningState', 'type': 'str'},
'instance_view': {'key': 'properties.instanceView', 'type': 'VirtualMachineInstanceView'},
'license_type': {'key': 'properties.licenseType', 'type': 'str'},
'vm_id': {'key': 'properties.vmId', 'type': 'str'},
'resources': {'key': 'resources', 'type': '[VirtualMachineExtension]'},
'identity': {'key': 'identity', 'type': 'VirtualMachineIdentity'},
'zones': {'key': 'zones', 'type': '[str]'},
}
For Python 3, the code can be found here: https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/compute/azure-mgmt-compute/azure/mgmt/compute/v2019_12_01/models/_models_py3.py.

How to extract common elements from list of lists of dictionaries

I am trying to build dictionaries (one contains all common elements and the other one contains the different elements) out of a list of dictionaries.
Now I've managed to get it working for a list of 2 dictionaries by converting to a set of tuples and then getting the unique keys as well as the differences with the intersection and difference methods but I don't know how to go about a list of varying length (sometimes I'll have 3 or 4 dictionaries in my list).
I'm sure I need to use map or reduce/lambda function but I can't figure it out.
This is my input:
all_maps = [
[{'key': 'target', 'value': 'true'},
{'key': 'region_name', 'value': 'europe'},
{'field': 'AccessToken', 'key': 'token','path': 'test/path'}],
[{'key': 'target', 'value': 'true'},
{'key': 'region_name', 'value': 'usa'},
{'field': 'AccessToken', 'key': 'token', 'path': 'test/path'}],
[{'key': 'target', 'value': 'true'},
{'key': 'region_name', 'value': 'japan'},
{'field': 'AccessToken', 'key': 'token', 'path': 'test/path'}]
]
What I want is to get 4 dictionaries as such:
intersection = {'key': 'target', 'value': 'true'},
{'field': 'AccessToken', 'key': 'token', 'path': 'test/path'}
diff1 = {'key': 'region_name', 'value': 'europe'}
diff2 = {'key': 'region_name', 'value': 'usa'}
diff3 = {'key': 'region_name', 'value': 'japan'}

A simple answer would be to flatten the all_maps list and separate each items based on its list.count() value:
def flatten(map_groups):
items = []
for group in map_groups:
items.extend(group)
return items
def intersection(map_groups):
unique = []
items = flatten(map_groups)
for item in items:
if item not in unique and items.count(item) > 1:
unique.append(item)
return unique
def difference(map_groups):
unique = []
items = flatten(map_groups)
for item in items:
if item not in unique and items.count(item) == 1:
unique.append(item)
return unique
Here's the output using these functions:
>>> intersection(all_maps)
[{'key': 'target', 'value': 'true'},
{'field': 'AccessToken', 'key': 'token', 'path': 'test/path'}]
>>> difference(all_maps)
[{'key': 'region_name', 'value': 'europe'},
{'key': 'region_name', 'value': 'usa'},
{'key': 'region_name', 'value': 'japan'}]
For a more advanced implementation, you can look into set().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping table from UniProt database - python

Related

Retrieving ECS tags using boto3

add key value in nested dictionary

Convert nested dictionary within JSON from a string

How to parse VirtualMachinePaged object using Azure SDK for Python?

How to extract common elements from list of lists of dictionaries

Categories

Resources