Extracting Fasta Moonlight Protein Sequences with Python - python

I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I don´t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:
for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:
go to the uniprot option
download the fasta file
store it in a .txt file inside a given folder
Thanks in advance!

I would strongly suggest to ask the authors for the database. From the FAQ:
I would like to use the MoonProt database in a project to analyze the
amino acid sequences or structures using bioinformatics.
Please contact us at bioinformatics#moonlightingproteins.org if you are
interested in using MoonProt database for analysis of sequences and/or
structures of moonlighting proteins.
Assuming you find something interesting, how are you going to cite it in your paper or your thesis?
"The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.
That's a good introduction to scraping
But back to your your original question.
import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')
for i, cell in enumerate(cells):
if cell.text:
#if we get something which looks like a FASTA sequence, print it
if cell.text.startswith('>'):
print(cell.text)
#if we find a table cell which has UniProt in it
#let's print the link from the next cell
if 'UniProt' in cell.text_content():
if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
print(cells[i + 1].find('a').attrib['href'])

Related

What is the best way to extract the body of an article with Python?

Summary
I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.
What I Want to Achieve
I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.
Issues
I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:
from pdfminer import high_level
# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf):
article_text = high_level.extract_text(pdf)
sents = article_text.split('.') #splitting on '.', roughly splits on every sentence
run_ave = 0
for s in sents:
run_ave += len(s)
run_ave /= len(sents)
sents_strip = []
for sent in sents:
if len(sent.strip()) >= run_ave:
sents_strip.append(sent)
return sents_strip
Note: I am using this article as input.
Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.
Appeal
Are there ways I can improve the performance of this parser and make it more consistent?
Thank you for your answers!

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

How to robustly extract author names from pdf papers?

I'd like to extract author names from pdf papers. Does anybody know a robust way to do so?
For example, I'd like to extract the name Archana Shukla from this pdf https://arxiv.org/pdf/1111.1648
PDF documents contain Metadata. It includes information about the document and its contents such as the author’s name, keywords, copyright information. See Adobe doc.
You can use PyPDF2 to extract PDF Metadata. See the documentation about the DocumentInformation class.
This information may not be filled and can appear blank. So, one possibility is to parse the beginning or the end of the text and extract what you think is the author name. Of course, it is not reliable. But, if you have a bibliographic database, to can try a match.
Nowadays, editors like Microsoft Word or Libre Office Writer always fill the author name in the Metadata. And it is copied in the PDF when you export your documents. So, this should work for you. Give it a try and tell us!
I am going to pre-suppose that you have a way to extract text from a PDF document, so the question is really "how can I figure out the author from this text". I think one straightforward solution is to use the correspondence email. Here is an example implementation:
import difflib
# Some sample text
pdf_text="""SENTIMENT ANALYSIS OF DOCUMENT BASED ON ANNOTATION\n
Archana Shukla\nDepartment of Computer Science and Engineering,
Motilal Nehru National Institute of Technology,
Allahabad\narchana#mnnit.ac.in\nABSTRACT\nI present a tool which
tells the quality of document or its usefulness based on annotations."""
def find_author(some_text):
words = some_text.split(" ")
emails = []
for word in words:
if "#" in word:
emails.append(word)
emails_clean = emails[0].split("\n")
actual_email = [a for a in emails_clean if "#" in a]
actual_email = actual_email[0]
maybe_name = actual_email.split("#")[0]
all_words_lists = [a.split("\n") for a in words]
words = [a for sublist in all_words_lists for a in sublist]
words.remove(actual_email)
return difflib.get_close_matches(maybe_name, words)
In this case, find_author(pdf_text) returns ['Archana']. It's not perfect, but it's not incorrect. I think you could likely extend this in some clever ways, perhaps by getting the next word after the result or by combining this guess with metadata, or even by finding the DOI in the document if/when it exists and looking it up through some API, but nonetheless I think this should be a good starting point.
First thing first, there are some pdfs out there which pages are image. I don't know if you can extract the text from image easily. But from the pdf link you mentioned, I think it can be done. There is exist a package called PyPDF2 which as I know, can extract the text from pdf. All that left is to scan the last few pages and parse the Author names.
An example on how to use the package described here. Some of the code listed there is as follows:
import PyPDF2
pdfFileObj = open('meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
disp(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
pageObj.extractText()

Use BeautifulSoup to Iterate over XML to pull specific tags and store in variable

I'm fairly new to programming and have been trying to find a solution for this but all I can find are bits and pieces with no real luck putting it all together.
I'm trying to use BeautifulSoup4 in python to scrape some xml and store the text value in between specific tags in variables. The data is from a med student training program and right now everything needed has to be found manually. So I'm trying to increase efficiency a bit with a scraping program.
Let's say for example that I was looking at this type of test data to experiment with:
<AllergyList>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Flagyl (metronidazole)</Name>
<Drug>
<NDCID>00025182151,00025182131,00025182150</NDCID>
</Drug>
</Allergen>
<Reaction>difficulty breathing</Reaction>
<OnsetDate>02/02/2013</OnsetDate>
</Allergy>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Bactrim (sulfamethoxazole-trimethoprim)</Name>
<Drug>
<NDCID>13310014501,49999023220</NDCID>
</Drug>
</Allergen>
<Reaction>swelling</Reaction>
<OnsetDate>05/03/2002</OnsetDate>
</Allergy>
<Number>2</Number>
</AllergyList>
I've been trying to pull the <Name> tag from in between multiple <Allergen> tags as well as the respective data from in between the <Onsetdate> and <Reaction> tags while storing the results of the pull into respective variables.
So for example I would want to pull Flagyl (metronidazole), difficulty breathing, 02/02/2013, then Bactrim (sulfamethoxazole-trimethoprim), swelling, 05/03/2002, and so on while placing them in separate variables that I can use later.
Pulling the first set from the <Allergen> tag is easy but I'm having trouble figuring out how to iterate over the xml and storing the pulled data into variables. I've been trying to use a for loop while storing the data into an array or list but the way I've been writing it I always pull the same data over and over again depending on the number of iterations I calculate from the len() function and have since failed to store any of it into an array.
I've been racking my brain about this for a while now and I think I may just not be that smart so any help or even pointing me in the right direction would be immensely appreciated.
It seems a simple task because there isn't many nesting tags:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
allergies = []
for allergy in soup.find_all('Allergy'):
d = {
'name': allergy.Allergen.Name.string,
'reaction': allergy.Reaction.string,
'on_set_date': allergy.OnsetDate.string,
}
allergies.append(d)
## Use 'allergies' array of dictionaries as you want.
## Example:
print(allergies[1]['reaction'])
Run it with the xml file as argument:
python3 script.py xmlfile
And this test yields:
swelling

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?
Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.
You could use OpenOffice. It can open word files, and also can run python macros.
I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.
how about saving the file as xml. then using python or something else and pull the data out of word and into the database.
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Categories

Resources