I am trying to scrape results from a site (has no captcha , simple roll-no authentication, and the pattern of roll-no is known to me). The problem is that they have the results in a table format and many students have different subjects. The code I wrote so far in Python is
for row in rows:
col=row.findAll('td') #BeautifulSoup object
sub=col[1].text.encode('utf-8') #Header.(Subject Names)
subjectname.append((sub))
marks=col[4].text.encode('utf-8')
markall.append((marks))
csvwriter.writerows([subjectname,])
csvwriter.writerows([markall,])
I want to generate a .csv file so that I can do some data analysis on it. Now the problem is I want a table which has a specific subject column and marks of it. But the scraper won't know if it's a different subject and will append marks of whatever subject it finds in that row/column pair.
How do I approach this?
Here's a visual representation of the problem.
So if I have Subject A at column 1 , I want to get marks only of subject A and not any other subject. Do I need to create a list for all marks ?
Edit : Here's the HTML table markup https://jsfiddle.net/rpmgku7m/
Related
Trying to write the question one more time because the first time the problem was not clear.
I'm trying to extract output from a SOAP protocol using Python.
The output is structured with a lot of subcategories and this is where I can't extract the information properly.
The output I have is very long, so I will bring here a short example from soapenv:Body
the original code (soap body)
<mc:ocb-rule id="boic"><mc:id>boic</mc:id><mc:ocb-conditions><mc:rule-deactivated>true</mc:rule-deactivated><mc:international>true</mc:international></mc:ocb-conditions><mc:cb-actions><mc:allow>false</mc:allow></mc:cb-actions></mc:ocb-rule>
As you can see I also used the command
xmltodict.parse(soap_response)
for turning the output into a dictionary
OrderedDict([('#id',
'boic'),
('mc:id',
'boic'),
('mc:ocb-conditions',
OrderedDict([('mc:rule-deactivated',
'true'),
('mc:international',
'true')])),
('mc:cb-actions',
OrderedDict([('mc:allow',
'false')]))])
If there is an easier way, I would appreciate guidance
As I mentioned, my goal is ultimately to get each category its own value, when if there is a subcategory, a separate category will be added.
In the end I will have to put all the values into the Data frame and display all the categories and their values
for example :
table example
I really hope that this time I was able to explain myself properly
Thanks in advance
i am trying to parsing soup response and insert all value and header to data frame
my bank only gives me my activity statement as pdf whereas business-clients can get a .csv
I need to extract data from the relevant table in that multi-page pdf which (the relevant table) is between the letter-head of the bank and each page is prepended with one 6-column information 1-row-table.
The cells are like
date
transaction information
debit
credit
13.09./13.09.
payment by card ACME-SHOP Reference {SOME 26 digits code} {SHOP OWNER WITH LINE-B\\REAK} Terminal {some 8 digits code} {DATE-FORMAT LIKE THIS: 2022-09-06T14:25:11} SequenceNr. 012 Expirationd. {4 digit code}
-312,12
After the last such column follow two tables in horizontal split so
table 1
table 2
3-row, 2-column table with information on your account
3-row, 2-column table with new balance
I have taken a look at the tabula-library for python but it looks like it doesn't offer any option for pattern-matching or something.
So my question would be if there is a more sophisticated open-source solution to this problem. Doesn't have to be in Python, but I just took a guess that the AI community of Python would be a good place to start looking for such kind of extraction tools.
Otherwise I guess I have to extract the columns and then do pattern-matching to re-accumulate the data.
Thanks
I'm still a newbie in Python and having a hard time trying to code something.
I have a list with more than 80k URLs and this is the only thing I have in my .xls, the URLs looks like this:
https://domainexample.com/user-query/credit-card-debit-balance/
https://domainexample.com/user-query/second-invoice-current-debt/
https://domainexample.com/user-query/query-balances/
https://domainexample.com/user-query/where-is-client-portal/
https://domainexample.com/user-query/i-want-to-change-my-password/
https://domainexample.com/user-query/second-invoice-internet/
https://domainexample.com/user-query/print-payment-invoice/
I want to create a code that will read this excel and based on certain categories I already wrote, will put them in other columns.
So, whenever the code finds "paswword" it will put that URL in the column password, when it finds "user" will put the URL in the column "user".
It would look like this:
debt
https://domainexample.com/user-query/second-invoice-current-debt/
password
https://domainexample.com/user-query/i-want-to-change-my-password/
payment
https://domainexample.com/user-query/print-payment-invoice/
The code doesn't necessarily needs to change the column of the URLs, if it can create a 2nd column and write of what categories that URL belongs, it would be also great.
There is no need for the code to read the URL, just the excel file, like those URLs are simple text.
If anyone can help me, thanks a lot!
Try this where df is your dataframe, and 'url_column' is the column with all your urls
df.loc[df['url_column'] =='url.com/what-is-a-car', 'car'] = 'url.com/'+'car'
df.loc[df['url_column'] =='url.com/what-is-a-bike', 'bike'] = 'url.com/'+'bike'
df.loc[df['url_column'] =='url.com/what-is-a-van', 'van'] = 'url.com/'+'van'
I would like to scrape and parse the data on these two pages: here and here into a tab-delimited format using scrapy. I did these commands:
scrapy shell
fetch("https://www.drugbank.ca/drugs/DB04899")
print response.text
My two question:
1. for example, for this page, when I type:
response.css(".sequence::text").extract()
[u'>DB04899: Natriuretic peptides B\nSPKMVQGSGCFGRKMDRISSSSGLGCKVLRRH']
But then when I type:
>>> response.css(".synonyms::text").extract()
[]
>>> response.css(".Synonyms::text").extract()
[]
But you can see that there are synonyms listed on the webpage and so the output should not be empty. Can someone demonstrate what I'm doing wrong? (I also tried other tags such as synonym, Synonym) etc.
When I type: response.css(".targets::text").extract(), the output is [u'Targets (3)']. I'm wondering how I can actually parse the data within this list, but I guess this is related to not using the right tags and question 1 above.
This is a vague question/advanced for me at the minute, is it possible to just scrape the whole page in one go, instead of having to know each individual tag? So my output would be a dictionary called 'identification' with Name, accession number, type etc as keys. Then a dictionary called pharmacology with indication, structured indication etc as keys, then another dictionary called interactions, and another called pharmaeconomics etc, one dictionary per page section?
Thanks
There are really no elements with synonyms or Synonyms class attribute value on the page.
You can get to the synonyms by "going to the right" of the dt element with the "Synonyms" text using following-sibling:
In [2]: response.xpath("//dt[. = 'Synonyms']/following-sibling::dd/ul/li/text()").extract()
Out[2]:
['BNP',
'Brain natriuretic peptide 32',
'Natriuretic peptides B',
'Nesiritide recombinant']
I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I donĀ“t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:
for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:
go to the uniprot option
download the fasta file
store it in a .txt file inside a given folder
Thanks in advance!
I would strongly suggest to ask the authors for the database. From the FAQ:
I would like to use the MoonProt database in a project to analyze the
amino acid sequences or structures using bioinformatics.
Please contact us at bioinformatics#moonlightingproteins.org if you are
interested in using MoonProt database for analysis of sequences and/or
structures of moonlighting proteins.
Assuming you find something interesting, how are you going to cite it in your paper or your thesis?
"The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.
That's a good introduction to scraping
But back to your your original question.
import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')
for i, cell in enumerate(cells):
if cell.text:
#if we get something which looks like a FASTA sequence, print it
if cell.text.startswith('>'):
print(cell.text)
#if we find a table cell which has UniProt in it
#let's print the link from the next cell
if 'UniProt' in cell.text_content():
if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
print(cells[i + 1].find('a').attrib['href'])