I am creating a news feed scraper so I can collate my favourite football teams news daily. Im an apprentice developer and I thought doing it would increase my knowledge. Just a simple thing to scan one or two sites for just headlines and return the text of the headlines. I have downloaded python, and gained a bit of knowledge around beautiful soup methods and I have managed to find a path directly to each headline on my chosen site, and I have stored these to an array
`page_soup = soup(page_html, "html.parser")` //"parses" the stored data(page_html)
`page_soup.findAll(class_="lakeside__title-text")` //finds all titles on the BBC Liverpool Sports page.
`headline1 = allHeadlines[0]` //create a single entry called "headline1"` from the first slot in our search results
'headline1.text //prints "headline1" string to show its working e.g "'What do you know about Dalglish?(my result)'"
But now I am puzzled as to how to create the loop needed to store the data and display.
for item in allHeadlines{
//something here. im a noob so all i know around this is usually item = item +1
}
print to file etc.,.
Any reading material for me around this topic would be greatly appreciated
Sorry for editing issues, my first ever post.
Assuming allHeadlines is list of objects ( which have method text) .
We can create a list of text from for loop for display or writing to file.
text_headlines = [ item.text for item in allHeadlines if item.text ]
print(text_headlines)
Related
Im still a beginner so maybe the answer is very easy, but I could not find a solution (at least one I could understand) online.
Currently I am learning famous works of art through the app "Anki". So I imported a deck for it online containing over 700 pieces.
Sadly the names of the pieces are in english and I would like to learn them in my mother language (german). So I wanted to write a script to automate the process of translating all the names inside the app. I started out by creating a dictionary with every artist and their art pieces (to fill this dictionary automatically reading the app is a task for another time).
art_dictionary = {
"Wassily Kandinsky": "Composition VIII",
"Zhou Fang": "Ladies Wearing Flowers in Their Hair",
}
My plan is to access wikipedia (or any other database for artworks) that stores the german name of the painting (because translating it with a eng-ger dictionary often returns wrong results since the german translation can vary drastically):
replacing every space character inside the name to an underscore
letting python access the wikipedia page of said painting:
import re
from urllib.request import urlopen
painting_name = "Composition_VIII" #this is manual input of course
url = "wikipedia.org/wiki/" + painting_name
page = urlopen(url)
somehow access the german version of the site and extracting the german name of the painting.
html = page.read().decode("utf-8")
pattern = "<title.*?>.*?</title.*?>" #I think Wikipedia stores the title like <i>Title</i>
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title)
storing it in a list or variable
inserting it in the anki app
maybe this is impossible or "over-engineering", but I'm learning a lot along the way.
I tried to search for a solution online, but could not find anything similar to my problem.
You can use dictionary comprehension with the replace method to update all the values (names of art pieces in this case) of the dictionary.
art_dictionary = {
"Wassily Kandinsky": "Composition VIII",
"Zhou Fang": "Ladies Wearing Flowers in Their Hair",
}
art_dictionary = {key:value.replace(' ', '_') for key,value in art_dictionary.items()}
print(art_dictionary)
# Output: {'Wassily Kandinsky': 'Composition_VIII', 'Zhou Fang': 'Ladies_Wearing_Flowers_in_Their_Hair'}
I'm new to python and scrapy and thought I'd try out a simple review site to scrape. While most of the site structure is straight forward, I'm having trouble extracting the content of the reviews. This portion is visually laid out in sets of 3 (the text to the right of 良 (good), 悪 (bad), 感 (impressions) fields), but I'm having trouble pulling this content and associating it with a reviewer or section of review due to the use of generic divs, , /n and other formatting.
Any help would be appreciated.
Here's the site and code I've tried for the grabbing them, with some results.
http://www.psmk2.net/ps2/soft_06/rpg/p3_log1.html
(1):
response.xpath('//tr//td[#valign="top"]//text()').getall()
This returns the entire set of reviews, but it contains newline markup and, more of a problem, it renders each line as a separate entry. Due to this, I can't figure out where the good, bad, and impression portions end, nor can I easily parse each separate review as entry length varies.
['\n弱点をついた時のメリット、つかれたときのデメリットがはっきりしてて良い', '\nコミュをあげるのが楽しい',
'\n仲間が多くて誰を連れてくか迷う', '\n難易度はやさしめなので遊びやすい', '\nタルタロスしかダンジョンが無くて飽きる。'........and so forth
(2) As an alternative, I tried:
response.xpath('//tr//td[#valign="top"]')[0].get()
Which actually comes close to what I'd like, save for the markup. Here it seems that it returns the entire field of a review section. Every third element should be the "good" points of each separate review (I've replaced the <> with () to show the raw return).
(td valign="top")\n精一杯考えました(br)\n(br)\n戦闘が面白いですね\n主人公だけですが・・・・(br)\n従来のプレスターンバトルの進化なので(br)\n(br)\n以上です(/td)
(3) Figuring I might be able to get just the text, I then tried:
response.xpath('//tr//td[#valign="top"]//text()')[0].get()
But that only provides each line at a time, with the \n at the front. As with (1), a line by line rendering makes it difficult to attribute reviews to reviewers and the appropriate section in their review.
From these (2) seems the closest to what I want, and I was hoping I could get some direction in how to grab each section for each review without the markup. I was thinking that since these sections come in sets of 3, if these could be put in a list that would make pulling them easier in the future (i.e. all "good" reviews follow 0, 0+3; all "bad" ones 1, 1+3 ... etc.)...but first I need to actually get the elements.
I've thought about, and tried, iterating over each line with an "if" conditional (something like:)
i = 0
if i <= len(response.xpath('//tr//td[#valign="top"]//text()').getall()):
yield {response.xpath('//tr//td[#valign="top"]')[i].get()}
i + 1
to pull these out, but I'm a bit lost on how to implement something like this. Not sure where it should go. I've briefly looked at Item Loader, but as I'm new to this, I'm still trying to figure it out.
Here's the block where the review code is.
def parse(self, response):
for table in response.xpath('body'):
yield {
#code for other elements in review
'date': response.xpath('//td//div[#align="left"]//text()').getall(),
'name': response.xpath('//td//div[#align="right"]//text()').getall(),
#this includes the above elements, and is regualr enough I can systematically extract what I want
'categories': response.xpath('//tr//td[#class="koumoku"]//text()').getall(),
'scores': response.xpath('//tr//td[#class="tokuten_k"]//text()').getall(),
'play_time': response.xpath('//td[#align="right"]//span[#id="setumei"]//text()').getall(),
#reviews code here
}
Pretty simple task using a part of text as anchor (I used string to get text content for a whole td):
for review_node in response.xpath('//table[#width="645"]'):
good = review_node.xpath('string(.//td[b[starts-with(., "良")]]/following-sibling::td[1])').get()
bad= review_node.xpath('string(.//td[b[starts-with(., "悪")]]/following-sibling::td[1])').get()
...............
EDIT: Thank you all for the very helpful answers. Indeed, as suggested in the responses to this post, school_list did not in fact contain hundreds of list items, it contained only four. This didn't stop school.text from grabbing all the hundreds of places within those four elements that included the text of a school name.
Original post:
I'm trying to iterate over each school name on a web page containing hundreds of school names, and append each school name to a list called list_of_names. I am using the Python library Selenium to access the web page and locate the HTML element which contains the list of school names.
driver.get('https://www.illinoisreportcard.com/SearchResult.aspx?SearchText=$high%20school$&type=NAME#High-schools')
school_list = driver.find_elements_by_class_name('container.col-sm-12.col-md-12')
list_of_names = []
for school in school_list:
try:
name = school.text
print(name)
list_of_names.append(name)
except selenium.common.exceptions.NoSuchElementException:
pass
As you can see below, where I've included the first three out of hundreds of results, the loop successfully prints the names of the schools plus grade information (which it has grabbed from each specified element of the HTML code).
ALLEN JUNIOR HIGH SCHOOL
(4 - 8)
LA MOILLE CUSD 303
(BUREAU)
LA MOILLE
CENTRALIA JR HIGH SCHOOL
(4 - 8)
The problem is that this line of code -- list_of_names.append(name) -- is not appending each of the school names as a list item surrounded by commas as separators, as I would have expected. Instead, it is appending each school name to one single list item that merely grows longer and longer. And in place of where commas should be, it is putting an '\n'.
Below is the first line of output of the command print(list_of_names):
['ALLEN JUNIOR HIGH SCHOOL\n(4 - 8)\nLA MOILLE CUSD 303\n(BUREAU)\nLA MOILLE\nCENTRALIA JR HIGH SCHOOL\n(4 - 8)\nCENTRALIA SD 135\n(MARION)\
(I have tried versions of this on smaller lists of elements outside of HTML and thus without the need for the Selenium try/except code at the very bottom here, and it worked. But that still doesn't get me any closer to being able to deploy this code on the web page with the school names.)
What is going on? Why isn't this code appending each school name to list_of_names as individual items in a list?
Appreciate any help!
The variable "school_list" is not a list rather it's a string.
So essentially the for loop runs only once. "\n" is an escape sequence for "new line", which is why you are getting the output in the print statement
If you want the varaible "list_of_names" to have elements as show in your print statement you can replace the for loop with
for school in school_list.split('\n'):
I am new with Python and trying to scrape IMDB. I am scraping a list of 250 top IMDB movies and want to get information on each unique website for example the length of each movie.
I already have a list of unique URLs. So, I want to loop over this list and for every URL in this list I want to retrieve the 'length' of that movie. Is this possible to do in one code?
for URL in urlofmovie:
htmlsource = requests.get(URL)
tree_url = html.fromstring(htmlsource)
lengthofmovie = tree_url.xpath('//*[#class="subtext"]')
I expect that lengthofmovie will become a list of all the lengths of the movies. However, it already goes wrong at line 2: the htmlsource.
To make it a list you should first create a list and then append each length to that list.
length_list = []
for URL in urlofmovie:
htmlsource = requests.get(URL)
tree_url = html.fromstring(htmlsource)
length_list.append(tree_url.xpath('//*[#class="subtext"]'))
Small tip: Since you are new to Python I would suggest you to go over PEP8 conventions. Your variable naming can make your(and other developers) life easier. (urlofmovie -> urls_of_movies)
However, it already goes wrong for at line 2: the htmlsource.
Please provide the exception you are receiving.
I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I don´t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:
for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:
go to the uniprot option
download the fasta file
store it in a .txt file inside a given folder
Thanks in advance!
I would strongly suggest to ask the authors for the database. From the FAQ:
I would like to use the MoonProt database in a project to analyze the
amino acid sequences or structures using bioinformatics.
Please contact us at bioinformatics#moonlightingproteins.org if you are
interested in using MoonProt database for analysis of sequences and/or
structures of moonlighting proteins.
Assuming you find something interesting, how are you going to cite it in your paper or your thesis?
"The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.
That's a good introduction to scraping
But back to your your original question.
import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')
for i, cell in enumerate(cells):
if cell.text:
#if we get something which looks like a FASTA sequence, print it
if cell.text.startswith('>'):
print(cell.text)
#if we find a table cell which has UniProt in it
#let's print the link from the next cell
if 'UniProt' in cell.text_content():
if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
print(cells[i + 1].find('a').attrib['href'])