I want to scrape review data from a website using scrapy. The code is given below.
The problem is that each time the program goes to the next page, it starts at the beginning (due to the callback) and it resets records[]. So the array will be empty again and every review that is saved in records[] is lost. This results in that when I open my csv file, I only get the reviews of the last page.
What I want is that all the data is stored in my csv file, so that records[] does not keep resetting each time a next page is requested. I can't put the line: records = [] before the parse method, because than the array is not defined.
Here is my code:
def parse(self, response):
records = []
for r in response.xpath('//div[contains(#class, "a-section review")]'):
rtext = r.xpath('.//div[contains(#class, "a-row review-data")]').extract_first()
rating = r.xpath('.//span[contains(#class, "a-icon-alt")]/text()').extract_first()
votes = r.xpath('normalize-space(.//span[contains(#class, "review-votes")]/text())').extract_first()
if not votes:
votes = "none"
records.append((rating, votes, rtext))
print(records)
nextPage = response.xpath('//li[contains(#class, "a-last")]/a/#href').extract_first()
if nextPage:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(url = nextPage)
import pandas as pd
df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')
Moving the record declaration to the method call will use a common gotcha in python outlined here in the python docs. However in this instance the weird behavior of instantiating lists in a method declaration will work in your favor.
Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby). This means that if you use a mutable default argument and mutate it, you will and have mutated that object for all future calls to the function as well.
def parse(self, response, records=[]):
for r in response.xpath('//div[contains(#class, "a-section review")]'):
rtext = r.xpath('.//div[contains(#class, "a-row review-data")]').extract_first()
rating = r.xpath('.//span[contains(#class, "a-icon-alt")]/text()').extract_first()
votes = r.xpath('normalize-space(.//span[contains(#class, "review-votes")]/text())').extract_first()
if not votes:
votes = "none"
records.append((rating, votes, rtext))
print(records)
nextPage = response.xpath('//li[contains(#class, "a-last")]/a/#href').extract_first()
if nextPage:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(url = nextPage)
import pandas as pd
df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')
The above method is a little weird. A more general solution would be to simply use a global variable. Here is a post going over how to use globals.
Here parse is a callback which is called every time again. Try to define records globally or call an appender function and call it to append values.
Also scrappy is capable to generate CSV itself. Here’s my little experiment with scraping - https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb
So you can load data to csv then pandas will read it.
Related
I am wondering how I could print a certain value of a variable on a text string.
Here is my code.
import requests
import time
import json
urls = ["https://api.opensea.io/api/v1/collection/doodles-official/stats",
"https://api.opensea.io/api/v1/collection/boredapeyachtclub/stats",
"https://api.opensea.io/api/v1/collection/mutant-ape-yacht-club/stats",
"https://api.opensea.io/api/v1/collection/bored-ape-kennel-club/stats"]
for url in urls:
response = requests.get(url)
json_data= json.loads(response.text)
data= (json_data["stats"]["floor_price"])
print("this is the floor price of {} of doodles!".format(data))
print("this is the floor price of {} of Bored Apes!".format(data))
As you can see in the code I am using the opensea API to extract the value of floor price from a json file of the different NFT collections listed on the urls.
The variable if you print the variable data you will get the different prices as follow:
The question is how can I take a specific value of that output, lets say 7.44 for example, and print in the following text
print("this is the floor price of {} of doodles!".format(data))
and then use the second value, 85, and print another text such as:
print("this is the floor price of {} of Bored Apes!".format(data))
How can I generate this output?
Thanks everyone!
I tried using data[0] or data[1], but that doesn't work.
The easiest way to do this in my opinion is to make another list, and add all the data to that list. For example, you can say data_list = [] to define a new list.
Then, in your for loop, add data_list.append(data), which adds the data to your list. Now, you can use data_list[0] and data_list[1] to access the different data.
The full code would look something like this:
import requests
import time
import json
urls = ["https://api.opensea.io/api/v1/collection/doodles-official/stats",
"https://api.opensea.io/api/v1/collection/boredapeyachtclub/stats",
"https://api.opensea.io/api/v1/collection/mutant-ape-yacht-club/stats",
"https://api.opensea.io/api/v1/collection/bored-ape-kennel-club/stats"]
data_list = []
for url in urls:
response = requests.get(url)
json_data= json.loads(response.text)
data= (json_data["stats"]["floor_price"])
data_list.append(data)
print("this is the floor price of {} of doodles!".format(data_list[0]))
print("this is the floor price of {} of Bored Apes!".format(data_list[1]))
The reason you cannot use data[0] and data[1] is because the variable data is rewritten each time the for loop runs again, and nothing is added.
Trying to extract coin names, price, and market cap from coinmarketcap.com. I first tried using soup.find_all to search for certain tags with a specific class but it always picked up information I didnt need or want. So instead I used find_all to search for 'td' and then planned on using a for loop to look for specific class names and to append those to a new list and then print that list but it returns a data type for some reason.
coin_table = soup.find_all('td')
class_value = 'sc-1eb5slv-0 iJjGCS'
for i in coin_table:
if class_value in coin_table:
list.append(i)
print(list)
But this returns:
<class 'list'>
to the console even though im not asking to see the data type. Very new to beautifulsoup and coding in general so sorry if this is a very basic question. Still trying to get my head around all of this stuff.
As #RJAdriaansen mentioned, you don't need to scrape website when they provide api. Here is how you do it with requests library:
import requests
url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,tags,platform,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
response = requests.get(url)
data = response.json()
This will give you json data. Now you can grab all you need by accessing correct keys:
final_list = []
temp = []
for each_crypto in data['data']['cryptoCurrencyList']:
temp.append(each_crypto['name'])
# each_crypto['quotes'] gives you list of price and market gap of each crypto
for quote in each_crypto['quotes']:
# assuming you want to get USD price of each crypto
if quote['name'] == "USD":
temp.append(quote['price'])
temp.append(quote['marketCap'])
final_list.append(temp)
temp = []
Final result would look like this:
[
['Bitcoin', 34497.01819639692, 646704595579.0485],
['Ethereum', 2195.11816422801, 255815488972.87268],
['Tether', 1.0003936138399, 62398426501.02027],
['Binance Coin', 294.2550537711805, 45148405357.003],
...
]
I'm trying to do a pretty basic web scrape but I'd like to be able to use variables so I don't have to keep repeating code for multiple pages.
example line of code:
elems = results.find_all("span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P")
I would like it to read as:
elems = results.find_all(<variable>)
and then have my variable be:
'"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
However, when I do this, I get no results. I've included the rest of the function below. Does anyone know why this will not work?
EDIT:
I've also tried splitting it up like below example but still get the same issue:
elems = results.find_all(variable1 , class_=variable2)
variable1 = '"span"'
variable2 = '"st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
code:
def get_name(sym,result,elem,name):
url = URL + sym
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id=result)
elems = results.find_all(elem)
for elem in elems:
name_elem = elem.find(name)
print(name_elem.text)
get_name('top',"app",'"span",class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"','"span","st_3lrv4Jo"')
The find_all method takes more then one parameter
you are just using a string in the first argument of the method which would struggle to find anything
you will need to split the variable into multiple so your variable '"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"' will need to be split into to variables
elem = "span" and class="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"
and in your code it will look like
elems = results.find_all(elem, class)
Just a few more things:
according to the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all and what i can find online the class parameter takes a Dict with a string array for multiple class values so your function will look more like
findAll(elem, {'class':['st_2XVIMK7', 'st_8u0ePN3', 'st_2oUi2Vb', 'st_3kXJm4P']})
I'm trying to extract Pubmed IDs from Pubmed URLs, using https://pypi.org/project/pymed/.
Problem: Despite proper inputs all throughout my for loop, my "pubmed query" is apparently being performed only on the first iteration? Please help.
url_list = pd.Series of URLs, dtype='str'
ID_list = list(range(0, len(url_list)))
for i in url_list:
spider = Pubmed("MyTool", my_email)
url = url_list.iloc[i].to_string(index=False)
print(url) #CORRECTLY ITERATES
if re.search(pattern, url):
lookup = spider.query(url) #Unique <class itertools.chain object>
results = list(lookup) #Unique <'pymed.article.PubMedArticle' object>
ID = results.pop().pubmed_id
#OR
ID = results[0].pubmed_id
print(ID) #RETURNS ONLY THE FIRST ID
ID_list[i] = ID
else:
ID_list[i] = None
print("Extracted " + url + " with ID: " + ID)
I've tried settings the spider and lookup vars to None at the end of the whole loop, as well as using "del VAR" on both of them.
Nothing gives. For whatever reason, the spider.query() method is only pulling from the first url it is fed. Note that I also tried putting the Pubmed() spider outside the for loop, where it probably should go, but this was me trying to be thorough.
Thanks so much for the help, and I apologize for any issues reproducing this, or with anything stylistic that causes you pain and suffering.
I am currently working on a script to scrape data from ClinicalTrials.gov. To do this I have written the following script:
def clinicalTrialsGov (id):
url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
data = BeautifulSoup(requests.get(url).text, "lxml")
studyType = data.study_type.text
if studyType == 'Interventional':
allocation = data.allocation.text
interventionModel = data.intervention_model.text
primaryPurpose = data.primary_purpose.text
masking = data.masking.text
enrollment = data.enrollment.text
officialTitle = data.official_title.text
condition = data.condition.text
minAge = data.eligibility.minimum_age.text
maxAge = data.eligibility.maximum_age.text
gender = data.eligibility.gender.text
healthyVolunteers = data.eligibility.healthy_volunteers.text
armType = []
intType = []
for each in data.findAll('intervention'):
intType.append(each.intervention_type.text)
for each in data.findAll('arm_group'):
armType.append(each.arm_group_type.text)
citedPMID = tryExceptCT(data, '.results_reference.PMID')
citedPMID = data.results_reference.PMID
print(citedPMID)
return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType
However, the following script won't always work as not all studies will have all items (ie. a KeyError will occur). To resolve this, I could simply wrap each statement in a try-except catch, like this:
try:
studyType = data.study_type.text
except:
studyType = ""
but it seems a bad way to implement this. What's a better/cleaner solution?
This is a good question. Before I address it, let me say that you should consider changing the second parameter to the BeautifulSoup (BS) constructor from lxml to xml. Otherwise, BS does not flag the parsed markup as XML (to verify this for yourself, access the is_xml attribute on the data variable in your code).
You can avoid generating an error when attempting to access a non-existent element by passing a list of desired element names to the find_all() method:
subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']
tag_matches = data.find_all(subset)
Then, if you want to get a specific element from the list of Tags without iterating through it, you can convert it to a dict using the Tag names as keys:
tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))