Avoiding rate limit error using tweepy - python

I am trying to get some basic information on all the twitter friends that a particular user is following. I am using a for loop to get this information but if the user has many friends I get a rate limit error. I am trying and struggling to integrate a way to get around the rate limit into my for loop. Thank you for any suggestions!!
My original code:
data = []
for follower in followers:
carrot = api.get_user(follower)
data.append("[")
data.append(carrot.screen_name)
data.append(carrot.description)
data.append("]")
My attempt at getting around rate limit error:
data = []
for follower in followers:
carrot = api.get_user(follower,include_entities=True)
while True:
try:
data.append("[")
data.append(carrot.screen_name)
data.append(carrot.description)
data.append("]")
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except StopIteration:
break

The problem could be it is probably get_user that is throwing the error. Try putting api.get_user in your exception block
Code below.
data = []
for follower in followers:
while True:
try:
carrot = api.get_user(follower,include_entities=True)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except StopIteration:
pass
break
data.append([carrot.screen_name, carrot.description])
How do you intend to store these values? Isn't the following easier to work with
[John, Traveller]
as against your code, that stores it as
[ "[", John, Traveller, "]" ]

Related

snscrape twitter using Python

I just got an information that sncrape python still error, I tried to solve it by add top = True in line twi***scraper but it still error.
Here is my codes:
pd.options.display.max_colwidth = 500
query = "(music) lang:en since:2023-01-01 until:2023-02-02"
tweets = []
limit = 10
get_ipython().run_line_magic('time', '')
try:
print("start scraping")
for tweet in sntwitter.TwitterSearchScraper(query=query, top = True).get_items():
if len(tweets) == limit :
break
else:
tweets.append([tweet.date, tweet.user.username, tweet.content])
df = pd.DataFrame(tweets, columns=['datetime', 'username', 'content'])
except Exception as e:
print(e)
print("Finished")
print("-------")
Can somebody solved the error?
Error retrieving https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&q=%28Gempa%29+lang%3Aid++since%3A2023-01-01+until%3A2023-02-02&count=100&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&ext=mediaStats%2ChighlightedLabel: non-200 status code
4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&q=%28Gempa%29+lang%3Aid++since%3A2023-01-01+until%3A2023-02-02&count=100&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&ext=mediaStats%2ChighlightedLabel failed, giving up.

Why is no data stored in my list in Python?

I have the following code to get some data using selenium. That goes through a list with ids with a for loop and to store them in my lists (titulos = [] and ids = []. It was working fine until I added the try/except. The code would look like this:
for item in registros:
found = False
ids = []
titulos = []
try:
while true:
#code to request data
try:
error = False
error = #error message
if error is True:
break
except:
continue
except:
continue
try:
found = #if id has data
if found.is_displayed:
titulo = #locator
ids.append(item)
titulos.append(titulo)
except NoSuchElementException:
input.clear()
The first inner try block needs to be indented. Also, the error parameter will always be set to the text message so it will always be true. Try formatting your code correctly and then identifying the problem.

Python Web Scraping error - Reading from JSON- IndexError: list index out of range - how do I ignore

I am performing web scraping via Python \ Selenium \ Chrome headless driver. I am reading the results from JSON - here is my code:
CustId=500
while (CustId<=510):
print(CustId)
# Part 1: Customer REST call:
urlg = f'https://mywebsite/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
# print(dict_from_json)
#try:
CustID = (dict_from_json['customerAddressCreateCommand']['customerId'])
# Addr = (dict_from_json['customerShowCommand']['customerAddressShowCommandSet'][0]['addressDisplayName'])
writefunction()
CustId = CustId+1
The issue is sometimes 'addressDisplayName' will be present in the result set and sometimes not. If its not, it errors with the error:
IndexError: list index out of range
Which makes sense, as it doesn't exist. How do I ignore this though - so if 'addressDisplayName' doesn't exist just continue with the loop? I've tried using a TRY but the code still stops executing.
try..except block should resolved your issue.
CustId=500
while (CustId<=510):
print(CustId)
# Part 1: Customer REST call:
urlg = f'https://mywebsite/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
# print(dict_from_json)
CustID = (dict_from_json['customerAddressCreateCommand']['customerId'])
try:
Addr = (dict_from_json['customerShowCommand']['customerAddressShowCommandSet'][0]'addressDisplayName'])
except:
Addr ="NaN"
CustId = CustId+1
If you get an IndexError (with an index of '0') it means that your list is empty. So it is one step in the path earlier (otherwise you'd get a KeyError if 'addressDisplayName' was missing from the dict).
You can check if the list has elements:
if dict_from_json['customerShowCommand']['customerAddressShowCommandSet']:
# get the data
Otherwise you can indeed use try..except:
try:
# get the data
except IndexError, KeyError:
# handle missing data

how to make except only one value of some in selenium?

I want to find title, address, price of some items in an online mall.
But, sometimes the address is empty and my code is break in my code(below_it's an only selenium part)
num = 1
while 1:
try:
title = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/span').text
datas_title.append(title)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
datas_address.append(address)
price = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/p').text
datas_price.append(price)
print('crowling....num = '+str(num))
num=num+1
except Exception as e:
print("finish get data...")
break
print(datas_title)
print(datas_address)
print(datas_price)
what should I do if the address is empty -> just ignore it and find the next items?
Use this so you can skip the entries with missing information:
num = 1
while 1:
try:
title = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/span').text
datas_title.append(title)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
datas_address.append(address)
price = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/p').text
datas_price.append(price)
print('crowling....num = '+str(num))
num=num+1
except:
print("an error was encountered")
continue
print(datas_title)
print(datas_address)
print(datas_price)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
if not address:
address = "None"
else:
address = address[0].text
datas_title.append(address)
You could use find_elements to check if it's empty and then proceed to do it with either value. You can than encapsulate this into a function pass it the xpath and the data_title array and your code should be repeatable.
I think you need to first check if the web element returned isn't none. And then proceed with fetching text.
You could write a function for it, and catch that exception in it.

Unable to scrape data after a dropdown

from typing import Text, final
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.nytimes.com/interactive/2021/us/covid-cases.html").text
soup = BeautifulSoup(source, "lxml")
states = soup.find("tbody", class_="children").find_all("tr")
# print(state.prettify())
for state in states:
# determining the name of the state
name = state.a.text
final_name = ""
for character in name:
if character in "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM ":
final_name += character
print(final_name)
# finding the daily number of cases on average
try:
daily_cases_avg = state.find("td",class_="bignum cases show-mobile").text
except Exception as e:
daily_cases_avg = None
print(daily_cases_avg)
# finding the number of cases per 100,000
try:
num_cases_per_hunThous = state.find("td",class_="num cases show-mobile").text
except Exception as e:
num_cases_per_hunThous = None
print(num_cases_per_hunThous)
# finding percent change over the past 14 days
try:
pct_change_cases_14 = state.find("td",class_="chart cases wider td-end show-mobile").span.text
except Exception as e:
pct_change_cases_14 = None
print(pct_change_cases_14)
# daily average of the number of people hospitalized
try:
daliy_hos_avg = state.find_all("td",class_="bignum")[1].text
except Exception as e:
daily_hos_avg = None
print(daliy_hos_avg)
# number of people people hospitalized per 100,000
try:
num_hos_hunThous = state.find_all("td",class_="num")[1].text
except Exception as e:
num_hos_hunThous = None
print(num_hos_hunThous)
# percent change of number of hospitalized people over the past 14 days
try:
pct_change_hos_14 = state.find("td",class_="num td-end").text
except Exception as e:
pct_change_hos_14 = None
print(pct_change_hos_14)
# daily average of deaths
try:
daily_death_avg = state.find_all("td",class_="bignum")[2].text
except Exception as e:
daily_death_avg = None
print(daily_death_avg)
# number of deaths per 100,000
try:
deaths_hunThous = state.find_all("td",class_="num td-end")[1].text
except Exception as e:
deaths_hunThous = None
print(deaths_hunThous)
# percent of people fully vaccinated
try:
pct_vac = state.find("td",class_="num vax td-end").text
except Exception as e:
pct_vac = None
print(pct_vac)
All I am trying to do is scrape COVID-19 data off of the New York Times. I am a beginner so I am just using this as a way to learn how to scrape websites efficiently. However, the website only the states that show up prior to a dropdown.
On the website, after the state of Illinois, there is button "Show all." The states that appear after clicking that button are not getting scraped for data, so I was wondering how I can get past that to get data for all of the states.
If you open developer tools and go to network you can see all of the requests the page is sending. I found one request it sends https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/data/counties.json
The website recieves a link for every county. Each element in the json obj contains a link to another nyt article that contains the avg cases for that individual county.
It would be more complicated but you could write a script that goes through each of these countys and scrapes the data. Then add up the avg cases for each state based on each individual county.
That is what I would do

Categories

Resources