I am starting to learn to program using BeautifulSoup. What I want to achieve with this code is to save prices from different pages. To achieve this I store the prices of each page in a list and all those lists in a list. The problem is some pages do not save the prices so there are some lists that are completely empty. What I am looking for is that those empty lists are assigned the elements of the "ListaR" so that later I do not have problems. Here's my code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from decimal import Decimal
from typing import List
AppID = ['495570', '540190', '607210', '575780', '338840', '585830', '637330', '514360', '575760', '530540', '361890', '543170', '346500', '555930', '575700', '595780', '362400', '562360', '745670', '763360', '689360', '363610', '575770', '467310', '380560']
ListaPrecios = list()
ListaUrl = list() #<------- LISTA
Blanco = [""]
ListaR = ["$0.00 USD", "$0.00 USD"]
for x in AppID: # <--------- Para cada una de las AppID...
#STR#
url = "https://steamcommunity.com/market/search?category_753_Game%5B%5D=tag_app_"+x+"&category_753_cardborder%5B%5D=tag_cardborder_0&category_753_item_class%5B%5D=tag_item_class_2#p1_price_asc" # <------ Usa AppID para entrar a sus links de mercado
ListaUrl += [url] # <---------- AGREGA CADA LINK A UNA LISTA
PageCromos = [requests.get(x) for x in ListaUrl]
SoupCromos = [BeautifulSoup(x.content, "html.parser") for x in PageCromos]
PrecioCromos = [x.find_all("span", {"data-price": True}) for x in SoupCromos] # <--------- GUARDA LISTAS DENTRO DE LISTAS CON CODIGO
min_CromoList = []
for item in PrecioCromos:
CromoList = [float(i.text.strip('USD$')) for i in item]
min_CromoList.append(min(CromoList)) # <---------------- Lista con todos los precios minimos de cromos de cada juego
print(min_CromoList)
Output:
ValueError: min() arg is an empty sequence
You can change this line
min_CromoList.append(min(CromoList))
to:
if not CromoList: # this will evaluate to True if the list is empty
min_CromoList.append(min(ListaR))
else:
min_CromoList.append(min(CromoList))
A neat feature of python is that empty lists evaluate to False and non-empty lists evaluate to True.
Since min(ListaR) will always evaluate to '$0.00 USD' it is probably neater to write this as:
if not CromoList:
min_CromoList.append('$0.00 USD')
else:
min_CromoList.append(min(CromoList))
Related
I am trying to extract the information of a tweet from an ID. With the ID, I want to get the tweet creation date, the tweet, location, followers, friends, favorites, their profile description, if they're verified, and the language, but I'm having trouble doing it. Next, I will show the steps I follow to carry out what I want.
I have made the following code. To start with, I have the IDs of the tweets in a txt file and I read them as follows:
# Read txt file
txt = '/content/drive/MyDrive/Mini-proyecto Texto/archivo.txt'
with open(txt) as archivo:
lines = archivo.readlines()
Next, I add each of the IDs to a list:
# Add the IDs to a list
IDs = []
for i in lines:
IDs.append(i.rsplit())
#print(i.rsplit())
IDs
#[['1206924075374956547'],
# ['1210912199402819584'],
# ['1210643148998938625'],
# ['1207776839697129472'],
# ['1203627609759920128'],
# ['1205895318212136961'],
# ['1208145724879364100'], ...
Finally, start extracting the information you require as follows:
# Extract information from tweets
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
The following image is the output of the tweets_df2 variable, but I don't understand why the values are repeated over and over again. Does anyone know what's wrong with my code?
If you need the txt I provide you with the link of the drive. https://drive.google.com/file/d/1vyohQMpLqlKqm6b4iTItcVVqL2wBMXWp/view?usp=sharing
Thank you very much in advance for your time :3
Your code basically runs fine for me with a few adjustments. I noticed that your identation is not correct and your list of IDs is a list of lists (with just one element) rather than one list of elements.
Try this:
api = tweepy.Client(consumer_key=api_key,
consumer_secret=api_key_secret,
access_token=access_token,
access_token_secret=access_token_secret,
bearer_token=bearer_token,
wait_on_rate_limit=True,
)
auth = tweepy.OAuth1UserHandler(
api_key, api_key_secret, access_token, access_token_secret
)
api = tweepy.API(auth)
txt = "misocorpus-misogyny.txt"
with open(txt) as archivo:
lines = archivo.readlines()
IDs = []
for i in lines:
IDs.append(i.strip())# <-- use strip() to remove /n rather than split
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
Result:
ID Tweet Creado_tweet Locacion_usuario Seguidores_usuario Amigos_usuario Favoritos_usuario Descripcion_usuario Verificado_usuario Idioma
0 1206924075374956547 Las feminazis quieren por poco que este chico ... 2019-12-17 13:08:17+00:00 Argentina 1683 2709 28982 El Progresismo es un C谩ncer que quiere destrui... False es
1 1210912199402819584 #CarlosVerareal #Galois2807 Los halagos con pi... 2019-12-28 13:15:40+00:00 Ecuador 398 1668 3123 Cuando te encuentres n una situaci贸n imposible... False es
2 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 馃挋馃挋馃挋馃挋 False es
3 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 馃挋馃挋馃挋馃挋 False es
4 1203627609759920128 Mostritas #Feminazi amenazando como ellas sabe... 2019-12-08 10:49:19+00:00 Lima, Per煤 2505 3825 45087 Latam News Report. Regional and World affairs.... False
I'm trying to retrieve wikipedia pages' characters count, for articles in different languages. I'm using a dictionary with as key the name of the page and as value a dictionary with the language as key and the count as value.
The code is:
pages = ["L'arte della gioia", "Il nome della rosa"]
langs = ["it", "en"]
dicty = {}
dicto ={}
numz = 0
for x in langs:
wikipedia.set_lang(x)
for y in pages:
pagelang = wikipedia.page(y)
splittedpage = pagelang.content
dicto[y] = dicty
for char in splittedpage:
numz +=1
dicty[x] = numz
If I print dicto, I get
{"L'arte della gioia": {'it': 72226, 'en': 111647}, 'Il nome della rosa': {'it': 72226, 'en': 111647}}
The count should be different for the two pages.
Please try this code. I didn't run it because I don't have the wikipedia module.
Notes:
Since your expected result is dict[page,dict[lan,cnt]], I think first iterate pages is more natural, then iterate languages. Maybe for performance reason you want first iterate languages, please comment.
Characters count of text can simply be len(text), why iterate and sum again?
Variable names. You will soon be lost in x y like variables.
pages = ["L'arte della gioia", "Il nome della rosa"]
langs = ["it", "en"]
dicto = {}
for page in pages:
lang_cnt_dict = {}
for lang in langs:
wikipedia.set_lang(lang)
page_lang = wikipedia.page(page)
chars_cnt = len(pagelang.content)
lang_cnt_dict[lan] = chars_cnt
dicto[page] = lang_cnt_dict
print(dicto)
update
If you want iterate langs first
pages = ["L'arte della gioia", "Il nome della rosa"]
langs = ["it", "en"]
dicto = {}
for lang in langs:
wikipedia.set_lang(lang)
for page in pages:
page_lang = wikipedia.page(page)
chars_cnt = len(pagelang.content)
if page in dicto:
dicto[page][lang] = chars_cnt
else:
dicto[page] = {lang: chars_cnt}
print(dicto)
I want an activity to scrapy a web page. The part of data web is route_data.
route_data = ["javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/215_14.jpg', '/wc/htdocs/web', 'Batet Lama帽a, Meritxell (Presidenta del Congreso de los Diputados)', 'Diputada por Barcelona', 'G.P. Socialista' ,'','');",
"javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/168_14.jpg', '/wc/htdocs/web', 'Rodr铆guez G贸mez de Celis, Alfonso (Vicepresidente Primero)', 'Diputado por Sevilla', 'G.P. Socialista' ,'','');",]
I create a dictionary with empty values.
dictionary_data = {"Nombre":None, "Territorio":None, "Partido":None, "url":None}
I have to save in dictionary_data each one line:
url = /wc/htdocs/web/img/diputados/peq/215_14.jpg
Nombre = Batet Lama帽a, Meritxell
Territorio = Diputada por Barcelona
Partido = G.P. Socialista
For thus, and I loop over route_data.
for i in route_data:
text = i.split(",")
nombre = text[2:4]
territorio = text[4]
partido = text[5]
But the output is:
[" 'Batet Lama帽a", " Meritxell (Presidenta del Congreso de los Diputados)'"] 'Diputada por Barcelona' 'G.P. Socialista'
[" 'Rodr铆guez G贸mez de Celis", " Alfonso (Vicepresidente Primero)'"] 'Diputado por Sevilla' 'G.P. Socialista'
How can it get put correct in dictionary?
A simple solution would be:
all_routes = []
for i in route_data:
text = re.findall("'.+?'", i)
all_routes.append(
{"Nombre": re.sub('\(.*?\)', '', text[2]).strip(),
"Territorio": text[3],
"Partido": text[-2],
"Url": text[0]})
Hi there i've been trying to collect all the information in 10,000 pages of this page for a school project, I thought everything was fine until on page 4 I got a mistake. I check the page manually and I find that the page now asks me for a captcha.
What can I do to avoid it? Maybe set a timer between the searchs?
Here it is my code.
import bs4, requests, csv
g_page = requests.get("http://www.usbizs.com/NY/New_York.html")
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
get_Pnum = m_page.select('div[class="pageNav"]')
MAX_PAGE = int(get_Pnum[0].text[9:16])
print("Recolectando informaci贸n de la p谩gina 1 de {}.".format(MAX_PAGE))
contador = 0
information_list = []
for k in range(1, MAX_PAGE):
c_items = m_page.select('div[itemtype="http://schema.org/Corporation"] a')
c_links = []
i = 0
for link in c_items:
c_links.append(link.get("href"))
i+=1
for j in range(len(c_links)):
temp = []
s_page = requests.get(c_links[j])
i_page = bs4.BeautifulSoup(s_page.text, "lxml")
print("Ingresando a: {}".format(c_links[j]))
info_t = i_page.select('div[class="infolist"]')
info_1 = info_t[0].text
info_2 = info_t[1].text
temp = [info_1,info_2]
information_list.append(temp)
contador+=1
with open ("list_information.cv", "w") as file:
writer=csv.writer(file)
for row in information_list:
writer.writerow(row)
print("Informaci贸n de {} clientes recolectada y guardada correctamente.".format(j+1))
g_page = requests.get("http://www.usbizs.com/NY/New_York-{}.html".format(k+1))
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
print("Recolectando informaci贸n de la p谩gina {} de {}.".format(k+1,MAX_PAGE))
print("Programa finalizado. Informaci贸n recolectada de {} clientes.".format(contador))
I've the following text file taken from a csv file. The file is two long to be shown properly here, so here's the line info:
The file has 5 lines:The 1st one starts in ETIQUETASThe 2nd one stars in RECURSOSThe 3rd one starts in DATOS CLIENTE Y PIEZAThe 4th one starts in Numero Referencia,The 5th and last one starts in BRIDA Al.
ETIQUETAS:;;;;;;;;;START;;;;;;;;;;;;;;;;;;;;;END;;
RECURSOS:;;;;;;;;;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;1;1;1;0;1;0;;Nota: 0
equivale a infinito, para decir que no existen recursos usar un numero
negativo DATOS CLIENTE Y PIEZA;;;;PLAZOS Y PROCESOS;;;;;;;;;;hoja
de ruta;MU;;;;;;;;;;;;;;;;; Numero Referencia;Descripcion
Referencia;Nombre Cliente;Codigo Cliente;PLAZO DE
ENTREGA;piezas;PROCESO;MATERIAL;stock;PROVEEDOR;tiempo ida
pulidor;pzas dia;TPO;tiempo vuelta pulidor;TIEMPO RECEPCION;CONTROL
CALIDAD DE ENTRADA;TIEMPO CONTROL CALIDAD DE ENTRADA;ALMACEN A (ANTES
DE ENTRAR
MAQUINA);GRANALLA;TPO;LIMPIADO;TPO;BRILLADO;TPO;;CARGA;MAQUINA;SOLTAR;control;EMPAQUETADO;ALMACENB;TIEMPO;
BRIDA Al;BRIDA Al;AEROGRAFICAS AHE,
S.A.;394;;;niquelado;aluminio;;;;matriz;;;5min;NO;;3dias;;;;;;;;1;1;1;;1;4D;;
I want to do two things:
Count the between START and END of the first line, both inclusive and save it as TOTAL_NUMBERS. This means if I've START;;END has to count 3; the START itself, the blank space between the two ;; and the END itself. In the example of the test, START;;;;;;;;;;;;;;;;;;;;;END it has to count 22.
What I've tried so far:
f = open("lt.csv", 'r')
array = []
for line in f:
if 'START' in line:
for i in line.split(";"):
array.append(i)
i = 0
while i < len(array):
if i == 'START':
# START COUNTING, I DONT KNOW HOW TO CONTINUE
i = i + 1
2.Check the file, go until the word PROVEEDOR appears, and save that word and the following TOTAL_NUMBERS(in the example, 22) on an array.
This means it has to save:
final array = ['PROVEEDOR', 'tiempo ida pulidor', 'pzas dia, 'TPO', 'tiempo vuelta pulidor', 'TIEMPO RECEPCION', 'CONTROL CALIDAD DE ENTRADA', 'TIEMPO CONTROL CALIDAD DE ENTRADA, 'ALMACEN A (ANTES DE ENTRAR MAQUINA)', 'GRANALLA', 'TPO', 'LIMPIADO', 'TPO','BRILLADO','TPO','','CARGA', 'MAQUINA', 'SOLTAR', 'control', 'EMPAQUETADO', 'ALMACENB']
Thanks in advance.
I am assuming the file is split into two lines; the first line with START and END and then a long line which needs to be parsed. This should work:
with open('somefile.txt') as f:
first_row = next(f).strip().split(';')
TOTAL_NUMBER = len(first_row[first_row.index('START'):first_row.index('END')+1])
bits = ''.join(line.rstrip() for line in f).split(';')
final_array = bits[bits.index('PROVEEDOR'):bits.index('PROVEEDOR')+TOTAL_NUMBER]