Scrapy using loops in Python - python

I want an activity to scrapy a web page. The part of data web is route_data.
route_data = ["javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/215_14.jpg', '/wc/htdocs/web', 'Batet Lamaña, Meritxell (Presidenta del Congreso de los Diputados)', 'Diputada por Barcelona', 'G.P. Socialista' ,'','');",
"javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/168_14.jpg', '/wc/htdocs/web', 'Rodríguez Gómez de Celis, Alfonso (Vicepresidente Primero)', 'Diputado por Sevilla', 'G.P. Socialista' ,'','');",]
I create a dictionary with empty values.
dictionary_data = {"Nombre":None, "Territorio":None, "Partido":None, "url":None}
I have to save in dictionary_data each one line:
url = /wc/htdocs/web/img/diputados/peq/215_14.jpg
Nombre = Batet Lamaña, Meritxell
Territorio = Diputada por Barcelona
Partido = G.P. Socialista
For thus, and I loop over route_data.
for i in route_data:
text = i.split(",")
nombre = text[2:4]
territorio = text[4]
partido = text[5]
But the output is:
[" 'Batet Lamaña", " Meritxell (Presidenta del Congreso de los Diputados)'"] 'Diputada por Barcelona' 'G.P. Socialista'
[" 'Rodríguez Gómez de Celis", " Alfonso (Vicepresidente Primero)'"] 'Diputado por Sevilla' 'G.P. Socialista'
How can it get put correct in dictionary?

A simple solution would be:
all_routes = []
for i in route_data:
text = re.findall("'.+?'", i)
all_routes.append(
{"Nombre": re.sub('\(.*?\)', '', text[2]).strip(),
"Territorio": text[3],
"Partido": text[-2],
"Url": text[0]})

Related

Feature extraction with tweet IDs

I am trying to extract the information of a tweet from an ID. With the ID, I want to get the tweet creation date, the tweet, location, followers, friends, favorites, their profile description, if they're verified, and the language, but I'm having trouble doing it. Next, I will show the steps I follow to carry out what I want.
I have made the following code. To start with, I have the IDs of the tweets in a txt file and I read them as follows:
# Read txt file
txt = '/content/drive/MyDrive/Mini-proyecto Texto/archivo.txt'
with open(txt) as archivo:
lines = archivo.readlines()
Next, I add each of the IDs to a list:
# Add the IDs to a list
IDs = []
for i in lines:
IDs.append(i.rsplit())
#print(i.rsplit())
IDs
#[['1206924075374956547'],
# ['1210912199402819584'],
# ['1210643148998938625'],
# ['1207776839697129472'],
# ['1203627609759920128'],
# ['1205895318212136961'],
# ['1208145724879364100'], ...
Finally, start extracting the information you require as follows:
# Extract information from tweets
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
The following image is the output of the tweets_df2 variable, but I don't understand why the values are repeated over and over again. Does anyone know what's wrong with my code?
If you need the txt I provide you with the link of the drive. https://drive.google.com/file/d/1vyohQMpLqlKqm6b4iTItcVVqL2wBMXWp/view?usp=sharing
Thank you very much in advance for your time :3
Your code basically runs fine for me with a few adjustments. I noticed that your identation is not correct and your list of IDs is a list of lists (with just one element) rather than one list of elements.
Try this:
api = tweepy.Client(consumer_key=api_key,
consumer_secret=api_key_secret,
access_token=access_token,
access_token_secret=access_token_secret,
bearer_token=bearer_token,
wait_on_rate_limit=True,
)
auth = tweepy.OAuth1UserHandler(
api_key, api_key_secret, access_token, access_token_secret
)
api = tweepy.API(auth)
txt = "misocorpus-misogyny.txt"
with open(txt) as archivo:
lines = archivo.readlines()
IDs = []
for i in lines:
IDs.append(i.strip())# <-- use strip() to remove /n rather than split
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
Result:
ID Tweet Creado_tweet Locacion_usuario Seguidores_usuario Amigos_usuario Favoritos_usuario Descripcion_usuario Verificado_usuario Idioma
0 1206924075374956547 Las feminazis quieren por poco que este chico ... 2019-12-17 13:08:17+00:00 Argentina 1683 2709 28982 El Progresismo es un Cáncer que quiere destrui... False es
1 1210912199402819584 #CarlosVerareal #Galois2807 Los halagos con pi... 2019-12-28 13:15:40+00:00 Ecuador 398 1668 3123 Cuando te encuentres n una situación imposible... False es
2 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 💙💙💙💙 False es
3 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 💙💙💙💙 False es
4 1203627609759920128 Mostritas #Feminazi amenazando como ellas sabe... 2019-12-08 10:49:19+00:00 Lima, Perú 2505 3825 45087 Latam News Report. Regional and World affairs.... False

How to add a progress bar while executing process in tkinter?

I'm building a little app to generate a word document with several tables, since this process takes a little long to finish I'd like to add a progress bar to, make it a little less boring the wait for the user, so far I manage to make the following method:
def loading_screen(self,e,data_hold=None):
"""
Generates a progress bart to display elapset time
"""
loading = Toplevel()
loading.geometry("300x50")
#loading.overrideredirect(True)
progreso = Progressbar(loading,orient=HORIZONTAL,length=280,mode="determinate")
progreso.pack()
progreso.start(10)
#progreso.destroy()
This method is supposed to run right after the user clicks a button of the next Toplevel.
def validate_data(self):
"""
genera una venta para validar los datos ingresadados y hacer cualquier correccion
previa a la generacion de los depositos finales
"""
if self.datos:
venta = Toplevel()
venta.title("Listado de depositos por envasadora")
venta.geometry("600x300")
columnas = ["ID","Banco","Monto","Envasadora"]
self.Tree_datos = ttk.Treeview(venta,columns=columnas,show="headings")
self.Tree_datos.pack()
for i in columnas:
self.Tree_datos.heading(i,text=i)
for contact in self.datos:
self.Tree_datos.insert('', END, values=contact)
self.Tree_datos.column("ID",width=20)
#Tree_datos.column("Banco",width=100)
imprimir_depositos = Button(venta,text="Generar Depositos",command=self.generar_depositos)
imprimir_depositos.pack(fill=BOTH,side=LEFT)
editar_deposito = Button(venta,text="Editar seleccion",command=self.edit_view)
editar_deposito.pack(fill=BOTH,side=RIGHT)
imprimir_depositos.bind("<Button-1>",self.loading_screen)
#return get_focus ()
def get_focus(e):
self.valor_actualizado = self.Tree_datos.item(self.Tree_datos.focus()) ["values"]
self.Tree_datos.bind("<ButtonRelease-1>",get_focus)
else:
messagebox.showinfo(message="No hay datos que mostra por el momento",title="No hay datos!")
The command that generates the doc file is this (along with other methods that are not relevant for now I guess):
def generar_depositos(self):
documento = Document()
add_style = documento.styles ["Normal"]
font_size = add_style.font
font_size.name = "Calibri"
font_size.size = Pt(9)
table_dict = {"BPD":self.tabla_bpd,"BHD":self.tabla_bhd,"Reservas":self.tabla_reservas}
self.tabla_bhd(documento,"La Jagua","2535")
#for banco,env,deposito in datos_guardados:
#table_dict [banco] (documento,env,deposito)
# documento.add_paragraph()
self.set_doc_dimx(documento,margen_der=0.38,margen_izq=0.9,margen_sup=0.3,margen_infe=1)
sc = documento.sections
for sec in sc:
print(sc)
documento.save("depositos.docx")
So basically what I want is to display the animated progress bar while this method is running, I read about threading but I don't how to implement it on my code.

How to set a value for an empy list

I am starting to learn to program using BeautifulSoup. What I want to achieve with this code is to save prices from different pages. To achieve this I store the prices of each page in a list and all those lists in a list. The problem is some pages do not save the prices so there are some lists that are completely empty. What I am looking for is that those empty lists are assigned the elements of the "ListaR" so that later I do not have problems. Here's my code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from decimal import Decimal
from typing import List
AppID = ['495570', '540190', '607210', '575780', '338840', '585830', '637330', '514360', '575760', '530540', '361890', '543170', '346500', '555930', '575700', '595780', '362400', '562360', '745670', '763360', '689360', '363610', '575770', '467310', '380560']
ListaPrecios = list()
ListaUrl = list() #<------- LISTA
Blanco = [""]
ListaR = ["$0.00 USD", "$0.00 USD"]
for x in AppID: # <--------- Para cada una de las AppID...
#STR#
url = "https://steamcommunity.com/market/search?category_753_Game%5B%5D=tag_app_"+x+"&category_753_cardborder%5B%5D=tag_cardborder_0&category_753_item_class%5B%5D=tag_item_class_2#p1_price_asc" # <------ Usa AppID para entrar a sus links de mercado
ListaUrl += [url] # <---------- AGREGA CADA LINK A UNA LISTA
PageCromos = [requests.get(x) for x in ListaUrl]
SoupCromos = [BeautifulSoup(x.content, "html.parser") for x in PageCromos]
PrecioCromos = [x.find_all("span", {"data-price": True}) for x in SoupCromos] # <--------- GUARDA LISTAS DENTRO DE LISTAS CON CODIGO
min_CromoList = []
for item in PrecioCromos:
CromoList = [float(i.text.strip('USD$')) for i in item]
min_CromoList.append(min(CromoList)) # <---------------- Lista con todos los precios minimos de cromos de cada juego
print(min_CromoList)
Output:
ValueError: min() arg is an empty sequence
You can change this line
min_CromoList.append(min(CromoList))
to:
if not CromoList: # this will evaluate to True if the list is empty
min_CromoList.append(min(ListaR))
else:
min_CromoList.append(min(CromoList))
A neat feature of python is that empty lists evaluate to False and non-empty lists evaluate to True.
Since min(ListaR) will always evaluate to '$0.00 USD' it is probably neater to write this as:
if not CromoList:
min_CromoList.append('$0.00 USD')
else:
min_CromoList.append(min(CromoList))

Generating a table with docx from a dataframe in python

Hellow,
Currently I´m working in a project in which I have to generate some info with docx library in python. I want to know how to generate a docx table from a dataframe in order to have the output with all the columns and rows from de dataframe I've created. Here is my code, but its not working correctly because I can´t reach the final output:
table = doc.add_table(rows = len(detalle_operaciones_total1), cols=5)
table.style = 'Table Grid'
table.rows[0].cells[0].text = 'Nombre'
table.rows[0].cells[1].text = 'Operacion Nro'
table.rows[0].cells[2].text = 'Producto'
table.rows[0].cells[3].text = 'Monto en moneda de origen'
table.rows[0].cells[4].text = 'Monto en moneda local'
for y in range(1, len(detalle_operaciones_total1)):
Nombre = str(detalle_operaciones_total1.iloc[y,0])
Operacion = str(detalle_operaciones_total1.iloc[y,1])
Producto = str(detalle_operaciones_total1.iloc[y,2])
Monto_en_MO = str(detalle_operaciones_total1.iloc[y,3])
Monto_en_ML = str(detalle_operaciones_total1.iloc[y,4])
table.rows[y].cells[0].text = Nombre
table.rows[y].cells[1].text = Operacion
table.rows[y].cells[2].text = Producto
table.rows[y].cells[3].text = Monto_en_MO
table.rows[y].cells[4].text = Monto_en_ML

What can I do to scrape 10000 pages without appearing captchas?

Hi there i've been trying to collect all the information in 10,000 pages of this page for a school project, I thought everything was fine until on page 4 I got a mistake. I check the page manually and I find that the page now asks me for a captcha.
What can I do to avoid it? Maybe set a timer between the searchs?
Here it is my code.
import bs4, requests, csv
g_page = requests.get("http://www.usbizs.com/NY/New_York.html")
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
get_Pnum = m_page.select('div[class="pageNav"]')
MAX_PAGE = int(get_Pnum[0].text[9:16])
print("Recolectando información de la página 1 de {}.".format(MAX_PAGE))
contador = 0
information_list = []
for k in range(1, MAX_PAGE):
c_items = m_page.select('div[itemtype="http://schema.org/Corporation"] a')
c_links = []
i = 0
for link in c_items:
c_links.append(link.get("href"))
i+=1
for j in range(len(c_links)):
temp = []
s_page = requests.get(c_links[j])
i_page = bs4.BeautifulSoup(s_page.text, "lxml")
print("Ingresando a: {}".format(c_links[j]))
info_t = i_page.select('div[class="infolist"]')
info_1 = info_t[0].text
info_2 = info_t[1].text
temp = [info_1,info_2]
information_list.append(temp)
contador+=1
with open ("list_information.cv", "w") as file:
writer=csv.writer(file)
for row in information_list:
writer.writerow(row)
print("Información de {} clientes recolectada y guardada correctamente.".format(j+1))
g_page = requests.get("http://www.usbizs.com/NY/New_York-{}.html".format(k+1))
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
print("Recolectando información de la página {} de {}.".format(k+1,MAX_PAGE))
print("Programa finalizado. Información recolectada de {} clientes.".format(contador))

Categories

Resources