Feature extraction with tweet IDs - python

I am trying to extract the information of a tweet from an ID. With the ID, I want to get the tweet creation date, the tweet, location, followers, friends, favorites, their profile description, if they're verified, and the language, but I'm having trouble doing it. Next, I will show the steps I follow to carry out what I want.
I have made the following code. To start with, I have the IDs of the tweets in a txt file and I read them as follows:
# Read txt file
txt = '/content/drive/MyDrive/Mini-proyecto Texto/archivo.txt'
with open(txt) as archivo:
lines = archivo.readlines()
Next, I add each of the IDs to a list:
# Add the IDs to a list
IDs = []
for i in lines:
IDs.append(i.rsplit())
#print(i.rsplit())
IDs
#[['1206924075374956547'],
# ['1210912199402819584'],
# ['1210643148998938625'],
# ['1207776839697129472'],
# ['1203627609759920128'],
# ['1205895318212136961'],
# ['1208145724879364100'], ...
Finally, start extracting the information you require as follows:
# Extract information from tweets
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
The following image is the output of the tweets_df2 variable, but I don't understand why the values are repeated over and over again. Does anyone know what's wrong with my code?
If you need the txt I provide you with the link of the drive. https://drive.google.com/file/d/1vyohQMpLqlKqm6b4iTItcVVqL2wBMXWp/view?usp=sharing
Thank you very much in advance for your time :3

Your code basically runs fine for me with a few adjustments. I noticed that your identation is not correct and your list of IDs is a list of lists (with just one element) rather than one list of elements.
Try this:
api = tweepy.Client(consumer_key=api_key,
consumer_secret=api_key_secret,
access_token=access_token,
access_token_secret=access_token_secret,
bearer_token=bearer_token,
wait_on_rate_limit=True,
)
auth = tweepy.OAuth1UserHandler(
api_key, api_key_secret, access_token, access_token_secret
)
api = tweepy.API(auth)
txt = "misocorpus-misogyny.txt"
with open(txt) as archivo:
lines = archivo.readlines()
IDs = []
for i in lines:
IDs.append(i.strip())# <-- use strip() to remove /n rather than split
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
Result:
ID Tweet Creado_tweet Locacion_usuario Seguidores_usuario Amigos_usuario Favoritos_usuario Descripcion_usuario Verificado_usuario Idioma
0 1206924075374956547 Las feminazis quieren por poco que este chico ... 2019-12-17 13:08:17+00:00 Argentina 1683 2709 28982 El Progresismo es un C谩ncer que quiere destrui... False es
1 1210912199402819584 #CarlosVerareal #Galois2807 Los halagos con pi... 2019-12-28 13:15:40+00:00 Ecuador 398 1668 3123 Cuando te encuentres n una situaci贸n imposible... False es
2 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 馃挋馃挋馃挋馃挋 False es
3 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 馃挋馃挋馃挋馃挋 False es
4 1203627609759920128 Mostritas #Feminazi amenazando como ellas sabe... 2019-12-08 10:49:19+00:00 Lima, Per煤 2505 3825 45087 Latam News Report. Regional and World affairs.... False

Related

How to set a value for an empy list

I am starting to learn to program using BeautifulSoup. What I want to achieve with this code is to save prices from different pages. To achieve this I store the prices of each page in a list and all those lists in a list. The problem is some pages do not save the prices so there are some lists that are completely empty. What I am looking for is that those empty lists are assigned the elements of the "ListaR" so that later I do not have problems. Here's my code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from decimal import Decimal
from typing import List
AppID = ['495570', '540190', '607210', '575780', '338840', '585830', '637330', '514360', '575760', '530540', '361890', '543170', '346500', '555930', '575700', '595780', '362400', '562360', '745670', '763360', '689360', '363610', '575770', '467310', '380560']
ListaPrecios = list()
ListaUrl = list() #<------- LISTA
Blanco = [""]
ListaR = ["$0.00 USD", "$0.00 USD"]
for x in AppID: # <--------- Para cada una de las AppID...
#STR#
url = "https://steamcommunity.com/market/search?category_753_Game%5B%5D=tag_app_"+x+"&category_753_cardborder%5B%5D=tag_cardborder_0&category_753_item_class%5B%5D=tag_item_class_2#p1_price_asc" # <------ Usa AppID para entrar a sus links de mercado
ListaUrl += [url] # <---------- AGREGA CADA LINK A UNA LISTA
PageCromos = [requests.get(x) for x in ListaUrl]
SoupCromos = [BeautifulSoup(x.content, "html.parser") for x in PageCromos]
PrecioCromos = [x.find_all("span", {"data-price": True}) for x in SoupCromos] # <--------- GUARDA LISTAS DENTRO DE LISTAS CON CODIGO
min_CromoList = []
for item in PrecioCromos:
CromoList = [float(i.text.strip('USD$')) for i in item]
min_CromoList.append(min(CromoList)) # <---------------- Lista con todos los precios minimos de cromos de cada juego
print(min_CromoList)
Output:
ValueError: min() arg is an empty sequence
You can change this line
min_CromoList.append(min(CromoList))
to:
if not CromoList: # this will evaluate to True if the list is empty
min_CromoList.append(min(ListaR))
else:
min_CromoList.append(min(CromoList))
A neat feature of python is that empty lists evaluate to False and non-empty lists evaluate to True.
Since min(ListaR) will always evaluate to '$0.00 USD' it is probably neater to write this as:
if not CromoList:
min_CromoList.append('$0.00 USD')
else:
min_CromoList.append(min(CromoList))

Scrapy using loops in Python

I want an activity to scrapy a web page. The part of data web is route_data.
route_data = ["javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/215_14.jpg', '/wc/htdocs/web', 'Batet Lama帽a, Meritxell (Presidenta del Congreso de los Diputados)', 'Diputada por Barcelona', 'G.P. Socialista' ,'','');",
"javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/168_14.jpg', '/wc/htdocs/web', 'Rodr铆guez G贸mez de Celis, Alfonso (Vicepresidente Primero)', 'Diputado por Sevilla', 'G.P. Socialista' ,'','');",]
I create a dictionary with empty values.
dictionary_data = {"Nombre":None, "Territorio":None, "Partido":None, "url":None}
I have to save in dictionary_data each one line:
url = /wc/htdocs/web/img/diputados/peq/215_14.jpg
Nombre = Batet Lama帽a, Meritxell
Territorio = Diputada por Barcelona
Partido = G.P. Socialista
For thus, and I loop over route_data.
for i in route_data:
text = i.split(",")
nombre = text[2:4]
territorio = text[4]
partido = text[5]
But the output is:
[" 'Batet Lama帽a", " Meritxell (Presidenta del Congreso de los Diputados)'"] 'Diputada por Barcelona' 'G.P. Socialista'
[" 'Rodr铆guez G贸mez de Celis", " Alfonso (Vicepresidente Primero)'"] 'Diputado por Sevilla' 'G.P. Socialista'
How can it get put correct in dictionary?
A simple solution would be:
all_routes = []
for i in route_data:
text = re.findall("'.+?'", i)
all_routes.append(
{"Nombre": re.sub('\(.*?\)', '', text[2]).strip(),
"Territorio": text[3],
"Partido": text[-2],
"Url": text[0]})

What can I do to scrape 10000 pages without appearing captchas?

Hi there i've been trying to collect all the information in 10,000 pages of this page for a school project, I thought everything was fine until on page 4 I got a mistake. I check the page manually and I find that the page now asks me for a captcha.
What can I do to avoid it? Maybe set a timer between the searchs?
Here it is my code.
import bs4, requests, csv
g_page = requests.get("http://www.usbizs.com/NY/New_York.html")
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
get_Pnum = m_page.select('div[class="pageNav"]')
MAX_PAGE = int(get_Pnum[0].text[9:16])
print("Recolectando informaci贸n de la p谩gina 1 de {}.".format(MAX_PAGE))
contador = 0
information_list = []
for k in range(1, MAX_PAGE):
c_items = m_page.select('div[itemtype="http://schema.org/Corporation"] a')
c_links = []
i = 0
for link in c_items:
c_links.append(link.get("href"))
i+=1
for j in range(len(c_links)):
temp = []
s_page = requests.get(c_links[j])
i_page = bs4.BeautifulSoup(s_page.text, "lxml")
print("Ingresando a: {}".format(c_links[j]))
info_t = i_page.select('div[class="infolist"]')
info_1 = info_t[0].text
info_2 = info_t[1].text
temp = [info_1,info_2]
information_list.append(temp)
contador+=1
with open ("list_information.cv", "w") as file:
writer=csv.writer(file)
for row in information_list:
writer.writerow(row)
print("Informaci贸n de {} clientes recolectada y guardada correctamente.".format(j+1))
g_page = requests.get("http://www.usbizs.com/NY/New_York-{}.html".format(k+1))
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
print("Recolectando informaci贸n de la p谩gina {} de {}.".format(k+1,MAX_PAGE))
print("Programa finalizado. Informaci贸n recolectada de {} clientes.".format(contador))

how do i divide list into smallers list

My list is formatted like:
gymnastics_school,participant_name,all-around_points_earned
I need to divide it up by schools but keep the scores.
import collections
def main():
names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
Data = collections.namedtuple("Data", names)
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(Data(*items))
These are examples of how they're set up:
Lanier City Gymnastics,Ben W.,55.301
Lanier City Gymnastics,Alex W.,54.801
Lanier City Gymnastics,Sky T.,51.2
Lanier City Gymnastics,William G.,47.3
Carrollton Boys,Cameron M.,61.6
Carrollton Boys,Zachary W.,58.7
Carrollton Boys,Samuel B.,58.6
La Fayette Boys,Nate S.,63
La Fayette Boys,Kaden C.,62
La Fayette Boys,Cohan S.,59.1
La Fayette Boys,Cooper J.,56.101
La Fayette Boys,Avi F.,53.401
La Fayette Boys,Frederic T.,53.201
Columbus,Noah B.,50.3
Savannah Metro,Levi B.,52.801
Savannah Metro,Taylan T.,52
Savannah Metro,Jacob S.,51.5
SAAB Gymnastics,Dawson B.,58.1
SAAB Gymnastics,Dean S.,57.901
SAAB Gymnastics,William L.,57.101
SAAB Gymnastics,Lex L.,52.501
Suwanee Gymnastics,Colin K.,57.3
Suwanee Gymnastics,Matthew B.,53.201
After processing it should look like:
Lanier City Gymnastics:participants(4)
as it own list
Carrollton Boys(3)
as it own list
La Fayette Boys(6)
etc.
I would recommend putting them in dictionaries:
data = {}
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
Then access schools could be done in the following way:
>>> data['Lanier City Gymnastics']
[['Ben W.',55.301],['Alex W.',54.801],['Sky T'.,51.2],['William G.',47.3]
EDIT:
Assuming you need the whole dataset as a list first, then you want to divide it into smaller lists you can generate the dictionary from the list:
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(items)
#perform median or other operation on your data
nested_data = {}
for items in data:
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
nested_data[item[0]]
When you need to get a subset of a list you can use slicing:
mylist[start:stop:step]
where start, stop and step are optional (see link for more comprehensive introduction)

Unhandled exception in py2neo: Type error

I am writing an application whose purpose is to create a graph from a journal dataset. The dataset was a xml file which was parsed in order to extract leaf data. Using this list I wrote a py2neo script to create the graph. The file is attached to this message.
As the script was processed an exception was raised:
The debugged program raised the exception unhandled TypeError
"(1676 {"titulo":"reconhecimento e agrupamento de objetos de aprendizagem semelhantes"})"
File: /usr/lib/python2.7/site-packages/py2neo-1.5.1-py2.7.egg/py2neo/neo4j.py, Line: 472
I don't know how to handle this. I think that the code is syntactically correct...but...
I dont know if I shoud post the entire code here, so the code is at: https://gist.github.com/herlimenezes/6867518
There goes the code:
+++++++++++++++++++++++++++++++++++
'
#!/usr/bin/env python
#
from py2neo import neo4j, cypher
from py2neo import node, rel
# calls database service of Neo4j
#
graph_db = neo4j.GraphDatabaseService("DEFAULT_DOMAIN")
#
# following nigel small suggestion in http://stackoverflow.com
#
titulo_index = graph_db.get_or_create_index(neo4j.Node, "titulo")
autores_index = graph_db.get_or_create_index(neo4j.Node, "autores")
keyword_index = graph_db.get_or_create_index(neo4j.Node, "keywords")
dataPub_index = graph_db.get_or_create_index(neo4j.Node, "data")
#
# to begin, database clear...
graph_db.clear() # not sure if this really works...let's check...
#
# the big list, next version this is supposed to be read from a file...
#
listaBase = [['2007-12-18'], ['RECONHECIMENTO E AGRUPAMENTO DE OBJETOS DE APRENDIZAGEM SEMELHANTES'], ['Raphael Ghelman', 'SWMS', 'MHLB', 'RNM'], ['Objetos de Aprendizagem', u'Personaliza\xe7\xe3o', u'Perfil do Usu\xe1rio', u'Padr\xf5es de Metadados', u'Vers\xf5es de Objetos de Aprendizagem', 'Agrupamento de Objetos Similares'], ['2007-12-18'], [u'LOCPN: REDES DE PETRI COLORIDAS NA PRODU\xc7\xc3O DE OBJETOS DE APRENDIZAGEM'], [u'Maria de F\xe1tima Costa de Souza', 'Danielo G. Gomes', 'GCB', 'CTS', u'Jos\xe9 ACCF', 'MCP', 'RMCA'], ['Objetos de Aprendizagem', 'Modelo de Processo', 'Redes de Petri Colorida', u'Especifica\xe7\xe3o formal'], ['2007-12-18'], [u'COMPUTA\xc7\xc3O M\xd3VEL E UB\xcdQUA NO CONTEXTO DE UMA GRADUA\xc7\xc3O DE REFER\xcaNCIA'], ['JB', 'RH', 'SR', u'S\xe9rgio CCSPinto', u'D\xe9bora NFB'], [u'Computa\xe7\xe3o M\xf3vel e Ub\xedqua', u'Gradua\xe7\xe3o de Refer\xeancia', u' Educa\xe7\xe3o Ub\xedqua']]
#
pedacos = [listaBase[i:i+4] for i in range(0, len(listaBase), 4)] # pedacos = chunks
#
# lists to collect indexed nodes: is it really useful???
# let's think about it when optimizing code...
dataPub_nodes = []
titulo_nodes = []
autores_nodes = []
keyword_nodes = []
#
#
for i in range(0, len(pedacos)):
# fill dataPub_nodes and titulo_nodes with content.
#dataPub_nodes.append(dataPub_index.get_or_create("data", pedacos[i][0], {"data":pedacos[i][0]})) # Publication date nodes...
dataPub_nodes.append(dataPub_index.get_or_create("data", str(pedacos[i][0]).strip('[]'), {"data":str(pedacos[i][0]).strip('[]')}))
# ------------------------------- Exception raised here... --------------------------------
# The debugged program raised the exception unhandled TypeError
#"(1649 {"titulo":["RECONHECIMENTO E AGRUPAMENTO DE OBJETOS DE APRENDIZAGEM SEMELHANTES"]})"
#File: /usr/lib/python2.7/site-packages/py2neo-1.5.1-py2.7.egg/py2neo/neo4j.py, Line: 472
# ------------------------------ What happened??? ----------------------------------------
titulo_nodes.append(titulo_index.get_or_create("titulo", str(pedacos[i][1]).strip('[]'), {"titulo":str(pedacos[i][1]).strip('[]')})) # title node...
# creates relationship publicacao
publicacao = graph_db.get_or_create_relationships(titulo_nodes[i], "publicado_em", dataPub_nodes[i])
# now processing autores sublist and collecting in autores_nodes
#
for j in range(0, len(pedacos[i][2])):
# fill autores_nodes list
autores_nodes.append(autores_index.get_or_create("autor", pedacos[i][2][j], {"autor":pedacos[i][2][j]}))
# creates autoria relationship...
#
autoria = graph_db.get_or_create_relationships(titulo_nodes[i], "tem_como_autor", autores_nodes[j])
# same logic...
#
for k in range(0, len(pedacos[i][3])):
keyword_nodes.append(keyword_index.get_or_create("keyword", pedacos[i][3][k]))
# cria o relacionamento 'tem_como_keyword'
tem_keyword = graph_db.get_or_create_relationships(titulo_nodes[i], "tem_como_keyword", keyword_nodes[k])
`
The fragment of py2neo which raised the exception
def get_or_create_relationships(self, *abstracts):
""" Fetch or create relationships with the specified criteria depending
on whether or not such relationships exist. Each relationship
descriptor should be a tuple of (start, type, end) or (start, type,
end, data) where start and end are either existing :py:class:`Node`
instances or :py:const:`None` (both nodes cannot be :py:const:`None`).
Uses Cypher `CREATE UNIQUE` clause, raising
:py:class:`NotImplementedError` if server support not available.
.. deprecated:: 1.5
use either :py:func:`WriteBatch.get_or_create_relationship` or
:py:func:`Path.get_or_create` instead.
"""
batch = WriteBatch(self)
for abstract in abstracts:
if 3 <= len(abstract) <= 4:
batch.get_or_create_relationship(*abstract)
else:
raise TypeError(abstract) # this is the 472 line.
try:
return batch.submit()
except cypher.CypherError:
raise NotImplementedError(
"The Neo4j server at <{0}> does not support " \
"Cypher CREATE UNIQUE clauses or the query contains " \
"an unsupported property type".format(self.__uri__)
)
======
Any help?
I have already fixed it, thanks to Nigel Small. I made a mistake as I wrote the line on creating relationships. I typed:
publicacao = graph_db.get_or_create_relationships(titulo_nodes[i], "publicado_em", dataPub_nodes[i])
and it must to be:
publicacao = graph_db.get_or_create_relationships((titulo_nodes[i], "publicado_em", dataPub_nodes[i]))
By the way, there is also another coding error:
keyword_nodes.append(keyword_index.get_or_create("keyword", pedacos[i][3][k]))
must be
keyword_nodes.append(keyword_index.get_or_create("keyword", pedacos[i][3][k], {"keyword":pedacos[i][3][k]}))

Categories

Resources