Python: List of URLS to text list in Python (Excel) - python

I have a list of URLs to tweets in Excel. Is it possible to take out the text from these tweets (URLs) in Python? And later save it in the Excel sheet?
I saw someone used the code below, but this is only only for 1 URL.
from lxml import html
import requests
page = requests.get('https://twitter.com/realDonaldTrump/status/1237448419284783105')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
The excel containts with columns: author and URL.
The excelfile ('twitter.xlsx') looks like this:
Author URL
realDon.. https://twitter.com/realDon..
. .
. .
. .
I tried this code:
import pandas as pd
from lxml import html
import requests
input_data = pd.read_excel('twitter.xlsx')
input_data1 = input_data[['URL']]
tweets = []
for url in input_data1.values:
x = requests.get(url)
tree = html.fromstring(x.content)
i = tree.xpath('//div[contains(#class, "permalink-tweet container")]//p[contains(#class, "tweet-text")]//text()')
tweets.append(i)
Error:
InvalidSchema: No connection adapters were found for '['https://twitter.com/realDonaldTrump/status/1237448419284783105']'

Short answer - yes.
Long answer - yes, it's possible. I suggest you do some reading on the topic.
https://automatetheboringstuff.com/chapter12/ covers how to manage and manipulate excel files. The openpyxl library is your friend here - here's their documentation.
requests is a great library to use for getting access to websites! Here is their documentation
Here's a pseudo code mock up of what your program logic could look like:
input_data = read(excel_file)
tweets = []
for url in input_data:
x = get(url)
tweets.append(x)
for tweet in tweets:
write(tweet, excel_file)

Related

How to convert url data to csv using python

i am trying to download the data from the following url and tying to save it as csv data but the output i am getting is a text file. can anyone pls help what i am doing wrong here ? also, is it possible to add multiple url in the same script and download multiple csv files.
import csv
import pandas as pd
import requests
from datetime import datetime
CSV_URL = ('https://dsv-ops-toolkit.ihsmvals.com/ftp?config=fenics-bgc&file=IRSDATA_20211129_1700_Intra.csv&directory=%2FIRS%2FIntraday%2FDaily')
with requests.Session() as s:
download = s.get(CSV_URL)
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
date =datetime.now().strftime('%y%m%d')
my_list = list(cr)
df=pd.DataFrame(my_list)
df.to_csv(f'RFR_{date}')
You can create a list of your necessary URLs like:
urls = ['http://url1.com','http://url2.com','http://url3.com']
Iterate through the list for each url and your requests will be as it is:
for each_url in urls:
with requests.Session() as s:
# your_code_here
Hope you'll find this helpful.

Juypter note book - The Text part is not getting printed

I can able to scrape text from the following website. https://scrapsfromtheloft.com/2020/04/25/chris-d-elia-white-male-black-comic-transcript/
I used the following code in Jypyter notebook,
import requests
import bs4
import pickle
from bs4 import BeautifulSoup
def url_to_transcript(url):
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p.text for p in soup.find(class_="post-content").find_all('p')]
print(url)
return text
urls = ['https://scrapsfromtheloft.com/2020/04/25/chris-d-elia-white-male-black-comic-transcript/']
writer = ['chris']
for i in urls:
transcript=url_to_transcript(i)
print(transcript)
After scraping the text from the website, I used this code to pickle the file.
for i, c in enumerate(writer):
with open("transcripts/" + c + ".txt", "wb") as file:
pickle.dump("transcripts[i]", file)
But when I checked the text file that was stored, there wasn't available the text I scraped, but just these two words alone €X transcripts[i]q .
I am totally a newbie here so I am not sure what I am doing wrong. I just want the anaconda to print the text I extract from a website in the directory. Please clarify me. Thanks
While your question doesn't show how this variable is generated, assuming transcripts is a list of lists containing text, then see the difference in the following output:
>>> import pickle
>>> transcripts = [["first_{}".format(i), "second_{}".format(i)] for i in range(3)]
>>> transcripts
[['first_0', 'second_0'], ['first_1', 'second_1'], ['first_2', 'second_2']]
>>> i=0
>>> pickle.loads(pickle.dumps("transcripts[i]"))
'transcripts[i]'
>>> pickle.loads(pickle.dumps(transcripts[i]))
['first_0', 'second_0']
>>>
In the first call, pickle simply pickles the text "transcripts[i]", while in the second (without quotes), it will pickle the value referenced by transcript in position i.
Please note that there's no magic in python that transforms singular names to plural, so you'll need explicitly declare/populate it, like so:
transcripts = []
for i in urls:
transcript=url_to_transcript(i)
print(transcript)
transcripts.append(transcript)
If your code did not explicitly declare transcripts, then enclosing it with quotation marks would solve the NameError exception, but probably not the way you intended it to.

Python : Extract exact word from a url

I just begin to learn python from 2 days and I try to make a script that extract me some data from url and save it, but the problems is I want to extract only a specific data from a long line
EX :
{"2019-11-19":{"period":"2019-11-19T00:00:00+00:00","uniqs":"344627","hits":"0","clicked":"4922","pay":126.52971186,"currency":"RON","subs":0},"2019-11-20":{"period":"2019-11-20T00:00:00+00:00","uniqs":"1569983","hits":"0","clicked":"15621","pay":358.43100342,"currency":"RON","subs":0},"2019-11-21":{"period":"2019-11-21T00:00:00+00:00","uniqs":"1699844","hits":"0","clicked":"16172","pay":363.15667371,"currency":"RON","subs":0},"2019-11-22":{"period":"2019-11-22T00:00:00+00:00","uniqs":"1779319","hits":"0","clicked":"17865","pay":384.67092962,"currency":"RON","subs":0},"2019-11-23":{"period":"2019-11-23T00:00:00+00:00","uniqs":"1825346","hits":"0","clicked":"17740","pay":356.72833095,"currency":"RON","subs":0},"2019-11-24":{"period":"2019-11-24T00:00:00+00:00","uniqs":"1732639","hits":"0","clicked":"16870","pay":308.4201041,"currency":"RON","subs":0},"2019-11-25":{"period":"2019-11-25T00:00:00+00:00","uniqs":"1826060","hits":"0","clicked":"17991","pay":346.29137133,"currency":"RON","subs":0},"2019-11-26":{"period":"2019-11-26T00:00:00+00:00","uniqs":"1873961","hits":"0","clicked":"18645","pay":379.17652358,"currency":"RON","subs":0},"2019-11-27":{"period":"2019-11-27T00:00:00+00:00","uniqs":"1734207","hits":"0","clicked":"16187","pay":251.91152953,"currency":"RON","subs":0},"2019-11-28":{"period":"2019-11-28T00:00:00+00:00","uniqs":"1611611","hits":"0","clicked":"12056","pay":158.96447829,"currency":"RON","subs":0},"2019-11-29":{"period":"2019-11-29T00:00:00+00:00","uniqs":"712011","hits":"0","clicked":"6242","pay":85.70053418,"currency":"RON","subs":0},"2019-11-30":{"period":"2019-11-30T00:00:00+00:00","uniqs":"47957","hits":"0","clicked":"427","pay":8.32775435,"currency":"RON","subs":0},"2019-12-01":{"period":"2019-12-01T00:00:00+00:00","uniqs":"1268892","hits":"0","clicked":"11779","pay":217.42321168,"currency":"RON","subs":0},"2019-12-02":{"period":"2019-12-02T00:00:00+00:00","uniqs":"1130724","hits":"0","clicked":"10694","pay":195.44476902,"currency":"RON","subs":0},"2019-12-03":{"period":"2019-12-03T00:00:00+00:00","uniqs":"1058965","hits":"0","clicked":"8123","pay":151.05243751,"currency":"RON","subs":0},"2019-12-04":{"period":"2019-12-04T00:00:00+00:00","uniqs":"1228326","hits":"0","clicked":"12230","pay":230.84154581,"currency":"RON","subs":0},"2019-12-05":{"period":"2019-12-05T00:00:00+00:00","uniqs":"1181029","hits":"0","clicked":"11467","pay":196.21644271,"currency":"RON","subs":0},"2019-12-06":{"period":"2019-12-06T00:00:00+00:00","uniqs":"951828","hits":"0","clicked":"9379","pay":153.35155293,"currency":"RON","subs":0},"2019-12-07":{"period":"2019-12-07T00:00:00+00:00","uniqs":"1172156","hits":"0","clicked":"11776","pay":181.65819439,"currency":"RON","subs":0},"2019-12-08":{"period":"2019-12-08T00:00:00+00:00","uniqs":"912109","hits":"0","clicked":"9240","pay":147.6364827,"currency":"RON","subs":0}}
I try to extract the after "pay": and save it to a file, after that i write the code that will calculate the amount and give me the result :D for this i worked 1 day :D
i use this code to extract and save the data from link :
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
f = open("dataNEW.txt", "w")
f.write(page)
f.close()
but the problem is it write me all details there, I want to save only what is after pay.
That string is in JSON format that can be easily converted to a Python data structure using the json package. Her is an example:
import json
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
data = json.loads(page)
with open("dataNEW.txt", "w") as f:
for sub_dict in data.values():
f.write("{}\n".format(sub_dict["pay"]))
Your dataNEW.txt should then look like the following:
358.43100342
363.15667371
384.67092962
356.72833095
126.52971186
346.29137133
379.17652358
251.91152953
158.96447829
85.70053418
8.32775435
147.6364827
153.35155293
181.65819439
308.4201041
196.21644271
230.84154581
151.05243751
195.44476902
217.42321168

How to input a list of URLs saved in a .txt to a Python program?

I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?
I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?
Extract from the code, lines 1-18:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?
This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.
Here is one way you could do it:
from newspaper import Article
from newspaper import fulltext
import requests
with open('myfile.txt',r) as f:
for line in f:
#do not forget to strip the trailing new line
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
This could help you:
url_file = open('myfile.txt','r')
for url in url_file.readlines():
print url
url_file.close()
You can apply it on your code as the following
from newspaper import Article
from newspaper import fulltext
import requests
url_file = open('myfile.txt','r')
for url in url_file.readlines():
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
url_file.close()

Load JSON data from Google GitHub repo

I am trying to load the following JSON file (from the Google Github repo) in Python as follows:
import json
import requests
url = "https://raw.githubusercontent.com/google/vsaq/master/questionnaires/webapp.json"
r = requests.get(url)
data = r.text.splitlines(True)
#remove first n lines which is not JSON (commented license)
data = ''.join(data[14:])
When I use json.loads(data) I get the following error:
JSONDecodeError: Expecting ',' delimiter: line 725 column 543 (char 54975)
As this has been saved as a json file by the GitHub repo owner (Google) I'm wondering what Im doing wrong here.
I found the obtained text from API call is like a simple text, not a valid JSON (I checked at https://jsonformatter.curiousconcept.com/).
Here is my code that I used to filter the valid JSON part from the response.
I have used re module to extract the JSON part.
import json
import requests
import re
url = "https://raw.githubusercontent.com/google/vsaq/master/questionnaires/webapp.json"
r = requests.get(url)
text = r.text.strip()
m = re.search(r'\{(.|\s)*\}', text) # It is for finding a valid JSON part from obtained text
s = m.group(0).replace('false', 'False') # Python has 'False/True' not 'false/true' (Replacement)
d = eval(s)
print(d) # {...}
print(type(d)) # <class 'dict'>
References »
https://docs.python.org/3.6/library/re.html
https://jsonformatter.curiousconcept.com/

Categories

Resources