Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.
Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.
Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.
from urllib import urlopen
from bs4 import BeautifulSoup
import re
# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))
for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]
# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()
# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]
# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)
# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')
# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''
# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
The reason why it seems that only the last article is written down, is because all the articles are writer to 20 separate files over and over again. Lets have a look at the following:
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
You are writing parag_string over and over again to the same 20 files for each iteration. What you need to be doing is this, append all your parag_strings to a separate list, say paraStringList, and then write all its contents to separate files, like so:
for i, var in enumerate(paraStringList): # Enumerate creates a tuple
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)
Now that this needs to be outside of your main loop i.e. for i in listIterator:(...). This is a working version of the program:
from urllib import urlopen
from bs4 import BeautifulSoup
import re
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]
listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []
for i in listIterator:
print findPatTitle[i]
articlePage = urlopen(findPatLink[i]).read()
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
article = articlePage[divBegin:divEnd]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
para_string = ''
for i in paragList:
para_string += str(i)
paraStringList.append(para_string)
for i, var in enumerate(paraStringList):
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)
Related
I'm compiling some data for a project and I've been using PyPDF4 to read this data from it's source PDF file, but I've been having trouble with certain characters not showing up correctly. Here's my code:
from PyPDF4 import PdfFileReader
import pandas as pd
import numpy as np
import os
import xml.etree.cElementTree as ET
# File name
pdf_path = "PV-9-2020-10-23-RCV_FR.pdf"
# Results storage
results = {}
# Start page
page = 5
# Lambda to assign votes
serify = lambda voters, vote: pd.Series({voter.strip(): vote for voter in voters})
with open(pdf_path, 'rb') as f:
# Get PDF reader for PDF file f
pdf = PdfFileReader(f)
while page < pdf.numPages:
# Get text of page in PDF
text = pdf.getPage(page).extractText()
proposal = text.split("\n+\n")[0].split("\n")[3]
# Collect all pages relevant pages
while text.find("\n0\n") is -1:
page += 1
text += "\n".join(pdf.getPage(page).extractText().split("\n")[3:])
# Remove corrections
text, corrections = text.split("CORRECCIONES")
# Grab relevant text !!! This is where the missing characters show up.
text = "\n, ".join([n[:n.rindex("\n")] for n in text.split("\n:")])
for_list = "".join(text[text.index("\n+\n")+3:text.index("\n-\n")].split("\n")[:-1]).split(", ")
nay_list = "".join(text[text.index("\n-\n")+3:text.index("\n0\n")].split("\n")[:-1]).split(", ")
abs_list = "".join(text[text.index("\n0\n")+3:].split("\n")[:-1]).split(", ")
# Store data in results
results.update({proposal: dict(pd.concat([serify(for_list, 1), serify(nay_list, -1), serify(abs_list, 0)]).items())})
page += 1
print(page)
results = pd.DataFrame(results)
The characters I'm having difficulty don't show up in the text extracted using extractText. Ždanoka for instance becomes "danoka, Štefanec becomes -tefanc. It seems like most of the characters are Eastern European, which makes me think I need one of the latin decoders.
I've looked through some of PyPDF4's capabilities, it seems like it has plenty of relevant codecs, including latin1. I've attempted decoding the file using different functions from the PyPDF4.generic.codecs module, and either the characters don't show still, or the code throws an error at an unrecognised byte.
I haven't yet attempted using multiple codecs on different bytes from the same file, that seems like it would take some time. Am I missing something in my code that can easily fix this? Or is it more likely I will have to tailor fit a solution using PyPDF4's functions?
Use pypdf instead of PyPDF2/PyPDF3/PyPDF4. You will need to apply the migrations.
pypdf has received a lot of updates in December 2022. Especially the text extraction.
To give you a minimal full example for text extraction:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())
I can able to scrape text from the following website. https://scrapsfromtheloft.com/2020/04/25/chris-d-elia-white-male-black-comic-transcript/
I used the following code in Jypyter notebook,
import requests
import bs4
import pickle
from bs4 import BeautifulSoup
def url_to_transcript(url):
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p.text for p in soup.find(class_="post-content").find_all('p')]
print(url)
return text
urls = ['https://scrapsfromtheloft.com/2020/04/25/chris-d-elia-white-male-black-comic-transcript/']
writer = ['chris']
for i in urls:
transcript=url_to_transcript(i)
print(transcript)
After scraping the text from the website, I used this code to pickle the file.
for i, c in enumerate(writer):
with open("transcripts/" + c + ".txt", "wb") as file:
pickle.dump("transcripts[i]", file)
But when I checked the text file that was stored, there wasn't available the text I scraped, but just these two words alone €X transcripts[i]q .
I am totally a newbie here so I am not sure what I am doing wrong. I just want the anaconda to print the text I extract from a website in the directory. Please clarify me. Thanks
While your question doesn't show how this variable is generated, assuming transcripts is a list of lists containing text, then see the difference in the following output:
>>> import pickle
>>> transcripts = [["first_{}".format(i), "second_{}".format(i)] for i in range(3)]
>>> transcripts
[['first_0', 'second_0'], ['first_1', 'second_1'], ['first_2', 'second_2']]
>>> i=0
>>> pickle.loads(pickle.dumps("transcripts[i]"))
'transcripts[i]'
>>> pickle.loads(pickle.dumps(transcripts[i]))
['first_0', 'second_0']
>>>
In the first call, pickle simply pickles the text "transcripts[i]", while in the second (without quotes), it will pickle the value referenced by transcript in position i.
Please note that there's no magic in python that transforms singular names to plural, so you'll need explicitly declare/populate it, like so:
transcripts = []
for i in urls:
transcript=url_to_transcript(i)
print(transcript)
transcripts.append(transcript)
If your code did not explicitly declare transcripts, then enclosing it with quotation marks would solve the NameError exception, but probably not the way you intended it to.
I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?
I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?
Extract from the code, lines 1-18:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?
This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.
Here is one way you could do it:
from newspaper import Article
from newspaper import fulltext
import requests
with open('myfile.txt',r) as f:
for line in f:
#do not forget to strip the trailing new line
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
This could help you:
url_file = open('myfile.txt','r')
for url in url_file.readlines():
print url
url_file.close()
You can apply it on your code as the following
from newspaper import Article
from newspaper import fulltext
import requests
url_file = open('myfile.txt','r')
for url in url_file.readlines():
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
url_file.close()
For work, I was asked to create a spreadsheet of the names and addresses of all allopathic medical schools in the United States. Being new to python, I thought that this would be the perfect situation to try web scraping. While I eventually wrote a program that returned the data I needed, I know that there is a better way to do it as there were some extraneous characters (eg: ", ], [) that I had to go into excel and manually remove. I would just like to know if there was a better way I could have written this code so I can get what I needed, minus the extraneous characters.
Edit: I have also attached an image of the csv file that was created to show the extraneous characters that I'm speaking about.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
# link to the site we want to scrape from
page_response = requests.get(link)
# fetching the content using the requests library
soup = BeautifulSoup(page_response.text, "html.parser")
# Calling BeautifulSoup in order to parse our document
data = []
# Empty list for the first scrape. We only get one column with many rows.
# We still have the line break tags here </br>
for tr in soup.find_all('tr', {'valign': 'top'}):
values = [td.get_text('</b>', strip=True) for td in tr.find_all('td')]
data.append(values)
data2 = []
# New list that we'll use to have name on index i, address on index i+1
for i in data:
test = list(str(i).split('</b>'))
# Using the line breaks to our advantage.
name = test[0].strip("['")
'''Here we are saying that the name of the school is the first element
before the first line break'''
addy = test[1:]
# The address is what comes after this first line break
data2.append(name)
data2.append(addy)
# Append the name of the school and address to our new list.
school_name = data2[::2]
# Making a new list that consists of the school name
school_address = data2[1::2]
# Another list that consists of the school's address.
with open("Medschooltest.csv", 'w', encoding='utf-8') as toWrite:
writer = csv.writer(toWrite)
writer.writerows(zip(school_name, school_address))
'''Zip the two together making a 2 column table with the schools name and
it's address'''
print("CSV Completed!")
Created CSV file
It seems applying conditional statements along with string manipulation can do the trick. I think the following script will lead you real close to what you want.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
res = requests.get(link)
soup = BeautifulSoup(res.text, "html.parser")
with open("membersInfo.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["Name","Address"])
for tr in soup.find_all('table', class_='bodyTXT'):
items = ', '.join([item.string for item in tr.select_one('td') if item.string!="\n" and item.string!=None])
name = items.split(",")[0].strip()
address = items.split(name)[1].strip(",")
writer.writerow([name,address])
If you have knowledge of SQL AND the data is in such a structured manner, it would be the best solution to extract it to a database.
I have a question regarding appending to text file. I have written a script and what this script does is that it will read the URL in JSON format and extract the list of titles and write into the file "WordsInCategory.text".
As this code will be used in a loop thus I used f1 = open('WordsInCategory.text', 'a').
But I encountered a problem, that is it will add in already existing title into the file.
I am having trouble coming out with a solution to solve this problem and using 'w' will overwrite what it is written.
My code is as follows:
import urllib2
import json
url1 ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
f1 = open('WordsInCategory.text', 'a')
for item in data1['query']:
for i in data1['query']['categorymembers']:
f1.write((i['title']).encode('utf8')+"\n")
Please advice on how I should modify my code.
Thank you.
I would suggest saving every title in an array, before writing to a file (and hence writing only once to the given file). You can modify your code this way :
import urllib2
import json
data = []
f1 = open('WordsInCategory.text', 'w')
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
data.append(i['title'].encode('utf8')+"\n")
# Do additional requests, and append the new titles to the data array
f1.write(''.join(set(data)))
f1.close()
set allows me to delete any duplicate entry.
If keeping the titles in memory is a problem, you can check if the title already exists before writing it to the file, but it may be awfully time consuming :
import urllib2
import json
data = []
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
title = (i['title'].encode('utf8')+"\n")
with open('WordsInCategory.text', 'r') as title_check:
if title not in title_check:
data.append(title)
with open('WordsInCategory.text', 'a') as f1:
f1.write(''.join(set(data)))
# Handle additional requests
Hope it'll be helpful.
You can track the titles you added.
titles = []
and then add each title to the list when writing
if title not in titles:
# write to file
titles += title