I have the following script and I would like to retrieve the URL's from a text file rather than an array. I'm new to Python and keep getting stuck!
from bs4 import BeautifulSoup
import requests
urls = ['URL1',
'URL2',
'URL3']
for u in urls:
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
Could you please be a little more clear about what you want?
Here is a possible answer which might or might not be what you want:
from bs4 import BeautifulSoup
import requests
with open('yourfilename.txt', 'r') as url_file:
for line in url_file:
u = line.strip()
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
The file was opened with the open() function; the second argument is 'r' to specify we're opening it in read-only mode. The call to open() is encapsulated in a with block so the file is automatically closed as soon as you no longer need it open.
The strip() function removes trailing whitespace (spaces, tabs, newlines) at the beginning and end of every line, for instant ' https://stackoverflow.com '.strip() becomes 'https://stackoverflow.com'.
Related
I made a web scraper to get the informative text of a Wikipedia page. I get the text I want but I want to cut off a big part of the bottom text. I already tried some other solutions but with those, I don't get the headers and white-spaces I need.
import requests
from bs4 import BeautifulSoup
import re
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")
text = list()
text.extend(soup.findAll('mw-content-text'))
text_content = soup.text
text_content = re.sub(r'==.*?==+', '', text_content)
# text_content = text.replace('\n', '')
print(text_content)
Here, soup.text is all the text of the wikipedia page with the class='mw-content-text' printed as a string. This prints the overall text I need but I need to cut off the string where it starts showing the text of the sources. I already tried the replace method but it didn't do anything.
Given this page, I want to cut of what's under the red line in the big string of text I have scraped
I tried something like this, which didn't work:
for content in soup('span', {'class': 'mw-content-text'}):
print(content.text)
text = content.findAll('p', 'a')
for t in text:
print(text.text)```
I also tried this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests
website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')
text = ''
for content in soup.find_all('p'):
text += content.text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
# print(text)
but these approaches just gave me an unreadable mess of text. I still want the whitespaces and headers that my base code gives me.
Think it is still a bit abstract but you could get your goal while iterating over all children and break if tag with class appendix appears:
for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
if c.get('class') and 'appendix' in c.get('class'):
break
print(c.get_text(strip=True))
Example
import requests
from bs4 import BeautifulSoup
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text)
for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
if c.get('class') and 'appendix' in c.get('class'):
break
print(c.get_text(strip=True))
There is likely a more efficient solution but here is a list comprehension that solves your issue:
# the rest of your code
references = [line for line in text_content.split('\n') if line.startswith("↑")]
Heres an alternative version that might be easier to understand:
# the rest of your code
# Turn text_content into a list of lines
text_content = text_content.split('\n')
references = []
# Iterate through each line and only save the values that start
# with the symbol used for each reference, on wikipedia: "↑"
# ( or "^" for english wikipedia pages )
for line in text_content:
if line.startswith("↑"):
references.append(line)
Both scripts will do the same thing.
I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error.
Simply trying to scrape headers from the web and save them to a txt file.
scraping code works well, however, it saves only last string bypassing all headers to the last one.
I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue.
please help.
here is my code
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h2"):
print(headlines.text.strip())
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
file_object.write(str(headlines.text.strip()))
'''
Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n") # write a newline after each headline
here is full working code corrected as per advice.
from bs4 import BeautifulSoup
import requests
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n")
this code will trough error in Jupiter work but when an
opening file, however when file open outside Jupiter headers saved...
I'm currently programming an app to gather information for me from websites using requests and BeautifulSoup. Now I am trying to place that information in a text file which I manage to do but only one paragraph was inserted in the text file.
Right now I am using the basic file command to do this and it hasn't worked I have searched online for other ways and none of the methods worked in my code.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Somalia")
soup = BeautifulSoup(r.text)
for p in soup.find_all('p'):
print(p.text)
file = open("Research.txt", "w")
file.write(p.text)
file.close()
Thank You in Advanced!
The code you posted only prints the last paragraph since it's the last item the loop iterated over. The code below writes all the paragraphs:
with open("Research.txt", "w") as f:
for p in soup.find_all('p'):
f.write(p.text)
Either it is a formatting error in your question, either your f.write() method is outside your for loop. The following code should work:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Somalia")
soup = BeautifulSoup(r.text)
with open("Research.txt", 'a') as f: # 'a' code stands for 'append'
for p in soup.find_all('p'):
f.write(f"{p.text}\n")
NB: If you do not understand the with open statement, take a look at: with-statement-in-python
NB2: the f-string format (ie: f"{p.text}\n") will only work with python3.6+. If you got a prior version, replace it by "{}\n".format(p.text)
As others have pointed out, your write operation is not in the loop. A simple solution would be:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Somalia")
soup = BeautifulSoup(r.text)
mytext = ""
for p in soup.find_all('p'):
print(p.text)
mytext += p.text
file = open("Research.txt", "w")
file.write(mytext)
file.close()
This Mostly works like your current code, but constructs a string in the loop and then writes it to the file.
I am trying to fetch data from an APU but as response I am getting the plain text. I want to read all text line by line.
This is the url variable: http://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640
First snippet:
from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640"
request = requests.get(url)
soup = bs(request.text,"lxml")
for line in soup:
print line
break
Result: It prints out the entire text
Second snippet:
request = requests.get(url)
for line in request.text():
print line
break
Result: It prints out 1 character
request = requests.get(url)
requestText = request.text()
allMf = requestText.splitlines()
Result: Exception: 'unicode' object is not callable
I have tried few more cases but not able to read text line by line.
request.text is a property and not a method, request.text returns a unicode string, request.text() throws the error 'unicode' object is not callable.
for line in request.text.splitlines():
print line
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640"
request = requests.get(url)
soup = bs(request.text,"lxml")
# soup.text is to get the returned text
# split function, splits the entire text into different lines (using '\n') and stores in a list. You can define your own splitter.
# each line is stored as an element in the allLines list.
allLines = soup.text.split('\n')
for line in allLines: # you iterate through the list, and print the single lines
print(line)
break # to just print the first line, to show this works
Try this:
from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.amfiindia.com/spages/NAVAll.txt?t=23052017073640"
request = requests.get(url)
soup = bs(request.text,"lxml")
for line in soup:
print line.text
break
So I wrote a crawler for my friend that will go through a large list of web pages that are search results, pull all the links off the page, check if they're in the output file and add if they're not there. It took a lot of debugging but it works great! Unfortunately, the little bugger is really picky about which anchored tags it deems important enough to add.
Here's the code:
#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future
urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore
with open(urls_filename, "r") as f:
url_list = f.read() #This command opens the input text file and reads the information inside it
with open(output_filename, "w") as f:
for url in url_list.split("\n"): #This command splits the text file into separate lines so it's easier to scan
hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
try:
response = urllib2.urlopen(url) #tells program to open the url from the text file
except:
print "Could not access", url
continue
page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
for link in urls_all:
if('href' in dict(link.attrs)):
url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url=url.split('#')[0]
if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
It works great except for the following link:
https://research.bidmc.harvard.edu/TVO/tvotech.asp
This link has a number of like "tvotech.asp?Submit=List&ID=796" and it's straight up ignoring them. The only anchor that goes into my output file is the main page itself. It's bizarre because looking at the source code, their anchors are pretty standard, like-
They have 'a' and 'href', I see no reason bs4 would just pass it and only include the main link. Please help. I've tried removing http from line 30 or changing it to https and that just removes all the results, not even the main page comes into the output.
that's cause one of the links there has a mailto in its href, it is then set to the url parameter and break the rest of the links as well cause the don't pass the url[0:4] == 'http' condition, it looks like this:
mailto:research#bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
you should either filter it out or not use the same argument url in the loop, note the change to url1:
for link in urls_all:
if('href' in dict(link.attrs)):
url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url1=url1.split('#')[0]
if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename