How to use request and read urls from a txt file - python

Im writing a piece of code where I have a list of urls in a file and Im using requests to go through the file and do a GET request and print the status code but from what i have written I am not getting any output
import requests
with open('demofile.txt','r') as http:
for req in http:
page=requests.get(req)
print page.status_code

I see two problems, one is that you forgot to indent the lines after the for loop and the second one is that you failed to remove the last new lines \n (supposing that urls are separated in different lines)
import requests
with open('deleteme','r') as urls:
for url in urls.readlines():
req = url.strip()
print(req)
page=requests.get(req)
print(page.status_code)

Related

Retrieve scrape urls from text file in BeautifulSoup

I have the following script and I would like to retrieve the URL's from a text file rather than an array. I'm new to Python and keep getting stuck!
from bs4 import BeautifulSoup
import requests
urls = ['URL1',
'URL2',
'URL3']
for u in urls:
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
Could you please be a little more clear about what you want?
Here is a possible answer which might or might not be what you want:
from bs4 import BeautifulSoup
import requests
with open('yourfilename.txt', 'r') as url_file:
for line in url_file:
u = line.strip()
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
The file was opened with the open() function; the second argument is 'r' to specify we're opening it in read-only mode. The call to open() is encapsulated in a with block so the file is automatically closed as soon as you no longer need it open.
The strip() function removes trailing whitespace (spaces, tabs, newlines) at the beginning and end of every line, for instant ' https://stackoverflow.com '.strip() becomes 'https://stackoverflow.com'.

Get URL of the page in python

My question might be bit weird.
So I have some page with different URL but all end up on same page. So Can I get that main URL from the old URL in python. For example:
1) https://www.verisk.com/insurance/products/iso-forms/
2) https://www.verisk.com/insurance/products/forms-library-on-isonet/
Both will end up on same page that is:
https://www.verisk.com/insurance/products/iso-forms/
So for each URL can I know the final URL where it'll land using Python(I have list of 1k URL). And I want another list of where those URL land correspondingly!
Here's one way of doing it, using requests library.
import requests
def get_redirected_url(url):
response = requests.get(url, stream=True) # stream=True prevents fetching the actual content
return response.url
This is a very simplified example, and in real implementation, you want to error handle, probably do delayed retries and possibly check what kind of redirection you're getting. (permanent redirects only?)
Simple approach with urllib.request:
from urllib.request import urlopen
resp = urlopen("http://sitey.com/redirect")
print(resp.url)
Might want to use threads if you're doing 1,000 URLs...

error in downloading file with Beautifulsoup

I am trying to download some files from a free dataset using Beautifulsoup.
I repeat the same process for two similar links in the web page.
This is the page address.
import requests
from bs4 import BeautifulSoup
first_url = "http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.region_xyz_centers_file.bcf53cd53a90f374.55434c415f43434e5f41504f455f4454495f41504f452d335f355f726567696f6e5f78797a5f63656e746572732e747874.txt"
second_url="http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.connectivity_matrix_file.bfcc4fb8da90e7eb.55434c415f43434e5f41504f455f4454495f41504f452d335f355f636f6e6e6563746d61742e747874.txt"
# labeled as Connectivity Matrix File in the webpage
def download_file(url, file_name):
myfile = requests.get(url)
open(file_name, 'wb').write(myfile.content)
download_file(first_url, "file1.txt")
download_file(second_url, "file2.txt")
output files:
file1.txt:
50.118248 53.451775 39.279296
51.417612 67.443649 41.009074
...
file2.txt
<html><body><h1>Internal error</h1>Ticket issued: umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562</body><!-- this is junk text else IE does not display the page: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx //--></html>
but I can download the second_url from the chrome browser properly (contains some numbers).
I tried to set user-agent
headers = {'User-Agent': "Chrome/6.0.472.63 Safari/534.3"}
r = requests.get(url, headers=headers)
but did not work.
Edit
The site does not need login to get the data. I opened the page in a private mode browser then downloaded the file in second_url.
Direct coping the second_url in address bar gave error:
Internal error
Ticket issued: umcd/89.41.15.124.2020-04-30.03-18-34.49c8cb58-7202-4f05-9706-3309b581af76
Do you have any idea?
Thank you in advance for any guide.
This isn't a Python issue. The second URL gives the same error both in Curl and in my browser.
It's odd to me the second URL would be shorter by the way. Are you sure you copied it right?

How to loop through .txt file links on website, scrape, and store in one malleable csv/excel file

I want to be able to scrape the data from a particular website (https://physionet.org/challenge/2012/set-a/) and the subdirectories like it, while also taking each text file and adding it to a giant csv or excel file so that I might be able to see all the data in one place.
I have deployed the following code, similar to this article, but my code basically downloads all the text files on the page, and stores them in my working directory. And, it honestly just takes too long to run.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://physionet.org/challenge/2012/set-a/'
response = requests.get(url)
response # 200 indicates that it works...
soup = BeautifulSoup(response.text, "html.parser")
for i in range(5,len(soup.findAll('a'))+1): #'a' tags are for links
one_a_tag = soup.findAll('a')[i]
link = one_a_tag['href']
download_url = 'https://physionet.org/challenge/2012/set-a/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/132539.txt')+1:])
time.sleep(1) #pause the code for a sec
Actual results are just a bunch of text files crowding my working directory, but before the for loop stops, I'd like to put it in one large csv file format.
If you want to save them, but have to do it a bit at a time (maybe you don't have enough RAM to hold everything in at once), then I would just append the files to a master file one by one.
import requests
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
output_file = 'outputfile.txt'
url = 'https://physionet.org/challenge/2012/set-a/'
# Download and find all the links. Check the last 4 characters to verify it's one
# of the files we are looking for
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, "html.parser")
links = [a['href'] for a in soup.find_all('a') if a['href'][-4:] == '.txt']
# Clear the current file
with open(output_file, 'w'):
pass
# Iterate through all the links
for href in links:
response = requests.get("{}{}".format(url, href), verify=False)
if response:
# Open up the output_file in append mode so we can just write to the one file
with open(output_file, 'a') as f:
f.write(response.text)
print(len(response.text.split('\n')))
The one downside to this is that you would have the headers from each text file. but you can change the f.write() to the following and get it without any headers
f.write("\n".join(response.text.split('\n')[1:]))
If you do have the available RAM, you could read in all the files using a list comprehension then use pandas.concat() to put them in one giant dataframe. Then use df.to_csv() to export it to a file.
df = pd.concat([pd.read_csv("{}{}".format(url, href)) for href in links])
df.to_csv(output_file)

Python Crawler is ignoring links on page

So I wrote a crawler for my friend that will go through a large list of web pages that are search results, pull all the links off the page, check if they're in the output file and add if they're not there. It took a lot of debugging but it works great! Unfortunately, the little bugger is really picky about which anchored tags it deems important enough to add.
Here's the code:
#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future
urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore
with open(urls_filename, "r") as f:
url_list = f.read() #This command opens the input text file and reads the information inside it
with open(output_filename, "w") as f:
for url in url_list.split("\n"): #This command splits the text file into separate lines so it's easier to scan
hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
try:
response = urllib2.urlopen(url) #tells program to open the url from the text file
except:
print "Could not access", url
continue
page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
for link in urls_all:
if('href' in dict(link.attrs)):
url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url=url.split('#')[0]
if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
It works great except for the following link:
https://research.bidmc.harvard.edu/TVO/tvotech.asp
This link has a number of like "tvotech.asp?Submit=List&ID=796" and it's straight up ignoring them. The only anchor that goes into my output file is the main page itself. It's bizarre because looking at the source code, their anchors are pretty standard, like-
They have 'a' and 'href', I see no reason bs4 would just pass it and only include the main link. Please help. I've tried removing http from line 30 or changing it to https and that just removes all the results, not even the main page comes into the output.
that's cause one of the links there has a mailto in its href, it is then set to the url parameter and break the rest of the links as well cause the don't pass the url[0:4] == 'http' condition, it looks like this:
mailto:research#bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
you should either filter it out or not use the same argument url in the loop, note the change to url1:
for link in urls_all:
if('href' in dict(link.attrs)):
url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url1=url1.split('#')[0]
if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename

Categories

Resources