I am trying to get actual URLs of shortened URLs by using following code (I have replaced shortened URLs with others as stackoverflow doesn't allow them)
url_list = ['https://stackoverflow.com/questions/62242867/php-lumen-request-body-always-empty', 'https://twitter.com/i/web/status/1269102116364324865']
import requests
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list = actual_url
print(actual_list)
At the end there is only last URL left in actual_url but I need each URL. Can someone tell me what am I doing wrong here?
You need to append the url's to the list.
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
please try this, you need use append to add the item to list:
url_list = ['https://stackoverflow.com/questions/62242867/php-lumen-request-body-always-empty', 'https://twitter.com/i/web/status/1269102116364324865']
import requests
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
print(actual_list)
you need to append the url in list:
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
but you are assigning the url to actual_list variable
Related
currently i am putting the full url in urlist i want the only string after pageno in the urlist and the program should go on rest as it as.
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=**AMCR24yMNFkfoXF3wKPmGMy_wV8TJPAlxm6oWiTHGOI**
urlList = ["https://bidplus.gem.gov.in/bidlists?bidlists&page_no=AMCR24yMNFkfoXF3wKPmGMy_wV8TJPAlxm6oWiTHGOI",
"https://bidplus.gem.gov.in/bidlists?bidlists&page_no=Hgw0LYpSZdLXow1Wq84uKar1nxXbFhClXQDuAAiPDxU",
"https://bidplus.gem.gov.in/bidlists?bidlists&page_no=rO5Erb90Q_P1S0fL5O6FEShlv20RBXmkHFusZogvUoo",
"https://bidplus.gem.gov.in/bidlists?bidlists&page_no=jiE0kS8e-ghmlmjDMPUJm1OBCRotqJ6n7srXZN99LZc",
"https://bidplus.gem.gov.in/bidlists?bidlists&page_no=MY89EG2RtzpSMlT1wjE61Cv31nAyetQ49kmXfw2AfMo",
]
for url in urlList:
print('Hold on creating URL to fetch data...')
url = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no)
print('URL created: ' + url)
scraped_data = requests.get(url, verify=False)
soup_data = bs(scraped_data.text, 'lxml')
extracted_data = sou
p_data.find('div', {'id': 'pagi_content'})
Use this line after your urlList variable:
urlList = [x.split('=')[-1] for x in urlList]
you can split the urls on = and get the part you need:
for url in urls:
print(url.split("=")[-1])
outputs:
AMCR24yMNFkfoXF3wKPmGMy_wV8TJPAlxm6oWiTHGOI
Hgw0LYpSZdLXow1Wq84uKar1nxXbFhClXQDuAAiPDxU
rO5Erb90Q_P1S0fL5O6FEShlv20RBXmkHFusZogvUoo
jiE0kS8e-ghmlmjDMPUJm1OBCRotqJ6n7srXZN99LZc
MY89EG2RtzpSMlT1wjE61Cv31nAyetQ49kmXfw2AfMo
if you want the page number in its own list this is how:
pagenumbers = [i.split("=")[-1] for i in urls]
>>> pagenumbers
['AMCR24yMNFkfoXF3wKPmGMy_wV8TJPAlxm6oWiTHGOI', 'Hgw0LYpSZdLXow1Wq84uKar1nxXbFhClXQDuAAiPDxU', 'rO5Erb90Q_P1S0fL5O6FEShlv20RBXmkHFusZogvUoo', 'jiE0kS8e-ghmlmjDMPUJm1OBCRotqJ6n7srXZN99LZc', 'MY89EG2RtzpSMlT1wjE61Cv31nAyetQ49kmXfw2AfMo']
there is no need to split the urls.
In your for loop you can just use the url directy since you are iterating over the full url.
for url in urlList:
print('Hold on fetching data...')
scraped_data = requests.get(url, verify=False)
soup_data = bs(scraped_data.text, 'lxml')
I would need to get all the links of the following list of websites (from a dataframe column transformed to list):
urls = df['URLs'].tolist()
saving for each urls in a new column (Links) in a copy of the original dataset.
To get info from one of these websites, I am using :
http = httplib2.Http()
status, response = http.request('https://www.farmaciairisdiana.it/blog/') # for example
for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
This code works pretty well (I tested a few cases).
How can I iterate for each of those urls, saving the results into a new column?
You can iterate list urls and save each link to result list. Then you create new dataframe or add this list to new column.
For example:
http = httplib2.Http()
all_links = []
for url in urls: # `urls` is your list from the question
status, response = http.request(url)
for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
all_links.append(link['href'])
new_df = pd.DataFrame({'Links': all_links})
print(new_df)
# or
#df['Links'] = all_links
EDIT: To create new dataframe, you can use this example:
http = httplib2.Http()
all_links = []
for url in urls: # `urls` is your list from the question
status, response = http.request(url)
l = []
all_links.append({'URLs': url, 'Links': l})
for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
l.append(link['href'])
new_df = pd.DataFrame(all_links)
print(new_df)
I have written code to extract all urls from the given site, but the problem is some urls are repeated, and I want it a list with unique urls.
from bs4 import BeautifulSoup
from termcolor import colored
import re, os
import requests
url = 'http://example.com'
ext = 'html'
count=0
countfiles=0
files=[]
def ulist(x):
return list(dict.fromkeys(x))
def listFD(filename, ext=''):
print filename
print url
if filename == url:
page = requests.get(url).text
else:
page = requests.get(url + filename).text
soup = BeautifulSoup(page, 'html.parser')
return ['/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
for file in ulist(listFD(url, ext)):
for unfile in ulist(listFD(file, ext)):
print unfile
You can act below:
urls = list(set(urls))
Just wrap your list in python's builtin set functionality:
urls = ['www.google.com', 'www.google.com', 'www.facebook.com']
unique_urls = list(set(urls))
print(unique_urls) # prints >> ['www.facebook.com', 'www.google.com']
Once you have your list of urls, you can use a set to get unique elements and a list comprehension:
unique_urls = [url for url in set(urls)]
I have some code which creates a list of the links on a website. At the end of the function I return the list, but when I reference it later on, the list is blank.
Here is my code:
PagesList = []
Startup = input('What is the website of the startup?')
def GetCleanLinks(Startup):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(Startup, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
PagesList = [a['href'] for a in soup.find_all('a', href=True) if a.text.strip()]
for i, link in enumerate(PagesList):
if link[0] in ['/','#']:
PagesList[i] = Startup + link
print(PagesList)
return(PagesList)
GetCleanLinks(Startup)
print(PagesList)
If I put the print within the function (before my return request) it will print out a list of the links. However, the print request outside the function prints a blank list. What am I missing?
Many thanks,
Rob
I want to web scrape a list of urls from a web site and then open them one by one.
I can get the list of all urls but then try to turn into a list things get wrong.
When I print the list, intead of geting [url1, urls2...] I get something like this in the console:
[url1,url2,url3] dif line
[url1,url2,url3,url4] difline
[url1,url2,url3,url4,url5]
Find the my script bellow:
driver = webdriver.Chrome()
my_url="https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
driver.get(my_url)
time.sleep(3)
content = driver.page_source.encode('utf-8').strip()
page_soup = soup(content,"html.parser")
links =[]
for link in page_soup.find_all('a', href=True):
url=link['href']
ai=str(url)
links.append(ai)
print(links)
links.append(ai)
print(links)
I have rewritten your code a little. First you need to load and scrap main page to get all links from "href". After that just use scraped urls in a loop to get next pages.
Also there is some junk in "href" which isn't url so you have to clean it first.
I prefer requests to do GET.
http://docs.python-requests.org/en/master/
I hope it helps.
from bs4 import BeautifulSoup
import requests
def main():
links = []
url = "https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
web_page = requests.get(url)
soup = BeautifulSoup(web_page.content, "html.parser")
a_tags = soup.find_all('a', href=True)
for a in a_tags:
links.append(a.get("href"))
print(links) # just to demonstrate that links are there
cleaned_list = []
for link in links:
if "http" in link:
cleaned_list.append(link)
print(cleaned_list)
return cleaned_list
def load_pages_from_links(urls):
user_agent = {'User-agent': 'Mozilla/5.0'}
links = urls
downloaded_pages = {}
if len(links) == 0:
return "There are no links."
else:
for nr, link in enumerate(links):
web_page = requests.get(link, headers=user_agent)
downloaded_pages[nr] = web_page.content
print(downloaded_pages)
if __name__ == "__main__":
links = main()
load_pages_from_links(links)