cleaning strings for link urls in python - python

So i have beautiful soup code which visit the main the main page of a website
and scrapes the links there. However when I get the links in python I can't seem to clean up the link (after it's converted to a string) for concatenation with the root url.
import re
import requests
import bs4
list1=[]
def get_links():
regex3= re.compile('/[a-z\-]+/[a-z\-]+')
response = requests.get('http://noisetrade.com')
soup = bs4.BeautifulSoup(response.text)
links= soup.select('div.grid_info a[href]')
for link in links:
lk= link.get('href')
prtLk= regex3.findall(lk)
list1.append(prtLk)
def visit_pages():
url1=str(list1[1])
print(url)
get_links()
visit_pages()
produces output: "['/stevevantinemusic/unsolicited-material']"
desired output:"/stevevantinemusic/unsolicited-material"
I have tried .strip() and .replace() and re.sub/match/etc. . . I can't seem to isolate the chars '[,\',]' which are the characters I need removed, I had iterated through it with sub-strings but that feels inefficient. I'm sure I'm missing something obvious.

findall returns a list of results so you can either write:
for link in links:
lk = link.get('href')
urls = regex3.findall(lk)
if urls:
prtLk = urls[0]
list1.append(prtLk)
or better, use search method:
for link in links:
lk = link.get('href')
m = regex3.search(lk)
if m:
prtLk = m.group()
list1.append(prtLk)
Those brackets were the result of converting a list with one element to a string.
For example:
l = ['text']
str(l)
results in:
"['text']"

Here I use the regexp r'[\[\'\]]' to replace any of the unwanted characters with the empty string:
$ cat pw.py
import re
def visit_pages():
url1="['/stevevantinemusic/unsolicited-material']"
url1 = re.sub(r'[\[\'\]]','',url1)
print(url1)
visit_pages()
$ python pw.py
/stevevantinemusic/unsolicited-material

Here is an example of what I think you are trying to do:
>>> import bs4
>>> with open('noise.html', 'r') as f:
... lines = f.read()
...
>>> soup = bs4.BeautifulSoup(lines)
>>> root_url = 'http://noisetrade.com'
>>> for link in soup.select('div.grid_info a[href]'):
... print(root_url + link.get('href'))
...
http://noisetrade.com/stevevantinemusic
http://noisetrade.com/stevevantinemusic/unsolicited-material
http://noisetrade.com/jessicarotter
http://noisetrade.com/jessicarotter/winter-sun
http://noisetrade.com/geographermusic
http://noisetrade.com/geographermusic/live-from-the-el-rey-theatre
http://noisetrade.com/kaleo
http://noisetrade.com/kaleo/all-the-pretty-girls-ep
http://noisetrade.com/aviddancer
http://noisetrade.com/aviddancer/an-introduction
http://noisetrade.com/thinkr
http://noisetrade.com/thinkr/quiet-kids-ep
http://noisetrade.com/timcaffeemusic
http://noisetrade.com/timcaffeemusic/from-conversations
http://noisetrade.com/pearl
http://noisetrade.com/pearl/hello
http://noisetrade.com/staceyrandolmusic
http://noisetrade.com/staceyrandolmusic/fables-noisetrade-sampler
http://noisetrade.com/sleepyholler
http://noisetrade.com/sleepyholler/sleepy-holler
http://noisetrade.com/sarahmcgowanmusic
http://noisetrade.com/sarahmcgowanmusic/indian-summer
http://noisetrade.com/briandunne
http://noisetrade.com/briandunne/songs-from-the-hive
Remember also bs4 has its own types that it uses.
A good way to debug your scripts would be to place:
for link in links:
import pdb;pdb.set_trace() # the script will stop for debugging here
lk= link.get('href')
prtLk= regex3.findall(lk)
list1.append(prtLk)
Anywhere you want to debug.
And then you could do something like this within pdb:
next
l
print(type(lk))
print(links)
dir()
dir(links)
dir(lk)

Related

How to search multiple keywords in web page? this only input one keyword

import mechanize
from bs4 import BeautifulSoup
import time
import smtplib
True by default
while True:
url = "https://www.google.com"
browser = mechanize.Browser()
browser.open(url)
response = browser.response().read()
soup = BeautifulSoup(response, "lxml")
count = 1
if str(soup).find("English") == -1:
# wait 60 seconds (change the time(in seconds) as you wish),
print('Checking - ' + str(count) + 'th Time')
time.sleep(60)
count += 1
# continue with the script
continue
There is a couple of problems here:
Beautiful soup provide a method get_text() to extract the text, so you do not need to convert it to string.
String's find() return -1 when no value was found. Are you sure that is what you want?
Why do you use time.sleep()? What is the purpose of stopping the program?
You did not create a loop, which make count redundant and you will get error for continue.
If you want to get the number of occurence of a string, you can use regex's findall() and then get its length like: len(re.findall("English", soup_text)).
If you want to find multiple keywords, you can create a list of the keywords and then loop through them like:
for k in ["a", "b", "c"]:
print(f'{k}: {len(re.findall(k, soup.get_text()))}')
Full example:
from bs4 import BeautifulSoup
import requests # simple http request
import re # regex
url = "https://www.google.com"
doc = requests.get(url)
soup = BeautifulSoup(doc.text, "lxml")
soup_text = soup.get_text()
keywords = ["Google", "English", "a"]
for k in keywords:
print(f'{k}: {len(re.findall(k, soup_text))}')
You are strongly suggested to study python thoroughly:
Python: w3school tutorial
BeautifulSoup: Documentation
Regex: w3schools tutorial or RegExr

NoneType error Problem in BeautifulSoup in python

im begginer at programming, so i have problem in find method in beautifuloup when i use it in web scraping,i have this code
from os import execle, link, unlink, write
from typing import Text
import requests
from bs4 import BeautifulSoup
import csv
from itertools import zip_longest
job_titleL =[]
company_nameL=[]
location_nameL=[]
experience_inL=[]
links=[]
salary=[]
job_requirementsL=[]
date=[]
result= requests.get(f"https://wuzzuf.net/search/jobs/?a=%7B%7D&q=python&start=1")
source = result.content
soup= BeautifulSoup(source , "lxml")
job_titles = soup.find_all("h2",{"class":"css-m604qf"} )
companies_names = soup.find_all("a",{"class":"css-17s97q8"})
locations_names = soup.find_all("span",{"class":"css-5wys0k"})
experience_in = soup.find_all("a", {"class":"css-5x9pm1"})
posted_new = soup.find_all("div",{"class":"css-4c4ojb"})
posted_old = soup.find_all("div",{"class":"css-do6t5g"})
posted = [*posted_new,*posted_old]
for L in range(len(job_titles)):
job_titleL.append(job_titles[L].text)
links.append(job_titles[L].find('a').attrs['href'])
company_nameL.append(companies_names[L].text)
location_nameL.append(locations_names[L].text)
experience_inL.append(experience_in[L].text)
date_text=posted[L].text.replace("-","").strip()
date.append(posted[L].text)
for link in links:
result= requests.get(link)
source= result.content
soup=BeautifulSoup(source,"lxml")
requirements=soup.find("div",{"class":"css-1t5f0fr"}).ul
requirements1=soup.find("div",{"class":"css-1t5f0fr"}).p
respon_text=""
if requirements:
for li in requirements.find_all("li"):
print(li)
if requirements1:
for br in requirements1.find_all("br"):
print(br)
respon_text +=li.text + "|"
job_requirementsL.append(respon_text)
file_list=[job_titleL,company_nameL,date,location_nameL,experience_inL,links,job_requirementsL]
exported=zip_longest(*file_list)
with open('newspeard2.csv',"w") as spreadsheet:
wr=csv.writer(spreadsheet)
wr.writerow(["job title", "company name","date", "location", "experience in","links","job requirements"])
wr.writerows(exported)
note: im not very good at english :(
so when i use find method to get the job requirements from each job in the website page (wuzzuf),use for loop to loop throug each text i job requirements, it returns error says:"NoneType objects han nod attribute find_all("li"), so after searching why this happens ,and after dowing inspect for each job page , i found that some job pages uses (br, p and strong) tags in job requirements, i didn't know what to do , but i used if statement to test it, it returns the tags but br tag is empty without text , so please can you see where is the prblem and answer me , thanks
the webpage:
https://wuzzuf.net/search/jobs/?a=hpb&q=python&start=1
the job used p and br tags:
https://wuzzuf.net/jobs/p/T9WuTpM3Mveq-Senior-Data-Scientist-Evolvice-GmbH-Cairo-Egypt?o=28&l=sp&t=sj&a=python|search-v3|hpb
sorry i didn't understand the problem with p sooner
for link in links:
result= requests.get(link)
source= result.content
soup=BeautifulSoup(source,"lxml")
requirements_div=soup.find("div",{"class":"css-1t5f0fr"})
respon_text=[]
for child in requirements_div.children:
if child.name=='ul':
for li in child.find_all("li"):
respon_text.append(li.text)
elif child.name=='p':
for x in child.contents:
if x.name == 'br':
pass
elif x.name == 'strong':
respon_text.append(x.text)
else:
respon_text.append(x)
job_requirementsL.append('|'.join(respon_text))

AttributeError: type object 'typing.re' has no attribute 'split'

I try to scam URL link form google. Users can input any search then they can take a URL link. but here is the main problem is this split function can't work. I can't fix it. So please help me
[[Suppose: Any user can input "all useless website" that time google can showing us a result. User can take only URL link.]]
from typing import re
from bs4 import BeautifulSoup
import requests
user_input = input('Enter value for search : ')
print('Please Wait')
page_source = requests.get("https://www.google.com/search?q=" + user_input)
soup = BeautifulSoup(page_source.text, 'html.parser')
print(soup.title)
print(soup.title.string)
print(soup.title.parent.name)
all_links = soup.find_all('a')
for link in all_links:
link_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
print(link_google.find["a"])
You're importing re from the wrong place. You need to use it via import re, as follows:
import re
...
link_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
Update to make your code work:
import re correctly
fix this line from all_links = soup.find_all('a') to all_links = soup.find_all('a', href=True)
Take the link and clean it up like you did before (re.split() works perfectly but it returns a list) and add that link to a list (unpack the list) or print it
Here is the code updated to make it work
# issue 1
import re
from bs4 import BeautifulSoup
import requests
user_input = input('Enter value for search : ')
print('Please Wait')
page_source = requests.get("https://www.google.com/search?q=" + user_input)
soup = BeautifulSoup(page_source.text, 'html.parser')
print(soup.title)
print(soup.title.string)
print(soup.title.parent.name)
# issue 2
all_links = soup.find_all('a', href=True)
for link in all_links:
link_from_google = re.split(":(?=http)", link["href"].replace("/url?q=", ""))
# issue 3
print(link_from_google[0])
>>> {returns all the http links}
One liner list comprehension for fun
google_links = [re.split(":(?=http)", link["href"].replace("/url?q=", ""))[0] for link in soup.find_all('a', href=True)]
>>> {returns all the http links}

Python: unnable to get any output using beautifulsoup

I am trying to scrape some words from any random website, but the following program is not showing errors and not showing any output when i tried printing the results.
I have checked the code twice and even incorporated an if statement to see whether the program is getting any words or not.
import requests
import operator
from bs4 import BeautifulSoup
def word_count(url):
wordlist = []
source_code = requests.get(url)
source = BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('a', {'class':'txt'}):
word_string=post_text.string
if word_string is not None:
word = word_string.lower().split()
for each_word in word:
print(each_word)
wordlist.append(each_word)
else:
print("None")
word_count('https://mumbai.craigslist.org/')
I am expecting all the words under the "class= txt" to be displayed in the output.
OP: I am expecting all the words of the class text to be displayed in the output
The culprit:
for post_text in source.findAll('a', {'class':'txt'}):
The reason:
anchor tag has no class txt but the span tag inside it does.
Hence:
import requests
from bs4 import BeautifulSoup
def word_count(url):
source_code = requests.get(url)
source=BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('a'):
s_text = post_text.find('span', class_ = "txt")
if s_text is not None:
print(s_text.text)
word_count('https://mumbai.craigslist.org/')
OUTPUT:
community
activities
artists
childcare
classes
events
general
groups
local news
lost+found
missed connections
musicians
pets
.
.
.
You are targeting the wrong elements.
if you use
print(source)
Everything works fine but the moment you try to target the element with findAll you are targeting something wrong because you get an empty list array.
If you replace
for post_text in source.findAll('a', {'class':'txt'}):
with
for post_text in source.find_all('a'):
everyting seems to work fine
I have visited https://mumbai.craigslist.org/, and find there is no <a class="txt">, only <span class="txt">, so I think you can try this:
def word_count(url):
wordlist = []
source_code = requests.get(url)
source=BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('span', {'class':'txt'}):
word_string=post_text.text
if word_string is not None:
word = word_string.lower().split ()
for each_word in word:
print(each_word)
wordlist.append(each_word)
else:
print("None")
it will output correctly:
community
activities
artists
childcare
classes
events
general
...
Hope that helps you, and comment if you have further questions. : )

Python index function

I am writing a simple Python program which grabs a webpage and finds all the URL links in it. However I try to index the starting and ending delimiter (") of each href link but the ending one always indexed wrong.
# open a url and find all the links in it
import urllib2
url=urllib2.urlopen('right.html')
urlinfo = url.info()
urlcontent = url.read()
bodystart = urlcontent.index('<body')
print 'body starts at',bodystart
bodycontent = urlcontent[bodystart:].lower()
print bodycontent
linklist = []
n = bodycontent.index('<a href=')
while n:
print n
bodycontent = bodycontent[n:]
a = bodycontent.index('"')
b = bodycontent[(a+1):].index('"')
print a, b
linklist.append(bodycontent[(a+1):b])
n = bodycontent[b:].index('<a href=')
print linklist
I would suggest using a html parsing library instead of manually searching the DOM String.
Beautiful Soup is an excellent library for this purpose. Here is the reference link
With bs your link searching functionality could look like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(bodycontent, 'html.parser')
linklist = [a.get('href') for a in soup.find_all('a')]

Categories

Resources