Different outputs in removing characters from a list - Python - python

I'm doing some web scraping in python and I want to delete the element "." from each element of a list. I have two approaches, but just one gives the correct output. The code is above.
import urllib2
from bs4 import BeautifulSoup
first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/res20130914/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
for link in q.find_all('a'):
w.append(link["href"])
s = [ i.replace(".","") for i in w ]
l=[]
for t in w:
l=t.replace(".","")
If I run print s , the output is the right output , but if I run print l, it isn't.
I would like to know why s gives the correct ouput and l doesn't.

In the loop, you replace the whole list in each iteration, instead of appending to it as in the single line example.
Instead, try:
for t in w:
l.append(t.replace(".",""))

You are replacing the list each time and it'e getting overwritten. As a result, you are getting the last element after the iterations! Hope, it helps!
import urllib2
from bs4 import BeautifulSoup
first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/res20130914/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
for link in q.find_all('a'):
w.append(link["href"])
s = [ i.replace(".","") for i in w ]
print s
l=[]
for t in w:
l.append(t.replace(".",""))
print l
Cheers!

Related

Regex: Sentence Beginning With Keyword Ending with Sentence Blank Output

If I was to print "m," there would be a result that begins with "Histology" and ends with a period. Despite that, the output shows up empty.
from bs4 import BeautifulSoup
from googlesearch import search
import requests
from goose3 import Goose
def search_google(query):
parent_=[]
for j in search(query, tld="co.in", num=10, stop=5, pause=2):
child_=[]
link_=j
site_name=link_.split("/")[2]
child_.append(site_name)
child_.append(link_)
parent_.append(child_)
g = Goose()
article = g.extract(link_)
m = article.cleaned_text
Answer = re.findall(r'\bHistology\s+([^.]*)',m)
print(Answer)
f = search_google("""'Histology'""")
Output: []
It seems your answer variable has incorrect indentation, and your last result has no matches in the cleaned text. This is why your print results in a empty list.
The print command, since it sits outside of the loop only triggers once. And given the final value of Answer has no matches, you are returned an empty list.
Indent the answer variable by 1 and it should output the correct result.
Your regex will also only match the sentence following Histology and not include the word itself. This is due to you specifying a capture group without Histology included. You can resolve this by removing the capturing group.
r'\bHistology\s+[^.]*'
from bs4 import BeautifulSoup
from googlesearch import search
import requests
from goose3 import Goose
def search_google(query):
parent_=[]
for j in search(query, tld="co.in", num=10, stop=5, pause=2):
child_=[]
link_=j
site_name=link_.split("/")[2]
child_.append(site_name)
child_.append(link_)
parent_.append(child_)
g = Goose()
article = g.extract(link_)
m = article.cleaned_text
Answer = re.findall(r'\bHistology\s+[^.]*',m)
print(Answer)
f = search_google("""'Histology'""")
To print all results on individual lines you can change print(Answer) to print('\n'.join(Answer))

How to clean HTML removing repeated paragraphs?

I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html
https://jsfiddle.net/97ptc0Lh/4/
Output.html
https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove.
from bs4 import BeautifulSoup
fp = open("Input.html", "rb")
soup = BeautifulSoup(fp, "html5lib")
Uniques = set()
CleanHtml = []
for element in soup.html:
if element not in Uniques:
Uniques.add(element)
CleanHtml.append(element)
print (CleanHtml)
May someone help me to reach this goal please.
I think this should do it:
elms = []
for elem in soup.find_all('font'):
if elem not in elms:
elms.append(elem)
else:
target =elem.findParent().findParent()
target.decompose()
print(soup.html)
This should get you your the desired output.
Edit:
To remove only for those paragraphs that have don't size 4 or 5, change the else block to
else:
if elem.attrs['size'] != "4" and elem.attrs['size'] !="5":
target =elem.findParent().findParent()
target.decompose()

Removing quotes from re.findall output

I am trying to remove the quotes from my re.findall output using Python 3. I tried suggestions from various forums but it didn't work as expected finally thought of asking out here myself.
My code:
import requests
from bs4 import BeautifulSoup
import re
import time
price = [];
while True:
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.prettify()
for p in data:
match = re.findall('\d*\.?\d+',data)
print("ETH/USDT",match)
price.append(match)
break
Output of match gives:
['143.19000000']. I would like it to be like: [143.1900000] but I cannot figure out how to do this.
Another problem I am encountering is that the list price appends every object like a single list. So the output of price would be for example [[a], [b], [c]]. I would like it to be like [a, b, c] I am having a bit of trouble to solve these two problems.
Thanks :)
Parse the response from requests.get() as JSON, rather than using BeautifulSoup:
import requests
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
response = requests.get(url)
response.raise_for_status()
data = response.json()
print(data["price"])
To get floats instead of strings:
float_match = [float(el) for el in match]
To get a list instead of a list of lists:
for el in float_match:
price.append(el)

How can I group it by using "search" function in regular expression?

I have been developing a python web-crawler to collect the used car stock data from this website. (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=20)
First of all, I would like to collect only "BMW" from the list. So, I used "search" function in regular expression like the code below. But, it keeps returning "None".
Is there anything wrong in my code?
Please give me some advice.
Thanks.
from bs4 import BeautifulSoup
import urllib.request
import re
CAR_PAGE_TEMPLATE = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page="
def fetch_post_list():
for i in range(20,21):
URL = CAR_PAGE_TEMPLATE + str(i)
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
print ("Page#", i)
# 50 lists per each page
lists=table.find_all('tr', itemtype="http://schema.org/Article")
count=0
r=re.compile("[BMW]")
for lst in lists:
if lst.find_all('td')[3].find('em').text:
lst_price=lst.find_all('td')[3].find('em').text
lst_title=lst.find_all('td')[1].find('a').text
lst_link = lst.find_all('td')[1].find('a')['href']
lst_photo_url=''
if lst.find_all('td')[0].find('img'):
lst_photo_url = lst.find_all('td')[0].find('img')['src']
count+=1
else: continue
print('#',count, lst_title, r.search("lst_title"))
return lst_link
fetch_post_list()
r.search("lst_title")
This is searching inside the string literal "lst_title", not the variable named lst_title, that's why it never matches.
r=re.compile("[BMW]")
The square brackets indicate that you're looking for one of those characters. So, for example, any string containing M will match. You just want "BMW". In fact you don't even need regular expressions, you can just test:
"BMW" in lst_title

Python BeautifulSoup Getting a column from table - IndexError List index out of range

Python newbie here. Python 2.7 with beautifulsoup 4.
I am trying to get parse a webpage to get columns using BeautifulSoup. The webpage has tables inside tables; but table 4 is the one that I want, it does not have any headers or th tag. I want to get the data into column.
from bs4 import BeautifulSoup
import urllib2
url = 'http://finance.yahoo.com/q/op?s=aapl+Options'
htmltext = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmltext)
#Table 8 has the data needed; it is nested under other tables though
# specific reference works as below:
print soup.findAll('table')[8].findAll('tr')[2].findAll('td')[2].contents
# Below loop erros out:
for row in soup.findAll('table')[8].findAll('tr'):
column2 = row.findAll('td')[2].contents
print column2
# "Index error: list index out of range" is what I get on second line of for loop.
I saw this as a working solution in another example but didnt work for me. Also tried iterating around tr:
mytr = soup.findAll('table')[8].findAll('tr')
for row in mytr:
print row.find('td') #works but gives only first td as expected
print row.findAll('td')[2]
which gives an error that row is a list which is out of index.
So:
First findAll('table') - works
second findAll('tr') - works
third findAll('td') - works only if ALL [ ] are numbers and not variables.
e.g.
print soup.findAll('table')[8].findAll('tr')[2].findAll('td')[2].contents
Above works as it is specific reference but not through variables.
But I need it inside a loop to get full column.
I took a look, first row in the table is actually a header so under the first tr there are some th, this should work:
>>> mytr = soup.findAll('table')[9].findAll('tr')
>>> for i,row in enumerate(mytr):
... if i:
... print i,row.findAll('td')[2]
as in most cases of html parsing, consider a more elegant solution like xml and xpath, like:
>>> from lxml import html
>>> print html.parse(url).xpath('//table[#class="yfnc_datamodoutline1"]//td[2]')

Categories

Resources