I am new to python and looking into scraping HTML using python beautifulsoup library.
I need to fetch date field value as Day and date and precip field value as well as measuring unit .
Python code
dates=[]
Precip=[]
for row in right_table.findAll("tr"):
cells = row.findAll('td')
th_cells=row.findAll('th') #To store second column data
if len(cells)==5:
Precip.append(cells[1].find(text=True))
dates.append(th_cells[0].find(text=True))
print(dates)
print(Precip)
Code Output
['Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ']
['0 ', '0 ', '0 ', '1 ', '3 ', '3 ', '13 ', '0 ', '0 ', '0 ', '0 ', '0 ', '\xa0', '1 ', '3 ', '0 ', '1 ', '4 ', '2 ', '9 ', '2 ', '0 ', '1 ', '0 ', '0 ', '0 ', '0 ', '0 ', '1 ', '2 ']
Required Output
['Wed 11/1','Thur 11/2'.......]
['0mm','0mm'....]
Below is the HTML which i am trying to parse
HTML
<class 'list'>: ['\n', <thead>
<tr>
<th>Date</th>
<th>Hi/Lo</th>
<th>Precip</th>
<th>Snow</th>
<th>Forecast</th>
<th>Avg. HI / LO</th>
</tr>
</thead>, '\n', <tbody>
<tr class="pre">
<th scope="row">Wed <time>11/1</time></th>
<td>25°/20°</td>
<td>0 <span class="small">mm</span></td>
<td>0 <span class="small">CM</span></td>
<td> </td>
<td>28°/18°</td>
</tr>
<tr class="pre">
<th scope="row">Thu <time>11/2</time></th>
<td>28°/19°</td>
<td>0 <span class="small">mm</span></td>
<td>0 <span class="small">CM</span></td>
<td> </td>
<td>27°/18°</td>
</tr>
I'd use .text instead of .find(text=true). What's currently happening is you're not fetching the content of the subtags, like <time>.
from bs4 import BeautifulSoup
import requests
html = requests.get("https://www.accuweather.com/en/in/bengaluru/204108/month/204108?view=table").text
soup = BeautifulSoup(html, 'html.parser')
right_table = soup.find("tbody")
dates=[]
Precip=[]
for row in right_table.findAll("tr"):
cells = row.findAll('td')
th_cells=row.findAll('th') #To store second column data
if len(cells)==5:
Precip.append(cells[1].text)
dates.append(th_cells[0].text)
print(dates)
print(Precip)
This gets the correct outputted result:
['Wed 11/1', 'Thu 11/2', 'Fri 11/3', 'Sat 11/4', 'Sun 11/5', 'Mon 11/6', 'Tue 11/7', 'Wed 11/8', 'Thu 11/9', 'Fri 11/10', 'Sat 11/11', 'Sun 11/12', 'Mon 11/13', 'Tue 11/14', 'Wed 11/15', 'Thu 11/16', 'Fri 11/17', 'Sat 11/18', 'Sun 11/19', 'Mon 11/20', 'Tue 11/21', 'Wed 11/22', 'Thu 11/23', 'Fri 11/24', 'Sat 11/25', 'Sun 11/26', 'Mon 11/27', 'Tue 11/28', 'Wed 11/29', 'Thu 11/30']
['0 mm', '0 mm', '0 mm', '1 mm', '3 mm', '3 mm', '13 mm', '0 mm', '0 mm', '0 mm', '0 mm', '0 mm', '\xa0', '1 mm', '3 mm', '0 mm', '1 mm', '4 mm', '2 mm', '9 mm', '2 mm', '0 mm', '1 mm', '0 mm', '0 mm', '0 mm', '0 mm', '0 mm', '1 mm', '2 mm']
Related
I'm trying to properly label my line plot and set the x-tick labels but have been unsuccessful.
Here is what I've tried so far:
plt.xticks(ticks = ... ,labels =...)
AND
labels = ['8 pcw', '12 pcw', '13 pcw', '16 pcw', '17 pcw', '19 pcw', '21 pcw',
'24 pcw', '35 pcw', '37 pcw', '4 mos', '1 yrs', '2 yrs', '3 yrs',
'4 yrs', '8 yrs', '11 yrs', '13 yrs', '18 yrs', '19 yrs', '21 yrs',
'23 yrs', '30 yrs', '36 yrs', '37 yrs', '40 yrs']
ax.set_xticks(labels)
The code that I've used to transpose this dataframe into a line graph is this:
mean_df.transpose().plot().line(figsize = (25, 10))
plt.xlabel("Age")
plt.ylabel("Raw RPKM")
plt.title("BTRC Expression in V1C")
The dataframe I'm using (mean_df) contains columns that are already named with their respective label (8 pcw, 12 pcw, ... 36yrs, 40yrs) so I would have thought that it would have pulled them automatically from there. However, it looks like matplotlib automatically removes the x-ticks and displays only 5 values for the x-ticks. How can I get it to display all 24 values instead?
I keep getting the following two errors when I try the methods listed above:
Failed to convert value(s) to axis units:
OR
ValueError: The number of FixedLocator locations (n), usually from a
call to set_ticks, does not match the number of ticklabels (n)
Here is an image of my plot:
In Pandas (Juypter) I have a column with dates in string format:
koncerti.Date.values[:20]
array(['15 September 2010', '16 September 2010', '18 September 2010',
'20 September 2010', '21 September 2010', '23 September 2010',
'24 September 2010', '26 September 2010', '28 September 2010',
'30 September 2010', '1 October 2010', '3 October 2010',
'5 October 2010', '6 October 2010', '8 October 2010',
'10 October 2010', '12 October 2010', '13 October 2010',
'15 October 2010', '17 October 2010'], dtype=object)
I try to convert them to date format with the following statement:
koncerti.Date = pd.to_datetime(koncerti.Date, format='%d %B %Y')
Unfortunatelly, it produces the following error: ValueError: unconverted data remains: [31]
What does it mean this error?
Solution: koncerti.Date = pd.to_datetime(koncerti.Date, format='%d %B %Y', exact=False)
Addditional parameter was needed: exact=False
This is the code in which I tried to get the data from one website using the requests and saved in dictionary called table but when I tried to iterate through those values and saved them in the list , I faced with below error, any help is appreciated.
import requests
from bs4 import BeautifulSoup
list1 = []
table = {}
r = requests.get("https://www.century21.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/?k=1")
content = r.content
soup = BeautifulSoup(content,'html.parser')
all = soup.find_all('div',{"class":"property-card-primary-info"})
for item in all:
print(item.find('a',{"class":"listing-price"}).text.replace('\n','').replace(' ',''))
table['address'] = item.find('div',{"class":"property-address"}).text.replace('\n','').replace(' ','')
table['city'] = item.find('div',{"class":"property-city"}).text.replace('\n','').replace(' ','')
table['beds'] = item.find('div',{"class":"property-beds"}).text.replace('\n','').replace(' ','')
table['baths'] = item.find('div',{"class":"property-baths"}).text.replace('\n','').replace(' ','')
try:
table['half-baths'] = item.find("div",{"class":"property-half-baths"}).text.replace('\n','').replace(' ','')
except:
table['half-baths'] = None
try:
table['property sq.ft.'] = item.find("div",{"class":"property-sqft"}).text.replace(' ','').replace("\n",'')
except:
table['property sq.ft.'] = None
list1.append(table)
list1
OUTPUT
$325,000
$249,000
$390,000
$274,900
$208,000
$169,000
$127,500
$990,999
I'm getting the unique values when I print price values , but when I append to the list all the values are replicated. Any help will means a lot.
Question : how to get rid of this replication of data and get the corresponding values?
[{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '}]
for item in all:
table ={} # important
print(item.find('a',{"class":"listing-price"}).text.replace('\n','').replace(' ',''))
table['address'] = item.find('div',{"class":"property-address"}).text.replace('\n','').replace(' ','')
table['city'] = item.find('div',{"class":"property-city"}).text.replace('\n','').replace(' ','')
table['beds'] = item.find('div',{"class":"property-beds"}).text.replace('\n','').replace(' ','')
table['baths'] = item.find('div',{"class":"property-baths"}).text.replace('\n','').replace(' ','')
try:
table['half-baths'] = item.find("div",{"class":"property-half-baths"}).text.replace('\n','').replace(' ','')
except:
table['half-baths'] = None
try:
table['property sq.ft.'] = item.find("div",{"class":"property-sqft"}).text.replace(' ','').replace("\n",'')
except:
table['property sq.ft.'] = None
list1.append(table)
print(set(list1)) # print list outside the loop use set to remove dups
How do I sort this list numerically?
sa = ['3 :mat', '20 :zap', '20 :jhon', '5 :dave', '14 :maya' ]
print(sorted(sa))
This shows
[ '14 :maya', '20 :zap','20 :jhon', '3 :mat', '5 :dave']
You can do it like this, since your numbers are part a the string:
sorted(sa, key = lambda x: int(x.split(' ')[0]))
You can do something like the below, which will use the numbers in the string and sort them.
sa.sort(key=lambda x: int(''.join(filter(str.isdigit, x))))
print(sa)
using regex:
sorted(sa, key=lambda x:int(re.findall('\d+', x)[0]))
['3 :mat', '5 :dave', '14 :maya', '20 :zap', '20 :jhon']
Using module natsort
from natsort import natsorted
natsorted(sa)
['3 :mat', '5 :dave', '14 :maya', '20 :jhon', '20 :zap']
I am trying to parse the miscellaneous stats table from basketball-reference.com (https://www.basketball-reference.com/leagues/NBA_1980.html). However, the table that I would like to parse is inside html comment.
Using the following code
html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))
results to the following
TypeError Traceback (most recent call last)
<ipython-input-35-93508687bbc6> in <module>()
----> 1 cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))
~/.pyenv/versions/3.7.0/lib/python3.7/re.py in sub(pattern, repl, string, count, flags)
190 a callable, it's passed the Match object and must return
191 a replacement string to be used."""
--> 192 return _compile(pattern, flags).sub(repl, string, count)
193
194 def subn(pattern, repl, string, count=0, flags=0):
TypeError: cannot use a string pattern on a bytes-like object
I am using python3.7.
Rather than trying to use re to put all of the HTML inside the comments into your HTML, you could instead use BeautifulSoup to return you just the comments from the HTML. These can then also be parsed using BeautifulSoup to extract any table elements as required, for example:
import requests
from bs4 import BeautifulSoup, Comment
html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
soup = BeautifulSoup(html, "html.parser")
for comment in soup.find_all(text=lambda t : isinstance(t, Comment)):
comment_html = BeautifulSoup(comment, "html.parser")
for table in comment_html.find_all("table"):
for tr in table.find_all("tr"):
row = [td.text for td in tr.find_all("td")]
print(row)
print()
This would give you rows in the tables starting:
['Finals', 'Cleveland Cavaliers \nover \nGolden State Warriors\n\xa0(4-3)\n', 'Series Stats']
['\n\n\nGame 1\nThu, June 2\nCleveland Cavaliers\n89# Golden State Warriors\n104\n\nGame 2\nSun, June 5\nCleveland Cavaliers\n77# Golden State Warriors\n110\n\nGame 3\nWed, June 8\nGolden State Warriors\n90# Cleveland Cavaliers\n120\n\nGame 4\nFri, June 10\nGolden State Warriors\n108# Cleveland Cavaliers\n97\n\nGame 5\nMon, June 13\nCleveland Cavaliers\n112# Golden State Warriors\n97\n\nGame 6\nThu, June 16\nGolden State Warriors\n101# Cleveland Cavaliers\n115\n\nGame 7\nSun, June 19\nCleveland Cavaliers\n93# Golden State Warriors\n89\n\n\n', 'Game 1', 'Thu, June 2', 'Cleveland Cavaliers', '89', '# Golden State Warriors', '104', 'Game 2', 'Sun, June 5', 'Cleveland Cavaliers', '77', '# Golden State Warriors', '110', 'Game 3', 'Wed, June 8', 'Golden State Warriors', '90', '# Cleveland Cavaliers', '120', 'Game 4', 'Fri, June 10', 'Golden State Warriors', '108', '# Cleveland Cavaliers', '97', 'Game 5', 'Mon, June 13', 'Cleveland Cavaliers', '112', '# Golden State Warriors', '97', 'Game 6', 'Thu, June 16', 'Golden State Warriors', '101', '# Cleveland Cavaliers', '115', 'Game 7', 'Sun, June 19', 'Cleveland Cavaliers', '93', '# Golden State Warriors', '89']
['Game 1', 'Thu, June 2', 'Cleveland Cavaliers', '89', '# Golden State Warriors', '104']
['Game 2', 'Sun, June 5', 'Cleveland Cavaliers', '77', '# Golden State Warriors', '110']
['Game 3', 'Wed, June 8', 'Golden State Warriors', '90', '# Cleveland Cavaliers', '120']
['Game 4', 'Fri, June 10', 'Golden State Warriors', '108', '# Cleveland Cavaliers', '97']
Note: To avoid getting the cannot use a string pattern on a bytes-like object, you could have used .text instead of .content to pass the string to your regular expression.