Printing out scraped data with python selenium - python

Here is the way I came up with to print out scraped data:
pool_to_search_for_loads = driver.find_element(By.XPATH, '//*[#id="searchResults"]/div[5]/div')
loads_contact = pool_to_search_for_loads.find_elements(By.CLASS_NAME, 'contact')
loads_origin = pool_to_search_for_loads.find_elements(By.CLASS_NAME, 'origin')
loads_dest = pool_to_search_for_loads.find_elements(By.CLASS_NAME, 'dest')
def parse_printer(info1, info2, info3):
count = 0
while count < len(info1):
print(info1[count].text, ' from ', info2[count].text, ' to ', info3[count].text)
count += 1
parse_printer(loads_contact, loads_origin, loads_dest)
This gives me such output:
(800) 999-0101 from Hernando, FL to Port Huron, MI
(800) 999-0101 from Albany, GA to Dayton, OH
(800) 999-0101 from Valdosta, GA to Cincinnati, OH
(800) 999-0101 from Tallahassee, FL to Indianapolis, IN
(800) 999-0101 from Macon, GA to Lexington, KY
Writing a function for such seems to be an overkill, is there a more elegant way to print out results?

Depending on whether you need to retain the loads_contact, loads_origin, and loads_dest variables for other usage. You could use list comprehension to extract the text.
loads_contact = [x.text for x in pool_to_search_for_loads.find_elements(By.CLASS_NAME, 'contact')]
loads_origin = [x.text for x in pool_to_search_for_loads.find_elements(By.CLASS_NAME, 'origin')]
loads_dest = [x.text for x in pool_to_search_for_loads.find_elements(By.CLASS_NAME, 'dest')]
Then you could zip those 3 into 1 list and then use the items in that 1 list (combined with f-string).
for item in zip(loads_contact, loads_origin, loads_dest):
print(f"{item[0]} from {item[1]} to {item[2]}")

Related

Is there a better way to find specific value in a python dictionary like in list?

I have been practicing on iterating through dictionary and list in Python.
The source file is a csv document containing Country and Capital. It seems I had to go through 2 for loops for country_dict in order to produce the same print result for country_list and capital_list.
Is there a better way to do this in Python dictionary?
The code:
import csv
path = #Path_to_CSV_File
country_list=[]
capital_list=[]
country_dict={'Country':[],'Capital':[]}
with open(path, mode='r') as data:
for line in csv.DictReader(data):
locals().update(line)
country_dict['Country'].append(Country)
country_dict['Capital'].append(Capital)
country_list.append(Country)
capital_list.append(Capital)
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
if i >= (len(country_dict['Country'])-1):
print("out of bound")
for count1, element in enumerate(country_dict['Country']):
if count1==i:
print('Country = ' + element)
for count2, element in enumerate(country_dict['Capital']):
if count2==i:
print('Capital = ' + element)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_list[i] + '\nCapital = ' + capital_list[i])
The output:
Country = Djibouti
Capital = Djibouti (city)
Country = Djibouti
Capital = Djibouti (city)
The CSV file content:
Country,Capital
Algeria,Algiers
Angola,Luanda
Benin,Porto-Novo
Botswana,Gaborone
Burkina Faso,Ouagadougou
Burundi,Gitega
Cabo Verde,Praia
Cameroon,Yaounde
Central African Republic,Bangui
Chad,N'Djamena
Comoros,Moroni
"Congo, Democratic Republic of the",Kinshasa
"Congo, Republic of the",Brazzaville
Cote d'Ivoire,Yamoussoukro
Djibouti,Djibouti (city)
Egypt,Cairo
Equatorial Guinea,"Malabo (de jure), Oyala (seat of government)"
Eritrea,Asmara
Eswatini (formerly Swaziland),"Mbabane (administrative), Lobamba (legislative, royal)"
Ethiopia,Addis Ababa
Gabon,Libreville
Gambia,Banjul
Ghana,Accra
Guinea,Conakry
Guinea-Bissau,Bissau
Kenya,Nairobi
Lesotho,Maseru
Liberia,Monrovia
Libya,Tripoli
Madagascar,Antananarivo
Malawi,Lilongwe
Mali,Bamako
Mauritania,Nouakchott
Mauritius,Port Louis
Morocco,Rabat
Mozambique,Maputo
Namibia,Windhoek
Niger,Niamey
Nigeria,Abuja
Rwanda,Kigali
Sao Tome and Principe,São Tomé
Senegal,Dakar
Seychelles,Victoria
Sierra Leone,Freetown
Somalia,Mogadishu
South Africa,"Pretoria (administrative), Cape Town (legislative), Bloemfontein (judicial)"
South Sudan,Juba
Sudan,Khartoum
Tanzania,Dodoma
Togo,Lomé
Tunisia,Tunis
Uganda,Kampala
Zambia,Lusaka
Zimbabwe,Harare
I am not sure if I get your point; Please check out the code.
import csv
path = #Path_to_CSV_File
country_dict={}
with open(path, mode='r') as data:
lines = csv.DictReader(data)
for idx,line in enumerate(lines):
locals().update(line)
country_dict[idx] = {"Country":Country,"Capital":}
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
country_info = country_dict.get(i)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_info['Country'] + '\nCapital = ' + country_info['Capital'])

BeautifulSoup trying to get text from wrapped divs but empty or "none" is being returned

Here is a picture (sorry) of the HTML that I am trying to parse:
I am using this line:
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
Thinking that I'd get the 1st child of the class statText and the outcome would be 53%.
But it's not. I get "Loading..." and none of the data that I was trying to use and display.
The full code I have so far:
soup = BeautifulSoup(source, 'lxml')
home_team = soup.find('div', class_='tname-home').a.text
away_team = soup.find('div', class_='tname-away').a.text
home_score = soup.select_one('.current-result .scoreboard:nth-child(1)').text
away_score = soup.select_one('.current-result .scoreboard:nth-child(2)').text
print("The home team is " + home_team, "and they scored " + home_score)
print()
print("The away team is " + away_team, "and they scored " + away_score)
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
print(home_stats)
Which currently does print the hone and away team and the number of goals they scored. But I can't seem to get any of the statistical content from this site.
My output plan is to have:
[home_team] had 53% ball possession and [away_team] had 47% ball possession
However, I would like to remove the "%" symbols from the parse (but that's not essential). My plan is to use these numbers for more stats later on, so the % symbol gets in the way.
Apologies for the noob question - this is the absolute beginning of my Pythonic journey. I have scoured the internet and StackOverflow and just can not find this situation - I also possibly don't know exactly what I am looking for either.
Thanks kindly for your help! May your answer be the one I pick as "correct" ;)
Assuming that this is the website that u r tryna scrape, here is the complete code to scrape all the stats:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.scoreboard.com/en/match/SO3Fg7NR/#match-statistics;0')
pg = driver.page_source #Gets the source code of the page
driver.close()
soup = BeautifulSoup(pg,'html.parser') #Creates a soup object
statrows = soup.find_all('div',class_ = "statTextGroup") #Finds all the div tags with class statTextGroup -- these div tags contain the stats
#Scrapes the team names
teams = soup.find_all('a',class_ = "participant-imglink")
teamslst = []
for x in teams:
team = x.text.strip()
if team != "":
teamslst.append(team)
stats_dict = {}
count = 0
for x in statrows:
txt = x.text
final_txt = ""
stat = ""
alphabet = False
percentage = False
#Extracts the numbers from the text
for c in txt:
if c in '0123456789':
final_txt+=c
else:
if alphabet == False:
final_txt+= "-"
alphabet = True
if c != "%":
stat += c
else:
percentage = True
values = final_txt.split('-')
#Appends the values to the dictionary
for x in values:
if stat in stats_dict.keys():
if percentage == True:
stats_dict[stat].append(x + "%")
else:
stats_dict[stat].append(int(x))
else:
if percentage == True:
stats_dict[stat] = [x + "%"]
else:
stats_dict[stat] = [int(x)]
count += 1
if count == 15:
break
index = [teamslst[0],teamslst[1]]
#Creates a pandas DataFrame out of the dictionary
df = pd.DataFrame(stats_dict,index = index).T
print(df)
Output:
Burnley Southampton
Ball Possession 53% 47%
Goal Attempts 10 5
Shots on Goal 2 1
Shots off Goal 4 2
Blocked Shots 4 2
Free Kicks 11 10
Corner Kicks 8 2
Offsides 2 1
Goalkeeper Saves 0 2
Fouls 8 10
Yellow Cards 1 0
Total Passes 522 480
Tackles 15 12
Attacks 142 105
Dangerous Attacks 44 29
Hope that this helps!
P.S: I actually wrote this code for a different question, but I didn't post it as an answer was already posted! But I didn't know that it would come in handy now! Anyways, I hope that my answer does what u need.

Convert in utf16

I am crawling several websites and extract the names of the products. In some names there are errors like this:
Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&M Discovery)
How can I fix that?
I am using xpath and re.search to get the names.
In every Python file, this is the first code: # -*- coding: utf-8 -*-
Edit:
This is the sourcecode, how I get the information.
if '"articleName":' in details:
closer_to_product = details.split('"articleName":', 1)[1]
closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
if debug_product == 1:
print('product before try:' + repr(closer_to_product_2))
try:
found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
except AttributeError:
found_product = ''
if debug_product == 1:
print('cleared product: ', '>>>' + repr(found_product) + '<<<')
if not found_product:
print(product_detail_page, found_product)
items['products'] = 'default'
else:
items['products'] = found_product
Details
product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]
Where is a problem (Python 3.8.3)?
import html
strings = [
'Bols Watermelon Lik\u00f6r 0,7l',
'Hayman\u00b4s Sloe Gin',
'Ron Zacapa Edici\u00f3n Negra',
'Havana Club A\u00f1ejo Especial',
'Caol Ila 13 Jahre (G&M Discovery)',
'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
for str in strings:
print( html.unescape(str).
encode('raw_unicode_escape').
decode('unicode_escape') )
Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L
Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings

Data Analysis using Python

I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.

html scraping using python topboxoffice list from imdb website

URL: http://www.imdb.com/chart/?ref_=nv_ch_cht_2
I want you to print top box office list from above site (all the movies' rank, title, weekend, gross and weeks movies in the order)
Example output:
Rank:1
title: godzilla
weekend:$93.2M
Gross:$93.2M
Weeks: 1
Rank: 2
title: Neighbours
This is just a simple way to extract those entities by BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data, 'html.parser')
rows = page.findAll("tr", {'class': ['odd', 'even']})
for tr in rows:
for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
print data.get_text()
P.S.-Arrange according to your will.
There is no need to scrape anything. See the answer I gave here.
How to scrape data from imdb business page?
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

Categories

Resources