Converting from HTML to CSV using Python - python

I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'https://www.privateequityinternational.com/database/#/pei-300'
page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table
Output: []
In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:

The data is loaded from external URL via Ajax. You can use requests/json module to get it:
import json
import requests
url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data['data']:
print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))
Prints:
5611 Blackstone New York, United States
5579 The Carlyle Group Washington DC, United States
5586 KKR New York, United States
6701 TPG Fort Worth, United States
5591 Warburg Pincus New York, United States
1801 NB Alternatives New York, United States
6457 CVC Capital Partners Luxembourg, Luxembourg
6477 EQT Stockholm, Sweden
6361 Advent International Boston, United States
8411 Vista Equity Partners Austin, United States
6571 Leonard Green & Partners Los Angeles, United States
6782 Cinven London, United Kingdom
6389 Bain Capital Boston, United States
8096 Apollo Global Management New York, United States
8759 Thoma Bravo San Francisco, United States
7597 Insight Partners New York, United States
867 BlackRock New York, United States
5471 General Atlantic New York, United States
6639 Permira Advisers London, United Kingdom
5903 Brookfield Asset Management Toronto, Canada
6473 EnCap Investments Houston, United States
6497 Francisco Partners San Francisco, United States
6960 Platinum Equity Beverly Hills, United States
16331 Hillhouse Capital Group Hong Kong, Hong Kong
5595 Partners Group Baar-Zug, Switzerland

And selenium version:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'
driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))
prints:
Ranking Name City, Country (HQ)
0 1 Blackstone New York, United States
1 2 The Carlyle Group Washington DC, United States
2 3 KKR New York, United States
3 4 TPG Fort Worth, United States
4 5 Warburg Pincus New York, United States
5 6 NB Alternatives New York, United States
6 7 CVC Capital Partners Luxembourg, Luxembourg
7 8 EQT Stockholm, Sweden
8 9 Advent International Boston, United States
9 10 Vista Equity Partners Austin, United States
10 11 Leonard Green & Partners Los Angeles, United States
11 12 Cinven London, United Kingdom
12 13 Bain Capital Boston, United States
13 14 Apollo Global Management New York, United States
14 15 Thoma Bravo San Francisco, United States
15 16 Insight Partners New York, United States
16 17 BlackRock New York, United States
17 18 General Atlantic New York, United States
18 19 Permira Advisers London, United Kingdom
19 20 Brookfield Asset Management Toronto, Canada
20 21 EnCap Investments Houston, United States
21 22 Francisco Partners San Francisco, United States
22 23 Platinum Equity Beverly Hills, United States
23 24 Hillhouse Capital Group Hong Kong, Hong Kong
24 25 Partners Group Baar-Zug, Switzerland
And also save data to a file.csv.
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Related

Is it possible to explode the list of substrings which are space separated and they are countries?

I have a dataset like this - The countries are separated by space and how can i explode into multiple columns?
I am beginner in python!
data = {'COUNRTY_LIST': ['UNITED STATES SAUDI ARABIA IRAQ PAKISTAN', 'GERMANY PAKISTAN SWITZERLAND CANADA', 'SAINT LUCIA VIRGIN ISLANDS', 'UNITED ARAB EMIRATES UNITED KINGDOM'], 'UNIQ_ID': [20, 21, 19, 18]}
OUTPUT I NEED
UNIQ_ID COUNTRY_LIST SPLIT_1 SPLIT_2 SPLIT_3 SPLIT_4
20 UNITED STATES SAUDI ARABIA IRAQ PAKISTAN UNITED STATES SAUDI ARABIA IRAQ PAKISTAN
21 GERMANY PAKISTAN SWITZERLAND CANADA GERMANY PAKISTAN SWITZERLAND CANADA
19 SAINT LUCIA VIRGIN ISLANDS SAINT LUCIA VIRGIN ISLANDS
18 UNITED ARAB EMIRATES UNITED KINGDOM UNITED ARAB EMIRATES UNITED KINGDOM
I was trying to create an array of countries and then loop through the list and explode. But did not succeed.
Any help would be highly appreciated

extract `state` from address using geopy

I have a table that has 10,000 address entries. I would like to obtain the full address from it and the state.
How could I do it with geopy module? Other modules such as geopandas are not available.
Example
address
------------------------------------------------------------------------------
Prosperity Drive Northwest, Huntsville, Alabama, United States
------------------------------------------------------------------------------
Eureka Drive, Fremont, Newark, Alameda County, 94538, United States
------------------------------------------------------------------------------
Messenger Loop NW, Facebook Data Center, Los Lunas, Valencia County, New Mexico
Desired
address with format | state
----------------------------------------------------------------------------------------------
Prosperity Drive Northwest, Huntsville, Madison County, Alabama, 35773, United States | Alabama
----------------------------------------------------------------------------------------------
Eureka Drive, Fremont, Newark, Alameda County, California, 94538, United States | California
----------------------------------------------------------------------------------------------
Messenger Loop NW, Facebook Data Center, Los Lunas, Valencia County, New Mexico, 87031, United States | New Mexico
Thank you for your time.
Note
I am aware of how to use the Nominatim function and the extra module that can cope with pandas dataframe. But extra module is not available in this case.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
location = geolocator.geocode("Eureka Drive, Fremont, Newark, Alameda County, 94538, United States")
print(location.address)
Output:
Eureka Drive, Fremont, Newark, Alameda County, California, 94538, United States
Exactly as it's used in the documentation...

How to extract only the n-nth HTML title tag in a sequence in Python with BeautifulSoup?

I'm trying to extract data from a Wikipedia table (https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award) about the MVP winners over NBA history.
This is my code:
wik_req = requests.get("https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award")
wik_webpage = wik_req.content
soup = BeautifulSoup(wik_webpage, "html.parser")
my_table = soup('table', {"class":"wikitable plainrowheaders sortable"})[0].find_all('a')
print(my_table)
for x in my_table:
test = x.get("title")
print(test)
However, this code prints all HTML title tags of the table as in the following (short version):
'1955–56 NBA season
Bob Pettit
Power Forward (basketball)
United States
St. Louis Hawks
1956–57 NBA season
Bob Cousy
Point guard
Boston Celtics'
Eventually, I want to create a pandas dataframe in which I store all the season years in a column, all the player years in a column, and so on and so forth. What code does the trick to only print one of the HTML tag titles (e.g. only the NBA season years)? I can then store those into a column to set up my dataframe and do the same with player, position, nationality and team.
All you should need for that dataframe is:
import pandas as pd
url = "https://en.wikipedia.org/wiki/NBA_Most_Valuable_Player_Award"
df=pd.read_html(url)[5]
Output:
print(df)
Season Player ... Nationality Team
0 1955–56 Bob Pettit* ... United States St. Louis Hawks
1 1956–57 Bob Cousy* ... United States Boston Celtics
2 1957–58 Bill Russell* ... United States Boston Celtics (2)
3 1958–59 Bob Pettit* (2) ... United States St. Louis Hawks (2)
4 1959–60 Wilt Chamberlain* ... United States Philadelphia Warriors
.. ... ... ... ... ...
59 2014–15 Stephen Curry^ ... United States Golden State Warriors (2)
60 2015–16 Stephen Curry^ (2) ... United States Golden State Warriors (3)
61 2016–17 Russell Westbrook^ ... United States Oklahoma City Thunder (2)
62 2017–18 James Harden^ ... United States Houston Rockets (4)
63 2018–19 Giannis Antetokounmpo^ ... Greece Milwaukee Bucks (4)
[64 rows x 5 columns]
If you really want to stick with BeautifulSoup, here's an example to get you started:
my_table = soup('table', {"class":"wikitable plainrowheaders sortable"})[0]
season_col=[]
for row in my_table.find_all('tr')[1:]:
season = row.findChildren(recursive=False)[0]
season_col.append(season.text.strip())
I expect there may be some differences between columns, but as you indicated you want to get familiar with BeautifulSoup, that's for you to explore :)

Filling out empty cells with lists of values

I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
And I have 3 lists of values that are ready to fill in the None cells:
city = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state = ['MA', 'CA', 'CA', 'ON']
country = ['United States', 'United States', 'United States', 'Canada']
The order of the elements in these list are correspondent to each other. Thus, the first items across all 3 lists match each other, and so forth. How can I fill out the empty cells and produce a result like below?
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
My code gives me an error and I'm stuck.
if df.loc[df['City'] == 'Boston']:
'State' = 'MA'
Any solution is welcome. Thank you.
Create two mappings, one for <city : state>, and another for <city : country>.
city_map = dict(zip(city, state))
country_map = dict(zip(city, country))
Next, set City as the index -
df = df.set_index('City')
And, finally use map/replace to transform keys to values as appropriate -
df['State'] = df['City'].map(city_map)
df['Country'] = df['City'].map(country_map)
As an extra final step, you may call df.reset_index() at the end.

BeautifulSoup not returning .text in csv and extra unwanted numbers

I have this code
import requests
import csv
from bs4 import BeautifulSoup
from time import sleep
f = csv.writer(open('destinations.csv', 'w'))
f.writerow(['Destinations', 'Country'])
pages = []
for i in range(1, 3):
url = 'http://www.travelindicator.com/destinations?page=' + str(i)
pages.append(url.decode('utf-8'))
for item in pages:
page = requests.get(item, sleep(2))
soup = BeautifulSoup(page.content.text, 'lxml')
for destinations_list in soup.select('.news-a header'):
destination = soup.select('h2 a')
country = soup.select('p a')
print (destinations_list.text)
f.writerow([destinations_list])
which gives me the console answer of:
Ellora
1
3/5
India
Volterra
2
2/5
Italy
Hamilton
3
3/5
New Zealand
London
4
5/5
United Kingdom
Sun Moon Lake
5
5/5
Taiwan
Texel
6
etc...
Firstly I am unsure why the extra numbers are being added as I have only specified the parts I want for each country.
Secondly, when I try and format it into a CSV file, it doesn't remove the HTML even though I have specified my soup to give me content.text. Been trying to figure it out for an hour and am at a loss.
pages = []
destinations_list = []
country = []
for item in pages:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'lxml')
temp = soup.findAll('article')
for l in temp:
destinations_list.append(l.find('h2').text)
country = l.find('p')
country.append(country.find('a').text)
print destinations_list
print country
Output:
Ellora
India
Volterra
Italy
Hamilton
New Zealand
London
United Kingdom
Sun Moon Lake
Taiwan
Texel
The Netherlands
Zhengzhou
China
Vladivostok
Russia
Charleston
United States
Banska Bystrica
Slovakia
Lviv
Ukraine
Viareggio
Italy
Wakkanai
Japan
Nordkapp
Norway
Jericoacoara
Brazil
Tainan
Taiwan
Boston
United States
Keelung
Taiwan
Stockholm
Sweden
Shaoxing
China
Bohol
Distance to you
Bohol
Philippines
Saint Petersburg
Russia
Malmo
Sweden
Elba
Italy
Gdansk
Poland
Langkawi
Malaysia
Poznan
Poland
Daegu
South Korea
Abu Simbel
Egypt
Melbourne
Australia
Reunion
Reunion
Annecy
France
Colombo
Sri Lanka
Penghu
Taiwan
Conwy
United Kingdom
Monterrico
Guatemala
Janakpur
Nepal
Bimini
Bahamas
Lake Tahoe
United States
Essaouira
Morocco

Categories

Resources