Output from webscraped page not appended to output from previous page - python

Similar to my last question, I'm having an iteration issue. I'm using the code
df1 = pd.DataFrame({'Username': [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]})
to get the list of names from one web page. However, when I try this for all pages, it creates new tables for each page instead of appending the output from each page together.
So for page 1 I'd get
Username
0 Alice
1 Bob
2 Carl
Page 2 :
Username
0 Sandra
1 Paula
2 Tim
etc. But I want my output to be:
Username
0 Alice
1 Bob
2 Car
3 Sandra
4 Paula
5 Tim
Below is my full code (with the url omitted) for iterating through all the pages
for pageno in range(0,99):
page=requests.get('full url'+ str(pageno))
soup=BeautifulSoup(page.text, 'html.parser')
df1 = pd.DataFrame({'Username': [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]})
How can I fix this?
Thank you.

Well your issue is that you are creating a new df at each loop so previous pages' records are not kept.
You might want to append your usernames to a global list and then import that list in a dataframe:
username_list = []
for pageno in range(0,99):
page=requests.get('full url'+ str(pageno))
soup=BeautifulSoup(page.text, 'html.parser')
username_list += [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]
df1 = pd.DataFrame({'Username': username_list})

The question is pretty unclear, but I guess this is what you wanted?
output_df = pd.DataFrame()
for pageno in range(0,99):
page=requests.get('full url'+ str(pageno))
soup=BeautifulSoup(page.text, 'html.parser')
df1 = pd.DataFrame({'Username': [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]})
output_df = pd.concat([output_df, df1])

Related

Collect URL and text in HTML tables for Excel files using Pandas

I'm using Pandas read_html() to snag the HTML table from my Bookshelf pages on the website Goodreads.com.
Overall Question: Is there a way to get the links AND the text into the dataframe using Pandas? Otherwise, what's the next best method? I'm saving to Excel but it seems like it should apply to CSV too.
Final Goal: I need the resulting columns to be title, title_url, author, author_url, shelves, shelves_urls, date_stated, date finished, date added
Version 1: Simple. Uses pd.read_html. Not sure where the data processing should be done-- before, after or during read_html? Can't figure out how to get URLs with this method. :(
dir = os.scandir(DOWNLOAD_PATH)
files = [entry.path for entry in dir if entry.is_dir() or entry.is_file()]
df_shelf = pd.DataFrame()
for file in files:
with open(file, 'r', encoding='utf-8') as f:
page_content = f.read()
df = pd.read_html(page_content)
df_shelf = pd.concat([df_shelf, df[0]])
test_filename = CWD / "shelf_list.xlsx"
df_shelf.to_excel(test_filename)
Results 1: (csv file, no URLs)
Unnamed: 0 title author my rating shelves date started date finished date added
0 Dead Speak (Cold Case Psychic #1) Pine, Pandora * 1 of 5 stars2 of 5 stars3 of 5 stars[ 4 of 5 stars ]5 of 5 stars currently-reading, 1-book-of-the-month 2022/02/01 2022/02/28 2022/02/02
1 Gifts of the Fairy Queen Crook, Amy * 1 of 5 stars2 of 5 stars[ 3 of 5 stars ]4 of 5 stars5 of 5 stars read, 0-gay, genre-fairytale-f..., genre-fantasy, profession-greent..., profession-mage-w..., theme-fantasy-of-... 2022/01/24
2 A Vanishing Glow (The Mystech Arcanum, #1-2) Radcliff, Alexis * 1 of 5 stars2 of 5 stars[ 3 of 5 stars ]4 of 5 stars5 of 5 stars read, 0-gay, genre-action-adve..., genre-political-i..., genre-sci-fi-fant..., genre-steam-punk-..., profession-captain, profession-military, profession-writer..., species-engineere... 2022/01/19
UPDATED (3/2/22): Version 2: This uses BeautifulSoup to do some data processing to create each column as a list where each element is [content text, [links]] with NaN to occupy empty space. I've overdone it with the lists. Any pointers on where I can simplify?
Hopefully I can proceed from here, but...
a. Is BeautifulSoup the only way to process it?
b. Is there a better approach?
df_shelf = pd.DataFrame([], columns = col_names)
shelf_pages = []
for file in files:
with open(file, 'r', encoding='utf-8') as f:
page_content = f.read()
soup = BeautifulSoup(page_content, 'lxml')
parsed_table = soup.find_all('table')[0].find_all('tr')
page_rows = []
for row in parsed_table: # collect all rows on a page (list)
row_data = []
for td in row.find_all('td'): # collect columns in each row (list)
column_data=[]
cell_text = ''.join(td.stripped_strings).strip()
link = td.find('a')
if not cell_text and not link:
continue
if cell_text and not link:
column_data = [cell_text, 'NaN']
else:
a_tags = td.find_all('a')
urls = [GR_BASE_URL + str(url['href']) for url in a_tags]
if not cell_text or 'view activity »edit' in cell_text:
column_data = ['NaN', urls]
else:
column_data = [cell_text, urls]
row_data.append(column_data)
if not all('' in s for s in row_data):
page_rows.append(row_data)
if shelf_pages: # collect all rows (list)
shelf_pages = shelf_pages + page_rows
else:
shelf_pages = page_rows
# [shelf_pages.append(row) for row in page_rows] # slower than itertools
Results: The final table has lists of lists of lists of lists. The only field that should have lists is shelves.
The only info I've found implies that pandas.DataFrame.read_html does not have a way of extracting the href component. No one has replied to my question, so closing it.
Most solutions I've found end up with proposed solutions requiring processing via BeautifulSoup, like this example: convert html table to csv using pandas python

Overwriting values in column created with Python for loop

I'm building an automated MLB schedule from a base URL and a loop through a list of team names as they appear in the URL. Using pd.read_html I get each team's schedule. The only thing I'm missing is, for each team's page, the team name itself, which I'd like as a new column 'team_name'. I have a small sample of my goal at the end of this post.
Below, is what I have so far, and if you run this, the print out does exactly what I need for just one team.
import pandas as pd
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners']
df = pd.DataFrame()
for team in (team_list):
new_url = url_base + team
df = df.append(pd.read_html(new_url)[1])
df['team_name'] = team
print(df[['team_name', 'Opponent']])
The trouble is, when I have all 30 teams in team_list, the value of team_name keeps getting overwritten, so that all 4000+ records list the same team name (the last one in team_list). I've tried dynamically assigning only certain rows the team value by using
df['team_name'][a:b] = team
where a, b are the starting and ending rows on the dataframe for the index team; but this gives KeyError: 'team_name'. I've also tried using placeholder series and dataframes for team_name, then merging with df later, but get duplication errors. On a larger scale, what I'm looking for is this:
team_name opponent
0 seattle-mariners new-york-yankees
1 seattle-mariners new-york-yankees
2 seattle-mariners boston-red-sox
3 seattle-mariners boston-red-sox
4 seattle-mariners san-diego-padres
5 seattle-mariners san-diego-padres
6 cincinatti-reds new-york-yankees
7 cincinatti-reds new-york-yankees
8 cincinatti-reds boston-red-sox
9 cincinatti-reds boston-red-sox
10 cincinatti-reds san-diego-padres
11 cincinatti-reds san-diego-padres
The original code df['team_name'] = team rewrites team_name for the entire df. The code below creates a placeholder, df_team, where team_name is updated and then df.append(df_team).
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners', 'houston-astros']
Option A: for loop
df_list = list()
for team in (team_list):
new_url = url_base + team
df_team = pd.read_html(new_url)[1]
df_team['team_name'] = team
df_list.append(df_team)
df = pd.concat(df_list)
Option B: list comprehension:
df_list = [pd.read_html(url_base + team)[1].assign(team=team) for team in team_list]
df = pd.concat(df_list)
df.head()
df.tail()

If-condition is not executed in a for-loop when scraping data from kworb.net

I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.
You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR

Scraping wikipedia table to pandas data frame

I need to scrape a wikipedia table to a pandas data frame and create three columns: PostalCode, Borough, and Neighborhoods.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Here is the code that I have used:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = [ ]
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighborhood'] = Neighbourhood
df
And it returns that:
(PostalCode, Borough, Neighborhood)
0 North York
1 Parkwoods
2 North York
3 Victoria Village
4 Downtown Toronto
5 Harbourfront (Toronto)
6 Downtown Toronto
7 Regent Park
8 North York
I can't figure out how to pick up the postcode and the neighbourhood from the wikipedia table.
Thank you
pandas allow you to do it in one line of code:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
Provide the error message.
By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.
You have two small errors:
df = pd.dataframe() should be df = pd.DataFrame([])
You also misspelled Neighborhoods as Neighbourhoods the second time.
You might also need to change soup = BeautifulSoup(website_url,'lxml') to soup = BeautifulSoup(website_url,'xml'), but we can't help you more without knowing your exact error message.
Instead of using
df = pd.dataframe()
df['Neighbourhoods'] = Neighbourhoods
You can use
df['Neighbourhoods'] = pd.Series(Neighbourhoods)
This would solve your error and add new columns similarly using pd.Series(listname) or you can give list of lists containing PostalCode, Borough, and Neighborhoods using this code
df = pd.Dataframe(list_of_lists)
It looks like you're only picking up one of the columns here:
links = My_table.findAll('a')
You should be looking for 'tr' rather than 'a' as that signifies a new row in the table.
You should then use a for loop to populate a list of lists, this code should work:
v = []
for tr in values:
td = tr.find_all('td')
row = [i.text for i in td]
v.append(row)
df = pd.DataFrame.from_records(v)

Compile one DataFrame from a loop sequence of smaller DataFrames

I am looping through a list of 103 FourSquare URLs to find "Coffee Shops."
I can create a DataFrame for each URL and print each DataFrame as I loop through the list (sample output at bottom).
I cannot figure out how to append the DataFrame for each URL into a single DataFrame as I loop through the list. My goal is to compile a single DataFrame from the DataFrames I am printing.
x = 0
while x < 103 :
results = requests.get(URLs[x]).json()
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
print(x, '!!!', dfven, '\n')
x = x + 1
Here is some output (I do get complete results):
0 !!! name categories lat lng
5 Tim Hortons Coffee Shop 43.80200 -79.198169
8 Tim Hortons / Esso Coffee Shop 43.80166 -79.199133
1 !!! Empty DataFrame
Columns: [name, categories, lat, lng]
Index: []
2 !!! name categories lat lng
5 Starbucks Coffee Shop 43.770367 -79.186313
18 Tim Hortons Coffee Shop 43.769591 -79.187081
3 !!! name categories lat lng
0 Starbucks Coffee Shop 43.770037 -79.221156
4 Country Style Coffee Shop 43.773716 -79.207027
I apologize if this is bad form or a breach of etiquette but I solved my problem and figured I should post. Perhaps making an effort to state the problem for StackOverflow helped me solve it?
First I learned how to ignore empty DataFrames:
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
if dfven.empty == False :
Once I added this code my printed output was a clean series of identically formatted data frames so appending them into one data frame was easy. I created a data frame at the beginning of my code (merge = pd.DataFrame()) and then added this line where I was printing.
merge = merge.append(dfven)
Now my output is perfect.

Categories

Resources