Background
I fetched a table from a source on the internet (see Mordred Molecular Descriptors) machine learning project.
The code I used to fetch that table is listed below:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Fetch the HTML content of the webpage
url = "https://mordred-descriptor.github.io/documentation/master/descriptors.html"
html = requests.get(url).content
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Find the table element in the HTML
table = soup.find('table')
# Convert the table into a Pandas dataframe
df = pd.read_html(str(table))[0]
# Print the resulting dataframe
df.drop(['#', 'constructor', 'dim', 'description'], axis=1)
After running above code in Python 3, I yield this dataframe.
Now I want to group together the names in the "names" column by their corresponding modules in the "modules" column. The problem is that the fetched table already is a pivot table, and that the "modules" column is filled by NaN values. Ideally, I would like to generate a dictionary with the keys the module, and the values a list of the grouped names.
Example: dict_df = {'ABCIndex': ['ABC','ABCGG'], 'AcidBase': ['nAcid', 'nBase'], ..., 'ZagrebIndex': ['Zagreb1', 'Zagreb2', 'mZagreb1', 'mZagreb2']}
I have tried grouping together the names by modules using .groupby() in Pandas, however the NaN values are left away leaving the dictionary values a list of a single name; the name of the row where the module was not a NaN value.
Thank you for your time and assistance.
IIUC, like this? Use ffill then groupby, agg with list.
df.groupby(df['module'].ffill())['name'].agg(list)
Output:
module
ABCIndex [ABC, ABCGG]
AcidBase [nAcid, nBase]
AdjacencyMatrix [SpAbs_A, SpMax_A, SpDiam_A, SpAD_A, SpMAD_A, ...
Aromatic [nAromAtom, nAromBond]
AtomCount [nAtom, nHeavyAtom, nSpiro, nBridgehead, nHete...
Autocorrelation [ATS0dv, ATS1dv, ATS2dv, ATS3dv, ATS4dv, ATS5d...
BCUT [BCUTc-1h, BCUTc-1l, BCUTdv-1h, BCUTdv-1l, BCU...
BalabanJ [BalabanJ]
BaryszMatrix [SpAbs_DzZ, SpMax_DzZ, SpDiam_DzZ, SpAD_DzZ, S...
BertzCT [BertzCT]
BondCount [nBonds, nBondsO, nBondsS, nBondsD, nBondsT, n...
CPSA [PNSA1, PNSA2, PNSA3, PNSA4, PNSA5, PPSA1, PPS...
CarbonTypes [C1SP1, C2SP1, C1SP2, C2SP2, C3SP2, C1SP3, C2S...
Chi [Xch-3d, Xch-4d, Xch-5d, Xch-6d, Xch-7d, Xch-3...
Constitutional [SZ, Sm, Sv, Sse, Spe, Sare, Sp, Si, MZ, Mm, M...
DetourMatrix [SpAbs_Dt, SpMax_Dt, SpDiam_Dt, SpAD_Dt, SpMAD...
DistanceMatrix [SpAbs_D, SpMax_D, SpDiam_D, SpAD_D, SpMAD_D, ...
EState [NsLi, NssBe, NssssBe, NssBH, NsssB, NssssB, N...
EccentricConnectivityIndex [ECIndex]
ExtendedTopochemicalAtom [ETA_alpha, AETA_alpha, ETA_shape_p, ETA_shape...
FragmentComplexity [fragCpx]
Framework [fMF]
GeometricalIndex [GeomDiameter, GeomRadius, GeomShapeIndex, Geo...
GravitationalIndex [GRAV, GRAVH, GRAVp, GRAVHp]
HydrogenBond [nHBAcc, nHBDon]
InformationContent [IC0, IC1, IC2, IC3, IC4, IC5, TIC0, TIC1, TIC...
KappaShapeIndex [Kier1, Kier2, Kier3]
Lipinski [Lipinski, GhoseFilter]
LogS [FilterItLogS]
McGowanVolume [VMcGowan]
MoRSE [Mor01, Mor02, Mor03, Mor04, Mor05, Mor06, Mor...
MoeType [LabuteASA, PEOE_VSA1, PEOE_VSA2, PEOE_VSA3, P...
MolecularDistanceEdge [MDEC-11, MDEC-12, MDEC-13, MDEC-14, MDEC-22, ...
MolecularId [MID, AMID, MID_h, AMID_h, MID_C, AMID_C, MID_...
MomentOfInertia [MOMI-X, MOMI-Y, MOMI-Z]
PBF [PBF]
PathCount [MPC2, MPC3, MPC4, MPC5, MPC6, MPC7, MPC8, MPC...
Polarizability [apol, bpol]
RingCount [nRing, n3Ring, n4Ring, n5Ring, n6Ring, n7Ring...
RotatableBond [nRot, RotRatio]
SLogP [SLogP, SMR]
TopoPSA [TopoPSA(NO), TopoPSA]
TopologicalCharge [GGI1, GGI2, GGI3, GGI4, GGI5, GGI6, GGI7, GGI...
TopologicalIndex [Diameter, Radius, TopoShapeIndex, PetitjeanIn...
VdwVolumeABC [Vabc]
VertexAdjacencyInformation [VAdjMat]
WalkCount [MWC01, MWC02, MWC03, MWC04, MWC05, MWC06, MWC...
Weight [MW, AMW]
WienerIndex [WPath, WPol]
ZagrebIndex [Zagreb1, Zagreb2, mZagreb1, mZagreb2]
Name: name, dtype: object
I am trying to extract data from a dynamic table with the following structure:
Team 1 - Score - Team 2 - Minute first goal.
It is a table of soccer match results and there are about 10 matches per table and one table for each matchday. This is an example of the website in working with: https://www.resultados-futbol.com/premier/grupo1/jornada1
For this I am trying web scraping with BeautifulSoup in Python. Although I've made good progress, I'm running into a problem. I would like to generate a code that would iterate data by data each row of the table and I would get each data to a list so that I would have, for example:
List Team 1: Real Madrid, Barcelona
Score list: 1-0, 1-0
List Team 2: Atletico Madrid, Sevilla
First goal minutes list: 17', 64'
Once I have the lists, my intention is to make a complete dataframe with all the extracted data. However, I have the following problem: the matches that end 0-0. This implies that in the column Minute first goal there is none and it doesn't extract anything, so I can't 'fill' that value in any way in my dataframe and I get an error. To continue with the previous example, imagine that the second game has ended 0-0 and that in the 'Minutes first goal list' there is only one data (17').
In my mind the solution would be to create a loop that takes the data cell by cell and put a condition in 'Score' that if it is 0-0 to the list of Minutes first goal a value for example 'No goals' would be added.
This is the code I am using. I paste only the part in which I would like to create the loop:
page = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
table = page.find('div', class_= 'contentitem').find_all('tr', class_= 'vevent')
teams1 = []
teams2 = []
scores = []
for cell in table:
team1 = cell.find('td', class_='team1')
for name in local:
nteam1 = name.text
teams1.append(nteam1)
team2 = cell.find('td', class_='team2')
for name in team2:
nteam2 = name.text
teams2.append(nteam2)
score = cell.find('span', class_='clase')
for name in score:
nscore = name.text
scores.append(nscore)
It is not clear to me how to iterate over the table to be able to store in the list the content of each cell and it is essential to include a condition "when the score cell is 0-0 create a non-goals entry in the list".
If someone could help me, I would be very grateful. Best regards
You are close to your goal, but can optimize your script a bit.
Do not use these different lists, just use one:
data = []
Try to get all information in one loop, there is an td that contains all the information and push a dict to your list:
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
Push your data into DataFrame
pd.DataFrame(data)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.resultados-futbol.com/premier/grupo1/jornada1'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
data = []
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
pd.DataFrame(data)
I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.
You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I need to scrape a wikipedia table to a pandas data frame and create three columns: PostalCode, Borough, and Neighborhoods.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Here is the code that I have used:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = [ ]
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighborhood'] = Neighbourhood
df
And it returns that:
(PostalCode, Borough, Neighborhood)
0 North York
1 Parkwoods
2 North York
3 Victoria Village
4 Downtown Toronto
5 Harbourfront (Toronto)
6 Downtown Toronto
7 Regent Park
8 North York
I can't figure out how to pick up the postcode and the neighbourhood from the wikipedia table.
Thank you
pandas allow you to do it in one line of code:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
Provide the error message.
By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.
You have two small errors:
df = pd.dataframe() should be df = pd.DataFrame([])
You also misspelled Neighborhoods as Neighbourhoods the second time.
You might also need to change soup = BeautifulSoup(website_url,'lxml') to soup = BeautifulSoup(website_url,'xml'), but we can't help you more without knowing your exact error message.
Instead of using
df = pd.dataframe()
df['Neighbourhoods'] = Neighbourhoods
You can use
df['Neighbourhoods'] = pd.Series(Neighbourhoods)
This would solve your error and add new columns similarly using pd.Series(listname) or you can give list of lists containing PostalCode, Borough, and Neighborhoods using this code
df = pd.Dataframe(list_of_lists)
It looks like you're only picking up one of the columns here:
links = My_table.findAll('a')
You should be looking for 'tr' rather than 'a' as that signifies a new row in the table.
You should then use a for loop to populate a list of lists, this code should work:
v = []
for tr in values:
td = tr.find_all('td')
row = [i.text for i in td]
v.append(row)
df = pd.DataFrame.from_records(v)
My code is working, which is good lol, but the output needs to be different in how it is viewed.
UPDATED CODE SINCE RECIEVING ANSWER
import pandas as pd
# Import File
YMM = pd.read_excel('C:/Users/PCTR261010/Desktop/OMIX_YMM_2016.xlsx').groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'})
print(YMM)
The output looks like Make | Model | StartYear | EndYear, with all the makes listed down column the Make Column next to the Model Column. But the Makes are filtered like a Pivot table.
Here is a screen shot:
I need American Motors next to every American Motors Model, every Buick next to every Buick Model and so on.
Here is the link to sample data:
http://jmp.sh/KLZKWVZ
Try this:
res = YMM.groupby(['Make','Model'], as_index=False).agg({'StartYear':'min', 'EndYear':'max'})
or
res = YMM.groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'}).reset_index()
With your own code
Min = YMM.groupby(['Make','Model']).StartYear.min()
Max = YMM.groupby(['Make','Model']).EndYear.max()
Min['Endyear']=Max.EndYear