Reading tables with Pandas and Selenium

Reading tables with Pandas and Selenium - python

I am trying to figure out an elegant way to scrape tables from a website. However, when running below script I am getting a ValueError: No tables found error.
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:...\chromedriver.exe')
driver.implicitly_wait(30)
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
df_list=pd.read_html(driver.find_element_by_id("eventTab_4").get_attribute('outerHTML'))
When I look at the site elements, I notice that the code below works if the < table > tag lies neatly within the < div id="...">. However, in this case, I think the code is not working because of the following reasons:
There is a < div > within a < div > and then there is the < table > tag.
The site uses Javascript with the tables.
Grateful for advice on how to pull the tables for all races. That is, there are several tables which are made visible as the user clicks on each tab (race). I need to extract all of them into separate dataframes.

from selenium import webdriver
import time
import pandas as pd
pd.set_option('display.max_column',None)
driver = webdriver.Chrome(executable_path='C:/bin/chromedriver.exe')
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
time.sleep(5)
tab = driver.find_element_by_id('tabs') #All tabs are here
li_list = tab.find_elements_by_tag_name('li') #They are in a "li"
a_list = []
for li in li_list[1:]: #First tab has nothing..We skip it
a = li.find_element_by_tag_name('a') #extract the "a" element from the "li"
a_list.append(a)
df = []
for a in a_list:
a.click() #Next Tab
time.sleep(8) #Tables take some time to load fully
page = driver.page_source #Get the HTML of the new Tab page
source = pd.read_html(page)
table = source[1] #Get 2nd table
df.append(table)
print(df)
Output
[ Silk No Horse Unnamed: 3 ACS SH CFR Trainer \
0 NaN 1 JEM ROCK NaN 4bC A NaN Eric Sands
1 NaN 2 MAISON MERCI NaN 3chC A NaN Brett Crawford
2 NaN 3 WORDSWORTH NaN 3bC AB NaN Greg Ennion
3 NaN 4 FOUND THE DREAM NaN 3bG A NaN Adam Marcus
4 NaN 5 IZAPHA NaN 3bG A NaN Andre Nel
5 NaN 6 JACKBEQUICK NaN 3grG A NaN Glen Kotzen
6 NaN 7 MHLABENI NaN 3chG A NaN Eric Sands
7 NaN 8 ORLOV NaN 3bG A NaN Adam Marcus
8 NaN 9 T'CHALLA NaN 3bC A NaN Justin Snaith
9 NaN 10 WEST COAST WONDER NaN 3bG A NaN Piet Steyn
Jockey Wgt MR Dr Odds Last 3 Runs
0 D Dillon 60 0 7 NaN NaN
continued

Related

Extract specific words from dataframe

I have the following dataframe named marketing where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data
source=book,social_media=facebook,ads=Facebook
source=book,ads=Facebook,customer=2
cost=2, customer=3
Im using python and i have tried the following:
df = pd.DataFrame()
def find_keywords(row_string):
tags = [x for x in row_string if x.startswith('source=')]
return tags
df['Data'] = marketing['Data'].apply(lambda row : find_keywords(row))
May i know whether there is a more efficient way to extract and place into columns:
source social_media ads customer costs
book facebook facebook - -
book - facebook 2 -

You can split the column value of string type into dict then use pd.json_normalize to convert dict to columns.
out = pd.json_normalize(marketing['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.split(',')]))).dropna(subset='source')
print(out)
source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN

Here's another option:
Sample dataframe marketing is:
marketing = pd.DataFrame(
{"Data": ["source=book,social_media=facebook,ads=Facebook",
"source=book,ads=Facebook,customer=2",
"cost=2, customer=3"]}
)
Data
0 source=book,social_media=facebook,ads=Facebook
1 source=book,ads=Facebook,customer=2
2 cost=2, customer=3
Now this
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).pivot(columns=0))
does produce
1
0 ads cost customer social_media source
0 Facebook NaN NaN facebook book
1 Facebook NaN 2 NaN book
2 NaN 2 3 NaN NaN
which is almost what you're looking for, except for the extra column level and the column ordering. So the following modification
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).rename(columns={0: "columns"})
.pivot(columns="columns").droplevel(level=0, axis=1))
result = result[["source", "social_media", "ads", "customer", "cost"]]
should fix that:
columns source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN
2 NaN NaN NaN 3 2

How to Insert following beautiful soup scraped data in excel?

from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=agent).text
return source
html_text = extract_source('https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
for a in soup.find_all('a', class_ = 'button button--link button--fluid catalog-list-item__actions-primary-button', href=True):
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
for first, third in zip(first_columns, third_columns):
print(first.text, third.text)
Basically I am trying to scrape data from tables from multiple links of Website. And I want to insert that data in one excel csv file in following table format
SKU 07DE9922
Analyte / Target Corticosterone
Base Catalog Number DE9922
Diagnostic Platforms EIA/ELISA
Diagnostic Solutions Endocrinology
Disease Screened Corticosterone
Evaluation Quantitative
Pack Size 96 Wells
Sample Type Plasma, Serum
Sample Volume 10 uL
Species Reactivity Mouse, Rat
Usage Statement For Research Use Only, not for use in diagnostic procedures.
To below format in excel file
SKU Analyte/Target Base Catalog Number Pack Size Sample Type
data data data data data
I am facing difficulties while converting data in proper format

I made small modification to your code. Instead of printing the data, I created a dictionary and added it to list. Then I used this list to create a DataFrame:
import pandas as pd
import requests
import time
from datetime import datetime
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
source = requests.get(url, headers=agent).text
return source
html_text = extract_source(
"https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids"
)
soup = BeautifulSoup(html_text, "lxml")
data = []
for a in soup.find_all(
"a",
class_="button button--link button--fluid catalog-list-item__actions-primary-button",
href=True,
):
urlof = a["href"]
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, "lxml")
table_rows = soup.find_all("tr")
first_columns = []
third_columns = []
for row in table_rows:
first_columns.append(row.findAll("td")[0])
third_columns.append(row.findAll("td")[1])
# create dictionary with values and add to the list
d = {}
for first, third in zip(first_columns, third_columns):
d[first.get_text(strip=True)] = third.get_text(strip=True)
data.append(d)
df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)
Prints:
SKU Alternate Names Base Catalog Number CAS # EC Number Format Molecular Formula Molecular Weight Personal Protective Equipment Usage Statement Application Notes Beilstein Registry Number Optical Rotation Purity UV Visible Absorbance Hazard Statements RTECS Number Safety Symbol Auto Ignition Biochemical Physiological Actions Density Melting Point pH pKa Solubility Vapor Pressure Grade Boiling Point Isoelectric Point
0 02100078-CF 2-Acetamido-5-Guanidinovaleric acid 100078 210545-23-6 205-846-6 Powder C8H16N4O3· 2H2O 216.241 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 02100142-CF Acetyltryptophan; DL-α-Acetylamino-3-indolepro... 100142 87-32-1 201-739-3 Powder C13H14N2O3 246.266 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... N-acetyl-DL-tryptophan, is used as stabilizer ... 89478 0° ± 2° (c=1, 1N NaOH, 24 hrs.) ~99% λ max (water)=280 ± 2 nm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 02100421-CF L-2,5-Diaminopentanoic acid; 2,5-Diaminopentan... 100421 3184-13-2 221-678-6 Powder C5H12N2O2 • HCl 168.621 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 3625847 NaN ~99% NaN H319 RM2985000 GHS07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 02100520-CF Phosphocreatine Disodium Salt Tetrahydrate; So... 100520 922-32-7 213-074-6 Powder C4H8N3Na2O5P·4H2O 255.077 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN NaN NaN ≥98% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 02100769-CF Vitamin C; Ascorbate; Sodium ascorbate; L-Xylo... 100769 50-81-7 200-066-2 NaN C6H8O6 176.124 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... L-Ascorbic Acid is used as an Antimicrobial an... 84272 +18° to +32° (c=1, water) ≥98% NaN NaN CI7650000 NaN 1220Â° F (NTP, 1992) Ascorbic Acid, also known as Vitamin C, is a s... 1.65 (NTP, 1992) 374 to 378Â° F (decomposes) (NTP, 1992) Between 2,4 and 2,8 (2 % aqueous solution) pK1: 4.17; pK2: 11.57 greater than or equal to 100 mg/mL at 73Â° F (... 9.28X10-11 mm Hg at 25 deg C (est) NaN NaN NaN
5 02101003-CF Lycine; Oxyneurine; (Carboxymethyl)trimethylam... 101003 107-43-7 203-490-6 Powder C5H11NO2 117.148 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Betaine is a reagent that is used in soldering... 3537113 NaN NaN NaN NaN DS5900000 NaN NaN End-product of oxidative metabolism of choline... NaN Decomposes around 293 deg C NaN 1.83 (Lit.) Solubility (g/100 g solvent): <a class="pubche... 1.36X10-8 mm Hg at 25 deg C (est) Anhydrous NaN NaN
6 02101806-CF (S)-2,5-Diamino-5-oxopentanoic acid; L-Glutami... 101806 56-85-9 200-292-1 NaN C5H10N2O3 146.146 g/mol Eyeshields, Gloves,  respirator filter Unless specified otherwise, MP Biomedical's pr... L-glutamine is an essential amino acid, which ... 1723797 +30 ± 5° (c = 3.5, 1N HCl) ≥99% NaN NaN MA2275100 NaN NaN L-Glutamine is an essential amino acid that is... 1.364 g/cu cm 185.5 dec °C pH = 5-6 at 14.6 g/L at 25 deg C NaN Water Solubility41300 mg/L (at 25 °C) 1.9X10-8 mm Hg at 25 deg C (est) NaN NaN NaN
7 02102158-CF L-2-Amino-4-methylpentanoic acid; Leu; L; α-am... 102158 61-90-5 200-522-0 Powder C6H13NO2 131.175 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... Leucine has been used as a molecular marker in... 1721722 +14.5 ° to +16.5 ° (Lit.) NaN NaN NaN OH2850000 NaN NaN NaN 1.293 g/cu cm at 18 deg C 293 °C NaN 2.33 (-COOH), 9.74 (-NH2)(Lit.) Water Solubility21500 mg/L (at 25 °C) 5.52X10-9 mm Hg at 25 deg C (est) NaN Sublimes at 145-148 deg C. Decomposes at 293-2... 6.04(Lit.)
8 02102576-CF 4-Hydroxycinnamic acid; 3-(4-Hydroxphenyl)-2-p... 102576 7400-08-0 231-000-0 Powder C9H8O3 164.16 g/mol Dust mask, Eyeshields, Gloves Unless specified otherwise, MP Biomedical's pr... p-Coumaric acid was used as a substrate to stu... NaN NaN ≥98% NaN NaN GD9094000 NaN NaN NaN NaN 211.5 °C NaN NaN NaN NaN NaN NaN NaN
9 02102868-CF DL-2-Amino-3-hydroxypropionic acid; (±)-2-Amin... 102868 302-84-1 206-130-6 Powder C3H7NO3 105.093 g/mol Eyeshields, Gloves, respirator filter Unless specified otherwise, MP Biomedical's pr... NaN 1721405 -1° to + 1° (c = 5, 1N HCl) ≥98% NaN NaN NaN NaN NaN NMDA agonist acting at the glycine site; precu... 1.6 g/cu cm # 22 deg C 228 deg C (decomposes) NaN NaN SOL IN <a class="pubchem-internal-link CID-962... NaN NaN NaN NaN
...and so on.
And saves data.csv (screenshot from LibreOffice):

Try this -
import pandas as pd
import requests
import time
from datetime import datetime
from bs4 import BeautifulSoup
def extract_source(url):
agent = {"User-Agent": "Mozilla/5.0"}
return requests.get(url, headers=agent).text
html_text = extract_source(
'https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')
result = []
for index, a in enumerate(soup.find_all('a', class_='button button--link button--fluid catalog-list-item__actions-primary-button', href=True)):
# if index >= 10:
# break
# print ("Found the URL:", a['href'])
urlof = a['href']
html_text = extract_source(urlof)
soup = BeautifulSoup(html_text, 'lxml')
table_rows = soup.find_all('tr')
first_columns = []
third_columns = []
for row in table_rows:
# for row in table_rows[1:]:
first_columns.append(row.findAll('td')[0])
third_columns.append(row.findAll('td')[1])
temp = {}
for first, third in zip(first_columns, third_columns):
third = str(third.text).strip('\n')
first = str(first.text).strip('\n')
temp[first] = third
result.append(temp)
df = pd.DataFrame(result)
# please drop useless column before saving the output to csv.
df.to_csv('out.csv', index=False)
This will give output -
SKU
Alternate Names
Base Catalog Number
CAS #
EC Number
Format
Molecular Formula
Molecular Weight
Personal Protective Equipment
Usage Statement
Application Notes
Beilstein Registry Number
Optical Rotation
Purity
UV Visible Absorbance
Hazard Statements
RTECS Number
Safety Symbol
02100078-CF
2-Acetamido-5-Guanidinovaleric acid
100078
210545-23-6
205-846-6
Powder
C8H16N4O3· 2H2O
216.241 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
02100142-CF
Acetyltryptophan; DL-α-Acetylamino-3-indolepropionic acid
100142
87-32-1
201-739-3
Powder
C13H14N2O3
246.266 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
N-acetyl-DL-tryptophan, is used as stabilizer in the human blood-derived therapeutic products normal serum albumin and plasma protein fraction.
89478
0° ± 2° (c=1, 1N NaOH, 24 hrs.)
~99%
λ max (water)=280 ± 2 nm
02100421-CF
L-2,5-Diaminopentanoic acid; 2,5-Diaminopentanoic acid monohydrochloride
100421
3184-13-2
221-678-6
Powder
C5H12N2O2 • HCl
168.621 g/mol
Eyeshields, Gloves, respirator filter
Unless specified otherwise, MP Biomedical's products are for research or further manufacturing use only, not for direct human use. For more information, please contact our customer service department.
3625847
~99%
H319
RM2985000
GHS07

Getting Table Info From Page Using Python and BeautifulSoup

The page I am trying to get info from is https://www.pro-football-reference.com/teams/crd/2017_roster.htm.
I'm trying to get all the information from the "Roster" table but for some reason I can't get it through BeautifulSoup.I've tried soup.find("div", {'id': 'div_games_played_team'}) but it doesn't work. When I look at the page's HTML I can see the table inside a very large comment and in a regular div. How can I use BeautifulSoup to get the information from this table?

you don't need Selenium. What you can do (and you correctly identified it), was pull out the comments, and then parse the table from within that.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/crd/2017_roster.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except ValueError as e:
print(e)
continue
Output:
print(tables[0].head().to_string())
No. Player Age Pos G GS Wt Ht College/Univ BirthDate Yrs AV Drafted (tm/rnd/yr) Salary
0 54.0 Bryson Albright 23.0 NaN 7 0.0 245.0 6-5 Miami (OH) 3/15/1994 1 0.0 NaN $246,177
1 36.0 Budda Baker*+ 21.0 ss 16 7.0 195.0 5-10 Washington 1/10/1996 Rook 9.0 Arizona Cardinals / 2nd / 36th pick / 2017 $465,000
2 64.0 Khalif Barnes 35.0 NaN 3 0.0 320.0 6-6 Washington 4/21/1982 12 0.0 Jacksonville Jaguars / 2nd / 52nd pick / 2005 $176,471
3 41.0 Antoine Bethea 33.0 db 15 6.0 206.0 5-11 Howard 7/27/1984 11 4.0 Indianapolis Colts / 6th / 207th pick / 2006 $2,000,000
4 28.0 Justin Bethel 27.0 rcb 16 6.0 200.0 6-0 Presbyterian 6/17/1990 5 3.0 Arizona Cardinals / 6th / 177th pick / 2012 $2,000,000
....

The tag you are trying to scrape is dynamically generated by JavaScript. You are most likely using requests to scrape your HTML. Unfortunately requests will not run JavaScript because it pulls all the HTML in as raw text. BeautifulSoup can not find the tag because it was never generated within your scraping program.
I recommend using Selenium. It's not a perfect solution - just the best one for your problem. The Selenium WebDriver will execute the JavaScript to generate the page's HTML. Then you can use BeautifulSoup to parse whatever it is that you are after. See Selenium with Python for further help on how to get started.

bs4 find table by id, returning 'None'

Not sure why this isn't working :( I'm able to pull other tables from this page, just not this one.
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
page = soup(url.content, 'html')
table = page.find('table', id='team_and_opponent')
print(table)
Appreciate the help.

The page is dynamic. So you have 2 options in this case.
Side note: If you see <table> tags, don't use BeautifulSoup, pandas can do that work for you (and it actually uses bs4 under the hood) by using pd.read_html()
1) Use selenium to first render the page, and THEN you can use BeautifulSoup to pull out the <table> tags
2) Those tables are within the comment tags in the html. You can use BeautifulSoup to pull out the comments, then just grab the ones with 'table'.
I chose option 2.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.basketball-reference.com/teams/BOS/2018.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
I don't know which particular table you want, but they are there in the list of tables
*Output:**
print (tables[1])
Unnamed: 0 G MP FG FGA ... STL BLK TOV PF PTS
0 Team 82.0 19805 3141 6975 ... 604 373 1149 1618 8529
1 Team/G NaN 241.5 38.3 85.1 ... 7.4 4.5 14.0 19.7 104.0
2 Lg Rank NaN 12 25 25 ... 23 18 15 17 20
3 Year/Year NaN 0.3% -0.9% -0.0% ... -2.1% 9.7% 5.6% -4.0% -3.7%
4 Opponent 82.0 19805 3066 6973 ... 594 364 1159 1571 8235
5 Opponent/G NaN 241.5 37.4 85.0 ... 7.2 4.4 14.1 19.2 100.4
6 Lg Rank NaN 12 3 12 ... 7 6 19 9 3
7 Year/Year NaN 0.3% -3.2% -0.9% ... -4.7% -14.4% 1.6% -5.6% -4.7%
[8 rows x 24 columns]
or
print (tables[18])
Rk Unnamed: 1 Salary
0 1 Gordon Hayward $29,727,900
1 2 Al Horford $27,734,405
2 3 Kyrie Irving $18,868,625
3 4 Jayson Tatum $5,645,400
4 5 Greg Monroe $5,000,000
5 6 Marcus Morris $5,000,000
6 7 Jaylen Brown $4,956,480
7 8 Marcus Smart $4,538,020
8 9 Aron Baynes $4,328,000
9 10 Guerschon Yabusele $2,247,480
10 11 Terry Rozier $1,988,520
11 12 Shane Larkin $1,471,382
12 13 Semi Ojeleye $1,291,892
13 14 Abdel Nader $1,167,333
14 15 Daniel Theis $815,615
15 16 Demetrius Jackson $92,858
16 17 Jarell Eddie $83,129
17 18 Xavier Silas $74,159
18 19 Jonathan Gibson $44,495
19 20 Jabari Bird $0
20 21 Kadeem Allen $0

There is no table with id team_and_opponent in that page. Rather there is a span tag with this id. You can get results by changing id.

This data should be loaded dynamically (like JavaScript).
You should take a look here Web-scraping JavaScript page with Python
For that you can use Selenium or html_requests who supports Javascript

import requests
import bs4
url = requests.get("https://www.basketball-reference.com/teams/BOS/2018.html",
headers={'User-Agent': 'Mozilla/5.0'})
soup=bs4.BeautifulSoup(url.text,"lxml")
page=soup.select(".table_outer_container")
for i in page:
print(i.text)
you will get your desired output

parse text into different columns in pandas

I have a dataframe containing the query part of multiple urls.
For eg.
in=2015-09-19&stars_4=yes&min=4&a=3&city=New+York,+NY,+United+States&out=2015-09-20&search=1\n
in=2015-09-14&stars_3=yes&min=4&a=3&city=London,+United+Kingdom&out=2015-09-15&search=1\n
in=2015-09-26&Filter=175&min=5&a=2&city=New+York,+NY,+United+States&out=2015-09-27&search=2\n
My desired dataframe should be:
in Filter stars min a max city country out search
--------------------------------------------------------------------------------
2015-09-19 NAN stars_4 4 3 NAN NY US 2015-09-20 1
2015-09-14 NAN stars_3 4 3 NAN LONDON UK 2015-09-15 1
2015-09-26 175 NAN 5 2 NAN NY US 2015-09-27 2
Is there any easy way out for this using regex?
Any help will be much appreciated! Thanks in advance!

A quick-and-dirty fix would be to just use list comprehensions:
json_data = [{c[0]:c[1] for c in [b.split('=') for b in line.split('&')]} \
for line in open('data_file.txt')]
df = pd.DataFrame.from_records(json_data)
This won't solve your location classification issues, but will get you a better dataframe from which to work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading tables with Pandas and Selenium - python

Related

Extract specific words from dataframe

How to Insert following beautiful soup scraped data in excel?

Getting Table Info From Page Using Python and BeautifulSoup

bs4 find table by id, returning 'None'

parse text into different columns in pandas

Categories

Resources