How to get the right xpath for this node? scrapy

How to get the right xpath for this node? scrapy - python

I'm new at using web scrapy and I've been trying to get the right xpath from this portion of the code.
from this website
hmtl code
I've been using this scrapy commands:
response.xpath('//*[#id="companycontent"]/div/div/div[2]/div/div[6]/div').getall()
This is the Output:
['<div class="address">\r\n <h4>Address <span>1</span></h4>\r\n <strong>Office : </strong>1715 , 1714<br>\r\n <strong>Floor : </strong>Floor 17<br>\r\n <strong>Building : </strong>Shatha Tower<br>\r\n Dubai Internet City<br><br>\r\n \t\t</div>']
response.xpath('//*[#id="companycontent"]/div/div/div2/div/div[6]/div').get()
'\r\n Address 1\r\n Office : 1715 , 1714\r\n Floor : Floor 17\r\n Building : Shatha Tower\r\n Dubai Internet City\r\n \t\t'
And this one:
response.xpath('//div[contains(#class, "address")]/text()').extract()
with the output:
['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
response.xpath('//div[contains(#class, "address")]/text()').getall()
['\r\n \r\n \r\n \t\t\t\t\t\t\t\t ', '\r\n ', '\r\n ', '1715 , 1714', '\r\n ', 'Floor 17', '\r\n ', 'Shatha Tower', '\r\n Dubai Internet City', '\r\n \t\t', ' \r\n \t\t\r\n\r\n\t\t\t\t\t\t \r\n ']
I'm sure the first command will do the job but I was wondering if there's a shorter xpath command to run the script.
Hope anyone can help me.

Finding text by xpath follow as //tag-name[#class="class-name"] you can follow this approach and find the data
Code:
from selenium import webdriver
path="C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(path)
driver.get("https://tecomgroup.ae/directory/company.php?company=0016F00001wcgFJQAY&csrt=2648526569298119449")
data=driver.find_element_by_xpath('//div[#class="address"]')
data.text.split("\n")
Output:
['ADDRESS 1',
'Office : 1715 , 1714',
'Floor : Floor 17',
'Building : Shatha Tower',
'Dubai Internet City']

You could use also css selector response.css('div.address > div.address ::text')
print(`[x.strip() for x in response.css('div.address > div.address ::text').getall() if x.strip()]`)

Related

PFF data scraping not recognizing text

I am trying to scrape PFF.com for football grades with selenium, I am trying to get a specific grade for all Quarterbacks. Problem is, it doesn't seem like it's capturing the text as .text isn't working but I am not getting any NoSuchElementException.
Here's my code:
service = Service(executable_path="C:\\chromedriver.exe")
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=op)
driver.get("https://premium.pff.com/nfl/positions/2022/REG/passing?position=QB")
sleep(2)
sign_in = driver.find_element(By.XPATH, '/html/body/div/div/header/div[3]/button')
sign_in.click()
sleep(2)
email = driver.find_element(By.XPATH, '/html/body/div/div/div/div/div/div/form/div[1]/input')
email.send_keys(my_email)
password = driver.find_element(By.XPATH,
'/html/body/div/div/div/div/div/div/form/div[2]/input')
password.send_keys(my_password)
sleep(2)
sign_in_2 = driver.find_element(By.XPATH,
'/html/body/div/div/div/div/div/div/form/button')
sign_in_2.click()
sleep(2)
all_off_grades = driver.find_elements(By.CSS_SELECTOR, '.kyber-table
.kyber-grade-badge__info-text div')
all_qb_names = driver.find_elements(By.CSS_SELECTOR, '.kyber-table .p-1 a')
qb_grades = []
qb_names = []
for grade in all_off_grades:
qb_grades.append(grade.text)
for qb_name in all_qb_names:
qb_names.append(qb_name.text)
print(qb_grades)
print(qb_names)
The lists keep showing as empty.
Here are the elements I am trying to pull, but for every QB, I already confirmed the other QB's have the same class names for their grade and name.
<div class="kyber-grade-badge__info-text">91.5</div>
need to pull the 91.5
<a class="p-1" href="/nfl/players/2022/REG/josh-allen/46601/passing">Josh Allen</a>
need to pull Josh Allen

#Jbuck3 I tried modifying the locator and it works for me. I am also giving the output I am getting. Let me know that is what you were expecting.
all_off_grades = driver.find_elements(By.CSS_SELECTOR, '.kyber-table-body__scrolling-rows-container .kyber-grade-badge__info-text')
all_qb_names = driver.find_elements(By.CSS_SELECTOR, "a[data-gtm-id = 'player_name']")
And the output I got is:
['91.5', '90.3', '74.6', '-', '-', '60.0', '84.3', '78.3', '78.1', '-', '-', '60.0', '82.8', '83.4', '-', '-', '-', '60.0']
['Josh Allen ', 'Geno Smith ', 'Kirk Cousins ', 'Marcus Mariota ', 'Jameis Winston ', 'Trey Lance ', 'Derek Carr ', 'Justin Fields ', 'Trevor Lawrence ', 'Russell Wilson ', 'Ryan Tannehill ', 'Tom Brady ', 'Tua Tagovailoa ', 'Mac Jones ', 'Davis Mills ', 'Matthew Stafford ', 'Baker Mayfield ', 'Lamar Jackson ', 'Joe Flacco ', 'Matt Ryan ', 'Jalen Hurts ', 'Daniel Jones ', 'Kyler Murray ', 'Justin Herbert ', 'Joe Burrow ', 'Aaron Rodgers ', 'Patrick Mahomes ', 'Mitchell Trubisky ', 'Dak Prescott ', 'Jacoby Brissett ', 'Carson Wentz ', 'Jared Goff ']

Flatten a list inside list of tuples

I have the following list of tuples.
lst =
[
('LexisNexis', ['IT Services and IT Consulting ', ' New York City, NY']),
('AbacusNext', ['IT Services and IT Consulting ', ' La Jolla, California']),
('Aderant', ['Software Development ', ' Atlanta, GA']),
('Anaqua', ['Software Development ', ' Boston, MA']),
('Thomson Reuters Elite', ['Software Development ', ' Eagan, Minnesota']),
('Litify', ['Software Development ', ' Brooklyn, New York'])
]
I want to flatten the lists in each tuple to be part of the tuples of lst.
I found this How do I make a flat list out of a list of lists? but have no idea how to make it adequate to my case.

You can use unpacking:
lst = [('LexisNexis', ['IT Services and IT Consulting ', ' New York City, NY']),
('AbacusNext', ['IT Services and IT Consulting ', ' La Jolla, California']),
('Aderant', ['Software Development ', ' Atlanta, GA']),
('Anaqua', ['Software Development ', ' Boston, MA']),
('Thomson Reuters Elite', ['Software Development ', ' Eagan, Minnesota']),
('Litify', ['Software Development ', ' Brooklyn, New York'])]
output = [(x, *l) for (x, l) in lst]
print(output)
# [('LexisNexis', 'IT Services and IT Consulting ', ' New York City, NY'),
# ('AbacusNext', 'IT Services and IT Consulting ', ' La Jolla, California'),
# ('Aderant', 'Software Development ', ' Atlanta, GA'),
# ('Anaqua', 'Software Development ', ' Boston, MA'),
# ('Thomson Reuters Elite', 'Software Development ', ' Eagan, Minnesota'),
# ('Litify', 'Software Development ', ' Brooklyn, New York')]

I've found the answer by Deacon using abc from collections. It is worth to try it too.
from collections import abc
def flatten(obj):
for o in obj:
# Flatten any iterable class except for strings.
if isinstance(o, abc.Iterable) and not isinstance(o, str):
yield from flatten(o)
else:
yield o
[tuple(flatten(i)) for i in lst]
Out[47]:
[('LexisNexis', 'IT Services and IT Consulting ', ' New York City, NY'),
('AbacusNext', 'IT Services and IT Consulting ', ' La Jolla, California'),
('Aderant', 'Software Development ', ' Atlanta, GA'),
('Anaqua', 'Software Development ', ' Boston, MA'),
('Thomson Reuters Elite', 'Software Development ', ' Eagan, Minnesota'),
('Litify', 'Software Development ', ' Brooklyn, New York')]

Scraping .aspx page with Python yields 404

I'm a web-scraping beginner and am trying to scrape this webpage: https://profiles.doe.mass.edu/statereport/ap.aspx
I'd like to be able to put in some settings at the top (like District, 2020-2021, Computer Science A, Female) and then download the resulting data for those settings.
Here's the code I'm currently using:
import requests
from bs4 import BeautifulSoup
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get('https://profiles.doe.mass.edu/statereport/ap.aspx')
soup = BeautifulSoup(r.text,"lxml")
data = {i['name']:i.get('value','') for i in soup.select('input[name]')}
data["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
data["ctl00$ContentPlaceHolder1$ddYear"]="2021",
data["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
data["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
p = s.post(url,data=data)
When I print out p.text, I get a page with title '\t404 - Page Not Found\r\n' and message
<h2>We are unable to locate information at: <br /><br '
'/>http://profiles.doe.mass.edu:80/statereport/ap.aspxp?ASP.NET_SessionId=bxfgao54wru50zl5tkmfml00</h2>\r\n'
Here's what data looks like before I modify it:
{'__EVENTVALIDATION': '/wEdAFXz4796FFICjJ1Xc5ZOd9SwSHUlrrW+2y3gXxnnQf/b23Vhtt4oQyaVxTPpLLu5SKjKYgCipfSrKpW6jkHllWSEpW6/zTHqyc3IGH3Y0p/oA6xdsl0Dt4O8D2I0RxEvXEWFWVOnvCipZArmSoAj/6Nog6zUh+Jhjqd1LNep6GtJczTu236xw2xaJFSzyG+xo1ygDunu7BCYVmh+LuKcW56TG5L0jGOqySgRaEMolHMgR0Wo68k/uWImXPWE+YrUtgDXkgqzsktuw0QHVZv7mSDJ31NaBb64Fs9ARJ5Argo+FxJW/LIaGGeAYoDphL88oao07IP77wrmH6t1R4d88C8ImDHG9DY3sCDemvzhV+wJcnU4a5qVvRziPyzqDWnj3tqRclGoSw0VvVK9w+C3/577Gx5gqF21UsZuYzfP4emcqvJ7ckTiBk7CpZkjUjM6Z9XchlxNjWi1LkzyZ8QMP0MaNCP4CVYJfndopwFzJC7kI3W106YIA/xglzXrSdmq6/MDUCczeqIsmRQGyTOkQFH724RllsbZyHoPHYvoSAJilrMQf6BUERVN4ojysx3fz5qZhZE7DWaJAC882mXz4mEtcevFrLwuVPD7iB2v2mlWoK0S5Chw4WavlmHC+9BRhT36jtBzSPRROlXuc6P9YehFJOmpQXqlVil7C9OylT4Kz5tYzrX9JVWEpeWULgo9Evm+ipJZOKY2YnC41xTK/MbZFxsIxqwHA3IuS10Q5laFojoB+e+FDCqazV9MvcHllsPv2TK3N1oNHA8ODKnEABoLdRgumrTLDF8Lh+k+Y4EROoHhBaO3aMppAI52v3ajRcCFET22jbEm/5+P2TG2dhPhYgtZ8M/e/AoXht29ixVQ1ReO/6bhLIM+i48RTmcl76n1mNjfimB8r3irXQGYIEqCkXlUHZ/SNlRYyx3obJ6E/eljlPveWNidFHOaj+FznOh264qDkMm7fF78WBO2v0x+or1WGijWDdQtRy9WRKXchYxUchmBlYm15YbBfMrIB7+77NJV+M6uIVVnCyiDRGj+oPXcTYxqSUCLrOMQyzYKJeu8/hWD0gOdKeoYUdUUJq4idIk+bLYy76sI/N2aK+aXZo/JPQ+23gTHzIlyi4Io7O6kXaULPs8rfo8hpkH1qXyKb/rP2VJBNWgyp8jOMx9px+m4/e2Iecd86E4eN4Rk6OIiwqGp+dMdgntXu5ruRHb1awPlVmDw92dL1P0b0XxJW7EGfMzyssMDhs1VT6K6iMUTHbuXkNGaEG1dP1h4ktnCwGqDLVutU6UuzT6i4nfqnvFjGK9+7Ze8qWIl8SYyhmvzmgpLjdMuF9CYMQ2Aa79HXLKFACsSSm0dyiU1/ZGyII2Fvga9o+nVV1jZam3LkcAPaXEKwEyJXfN/DA7P4nFAaQ+QP+2bSgrcw+/dw+86OhPyG88qyJwqZODEXE1WB5zSOUywGb1/Xed7wq9WoRs6v8rAK5c/2iH7YLiJ4mUVDo+7WCKrzO5+Hsyah3frMKbheY1acRmSVUzRgCnTx7jvcLGR9Jbt6TredqZaWZBrDFcntdg7EHd7imK5PqjUld3iCVjdyO+yLKUkMKiFD85G3vEferg/Q/TtfVBqeTU0ohP9d+CsKOmV/dxVYWEtBcfa9KiN6j4N8pP7+3iUOhajojZ8jV98kxT0zPZlzkpqI4SwR6Ys8d2RjIi5K+oQul4pL5u+zZvX0lsLP9Jl7FeVTfBvST67T6ohz8dl9gBfmmbwnT23SyuFSUGd6ZGaKE+9kKYmuImW7w3ePs7C70yDWHpIpxP/IJ4GHb36LWto2g3Ld3goCQ4fXPu7C4iTiN6b5WUSlJJsWGF4eQkJue8=',
'__VIEWSTATE': '/wEPDwUKLTM0NzY4OTQ4NmRkDwwPzTpuna+yxVhQxpRF4n2+zYKQtotwRPqzuCkRvyU=',
'__VIEWSTATEGENERATOR': '2B6F8D71',
'ctl00$ContentPlaceHolder1$btnViewReport': 'View Report',
'ctl00$ContentPlaceHolder1$hfExport': 'ViewReport',
'leftNavId': '11241',
'quickSearchValue': '',
'runQuickSearch': 'Y',
'searchType': 'QUICK',
'searchtext': ''}
Following suggestions from similar questions, I've tried playing around with the parameters, editing data in various ways (to emulate the POST request that I see in my browser when I navigate the site myself), and specifying an ASP.NET_SessionId, but to no avail.
How can I access the information from this website?

This should be what you are looking for what I did was use bs4 to parse HTML data and then found the table. Then I get the rows and to make it easier to work with the data I put it into a dictionary.
import requests
from bs4 import BeautifulSoup
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find_all('table')
rows = table[0].find_all('tr')
data = {}
for row in rows:
if row.find_all('th'):
keys = row.find_all('th')
for key in keys:
data[key.text] = []
else:
values = row.find_all('td')
for value in values:
data[keys[values.index(value)].text].append(value.text)
for key in data:
print(key, data[key][:10])
print('\n')
The output:
District Name ['Abington', 'Academy Of the Pacific Rim Charter Public (District)', 'Acton-Boxborough', 'Advanced Math and Science Academy Charter (District)', 'Agawam', 'Amesbury', 'Amherst-Pelham', 'Andover', 'Arlington', 'Ashburnham-Westminster']
District Code ['00010000', '04120000', '06000000', '04300000', '00050000', '00070000', '06050000', '00090000', '00100000', '06100000']
Tests Taken [' 100', ' 109', ' 1,070', ' 504', ' 209', ' 126', ' 178', ' 986', ' 893', ' 97']
Score=1 [' 16', ' 81', ' 12', ' 29', ' 27', ' 18', ' 5', ' 70', ' 72', ' 4']
Score=2 [' 31', ' 20', ' 55', ' 74', ' 65', ' 34', ' 22', ' 182', ' 149', ' 23']
Score=3 [' 37', ' 4', ' 158', ' 142', ' 55', ' 46', ' 37', ' 272', ' 242', ' 32']
Score=4 [' 15', ' 3', ' 344', ' 127', ' 39', ' 19', ' 65', ' 289', ' 270', ' 22']
Score=5 [' 1', ' 1', ' 501', ' 132', ' 23', ' 9', ' 49', ' 173', ' 160', ' 16']
% Score 1-2 [' 47.0', ' 92.7', ' 6.3', ' 20.4', ' 44.0', ' 41.3', ' 15.2', ' 25.6', ' 24.7', ' 27.8']
% Score 3-5 [' 53.0', ' 7.3', ' 93.7', ' 79.6', ' 56.0', ' 58.7', ' 84.8', ' 74.4', ' 75.3', ' 72.2']
Process finished with exit code 0

I was able to get this working by adapting the code from here. I'm not sure why editing the payload in this way made the difference, so I'd be grateful for any insights!
Here's my working code, using Pandas to parse out the tables:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
response = s.get(url)
soup = BeautifulSoup(response.content, 'html5lib')
data = { tag['name']: tag['value']
for tag in soup.select('input[name^=ctl00]') if tag.get('value')
}
state = { tag['name']: tag['value']
for tag in soup.select('input[name^=__]')
}
payload = data.copy()
payload.update(state)
payload["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
payload["ctl00$ContentPlaceHolder1$ddYear"]="2021",
payload["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
payload["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
p = s.post(url,data=payload)
df = pd.read_html(p.text)[0]
df["District Code"] = df["District Code"].astype(str).str.zfill(8)
display(df)

Python: Get text from all HTML child elements texts with lxml xpath

I am using the lxml xpath of python. I am able to extract text if I give the full path to a HTML tag. However I can't extract all the text from a tag and it's child elements into a list. So for example given this html I would like to get all the texts of the "example" class:
<div class="example">
"Some text"
<div>
"Some text 2"
<p>"Some text 3"</p>
<p>"Some text 4"</p>
<span>"Some text 5"</span>
</div>
<p>"Some text 6"</p>
</div>
I would like to get:
["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]

mzjn-s anwer is correct. After some trial and error I've managed to get it working. This is what the end code looks like. You need to put //text() to the end of the xpath. It is without refactoring for the moment, so there will definitely be some mistakes and bad practices but it works.
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
page = session.get("The url you are webscraping")
content = page.content
htmlsite = urllib.request.urlopen("The url you are webscraping")
soup = BeautifulSoup(htmlsite, 'lxml')
htmlsite.close()
tree = html.fromstring(content)
scraped = tree.xpath('//html[contains(#class, "no-js")]/body/div[contains(#class, "container")]/div[contains(#class, "content")]/div[contains(#class, "row")]/div[contains(#class, "col-md-6")]/div[contains(#class, "clearfix")]//text()')
I've tried it out on the team introduction page of keeleyteton.com. It returned the following list which is correct (although needs lots of amending!) because they are in different tags and some are children tags. Thank you for the help!
['\r\n ', '\r\n ', 'Nicholas F. Galluccio', '\r\n ', '\r\n ', 'Managing Director and Portfolio Manager', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Scott R. Butler', '\r\n ', '\r\n ', 'Senior Vice President and Portfolio Manager ', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Thomas E. Browne, Jr., CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Brian P. Leonard, CFA', '\r\n ', '\r\n
', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Robert M. Goldsborough', '\r\n ', '\r\n ', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', '\r\n ', '\r\n ', 'Brian R. Keeley, CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Edward S. Borland', '\r\n ', '\r\n
', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Kevin M. Keeley', '\r\n ', '\r\n ', 'President', '\r\n
', '\r\n ', '\r\n ', 'Deanna B. Marotz', '\r\n ', '\r\n ', 'Chief Compliance Officer', '\r\n ']

Trying to read csv file in python and creating separate table

import numpy as np
import pandas as pd
Trying to read a csv file using pandas
This is the data that I scraped.
Please note that there are Brackets start and end [](Maybe its a list). What should I write so entire data to be in table form? I don't know how to separate Brackets from the data.
[]
['Auburn University (Online Master of Business Administration with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' /Campus ', ' Raymond J. Harbert College of Business ']
['Auburn University (Data Science)', ' Bachelors ', ' US', ' AL', ' /Campus ', ' Business ']
['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']
['The University of Alabama (MS in Operations Management - Decision Analytics Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (M.S. degree in Applied Statistics, Data Mining Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (MBA with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Culverhouse College of Commerce ']
['Arkansas Tech University (Business Data Analytics)', ' Bachelors ', ' US', ' AR', ' /Campus ', ' Business ']
['University of Arkansas (Graduate Certificate in Business Analytics)', ' Certificate ', ' US', ' AR', ' Online/ ', ' Sam M. Walton College of Business ']
['University of Arkansas (Master of Information Systems with Business Analytics Concentration)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of Business ']
['University of Arkansas (Professional Master of Information Systems)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of
How should I read CSV file? I want all the data in a table form. Please help

Your problem is exactly what the error message is telling you it is. The error is in parsing this line:
['The University of Alabama (Master of Science in Marketing,
Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', '
Online/ ', ' Manderson Graduate School of Business ']
The code ignores quote characters and breaks the line up into fields, making a break wherever it finds the delimiter ", ". You're expecting this to be a single field:
The University of Alabama (Master of Science in Marketing,
Specialization in Marketing Analytics
but this "field" has an instance of the delimiter ", " in it, which the CSV parser will honor because it is ignoring the fact that you have this value in quotes. So this piece of data is broken into two fields:
['The University of Alabama (Master of Science in Marketing
and
Specialization in Marketing Analytics)'
This results in the line being broken into 7 fields, and your code is expecting only 6.
Note that in addition, your items are going to include the quotes, which may not be what you're expecting either, and those square braces don't belong there. In short, this isn't a well formed CSV file.
UPDATE: I'm a regex weenie. I do everything with regex expressions, and can't ignore a challenge like this. Here's a regex-based solution that will read exactly what you want out of this data. If you want it to recognize the last line of your data, you should add "']" to the end of that line.
import regex
from pprint import pprint
def parse_file(file):
linepat = regex.compile(r"\[\s*('([^']*)')?(\s*,\s*'([^']*)')*\s*\]")
with open(file) as f:
r = []
while True:
line = f.readline()
if not line:
break
line = line.strip()
if len(line) == 0:
continue
m = linepat.match(line)
if m and m.captures(4):
fields = [m.group(2)] + [s.strip() for s in m.captures(4)]
r.append(fields)
return r
def main():
r = parse_file("/tmp/blah.csv")
pprint(r)
main()
Result:
[['Auburn University (Online Master of Business Administration with '
'concentration in Business Analytics)',
'Masters',
'US',
'AL',
'/Campus',
'Raymond J. Harbert College of Business'],
...
['University of Arkansas (Professional Master of Information Systems)',
'Masters',
'US',
'AR',
'/Campus',
'Sam M. Walton College of']]
Note that this doesn't use the built-in 're' module. That module doesn't deal with repeating groups, which is a must for this kind of problem. Also note that this doesn't involve Pandas. I don't know anything about that module, I assume it is trivial to feed the clean, parsed data from this code into Pandas if that's where you really want it.

the basic method to read file.csv.
def process(string):
print("Processing:",string)
data = []
for line in open("file.csv"):
process(string)
line = line.replace("\n","")
process_code()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get the right xpath for this node? scrapy - python

You could use also css selector response.css('div.address > div.address ::text') print(`[x.strip() for x in response.css('div.address > div.address ::text').getall() if x.strip()]`)

Related

PFF data scraping not recognizing text

Flatten a list inside list of tuples

Scraping .aspx page with Python yields 404

Python: Get text from all HTML child elements texts with lxml xpath

Trying to read csv file in python and creating separate table

Categories

Resources