xpath preceding child issue - python

Building a webscraper to scrape this page http://espn.go.com/nba/teams in order to fill a database with all the team names and their corresponding divisions using the scrapy python library. I am attempting to write my parse function however I still don't exactly understand how to extract the corresponding division name to match each team.
[1] https://www.dropbox.com/s/jv1n49rg4p6p2yh/2014-12-29%2014.08.07-2.jpg?dl=0
def parse(self,response):
items = []
mex = "//div[#class='span-6']/div[#class='span-4']/div/div/div/div[2]/ul/li"
i=0
for sel in response.xpath(mex):
item = TeamStats()
item['team'] = sel.xpath(mex + "/div/h5/a/text()")[i]
item['division'] = sel.xpath("//div[#class='span-6']/div[#class='span-4']/div/div/div/div[1]/h4")
items.append(item)
i=i+1
return items
My parse function is able to return a list of teams and a corresponding divisions list which lists ALL divisions. Now I'm not really how to specify the exact division, as it seems to me that I must navigate from the team name selected (which is represented by item['team'] = sel.xpath(mex + "/div/h5/a/text()")[i] ) up the DOM by using the preceding child relation (was going to include a website that I've been following as a tutorial however I don't have 10 reputation points) to get the RIGHT division, but I'm not sure how to write that...
If I'm on the wrong track with this, let me know as I'm no expert with XPath. I'm actually not even sure if I need a counter as if I remove the [i] then I just get 30 lists with all 30 teams.

Let's make it simpler.
Each division is represented with div with a mod-teams-list-medium class. Each division div consist of 2 parts:
div with class="mod-header" containing the division name
div with class="mod-content" containing the list of teams
Inside your spider it would be reflected this way:
for division in response.xpath('//div[#id="content"]//div[contains(#class, "mod-teams-list-medium")]'):
division_name = division.xpath('.//div[contains(#class, "mod-header")]/h4/text()').extract()[0]
print division_name
print
for team in division.xpath('.//div[contains(#class, "mod-content")]//li'):
team_name = team.xpath('.//h5/a/text()').extract()[0]
print team_name
print "------"
And here is what I'm getting on the console:
Atlantic
Boston Celtics
Brooklyn Nets
New York Knicks
Philadelphia 76ers
Toronto Raptors
------
Pacific
Golden State Warriors
Los Angeles Clippers
Los Angeles Lakers
Phoenix Suns
Sacramento Kings
------
...

Related

Using regex to capture substring within a pandas df

I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:
Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.
df_appointments['Name'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)')
df_appointments['Location'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)\b\s([^\n\r]*)')
Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.
The following pattern works for your sample set:
rgx = r'(?:\w\s)+([A-Z\s\.,]+)(?:\sof\s)([A-Za-z\s]+,\s[A-Za-z\s]+)'
It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:
df[['name', 'location']] = df['precis'].str.extract(rgx)
This then produces:
df
precis name location
0 Appointment of... DAVID MERRIGAN Hammonds Plains, Nova Scotia
1 Appointment of... CARLA R. CONKIN Fort Steele, British Columbia
2 Appointment of... JUDY A. WHITE, Q.C., Conne River, Newfoundland and...
3 Appointment of... GRETA SITTICHINLI Inuvik, Northwest Territories`
Depending on the exact format of all of your precis values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...
# Final Answer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\yueheng.li\Desktop\Up\Question_20220824\Data.csv")
data[['Field_Part1','Field_Part2','Field_Part3']] = data['Precis'].str.split('of',2,expand=True)
data['Address_part1'] = data['Field_Part3'].str.split(',').str[0]
data['Address_part2'] = data['Field_Part3'].str.split(',').str[1]
data['Address'] = data['Address_part1']+','+data['Address_part2']
data.drop(['Field_Part1','Field_Part2','Field_Part3','Address_part1','Address_part2'],axis=1,inplace=True)
# Output Below
data
Easy Way to understand
Thanks
Leon

Why does my program output multiple times?

My python program works fine however it keeps printing the answer atleast 5 times over and I am racking my brain as to why. Any ideas?
Text = """The University of Wisconsin–Milwaukee is a public urban research university in Milwaukee, Wisconsin. It is the largest university in the Milwaukee metropolitan area and a member of the University of Wisconsin System. It is also one of the two doctoral degree-granting public universities and the second largest university in Wisconsin. The university consists of 14 schools and colleges, including the only graduate school of freshwater science in the U.S., the first CEPH accredited dedicated school of public health in Wisconsin, and the state"s only school of architecture. As of the 2015–2016 school year, the University of Wisconsin–Milwaukee had an enrollment of 27,156, with 1,604 faculty members, offering 191 degree programs, including 94 bachelor's, 64 master's and 33 doctorate degrees. The university is classified among "R1: Doctoral Universities – Highest research activity". In 2018, the university had a research expenditure of $55 million. The university's athletic teams are the Panthers. A total of 15 Panther athletic teams compete in NCAA Division I. Panthers have won the James J. McCafferty Trophy as the Horizon League's all-sports champions seven times since 2000. They have earned 133 Horizon League titles and made 40 NCAA tournament appearances as of 2016."""
for punc in "–.,\n":
Text=Text.replace(punc," ")
Text = Text.lower()
word_list = Text.split()
dict = {}
for word in word_list:
dict[word] = dict.get(word, 0) + 1
word_freq = []
for key, value in sorted(dict.items()):
if value > 5:
print(key, value)
You have an indentation issue that leads to the nested for loop. Fix the code into:
Text = """The University of Wisconsin–Milwaukee is a public urban research university in Milwaukee, Wisconsin. It is the largest university in the Milwaukee metropolitan area and a member of the University of Wisconsin System. It is also one of the two doctoral degree-granting public universities and the second largest university in Wisconsin. The university consists of 14 schools and colleges, including the only graduate school of freshwater science in the U.S., the first CEPH accredited dedicated school of public health in Wisconsin, and the state"s only school of architecture. As of the 2015–2016 school year, the University of Wisconsin–Milwaukee had an enrollment of 27,156, with 1,604 faculty members, offering 191 degree programs, including 94 bachelor's, 64 master's and 33 doctorate degrees. The university is classified among "R1: Doctoral Universities – Highest research activity". In 2018, the university had a research expenditure of $55 million. The university's athletic teams are the Panthers. A total of 15 Panther athletic teams compete in NCAA Division I. Panthers have won the James J. McCafferty Trophy as the Horizon League's all-sports champions seven times since 2000. They have earned 133 Horizon League titles and made 40 NCAA tournament appearances as of 2016."""
for punc in "–.,\n":
Text=Text.replace(punc," ")
Text = Text.lower()
word_list = Text.split()
freq = {}
for word in word_list:
freq[word] = freq.get(word, 0) + 1
for key, value in sorted(freq.items()):
if value > 5:
print(key, value)
Since the loops are nested, the print(key, value) line gets called whenever the outer loop goes to the next word. As your freq dictionary grows larger, it will inevitably keeps printing out that same dictionary for every iteration, leading to redundant printing.
=> You probably don't want that; you only want to print the freq dictionary only ONCE after the previous for loop has finished collecting the frequency of each word. Thus separating the loops - the second loop will only run after the first one finished.
Edit: Another thing pointed out by #random-davis is that you don't want to use reserved keyword like dict for your variable name. Change it to freq, or dictionary, or something else.

Selenium scraping dynamic infinite scroll without AJAX

Intent: Scrape company data from the Inc.5000 list (e.g., rank, company name, growth, industry, state, city, description (via hovering over company name)).
Problem: From what I can see, data from the list is dynamically generated in the browser (no AJAX). Additionally, I can't just scroll to the bottom and then scrape the whole page because only a certain number of companies are available at any one time. In other words, companies 1-10 render, but once I scroll to companies 500-510, companies 1-10 are "de-rendered".
Current effort: The following code is where I'm at now.
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get('https://www.inc.com/inc5000/list/2020')
all_companies = []
scroll_max = 600645 #found via Selenium IDE
curr_scroll = 0
next_scroll = curr_scroll+2000
for elem in driver.find_elements_by_class_name('franchise-list__companies'):
while scroll_num <= scroll_max:
scroll_fn = ''.join(("window.scrollTo(", str(curr_scroll), ", ", str(next_scroll), ")"))
driver.execute_script(scroll_fn)
all_companies.append(elem.text.split('\n'))
print('Current length: ', len(all_companies))
curr_scroll += 2000
next_scroll += 2000
Most SO posts related to infinite scroll deal with those that either maintain the data generated as scrolling occurs, or produce AJAX that can be tapped. This problem is an exception to both (but if I missed an applicable SO post, feel free to point me in that direction).
Problem:
Redundant data is scraped (e.g. a single company may be scraped twice)
I still have to split out the data afterwards (final destination is a Pandas datafarame)
Doesn't include the company description (seen by hovering over company name)
It's slow (I realize this is a caveat to Selenium itself, but think the code can be optimized)
The data is loaded from external URL. To print all companies, you can use this example:
import json
import requests
url = 'https://www.inc.com/rest/i5list/2020'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i, company in enumerate(data['companies'], 1):
print('{:>05d} {}'.format(i, company['company']))
# the hover text is stored in company['ifc_business_model']
Prints:
00001 OneTrust
00002 Create Music Group
00003 Lovell Government Services
00004 Avalon Healthcare Solutions
00005 ZULIE VENTURE INC
00006 Hunt A Killer
00007 Case Energy Partners
00008 Nationwide Mortgage Bankers
00009 Paxon Energy
00010 Inspire11
00011 Nugget
00012 TRYFACTA
00013 CannaSafe
00014 BRUMATE
00015 Resource Innovations
...and so on.

Beautiful Soup: Extracting href from HTML ordered list

I am attempting to extract the URLs from within a HTML ordered list using the BeautifulSoup python module. My code returns a list of NONE values equal in number to the number of items from the ordered list so I know I'm in the right place in the document. What am I doing wrong?
The URL I am scraping from is http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV
Here are 5 of 50 lines from the HTML list (apologies for the length):
> `<div id="body" class="article-body">
<ol>
<li>WACO, TX, 3/18/13: Police responding to a domestic disturbance call found a man struggling to restrain his grandson, who was agitated and holding an AR-15. The cops shot grandpa. But that would totally never happen in a crowded theater.</li>
<li>GROSSE POINTE PARK, MI, 4/06/13: Grosse Pointe Park police arrested a 20-year-old Detroit man April 6 after he accidentally shot a 9mm handgun into the floor of a home in the 1000 block of Beaconsfield. The man was trying to make the gun safe when it discharged.</li>
<li>OTTAWA, KS, 4/13/13: No one was injured when a “negligent” rifle shot rang out Saturday night inside a residence in the 1600 block of South Cedar Street in Ottawa. Dylan Spencer, 22, Ottawa, was arrested by Ottawa police about 7 p.m. on suspicion of unlawfully discharging an AR-15 rifle in his apartment, according to a police report. The bullet exited his apartment, passed through both walls of an occupied apartment and lodged into a utility pole. But of course, Dylan didn't think the gun was loaded. So it's cool.</li>
<li>KLAMATH FALLS, OR, 4/13/13: An investigation into the shooting death of Lee Roy Myers, 47, has been ruled accidental. The Klamath County Major Crimes Team was called to investigate a shooting on Saturday, April 13. An autopsy concluded the cause of death was an accidental, self-inflicted handgun wound.</li>
<li>SOUTHAMPTON, NY, 4/13/13: The report states that the detective visited the home and interviewed the man, who legally owned the Ruger 10/22 rifle. The man said he was cleaning the rifle when it accidentally discharged into his big toe. When the rifle was pointed in a downward angle, inertia caused the firing pin to strike the primer, which caused the rifle to fire, according to the incident report. The detective advised the man on safety techniques while cleaning his rifle. (Step one: unload it.)</li>`
And here is my code:
page= urllib2.urlopen(url)
soup = BeautifulSoup(page)
li=soup.select("ol > li")
for link in li:
print (link.get('href'))
You're iterating over li elements which don't have href attribute. a tags inside them do:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
li = soup.select("ol > li > a")
for link in li:
print(link.get('href'))

Compensating for "variance" in a survey

The title for this one was quite tricky.
I'm trying to solve a scenario,
Imagine a survey was sent out to XXXXX amount of people, asking them what their favourite football club was.
From the response back, it's obvious that while many are favourites of the same club, they all "expressed" it in different ways.
For example,
For Manchester United, some variations include...
Man U
Man Utd.
Man Utd.
Manchester U
Manchester Utd
All are obviously the same club however, if using a simple technique, of just trying to get an extract string match, each would be a separate result.
Now, if we further complication the scenario, let's say that because of the sheer volume of different clubs (eg. Man City, as M. City, Manchester City, etc), again plagued with this problem, its impossible to manually "enter" these variances and use that to create a custom filter such that converters all Man U -> Manchester United, Man Utd. > Manchester United, etc. But instead we want to automate this filter, to look for the most likely match and converter the data accordingly.
I'm trying to do this in Python (from a .cvs file) however welcome any pseudo answers that outline a good approach to solving this.
Edit: Some additional information
This isn't working off a set list of clubs, the idea is to "cluster" the ones we have together.
The assumption is there are no spelling mistakes.
There is no assumed length of how many clubs
And the survey list is long. Long enough that it doesn't warranty doing this manually (1000s of queries)
Google Refine does just this, but I'll assume you want to roll your own.
Note, difflib is built into Python, and has lots of features (including eliminating junk elements). I'd start with that.
You probably don't want to do it in a completely automated fashion. I'd do something like this:
# load corrections file, mapping user input -> output
# load survey
import difflib
possible_values = corrections.values()
for answer in survey:
output = corrections.get(answer,None)
if output = None:
likely_outputs = difflib.get_close_matches(input,possible_values)
output = get_user_to_select_output_or_add_new(likely_outputs)
corrections[answer] = output
possible_values.append(output)
save_corrections_as_csv
Please edit your question with answers to the following:
You say "we want to automate this filter, to look for the most likely match" -- match to what?? Do you have a list of the standard names of all of the possible football clubs, or do the many variations of each name need to be clustered to create such a list?
How many clubs?
How many survey responses?
After doing very light normalisation (replace . by space, strip leading/trailing whitespace, replace runs of whitespace by a single space, convert to lower case [in that order]) and counting, how many unique responses do you have?
Your focus seems to be on abbreviations of the standard name. Do you need to cope with nicknames e.g. Gunners -> Arsenal, Spurs -> Tottenham Hotspur? Acronyms (WBA -> West Bromwich Albion)? What about spelling mistakes, keyboard mistakes, SMS-dialect, ...? In general, what studies of your data have you done and what were the results?
You say """its impossible to manually "enter" these variances""" -- is it possible/permissible to "enter" some "variances" e.g. to cope with nicknames as above?
What are your criteria for success in this exercise, and how will you measure it?
It seems to me that you could convert many of these into a standard form by taking the string, lower-casing it, removing all punctuation, then comparing the start of each word.
If you had a list of all the actual club names, you could compare directly against that as well; and for strings which don't match first-n-letters to any actual team, you could try lexigraphical comparison against any of the returned strings which actually do match.
It's not perfect, but it should get you 99% of the way there.
import string
def words(s):
s = s.lower().strip(string.punctuation)
return s.split()
def bestMatchingWord(word, matchWords):
score,best = 0., ''
for matchWord in matchWords:
matchScore = sum(w==m for w,m in zip(word,matchWord)) / (len(word) + 0.01)
if matchScore > score:
score,best = matchScore,matchWord
return score,best
def bestMatchingSentence(wordList, matchSentences):
score,best = 0., []
for matchSentence in matchSentences:
total,words = 0., []
for word in wordList:
s,w = bestMatchingWord(word,matchSentence)
total += s
words.append(w)
if total > score:
score,best = total,words
return score,best
def main():
data = (
"Man U",
"Man. Utd.",
"Manch Utd",
"Manchester U",
"Manchester Utd"
)
teamList = (
('arsenal',),
('aston', 'villa'),
('birmingham', 'city', 'bham'),
('blackburn', 'rovers', 'bburn'),
('blackpool', 'bpool'),
('bolton', 'wanderers'),
('chelsea',),
('everton',),
('fulham',),
('liverpool',),
('manchester', 'city', 'cty'),
('manchester', 'united', 'utd'),
('newcastle', 'united', 'utd'),
('stoke', 'city'),
('sunderland',),
('tottenham', 'hotspur'),
('west', 'bromwich', 'albion'),
('west', 'ham', 'united', 'utd'),
('wigan', 'athletic'),
('wolverhampton', 'wanderers')
)
for d in data:
print "{0:20} {1}".format(d, bestMatchingSentence(words(d), teamList))
if __name__=="__main__":
main()
run on sample data gets you
Man U (1.9867767507647776, ['manchester', 'united'])
Man. Utd. (1.7448074166742613, ['manchester', 'utd'])
Manch Utd (1.9946817328797555, ['manchester', 'utd'])
Manchester U (1.989100008901989, ['manchester', 'united'])
Manchester Utd (1.9956787398647866, ['manchester', 'utd'])

Categories

Resources