I am scraping a wiki page, but there are some empty <td> elements in some rows, therefore I used :
for tr in table1.tbody:
list = []
for td in tr:
try:
if(td.text is None): list.append('NA')
else: list.append(td.text.strip())
except:
list.append(td.strip())
to store those rows element in a list, but when I print row_list.
Those rows_list with empty <td> value, which should now have append 'NA' value, those are still empty, i.e, 'NA' have not appended in list.
How could I fix this?
Note Question needs improvment - While you update here just two option to fix
Option#1
Use pandas to get the tables in a quick and propper way:
import pandas as pd
pd.concat(pd.read_html('https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches#Past_launches')[2:11])
Option #2
Put the list outside before your loops, so you avoid overwriting and check your indentation:
data = []
for tr in table1.tbody:
for td in tr:
try:
if(td.text is None): data.append('NA')
else: data.append(td.text.strip())
except:
data.append(td.strip())
Few things here:
don't use list as a variable. It's a predefined method in python.
the td.text is not None. There is actually an string as content (Ie: ' ')
You are not iterating through the tr tags and td tags (or atleast in the code you are providing here). You need to create your list of tr tags, and td elements to use in your for loop.
Try this:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches#Past_launches'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table1 = soup.find_all('table')[2]
stored_list = []
for tr in table1.tbody.find_all('tr'):
for td in tr.find_all('td'):
if td.text.strip() == '':
stored_list.append('NA')
else:
stored_list.append(td.text.strip())
Output:
print(stored_list)
['4 June 2010,18:45', 'F9 v1.0[7]B0003[8]', 'CCAFS,SLC-40', 'Dragon Spacecraft Qualification Unit', 'NA', 'LEO', 'SpaceX', 'Success', 'Failure[9][10](parachute)', 'First flight of Falcon 9 v1.0.[11] Used a boilerplate version of Dragon capsule which was not designed to separate from the second stage.(more details below) Attempted to recover the first stage by parachuting it into the ocean, but it burned up on reentry, before the parachutes even go to deploy.[12]', '8 December 2010,15:43[13]', 'F9 v1.0[7]B0004[8]', 'CCAFS,SLC-40', 'Dragon demo flight C1(Dragon C101)', 'NA', 'LEO (ISS)', 'NASA (COTS)\nNRO', 'Success[9]', 'Failure[9][14](parachute)', "Maiden flight of SpaceX's Dragon capsule, consisting of over 3 hours of testing thruster maneuvering and then reentry.[15] Attempted to recover the first stage by parachuting it into the ocean, but it disintegrated upon reentry, again before the parachutes were deployed.[12] (more details below) It also included two CubeSats,[16] and a wheel of Brouère cheese. Before the launch, SpaceX discovered that there was a crack in the nozzle of the 2nd stage's Merlin vacuum engine. So Elon just had them cut off the end of the nozzle with a pair of shears and launched the rocket a few days later. After SpaceX had trimmed the nozzle, NASA was notified of the change and they agreed to it.[17]", '22 May 2012,07:44[18]', 'F9 v1.0[7]B0005[8]', 'CCAFS,SLC-40', 'Dragon demo flight C2+[19](Dragon C102)', '525\xa0kg (1,157\xa0lb)[20] (excl. Dragon mass)', 'LEO (ISS)', 'NASA (COTS)', 'Success[21]', 'No attempt', 'The Dragon spacecraft demonstrated a series of tests before it was allowed to approach the International Space Station. Two days later, it became the first commercial spacecraft to board the ISS.[18] (more details below)', '8 October 2012,00:35[22]', 'F9 v1.0[7]B0006[8]', 'CCAFS,SLC-40', 'SpaceX CRS-1[23](Dragon C103)', '4,700\xa0kg (10,400\xa0lb) (excl. Dragon mass)', 'LEO (ISS)', 'NASA (CRS)', 'Success', 'No attempt', 'Orbcomm-OG2[24]', '172\xa0kg (379\xa0lb)[25]', 'LEO', 'Orbcomm', 'Partial failure[26]', "CRS-1 was successful, but the secondary payload was inserted into an abnormally low orbit and subsequently lost. This was due to one of the nine Merlin engines shutting down during the launch, and NASA declining a second reignition, as per ISS visiting vehicle safety rules, the primary payload owner is contractually allowed to decline a second reignition. NASA stated that this was because SpaceX could not guarantee a high enough likelihood of the second stage completing the second burn successfully which was required to avoid any risk of secondary payload's collision with the ISS.[27][28][29]", '1 March 2013,15:10', 'F9 v1.0[7]B0007[8]', 'CCAFS,SLC-40', 'SpaceX CRS-2[23](Dragon C104)', '4,877\xa0kg (10,752\xa0lb) (excl. Dragon mass)', 'LEO (ISS)', 'NASA (CRS)', 'Success', 'No attempt', 'Last launch of the original Falcon 9 v1.0 launch vehicle, first use of the unpressurized trunk section of Dragon.[30]', '29 September 2013,16:00[31]', 'F9 v1.1[7]B1003[8]', 'VAFB,SLC-4E', 'CASSIOPE[23][32]', '500\xa0kg (1,100\xa0lb)', 'Polar orbit LEO', 'MDA', 'Success[31]', 'Uncontrolled(ocean)[d]', 'First commercial mission with a private customer, first launch from Vandenberg, and demonstration flight of Falcon 9 v1.1 with an improved 13-tonne to LEO capacity.[30] After separation from the second stage carrying Canadian commercial and scientific satellites, the first stage booster performed a controlled reentry,[33] and an ocean touchdown test for the first time. This provided good test data, even though the booster started rolling as it neared the ocean, leading to the shutdown of the central engine as the roll depleted it of fuel, resulting in a hard impact with the ocean.[31] This was the first known attempt of a rocket engine being lit to perform a supersonic retro propulsion, and allowed SpaceX to enter a public-private partnership with NASA and its Mars entry, descent, and landing technologies research projects.[34] (more details below)', '3 December 2013,22:41[35]', 'F9 v1.1B1004', 'CCAFS,SLC-40', 'SES-8[23][36][37]', '3,170\xa0kg (6,990\xa0lb)', 'GTO', 'SES', 'Success[38]', 'No attempt[39]', 'First Geostationary transfer orbit (GTO) launch for Falcon 9,[36] and first successful reignition of the second stage.[40] SES-8 was inserted into a Super-Synchronous Transfer Orbit of 79,341\xa0km (49,300\xa0mi) in apogee with an inclination of 20.55° to the equator.']
Related
I have such a html page inside the content_list variable
<h3 class="sds-heading--7 title">Problems with battery capacity long-term</h3>
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
<div class="review-type"><strong>Owns this car</strong></div>
</div>
<div class="review-section">
<p class="review-body">We have owned our Leaf since May 2011. We have loved the car but are now getting quite concerned. My husband drives the car, on average, 20-40 miles/day to and from work and running errands, mostly 100% on city roads. We live in San Diego, so no issue with winter weather and we live 7 miles from the ocean so seldom have daytime temperatures above 85. Originally, we would get 65-70 miles per 80-90% charge. Last fall we noticed that there was considerably less remaining charge left after a day of driving. He began to track daily miles, remaining "bars", as well as started charging it 100%. For 9 months we have only been getting 40-45 miles on a full charge with only 1-2 "bars" remaining at the end of the day. Sometimes it will be blinking and "talking" to us to get to a charging place ASAP. We just had it into the dealership. Though on a full charge, the car gauge shows 12 bars, the dealership states that the batteries have lost 2 bars via the computer diagnostics (which we are told is a different reading from the car gauge itself) and, that they say, is average and excepted for the car at this age. Everything else (software, diagnostics, etc.) shows 100%, so the dealership thinks that the car is functioning as it should. They are unable to explain why we can only go 40-45 miles on a charge, but keep saying that the car tests out fine. If the distance one is able to drive on a full charge decreases any further, it will begin to render the car useless. As someone else recommended, in retrospect, the best way to go is to lease the Leaf so that battery life is not an issue.</p>
</div>
First I used this code to get to the collection of reviews
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
Now I would like to take the value of date of the review and the name of the reviewer which in this case would be
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
The problem is I can't separate those two divs
My code:
data = []
for e in content_list:
data.append({
'review_date':e.find_all("div", {"class":"review-byline"})[0].text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
The result of my code
{'overall_rating': '4.7',
'review_content': 'This is the perfect electric car for driving around town, doing errands or even for a short daily commuter. It is very comfy and very quick. The only issue was the first gen battery. The 2011-2014 battery degraded quickly and if the owner did not have Nissan replace it, all those cars are now junk and can only go 20 miles or so on a charge. We had Nissan replace our battery with the 2nd gen battery and it is good as new!',
'review_date': '\nFebruary 24, 2020\nBy EVs are the future from Tucson, AZ\nOwns this car\n',
'review_title': 'Great Electric Car!'}
For the first one you could the <div> directly:
'review_date':e.find("div", {"class":"review-byline"}).div.text,
for the second one use e.g. css selector:
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
Example
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
data = []
for e in content_list:
data.append({
'review_date':e.find("div", {"class":"review-byline"}).div.text,
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
data
Intent: Scrape company data from the Inc.5000 list (e.g., rank, company name, growth, industry, state, city, description (via hovering over company name)).
Problem: From what I can see, data from the list is dynamically generated in the browser (no AJAX). Additionally, I can't just scroll to the bottom and then scrape the whole page because only a certain number of companies are available at any one time. In other words, companies 1-10 render, but once I scroll to companies 500-510, companies 1-10 are "de-rendered".
Current effort: The following code is where I'm at now.
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get('https://www.inc.com/inc5000/list/2020')
all_companies = []
scroll_max = 600645 #found via Selenium IDE
curr_scroll = 0
next_scroll = curr_scroll+2000
for elem in driver.find_elements_by_class_name('franchise-list__companies'):
while scroll_num <= scroll_max:
scroll_fn = ''.join(("window.scrollTo(", str(curr_scroll), ", ", str(next_scroll), ")"))
driver.execute_script(scroll_fn)
all_companies.append(elem.text.split('\n'))
print('Current length: ', len(all_companies))
curr_scroll += 2000
next_scroll += 2000
Most SO posts related to infinite scroll deal with those that either maintain the data generated as scrolling occurs, or produce AJAX that can be tapped. This problem is an exception to both (but if I missed an applicable SO post, feel free to point me in that direction).
Problem:
Redundant data is scraped (e.g. a single company may be scraped twice)
I still have to split out the data afterwards (final destination is a Pandas datafarame)
Doesn't include the company description (seen by hovering over company name)
It's slow (I realize this is a caveat to Selenium itself, but think the code can be optimized)
The data is loaded from external URL. To print all companies, you can use this example:
import json
import requests
url = 'https://www.inc.com/rest/i5list/2020'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i, company in enumerate(data['companies'], 1):
print('{:>05d} {}'.format(i, company['company']))
# the hover text is stored in company['ifc_business_model']
Prints:
00001 OneTrust
00002 Create Music Group
00003 Lovell Government Services
00004 Avalon Healthcare Solutions
00005 ZULIE VENTURE INC
00006 Hunt A Killer
00007 Case Energy Partners
00008 Nationwide Mortgage Bankers
00009 Paxon Energy
00010 Inspire11
00011 Nugget
00012 TRYFACTA
00013 CannaSafe
00014 BRUMATE
00015 Resource Innovations
...and so on.
I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”
The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced
s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3
Here is my code
import json
data = []
with open("review.json") as f:
for line in f:
data.append(json.loads(line))
lst_string = []
lst_num = []
for i in range(len(data)):
if (data[i]["stars"] == 5.0):
x = data[i]["text"]
for word in x.split():
if word in lst_string:
lst_num[lst_string.index(word)] += 1
else:
lst_string.append(word)
lst_num.append(1)
result = set(zip(lst_string, lst_num))
print(result)
with open("set.txt", "w") as g:
g.write(str(result))
I'm trying to write a set of all words in reviews that were given 5 stars from a pulled in json file formatted like
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
{"review_id":"GJXCdrto3ASJOqKeVWPi6Q","user_id":"yXQM5uF2jS6es16SJzNHfg","business_id":"NZnhc2sEQy3RmzKTZnqtwQ","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon! I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level! \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit. Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room. Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure. That was superb! Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen. The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement. It was so much fun to be there! \n\nNext Travis started with the flat iron. The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable. It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists. At the end of the blowout & style my hair was perfectly bouncey and looked terrific. The only thing better? That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas. You make me feel beauuuutiful!","date":"2017-01-14 21:30:33"}
{"review_id":"2TzJjDVDEuAW6MR5Vuc1ug","user_id":"n6-Gk65cPZL6Uz8qRm3NYw","business_id":"WTqjgwHlXbSFevF32_DJVw","stars":1.0,"useful":3,"funny":0,"cool":0,"text":"I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!","date":"2016-11-09 20:09:03"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!","date":"2018-01-09 20:56:38"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"a b aa bb a b","date":"2018-01-09 20:56:38"}
but it is using all the memory on my computer before it can output into a text file. How can I use a less memory intensive way?
Only get text where stars == 5:
Data:
Based on the question, the data is a file containing rows of dicts.
Get the text into a list:
Given the data from Yelp Challenge, getting the 5 stars text into a list, doesn't take that much memory.
The Windows resource manager showed an increase of about 1.3GB, but the object size of text_list was about 25MB.
import json
text_list = list()
with open("review.json", encoding="utf8") as f:
for line in f:
line = json.loads(line)
if line['stars'] == 5:
text_list.append(line['text'])
print(text_list)
>>> ['Test text, example 1!', 'Test text, example 2!']
Extra:
Everything after loading the data, seems to require a lot of memory that isn't being released.
When cleaning the text, Windows resource manager went up by 16GB, though the final size of clean_text was also only about 25MB.
Interestingly, deleting clean_text does not release the 16GB of memory.
In Jupyter Lab, restarting the Kernel will release the memory
In PyCharm, stopping the process also releases the memory
I tried manually running the garbage collector, but that didn't release the memory
Clean text_list:
import string
def clean_string(value: str) -> list:
value = value.lower()
value = value.translate(str.maketrans('', '', string.punctuation))
value = value.split()
return value
clean_text = [clean_string(item) for item in text_list]
print(clean_text)
>>> [['test', 'text', 'example', '1'], ['test', 'text', 'example', '2']]
Count words in clean_text:
from collection import Counter
words = Counter()
for item in clean_text:
words.update(item)
print(words)
>>> Counter({'test': 2, 'text': 2, 'example': 2, '1': 1, '2': 1})
I am attempting to extract the URLs from within a HTML ordered list using the BeautifulSoup python module. My code returns a list of NONE values equal in number to the number of items from the ordered list so I know I'm in the right place in the document. What am I doing wrong?
The URL I am scraping from is http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV
Here are 5 of 50 lines from the HTML list (apologies for the length):
> `<div id="body" class="article-body">
<ol>
<li>WACO, TX, 3/18/13: Police responding to a domestic disturbance call found a man struggling to restrain his grandson, who was agitated and holding an AR-15. The cops shot grandpa. But that would totally never happen in a crowded theater.</li>
<li>GROSSE POINTE PARK, MI, 4/06/13: Grosse Pointe Park police arrested a 20-year-old Detroit man April 6 after he accidentally shot a 9mm handgun into the floor of a home in the 1000 block of Beaconsfield. The man was trying to make the gun safe when it discharged.</li>
<li>OTTAWA, KS, 4/13/13: No one was injured when a “negligent” rifle shot rang out Saturday night inside a residence in the 1600 block of South Cedar Street in Ottawa. Dylan Spencer, 22, Ottawa, was arrested by Ottawa police about 7 p.m. on suspicion of unlawfully discharging an AR-15 rifle in his apartment, according to a police report. The bullet exited his apartment, passed through both walls of an occupied apartment and lodged into a utility pole. But of course, Dylan didn't think the gun was loaded. So it's cool.</li>
<li>KLAMATH FALLS, OR, 4/13/13: An investigation into the shooting death of Lee Roy Myers, 47, has been ruled accidental. The Klamath County Major Crimes Team was called to investigate a shooting on Saturday, April 13. An autopsy concluded the cause of death was an accidental, self-inflicted handgun wound.</li>
<li>SOUTHAMPTON, NY, 4/13/13: The report states that the detective visited the home and interviewed the man, who legally owned the Ruger 10/22 rifle. The man said he was cleaning the rifle when it accidentally discharged into his big toe. When the rifle was pointed in a downward angle, inertia caused the firing pin to strike the primer, which caused the rifle to fire, according to the incident report. The detective advised the man on safety techniques while cleaning his rifle. (Step one: unload it.)</li>`
And here is my code:
page= urllib2.urlopen(url)
soup = BeautifulSoup(page)
li=soup.select("ol > li")
for link in li:
print (link.get('href'))
You're iterating over li elements which don't have href attribute. a tags inside them do:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
li = soup.select("ol > li > a")
for link in li:
print(link.get('href'))