No Output When Running BeautifulSoup Python Code - python

I was recently trying out the following Python code using BeautifulSoup from this question, which seems to have worked for the question-asker.
import urllib2
import bs4
import string
from bs4 import BeautifulSoup
badwords = set([
'cup','cups',
'clove','cloves',
'tsp','teaspoon','teaspoons',
'tbsp','tablespoon','tablespoons',
'minced'
])
def cleanIngred(s):
s=s.strip()
s=s.strip(string.digits + string.punctuation)
return ' '.join(word for word in s.split() if not word in badwords)
def cleanIngred(s):
# remove leading and trailing whitespace
s = s.strip()
# remove numbers and punctuation in the string
s = s.strip(string.digits + string.punctuation)
# remove unwanted words
return ' '.join(word for word in s.split() if not word in badwords)
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]
fname = 'PorkRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
I can't get it to work in my case though for some reason. I receive the error:
AttributeError Traceback (most recent call last)
<ipython-input-4-55411b0c5016> in <module>()
41
42 if __name__=="__main__":
---> 43 main()
<ipython-input-4-55411b0c5016> in main()
31 url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
32 data = urllib2.urlopen(url).read()
---> 33 bs = BeautifulSoup.BeautifulSoup(data)
34
35 ingreds = bs.find('div', {'class': 'ingredients'})
AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'
I suspect this is because I'm using bs4 and not BeautifulSoup. I tried replacing the line bs = BeautifulSoup.BeautifulSoup(data) with bs = bs4.BeautifulSoup(data) and no longer receive an error, but get no output. Are there too many possible causes for this to guess?

The original code used BeautifulSoup version 3:
import BeautifulSoup
You switched to BeautifulSoup version 4, but also switched the style of the import:
from bs4 import BeautifulSoup
Either remove that line; you already have the correct import earlier in your file:
import bs4
and then use:
bs = bs4.BeautifulSoup(data)
or change that latter line to:
bs = BeautifulSoup(data)
(and remove the import bs4 line).
You may also want to review the Porting code to BS4 section of the BeautifulSoup documentation, so you can make any other necessary changes upgrading the code you found to get the best out of BeautifulSoup version 4.
The script otherwise works just fine and produces a new file, PorkRecipe.txt, it doesn't produce output on stdout.
The contents of the file after fixing the bs4.BeautifulSoup reference:
READY IN 4+ hrs
Slow Cooker Pork Chops II
Amazing Pork Tenderloin in the Slow Cooker
Jerre's Black Bean and Pork Slow Cooker Chili
Slow Cooker Pulled Pork
Slow Cooker Sauerkraut Pork Loin
Slow Cooker Texas Pulled Pork
Oven-Fried Pork Chops
Pork Chops for the Slow Cooker
Tangy Slow Cooker Pork Roast
Types of Cooking Oil
Garlic: Fresh Vs. Powdered
All about Paprika
Types of Salt
olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste
PREP 10 mins
COOK 4 hrs
READY IN 4 hrs 10 mins
In a large bowl, whisk together the olive oil, chicken broth, garlic, paprika, garlic powder, poultry seasoning, oregano, and basil. Pour into the slow cooker. Cut small slits in each pork chop with the tip of a knife, and season lightly with salt and pepper. Place pork chops into the slow cooker, cover, and cook on High for 4 hours. Baste periodically with the sauce

Related

AttributeError: 'NoneType' object has no attribute 'lower' in Python. How to preprocess before tokenizing the text content?

The data set I am using looks like this. It is a video captioning data set with captions under the column 'caption' with multiple captions for a single video clip.
video_id caption
mv89psg6zh4 A bird is bathing in a sink.
mv89psg6zh4 A faucet is running while a bird stands.
mv89psg6zh4 A bird gets washed.
mv89psg6zh4 A parakeet is taking a shower in a sink.
mv89psg6zh4 The bird is taking a bath under the faucet.
mv89psg6zh4 A bird is standing in a sink drinking water.
R2DvpPTfl-E PLAYING GAME ON LAPTOP.
R2DvpPTfl-E THE MAN IS WATCHING LAPTOP.
l7x8uIdg2XU A woman is pouring ingredients into a bowl.
l7x8uIdg2XU A woman is adding milk to some pasta.
l7x8uIdg2XU A person adds ingredients to pasta.
l7x8uIdg2XU the girls are doing the cooking.
It is working on the "CandidateA" json File here
However, it is not working on the "Referencedf" json file which looks like this (the complete file can be found here):
(Excerpt only):
[{"video_id":"mv89psg6zh4_33_46","caption":"A bird in a sink keeps getting under the running water from a faucet."},{"video_id":"mv89psg6zh4_33_46","caption":"A bird is bathing in a sink."},{"video_id":"mv89psg6zh4_33_46","caption":"A bird is splashing around under a running faucet."},{"video_id":"60x_yxy7Sfw_1_7","caption":"A MAN IS WATCHING A LAPTOP."},{"video_id":"60x_yxy7Sfw_1_7","caption":"A man is sitting at his computer."}]
This is the following code I am applying:
import json
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
with open("Referencedf.json", 'r') as f:
datastore = json.load(f)
captions = []
video_id = []
for item in datastore:
captions.append(item['caption'])
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(captions)
The error I am getting is this:
AttributeError Traceback (most recent call last)
<ipython-input-25-63fee6e467f1> in <module>
1 tokenizer = Tokenizer(oov_token="<OOV>")
----> 2 tokenizer.fit_on_texts(captions)
3 word_index = tokenizer.word_index
4 print(len(word_index))
~\anaconda3\lib\site-packages\keras_preprocessing\text.py in fit_on_texts(self, texts)
221 self.filters,
222 self.lower,
--> 223 self.split)
224 for w in seq:
225 if w in self.word_counts:
~\anaconda3\lib\site-packages\keras_preprocessing\text.py in text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'NoneType' object has no attribute 'lower'
Edit:
As suggested by #MahindraSinghMeena, I removed the Null rows from the dataframe beforehand only so as to avoid the error by using
df = df.dropna()
This happens if you have some incorrect data in the text being fed to the Tokenizer, as the error message suggests that it found some element to be None. So a cleanup in the data should be done to remove such cases.
You can see in the following snippet, that an entry has invalid text for caption.
import json
datastore = json.load(open('/Referencedf.json', 'r'))
for d in datastore:
if d['caption'] is None:
print(d)
{'video_id': 'SKhmFSV-XB0_12_18', 'caption': None}

HTML parsing using beautiful soup gives structure different to website

When I view this link https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm the text is displayed in a clear way. However when I try to parse the page using beautiful soup I am outputting something which doesn't look the same - it is all messed up. Here is the code
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)
The desired ouput would look like this
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Dealer : Asset Manager/ : Leveraged : Other : Nonreportable :
Intermediary : Institutional : Funds : Reportables : Positions :
Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE ($100 X INDEX)
CFTC Code #221602 Open Interest is 19,721
Positions
97 2,934 0 8,941 1,574 973 6,490 11,975 1,694 1,372 539 0 154 32
Changes from: June 9, 2015 Total Change is: 3,505
48 0 0 2,013 1,141 70 447 1,369 923 -64 0 0 68 2
Percent of Open Interest Represented by Each Category of Trader
0.5 14.9 0.0 45.3 8.0 4.9 32.9 60.7 8.6 7.0 2.7 0.0 0.8 0.2
Number of Traders in Each Category Total Traders: 31
. . 0 5 . . 6 9 . 5 . 0
-----------------------------------------------------------------------------------------------------------------------------------------------------------
After viewing the page source it is not clear to me how a new line is being distingushed in the style - which is where I think the problem comes from.
Is there some type of structure I need to specify in the BeautifulSoup function? I'm very lost here, so any help is much appreciated.
Fwiw I have installing the html2text module and had no luck installing on anaconda using !conda config --append channels conda-forge and !conda install html2text
Cheers
EDIT: ive figured it out. im a brainlet
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')
cleaned = []
for i in htm:
i = BeautifulSoup(i,'html.parser' ).get_text()
cleaned.append(i)
with open('trouble.txt','w') as f:
for line in cleaned:
f.write('%s\n' % line)

Python: How to read line from file which has two line spaces in between

I am trying to read a file which has format like below: It has two '\n' space in between every line.
Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http
Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!
I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...
I am using below python code to read the line and convert it into Dataframe:
open_reviews = open("C:\\Downloads\\review_short.txt","r",encoding="Latin-1" ).read()
documents = []
for r in open_reviews.split('\n\n'):
documents.append(r)
df = pd.DataFrame(documents)
print(df.head())
The output I am getting is as below:
0 I was very inspired by Louise's Hay approach t...
1 \n You Can Heal Your Life by
2 \n I had an older version
3 \n I love Louise Hay and
4 \n I thought the book was exellent
Since I used two (\n), it gets appended at beginning of each line. Is there any other way to handle this, so that I get output as below:
0 I was very inspired by Louise's Hay approach t...
1 You Can Heal Your Life by
2 I had an older version
3 I love Louise Hay and
4 I thought the book was exellent
This appends every non-blank line.
filename = "..."
lines = []
with open(filename) as f:
for line in f:
line = line.strip()
if line:
lines.append(line)
>>> lines
['Great tool for healing your life--if you are ready to change your beliefs!<br /><a href="http',
'Bought this book for a friend. I read it years ago and it is one of those books you keep forever. Love it!',
'I read this book many years ago and have heard Louise Hay speak a couple of times. It is a valuable read...']
lines = pd.DataFrame(lines, columns=['my_text'])
>>> lines
my_text
0 Great tool for healing your life--if you are r...
1 Bought this book for a friend. I read it years...
2 I read this book many years ago and have heard...
Try using the .stip() method. It will remove any unnecessary whitespace characters from the beginning or end of a string.
You can use it like this:
for r in open_review.split('\n\n'):
documents.append(r.strip())
Use readlines() and clean the line with strip().
filename = "C:\\Downloads\\review_short.txt"
open_reviews = open(filename, "r", encoding="Latin-1")
documents = []
for r in open_reviews.readlines():
r = r.strip() # clean spaces and \n
if r:
documents.append(r)

How to put a string at the front of a file in python

this is my code:
>>> p = open(r'/Users/ericxx/Desktop/dest.txt','r+')
>>> xx = p.read()
>>> xx = xx[:0]+"How many roads must a man walk down\nBefore they call him a man" +xx[0:]
>>> p.writelines(xx)
>>> p.close()
the original file content looks like:
How many seas must a white dove sail
Before she sleeps in the sand
the result looks like :
How many seas must a white dove sail
Before she sleeps in the sand
How many roads must a man walk down
Before they call him a man
How many seas must a white dove sail
Before she sleeps in the sand
expected output :
How many roads must a man walk down
Before they call him a man
How many seas must a white dove sail
Before she sleeps in the sand
You have to "rewind" the file between reading and writing:
p.seek(0)
The whole code will look like this (with other minor changes):
p = open('/Users/ericxx/Desktop/dest.txt','r+')
xx = p.read()
xx = "How many roads must a man walk down\nBefore they call him a man" + xx
p.seek(0)
p.write(xx)
p.close()
Adding to #messas answer,
while doing seek to add the data in the front it can also leave you with old data at the end of your file if you ever shortened xx at any point.
This is because p.seek(0) puts the input pointer in the file at the beginning of the file and any .write() operation will overwrite content as it goes. However a shorter content written vs content already in the file will result in som old data being left at the end, not overwritten.
To avoid this you could open and close the file twice with , 'w') as the opening parameter the second time around or store/fetch the file contents length and pad your new content. Or truncate the file to your new desired length.
To truncate the file, simply add p.flush() after you've written the data.
Also, use the with operator
with open('/Users/ericxx/Desktop/dest.txt','r+') as p:
xx = p.read()
xx = "How many roads must a man walk down\nBefore they call him a man" + xx
p.seek(0)
p.write(xx)
p.flush()
I'm on my phone so apologies if the explanation is some what short and blunt and lacking code. Can update more tomorrow.

rstrip() not working as expected

I have written following python code. What i am expecting it to do is add a random word from the file "noise" to each line of "raw" and print it to the file "dataset"
#! /usr/bin/python
from random import randint
raw = open("raw_dataset_1", "r")
noise = open("random", "r")
dataset = open("raw_noisy", "w")
lines = noise.readlines()
for line in raw:
a = randint(1, 5449)
addNoise = lines[a-1]
#print a
#print addNoise
noisy = (line + addNoise)
noisy1= noisy.rstrip()
#print noisy1
dataset.write(noisy1)
My expected "dataset" file is :
city mountain sky sun chalk
bay lake sun tree discussions
beach sea sky sun background
But i'm getting:
city mountain sky sun
chalk
bay lake sun tree
discussions
beach sea sky sun
background
Can someone please point out my mistake?
I think you want to do noisy = (line.rstrip("\n") + " " + addNoise)
I tested it and it worked for me.
While reading each line using:
for line in raw:
line contains the newline at the end. You need to remove it.
Try using:
noisy = line.rstrip() + " " + addNoise

Categories

Resources