Web Scraping Get Text to CSV file - python

I'm very new to Python - WebScraping, and I want to extract text from website and export to csv files,
but i got a problem when check the csv file,
When i run this code (with print) :
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://intanseafood.com/demersal-fish"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[]
table = soup.find('div', attrs = {'id':'archive-product'})
for row in table.findAll('div',
attrs = {'class':'product-h2'}):
quote = {}
quote['product'] = print(row.get_text())
quotes.append(quote)
Results:
Fish Goldband Snapper Natural Cut
Fish Grouper Portion
Fish Ruby Snaper Natural Cut
Fish Croaker
Fish Grouper WGGS
Fish Pinjalo Snapper Natural Cut
Fish Parrotfish WGGS
Fish Snapper One Cut
But when i change it to this code (export to csv) :
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://intanseafood.com/demersal-fish"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[]
table = soup.find('div', attrs = {'id':'archive-product'})
for row in table.findAll('div',
attrs = {'class':'product-h2'}):
quote = {}
quote['product'] = row.get_text()
quotes.append(quote)
filename = 'demersal.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['product'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
File csv created, but nothing inside except the header. Kindly anybody help me to resolve this, Thanks in advance

There is a lot of whitespace in your first output which means there are tabs/spaces/new lines in the string. Doing a little digging showed it was a newline and tabs. Remove them, for example:
text = row.get_text()
quote['product'] = text.replace("\t", "").replace("\n","")

Related

How to remove empty space in second alinea?

I'm tring to remove the extra space and "rebtel.bootstrappedData" in the second alinea but for some reason it won't work.
This is my output
"welcome_offer_cuba.block_1_title":"SaveonrechargetoCuba","welcome_offer_cuba.block_1_cta":"Sendrecharge!","welcome_offer_cuba.block_1_cta_prebook":"Pre-bookRecarga","welcome_offer_cuba.block_1_footprint":"Offervalidfornewusersonly.","welcome_offer_cuba.block_2_key":"","welcome_offer_cuba.block_2_title":"Howtosendarecharge?","welcome_offer_cuba.block_2_content":"<ol><li>Simplyenterthenumberyou’dliketosendrechargeinthefieldabove.</li><li>Clickthe“{{buttonText}}”button.</li><li>CreateaRebtelaccountifyouhaven’talready.</li><li>Done!Yourfriendshouldreceivetherechargeshortly.</li></ol>","welcome_offer_cuba.block_3_title":"DownloadtheRebtelapp!","welcome_offer_cuba.block_3_content":"Sendno-feerechargeandenjoythebestcallingratestoCubainoneplace."},"canonical":{"string":"<linkrel=\"canonical\"href=\"https://www.rebtel.com/en/rates/\"/>"}};
rebtel.bootstrappedData={"links":{"summary":{"collection":"country_links","ids":[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"params":{"locale":"en"},"meta":{}},"data":[{"title":"A","links":[{"iso2":"AF","route":"afghanistan","name":"Afghanistan","url":"/en/rates/afghanistan/","callingCardsUrl":"/en/calling-cards/afghanistan/","popular":false},{"iso2":"AL","route":"albania","name":"Albania","url":"/en/rates/albania/
And this is the code I used:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.rebtel.com/en/rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
x = range(132621, 132624)
script = soup.find_all("script")[4].text.strip()[38:]
print(script)
What should I add to "script" so it will remove the empty spaces?
Original answer
You can change the definition of your script variable by :
script = soup.find_all("script")[4].text.replace("\t", "")[38:]
It will remove all tabulations on your text and so the alineas.
Edit after conversation in the comments
You can use the following code to extract the data in json :
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.rebtel.com/en/rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
script = list(filter(None, soup.find_all("script")[4].text.replace("\t", "").split("\r\n")))
app_data = json.loads(script[1].replace("rebtel.appData = ", "")[:-1])
bootstrapped_data = json.loads(script[2].replace("rebtel.bootstrappedData = ", ""))
I extracted the lines of the script with split("\r\n") and get the wanted data from there.

After scraping I can not write the text to a text file

I am trying to scrape the prices from a website and it's working but... I can't write the result to a text.file.
this is my python code.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.futbin.com/stc/cheapest"
r = requests.get(url)
soup = bs(r.content, "html.parser")
price = soup.find("div", {"class":"d-flex row col-md-9 px-0"})
name =("example")
f =open(name + '.txt', "a")
f.write(price.text)
This is not working but if I print it instead of try to write it to a textfile it's working. I have searched for a long time but don't understand it. I think it must be a string to write to a text file but don't know how to change the ouput to a string.
You're getting error due to unicode character.
Try to add encoding='utf-8' property while opening a file.
Also your code gives a bit messy output. Try this instead:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.futbin.com/stc/cheapest"
r = requests.get(url)
soup = bs(r.content, "html.parser")
rows = soup.find("div", {"class":"d-flex row col-md-9 px-0"})
prices = rows.findAll("span",{"class":"price-holder-row"})
names = rows.findAll("div",{"class":"name-holder"})
price_list = []
name_list = []
for price in prices:
price_list.append(price.text.strip("\n "))
for name in names:
name_list.append(name.text.split()[0])
name =("example")
with open(f"{name}.txt",mode='w', encoding='utf-8') as f:
for name, price in zip(name_list,price_list):
f.write(f"{name}:{price}\n")

add extracted data as a new column in the existing csv-python

Anyone please helps. Please point out where i am wrong when the extracted reviews are written into 3 separate columns in hotelreview.csv, how can i fix this in order to write them into 1 column? and how to add the heading name "review" for it based on the codes below.
And I also want to add the new extracted data ("review" column) into the existing csv 'hotel_FortWorth.csv'. I just added the extracted information into a new csv, i don't know how to combine 2 files together or any other ways? the url can be repeated to match the reviews. Please!
Thank you!
File 'hotel_FortWorth.csv' has 3 columns, for example:
Name link
1 Omni Fort Worth Hotel https://www.tripadvisor.com.au/Hotel_Review-g55857-d777199-Reviews-Omni_Fort_Worth_Hotel-Fort_Worth_Texas.html
2 Hilton Garden Hotel https://www.tripadvisor.com.au/Hotel_Review-g55857-d2533205-Reviews-Hilton_Garden_Inn_Fort_Worth_Medical_Center-Fort_Worth_Texas.html
3......
...
I used the urls from existing csv to extract the reviews, the codes as shown:
import requests
from unidecode import unidecode
from bs4 import BeautifulSoup
import pandas as pd
file = []
data = pd.read_csv('hotel_FortWorth.csv', header = None)
df = data[2]
for url in df[1:]:
print(url)
thepage = requests.get(url).text
soup = BeautifulSoup(thepage, "html.parser")
resultsoup = soup.find_all("p", {"class": "partial_entry"})
file.extend(resultsoup)
with open('hotelreview.csv', 'w', newline='') as fid:
for review in file:
review_list = review.get_text()
fid.write(unidecode(review_list+'\n'))
Expected result:
name link review
1 ... ... ...
2
....
You can pandas to create the new CSV.
Ex:
import requests
from unidecode import unidecode
from bs4 import BeautifulSoup
import pandas as pd
data = pd.read_csv('hotel_FortWorth.csv')
review = []
for url in data["link"]:
print(url)
thepage = requests.get(url).text
soup = BeautifulSoup(thepage, "html.parser")
resultsoup = soup.find_all("p", {"class": "partial_entry"})
review.append(unidecode(resultsoup))
data["review"] = review
data.to_csv('hotelreview.csv')

python webscraping and write data into csv

I'm trying to save all the data(i.e all pages) in single csv file but this code only save the final page data.Eg Here url[] contains 2 urls. the final csv only contains the 2nd url data.
I'm clearly doing something wrong in the loop.but i dont know what.
And also this page contains 100 data points. But this code only write first 44 rows.
please help this issue.............
from bs4 import BeautifulSoup
import requests
import csv
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
for ur in url:
r = requests.get(ur)
soup = BeautifulSoup(r.content)
g_data = soup.find_all("a", {"class": "hdrlnk"})
gen_list=[]
for row in g_data:
try:
name = row.text
except:
name=''
try:
link = "http://sfbay.craigslist.org"+row.get("href")
except:
link=''
gen=[name,link]
gen_list.append(gen)
with open ('filename2.csv','wb') as file:
writer=csv.writer(file)
for row in gen_list:
writer.writerow(row)
the gen_list is being initialized again inside your loop that runs over the urls.
gen_list=[]
Move this line outside the for loop.
...
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
gen_list=[]
for ur in url:
...
i found your post later, wanna try this method:
import requests
from bs4 import BeautifulSoup
import csv
final_data = []
url = "https://sfbay.craigslist.org/search/sss"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all(class_="result-row")
for details in get_details:
getclass = details.find_all(class_="hdrlnk")
for link in getclass:
link1 = link.get("href")
sublist = []
sublist.append(link1)
final_data.append(sublist)
print(final_data)
filename = "sfbay.csv"
with open("./"+filename, "w") as csvfile:
csvfile = csv.writer(csvfile, delimiter = ",")
csvfile.writerow("")
for i in range(0, len(final_data)):
csvfile.writerow(final_data[i])

writing only one row after website scraping

I am trying to extract a list of all the golf courses in the USA through this link. I need to extract the name of the golf course, address, and the phone number. My script is suppose to extract all the data from the website but it looks like it only prints one row in my csv file. I noticed that when I print the "name" field it only prints once despite the find_all function. All I need is the data and not just one field from multiple links on the website.
How do I go about fixing my script so that it prints all the needed data into a CSV file.
Here is my script:
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(1):
url="http://www.thegolfcourses.net/page/1?ls&location=California&orderby=title&radius=6750#038;location=California&orderby=title&radius=6750" #.format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"list"})
for item in g_data2:
try:
name= item.contents[7].find_all("a",{"class":"entry-title"})[0].text
print name
except:
name=''
try:
phone= item.contents[7].find_all("p",{"class":"listing-phone"})[0].text
except:
phone=''
try:
address= item.contents[7].find_all("p",{"class":"listing-address"})[0].text
except:
address=''
course=[name,phone,address]
courses_list.append(course)
with open ('PGN_Final.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
Here is a neat implementation for your code. You can use the library urllib2 instead of requests. And bs4 works the same though.
import csv
import urllib2
from BeautifulSoup import *
url="http://www.thegolfcourses.net/page/1?ls&location=California&orderby=title&radius=6750#038;location=California&orderby=title&radius=6750" #.format(i)
r = urllib2.urlopen(url).read()
soup = BeautifulSoup(r)
courses_list = []
courses_list.append(("Course name","Phone Number","Address"))
names = soup.findAll('h2', attrs={'class':'entry-title'})
phones = soup.findAll('p', attrs={'class':'listing-phone'})
address = soup.findAll('p', attrs={'class':'listing-address'})
for na, ph, add in zip(names,phones, address):
courses_list.append((na.text,ph.text,add.text))
with open ('PGN_Final.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])

Categories

Resources