Python scraping encoding excel formulas - python

I'm trying to scrape websites to csv and work on this data but text formulas won't work properly. I don't really underestand what I'm doing wrong but my guess is the encoding part.
This is the python part :
page = requests.get(url)
encoding = page.encoding if 'charset' in page.headers.get('content-type', '').lower() else None
soup = BeautifulSoup(page.content, 'html.parser', from_encoding=encoding)
example = soup.find(class_= htmlClass).get_text()
example = "".join([s for s in example.splitlines(True) if s.strip()])
example = example.splitlines()
outputList.append(example)
[...]
with open(outputFile, "w") as fileHandle:
fileHandle.writelines(outputFileData)
The text in the csv does looks ok but if I'm trying to have some MATCH formulas it often won't find the data. =MATCH("*13 MARCH*";F1:F20;0) will give N/A while there is the text 13 MARCH in the column.
I've done many changes and test and I noted that when I use this :
with codecs.open(outputFile, "w", "utf-8") as fileHandle: I have special characters in the CSV file and this probably explain the MATCH formulas not properly finding text.
If it helps, I actually import the csv in googlesheet via script and then work with MATCH formulas, the script is :
function importFromCSV() {
var file = DriveApp.getFilesByName("menulist.csv");
var csvFile = file.next().getBlob().getDataAsString();
var csvData = Utilities.parseCsv(csvFile, ";");
var ss = SpreadsheetApp.openById("xxx");
var sheet = ss.getSheetByName('import');
sheet.getRange('A7:AZ60').clear()
sheet.getRange(7,1, csvData.length, csvData[0].length).setValues(csvData);
}
I had rubies with the above and added var csvFile = file.next().getBlob().getDataAsString('ISO-8859-1'); to avoid rubies but MATCH formula still wont work.
And idea what I'm doing wrong with encodoing ?

Try Using, hope it will solve your problem
with codecs.open(outputFile, "w", "utf-8-sig") as fileHandle:

Related

Decoding problem with fitz.Document in Python 3.7

I want to extract the text of a PDF and use some regular expressions to filter for information.
I am coding in Python 3.7.4 using fitz for parsing the pdf. The PDF is written in German. My code looks as follows:
doc = fitz.open(pdfpath)
pagecount = doc.pageCount
page = 0
content = ""
while (page < pagecount):
p = doc.loadPage(page)
page += 1
content = content + p.getText()
Printing the content, I realized that the first (and important) half of the document is decoded as a strange mix of Japanese (?) signs and others, like this: ョ。オウキ・ゥエオョァ@ュ.
I tried to solve it with different decodings (latin-1, iso-8859-1), encoding is definitely in utf-8.
content= content+p.getText().encode("utf-8").decode("utf-8")
I also have tried to get the text using minecart:
import minecart
file = open(pdfpath, 'rb')
document = minecart.Document(file)
for page in document.iter_pages():
for lettering in page.letterings :
print(lettering)
which results in the same problem.
Using textract, the first half is an empty string:
import textract
text = textract.process(pdfpath)
print(text.decode('utf-8'))
Same thing with PyPDF2:
import PyPDF2
pdfFileObj = open(pdfpath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for index in range(0, pdfReader.numPages) :
pageObj = pdfReader.getPage(index)
print(pageObj.extractText())
I don't understand the problem as it's looking like a normal PDF with normal text. Also some of the PDFs don't have this problem.

Python Looping through urls in csv file returns \ufeffhttps://

I am new to python and I am trying to loop through the list of urls in a csv file and grab the website titleusing BeautifulSoup, which I would like then to save to a file Headlines.csv. But I am unable to grab the webpage title. If I use a variable with single url as follows:
url = 'https://www.space.com/japan-hayabusa2-asteroid-samples-landing-date.html'
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
It works just fine and I get the title Japanese capsule carrying pieces of asteroid Ryugu will land on Earth Dec. 6 | Space
But when I use the loop,
import csv
with open('urls_file2.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for url in reader:
print(url)
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
I get the following
['\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']
and an error message
InvalidSchema: No connection adapters were found for "['\\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']"
I am not sure what am I doing wrong.
You have a byte order mark \\ufeff on the URL you parse from your file.
It looks like your file is a signature file and has encoding like utf-8-sig.
You need to read with the file with encoding='utf-8-sig'
Read more here.
As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.
The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.
Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.
import csv
with open('urls_file2.csv', newline='', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for url in reader:
print(url[0])
To read up more on lists, you can refer here.
You can add more columns to the CSV file and experiment to see how the results would appear.
If you would like to refer to the column name while reading each row, you can refer here.

Writing the exact same thing in CSV file using Python

I've encountered an issue with my writing CSV program for a web-scraping project.
I got a data formatted like this :
table = {
"UR": url,
"DC": desc,
"PR": price,
"PU": picture,
"SN": seller_name,
"SU": seller_url
}
Which I get from a loop that analyze a html page and return me this table.
Basically, this table is ok, it changes every time I've done a loop.
The thing now, is when I want to write every table I get from that loop into my CSV file, it is just gonna write the same thing over and over again.
The only element written is the first one I get with my loop and write it about 10 millions times instead of about 45 times (articles per page)
I tried to do it vanilla with the library 'csv' and then with pandas.
So here's my loop :
if os.path.isfile(file_path) is False:
open(file_path, 'a').close()
file = open(file_path, "a", encoding = "utf-8")
i = 1
while True:
final_url = website + brand_formatted + "+handbags/?p=" + str(i)
request = requests.get(final_url)
soup = BeautifulSoup(request.content, "html.parser")
articles = soup.find_all("div", {"class": "dui-card searchresultitem"})
for article in articles:
table = scrap_it(article)
write_to_csv(table, file)
if i == nb_page:
break
i += 1
file.close()
and here my method to write into a csv file :
def write_to_csv(table, file):
import csv
writer = csv.writer(file, delimiter = " ")
writer.writerow(table["UR"])
writer.writerow(table["DC"])
writer.writerow(table["PR"])
writer.writerow(table["PU"])
writer.writerow(table["SN"])
writer.writerow(table["SU"])
I'm pretty new on writing CSV files and Python in general but I can't find why this isn't working. I've followed many guide and got more or less the same code for writing csv file.
edit: Here's an output in an img of my csv file
you can see that every element is exactly the same, even if my table change
EDIT: I fixed my problems by making a file for each article I scrap. That's a lot of files but apparently it is fine for my project.
This might be solution you wanted
import csv
fieldnames = ['UR', 'DC', 'PR', 'PU', 'SN', 'SU']
def write_to_csv(table, file):
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(table)
Reference: https://docs.python.org/3/library/csv.html

Save body text on csv file | Python 3

I am trying to create a database with several articles for Text mining purposes.
I am extracting the body via web scraping and then save the body of these articles on a csv file. However, I couldn't manage to save all the body texts.
The code that I came up with saves only the text the last URL (article) while if I print what I am scraping (and what I am supposed to save) I obtain the body of all the articles.
I just included some of the URL from the list (which contains a larger number of URLs) just to give you an idea:
import requests
from bs4 import BeautifulSoup
import csv
r=["http://www.nytimes.com/2016/10/12/world/europe/germany-arrest-syrian-refugee.html",
"http://www.nytimes.com/2013/06/16/magazine/the-effort-to-stop-the- attack.html",
"http://www.nytimes.com/2016/10/06/world/europe/police-brussels-knife-terrorism.html",
"http://www.nytimes.com/2016/08/23/world/europe/france-terrorist-attacks.html",
"http://www.nytimes.com/interactive/2016/09/09/us/document-Review-of-the-San-Bernardino-Terrorist-Shooting.html",
]
for url in r:
t= requests.get(url)
t.encoding = "ISO-8859-1"
soup = BeautifulSoup(t.content, 'lxml')
text = soup.find_all(("p",{"class": "story-body-text story-content"}))
print(text)
with open('newdb30.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ',quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(text)
Try declaring variable such as all_text = "" before the for loop and adding text to all_text by all_text += text + "\n" at the end of the for loop (the \n creates a new line).
Then, in the last row, instead of writing text, you write all_text.

How to prevent writing into txt file the same words using open(text.txt,a)?

I have a question regarding appending to text file. I have written a script and what this script does is that it will read the URL in JSON format and extract the list of titles and write into the file "WordsInCategory.text".
As this code will be used in a loop thus I used f1 = open('WordsInCategory.text', 'a').
But I encountered a problem, that is it will add in already existing title into the file.
I am having trouble coming out with a solution to solve this problem and using 'w' will overwrite what it is written.
My code is as follows:
import urllib2
import json
url1 ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
f1 = open('WordsInCategory.text', 'a')
for item in data1['query']:
for i in data1['query']['categorymembers']:
f1.write((i['title']).encode('utf8')+"\n")
Please advice on how I should modify my code.
Thank you.
I would suggest saving every title in an array, before writing to a file (and hence writing only once to the given file). You can modify your code this way :
import urllib2
import json
data = []
f1 = open('WordsInCategory.text', 'w')
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
data.append(i['title'].encode('utf8')+"\n")
# Do additional requests, and append the new titles to the data array
f1.write(''.join(set(data)))
f1.close()
set allows me to delete any duplicate entry.
If keeping the titles in memory is a problem, you can check if the title already exists before writing it to the file, but it may be awfully time consuming :
import urllib2
import json
data = []
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
title = (i['title'].encode('utf8')+"\n")
with open('WordsInCategory.text', 'r') as title_check:
if title not in title_check:
data.append(title)
with open('WordsInCategory.text', 'a') as f1:
f1.write(''.join(set(data)))
# Handle additional requests
Hope it'll be helpful.
You can track the titles you added.
titles = []
and then add each title to the list when writing
if title not in titles:
# write to file
titles += title

Categories

Resources