Arabic output in python 3 is stored weirdly in a file - python

I made a small web scraping bot using Python 3, Currently it is taking the input between classes and thankfully puts them into .csv file, But when i open it i find the part in arabic letters of it like this:
وائل ÙتÙ
I tried arabic resharper but looks like it just does converting in direction and some sort of encoding, When storing the string it represent bad characters as same as the above
Also this code below makes a successful arabic content into text file:
s = "ذهب الطالب الى المدرسة"
with open("file.txt", "w", encoding="utf-8") as myfile:
myfile.write(s)
-Note i'm using Selenium driver to get the content:
content = driver.page_source
soup = BeautifulSoup(content)

Try this, it should work:
soup = BeautifulSoup(content.decode('utf-8'))

Answer after more digging into problem:
1- I found that if i open it with normal windows notepad - i can see the arabic content so python was producing the website content correctly!
2- I used this video as a reference to correctly show data in excel (which the problem was in):
https://www.youtube.com/watch?v=V6AR_Hi7p5Q

Related

Forward slash "/" in string converted to "&#47", is that platform independent behaviour?

I have a Python script that reads an html into lines, and then filters out the relevant lines before saving those lines back as html file. I had some problems till I figured out that a / in the page text was being converted to &#47 when saved as a string.
The source html that I'm parsing through has the following line:
<h3 style="text-align:left">SYDNEY/KINGSFORD SMITH (YSSY)</h3>
which when passing through the file.readlines() would come out as:
<h3 style='text-align:left'>SYDNEY&#47BANKSTOWN (YSBK)</h3>
which then trips up the beautifulsoup because that then gets confused with the "&" symbol tripping up all subsequent tags.
What I'm interested in is to know if this replacement value "&#47" is platform independent or not?
It's not hard to run a .replace prior to saving each string, avoiding the issue now that I'm coding and testing on windows, but will it still work if I deploy my script on a linux server?
Here's what I have now, which works fine when run under windows:
def getHTML(self,html_source):
with open(html_source, 'r') as file:
source_lines = file.readlines()
relevant = False
relevant_lines = []
for line in source_lines:
if "</table>" in line:
relevant = False
if self.airport in line:
relevant = True
if relevant:
line = line.replace("&#47", " ")
relevant_lines.append(line)
relevant_lines.append("</table>")
filename = f"{html_source[:-5]}_{self.airport}.html"
with open(filename, 'w') as file:
file.writelines(relevant_lines)
with open(filename, 'r') as file:
relevant_html = file.read()
return relevant_html
Can anyone tell me, without having to install a virtual machine with linux, if this will work cross-platform? I tried to look for documentation on this, but all I could find was about ways to explicitly escape a / when entering a string, nothing documenting how to deal with / or other invalid characters being read when reading a source file into strings.
It should be OK everywhere, it is a standard.
See https://www.w3schools.com/charsets/ref_html_ascii.asp

PDF - Split Single Words into Individual Lines - Python 3

I am trying to extract words from a PDF into individual lines, but can only do this with Text files as demonstrated below.
Moreover, the rule is that I cannot convert PDF files to TXT then perform this operation. It must be done on PDF files.
with open('filename.txt','r') as f:
for line in f:
for word in line.split():
print(word)
If filename.txt has just "Hello World!", then this function returns:
Hello
World!
I need to do the same with searchable PDF files as well. Any help would be appreciated.
Check out PyMuPDF. There's loads of stuff you can do, including get line by line text from a PDF using page.getText()
For the PDF, you should use pdf.miner or PyPDF2.
Here is a good article you can use to extract the text, and then you can use Anilkumar's method to extract line by line.
https://medium.com/#rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
You can use pdfreader to extract texts (plain and containing PDF operators) from PDF document
Here is a sample code extracting all the above from all document pages.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
Just want to outline, that text in PDFs usually do not come as "words", they look like commands to a conforming PDF viewer where and how to put a glyph. Which means a single word may be displayed by several commands. Read more on that in PDF 1.7 docs sec.9 - Text
when I saw filename.txt I got confused.
Since you are working with PDF below link might be helpful. See it helps
How to use PDFminer.six with python 3?

how to properly extract utf8 text (japanese symbols) from a webpage with BeautifulSoup4

i downloaded webpages using wget. now i am trying to extract some data i need from those pages. the problem is with the Japanese words contained in this data. the English words extraction was perfect.
when i try to extract the Japanese words and use them in another app they appear gibberish. during testing diffrent methods there was one solution that fixed only half the japanese words.
what i tried: i tried
from_encoding="utf-8"
which had no effect. also i tried multiple ways to extract the text from the html code like
section.get_text(strip=True)
section.text.strip()
and others, also i tried to encode the generated text using URLencoding which did not work, also i tried using every code i could find on stackoverflow
one of the methods that strangely worked (but not completely) was saving the string in a dictionary then saving it into a JSON then calling the JSON from ANOTHER script. just using the dictionary, as it is, would not work. i have to use JSON as a middle man between two scripts. strange. (not all the words worked)
my question may seem like duplicates of anther question. but that other question is scraping from the internet. and what i am trying to do is extract from an offline source.
here is a simple script explaining the main problem
from bs4 import BeautifulSoup
page = BeautifulSoup(open("page1.html"), 'html.parser', from_encoding="utf-8")
word = page.find('span', {'class' : "radical-icon"})
wordtxt = word.get_text(strip=True)
#then save the word to a file
with open("text.txt", "w", encoding="utf8") as text_file:
text_file.write(wordtxt)
when i open the file i get gibberish characters
here is the part of the html that BeautifulSoup searchs:
<span class="radical-icon" lang="ja">亠</span>
the expected results is to get the symbols inside the text file. or to save them properly in anyway.
is there a better web scraper to use to properly get the utf8?
PS: sorry for bad english
i think i found an answer, just uninstall beautifulsoup4. i dont need it.
python has a builtin way to search for strings, i tried something like this:
import codecs
import re
with codecs.open("page1.html", 'r', 'utf-8') as myfile:
for line in myfile:
if line.find('<span class="radical-icon"') > -1:
result = re.search('<span class="radical-icon" lang="ja">(.*)</span>', line)
s = result.group(1)
with codecs.open("text.txt", 'w', 'utf-8') as textfile:
textfile.write(s)
which is the over complicated and non-pythonic way of doing it. but what works works.

How can I extract a text from a bytes file using python

I am trying to code a script that gets the code of a website, saves all html in a file and after that extracts some information.
For the moment I´ve done the first part, I've saved all html into a text file.
Now I have to extract the relevant information and then save it in another text file.
But I'm having problems with encoding and also I don´t know very well how to extract the text in python.
Parsing a website:
import urllib.request
file name to store the data
file_name = r'D:\scripts\datos.txt'
I want to get the text that goes after this tag <p class="item-description"> and before this other one </p>
tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'
I get the website code and I save it into a text file
with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
data = response.read()
out_file.write(data)
print (out_file) # First question how can I print the file? Gives me an error, I can´t print bytes
the file is now full of html text so I want to open it and process it
file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")
Extract information from the file
second question how to do a substring of the lines that contain the file and get the text between p class="item-description" and
/p so i can store in file_for_results
Here is the pseudocode that I'm not capable to code.
for line in file_to_filter:
if line contains word_starts_with
copy in file_for_results until you find </p>
I am assuming this is an assignment of some sort, where you need to parse the html given an algorithm, if not just use Beautiful Soup.
The pseudocode actually translates to python code quite easily:
file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
if word_starts_with in line:
print(line, end='', file=out_file) # Store data in another file
if word_ends_with in line:
break
And of course you need to close the files, make sure you remove the tags and so on, but this is roughly what your code should be given this algorithm.

lxml adds urlencoding in xml?

I'll preface this by indicating I'm using Python 2.7.3 (x64) on Windows 7, with lxml 2.3.6.
I have a little, odd, problem I'm hoping somebody can help with. I haven't find a solution online, perhaps I'm not searching for the right thing.
Anyway, I have a problem where I'm programmatically building some XML with lxml, then outputting this to a text file, the problem is lxml is converting carriage returns to the text 
, almost like urlencoding - but I'm not using HTML I'm using XML.
For example, I have a simple text file created in Notepad, like this:
This
is
my
text
I then build some xml and add this text into the xml:
from lxml import etree
textstr = ""
fh = open("mytext.txt", "rb")
for line in fh:
textstr += line
root = etree.Element("root")
a = etree.SubElement(root, "some_element")
a.text = textstr
print etree.tostring(root)
The problem here is the output of the print looks like this:
<root><some_element>This
is
my
text</some_element></root>
For my purposes the line breaks are fine, but the 
 elements are not.
What I have been able to figure out is that this is happening because I'm opening the text file in binary mode "rb" (which I actually need to do as my app is indexing a large text file). If I don't open the file in binary mode "r", then the output does not contain 
 (but of course, then my indexing doesn't work).
I've also tried changing the etree.tostring to:
print etree.tostring(root, method="xml")
However there is no difference in the output.
Now, I CAN dump the xml text to a string then do a replace of the $#13; artifacts, however, I was hoping for a more elegant solution - because the text files I parse are not under my control and I'm worried that other elements of the text file might be converted to url style encoding without my knowledge.
Does anyone know a way of preventing this encoding from happening?
Windows uses \r\n to represent a line ending, Unix uses \n.
This will remove the \r at the end of the line, if there is one there (so the code will work with unix text files too.) It will remove at most one \r, so if there is an \r somewhere else in the line it will be preserved.
import re
textstr = ""
with open("mytext.txt", "rb") as fh:
for line in fh:
textstr += re.sub(r'\r$', '', line)
print(repr(textstr))

Categories

Resources