csv file not workinf properly - python

so I created a simple code to read a csv file in python 3.0 using pandas
import pandas as pd
df = pd.read_csv('https://www.goodreads.com/review_porter/export/153331182/goodreads_export.csv', on_bad_lines= 'skip')
print(df)
and instead of the csv file i ended with this:
<!DOCTYPE html>
0 <html>
1 <head>
2 <title>Sign Up</title>
3 <meta content='telephone=no' name='format-dete...
4 <link href='https://www.goodreads.com/user/sig...
.. ...
255 }
256 //]]>
257 </script>
258 </html>
259 <!-- This is a random-length HTML comment: xme...
[260 rows x 1 columns]
can someone help me understand why in this particular case is not working, becouse i tryed another .csv and worked just fine. The site that i use is https://www.goodreads.com/ and the .csv file is from the export section.

Thats because that link need you to be authenticated before you can access the csv file. Since you have not passed any authentication it just read the sign up page and displaying the HTML format.
You can try this:
import requests
response = requests.get(url, auth=(username, password), verify=False)
Even if you download the csv file, it should work too.

Related

convert html file to BytesIO then pass as Flask variable

I'm trying to convert a HTML file to BytesIO so that I don't need to write the file in the filesystem and get it from memory instead. I've read this about converting image to BytesIO however I can't apply it to HTML file.
I'm using Flask as my framework.
What i have tried:
buffer = io.BytesIO()
merged_df.to_html(buffer, encoding = 'utf-8', table_uuid = 'seasonality_table')
buffer.seek(0)
html_memory = base64.b64encode(buffer.getvalue())
return render_template('summary_01.html', html_data = html_memory.decode('utf-8'))
then in the HTML code which I want to output the html file:
<img id="picture" src="data:html;base64, {{ html_data }}">
Error message I got =
TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
First: using <img> to display HTML is totally wrong idea.
Tag <img> is only for images like PNG, JPG, etc.
You can get directly HTML using to_html() without filename
html = merged_df.to_html(table_id='seasonality_table')
and send this as HTML
return render_template('summary_01.html', html_data=html)
and you need safe to display it as HTML
{{ html_data | safe }}
BTW:
If you want to put data as file for downloading then you should use <a> instead of <img> and it may need application/octet-stream instead of html to start downloading it.
html = merged_df.to_html(table_id='seasonality_table')
html = base64.b64encode(html.encode('utf-8')).decode('utf-8')
return render_template('summary_01.html', html_data=html)
DOWNLOAD
Minimal working example
from flask import Flask, render_template_string
import pandas as pd
import base64
app = Flask(__name__)
#app.route('/')
def index():
data = {
'A': [1,2,3],
'B': [4,5,6],
'C': [7,8,9]
}
df = pd.DataFrame(data)
html = df.to_html(table_id='seasonality_table')
html_base64 = base64.b64encode(html.encode()).decode()
return render_template_string('''<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
{{ html_data | safe }}
DOWNLOAD
</body>
</html>''', html_data=html, html_base64=html_base64)
if __name__ == '__main__':
#app.debug = True
#app.run(threaded=False, processes=3)
#app.run(use_reloader=False)
app.run()

How to load a yaml file from url to process in python?

I have a yaml file stored at URL location. How do I load it into python for processing?
This is the code I use to read then simply print it out to verify. But I do not see the yaml file format, looks like html to me.
code:
import yaml
import urllib
from urllib import request
x = urllib.request.urlopen("https://git.myplace.net/projects/groups%2users.yaml")
User_Object = yaml.load(x)
print(User_Object)
...
The output looks like:
anch:create-branch-action":{"serverCondition":false}});}(_PageDataPlugin));</script><meta name="application-name" content="Bitbucket"><link rel="shortcut icon" type="image/x-icon" href="/s/-1051105741/5ab4b55/261/1.0/_/download/resources/com.atlassian.bitbucket.server.bitbucket-webpack-INTERNAL:favicon/favicon.ico" /><link rel="search" href="https://git.cnvrmedia.net/plugins/servlet/opensearch-descriptor" type="application/opensearchdescription+xml" title="Bitbucket code search"/></head><body class="aui-page-sidebar bitbucket-theme"><ul id="assistive-skip-links" class="assistive"><li>Skip to sidebar navigation</li><li>Skip to content</li></ul><div id="page"><!-- start
The file name is "groups+users.yaml". What is the best way to read in yaml format for python to parse/process?

How to write for loop content in html code python?

Below is the code
urls.append('http://google.com')
urls.append('http://stacoverflow.com')
whole = """<html>
<head>
<title>output -</title>
</head>
<body>Below are the list of URLS
%s // here I want to write both urls.
</body>
</html>"""
for x in urls:
print x
f = open('myfile.html', 'w')
f.write(whole)
f.close()
So this is the code for saving the file in HTML format. But I can't find the way to get the contents of for loop into HTML file. In other words, I want to write a list of indexes elements i.e. http://google.com, http://stackoverflow.com into my HTML file. As you can see that I have created myfile.html as HTML file, So I want to write both URLs which are in the list of indexes into my HTML file
Hope this time I better explain?
How can I? Would anyone like to suggest me something? It would be a really big help.
Try below code:
urls.append('http://google.com')
urls.append('http://stacoverflow.com')
whole = """<html>
<head>
<title>output -</title>
</head>
<body>Below are the list of URLS
%s
</body>
</html>"""
f = open('myfile.html', 'w')
f.write(whole % ", ".join(urls))
f.close()

Mismatched Tag Error While Parsing XML?

I'm writing this script that downloads an HTML document from http://example.com/ and attempts to parse it as an XML by using:
with urllib.request.urlopen("http://example.com/") as f:
tree = xml.etree.ElementTree.parse(f)
However, I keep getting a ParseError: mismatched tag error, supposedly at line 1, column 2781, so I donwloaded the file manually (Ctrl+S on my browser) and checked it, but such position indicates a place in the middle of a string, and not even near the EOF, but there were a few lines before the actual 2781nth character so that might've messed up my calculation of the exact position. However, I tried to download and actually write the response to a file to parse it later by:
response = urllib.request.urlopen("http://example.com/")
f = open("test.html", "wb")
f.write(response.read())
f.close()
html = open("test.html", "r")
tree = xml.etree.ElementTree.parse(html)
And I'm still getting the same mismatched tag error at the same column, but this time I opened the downloaded html and the only stuff near column 2781 is this:
;</script></head><body class
And the exact 2781nth column marks the first "h" in </head>, so what could be wrong here? am I missing something?
Edit:
I've been looking more into it and tried to parse the XML using another parser, this time minidom, but I'm still getting the exact same error at the exact same line, what could be the problem here? this also happens even though I've downloaded the file by several different ways (urllib, curl, wget, even Ctrl+Save on the browser) and the result is the same.
Edit 2:
This is what I've tried so far:
This is an example xml I just got from the API doc, and saved it to text.html:
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>Moved to example.org
or example.com.</p>
</body>
</html>
And I tried:
with urllib.request.urlopen("text.html") as f:
tree = xml.etree.ElementTree.parse(f)
And it works, then:
with urllib.request.urlopen("text.html") as f:
tree = xml.etree.ElementTree.fromstring(f.read())
And it also works, but:
with urllib.request.urlopen("http://example.com/") as f:
xml.etree.ElementTree.parse(f)
Doesn't, also tried:
with urllib.request.urlopen("http://example.com/") as f:
xml.etree.ElementTree.fromstring(f.read())
And it doesn't work too, what could be the problem? as far as I can tell the document doesn't have mismatching tags, but perhaps it's too large? it's only 95.2 KB.
You can use bs4 to parse this page. Like this:
import bs4
import urllib
url = 'http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl'
proxies = {'http': 'http://www-proxy.ericsson.se:8080'}
f = urllib.urlopen(url, proxies=proxies)
info = f.read()
soup = bs4.BeautifulSoup(info)
print soup.a
OUTPUT:
a
You can download bs4 from this link.
Based on the urllib and ElementTree documentation, this code snippet seemed to work without error for your sample URL.
import urllib.request
import xml.etree.ElementTree as ET
with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
html = response.read()
tree = ET.parse(html)
If you don't want to read the response into a variable before parsing it with ElementTree, this also works:
with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
tree = ET.parse(response.read())

Edit and create HTML file using Python

I am currently working on an assignment for creating an HTML file using python. I understand how to read an HTML file into python and then edit and save it.
table_file = open('abhi.html', 'w')
table_file.write('<!DOCTYPE html><html><body>')
table_file.close()
The problem with the above piece is it's just replacing the whole HTML file and putting the string inside write(). How can I edit the file and the same time keep it's content intact. I mean, writing something like this, but inside the body tags
<link rel="icon" type="image/png" href="img/tor.png">
I need the link to automatically go in between the opening and closing body tags.
You probably want to read up on BeautifulSoup:
import bs4
# load the file
with open("existing_file.html") as inf:
txt = inf.read()
soup = bs4.BeautifulSoup(txt)
# create new link
new_link = soup.new_tag("link", rel="icon", type="image/png", href="img/tor.png")
# insert it into the document
soup.head.append(new_link)
# save the file again
with open("existing_file.html", "w") as outf:
outf.write(str(soup))
Given a file like
<html>
<head>
<title>Test</title>
</head>
<body>
<p>What's up, Doc?</p>
</body>
</html>
this produces
<html>
<head>
<title>Test</title>
<link href="img/tor.png" rel="icon" type="image/png"/></head>
<body>
<p>What's up, Doc?</p>
</body>
</html>
(note: it has munched the whitespace, but gotten the html structure correct).

Categories

Resources