I have a task, I'm using jupyter and I have to combine or merge multiple html files into one html file.
Any ideas how?
I did this with excel but didn't work with html files:
import os
import pandas as pd
data_folder='C:\\Users\\hhhh\Desktop\\test'
df = []
for file in os.listdir(data_folder):
if file.endswith('.xlsx'):
print('Loading file {0}...'.format(file))
df.append(pd.read_excel(os.path.join(data_folder , file), sheet_name='sheet1'))
Sounds like a task for Beautiful Soup.
You would get anything inside the <body> tag of each HTML document, I assume, and then combine them.
Maybe something like:
import os
from bs4 import BeautifulSoup
output_doc = BeautifulSoup()
output_doc.append(output_doc.new_tag("html"))
output_doc.html.append(output_doc.new_tag("body"))
for file in os.listdir(data_folder):
if not file.lower().endswith('.html'):
continue
with open(file, 'r') as html_file:
output_doc.body.extend(BeautifulSoup(html_file.read(), "html.parser").body)
print(output_doc.prettify())
Related
My python script fetches data from below website 'http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448' in a text file.
Now my aim is to filter: 'Headers', 'Details', 'FromDateTime', 'UptoDateTime' and 'Updated'
I have tried BS with text specific search, but not there...Below code shows that. Any help will be indeed helpful :)Sorry if I missed something very natural..
'''
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import csv
import operator
from numpy import *
# Collect and parse first page
page = requests.get('http://api.sl.se/api2/deviations.json?
key=c7606e4606f642a380f7fdd75d683448')
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup)
for script in
soup(["Header","Details","Updated","UpToDateTime","FromDateTime"]):
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
f1 = open(data.txt", "r")
resultFile = open("out.csv", "wb")
wr = csv.writer(resultFile, quotechar=',')
'''
I expect a csv with columns of Header","Details","Updated","UpToDateTime","FromDateTime"
You are doing in a wrong way. You don't need a beautifulsoup for this task. Your api returning data as json. BeautifulSoup is best for html. For your purpose you can use PANDAS and JSON Library.
Pandas can read directly from webresource as well but you want only requestdata from json so that you require both library.
Here is a snippet which you can use :
import pandas as pd
import requests
import json
page = requests.get('http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448')
data = json.loads(page.text)
df = pd.DataFrame(data["ResponseData"])
df.to_csv("file path")
Change File path and you get whole data inside csv.
But if you want to remove any column or any manipulation over data you can do using pandas dataframe as well. It is very powerful library you can learn about it using google.
I found this solution to read word file content from a url
from urllib.request import urlopen
from bs4 import BeautifulSoup
from io import BytesIO
from zipfile import ZipFile
file = urlopen(url).read()
file = BytesIO(file)
document = ZipFile(file)
content = document.read('word/document.xml')
word_obj = BeautifulSoup(content.decode('utf-8'))
text_document = word_obj.findAll('w:t')
for t in text_document:
print(t.text)
Anyone know a similar way to process pptx files? I have seen several solutions but to read the file directly, not from a url.
i don't know if it can help you but with urllib you obtain the content of the pptx (variable file), use cStringIO.StringIO(file) in function that read a pptx file path to simulate a file.
I am using the following Python - Beautifulsoup code to remove html elements from a text file:
from bs4 import BeautifulSoup
with open("textFileWithHtml.txt") as markup:
soup = BeautifulSoup(markup.read())
with open("strip_textFileWithHtml.txt", "w") as f:
f.write(soup.get_text().encode('utf-8'))
The question I have is how can I apply this code to every text file in a folder(directory), and for each text file produce a new text file which is processed and where the html elements etc. are removed, without having to call the function for each and every text file?
The glob module lets you list all the files in a directory:
import glob
for path in glob.glob('*.txt'):
with open(path) as markup:
soup = BeautifulSoup(markup.read())
with open("strip_" + path, "w") as f:
f.write(soup.get_text().encode('utf-8'))
If you want to also do that for every subfolder recursively, check out os.walk
I would leave that work to the OS, simply replace the hardcoded input file with input from external source, in argv array, and invoke the script inside a loop or with a regular expression that matches many files, like:
from bs4 import BeautifulSoup
import sys
for fi in sys.argv[1:]:
with open(fi) as markup:
soup = BeautifulSoup(markup.read())
with open("strip_" + fi, "w") as f:
f.write(soup.get_text().encode('utf-8'))
And run it like:
python script.py *.txt
I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')
I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute.
I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?
import os
import sys
from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path
def main():
ROOT = Path(os.path.realpath(__file__)).ancestor(3)
src = ROOT.child("src")
templatedir = src.child("templates")
for (dirpath, dirs, files) in os.walk(templatedir):
for path in (Path(dirpath, f) for f in files):
if path.endswith(".html"):
for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
print link
if __name__ == "__main__":
sys.exit(main())
I think you need something like this
if path.endswith(".html"):
htmlfile = open(dirpath)
for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
print link
The usage BeautifulSoup is OK but you should pass in the html string, not just the path of the html file. BeautifulSoup accepts the html string as argument, not the file path. It will not open it and then read the content automatically. You should do it yourself. If you pass in a.html, the soup will be <html><body><p>a.html</p></body></html>. This is not the content of the file. Surely there is no links. You should use BeautifulSoup(open(path).read(), ...).
edit:
It also accepts the file descriptor. BeautifulSoup(open(path), ...) is enough.