I have created this code to substitute some strings in xml file with other text. I used Beautifulsoup for this excersise and as instructed in the documentation i used soup.prettify in the end in order to save changed xml. However prettified xml is not working for me - i get errors when trying to import it back to the CMS.
Is there any other way to save updated xml without changing xml structure and without re-writing the whole code. See my code for reference below. Thanks for advice!
import openpyxl
import sys
#searching for Part Numbers and descriptions in xml
from bs4 import BeautifulSoup
infile = open('name of my file.xml', "r", encoding="utf8")
contents = infile.read()
infile.close()
soup = BeautifulSoup(contents,'xml')
all_Products = soup.find_all('Product')
#gathering all Part Numbers from xml
for i in all_Products:
PN = i.find('Name')
PN_Descr = i.find_all(AttributeID="PartNumberDescription")
PN_Details = i.find_all(AttributeID="PartNumberDetails")
for y in PN_Descr:
PN_Descr_text = y.find("TranslatableText")
try:
string = PN_Descr_text.string
PN_Descr_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Description for: ", PN)
continue
for z in PN_Details:
PN_Details_text = z.find("TranslatableText")
try:
string = PN_Details_text.string
PN_Details_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Details for: ", PN)
continue
xml = soup.prettify("utf-8")
with open('name of my file.xml', "wb") as file:
file.write(xml)
Related
My objective is to convert an xls file to xlsx file. The xls file which I am trying to convert is actually an html file containing tables (This xls file is obtained as a result of a query from jira). To facilitate the conversion I have created a file handler and then given that file handler to a beautiful soup and have extracted the table on interest and this extracted table is converted to a string and given to pandas dataframe for further processing.
This works fine but when the file size is large say around 80 MB it takes a large amount of time to process. How do I overcome this?
import bs4, os
import pandas as pd
print('Begin')
fileName = 'TestSample.xls'
fileHandler=open(fileName, encoding='utf-8')
soup = bs4.BeautifulSoup(fileHandler,'html.parser')
tbl = soup.find_all('table', id='issuetable')
df=pd.read_html(str(tbl))
df[0].to_excel("restult.xlsx", index=False)
print('Completed')
There is no good way for large files, but you can try different ways.
from simplified_scrapy import SimplifiedDoc
print('Begin')
fileName = 'TestSample.xls'
html=open(fileName, encoding='utf-8').read()
doc = SimplifiedDoc(html)
start = 0 # If a string can uniquely mark the starting position of data, the performance will be better
tbl = doc.getElement('table', attr='id',value='issuetable', start=start)
print(tbl.outerHtml)
Or block read
f=open(fileName, encoding='utf-8')
html = ''
start = '' # Start of block
end = '' # End of block
for line in f.readlines():
if not html:
html+=line
if line.find(end)>=0:
break
elif line.find(start)>=0:
html = line
if line.find(end)>=0:
break
doc = SimplifiedDoc(html)
tbl = doc.getElement('table', attr='id',value='issuetable')
print(tbl.outerHtml)
I am scraping a website for the course number and the course name. But if a course number does not have a name or vice versa, the data should be skipped from the final output. I do not know how to do that.
from bs4 import BeautifulSoup
from urllib import urlopen
import csv
source = urlopen('https://www.rit.edu/study/computing-security-bs')
csv_file1 = open('scrape.csv', 'w')
csv_writer = csv.writer(csv_file1)
csv_writer.writerow(['Course Number', 'Course Name'])
soup = BeautifulSoup(source, 'lxml')
table = soup.find('div', class_='processed-table')
#print(table)
curriculum = table.find('curriculum')
#print(curriculum.prettify())
next = curriculum.find('table', class_='table-curriculum')
#print(next.prettify())
for course_num in next.find_all('tr', class_='hidden-row rows-1'):
num = course_num.find_all('td')[0]
real = num.get_text()
# print(real)
realstr = real.encode('utf-8')
name = course_num.find('div', class_='course-name')
realname = name.get_text()
# print(realname)
realnamestr = realname.encode('utf-8')
csv_writer.writerow([realstr, realnamestr])
csv_file1.close()
This is my csv
csv
I want to get rid of the last 4 rows.
As #zvone suggested, a continue will do the job here. Writing this answer as you mentioned you are not aware of the keyword.
Before, csv_writer.writerow([realstr, realnamestr]) just put an if to check the realstr and continue:
if realstr.stip() == "":
continue
I think you should still go through the continue, break and else keywords and how they can be helpful in controlling your loops.
Another approach would be to put data into csv_writer only when realstr has some value. So:
if realstr.strip != "":
csv_writer.writerow([realstr, realnamestr])
I have a bunch of web text that I'd like to scrape and export to a csv file. The problem is that the text is split over multiple lines on the website and that's how beautifulsoup reads it. When I export to csv, all the text goes into one cell but the cell has multiple lines of text. When I try to read the csv into another program, it interprets the multiple lines in a way that yields a nonsensical dataset. The question is, how do I put all the text into a single line after I pull it with beautifulsoup but before I export to csv?
Here's a simple working example demonstrating the problem of multiple lines (in fact, the first few lines in the resulting csv are blank, so at first glance it may look empty):
import csv
import requests
from bs4 import BeautifulSoup
def main():
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,"html.parser")
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=",")
abstract=soup.find("article").text
writer.writerow([abstract])
if __name__ == '__main__':
main()
UPDATE: there have been some good suggestions, but it's still not working. The following code still produces a csv file with line breaks in a cell:
import csv
import requests
from bs4 import BeautifulSoup
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=',')
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,'lxml')
find_article = soup.find('article')
find_2para = find_article.p.find_next_sibling("p")
find_largetxt = find_article.p.find_next_sibling("p").nextSibling
writer.writerow([find_2para,find_largetxt])
Here's another attempt based on a different suggestion. This one also ends up producing a line break in the csv file:
import csv
import requests
from bs4 import BeautifulSoup
def main():
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,"html.parser")
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=",")
abstract=soup.find("article").get_text(separator=" ", strip=True)
writer.writerow([abstract])
if __name__ == '__main__':
main()
Change your abstract = ... line into:
abstract = soup.find("article").get_text(separator=" ", strip=True)
It'll separate each line using the separator parameter (in this case It'll separate the strings with an empty space.
The solution that ended up working for me is pretty simple:
abstract=soup.find("article").text.replace("\t", "").replace("\r", "").replace("\n", "")
That gets rid of all line breaks.
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,'lxml') # I prefer using xml parser
find_article = soup.find('article')
# Next line how to find The title in this case: Econometrica: Mar 2017, Volume 85, Issue 2
find_title = find_article.h3
# find search yeild
find_yeild = find_article.h1
#first_paragraph example : DOI: 10.3982/ECTA14057 p. 351-378
find_1para = find_article.p
#second p example : David MartinezāMiera, Rafael Repullo
find_2para = find_article.p.find_next_sibling("p")
#find the large text area using e.g. 'We present a model of the relationship bet...'
find_largetxt = find_article.p.find_next_sibling("p").nextSibling
I used a variety of methods of getting to the text area you wish just for the purpose of education(you can use .text on each of these to get the text without tags or you can use Zroq's method.
But you can write each one of these into the file by doing for example
writer.writerow(find_title.text)
I'm trying to extract the data on the crime rate across states from
this webpage, link to web page
http://www.disastercenter.com/crime/uscrime.htm
I am able to get this into text file. But I would like to get the
response in Json format. How can I do this in python.
Here is my code:
import urllib
import re
from bs4 import BeautifulSoup
link = "http://www.disastercenter.com/crime/uscrime.htm"
f = urllib.urlopen(link)
myfile = f.read()
soup = BeautifulSoup(myfile)
soup1=soup.find('table', width="100%")
soup3=str(soup1)
result = re.sub("<.*?>", "", soup3)
print(result)
output=open("output.txt","w")
output.write(result)
output.close()
The following code will get the data from the two tables and output all of it as a json formatted string.
Working Example (Python 2.7.9):
from lxml import html
import requests
import re as regular_expression
import json
page = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = html.fromstring(page.text)
tables = [tree.xpath('//table/tbody/tr[2]/td/center/center/font/table/tbody'),
tree.xpath('//table/tbody/tr[5]/td/center/center/font/table/tbody')]
tabs = []
for table in tables:
tab = []
for row in table:
for col in row:
var = col.text_content()
var = var.strip().replace(" ", "")
var = var.split('\n')
if regular_expression.match('^\d{4}$', var[0].strip()):
tab_row = {}
tab_row["Year"] = var[0].strip()
tab_row["Population"] = var[1].strip()
tab_row["Total"] = var[2].strip()
tab_row["Violent"] = var[3].strip()
tab_row["Property"] = var[4].strip()
tab_row["Murder"] = var[5].strip()
tab_row["Forcible_Rape"] = var[6].strip()
tab_row["Robbery"] = var[7].strip()
tab_row["Aggravated_Assault"] = var[8].strip()
tab_row["Burglary"] = var[9].strip()
tab_row["Larceny_Theft"] = var[10].strip()
tab_row["Vehicle_Theft"] = var[11].strip()
tab.append(tab_row)
tabs.append(tab)
json_data = json.dumps(tabs)
output = open("output.txt", "w")
output.write(json_data)
output.close()
This might be what you want, if you can use the requests and lxml modules. The data structure presented here is very simple, adjust this to your needs.
First, get a response from your requested URL and parse the result into an HTML tree:
import requests
from lxml import etree
import json
response = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = etree.HTML(response.text)
Assuming you want to extract both tables, create this XPath and unpack the results. totals is "Number of Crimes" and rates is "Rate of Crime per 100,000 People":
xpath = './/table[#width="100%"][#style="background-color: rgb(255, 255, 255);"]//tbody'
totals, rates = tree.findall(xpath)
Extract the raw data (td.find('./') means first child item, whatever tag it has) and clean the strings (r'' raw strings are needed for Python 2.x):
raw_data = []
for tbody in totals, rates:
rows = []
for tr in tbody.getchildren():
row = []
for td in tr.getchildren():
child = td.find('./')
if child is not None and child.tag != 'br':
row.append(child.text.strip(r'\xa0').strip(r'\n').strip())
else:
row.append('')
rows.append(row)
raw_data.append(rows)
Zip together the table headers in the first two rows, then delete the redundant rows, seen as the 11th & 12th steps in slice notation:
data = {}
data['tags'] = [tag0 + tag1 for tag0, tag1 in zip(raw_data[0][0], raw_data[0][1])]
for raw in raw_data:
del raw[::12]
del raw[::11]
Store the rest of the raw data and create a JSON file (optional: eliminate whitespace with separators=(',', ':')):
data['totals'], data['rates'] = raw_data[0], raw_data[1]
with open('data.json', 'w') as f:
json.dump(data, f, separators=(',', ':'))
I have some code that is parsing an xml file and saving it as a csv. I can do this two ways, one by manually downloading the xml file and then parsing it, the other by taking the xml feed directly using ET.fromstring and then parsing. When I go directly I get data errors it appears to be an integrity issue. I am trying to include the xml download in to the code, but I am not quite sure the best way to approach this.
import xml.etree.ElementTree as ET
import csv
import urllib
url = 'http://www.capitalbikeshare.com/data/stations/bikeStations.xml'
connection = urllib.urlopen(url)
data = connection.read()
#I need code here!!!
tree = ET.parse('bikeStations.xml')
root = tree.getroot()
#for child in root:
#print child.tag, child.attrib
locations = []
for station in root.findall('station'):
name = station.find('name').text
bikes = station.find('nbBikes').text
docks = station.find('nbEmptyDocks').text
time = station.find('latestUpdateTime').text
sublist = [name, bikes, docks, time]
locations.append(sublist)
#print 'Station:', name, 'has', bikes, 'bikes and' ,docks, 'docks'
#print locations
s = open('statuslog.csv', 'wb')
w = csv.writer(s)
w.writerows(locations)
s.close()
f = open('filelog.csv', 'ab')
w = csv.writer(f)
w.writerows(locations)
f.close()
What you need is:
root = ET.fromstring(data)
and omit the line of: tree = ET.parse('bikeStations.xml')
As the response from connection.read() returns String, you can directly read the XML string by using fromstring method, you can read more from HERE.