I am trying to parse the wikicfp.v1.2008.xml and wikicfp.v1.2009.xml and wikicfp.v1.2010.xml. The three available versions in the link below.
https://github.com/creswick/wikicfp-parser/tree/master/data
I tried with XML.etree.ElementTree and with beautifulsoup, but I got a lot of encoding errors like
not well-formed (invalid token): line 949, column 40
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 63563: character maps to undefined>
I couldn't progress because of the errors. I aim to parse each row and save it in a SQL script file or CSV file for later usage.
from bs4 import BeautifulSoup
out_file = open("final.sql","w")
out_file.write("--DROP TABLE event1;\n")
out_file.write("CREATE TABLE event1 (eventid int, fullname TEXT, location TEXT, begindate TINYTEXT , finishdate TINYTEXT , weblink TEXT, info TEXT, PRIMARY KEY (eventid));\n")
out_file.close()
infile = open("wikicfp.v1.2009.xml",encoding='utf-8-sig')
contents = infile.read()
soup = BeautifulSoup(contents)
rows = soup.find_all('row')
c = 0
for count in rows:
tempsoup = rows[c]
try:
ei = tempsoup.findAll("field", {"name":"eventid"})
if not ei[0].contents[0].strip():
ei = "No info"
eventid = ei[0].contents[0].strip()
except Exception:
eventid = 0
try:
fn = tempsoup.findAll("field", {"name":"fullname"})
s = fn[0].contents[0].strip()
fullname = s.decode('utf-8')
fullname = fullname.replace("'","_")
except Exception:
fullname = "No info"
try:
l = tempsoup.findAll("field", {"name":"location"})
s = l[0].contents[0].strip()
location = s.decode('utf-8')
location = location.replace("'","_")
except Exception:
location = "No info"
try:
bd = tempsoup.findAll("field", {"name":"begindate"})
s = bd[0].contents[0].strip()
begindate = s.decode('utf-8')
except Exception:
begindate = "No info"
try:
fd = tempsoup.findAll("field", {"name":"finishdate"})
s = fd[0].contents[0].strip()
finishdate = s.decode('utf-8')
except Exception:
finishdate = "No info"
try:
wl = tempsoup.findAll("field", {"name":"weblink"})
s = wl[0].contents[0].strip()
weblink = s.decode('utf-8')
except Exception:
weblink = "No info"
try:
i = tempsoup.findAll("field", {"name":"info"})
s = i[0].contents[0].strip()
info = s.decode('utf-8')
info = info.replace("'","_")
except Exception:
info = "No info"
with open("final.sql","a") as out_file:
out_file.write("INSERT INTO event VALUES (")
out_file.write(eventid)
out_file.write(", '")
out_file.write(fullname)
out_file.write("', '")
out_file.write(location)
out_file.write("','")
out_file.write(begindate)
out_file.write("','")
out_file.write(finishdate)
out_file.write("','")
out_file.write(weblink)
out_file.write("','")
out_file.write(info)
out_file.write("');\n")
c=c+1
out_file.close()
infile.close()
another start of try
from bs4 import BeautifulSoup
with open("wikicfp.v1.2009.xml") as fp:
soup = BeautifulSoup(fp, 'xml')
rows = soup.find_all('row')
another try
import xml.etree.ElementTree as ET
tree = ET.parse('wikicfp.v1.2009.xml')
root = tree.getroot()
It looks like your xml-files are containing invalid characters. I tried sveral different text editors (Notepad++, bracket, Notepad, ...) and all of them ran into several positions, they couldn't encode properly (e.g. in the 2008-xml at the end of line 56964). So the xml-parser fails to parse the xml right there. You could use lxml and its Parsers recover-option to ignore those characters:
import lxml.etree as ET
tree = ET.parse('wikicfp.v1.2008.xml',
ET.XMLParser(encoding='ISO-8859-1', ns_clean=True, recover=True))
root = tree.getroot()
rows = root.findall('row')
for row in rows:
fields = row.findall('field')
for field in fields:
print(field)
You can get lxml by simply typing pip install lxml in your bash
Related
I have created this code to substitute some strings in xml file with other text. I used Beautifulsoup for this excersise and as instructed in the documentation i used soup.prettify in the end in order to save changed xml. However prettified xml is not working for me - i get errors when trying to import it back to the CMS.
Is there any other way to save updated xml without changing xml structure and without re-writing the whole code. See my code for reference below. Thanks for advice!
import openpyxl
import sys
#searching for Part Numbers and descriptions in xml
from bs4 import BeautifulSoup
infile = open('name of my file.xml', "r", encoding="utf8")
contents = infile.read()
infile.close()
soup = BeautifulSoup(contents,'xml')
all_Products = soup.find_all('Product')
#gathering all Part Numbers from xml
for i in all_Products:
PN = i.find('Name')
PN_Descr = i.find_all(AttributeID="PartNumberDescription")
PN_Details = i.find_all(AttributeID="PartNumberDetails")
for y in PN_Descr:
PN_Descr_text = y.find("TranslatableText")
try:
string = PN_Descr_text.string
PN_Descr_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Description for: ", PN)
continue
for z in PN_Details:
PN_Details_text = z.find("TranslatableText")
try:
string = PN_Details_text.string
PN_Details_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Details for: ", PN)
continue
xml = soup.prettify("utf-8")
with open('name of my file.xml', "wb") as file:
file.write(xml)
I am trying to extract values from json ld to csv as they are in the file. There are a couple of issues I am facing.
1. The values being read for different fields are getting truncated in most of the cases. In the remaining cases the value of some other field is appearing in some other field.
2. I am also getting an error - 'Additional data' after some 4,000 lines.
The file is quite big(half a gb). I am attaching a shortened version of my code. Please tell me where am I going wrong.
The input file - I have shortened it and kept it here. There was no way of putting it here.
https://github.com/Architsi/json-ld-issue
I tried writing this script and I tried multiple online converters too
import csv, sys, math, operator, re, os, json, ijson
from pprint import pprint
filelist = []
for file in os.listdir("."):
if file.endswith(".json"):
filelist.append(file)
for input in filelist:
newCsv = []
splitlist = input.split(".")
output = splitlist[0] + '.csv'
newFile = open(output, 'w', newline='') #wb for windows, else you'll see newlines added to csv
# initialize csv writer
writer = csv.writer(newFile)
#Name of the columns
header_row = ('Format', 'Description', 'Object', 'DataProvider')
writer.writerow(header_row)
with open(input, encoding="utf8") as json_file:
data = ijson.items(json_file, 'item')
#passing all the values through try except
for s in data:
source = s['_source']
try:
source_resource = source['sourceResource']
except:
print ("Warning: No source resource in record ID: " + id)
try:
data_provider = source['dataProvider'].encode()
except:
data_provider = "N/A"
try:
_object = source['object'].encode()
except:
_object = "N/A"
try:
descriptions = source_resource['description']
string = ""
for item in descriptions:
if len(descriptions) > 1:
description = item.encode() #+ " | "
else:
description = item.encode()
string = string + description
description = string.encode()
except:
description = "N/A"
created = ""
#writing it to csv
write_tuple = ('format', description, _object, data_provider)
writer.writerow(write_tuple)
print ("File written to " + output)
newFile.close()
The error that I am getting is this- raise common.JSONError('Additional Data')
Expected result is a csv file with all the columns and correct values
I am currently working on a project for which I need to download a few thousand citations from PubMed. I am currently using BioPython and have written this code:
from Bio import Entrez
from Bio import Medline
from pandas import *
from sys import argv
import os
Entrez.email = "my_email"
df = read_csv("my_file_path")
i=0
for index, row in df.iterrows():
print (row.id)
handle = Entrez.efetch(db="pubmed",rettype="medline",retmode="text", id=row.id)
records = Medline.parse(handle)
for record in records:
try:
abstract = str(record["AB"])
except:
abstract = "none"
try:
title = str(record["TI"])
except:
title = "none"
try:
mesh = str(record["MH"])
except:
mesh = "none"
path = 'my_file_path'
filename= str(row.id) + '.txt'
filename = os.path.join(path, filename)
file = open(filename, "w")
output = "title: "+str(title) + "\n\n" + "abstract: "+str(abstract) + "\n\n" + "mesh: "+str(mesh) + "\n\n"
file.write(output)
file.close()
print (i)
i=i+1
However, I receive the following error when this code is run:
Traceback (most recent call last):
File "my_file_path", line 13, in <module>
handle = Entrez.efetch(db="pubmed",rettype="medline",retmode="text", id=row.id)
File "/.../anaconda/lib/python3.5/site-packages/biopython-1.68-py3.5-macosx-10.6-x86_64.egg/Bio/Entrez/__init__.py", line 176, in efetch
if ids.count(",") >= 200:
AttributeError: 'numpy.int64' object has no attribute 'count'
Here are the first few columns of the CSV file:
id
10029645
10073846
10078088
10080457
10088066
...
Your error is at
handle = Entrez.efetch(db="pubmed",rettype="medline",retmode="text", id=row.id)
From the documentation
id
UID list. Either a single UID or a comma-delimited list of UIDs
From the examples I see, id is a string, not a numpy.int64 out of a pandas dataframe. You should convert that row.id to a string
I am trying to scrape data from the PGA website to get a list of all the golf courses in the USA. I want to scrape the data and input into a CSV file. My problem is after running my script I get this error. Can anyone help fix this error and how I can go about extracting the data?
Here is the error message:
File "/Users/AGB/Final_PGA2.py", line 44, in
writer.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in
position 35: ordinal not in range(128)
Script Below;
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(906): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
for item in g_data2:
try:
name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
print name
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('PGA_Final.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
You should not get the error on Python 3. Here's code example that fixes some unrelated issues in your code. It parses specified fields on a given web-page and saves them as csv:
#!/usr/bin/env python3
import csv
from urllib.request import urlopen
import bs4 # $ pip install beautifulsoup4
page = 905
url = ("http://www.pga.com/golf-courses/search?page=" + str(page) +
"&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0"
"&course_type=both&has_events=0")
with urlopen(url) as response:
field_content = bs4.SoupStrainer('div', 'views-field-nothing')
soup = bs4.BeautifulSoup(response, parse_only=field_content)
fields = [bs4.SoupStrainer('div', 'views-field-' + suffix)
for suffix in ['title', 'address', 'city-state-zip', 'website', 'work-phone']]
def get_text(tag, default=''):
return tag.get_text().strip() if tag is not None else default
with open('pga.csv', 'w', newline='') as output_file:
writer = csv.writer(output_file)
for div in soup.find_all(field_content):
writer.writerow([get_text(div.find(field)) for field in fields])
with open ('PGA_Final.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
Change that to:
with open ('PGA_Final.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row.encode('utf-8'))
Or:
import codecs
....
with codecs.open('PGA_Final.csv','a', encoding='utf-8') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
I am in the process of stripping a couple million XMLs of sensitive data. How can I add a try and except to get around this error which seems to have occurred because a couple of malformed xmls out to the bunch.
xml.parsers.expat.ExpatError: mismatched tag: line 1, column 28691
#!/usr/bin/python
import sys
from xml.dom import minidom
def getCleanString(word):
str = ""
dummy = 0
for character in word:
try:
character = character.encode('utf-8')
str = str + character
except:
dummy += 1
return str
def parsedelete(content):
dom = minidom.parseString(content)
for element in dom.getElementsByTagName('RI_RI51_ChPtIncAcctNumber'):
parentNode = element.parentNode
parentNode.removeChild(element)
return dom.toxml()
for line in sys.stdin:
if line > 1:
line = line.strip()
line = line.split(',', 2)
if len(line) > 2:
partition = line[0]
id = line[1]
xml = line[2]
xml = getCleanString(xml)
xml = parsedelete(xml)
strng = '%s\t%s\t%s' %(partition, id, xml)
sys.stdout.write(strng + '\n')
Catching exceptions is straight forward. Add import xml to your import statements and wrap the problem code in a try/except handler.
def parsedelete(content):
try:
dom = minidom.parseString(content)
except xml.parsers.expat.ExpatError, e:
# not sure how you want to handle the error... so just passing back as string
return str(e)
for element in dom.getElementsByTagName('RI_RI51_ChPtIncAcctNumber'):
parentNode = element.parentNode
parentNode.removeChild(element)
return dom.toxml()