Iteratively process large wikipedia dump

Iteratively process large wikipedia dump - python

I want to parse a large wikipedia dump iteratively. I found a tutorial for this here: https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c
However, when I want to read in the data like this:
data_path = 'C:\\Users\\Me\\datasets\\dewiki-latest-pages-articles1.xml-p1p262468.bz2'
import xml.sax
class WikiXmlHandler(xml.sax.handler.ContentHandler):
"""Content handler for Wiki XML data using SAX"""
def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self._buffer = None
self._values = {}
self._current_tag = None
self._pages = []
def characters(self, content):
"""Characters between opening and closing tags"""
if self._current_tag:
self._buffer.append(content)
def startElement(self, name, attrs):
"""Opening tag of element"""
if name in ('title', 'text'):
self._current_tag = name
self._buffer = []
def endElement(self, name):
"""Closing tag of element"""
if name == self._current_tag:
self._values[name] = ' '.join(self._buffer)
if name == 'page':
self._pages.append((self._values['title'], self._values['text']))
# Object for handling xml
handler = WikiXmlHandler()
# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
# Iteratively process file
for line in subprocess.Popen(['bzcat'],
stdin = open(data_path),
stdout = subprocess.PIPE,shell=True).stdout:
parser.feed(line)
# Stop when 3 articles have been found
if len(handler._pages) > 3:
break
it seems like nothing happens. The handler._pages list is empty. This is where the parsed articles should be stored. I also added shell=True because otherwise I get the error message FileNotFoundError: [WinError 2].
I never worked with subprocesses in python so I don't know what the problem might be.
I also tried to specify the data_path differently (with / and //).
Thank you in advance.

Related

ValueError: stat: path too long for Windows on Jupyter Notebook parsing a URL request

I am trying to parse my company Odata data to construct a proportion of late suppliers taking the CompanyName and LateDays fields.
I opened the file and converted it into a string since I found a really helpful post on how request urls with authentication, and I obtained my string text containing the whole report. The report is written in ?XML and I am using Python 3.7 in Jupyter Notebook to handle it.
I found another post that queries a XML file similar to mine using a class method, but my output is the error ValueError: stat: path too long for Windows.
How can I fix this?
Thanks!
import requests
import pandas as pd
import numpy as np
import base64
import urllib.request
request = urllib.request.Request('https://myUrl_OData')
base64string = base64.b64encode(bytes('%s:%s' % ('Myusername', 'Mypassword'),'ascii'))
request.add_header("Authorization", "Basic %s" % base64string.decode('utf-8'))
result = urllib.request.urlopen(request)
resulttext = result.read()
text = resulttext.decode(encoding='utf-8',errors='ignore')
from xml.sax import parse
from xml.sax.handler import ContentHandler
class properties(ContentHandler):
def __init__(self):
self.elements = [] # stack of elements
self.char_data = u'' # string buffer
self.current_vendor = u''
self.current_latedays = u''
def startElement(self, name, attrs):
if companyname == u'CompanyName':
self.elements.append(u'CompanyName')
if latedays == u'LateDays':
self.elements.append(u'LateDays')
def characters(self, chars):
if len(self.elements) > 0 and self.elements[-1] in [u'CompanyName', u'LateDays']:
self.char_data += chars
def endElement(self, name):
self.elements.pop() if len(self.elements) > 0 else None
if companyname == u'CompanyName':
self.current_vendor = self.char_data
self.char_data = ''
if latedays == u'LateDays':
self.current_latedays = self.char_data
self.char_data = ''
if companyname == 'CompanyName':
if self.current_latedays == u'LateDays':
print('Found:', self.current_customer)
# clear the buffers now that is finished
self.current_year = u''
self.current_customer = u''
self.char_data = u''
parse(r"\\\\?\\" + text, properties())

Your error doesn't seem to be related with XML parsing but with your OS limitations.
On a Windows-based OS, the path of a file cannot be longer than ~260 characters (ref).
Try to reduce the length of your filename, or reduce the number of nested folders leading to your data.

writing to files: switch to new file after X MB file capacity

I have millions of domains which I will send WHOIS query and record WHOIS response on some .txt file.
I would like to set maximum capacity for a single .txt output file. For example, let's say I started recording responses on out0.txt. I want to switch to out1.txt if out0.txt is >= 100mb. Same thing goes for out1.txt, if out1.txt>=100mb then start writing to out2.txtand so on.
I know that I can do if checks after each insertion, but I want my code to be fast: i.e. I thought if checks at each domain can slow down my code. (It will asynchronously query millions of domains).
I imagined a try-except block could solve my issue here, like this:
folder_name = "out%s.txt"
folder_number = 0
folder_name = folder_name % folder_number
f = open(folder_name, 'w+')
for domain in millions_of_domains:
try:
response_json = send_whois_query(domain)
f.write(response_json)
except FileGreaterThan100MbException:
folder_number += 1
folder_name = folder_name % folder_number
f = open(folder_name, 'w+')
f.write(response_json)
Any suggestions will be appreciated. Thank you for your time.

You can create a wrapper object that tracks how much data has been written, and opens a new file if you reached a limit:
class MaxSizeFileWriter(object):
def __init__(self, filenamepattern, maxdata=2**20, # default 1Mb
start=0, mode='w', *args, **kwargs):
self._pattern = filenamepattern
self._counter = start
self._mode = mode
self._args, self._kwargs = args, kwargs
self._max = maxdata
self._openfile = None
self._written = 0
def _open(self):
if self._openfile is not None:
filename = self._pattern.format(self._counter)
self._counter += 1
self._openfile = open(filename, mode=self._mode, *self._args, **self._kwargs)
def _close(self):
if self._openfile is not None:
self._openfile.close()
def __enter__(self):
return self
def __exit__(self, *args, **kwargs):
if self._openfile is not None:
self._openfile.close()
def write(self, data):
if self._written + len(data) > self._max:
# current file too full to fit data too, close it
# This will trigger a new file to be opened.
self._close()
self._open() # noop if already open
self._openfile.write(data)
self._written += len(data)
The above is a context manager, and can be used just like a regular file. Pass in a filename with a {} placeholder for the number to be inserted into:
folder_name = "out{}.txt"
with MaxSizeFileWriter(folder_name, maxdata=100 * 2**10) as f:
for domain in millions_of_domains:
response_json = send_whois_query(domain)
f.write(response_json)

Manage exceptions handling

I have class named ExcelFile, his job is to manage excel files (read, extract data, and differents things for the stack).
I want to implement a system for managing errors/exceptions.
For example, ExcelFile as a method load(), like a "setup"
def load(self):
"""
Setup for excel file
Load workbook, worksheet and others characteristics (data lines, header...)
:return: Setup successfully or not
:rtype: bool
Current usage
:Example:
> excefile = ExcelFile('test.xls')
> excefile.load()
True
> excefile.nb_rows()
4
"""
self.workbook = xlrd.open_workbook(self.url)
self.sheet = self.workbook.sheet_by_index(0)
self.header_row_index = self.get_header_row_index()
if self.header_row_index == None: # If file doesn't have header (or not valid)
return False
self.header_fields = self.sheet.row_values(self.header_row_index)
self.header_fields_col_ids = self.get_col_ids(self.header_fields) # Mapping between header fields and col ids
self.nb_rows = self.count_rows()
self.row_start_data = self.header_row_index + self.HEADER_ROWS
return True
As you can see, I can encounter 2 differents errors:
The file is not an excel file (raise xlrd.XLRDError)
The file has an invalid header (so I return False)
I want to implement a good management system of ExcelFile errors, because this class is used a lot in the stack.
This is my first idea for processing that :
Implement a standard exception
class ExcelFileException(Exception):
def __init__(self, message, type=None):
self.message = message
self.type = type
def __str__(self):
return "{} : {} ({})".format(self.__class__.__name__, self.message, self.type)
Rewrite load method
def load(self):
"""
Setup for excel file
Load workbook, worksheet and others characteristics (data lines, header...)
:return: Setup successfully or not
:rtype: bool
Current usage
:Example:
> excefile = ExcelFile('test.xls')
> excefile.load()
True
> excefile.nb_rows()
4
"""
try:
self.workbook = xlrd.open_workbook(self.url)
except xlrd.XLRDError as e:
raise ExcelFileException("Unsupported file type", e.__class__.__name__)
self.sheet = self.workbook.sheet_by_index(0)
self.header_row_index = self.get_header_row_index()
if self.header_row_index == None: # If file doesn't have header (or not valid)
raise ExcelFileException("Invalid or empty header")
self.header_fields = self.sheet.row_values(self.header_row_index)
self.header_fields_col_ids = self.get_col_ids(self.header_fields) # Mapping between header fields and col ids
self.nb_rows = self.count_rows()
self.row_start_data = self.header_row_index + self.HEADER_ROWS
return True
And this an example in a calling method, a big problem is I have to manage a dict named "report" with errors in french, for customers success and other.
...
def foo():
...
file = ExcelFile(location)
try:
file.load()
except ExcelFileException as e:
log.warn(e.__str__())
if e.type == 'XLRDError'
self.report['errors'] = 'Long description of the error, in french (error is about invalid file type)'
else:
self.report['errors'] = 'Long description of the error, in french (error is about invalid header)'
...
What do you think about that ? Do you have a better way ?
Thank you

You could change your exception to log the errors in your dict:
class ExcelFileException(Exception):
def __init__(self, message, report, type=None):
report['errors'].append(message)
self.message = message
self.type = type
def __str__(self):
return "{} : {} ({})".format(self.__class__.__name__, self.message, self.type)
When you will raise an exception:
raise ExcelFileException("Invalid or empty header", report)
The errors will be present in self.dictionnary['errors']

Also the error can be fixed by installing missing a optional dependence Xlrd
pip install Xlrd
More available python packages when working with excel

lxml file opening methods

I've generally been in the habit of writing:
with open('path/my_document.xml', 'rb') as xml_document:
return etree.parse(xml_document)
I only recently realized I can just write
return etree.parse('path/my_document.xml')
Assuming I don't need to use the xml_document variable for anything else, are there real world scenarios in which there is a difference between the two?
Source code of etree.parse:
def parse(self, source, _BaseParser parser=None, *, base_url=None):
u"""parse(self, source, parser=None, base_url=None)
Updates self with the content of source and returns its root
"""
cdef _Document doc = None
try:
doc = _parseDocument(source, parser, base_url)
self._context_node = doc.getroot()
if self._context_node is None:
self._doc = doc
except _TargetParserResult as result_container:
# raises a TypeError if we don't get an _Element
self._context_node = result_container.result
return self._context_node

How to use the Typed unmarshaller in suds?

I have existing code that processes the output from suds.client.Client(...).service.GetFoo(). Now that part of the flow has changed and we are no longer using SOAP, instead receiving the same XML through other channels. I would like to re-use the existing code by using the suds Typed unmarshaller, but so far have not been successful.
I came 90% of the way using the Basic unmarshaller:
tree = suds.umx.basic.Basic().process(xmlroot)
This gives me the nice tree of objects with attributes, so that the pre-existing code can access tree[some_index].someAttribute, but the value will of course always be a string, rather than an integer or date or whatever, so the code can still not be re-used as-is.
The original class:
class SomeService(object):
def __init__(self):
self.soap_client = Client(some_wsdl_url)
def GetStuff(self):
return self.soap_client.service.GetStuff()
The drop-in replacement that almost works:
class SomeSourceUntyped(object):
def __init__(self):
self.url = some_url
def GetStuff(self):
xmlfile = urllib2.urlopen(self.url)
xmlroot = suds.sax.parser.Parser().parse(xmlfile)
if xmlroot:
# because the parser creates a document root above the document root
tree = suds.umx.basic.Basic().process(xmlroot)[0]
else:
tree = None
return tree
My vain effort to understand suds.umx.typed.Typed():
class SomeSourceTyped(object):
def __init__(self):
self.url = some_url
self.schema_file_name =
os.path.realpath(os.path.join(os.path.dirname(__file__),'schema.xsd'))
with open(self.schema_file_name) as f:
self.schema_node = suds.sax.parser.Parser().parse(f)
self.schema = suds.xsd.schema.Schema(self.schema_node, "", suds.options.Options())
self.schema_query = suds.xsd.query.ElementQuery(('http://example.com/namespace/','Stuff'))
self.xmltype = self.schema_query.execute(self.schema)
def GetStuff(self):
xmlfile = urllib2.urlopen(self.url)
xmlroot = suds.sax.parser.Parser().parse(xmlfile)
if xmlroot:
unmarshaller = suds.umx.typed.Typed(self.schema)
# I'm still running into an exception, so obviously something is missing:
# " Exception: (document, None, ), must be qref "
# Do I need to call the Parser differently?
tree = unmarshaller.process(xmlroot, self.xmltype)[0]
else:
tree = None
return tree
This is an obscure one.
Bonus caveat: Of course I am in a legacy system that uses suds 0.3.9.
EDIT: further evolution on the code, found how to create SchemaObjects.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iteratively process large wikipedia dump - python

Related

ValueError: stat: path too long for Windows on Jupyter Notebook parsing a URL request

writing to files: switch to new file after X MB file capacity

Manage exceptions handling

lxml file opening methods

How to use the Typed unmarshaller in suds?

Categories

Resources