Capítulo 9. Procesamiento de XML

9.1. Inmersión

Los dos capítulos siguientes hablan sobre el procesamiento de XML en Python. Sería de ayuda que ya conociese el aspecto de un documento XML, que está compuesto por etiquetas estructuradas que forman una jerarquía de elementos. Si esto no tiene sentido para usted, hay muchos tutoriales sobre XML que pueden explicarle lo básico.

Aunque no esté particularmente interesado en XML debería leer estos capítulos, que cubren aspectos importantes como los paquetes de Python, Unicode, los argumentos de la línea de órdenes y cómo usar getattr para despachar métodos.

No se precisa tener la carrera de filosofía, aunque si tiene la poca fortuna de haber sido sometido a los escritos de Immanuel Kant, apreciará el programa de ejemplo mucho más que si estudió algo útil, como ciencias de la computación.

Hay dos maneras básicas de trabajar con XML. Una se denomina SAX (“Simple API for XML”), y funciona leyendo un poco de XML cada vez, invocando un método por cada elemento que encuentra (si leyó Capítulo 8, Procesamiento de HTML, esto debería serle familiar, porque es la manera en que trabaja el módulo sgmllib). La otra se llama DOM (“Document Object Model”), y funciona leyendo el documento XML completo para crear una representación interna utilizando clases nativas de Python enlazadas en una estructura de árbol. Python tiene módulos estándar para ambos tipos de análisis, pero en este capítulo sólo trataremos el uso de DOM.

Lo que sigue es un programa de Python completo que genera una salida pseudoaleatoria basada en una gramática libre de contexto definida en formato XML. No se preocupe si aún no ha entiende lo que esto significa; examinaremos juntos la entrada y la salida del programa en más profundidad durante los próximos dos capítulos.

Ejemplo 9.1. kgp.py

Si aún no lo ha hecho, puede descargar éste ejemplo y otros usados en este libro.

"""Kant Generator for Python

Generates mock philosophy based on a context-free grammar

Usage: python kgp.py [options] [source]

Options:
  -g ..., --grammar=...   use specified grammar file or URL
  -h, --help              show this help
  -d                      show debugging information while parsing

Examples:
  kgp.py                  generates several paragraphs of Kantian philosophy
  kgp.py -g husserl.xml   generates several paragraphs of Husserl
  kpg.py "<xref id='paragraph'/>"  generates a paragraph of Kant
  kgp.py template.xml     reads from template.xml to decide what to generate
"""
from xml.dom import minidom
import random
import toolbox
import sys
import getopt

_debug = 0

class NoSourceError(Exception): pass

class KantGenerator:
    """generates mock philosophy based on a context-free grammar"""

    def __init__(self, grammar, source=None):
        self.loadGrammar(grammar)
        self.loadSource(source and source or self.getDefaultSource())
        self.refresh()

    def _load(self, source):
        """load XML input source, return parsed XML document

        - a URL of a remote XML file ("http://diveintopython.org/kant.xml")
        - a filename of a local XML file ("~/diveintopython/common/py/kant.xml")
        - standard input ("-")
        - the actual XML document, as a string
        """
        sock = toolbox.openAnything(source)
        xmldoc = minidom.parse(sock).documentElement
        sock.close()
        return xmldoc

    def loadGrammar(self, grammar):                         
        """load context-free grammar"""                     
        self.grammar = self._load(grammar)                  
        self.refs = {}                                      
        for ref in self.grammar.getElementsByTagName("ref"):
            self.refs[ref.attributes["id"].value] = ref     

    def loadSource(self, source):
        """load source"""
        self.source = self._load(source)

    def getDefaultSource(self):
        """guess default source of the current grammar
        
        The default source will be one of the <ref>s that is not
        cross-referenced.  This sounds complicated but it's not.
        Example: The default source for kant.xml is
        "<xref id='section'/>", because 'section' is the one <ref>
        that is not <xref>'d anywhere in the grammar.
        In most grammars, the default source will produce the
        longest (and most interesting) output.
        """
        xrefs = {}
        for xref in self.grammar.getElementsByTagName("xref"):
            xrefs[xref.attributes["id"].value] = 1
        xrefs = xrefs.keys()
        standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
        if not standaloneXrefs:
            raise NoSourceError, "can't guess source, and no source specified"
        return '<xref id="%s"/>' % random.choice(standaloneXrefs)
        
    def reset(self):
        """reset parser"""
        self.pieces = []
        self.capitalizeNextWord = 0

    def refresh(self):
        """reset output buffer, re-parse entire source file, and return output
        
        Since parsing involves a good deal of randomness, this is an
        easy way to get new output without having to reload a grammar file
        each time.
        """
        self.reset()
        self.parse(self.source)
        return self.output()

    def output(self):
        """output generated text"""
        return "".join(self.pieces)

    def randomChildElement(self, node):
        """choose a random child element of a node
        
        This is a utility method used by do_xref and do_choice.
        """
        choices = [e for e in node.childNodes
                   if e.nodeType == e.ELEMENT_NODE]
        chosen = random.choice(choices)            
        if _debug:                                 
            sys.stderr.write('%s available choices: %s\n' % \
                (len(choices), [e.toxml() for e in choices]))
            sys.stderr.write('Chosen: %s\n' % chosen.toxml())
        return chosen                              

    def parse(self, node):         
        """parse a single XML node
        
        A parsed XML document (from minidom.parse) is a tree of nodes
        of various types.  Each node is represented by an instance of the
        corresponding Python class (Element for a tag, Text for
        text data, Document for the top-level document).  The following
        statement constructs the name of a class method based on the type
        of node we're parsing ("parse_Element" for an Element node,
        "parse_Text" for a Text node, etc.) and then calls the method.
        """
        parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
        parseMethod(node)

    def parse_Document(self, node):
        """parse the document node
        
        The document node by itself isn't interesting (to us), but
        its only child, node.documentElement, is: it's the root node
        of the grammar.
        """
        self.parse(node.documentElement)

    def parse_Text(self, node):    
        """parse a text node
        
        The text of a text node is usually added to the output buffer
        verbatim.  The one exception is that <p class='sentence'> sets
        a flag to capitalize the first letter of the next word.  If
        that flag is set, we capitalize the text and reset the flag.
        """
        text = node.data
        if self.capitalizeNextWord:
            self.pieces.append(text[0].upper())
            self.pieces.append(text[1:])
            self.capitalizeNextWord = 0
        else:
            self.pieces.append(text)

    def parse_Element(self, node): 
        """parse an element
        
        An XML element corresponds to an actual tag in the source:
        <xref id='...'>, <p chance='...'>, <choice>, etc.
        Each element type is handled in its own method.  Like we did in
        parse(), we construct a method name based on the name of the
        element ("do_xref" for an <xref> tag, etc.) and
        call the method.
        """
        handlerMethod = getattr(self, "do_%s" % node.tagName)
        handlerMethod(node)

    def parse_Comment(self, node):
        """parse a comment
        
        The grammar can contain XML comments, but we ignore them
        """
        pass
    
    def do_xref(self, node):
        """handle <xref id='...'> tag
        
        An <xref id='...'> tag is a cross-reference to a <ref id='...'>
        tag.  <xref id='sentence'/> evaluates to a randomly chosen child of
        <ref id='sentence'>.
        """
        id = node.attributes["id"].value
        self.parse(self.randomChildElement(self.refs[id]))

    def do_p(self, node):
        """handle <p> tag
        
        The <p> tag is the core of the grammar.  It can contain almost
        anything: freeform text, <choice> tags, <xref> tags, even other
        <p> tags.  If a "class='sentence'" attribute is found, a flag
        is set and the next word will be capitalized.  If a "chance='X'"
        attribute is found, there is an X% chance that the tag will be
        evaluated (and therefore a (100-X)% chance that it will be
        completely ignored)
        """
        keys = node.attributes.keys()
        if "class" in keys:
            if node.attributes["class"].value == "sentence":
                self.capitalizeNextWord = 1
        if "chance" in keys:
            chance = int(node.attributes["chance"].value)
            doit = (chance > random.randrange(100))
        else:
            doit = 1
        if doit:
            for child in node.childNodes: self.parse(child)

    def do_choice(self, node):
        """handle <choice> tag
        
        A <choice> tag contains one or more <p> tags.  One <p> tag
        is chosen at random and evaluated; the rest are ignored.
        """
        self.parse(self.randomChildElement(node))

def usage():
    print __doc__

def main(argv):                         
    grammar = "kant.xml"                
    try:                                
        opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
    except getopt.GetoptError:          
        usage()                         
        sys.exit(2)                     
    for opt, arg in opts:               
        if opt in ("-h", "--help"):     
            usage()                     
            sys.exit()                  
        elif opt == '-d':               
            global _debug               
            _debug = 1                  
        elif opt in ("-g", "--grammar"):
            grammar = arg               
    
    source = "".join(args)              

    k = KantGenerator(grammar, source)
    print k.output()

if __name__ == "__main__":
    main(sys.argv[1:])

Ejemplo 9.2. toolbox.py

"""Miscellaneous utility functions"""

def openAnything(source):            
    """URI, filename, or string --> stream

    This function lets you define parsers that take any input source
    (URL, pathname to local or network file, or actual data as a string)
    and deal with it in a uniform manner.  Returned object is guaranteed
    to have all the basic stdio read methods (read, readline, readlines).
    Just .close() the object when you're done with it.
    
    Examples:
    >>> from xml.dom import minidom
    >>> sock = openAnything("http://localhost/kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>")
    >>> doc = minidom.parse(sock)
    >>> sock.close()
    """
    if hasattr(source, "read"):
        return source

    if source == '-':
        import sys
        return sys.stdin

    # try to open with urllib (if source is http, ftp, or file URL)
    import urllib                         
    try:                                  
        return urllib.urlopen(source)     
    except (IOError, OSError):            
        pass                              
    
    # try to open with native open function (if source is pathname)
    try:                                  
        return open(source)               
    except (IOError, OSError):            
        pass                              
    
    # treat source as string
    import StringIO                       
    return StringIO.StringIO(str(source)) 

La ejecución del programa kgp.py analizará la gramática basada en XML por omisión, de kant.xml, y mostrará varios párrafos de filosofía al estilo de Immanuel Kant.

Ejemplo 9.3. Ejemplo de la salida de kgp.py

[usted@localhost kgp]$ python kgp.py
     As is shown in the writings of Hume, our a priori concepts, in
reference to ends, abstract from all content of knowledge; in the study
of space, the discipline of human reason, in accordance with the
principles of philosophy, is the clue to the discovery of the
Transcendental Deduction.  The transcendental aesthetic, in all
theoretical sciences, occupies part of the sphere of human reason
concerning the existence of our ideas in general; still, the
never-ending regress in the series of empirical conditions constitutes
the whole content for the transcendental unity of apperception.  What
we have alone been able to show is that, even as this relates to the
architectonic of human reason, the Ideal may not contradict itself, but
it is still possible that it may be in contradictions with the
employment of the pure employment of our hypothetical judgements, but
natural causes (and I assert that this is the case) prove the validity
of the discipline of pure reason.  As we have already seen, time (and
it is obvious that this is true) proves the validity of time, and the
architectonic of human reason, in the full sense of these terms,
abstracts from all content of knowledge.  I assert, in the case of the
discipline of practical reason, that the Antinomies are just as
necessary as natural causes, since knowledge of the phenomena is a
posteriori.
    The discipline of human reason, as I have elsewhere shown, is by
its very nature contradictory, but our ideas exclude the possibility of
the Antinomies.  We can deduce that, on the contrary, the pure
employment of philosophy, on the contrary, is by its very nature
contradictory, but our sense perceptions are a representation of, in
the case of space, metaphysics.  The thing in itself is a
representation of philosophy.  Applied logic is the clue to the
discovery of natural causes.  However, what we have alone been able to
show is that our ideas, in other words, should only be used as a canon
for the Ideal, because of our necessary ignorance of the conditions.

[...corte...]

Por supuesto, esto no es más que texto sin sentido. Bueno, no totalmente sin sentido. Es correcto sintáctica y gramaticalmente (aunque muy prolijo -- Kant no es lo que diríamos el tipo de hombre que va al grano). Parte de él quizá sea cierta en realidad (o al menos el tipo de cosa con la que Kant hubiera estado de acuerdo), parte es manifiestamente falsa, y la mayoría simplemente incoherente. Pero todo está en el estilo de Immanuel Kant.

Permítame repetir que esto es mucho, mucho más divertido en caso de que esté cursando o haya cursado alguna vez estudios superiores de filosofía.

Lo interesante de este programa es que no hay nada en él específico a Kant. Todo el contenido del ejemplo anterior se ha derivado del fichero de gramática, kant.xml. Si le dice al programa que use un fichero de gramática diferente (que podemos especificar en la línea de ordenes), la salida será totalmente diferente.

Ejemplo 9.4. Salida de kgp.py, más simple

[usted@localhost kgp]$ python kgp.py -g binary.xml
00101001
[usted@localhost kgp]$ python kgp.py -g binary.xml
10110100

Daremos un vistazo más de cerca a la estructura del fichero de gramática más adelante. Por ahora, todo lo que ha de saber es que el fichero de gramática define la estructura de la salida, y que el programa kgp.py lee esa gramática y toma decisiones al azar sobre qué palabras poner.