The map function provides a means to perform distributed processing of XML dump files. This is particularly useful for processing multi-file XML dumps, but it can also be used to process XML dumps for many wikis at the same time.
mwxml.map(process, paths, threads=None)[source]¶Implements a distributed stategy for processing XML files. This
function constructs a set of py:mod:multiprocessing threads (spread over
multiple cores) and uses an internal queue to aggregate outputs. To use
this function, implement a process() function that takes two arguments
– a mwxml.Dump and the path the dump was loaded
from. Anything that this function ``yield``s will be yielded in turn
from the mwxml.map() function.
| Parameters: |
|
|---|---|
| Example: | >>> import mwxml
>>> files = ["examples/dump.xml", "examples/dump2.xml"]
>>>
>>> def page_info(dump, path):
... for page in dump:
... yield page.id, page.namespace, page.title
...
>>> for id, namespace, title in mwxml.map(page_info, files):
... print(id, namespace, title)
...
|