d3bc42483a
- add py3 to tox.ini (gate already tests py3) - move all tests to $GITROOT/test so they can all run through testr - add scrapy to test-requirements.txt to support sitemap tests - move tests from test_items.py to test_sitemap_file.py - fix broken sitemap tests - add newton to list of old releases in sitemap_file.py - ignore flake8 H101 as it returns false positives for Sphinx conf.py - Use openstackdocstheme for docs - Update sitemap README - Restructure repo docs - fix minor style issues Change-Id: I22c018149b2eefde6ca5c38c22ac06886fe9a7a8 |
||
---|---|---|
.. | ||
generator | ||
README.rst | ||
__init__.py | ||
scrapy.cfg | ||
transform-sitemap.xslt |
README.rst
Sitemap Generator
This script crawls all available sites on http://docs.openstack.org and extracts all URLs. Based on the URLs the script generates a sitemap for search engines according to the sitemaps protocol.
Installation
To install the needed modules you can use pip or the package management system included in your distribution. When using the package management system maybe the name of the packages differ. Installation in a virtual environment is recommended.
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
When using pip, you may also need to install some development packages. For example, on Ubuntu 16.04 install the following packages:
$ sudo apt install gcc libssl-dev python-dev python-virtualenv
Usage
To generate a new sitemap file, change into your local clone of the
openstack/openstack-doc-tools
repository and run the
following commands:
$ cd sitemap
$ scrapy crawl sitemap
The script takes several minutes to crawl all available sites on http://docs.openstack.org. The
result is available in the sitemap_docs.openstack.org.xml
file.
Options
domain=URL
Sets the
domain
to crawl. Default isdocs.openstack.org
.For example, to crawl http://developer.openstack.org use the following command:
$ scrapy crawl sitemap -a domain=developer.openstack.org
The result is available in the
sitemap_developer.openstack.org.xml
file.
urls=URL
You can define a set of additional start URLs using the
urls
attribute. Separate multiple URLs with,
.For example:
$ scrapy crawl sitemap -a domain=developer.openstack.org -a urls="http://developer.openstack.org/de/api-guide/quick-start/"
LOG_FILE=FILE
Write log messages to the specified file.
For example, to write to
scrapy.log
:$ scrapy crawl sitemap -s LOG_FILE=scrapy.log