History

Andreas Jaeger c99bb681b2 Mitaka is an old release now

Update settings for mitaka

Change-Id: I122aafd4a3136da209ccd26d13d9060a134ce727

2016-10-06 14:33:34 +02:00

generator

Mitaka is an old release now

2016-10-06 14:33:34 +02:00

test

Change assertTrue(isinstance()) by optimal assert

2016-09-30 16:30:29 +08:00

__init__.py

doc-tools unit tests

2016-08-03 07:05:51 +00:00

README.rst

Improve the README file of the sitemap generator

2016-10-06 13:57:49 +02:00

requirements.txt

[sitemap] add a requirements file

2015-10-02 08:28:37 +02:00

scrapy.cfg

script to generate the sitemap.xml for docs.openstack.org

2014-05-29 01:29:18 +02:00

transform-sitemap.xslt

Remove /draft from sitemap

2015-04-18 09:43:10 +02:00

README.rst

Sitemap Generator

This script crawls all available sites on http://docs.openstack.org and extracts all URLs. Based on the URLs the script generates a sitemap for search engines according to the protocol described at http://www.sitemaps.org/protocol.html.

Installation

To install the needed modules you can use pip or the package management system included in your distribution. When using the package management system maybe the name of the packages differ. Installation in a virtual environment is recommended.

$ virtualenv venv $ source venv/bin/activate $ pip install -r requirements.txt

When using pip it's maybe necessary to install some development packages. For example on Ubuntu 16.04 install the following packages.

$ sudo apt install gcc libssl-dev python-dev python-virtualenv

Usage

To generate a new sitemap file simply run the spider using the following command. It will take several minutes to crawl all available sites on http://docs.openstack.org. The result will be available in the file sitemap_docs.openstack.org.xml.

$ scrapy crawl sitemap

It's also possible to crawl other sites using the attribute domain.

For example to crawl http://developer.openstack.org use the following command. The result will be available in the file sitemap_developer.openstack.org.xml.

$ scrapy crawl sitemap -a domain=developer.openstack.org

To write log messages into a file append the parameter -s LOG_FILE=scrapy.log.

It is possible to define a set of additional start URLs using the attribute urls. Separate multiple URLs with ,.

$ scrapy crawl sitemap -a domain=developer.openstack.org -a urls="http://developer.openstack.org/de/api-guide/quick-start/"