c99bb681b2
Update settings for mitaka Change-Id: I122aafd4a3136da209ccd26d13d9060a134ce727 |
||
---|---|---|
.. | ||
generator | ||
test | ||
__init__.py | ||
README.rst | ||
requirements.txt | ||
scrapy.cfg | ||
transform-sitemap.xslt |
Sitemap Generator
This script crawls all available sites on http://docs.openstack.org and extracts all URLs. Based on the URLs the script generates a sitemap for search engines according to the protocol described at http://www.sitemaps.org/protocol.html.
Installation
To install the needed modules you can use pip or the package management system included in your distribution. When using the package management system maybe the name of the packages differ. Installation in a virtual environment is recommended.
$ virtualenv venv $ source venv/bin/activate $ pip install -r requirements.txt
When using pip it's maybe necessary to install some development packages. For example on Ubuntu 16.04 install the following packages.
$ sudo apt install gcc libssl-dev python-dev python-virtualenv
Usage
To generate a new sitemap file simply run the spider using the
following command. It will take several minutes to crawl all available
sites on http://docs.openstack.org. The
result will be available in the file
sitemap_docs.openstack.org.xml
.
$ scrapy crawl sitemap
It's also possible to crawl other sites using the attribute
domain
.
For example to crawl http://developer.openstack.org
use the following command. The result will be available in the file
sitemap_developer.openstack.org.xml
.
$ scrapy crawl sitemap -a domain=developer.openstack.org
To write log messages into a file append the parameter
-s LOG_FILE=scrapy.log
.
It is possible to define a set of additional start URLs using the
attribute urls
. Separate multiple URLs with
,
.
$ scrapy crawl sitemap -a domain=developer.openstack.org -a urls="http://developer.openstack.org/de/api-guide/quick-start/"