Add documentation
This commit is contained in:
23
TODO.rst
Normal file
23
TODO.rst
Normal file
@@ -0,0 +1,23 @@
|
||||
.. Note: this list is automatically included in the documentation.
|
||||
|
||||
***********************************
|
||||
To-do list and possible future work
|
||||
***********************************
|
||||
|
||||
This document lists some ideas that the developers thought of, but have not yet
|
||||
implemented. The topics described below may be implemented (or not) in the
|
||||
future, depending on time, demand, and technical possibilities.
|
||||
|
||||
* Improved error handling instead of just propagating the errors from the
|
||||
Thrift layer. Maybe wrap the errors in a HappyBase.Error?
|
||||
|
||||
* Automatic retries for failed operations (but only those that can be retried)
|
||||
|
||||
* Connection pooling (maybe based on PyCassa's ConnectionPool?)
|
||||
|
||||
* Thread safety. This involves at least coordinating access to the socket
|
||||
connection to HBase's Thrift gateway.
|
||||
|
||||
* Port HappyBase over to the (still experimental) HBase Thrift2 API when it
|
||||
becomes mainstream, and expose more of the underlying features nicely in the
|
||||
HappyBase API.
|
||||
47
doc/api.rst
Normal file
47
doc/api.rst
Normal file
@@ -0,0 +1,47 @@
|
||||
*****************
|
||||
API documentation
|
||||
*****************
|
||||
|
||||
.. py:currentmodule:: happybase
|
||||
|
||||
This chapter contains detailed API documentation for HappyBase. It is suggested
|
||||
to read the :doc:`tutorial <tutorial>` first to get a general idea about how
|
||||
HappyBase works.
|
||||
|
||||
The HappyBase API is organised as follows:
|
||||
|
||||
:py:class:`~happybase.Connection`:
|
||||
The :py:class:`~happybase.Connection` class is the main entry point for
|
||||
application developers. It connects to the HBase Thrift server and provides
|
||||
methods for table management.
|
||||
|
||||
:py:class:`~happybase.Table`:
|
||||
The :py:class:`Table` class is the main class for interacting with data in
|
||||
tables. This class offers methods for data retrieval and data manipulation.
|
||||
Instances of this class can be obtained using the
|
||||
:py:meth:`Connection.table()` method.
|
||||
|
||||
:py:class:`~happybase.Batch`:
|
||||
The :py:class:`Batch` class implements the batch API for data manipulation,
|
||||
and is available through the :py:meth:`Table.batch()` method.
|
||||
|
||||
|
||||
Connection
|
||||
==========
|
||||
|
||||
.. autoclass:: happybase.Connection
|
||||
|
||||
|
||||
Table
|
||||
=====
|
||||
|
||||
.. autoclass:: happybase.Table
|
||||
|
||||
|
||||
Batch
|
||||
=====
|
||||
|
||||
.. autoclass:: happybase.Batch
|
||||
|
||||
|
||||
.. vim: set spell spelllang=en:
|
||||
244
doc/conf.py
Normal file
244
doc/conf.py
Normal file
@@ -0,0 +1,244 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
#
|
||||
# HappyBase documentation build configuration file, created by
|
||||
# sphinx-quickstart on Tue Mar 20 17:40:16 2012.
|
||||
#
|
||||
# This file is execfile()d with the current directory set to its containing dir.
|
||||
#
|
||||
# Note that not all possible configuration values are present in this
|
||||
# autogenerated file.
|
||||
#
|
||||
# All configuration values have a default; values that are commented out
|
||||
# serve to show the default.
|
||||
|
||||
import sys, os
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
#sys.path.insert(0, os.path.abspath('.'))
|
||||
|
||||
# -- General configuration -----------------------------------------------------
|
||||
|
||||
# If your documentation needs a minimal Sphinx version, state it here.
|
||||
#needs_sphinx = '1.0'
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be extensions
|
||||
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
|
||||
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.coverage']
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['_templates']
|
||||
|
||||
# The suffix of source filenames.
|
||||
source_suffix = '.rst'
|
||||
|
||||
# The encoding of source files.
|
||||
#source_encoding = 'utf-8-sig'
|
||||
|
||||
# The master toctree document.
|
||||
master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = u'HappyBase'
|
||||
copyright = u'2012'
|
||||
|
||||
# The version info for the project you're documenting, acts as replacement for
|
||||
# |version| and |release|, also used in various other places throughout the
|
||||
# built documents.
|
||||
#
|
||||
# The short X.Y version.
|
||||
version = '0.1'
|
||||
# The full version, including alpha/beta/rc tags.
|
||||
release = '0.1'
|
||||
|
||||
# The language for content autogenerated by Sphinx. Refer to documentation
|
||||
# for a list of supported languages.
|
||||
#language = None
|
||||
|
||||
# There are two options for replacing |today|: either, you set today to some
|
||||
# non-false value, then it is used:
|
||||
#today = ''
|
||||
# Else, today_fmt is used as the format for a strftime call.
|
||||
#today_fmt = '%B %d, %Y'
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
exclude_patterns = []
|
||||
|
||||
# The reST default role (used for this markup: `text`) to use for all documents.
|
||||
#default_role = None
|
||||
|
||||
# If true, '()' will be appended to :func: etc. cross-reference text.
|
||||
#add_function_parentheses = True
|
||||
|
||||
# If true, the current module name will be prepended to all description
|
||||
# unit titles (such as .. function::).
|
||||
#add_module_names = True
|
||||
|
||||
# If true, sectionauthor and moduleauthor directives will be shown in the
|
||||
# output. They are ignored by default.
|
||||
#show_authors = False
|
||||
|
||||
# The name of the Pygments (syntax highlighting) style to use.
|
||||
pygments_style = 'sphinx'
|
||||
|
||||
# A list of ignored prefixes for module index sorting.
|
||||
#modindex_common_prefix = []
|
||||
|
||||
autodoc_default_flags = ['members', 'undoc-members']
|
||||
autodoc_member_order = 'bysource'
|
||||
|
||||
# -- Options for HTML output ---------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
html_theme = 'default'
|
||||
|
||||
# Theme options are theme-specific and customize the look and feel of a theme
|
||||
# further. For a list of options available for each theme, see the
|
||||
# documentation.
|
||||
#html_theme_options = {}
|
||||
|
||||
# Add any paths that contain custom themes here, relative to this directory.
|
||||
#html_theme_path = []
|
||||
|
||||
# The name for this set of Sphinx documents. If None, it defaults to
|
||||
# "<project> v<release> documentation".
|
||||
#html_title = None
|
||||
|
||||
# A shorter title for the navigation bar. Default is the same as html_title.
|
||||
#html_short_title = None
|
||||
|
||||
# The name of an image file (relative to this directory) to place at the top
|
||||
# of the sidebar.
|
||||
#html_logo = None
|
||||
|
||||
# The name of an image file (within the static path) to use as favicon of the
|
||||
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
|
||||
# pixels large.
|
||||
#html_favicon = None
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ['_static']
|
||||
|
||||
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
|
||||
# using the given strftime format.
|
||||
#html_last_updated_fmt = '%b %d, %Y'
|
||||
|
||||
# If true, SmartyPants will be used to convert quotes and dashes to
|
||||
# typographically correct entities.
|
||||
#html_use_smartypants = True
|
||||
|
||||
# Custom sidebar templates, maps document names to template names.
|
||||
#html_sidebars = {}
|
||||
|
||||
# Additional templates that should be rendered to pages, maps page names to
|
||||
# template names.
|
||||
#html_additional_pages = {}
|
||||
|
||||
# If false, no module index is generated.
|
||||
#html_domain_indices = True
|
||||
|
||||
# If false, no index is generated.
|
||||
#html_use_index = True
|
||||
|
||||
# If true, the index is split into individual pages for each letter.
|
||||
#html_split_index = False
|
||||
|
||||
# If true, links to the reST sources are added to the pages.
|
||||
#html_show_sourcelink = True
|
||||
|
||||
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
|
||||
#html_show_sphinx = True
|
||||
|
||||
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
|
||||
#html_show_copyright = True
|
||||
|
||||
# If true, an OpenSearch description file will be output, and all pages will
|
||||
# contain a <link> tag referring to it. The value of this option must be the
|
||||
# base URL from which the finished HTML is served.
|
||||
#html_use_opensearch = ''
|
||||
|
||||
# This is the file name suffix for HTML files (e.g. ".xhtml").
|
||||
#html_file_suffix = None
|
||||
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = 'HappyBasedoc'
|
||||
|
||||
|
||||
# -- Options for LaTeX output --------------------------------------------------
|
||||
|
||||
latex_elements = {
|
||||
# The paper size ('letterpaper' or 'a4paper').
|
||||
#'papersize': 'letterpaper',
|
||||
|
||||
# The font size ('10pt', '11pt' or '12pt').
|
||||
#'pointsize': '10pt',
|
||||
|
||||
# Additional stuff for the LaTeX preamble.
|
||||
#'preamble': '',
|
||||
}
|
||||
|
||||
# Grouping the document tree into LaTeX files. List of tuples
|
||||
# (source start file, target name, title, author, documentclass [howto/manual]).
|
||||
latex_documents = [
|
||||
('index', 'HappyBase.tex', u'HappyBase Documentation',
|
||||
u' ', 'manual'),
|
||||
]
|
||||
|
||||
# The name of an image file (relative to this directory) to place at the top of
|
||||
# the title page.
|
||||
#latex_logo = None
|
||||
|
||||
# For "manual" documents, if this is true, then toplevel headings are parts,
|
||||
# not chapters.
|
||||
#latex_use_parts = False
|
||||
|
||||
# If true, show page references after internal links.
|
||||
#latex_show_pagerefs = False
|
||||
|
||||
# If true, show URL addresses after external links.
|
||||
#latex_show_urls = False
|
||||
|
||||
# Documents to append as an appendix to all manuals.
|
||||
#latex_appendices = []
|
||||
|
||||
# If false, no module index is generated.
|
||||
#latex_domain_indices = True
|
||||
|
||||
|
||||
# -- Options for manual page output --------------------------------------------
|
||||
|
||||
# One entry per manual page. List of tuples
|
||||
# (source start file, name, description, authors, manual section).
|
||||
man_pages = [
|
||||
('index', 'happybase', u'HappyBase Documentation',
|
||||
[u' '], 1)
|
||||
]
|
||||
|
||||
# If true, show URL addresses after external links.
|
||||
#man_show_urls = False
|
||||
|
||||
|
||||
# -- Options for Texinfo output ------------------------------------------------
|
||||
|
||||
# Grouping the document tree into Texinfo files. List of tuples
|
||||
# (source start file, target name, title, author,
|
||||
# dir menu entry, description, category)
|
||||
texinfo_documents = [
|
||||
('index', 'HappyBase', u'HappyBase Documentation',
|
||||
u' ', 'HappyBase', 'One line description of project.',
|
||||
'Miscellaneous'),
|
||||
]
|
||||
|
||||
# Documents to append as an appendix to all manuals.
|
||||
#texinfo_appendices = []
|
||||
|
||||
# If false, no module index is generated.
|
||||
#texinfo_domain_indices = True
|
||||
|
||||
# How to display URL addresses: 'footnote', 'no', or 'inline'.
|
||||
#texinfo_show_urls = 'footnote'
|
||||
27
doc/index.rst
Normal file
27
doc/index.rst
Normal file
@@ -0,0 +1,27 @@
|
||||
*********
|
||||
HappyBase
|
||||
*********
|
||||
|
||||
.. include:: ../README.rst
|
||||
|
||||
.. rubric:: Table of contents
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
introduction
|
||||
installation
|
||||
tutorial
|
||||
api
|
||||
todo
|
||||
license
|
||||
|
||||
|
||||
.. rubric:: Indices and tables
|
||||
|
||||
* :ref:`genindex`
|
||||
* :ref:`modindex`
|
||||
* :ref:`search`
|
||||
|
||||
|
||||
.. vim: set spell spelllang=en:
|
||||
67
doc/installation.rst
Normal file
67
doc/installation.rst
Normal file
@@ -0,0 +1,67 @@
|
||||
************
|
||||
Installation
|
||||
************
|
||||
|
||||
This guide describes how to install HappyBase.
|
||||
|
||||
.. contents:: On this page
|
||||
:local:
|
||||
|
||||
|
||||
Setting up a virtual environment
|
||||
================================
|
||||
|
||||
The recommended way to install HappyBase and Thrift is to use a virtual
|
||||
environment created by `virtualenv`. Setup and activate a new virtual
|
||||
environment like this:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ virtualenv envname
|
||||
$ source envname/bin/activate
|
||||
|
||||
If you use the `virtualenvwrapper` scripts, type this instead:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
$ mkvirtualenv envname
|
||||
|
||||
|
||||
Installing packages
|
||||
===================
|
||||
|
||||
The next step is to install the Thrift package for Python:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
(envname) $ pip install thrift
|
||||
|
||||
…and the HappyBase package:
|
||||
|
||||
.. code-block:: sh
|
||||
|
||||
(envname) $ cd /path/to/happybase/
|
||||
(envname) $ python setup.py install
|
||||
|
||||
.. note::
|
||||
|
||||
Generating and installing the HBase Thrift Python modules (using ``thrift
|
||||
--gen py`` on the ``.thrift`` file) is not necessary, since HappyBase
|
||||
bundles pregenerated versions of those modules.
|
||||
|
||||
|
||||
Testing the installation
|
||||
========================
|
||||
|
||||
Verify that the packages are installed correctly by starting a ``python`` shell
|
||||
and entering the following statements::
|
||||
|
||||
>>> import thrift
|
||||
>>> import happybase
|
||||
|
||||
If you don't see any errors, the installation was successful. Congratulations!
|
||||
Now that you have HappyBase installed on your machine, continue with the
|
||||
:doc:`tutorial <tutorial>` to learn how to use it.
|
||||
|
||||
|
||||
.. vim: set spell spelllang=en:
|
||||
114
doc/introduction.rst
Normal file
114
doc/introduction.rst
Normal file
@@ -0,0 +1,114 @@
|
||||
************
|
||||
Introduction
|
||||
************
|
||||
|
||||
.. py:currentmodule:: happybase
|
||||
|
||||
.. contents:: On this page
|
||||
:local:
|
||||
|
||||
|
||||
What is HappyBase?
|
||||
==================
|
||||
|
||||
.. include:: ../README.rst
|
||||
|
||||
HappyBase is designed for for use in standard HBase setups, and offers
|
||||
application developers a Pythonic API to interact with HBase.
|
||||
|
||||
Below the surface, HappyBase uses the `Python Thrift library
|
||||
<http://pypi.python.org/pypi/thrift>`_ to connect to HBase's `Thrift
|
||||
<http://thrift.apache.org/>`_ gateway, which is included in the standard HBase
|
||||
0.9x releases. HappyBase hides most of the details of the underlying RPC
|
||||
mechanisms, resulting in application code that is cleaner, more productive to
|
||||
write, and more maintainable.
|
||||
|
||||
|
||||
What does code using HappyBase look like?
|
||||
=========================================
|
||||
|
||||
The example below illustrates basic usage of the library::
|
||||
|
||||
import happybase
|
||||
|
||||
connection = happybase.Connection('hostname')
|
||||
table = connection.table('table-name')
|
||||
|
||||
table.put('row-key', {'family:qual1': 'value1',
|
||||
'family:qual2': 'value2'})
|
||||
|
||||
row = table.row('row-key')
|
||||
print row['family:qual1'] # prints 'value1'
|
||||
|
||||
for key, data in table.rows(['row-key-1', 'row-key-2']):
|
||||
print key, data # prints row key and data for each row
|
||||
|
||||
for key, data in table.scan(row_prefix='row'):
|
||||
print key, data # prints 'value1' and 'value2'
|
||||
|
||||
row = table.delete('row-key')
|
||||
|
||||
Note that the :doc:`tutorial <tutorial>` contains many more examples.
|
||||
|
||||
|
||||
Why not use the HBase Thrift API directly?
|
||||
==========================================
|
||||
|
||||
You may consider using the HBase Thrift API directly instead of adding yet
|
||||
another library to your project. After all, :pep:`20` taught us that simple is
|
||||
better than complex, and there should be one, and preferably one way to do it,
|
||||
right? Well, we agree.
|
||||
|
||||
While the HBase Thrift API can be used directly from Python using the
|
||||
(automatically generated) HBase Thrift service classes, application code using
|
||||
this API is verbose, cumbersome, and hence error-prone. The reason for this is
|
||||
that the HBase Thrift API is a flat, language-agnostic interface API closely
|
||||
tied to the RPC going over the wire-level protocol. This means that
|
||||
applications need to deal with many imports, sockets, transports, protocols,
|
||||
clients, Thrift types and mutation objects. For instance, look at the code
|
||||
required to connect to HBase and store two values::
|
||||
|
||||
from thrift import Thrift
|
||||
from thrift.transport import TSocket, TTransport
|
||||
from thrift.protocol import TBinaryProtocol
|
||||
|
||||
from hbase import ttypes
|
||||
from hbase.Hbase import Client, Mutation
|
||||
|
||||
sock = TSocket.TSocket('hostname', 9090)
|
||||
transport = TTransport.TBufferedTransport(sock)
|
||||
protocol = TBinaryProtocol.TBinaryProtocol(transport)
|
||||
client = Client(protocol)
|
||||
transport.open()
|
||||
|
||||
mutations = [Mutation(column='family:qual1', value='value1'),
|
||||
Mutation(column='family:qual2', value='value2')]
|
||||
client.mutateRow('table-name', 'row-key', mutations)
|
||||
|
||||
HappyBase hides all the Thrift cruft below a friendly API, and makes the task
|
||||
in the example above look like this::
|
||||
|
||||
import happybase
|
||||
connection = happybase.Connection('hostname')
|
||||
table = connection.table('table-name')
|
||||
table.put('row-key', {'family:qual1': 'value1',
|
||||
'family:qual2': 'value2'})
|
||||
|
||||
Hopefully this example makes it clear that you will be a lot happier using
|
||||
HappyBase than using the Thrift API directly. If you still have doubts about
|
||||
this, try to accomplish some other common tasks, e.g. retrieving rows and
|
||||
scanning over a part of a table, and compare that with the really-easy-to-use
|
||||
HappyBase equivalents. If you're still not convinced by then, we're sorry to
|
||||
inform you that HappyBase is not the project for you, and we wish you all of
|
||||
luck maintaining your code ‒ or is it Thrift boilerplate? ‒ while your
|
||||
application evolves.
|
||||
|
||||
|
||||
How do I get started?
|
||||
=====================
|
||||
|
||||
Follow the :doc:`installation guide <installation>` and read the :doc:`tutorial
|
||||
<tutorial>`.
|
||||
|
||||
|
||||
.. vim: set spell spelllang=en:
|
||||
5
doc/license.rst
Normal file
5
doc/license.rst
Normal file
@@ -0,0 +1,5 @@
|
||||
*******
|
||||
License
|
||||
*******
|
||||
|
||||
.. include:: ../LICENSE.rst
|
||||
1
doc/todo.rst
Normal file
1
doc/todo.rst
Normal file
@@ -0,0 +1 @@
|
||||
.. include:: ../TODO.rst
|
||||
414
doc/tutorial.rst
Normal file
414
doc/tutorial.rst
Normal file
@@ -0,0 +1,414 @@
|
||||
********
|
||||
Tutorial
|
||||
********
|
||||
|
||||
.. py:currentmodule:: happybase
|
||||
|
||||
This tutorial explores the HappyBase API and should provide you with enough
|
||||
information to get you started. Note that this tutorial is intended as an
|
||||
introduction to HappyBase, not to HBase in general. Readers should already have
|
||||
a basic understanding of HBase and its data model.
|
||||
|
||||
While the tutorial does cover most features, it is not a complete reference
|
||||
guide. More information about the HappyBase API is available from the :doc:`API
|
||||
documentation <api>`.
|
||||
|
||||
.. contents:: On this page
|
||||
:local:
|
||||
|
||||
|
||||
Opening a :py:class:`Connection`
|
||||
================================
|
||||
|
||||
We'll get started by connecting to HBase::
|
||||
|
||||
import happybase
|
||||
|
||||
connection = happybase.Connection('somehost')
|
||||
|
||||
When a :py:class:`Connection` instance is created, it automatically opens a
|
||||
socket connection to the HBase Thrift server. This behaviour can be disabled by
|
||||
setting the `autoconnect` argument to `False`, and opening the connection
|
||||
manually using :py:meth:`Connection.open`::
|
||||
|
||||
connection = happybase.Connection('somehost', autoconnect=False)
|
||||
|
||||
# before first use:
|
||||
connection.open()
|
||||
|
||||
The :py:class:`Connection` class provides various methods to interact with the
|
||||
HBase instance. For instance, we can ask ask for the names of the available
|
||||
tables using the :py:meth:`Connection.tables` method::
|
||||
|
||||
print connection.tables()
|
||||
|
||||
If a single HBase instance is used by multiple applications, table name
|
||||
collisions may occur because applications use the same table names. A solution
|
||||
is to add a ‘namespace’ prefix to the names of all tables ‘owned’ by a specific
|
||||
application. Instead of adding this application-specific prefix each time a
|
||||
table name is passed to HappyBase, the `table_prefix` parameter can be used.
|
||||
HappyBase will prepend that prefix (and an underscore) to each table name
|
||||
handled by the :py:class:`Connection` instance. So, for a project ``myproject``
|
||||
that should have table names that look like ``myproject_XYZ``, use this::
|
||||
|
||||
connection = happybase.Connection('somehost', table_prefix='myproject')
|
||||
|
||||
:py:meth:`Connection.tables` no longer includes tables in other ‘namespaces’;
|
||||
it will only returns tables with a ``myproject_`` prefix in HBase, and also
|
||||
strips of the prefix::
|
||||
|
||||
print connection.tables() # Table "myproject_XYZ" in HBase will be
|
||||
# returned as simply "XYZ"
|
||||
|
||||
The :py:class:`Connection` class offers various other methods to interact with
|
||||
HBase, mostly to perform table management tasks like enabling and disabling
|
||||
tables. This tutorial does not cover those; the :doc:`API documentation <api>`
|
||||
for the :py:class:`Connection` class contains more information.
|
||||
|
||||
|
||||
Obtaining a :py:class:`Table` instance
|
||||
======================================
|
||||
|
||||
The :py:class:`Table` class provides the main API to retrieve and manipulate
|
||||
data in HBase. In the example above, we already asked for the available tables
|
||||
using the :py:meth:`Connection.tables` method, so the next step is to obtain a
|
||||
:py:class:`.Table` instance. This is done by calling
|
||||
:py:meth:`Connection.table` with the name of the table::
|
||||
|
||||
table = connection.table('mytable')
|
||||
|
||||
Obtaining a :py:class:`Table` instance does *not* result in a round-trip to the
|
||||
Thrift server, which means application code may ask the :py:class:`Connection`
|
||||
instance for a new :py:class:`Table` whenever it needs one, without negative
|
||||
performance consequences. A side effect is that no check is done to ensure that
|
||||
the table exists, since that would involve a round-trip, so expect errors if
|
||||
you try to interact with non-existing tables later in your code. For this
|
||||
tutorial, we assume the table exists.
|
||||
|
||||
.. note::
|
||||
|
||||
The ‘heavy’ `HTable` HBase class from the Java HBase API, which does the
|
||||
real communication with the region servers, is at the other side of the
|
||||
Thrift connection. There is no direct mapping between :py:class:`Table`
|
||||
instances on the Python side and `HTable` instances on the server side.
|
||||
|
||||
|
||||
Retrieving data
|
||||
===============
|
||||
|
||||
The HBase data model is a multidimensional sparse map. A table in HBase
|
||||
contains column families with column qualifiers containing a value and a
|
||||
timestamp. In most of the HappyBase API, column family and qualifier names are
|
||||
specified as a single string, e.g. ``cf1:col1``, and not as two separate
|
||||
arguments. While column families and qualifiers are different concepts in the
|
||||
HBase data model, they are almost always used together when interacting with
|
||||
data, so treating them as a single string makes the API a lot simpler.
|
||||
|
||||
Retrieving rows
|
||||
---------------
|
||||
|
||||
The :py:class:`Table` class offers various methods to retrieve data from a
|
||||
table in HBase. The most basic one is :py:meth:`Table.row`, which retrieves a
|
||||
single row from the table, and returns it as a dictionary mapping columns to
|
||||
values::
|
||||
|
||||
row = table.row('row-key')
|
||||
print row['cf1:col1'] # prints the value of cf1:col1
|
||||
|
||||
The :py:meth:`Table.rows` method works just like :py:meth:`Table.row`, but
|
||||
takes multiple row keys and returns those as `(key, data)` tuples::
|
||||
|
||||
rows = table.rows(['row-key-1', 'row-key-2'])
|
||||
for key, data in rows:
|
||||
print key, data
|
||||
|
||||
If you want the results that :py:meth:`Table.rows` returns as a dictionary or
|
||||
ordered dictionary, you will have to do this yourself. This is really easy
|
||||
though, since the return value can be passed directly to the dictionary
|
||||
constructor. For a normal dictionary, order is lost::
|
||||
|
||||
rows_as_dict = dict(table.rows(['row-key-1', 'row-key-2']))
|
||||
|
||||
…whereas for a :py:class:`OrderedDict`, order is preserved::
|
||||
|
||||
from collections import OrderedDict
|
||||
rows_as_ordered_dict = OrderedDict(table.rows(['row-key-1', 'row-key-2']))
|
||||
|
||||
|
||||
Making more fine-grained selections
|
||||
-----------------------------------
|
||||
|
||||
HBase's data model allows for more fine-grained selections of the data to
|
||||
retrieve. If you know beforehand which columns are needed, performance can be
|
||||
improved by specifying those columns explicitly to :py:meth:`Table.row` and
|
||||
:py:meth:`Table.rows`. The `columns` argument takes a list (or tuple) of column
|
||||
names::
|
||||
|
||||
row = table.row('row-key', columns=['cf1:col1', 'cf1:col2'])
|
||||
print row['cf1:col1']
|
||||
print row['cf1:col2']
|
||||
|
||||
Instead of providing both a column family and a column qualifier, items in the
|
||||
`columns` argument may also be just a column family, which means that all
|
||||
columns from that column family will be retrieved. For example, to get all
|
||||
columns and values in the column family `cf1`, use this::
|
||||
|
||||
row = table.row('row-key', columns=['cf1'])
|
||||
|
||||
In HBase, each cell has a timestamp attached to it. In case you don't want to
|
||||
work with the latest version of data stored in HBase, the methods that retrieve
|
||||
data from the database, e.g. :py:meth:`Table.row`, all accept a `timestamp`
|
||||
argument that specifies that the results should be restricted to values with a
|
||||
timestamp up to the specified timestamp::
|
||||
|
||||
row = table.row('row-key', timestamp=123456789)
|
||||
|
||||
By default, HappyBase does not include timestamps in the results it returns. In
|
||||
your application needs access to the timestamps, simply set the
|
||||
`include_timestamp` parameter to ``True``. Now, each cell in the result will be
|
||||
returned as a `(value, timestamp)` tuple instead of just a value::
|
||||
|
||||
row = table.row('row-key', columns=['cf1:col1'], include_timestamp=True)
|
||||
value, timestamp = row['cf1:col1']
|
||||
|
||||
HBase supports storing multiple versions of the same cell. This can be
|
||||
configured for each column family. To retrieve all versions of a column for a
|
||||
given row, :py:meth:`Table.cells` can be used. This method returns an ordered
|
||||
list of cells, with the most recent version coming first. The `versions`
|
||||
argument specifies the maximum number of versions to return. Just like the
|
||||
methods that retrieve rows, the `include_timestamp` argument determines whether
|
||||
timestamps are included in the result. Example::
|
||||
|
||||
values = table.cells('row-key', 'cf1:col1', versions=2)
|
||||
for value in values:
|
||||
print "Cell data: %s" % value
|
||||
|
||||
cells = table.cells('row-key', 'cf1:col1', versions=3, include_timestamp=True)
|
||||
for value, timestamp in cells:
|
||||
print "Cell data at %d: %s" % (timestamp, value)
|
||||
|
||||
Note that the result may contain fewer cells than requested. The cell may just
|
||||
have fewer versions, or you may have requested more versions than HBase keeps
|
||||
for the column family.
|
||||
|
||||
Scanning over rows in a table
|
||||
-----------------------------
|
||||
|
||||
In addition to retrieving data for known row keys, rows in HBase can be
|
||||
efficiently iterated over using a table scanner, created using
|
||||
:py:meth:`Table.scan`. A basic scanner that iterates over all rows in the table
|
||||
looks like this::
|
||||
|
||||
for key, data in table.scan():
|
||||
print key, data
|
||||
|
||||
Doing full table scans like in the example above is prohibitively expensive in
|
||||
practice. Scans can be restricted in several ways to make more selective range
|
||||
queries. One way is to specify start or stop keys, or both. To iterate over all
|
||||
rows from row `aaa` to the end of the table::
|
||||
|
||||
for key, data in table.scan(row_start='aaa'):
|
||||
print key, data
|
||||
|
||||
To iterate over all rows from the start of the table up to row `xyz`, use this::
|
||||
|
||||
for key, data in table.scan(row_stop='xyz'):
|
||||
print key, data
|
||||
|
||||
To iterate over all rows between row `aaa` (included) and `xyz` (not included),
|
||||
supply both::
|
||||
|
||||
for key, data in table.scan(row_start='aaa', row_stop='xyz'):
|
||||
print key, data
|
||||
|
||||
An alternative is to use a key prefix. For example, to iterate over all rows
|
||||
starting with `abc`::
|
||||
|
||||
for key, data in table.scan(row_prefix='abc'):
|
||||
print key, data
|
||||
|
||||
The scanner examples above only limit the results by row key using the
|
||||
`row_start`, `row_stop`, and `row_prefix` arguments, but scanners can also
|
||||
limit results to certain columns, column families, and timestamps, just like
|
||||
:py:meth:`Table.row` and :py:meth:`Table.rows`. For advanced users, a filter
|
||||
string can be passed as the `filter` argument. Additionally, the optional
|
||||
`limit` argument defines how much data is at most retrieved, and the
|
||||
`batch_size` argument specifies how big the transferred chunks should be. The
|
||||
:py:meth:`Table.scan` API documentation provides more information on the
|
||||
supported scanner options.
|
||||
|
||||
|
||||
Manipulating data
|
||||
=================
|
||||
|
||||
In HBase, all mutations either store data or mark data for deletion; there is
|
||||
no such thing as an `update`. HappyBase provides methods to do single inserts
|
||||
or deletes, and also a batch API for bulk mutations.
|
||||
|
||||
Storing data
|
||||
------------
|
||||
|
||||
To store a single cell of data in our table, we can use :py:meth:`Table.put`,
|
||||
which takes the row key, and the data to store. The data should be a dictionary
|
||||
mapping the column name to a value::
|
||||
|
||||
table.put('row-key', {'cf:col1': 'value1',
|
||||
'cf:col2': 'value2'})
|
||||
|
||||
Use the `timestamp` argument if you want to provide timestamps explicitly::
|
||||
|
||||
table.put('row-key', {'cf:col1': 'value1'}, timestamp=123456789)
|
||||
|
||||
If omitted, HBase defaults to the current system time.
|
||||
|
||||
Deleting data
|
||||
-------------
|
||||
|
||||
The :py:meth:`Table.delete` method deletes data from a table. To delete a
|
||||
complete row, just specify the row key::
|
||||
|
||||
table.delete('row-key')
|
||||
|
||||
To delete one or more columns instead of a complete row, also specify the
|
||||
`columns` argument::
|
||||
|
||||
table.delete('row-key', columns=['cf1:col1', 'cf1:col2'])
|
||||
|
||||
The optional `timestamp` argument restricts the delete operation to data up to
|
||||
the specified timestamp.
|
||||
|
||||
Performing batch mutations
|
||||
--------------------------
|
||||
|
||||
The :py:meth:`Table.put` and :py:meth:`Table.delete` methods both issue a
|
||||
command to the HBase Thrift server immediately. This means that using these
|
||||
methods is not very efficient when storing or deleting multiple values. It is
|
||||
much more efficient to aggregate a bunch of commands and send them to the
|
||||
server in one go. This is exactly what the :py:class:`Batch` class, created
|
||||
using :py:meth:`Table.batch`, does. A :py:class:`Batch` instance has put and
|
||||
delete methods, just like the :py:class:`Table` class, but the changes are sent
|
||||
to the server in a single round-trip using :py:meth:`Batch.send`::
|
||||
|
||||
b = table.batch()
|
||||
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
|
||||
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
|
||||
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
|
||||
b.delete('row-key-4')
|
||||
b.send()
|
||||
|
||||
.. note::
|
||||
|
||||
Storing and deleting data for the same row key in a single batch leads to
|
||||
unpredictable results, so don't do that.
|
||||
|
||||
While the methods on the :py:class:`Batch` instance resemble the
|
||||
:py:meth:`~Table.put` and :py:meth:`~Table.delete` methods, they do not take a
|
||||
`timestamp` argument for each mutation. Instead, you can specify a single
|
||||
`timestamp` argument for the complete batch::
|
||||
|
||||
b = table.batch(timestamp=123456789)
|
||||
b.put(...)
|
||||
b.delete(...)
|
||||
b.send()
|
||||
|
||||
:py:class:`Batch` instances can be used as *context managers*, which are most
|
||||
useful in combination with Python's ``with`` construct. The example above can
|
||||
be simplified to read::
|
||||
|
||||
with table.batch() as b:
|
||||
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
|
||||
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
|
||||
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
|
||||
b.delete('row-key-4')
|
||||
|
||||
As you can see, there is no call to :py:meth:`Batch.send` anymore. The batch is
|
||||
automatically applied when the ``with`` code block terminates, even in case of
|
||||
errors somewhere in the ``with`` block, so it behaves basically the same as a
|
||||
``try/finally`` clause. However, some applications require transactional
|
||||
behaviour, sending the batch only if no exception occurred. Without a context
|
||||
manager this would look something like this::
|
||||
|
||||
b = table.batch()
|
||||
try:
|
||||
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
|
||||
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
|
||||
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
|
||||
b.delete('row-key-4')
|
||||
raise ValueError("Something went wrong!")
|
||||
except ValueError as e:
|
||||
# error handling goes here; nothing is sent to HBase
|
||||
pass
|
||||
else:
|
||||
# no exceptions; send data
|
||||
b.send()
|
||||
|
||||
Obtaining the same behaviour is easier using a ``with`` block. The
|
||||
`transaction` argument to :py:meth:`Table.batch` is all you need::
|
||||
|
||||
try:
|
||||
with table.batch(transaction=True) as b:
|
||||
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
|
||||
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
|
||||
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
|
||||
b.delete('row-key-4')
|
||||
raise ValueError("Something went wrong!")
|
||||
except ValueError:
|
||||
# error handling goes here; nothing is sent to HBase
|
||||
pass
|
||||
|
||||
# when no error occurred, the transaction succeeded
|
||||
|
||||
As you may have imagined already, a :py:class:`Batch` keeps all mutations in
|
||||
memory until the batch is sent, either by calling :py:meth:`Batch.send()`
|
||||
explicitly, or when the ``with`` block ends. This doesn't work for applications
|
||||
that need to store huge amounts of data, since it may result in batches that
|
||||
are too big to send in one round-trip, or in batches that use too much memory.
|
||||
For these cases, the `batch_size` argument can be specified. The `batch_size`
|
||||
acts as a threshold: a :py:class:`Batch` instance automatically sends all
|
||||
pending mutations when there are more than `batch_size` pending operations. For
|
||||
example, this will result in three round-trips to the server (two batches with
|
||||
1000 cells, and one with the remaining 400)::
|
||||
|
||||
with table.batch(batch_size=1000) as b:
|
||||
for i in range(1200):
|
||||
# this put() will result in two mutations (two cells)
|
||||
b.put('row-%04d' % i, {'cf1:col1': 'v1',
|
||||
'cf1:col2': 'v2',})
|
||||
|
||||
The appropriate `batch_size` is very application-specific since it depends on
|
||||
the data size, so just experiment to see how different sizes work for your
|
||||
specific use case.
|
||||
|
||||
Using atomic counters
|
||||
---------------------
|
||||
|
||||
The :py:meth:`Table.counter_inc` and :py:meth:`Table.counter_dec` methods allow
|
||||
for atomic incrementing and decrementing of 8 byte wide values, which are
|
||||
interpreted as big-endian 64-bit signed integers by HBase. Counters are
|
||||
automatically initialised to 0 upon first use. When incrementing or
|
||||
decrementing a counter, the value after modification is returned. Example::
|
||||
|
||||
print table.counter_inc('row-key', 'cf1:counter') # prints 1
|
||||
print table.counter_inc('row-key', 'cf1:counter') # prints 2
|
||||
print table.counter_inc('row-key', 'cf1:counter') # prints 3
|
||||
|
||||
print table.counter_dec('row-key', 'cf1:counter') # prints 2
|
||||
|
||||
The optional `value` argument specifies how much to increment or decrement by::
|
||||
|
||||
print table.counter_inc('row-key', 'cf1:counter', value=3) # prints 5
|
||||
|
||||
While counters are typically used with the increment and decrement functions
|
||||
shown above, the :py:meth:`Table.counter_get` and :py:meth:`Table.counter_set`
|
||||
methods can be used to retrieve or set a counter value directly::
|
||||
|
||||
print table.counter_get('row-key', 'cf1:counter') # prints 5
|
||||
|
||||
table.counter_set('row-key', 'cf1:counter', 12)
|
||||
|
||||
Note that an application should *never* :py:meth:`~Table.counter_get` the
|
||||
current value, modify it in code and then :py:meth:`~Table.counter_set` the
|
||||
modified value; use the atomic :py:meth:`~Table.counter_inc` and
|
||||
:py:meth:`~Table.counter_dec` instead!
|
||||
|
||||
.. vim: set spell spelllang=en:
|
||||
Reference in New Issue
Block a user