Add documentation

This commit is contained in:
Wouter Bolsterlee
2012-05-20 22:28:14 +02:00
parent e7b271b460
commit de4b0ab0b9
9 changed files with 942 additions and 0 deletions

23
TODO.rst Normal file
View File

@@ -0,0 +1,23 @@
.. Note: this list is automatically included in the documentation.
***********************************
To-do list and possible future work
***********************************
This document lists some ideas that the developers thought of, but have not yet
implemented. The topics described below may be implemented (or not) in the
future, depending on time, demand, and technical possibilities.
* Improved error handling instead of just propagating the errors from the
Thrift layer. Maybe wrap the errors in a HappyBase.Error?
* Automatic retries for failed operations (but only those that can be retried)
* Connection pooling (maybe based on PyCassa's ConnectionPool?)
* Thread safety. This involves at least coordinating access to the socket
connection to HBase's Thrift gateway.
* Port HappyBase over to the (still experimental) HBase Thrift2 API when it
becomes mainstream, and expose more of the underlying features nicely in the
HappyBase API.

47
doc/api.rst Normal file
View File

@@ -0,0 +1,47 @@
*****************
API documentation
*****************
.. py:currentmodule:: happybase
This chapter contains detailed API documentation for HappyBase. It is suggested
to read the :doc:`tutorial <tutorial>` first to get a general idea about how
HappyBase works.
The HappyBase API is organised as follows:
:py:class:`~happybase.Connection`:
The :py:class:`~happybase.Connection` class is the main entry point for
application developers. It connects to the HBase Thrift server and provides
methods for table management.
:py:class:`~happybase.Table`:
The :py:class:`Table` class is the main class for interacting with data in
tables. This class offers methods for data retrieval and data manipulation.
Instances of this class can be obtained using the
:py:meth:`Connection.table()` method.
:py:class:`~happybase.Batch`:
The :py:class:`Batch` class implements the batch API for data manipulation,
and is available through the :py:meth:`Table.batch()` method.
Connection
==========
.. autoclass:: happybase.Connection
Table
=====
.. autoclass:: happybase.Table
Batch
=====
.. autoclass:: happybase.Batch
.. vim: set spell spelllang=en:

244
doc/conf.py Normal file
View File

@@ -0,0 +1,244 @@
# -*- coding: utf-8 -*-
#
# HappyBase documentation build configuration file, created by
# sphinx-quickstart on Tue Mar 20 17:40:16 2012.
#
# This file is execfile()d with the current directory set to its containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import sys, os
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))
# -- General configuration -----------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.coverage']
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
#source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = u'HappyBase'
copyright = u'2012'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.1'
# The full version, including alpha/beta/rc tags.
release = '0.1'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []
# The reST default role (used for this markup: `text`) to use for all documents.
#default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
autodoc_default_flags = ['members', 'undoc-members']
autodoc_member_order = 'bysource'
# -- Options for HTML output ---------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = 'default'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}
# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
#html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
#html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = '%b %d, %Y'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If false, no module index is generated.
#html_domain_indices = True
# If false, no index is generated.
#html_use_index = True
# If true, the index is split into individual pages for each letter.
#html_split_index = False
# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None
# Output file base name for HTML help builder.
htmlhelp_basename = 'HappyBasedoc'
# -- Options for LaTeX output --------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#'preamble': '',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass [howto/manual]).
latex_documents = [
('index', 'HappyBase.tex', u'HappyBase Documentation',
u' ', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# If true, show page references after internal links.
#latex_show_pagerefs = False
# If true, show URL addresses after external links.
#latex_show_urls = False
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# If false, no module index is generated.
#latex_domain_indices = True
# -- Options for manual page output --------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'happybase', u'HappyBase Documentation',
[u' '], 1)
]
# If true, show URL addresses after external links.
#man_show_urls = False
# -- Options for Texinfo output ------------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'HappyBase', u'HappyBase Documentation',
u' ', 'HappyBase', 'One line description of project.',
'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
#texinfo_appendices = []
# If false, no module index is generated.
#texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'

27
doc/index.rst Normal file
View File

@@ -0,0 +1,27 @@
*********
HappyBase
*********
.. include:: ../README.rst
.. rubric:: Table of contents
.. toctree::
:maxdepth: 1
introduction
installation
tutorial
api
todo
license
.. rubric:: Indices and tables
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
.. vim: set spell spelllang=en:

67
doc/installation.rst Normal file
View File

@@ -0,0 +1,67 @@
************
Installation
************
This guide describes how to install HappyBase.
.. contents:: On this page
:local:
Setting up a virtual environment
================================
The recommended way to install HappyBase and Thrift is to use a virtual
environment created by `virtualenv`. Setup and activate a new virtual
environment like this:
.. code-block:: sh
$ virtualenv envname
$ source envname/bin/activate
If you use the `virtualenvwrapper` scripts, type this instead:
.. code-block:: sh
$ mkvirtualenv envname
Installing packages
===================
The next step is to install the Thrift package for Python:
.. code-block:: sh
(envname) $ pip install thrift
…and the HappyBase package:
.. code-block:: sh
(envname) $ cd /path/to/happybase/
(envname) $ python setup.py install
.. note::
Generating and installing the HBase Thrift Python modules (using ``thrift
--gen py`` on the ``.thrift`` file) is not necessary, since HappyBase
bundles pregenerated versions of those modules.
Testing the installation
========================
Verify that the packages are installed correctly by starting a ``python`` shell
and entering the following statements::
>>> import thrift
>>> import happybase
If you don't see any errors, the installation was successful. Congratulations!
Now that you have HappyBase installed on your machine, continue with the
:doc:`tutorial <tutorial>` to learn how to use it.
.. vim: set spell spelllang=en:

114
doc/introduction.rst Normal file
View File

@@ -0,0 +1,114 @@
************
Introduction
************
.. py:currentmodule:: happybase
.. contents:: On this page
:local:
What is HappyBase?
==================
.. include:: ../README.rst
HappyBase is designed for for use in standard HBase setups, and offers
application developers a Pythonic API to interact with HBase.
Below the surface, HappyBase uses the `Python Thrift library
<http://pypi.python.org/pypi/thrift>`_ to connect to HBase's `Thrift
<http://thrift.apache.org/>`_ gateway, which is included in the standard HBase
0.9x releases. HappyBase hides most of the details of the underlying RPC
mechanisms, resulting in application code that is cleaner, more productive to
write, and more maintainable.
What does code using HappyBase look like?
=========================================
The example below illustrates basic usage of the library::
import happybase
connection = happybase.Connection('hostname')
table = connection.table('table-name')
table.put('row-key', {'family:qual1': 'value1',
'family:qual2': 'value2'})
row = table.row('row-key')
print row['family:qual1'] # prints 'value1'
for key, data in table.rows(['row-key-1', 'row-key-2']):
print key, data # prints row key and data for each row
for key, data in table.scan(row_prefix='row'):
print key, data # prints 'value1' and 'value2'
row = table.delete('row-key')
Note that the :doc:`tutorial <tutorial>` contains many more examples.
Why not use the HBase Thrift API directly?
==========================================
You may consider using the HBase Thrift API directly instead of adding yet
another library to your project. After all, :pep:`20` taught us that simple is
better than complex, and there should be one, and preferably one way to do it,
right? Well, we agree.
While the HBase Thrift API can be used directly from Python using the
(automatically generated) HBase Thrift service classes, application code using
this API is verbose, cumbersome, and hence error-prone. The reason for this is
that the HBase Thrift API is a flat, language-agnostic interface API closely
tied to the RPC going over the wire-level protocol. This means that
applications need to deal with many imports, sockets, transports, protocols,
clients, Thrift types and mutation objects. For instance, look at the code
required to connect to HBase and store two values::
from thrift import Thrift
from thrift.transport import TSocket, TTransport
from thrift.protocol import TBinaryProtocol
from hbase import ttypes
from hbase.Hbase import Client, Mutation
sock = TSocket.TSocket('hostname', 9090)
transport = TTransport.TBufferedTransport(sock)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Client(protocol)
transport.open()
mutations = [Mutation(column='family:qual1', value='value1'),
Mutation(column='family:qual2', value='value2')]
client.mutateRow('table-name', 'row-key', mutations)
HappyBase hides all the Thrift cruft below a friendly API, and makes the task
in the example above look like this::
import happybase
connection = happybase.Connection('hostname')
table = connection.table('table-name')
table.put('row-key', {'family:qual1': 'value1',
'family:qual2': 'value2'})
Hopefully this example makes it clear that you will be a lot happier using
HappyBase than using the Thrift API directly. If you still have doubts about
this, try to accomplish some other common tasks, e.g. retrieving rows and
scanning over a part of a table, and compare that with the really-easy-to-use
HappyBase equivalents. If you're still not convinced by then, we're sorry to
inform you that HappyBase is not the project for you, and we wish you all of
luck maintaining your code or is it Thrift boilerplate? while your
application evolves.
How do I get started?
=====================
Follow the :doc:`installation guide <installation>` and read the :doc:`tutorial
<tutorial>`.
.. vim: set spell spelllang=en:

5
doc/license.rst Normal file
View File

@@ -0,0 +1,5 @@
*******
License
*******
.. include:: ../LICENSE.rst

1
doc/todo.rst Normal file
View File

@@ -0,0 +1 @@
.. include:: ../TODO.rst

414
doc/tutorial.rst Normal file
View File

@@ -0,0 +1,414 @@
********
Tutorial
********
.. py:currentmodule:: happybase
This tutorial explores the HappyBase API and should provide you with enough
information to get you started. Note that this tutorial is intended as an
introduction to HappyBase, not to HBase in general. Readers should already have
a basic understanding of HBase and its data model.
While the tutorial does cover most features, it is not a complete reference
guide. More information about the HappyBase API is available from the :doc:`API
documentation <api>`.
.. contents:: On this page
:local:
Opening a :py:class:`Connection`
================================
We'll get started by connecting to HBase::
import happybase
connection = happybase.Connection('somehost')
When a :py:class:`Connection` instance is created, it automatically opens a
socket connection to the HBase Thrift server. This behaviour can be disabled by
setting the `autoconnect` argument to `False`, and opening the connection
manually using :py:meth:`Connection.open`::
connection = happybase.Connection('somehost', autoconnect=False)
# before first use:
connection.open()
The :py:class:`Connection` class provides various methods to interact with the
HBase instance. For instance, we can ask ask for the names of the available
tables using the :py:meth:`Connection.tables` method::
print connection.tables()
If a single HBase instance is used by multiple applications, table name
collisions may occur because applications use the same table names. A solution
is to add a namespace prefix to the names of all tables owned by a specific
application. Instead of adding this application-specific prefix each time a
table name is passed to HappyBase, the `table_prefix` parameter can be used.
HappyBase will prepend that prefix (and an underscore) to each table name
handled by the :py:class:`Connection` instance. So, for a project ``myproject``
that should have table names that look like ``myproject_XYZ``, use this::
connection = happybase.Connection('somehost', table_prefix='myproject')
:py:meth:`Connection.tables` no longer includes tables in other namespaces;
it will only returns tables with a ``myproject_`` prefix in HBase, and also
strips of the prefix::
print connection.tables() # Table "myproject_XYZ" in HBase will be
# returned as simply "XYZ"
The :py:class:`Connection` class offers various other methods to interact with
HBase, mostly to perform table management tasks like enabling and disabling
tables. This tutorial does not cover those; the :doc:`API documentation <api>`
for the :py:class:`Connection` class contains more information.
Obtaining a :py:class:`Table` instance
======================================
The :py:class:`Table` class provides the main API to retrieve and manipulate
data in HBase. In the example above, we already asked for the available tables
using the :py:meth:`Connection.tables` method, so the next step is to obtain a
:py:class:`.Table` instance. This is done by calling
:py:meth:`Connection.table` with the name of the table::
table = connection.table('mytable')
Obtaining a :py:class:`Table` instance does *not* result in a round-trip to the
Thrift server, which means application code may ask the :py:class:`Connection`
instance for a new :py:class:`Table` whenever it needs one, without negative
performance consequences. A side effect is that no check is done to ensure that
the table exists, since that would involve a round-trip, so expect errors if
you try to interact with non-existing tables later in your code. For this
tutorial, we assume the table exists.
.. note::
The heavy `HTable` HBase class from the Java HBase API, which does the
real communication with the region servers, is at the other side of the
Thrift connection. There is no direct mapping between :py:class:`Table`
instances on the Python side and `HTable` instances on the server side.
Retrieving data
===============
The HBase data model is a multidimensional sparse map. A table in HBase
contains column families with column qualifiers containing a value and a
timestamp. In most of the HappyBase API, column family and qualifier names are
specified as a single string, e.g. ``cf1:col1``, and not as two separate
arguments. While column families and qualifiers are different concepts in the
HBase data model, they are almost always used together when interacting with
data, so treating them as a single string makes the API a lot simpler.
Retrieving rows
---------------
The :py:class:`Table` class offers various methods to retrieve data from a
table in HBase. The most basic one is :py:meth:`Table.row`, which retrieves a
single row from the table, and returns it as a dictionary mapping columns to
values::
row = table.row('row-key')
print row['cf1:col1'] # prints the value of cf1:col1
The :py:meth:`Table.rows` method works just like :py:meth:`Table.row`, but
takes multiple row keys and returns those as `(key, data)` tuples::
rows = table.rows(['row-key-1', 'row-key-2'])
for key, data in rows:
print key, data
If you want the results that :py:meth:`Table.rows` returns as a dictionary or
ordered dictionary, you will have to do this yourself. This is really easy
though, since the return value can be passed directly to the dictionary
constructor. For a normal dictionary, order is lost::
rows_as_dict = dict(table.rows(['row-key-1', 'row-key-2']))
…whereas for a :py:class:`OrderedDict`, order is preserved::
from collections import OrderedDict
rows_as_ordered_dict = OrderedDict(table.rows(['row-key-1', 'row-key-2']))
Making more fine-grained selections
-----------------------------------
HBase's data model allows for more fine-grained selections of the data to
retrieve. If you know beforehand which columns are needed, performance can be
improved by specifying those columns explicitly to :py:meth:`Table.row` and
:py:meth:`Table.rows`. The `columns` argument takes a list (or tuple) of column
names::
row = table.row('row-key', columns=['cf1:col1', 'cf1:col2'])
print row['cf1:col1']
print row['cf1:col2']
Instead of providing both a column family and a column qualifier, items in the
`columns` argument may also be just a column family, which means that all
columns from that column family will be retrieved. For example, to get all
columns and values in the column family `cf1`, use this::
row = table.row('row-key', columns=['cf1'])
In HBase, each cell has a timestamp attached to it. In case you don't want to
work with the latest version of data stored in HBase, the methods that retrieve
data from the database, e.g. :py:meth:`Table.row`, all accept a `timestamp`
argument that specifies that the results should be restricted to values with a
timestamp up to the specified timestamp::
row = table.row('row-key', timestamp=123456789)
By default, HappyBase does not include timestamps in the results it returns. In
your application needs access to the timestamps, simply set the
`include_timestamp` parameter to ``True``. Now, each cell in the result will be
returned as a `(value, timestamp)` tuple instead of just a value::
row = table.row('row-key', columns=['cf1:col1'], include_timestamp=True)
value, timestamp = row['cf1:col1']
HBase supports storing multiple versions of the same cell. This can be
configured for each column family. To retrieve all versions of a column for a
given row, :py:meth:`Table.cells` can be used. This method returns an ordered
list of cells, with the most recent version coming first. The `versions`
argument specifies the maximum number of versions to return. Just like the
methods that retrieve rows, the `include_timestamp` argument determines whether
timestamps are included in the result. Example::
values = table.cells('row-key', 'cf1:col1', versions=2)
for value in values:
print "Cell data: %s" % value
cells = table.cells('row-key', 'cf1:col1', versions=3, include_timestamp=True)
for value, timestamp in cells:
print "Cell data at %d: %s" % (timestamp, value)
Note that the result may contain fewer cells than requested. The cell may just
have fewer versions, or you may have requested more versions than HBase keeps
for the column family.
Scanning over rows in a table
-----------------------------
In addition to retrieving data for known row keys, rows in HBase can be
efficiently iterated over using a table scanner, created using
:py:meth:`Table.scan`. A basic scanner that iterates over all rows in the table
looks like this::
for key, data in table.scan():
print key, data
Doing full table scans like in the example above is prohibitively expensive in
practice. Scans can be restricted in several ways to make more selective range
queries. One way is to specify start or stop keys, or both. To iterate over all
rows from row `aaa` to the end of the table::
for key, data in table.scan(row_start='aaa'):
print key, data
To iterate over all rows from the start of the table up to row `xyz`, use this::
for key, data in table.scan(row_stop='xyz'):
print key, data
To iterate over all rows between row `aaa` (included) and `xyz` (not included),
supply both::
for key, data in table.scan(row_start='aaa', row_stop='xyz'):
print key, data
An alternative is to use a key prefix. For example, to iterate over all rows
starting with `abc`::
for key, data in table.scan(row_prefix='abc'):
print key, data
The scanner examples above only limit the results by row key using the
`row_start`, `row_stop`, and `row_prefix` arguments, but scanners can also
limit results to certain columns, column families, and timestamps, just like
:py:meth:`Table.row` and :py:meth:`Table.rows`. For advanced users, a filter
string can be passed as the `filter` argument. Additionally, the optional
`limit` argument defines how much data is at most retrieved, and the
`batch_size` argument specifies how big the transferred chunks should be. The
:py:meth:`Table.scan` API documentation provides more information on the
supported scanner options.
Manipulating data
=================
In HBase, all mutations either store data or mark data for deletion; there is
no such thing as an `update`. HappyBase provides methods to do single inserts
or deletes, and also a batch API for bulk mutations.
Storing data
------------
To store a single cell of data in our table, we can use :py:meth:`Table.put`,
which takes the row key, and the data to store. The data should be a dictionary
mapping the column name to a value::
table.put('row-key', {'cf:col1': 'value1',
'cf:col2': 'value2'})
Use the `timestamp` argument if you want to provide timestamps explicitly::
table.put('row-key', {'cf:col1': 'value1'}, timestamp=123456789)
If omitted, HBase defaults to the current system time.
Deleting data
-------------
The :py:meth:`Table.delete` method deletes data from a table. To delete a
complete row, just specify the row key::
table.delete('row-key')
To delete one or more columns instead of a complete row, also specify the
`columns` argument::
table.delete('row-key', columns=['cf1:col1', 'cf1:col2'])
The optional `timestamp` argument restricts the delete operation to data up to
the specified timestamp.
Performing batch mutations
--------------------------
The :py:meth:`Table.put` and :py:meth:`Table.delete` methods both issue a
command to the HBase Thrift server immediately. This means that using these
methods is not very efficient when storing or deleting multiple values. It is
much more efficient to aggregate a bunch of commands and send them to the
server in one go. This is exactly what the :py:class:`Batch` class, created
using :py:meth:`Table.batch`, does. A :py:class:`Batch` instance has put and
delete methods, just like the :py:class:`Table` class, but the changes are sent
to the server in a single round-trip using :py:meth:`Batch.send`::
b = table.batch()
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
b.delete('row-key-4')
b.send()
.. note::
Storing and deleting data for the same row key in a single batch leads to
unpredictable results, so don't do that.
While the methods on the :py:class:`Batch` instance resemble the
:py:meth:`~Table.put` and :py:meth:`~Table.delete` methods, they do not take a
`timestamp` argument for each mutation. Instead, you can specify a single
`timestamp` argument for the complete batch::
b = table.batch(timestamp=123456789)
b.put(...)
b.delete(...)
b.send()
:py:class:`Batch` instances can be used as *context managers*, which are most
useful in combination with Python's ``with`` construct. The example above can
be simplified to read::
with table.batch() as b:
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
b.delete('row-key-4')
As you can see, there is no call to :py:meth:`Batch.send` anymore. The batch is
automatically applied when the ``with`` code block terminates, even in case of
errors somewhere in the ``with`` block, so it behaves basically the same as a
``try/finally`` clause. However, some applications require transactional
behaviour, sending the batch only if no exception occurred. Without a context
manager this would look something like this::
b = table.batch()
try:
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
b.delete('row-key-4')
raise ValueError("Something went wrong!")
except ValueError as e:
# error handling goes here; nothing is sent to HBase
pass
else:
# no exceptions; send data
b.send()
Obtaining the same behaviour is easier using a ``with`` block. The
`transaction` argument to :py:meth:`Table.batch` is all you need::
try:
with table.batch(transaction=True) as b:
b.put('row-key-1', {'cf:col1': 'value1', 'cf:col2': 'value2'})
b.put('row-key-2', {'cf:col2': 'value2', 'cf:col3': 'value3'})
b.put('row-key-3', {'cf:col3': 'value3', 'cf:col4': 'value4'})
b.delete('row-key-4')
raise ValueError("Something went wrong!")
except ValueError:
# error handling goes here; nothing is sent to HBase
pass
# when no error occurred, the transaction succeeded
As you may have imagined already, a :py:class:`Batch` keeps all mutations in
memory until the batch is sent, either by calling :py:meth:`Batch.send()`
explicitly, or when the ``with`` block ends. This doesn't work for applications
that need to store huge amounts of data, since it may result in batches that
are too big to send in one round-trip, or in batches that use too much memory.
For these cases, the `batch_size` argument can be specified. The `batch_size`
acts as a threshold: a :py:class:`Batch` instance automatically sends all
pending mutations when there are more than `batch_size` pending operations. For
example, this will result in three round-trips to the server (two batches with
1000 cells, and one with the remaining 400)::
with table.batch(batch_size=1000) as b:
for i in range(1200):
# this put() will result in two mutations (two cells)
b.put('row-%04d' % i, {'cf1:col1': 'v1',
'cf1:col2': 'v2',})
The appropriate `batch_size` is very application-specific since it depends on
the data size, so just experiment to see how different sizes work for your
specific use case.
Using atomic counters
---------------------
The :py:meth:`Table.counter_inc` and :py:meth:`Table.counter_dec` methods allow
for atomic incrementing and decrementing of 8 byte wide values, which are
interpreted as big-endian 64-bit signed integers by HBase. Counters are
automatically initialised to 0 upon first use. When incrementing or
decrementing a counter, the value after modification is returned. Example::
print table.counter_inc('row-key', 'cf1:counter') # prints 1
print table.counter_inc('row-key', 'cf1:counter') # prints 2
print table.counter_inc('row-key', 'cf1:counter') # prints 3
print table.counter_dec('row-key', 'cf1:counter') # prints 2
The optional `value` argument specifies how much to increment or decrement by::
print table.counter_inc('row-key', 'cf1:counter', value=3) # prints 5
While counters are typically used with the increment and decrement functions
shown above, the :py:meth:`Table.counter_get` and :py:meth:`Table.counter_set`
methods can be used to retrieve or set a counter value directly::
print table.counter_get('row-key', 'cf1:counter') # prints 5
table.counter_set('row-key', 'cf1:counter', 12)
Note that an application should *never* :py:meth:`~Table.counter_get` the
current value, modify it in code and then :py:meth:`~Table.counter_set` the
modified value; use the atomic :py:meth:`~Table.counter_inc` and
:py:meth:`~Table.counter_dec` instead!
.. vim: set spell spelllang=en: