For awhile now lately, we have been seeing Elastic Search indexing
quickly fall behind as some log files generated in the gate have become
larger. Currently, we download a full log file into memory and then
emit it line-by-line to be received by a logstash listener. When log
files are large (example: 40M) logstash gets bogged down processing
them.
Instead of downloading full files into memory, we can stream the files
and emit their lines on-the-fly to try to alleviate load on the log
processor.
This:
* Replaces use of urllib2.urlopen with requests with stream=True
* Removes manual decoding of gzip and deflate compression
formats as these are decoded automatically by requests.iter_lines
* Removes unrelated unused imports
* Removes an unused arg 'retry' from the log retrieval method
Change-Id: I6d32036566834da75f3a73f2d086475ef3431165
This is a thing that puppet has gone back and forth on and now we are on
the wrong side of it. Fix it like we've fixed it elsewhere.
Change-Id: I6d514b2345ff284c57409cc508786b76258d9f4a
Belts and suspenders for cases where content encoding may not be
present. I believe this is possible if the content is served with the
identity encoding. In that case setting the encoding header isn't
required.
Change-Id: If18670d4fd3656a35f818247539b7afad39493e6
Zuul gives us the source url to index. Previously we tried to fetch
url + .gz because in many cases we uploaded the file as a gzip file but
logically treated it as unzipped. Now with logs in swift we compress
files without the .gz suffix. This means we should be able to always
fetch the url that zuul provides unmodified.
Depends-On: https://review.opendev.org/678303
Change-Id: I0ea4d9daa905ccb50372b73b5035758fc0963716
The severity filters are passed the entire json event and not just a
string. Update the systemd filter to access the message string out of
the event json dict.
Prior to this we get a type error:
2019-08-19 17:18:48,055 Exception handling log event.
Traceback (most recent call last):
File "/usr/local/bin/log-gearman-worker.py", line 255, in
_handle_event
keep_line = f.process(out_event)
File "/usr/local/bin/log-gearman-worker.py", line 183, in process
m = self.SYSTEMDRE.match(msg)
TypeError: expected string or buffer
Change-Id: I7ab56ac397133f00539d9d3374fa400363ef12d6
Now that logs have moved into swift, the os-loganalyze middleware that
stripped DEBUG level logs when the URL was given a ?level= parameter
no longer functions.
We move to filtering DEBUG statements directly. Because services in
devstack now run as systemd services, their log files are actually
journalctl dumps. Thus we add a new filter for systemd style
timestamps and messages (this is loosely based on the zuul log viewer
at [1]).
[1] 8c1f4e9d6b/web/src/actions/logfile.js
Change-Id: I54087c95c809612758139136d5b3e86b1a6372be
We don't need to worry about the file changing under us any more; this
was all pre-zuul, let alone pre-using-swift for logs. Remove this
workaround.
Change-Id: I5938dcef5550d4c62c8158c5f89ace75ae99aedc
We are now logging to swift and store the objects as deflate encoded
data [1]. This means that we get back "Content-Encoding: deflate"
data when downloading the logs (even despite us not accepting that).
So put in a path for deflate encoding to the extant code with zlib.
For completeness we also update our Accept-Encoding: header to show we
accept deflate.
[1] 60e7542875/roles/upload-logs-swift/library/zuul_swift_upload.py (L608)
Change-Id: I328bafea3ddae858fd77af043f16c499ddd5a30e
This is a mechanically generated change to replace openstack.org
git:// URLs with https:// equivalents.
This is in aid of a planned future move of the git hosting
infrastructure to a self-hosted instance of gitea (https://gitea.io),
which does not support the git wire protocol at this stage.
This update should result in no functional change.
For more information see the thread at
http://lists.openstack.org/pipermail/openstack-discuss/2019-March/003825.html
Change-Id: I218bec35df3cb1ee1a65f05639756a5cfe9a56b6
By default geard only listens on ipv4 0.0.0.0 which means ipv6
connectiosn don't work. Because we run dual stack and things expect ipv6
to work (we have AAAA dns records after all) force geard to listen on ::
which will accept ipv6 and ipv4 connections.
Change-Id: Ibf3bfc5f80ca139b375ee2902dc3149ac791ef96
The logstash log processing and subunit2sql tooling is often paired
together on servers. Currently subunit2sql depends on
os-performance-tools which depend on statsd<3.0. That means our package
resource here for statsd that tried to install latest conflicts with our
usage of subunit2sql on the same server.
Avoid this conflict by installing 2.1.2 for geard and then subunit2sql
will be happy too.
Change-Id: I3ac04cb93025ae2e2115ed23ba4927c2060f6dc8
This adds severity as a logstash field for every oslo formatted
log line, and does not add any lines which are at DEBUG level.
This means we no longer rely on the level=INFO query paremeter
in order to remove DEBUG lines, so we will avoid sending them
to logstash regardless of whether os-loganalyze is used.
Change-Id: I8c4ac76a7fa0c3badd82fc7c54959ef6eb052732
The logic in the Gemfile was relying on Zuulv2 variables to find out
whether the spec helper gem was already available on disk, and since
Zuulv3 has changed things it was failing to find it and downloading the
master version instead. This patch ensures the Gemfile looks for the gem
in the right place when running in CI.
Change-Id: If0864708efebd16058707ae102ef1b06b9332c42
We don't use the jenkins log client 0mq events anymore with zuulv3.
Instead zuul jobs submit the log indexing jobs directly to the gearman
queue for log processing. This means we only need a geard to be running
so add support for running just that daemon.
Change-Id: Iedcb5b29875494b8e18fa125adb08ec2e34d0064
Now that we are upgrading to Xenial, we need to take into account
that we're running with systemd and reload it so that it picks up the
new service.
Change-Id: Id02ac2bc51132a8d8d4a77cb05d41fa902765b28
Log files come with many names while still containing the same logical
content. That may be because the path to them differs (eg /var/log/foo.log
and /opt/stack/log/foo.log) or due to file rotations (eg
/var/log/foo.log and /var/log/foo.log.1) or due to compression (eg
/var/log/foo.log and /var/log/foo.log.gz). At the end of the day these
are all the same foo.log log file.
This means when we do machine learning on the log files we can collapse
all these different cases down into a single case that we learn on. This
has become more important with the recent running out of disk space due
to all the non unique log paths out there for our log files but should
also result in better learning.
Change-Id: I4ba276870b73640909ac469b336a436eb127f611
We are having OOM problems with larger log files. Attempt to make this
more robust by having much smaller internal log line message queues (we
reduce the queue size to about 10% the original size). The idea here is
that if we have the old 130k queue size full then grab a large log file
the overhead there is fairly significant whereas if we have a small 16k
queue size and grab a large log file we really only have to worry about
the size of the logfile itself.
Depends-On: Iddbbab9ea5996df4922bf7927deb8f0354378ab7
Change-Id: I761fabaa1b5aae64790def721980151f9fdc720d
Instead of keeping a local copy of spec_helper_acceptance.rb and
requiring updates to all modules for any change, we can move it into the
common helper gem and require it from there. This will make it easier to
create and review changes that affect all puppet modules. Also change
the Gemfile to look for the gem in the local workspace if running in a
zuul environment.
Change-Id: I39394696ddb30d4c9ab67c7b3598c7a723c89b14
We were previously sending events for every file we attempted to
process, not just those that were processed and also for every single
log line event. This effectively doubled the io performed by the
logstash workers which seemed to slow the whole pipeline down. Trim it
down to only recording events for log files that are processed which
should significantly trim down the total number of events.
Change-Id: I0daf3eb2e2b3240e3efa4f2c7bac57de99505df0
Previously the mqtt topic generation always assumed a build_change was
present. However there are some cases where the isn't a build_change in
the metadata, like periodic, post, and release jobs. This commit handles
those edge cases so it uses the build queue in the topic instead of the
build_change. If that doesn't work the topic is just the project.
Change-Id: I26dba76e3475749d00a45b076d981778f885c339
The new paho-mqtt 1.3.0 release brings
https://github.com/eclipse/paho.mqtt.python/commit/0a8cccc
which prevents its use on Ubuntu Trusty's default Python interpreter.
Until we upgrade to a newer Python there, stay on paho-mqtt 1.2.3 so
that things keep working.
Change-Id: I4ffcd8c7906c86a40f3cd8f8d83fb8208944d189
There was bad indentation and a missing '.' in config.get.
_generate_topic())_ is an object method not global function and it takes
an action argument.
Change-Id: I01c4af83cf98f0d7191041a864618a1608f97647
Add a xenial nodeset and update the spec helper to install puppet 3 from
the Ubuntu repos instead of from puppetlabs.
Change-Id: I0a747faa3b9964227b0f2f2ca4a7bb895eca95e9
This commit adds support to the gearman worker for publishing an mqtt
message when processing a gearman job succeeds or fails. It also adds
a message for when the processor passes the logs to logstash either via
stdout or over a socket. By default this is disabled since it requires
extra configuration to tell the worker how to talk to the mqtt broker.
Depends-On: Id0308d2d4d1843fcca73f459cffa2ae944bebd0c
Change-Id: I43be3562780c61591ebede61f3a8929e8217f199
We had been running at debug level which is incredibly verbose. Remove
the -d flag. This will cause the logs which are logged to go to
stdout/err which should mean that upstart (or whatever init system) will
deal with them for us.
We should properly clean this up so that debug logging is useful again
in the long term.
Change-Id: I613c135ea56507d083df8c66e8846c6fbfa8b2ed
This is needed to support running multiple services on the same host,
otherwise all services will report as running if any single service is
running.
Change-Id: Ie6b7918af846af2189324b0b177b45ac858eadba
As part of the move to logstash 2.0 we are relying on upstream packaging
for logstash. This packaging replaces a lot of the micromanagement of
users and groups and dirs that was done in puppet for logstash. This is
great news because its less work for us but means that the log
processors can't rely on puppet resources for those items and we don't
actually want to install logstash package everywhere we run log
processor daemons.
Since the log processors don't need a logstash service running and
actually don't need any of the logstash stuff at all decouple them
completely and have the log processor daemons use their own user, group,
log dir, config dir, etc. With this in place we can easily switch to
using the logstash packages only where we actually need logstash to be
running.
Change-Id: I2354fbe9d3ab25134c52bfe58f562dfdf9ff6786
log_processor class may be applied to a server along with
other classes also declaring python-daemon as a dependency.
As puppet cannmot handle this, add a if defined check.
Change-Id: I40dc68bd93f113912373cb10b376819d30eb3087
This reverts commit b548b141ced9853456bb0c16b90d8ed30d423577.
b548b141ced9853456bb0c16b90d8ed30d423577 was supposed to depend-on
https://review.openstack.org/248868
Change-Id: If3d4ad8a1cd45e6e63155a76dc1477ab38b156e3
The python scripts have been moved to their own project at
openstack-infra/log_processor. Delete the files here and start
installing that project from source. As a part of this split, the
.py extension has been dropped from the filename of the installed
executables.
Change-Id: Ied3025df46b5014a092be0c26e43d4f90699a43f
The node region can be figured out from the build_node very easily and
having a discrete field will make filtering to a single region much
simpler. This commit adds a new metadata field 'node_region' which is
the cloud region that the build_node ran in.
Change-Id: I06bbb62d21871ee61dbfb911143efff376992b98
This reverts commit 135ac1809d0182e04c2362458568de14a3cf948d.
EventProcessor was called before being defined. The code also doesn't
look entirely right. Reverting this to fix up the logstash servers
Change-Id: I2fb8081426646565814090c152d04d7349c16945