poppy/hadoop/log_delivery.pig
Sriram Madapusi Vasudevan 92577d2272 feat: add log delivery pig script
- The hadoop script will allow split up the provider's logs that are
  piped into it, based on those domains that have log delivery enabled.

- README.rst contains instructions on how the script is meant to be
  used.

Implements: blueprint log-delivery

Change-Id: I4434175bead26e9b78a3115038af55b25a62163c
2015-06-12 11:39:45 -04:00

12 lines
808 B
Pig

REGISTER /usr/lib/pig/piggybank.jar;
logs = LOAD '$INPUT/*.gz' USING PigStorage('\t') AS (date, time, ip, method, uri, status, bytes:long, time_taken, referer, user_agent, cookie, country);
log_domains = LOAD '$INPUT/domains_log.tsv' USING PigStorage('\n') AS domains;
formatted_logs = FOREACH logs GENERATE ip, '-', '-', org.apache.pig.builtin.StringConcat('[',date,':',time, ' +0000',']') , org.apache.pig.builtin.StringConcat('"', method,' ', uri,' ','HTTP/1.1', '"'), status, bytes, referer, user_agent, REGEX_EXTRACT(uri, '/([^/]*).$PROVIDER_URL_EXT(/.*)', 1) AS domain;
delivery_enabled_formamatted_logs = JOIN log_domains BY domains, formatted_logs BY domain;
STORE delivery_enabled_formamatted_logs INTO '$OUTPUT' USING org.apache.pig.piggybank.storage.MultiStorage('$OUTPUT', 10, 'gz', '\\t');