Use CloudWright to Build + Share Reporting Tools

January 15, 2020

CloudWright makes it easy to build powerful reporting tools and hand them off to non-technical teams. In this article we show how to use CloudWright to:

  • build a robust data reporting tool backed by BigQuery
  • use the Slack integration to send notifications
  • share and document the report generator to be run by non-technical users

Use-case: Filtering and summarizing purchase records

BigQuery is a powerful GCP service which enables ad-hoc SQL-like queries on massive datasets. One common use of BigQuery is to analyze historical log data to derive product insights.

In this example, our company has purchase records stored as raw logs in cloud object storage (here, Google Cloud Storage). The records contain the purchased item alongside transaction metadata (timestamp and location):

Timestamp,Customer ID,Orders,City,State
1/12/2020 1:41am, 1, Pepperoni Pizza,Seattle,WA
1/12/2020 1:45am, 3, Hot Wings,Dallas,TX
1/12/2020 1:46am, 8, Cheese Bread,Seattle,WA
1/12/2020 1:46am, 4, Pineapple Pizza,Seattle,WA
1/12/2020 1:46am, 4, Cheese Pizza,Dallas,TX

A Product Manager trying to track the performance of product lines in various markets may want to answer questions like:

  • How well do Hot Wings sales perform on weekends vs weekdays?
  • Are we meeting Pizza sale goals in Dallas?

If the Product Manager cannot directly access the raw logs (for legal or PII access reasons), or lacks the technical skills to write these queries, each of these requests — and every variation on the request — becomes an engineering request, and iterating on these queries is frustrating to both the product and engineering teams.

However, using CloudWright, we can quickly build and hand off a shared, reusable reporting tool which empowers the product team to quickly answer these questions.


Building a shared reporting tool in CloudWright

Writing a robust BigQuery-based application is CloudWright is easy — the CloudWright BigQuery module configures and exposes the BigQuery Python API. We'll talk about the important parts of the report generator in this post.

(There's a bit of boilerplate we won't cover, but if you're interested, check out the whole application below)

See the full application
from google.cloud import bigquery
import time

# Search record path and filter term inputs
search_path = CloudWright.inputs.get('search_path')
filter_term = CloudWright.inputs.get('filter')

# BigQuery to do the analysis, Slack to notify on completion
bq = CloudWright.get_module('bigquery')
slack = CloudWright.get_module('slack')

# Get the project name from the client (to make this script re-usable)
project = next(iter(list(bq.list_projects())), None).project_id

# Either find or create the bigquery dataset we use as tmp space 
dataset_name = 'purchase_summarizer'
destination=f'{project}.{dataset_name}.tmp'

# Create a dataset for temp storage, if it doesn't already exist
report_set = next(filter(lambda ds: ds.dataset_id == dataset_name, bq.list_datasets()), None)
if not report_set:
    report_set = bq.create_dataset(dataset_name)

BUCKET='gs://cloudwright-examples-report-generator'

# We'll read directly from the GS bucket for this job
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [f'{BUCKET}/logs/{search_path}/*']
external_config.autodetect = True

# Configure the job to filter and count per city + state pair
query = "SELECT City,State,COUNT(*) FROM logs WHERE `Orders` LIKE @filter_term GROUP BY City,State"
job_config = bigquery.QueryJobConfig(
    destination=destination,
    table_definitions={'logs': external_config},
    query_parameters=[bigquery.ScalarQueryParameter("filter_term", "STRING", f'%{filter_term}%')]
)

# Directory to extract the result into (timestamped for uniqueness)
report_path=f'{BUCKET}/output/report_{int(round(time.time() * 1000))}'

# (1) run the query (2) extract the result to object storage (3) clean up temp space
bq.query(query, job_config=job_config).result()
bq.extract_table(report_set.table('tmp'), report_path).result()
bq.delete_table(destination)

# Send a message on Slack to let the team know that the job ran successfully

message = f"Purchase order summarizer complete! \n \
    Search path: '{search_path}'\n \
    Filter: '{filter_term}'\n \
    You can download the report here: {report_path}"
slack.chat_postMessage(channel="file-automation", text=message)

CloudWright.response.set_value(f"Report successfully delivered to: {report_path}")

We'll take two inputs to our application — the path to search for logs (we'll assume that logs are structured as YYYY/MM/DD subdirectories), and the string we'll use to filter the records:

    search_path = CloudWright.inputs.get('search_path')
    filter_term = CloudWright.inputs.get('filter')

We only need to import two Modules here: BigQuery, to run our distributed query, and Slack, to send a notification once the script completes:

    bq = CloudWright.get_module('bigquery')
    slack = CloudWright.get_module('slack')
Setting up the BigQuery and Slack Modules

We'll show how to these modules up from scratch, but remember — you'll usually be able to re-use modules your team has already created.

Our application needs two Modules — our Slack integration and a connection to our BigQuery project (for more details about how to create and use CloudWright Modules, check out the Modules section in our docs).

We'll set up BigQuery to connect using a GCP Service Account Key. You can learn how to create and use Service Accounts in the Google Cloud Platform documentation. Our queries will execute in the cloudwright-examples project:

Configure BigQuery

The Slack needs two secrets — an API Token and a Signing Secret. You can get these secrets by creating a slack app and getting the signing secret:

Configure Slack

Once those these Modules are created, we can move on to writing our Application.


Once our Modules are attached to a new Application, executing a BigQuery query against the attached module is straightforward — our script will filter for the input text, and count records per City+State:

    bq = CloudWright.get_module('bigquery')
    ...
    query = "SELECT City,State,COUNT(*) FROM logs \
        WHERE `Orders` LIKE '@filter_term%' \
        GROUP BY City,State"
    ...
    bq.query(query, job_config=job_config).result()

Likewise, sending a message to a Slack channel is a one-line operation on the attached Slack module:

    slack = CloudWright.get_module('slack')
    ...
    slack.chat_postMessage(channel="file-automation", text=message)

We can test out our script in the dry run console to confirm that it runs successfully:

Dry Run

When we check our generated output file, we can confirm it has the record counts we expect:

    City,State,f0_
    Dallas,TX,1
    Seattle,WA,2

The dry-run console is great for testing our application, but it's not where we want end-users to use the tool. For that hand-off, we'll publish the application and point users at the CloudWright Run Console

In the CloudWright Run Console, users can launch executions of our Application with custom inputs. We've also added documentation in the right sidebar, to guide the user through how to use the tool we built:

Dry Run

In the History viewer, users can track executions of the application. This gives users a view into how the application is being used, and whether their own executions have completed:

Run History

Once the manual run is complete, we can check out the status and the output, which to point us at the delivered report:

Run Details

That's it — our tool is published, and the product team is able to iterate on data queries without proxying SQL queries through the engineering teams.

Conclusion

This article showed how you can use CloudWright to build and share robust data reporting tools in hours or minutes. If you'd like to learn more, see a live demo, or start using CloudWright to build and share your own reporting tools, let's get in touch.