This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Learn Cloud Operations

Cloud Operations Overview

As the cloud-native microservice architecture, which promises scalability and flexibility benefits, gets more popular, developers and administrators need tools that can work across cloud-based distributed systems.

Cloud Operations provides products for both developers and administrators, this section introduces the products and their general audiences. The tools are covered in more detail later. Application developers need to be able to investigate the cause of problems in applications running in distributed environments, and in this context, the importance of Application Performance Management (APM) has increased. Cloud Operations provides 3 products for APM:

Similarly, cloud-native, microservice-based applications complicate traditional approaches used by administrators for monitoring system health: it’s harder to observe your system health when the number of instances is flexible and the inter-dependencies among the many components are complicated. In the last few years, Site Reliability Engineering (SRE) has become recognized as a practical approach to managing large-scale, highly complex, distributed systems. Cloud Operations provides the following tools that are useful for SRE:

You can find the Cloud Operations products in the navigation panel on the GCP Console:

image

1 - Cloud Trace

Trace Overview

Cloud Trace (documentation) enables developers to see distributed traces that visually expose latency bottleneck in requests. Developers instrument application code to collect trace information. You can also include environmental information in traces and trace information can be included in Cloud Logging logs. The Trace UI can then pull relevant log events into the trace timelines.

For instrumenting your applications, currently recommended solution is OpenCensus. OpenCensus is an open-source project that supports trace instrumentation in a variety of languages and that can export this data to Cloud Operations. Then you can use the Cloud Trace UI to analyze the data. Note that OpenCensus is merging with another similar project, OpenTracing, to form OpenTelemetry. See OpenCensus to become OpenTelemetry in this doc.

HipsterShop microservices are instrumented to collect trace data. In addition to distributed tracing, OpenCensus (Stats) provides the sink to send quantifiable data, such as database latency, open file descriptors, and so on, that helps to set up monitoring of SLIs and SLOs for the service. This data is available in Cloud Monitoring, and HipsterShop microservices are also instrumented to collect this kind of data.

Using Trace

To bring up Cloud Trace, click Trace in the GCP navigation panel. This takes you to the Trace Overview page, where you see the traces generated by the Sandbox microservices:

image

Click Trace List in the navigation panel to get the list of traces captured during a particular time:

image

Click on any trace in the timeline to get a detailed view and breakdown of the traced call and the subsequent calls that were made:

image

Finally, click Analysis Reports in the navigation menu to see a list of reports that are generated.

If you have just set up the Sandbox environment, you may not have any reports. Click on New Report to create one. An example of a first report: in the Request Filter field, select Recv./cart. Leave the other options the default. Once the report is created, you should be able to see it in the Analysis Reports list.

image

View one of the reports that was created (or the one you created yourself) to understand either the density or cumulative distribution of latency for the call you selected:

image

Feel free to explore the tracing data collected from here before moving on to the next section.

2 - SLIs, SLOs and Burn rate Alerts

SLIs, SLOs and Burn rate Alerts Overview

Cloud Operations Sandbox comes with several predefined SLOs (Service level objectives), that allow you to measure your users happiness. To learn more about SLIs and SLOs SRE fundamentals.

Cloud operations suite provides service oriented monitoring, that means that you are configuring SLIs, SLOs and Burning Rates Alerts for a ‘service’.

The first step in order to create SLO is to ingest the data. For GKE services telemetry and dashboards comes out of the box, but you can also ingest additional data and create custom metrics.

Then you need to define your service, Cloud Operations Sandbox' services are already detected since Istio’s services are automatically detected and created. But to demonstrate that you can create your own services, it also deploys custom services using Terraform.

You can find all the services under monitoring → services → Services Overview, and you can create your own custom service.

image

Services SLOs

The predefined SLOs are also deployed as part of Terraform code and currently are for the mentioned custom services, the Istio service and Rating service.

Custom services SLOs

Custom service availability SLO: 90% of HTTP requests are successful within the past 30 day windowed period
Custom service Latency SLO: 90% of requests return in under 500 ms in the previous 30 days

To view the exiting SLOs, in the Services Overview screen choose the desired service.

For example for checkoutservice:

image

image

Additional predefined SLOs:

Istio service availability SLO: 99% of HTTP requests are successful within the past 30 day windowed period
Istio service latency SLO: 99% of requests return in under 500 ms in the previous 30 days
Rating service availability SLO: 99% of HTTP requests are successful within the past 30 day windowed period
Rating service latency SLO: 99% of requests that return in under 175 ms in the previous 30 days
Rating service's data freshness SLO: during a day 99.9% of minutes have at least 1 successful recollect API call

Configure your own SLIs and SLOs

Remember The purpose of defining SLIs and SLOs is to improve your user’s experience, your SLOs scope is a User journey. Therefore your first step should be to identify the most critical User Journey(CUJ) to your business, then identify the metrics that measure your customer experience as closely as possible and ingest that data.

You can configure your own SLIs and SLOs for an existing service or for your own custom service.

Example: configuring the checkout service

  1. In the service screen you will choose Create SLO: image 2.Then you will set your SLI, you need to choose SLI type and the method(request vs window based): image
  2. Then you will define your metric and you can also preview its performance based on historical data: image
  3. Then you will configure your SLO, your target in a specific time window. You can also choose between rolling window or a calendar window: image

Configure Burn Rate Alerts

After you create the SLO, you can create Burn Rate Alertsfor those.

Several predefined policies are deployed as part of Terraform. You can view them in the service screen, edit them, or create your own.

Let’s continue with the Istio checkoutservice SLO you created in the previous section:

  1. In the service screen you will be able to see your new SLO and you will choose ‘Create Alerting Policy’

image 2. Then you will want to set the alert’s condition, who and how they will be notified and additional instructions:
image 3. After it will be created you could see it and incidents that might be triggered due to it in the service screen and in the Alerting screen: image

3 - Cloud Profiler

Profiler Overview

Cloud Profiler (documentation) performs statistical sampling on your running application. Depending on the language, it can capture statistical data on CPU utilization, heap size, threads, and so on. You can use the charts created by the Profiler UI to help identify performance bottlenecks in your application code.

You do not have to write any profiling code in your application; you simply need to make the Profiler library available (the mechanism varies by language). This library will sample performance traits and create reports, which you can then analyze with the Profiler UI.

The following Hipster Shop microservices are configured to capture profiling data:

  • Checkout service
  • Currency service
  • Frontend
  • Payment service
  • Product-catalog service
  • Shipping service

Using Profiler

Select Profiler from the GCP navigation menu to open the Profiler home page. It comes up with a default configuration and shows you the profiling graph:

image

You can change the service, the profile type, and many other aspects of the configuration For example, to select the service you’d like to view Profiler data for, choose a different entry on the Service pulldown menu:

image

Depending on the service you select and the language it’s written in, you can select from multiple metrics collected by Profiler:

image

See “Types of profiling available” for information on the specific metrics available for each language.

Profiler uses a visualization called a flame graph to represents both code execution and resource utilization. See “Flame graphs” for information on how to interpret this visualization. You can read more about how to use the flame graph to understand your service’s efficiency and performance in “Using the Profiler interface”.

4 - Cloud Debugger

Debugger Overview

You might have experienced situations where you see problems in production environments but they can’t be reproduced in test environments. To find a root cause, then, you need to step into the source code or add more logs of the application as it runs in the production environment. Typically, this would require re-deploying the app, with all associated risks for production deployment.

Cloud Debugger (documentation) lets developers debug running code with live request data. You can set breakpoints and log points on the fly. When a breakpoint is hit, a snapshot of the process state is taken, so you can examine what caused the problem. With log points, you can add a log statement to a running app without re-deploying, and without incurring meaningful performance costs.

You do not have to add any instrumentation code to your application to use Cloud Debugger. You start the debugger agent in the container running the application, and you can then use the Debugger UI to step through snapshots of the running code.

The following Hipster Shop microservices are configured to capture debugger data:

  • Currency service
  • Email service
  • Payment service
  • Recommendation service

Using Debugger

To bring up the Debugger, select Debugger from the navigation panel on the GPC console:

image

As you can see, Debugger requires access to source code to function. For this exercise, you’ll download the code locally and link it to Debugger.

Download source code

In Cloud Shell, issue these commands to download a release of the Sandbox source code and extract the archive:

cd ~
wget https://github.com/GoogleCloudPlatform/cloud-ops-sandbox/archive/next19.tar.gz
tar -xvf next19.tar.gz
cd cloud-ops-sandbox-next19
Create and configure source repository

To create a Cloud Source Repository for the source code and to configure Git access, issue these commands in Cloud Shell:

gcloud source repos create google-source-captures
git config --global user.email "user@domain.tld" # substitute with your email
git config --global user.name "first last"       # substitute with your name
Upload source code to Debugger

In the Debugger home page, copy the command (don’t click the button!) in the “Upload a source code capture to Google servers” box, but don’t include the LOCAL_PATH variable. (You will replace this with another value before executing the command.)

image

Paste the command into your Cloud Shell prompt and add a space and a period:

gcloud beta debug source upload --project=cloud-ops-sandbox-68291054 --branch=6412930C2492B84D99F3 .

Enter RETURN to execute the command.

In the Debugger home page, click the Select Source button under “Upload a source code capture” option, which will then open the source code:

image

You are now ready to debug your code!

Create a snapshot

Start by using the Snapshot functionality to understand the state of your variables. In the Source capture tree, open the server.js file under src > currencyservice.

Next, click on line 121 to create a snapshot. in a few moments, you should see a snapshot be created, and you can view the values of all variables at that point on the right side of the screen:

image

Create a logpoint

Switch to the Logpoint tab on the right side. To create the logpoint:

  1. Again, click on line 121 of server.js to position the logpoint.
  2. In the Message field, type “testing logpoint” to set the message that will be logged.
  3. Click the Add button.

To see all messages that are being generated in Cloud Logging from your logpoint, click the Logs tab in the middle of the UI. This brings up an embedded viewer for the logs:

image

5 - Cloud Monitoring

Monitoring Overview

Cloud Monitoring (documentation) is the go-to place to grasp real-time trends of the system based on SLI/SLO. SRE team and application development team (and even business organization team) can collaborate to set up charts on the monitoring dashboard using metrics sent from the resources and the applications.

Using Monitoring

To get to Cloud Monitoring from the GCP console, select Monitoring on the navigation panel. By default, you reach an overview page:

image

There are many pre-built monitoring pages. For example, the GKE Cluster Details page (select Monitoring > Dashboards > Kubernetes Engine > Infrastructure) brings up a page that provides information about the Sandbox cluster:

image

You can also use the Monitoring console to create alerts and uptime checks, and to create dashboards that chart metrics you are interested in. For example, Metrics Explorer lets you select a specific metric, configure it for charting, and then save the chart. Select Monitoring > Metrics Explorer from the navigation panel to bring it up.

To search and view metrics, type the name of the metric or the type of resource, for example to search OpenCensus metrics in the **Monitoring > Metrics Explorer > ** search for grpc.io:

image

The following chart shows the client-side RPC calls that did not result in an OK status:

image

This chart uses the metric type custom.googleapis.com/opencensus/grpc.io/client/completed_rpcs (display name: “OpenCensus/grpc.io/client/completed_rpcs” ), and filters on the grpc_client_status label to only keep time series where the label value equals “OK”.

The following example displays results where the grpc_client_status is not “OK” (e.g. PERMISSION_DENIED) and where the grpc_client_method does not include “google”, i.e. errors from application services.

image

In addition to the default GCP dashboards mentioned above, Cloud Operations Sandbox provisions several dashboards using Terraform code.

In the User Experience Dashboard, you can also view Opencensus metrics like HTTP Request Count by Method, HTTP Response Errors and HTTP Request Latency, 99th Percentile.

Additionally, you can edit the dashboard, add additional charts, and also open the chart in the Metrics explorer as shown below: image

Monitoring and logs-based metrics

Cloud Logging provides default, logs-based system metrics, but you can also create your own (see Using logs-based metrics). To see available metrics, select Logging > Logs-based metrics from the navigation panel. You should see both system metrics and some user-defined, logs-based metrics created in Sandbox.

image

All system-defined logs-based metrics are counters. User-defined logs-based metrics can be either counter or distribution metrics.

Creating a logs-based metric

To create a logs-based metric, click the Create Metric button at the top of the Logs-based metrics page or the Logs Viewer. This takes you to the Logs Viewer if needed, and also brings up the Metric Editor panel.

Creating a logs-based metric involves two general steps:

  1. Identifying the set of log entries you want to use as the source of data for your entry by using the Logs Viewer. Using the Logs Viewer is briefly described in the Cloud Logging section of this document.
  2. Describing the metric data to extract from these log entries by using the Metric Editor.

This example creates a logs-based metric that counts the number of times a user (user ID, actually) adds an item to the HipsterShop cart. (This is an admittedly trivial example, though it could be extended. For example, from this same set of records, you can extract the user ID, item, and quantity added.)

First, create a logs query that finds the relevant set of log entries:

  1. For the resource type, select Kubernetes Container > cloud-ops-sandbox > default > server
  2. In the box with default text “Filter by label or text search”, enter “AddItemAsync” (the method used to add an item to the cart), and hit return.

The Logs Viewer display shows the resulting entries:

image

Second, describe the new metric to be based on the logs query. This will be a counter metric. Enter a name and description and click Create Metric:

image

It takes a few minutes for metric data to be collected, but once the metric collection has begun, you can chart this metric just like any other.

To chart this metric using Metrics Explorer, select Monitoring from the GCP console, and on the Monitoring console, select Resources > Metrics Explorer.

Search for the metric type using the name you gave it (“purchasing_counter_metric”, in this example):

image

6 - Cloud Logging

Logging Overview

Operators can look at logs in Cloud Logging to find clues explaining any anomalies in the metrics charts.

Using Logging

You can access Cloud Logging by selecting Logging from the GCP navigation menu. This brings up the Logs Viewer interface:

image

The Logs Viewer allows you to view logs emitted by resources in the project using search filters provided. The Logs Viewer lets you select standard filters from pulldown menus.

An example: server logs

To view all container logs emitted by pods running in the default namespace, use the Resources and Logs filter fields (these default to Audited Resources and All logs):

  1. For the resource type, select GKE Container -> cloud-ops-sandbox -> default
  2. For the log type, select server

The Logs Viewer now displays the logs generated by pods running in the default namespace:

image

Another example: audit logs

To see logs for all audited actions that took place in the project during the specified time interval:

  1. For the resource type, select Audited Resources > All services
  2. For the log type, select** All logs**
  3. For the time interval, you might have to experiment, depending on how long your project has been up.

The Logs Viewer now shows all audited actions that took place in the project during the specified time interval:

image

Exporting logs

Audit logs contain the records of who did what. For long-term retention of these records, the recommended practice is to create exports for audit logs. You can do that by clicking on Create Sink:

image

Give your sink a name, and select the service and destination to which you will export your logs. We recommend using a less expensive class of storage for exported audit logs, since they are not likely to be accessed frequently. For this example, create an export for audit logs to Google Cloud Storage.

Click Create Sink. Then follow the prompts to create a new storage bucket and export logs there:

image

7 - Cloud Error Reporting

Error Reporting Overview

Cloud Error Reporting (documentation) automatically groups errors depending on stack trace message patterns and shows the frequency of each error group. The error groups are generated automatically, based on stack traces.

On opening an error group report, operators can access to the exact line in the application code where the error occurred and reason about the cause by navigating to the line of the source code on Google Cloud Source Repository.

Using Error Reporting

You can access Error Reporting by selecting Error Reporting from the GCP navigation menu:

image

Note: Error Reporting can also let you know when new errors are received; see “Notifications for Error Reporting” for details.

To get started, select any open error by clicking on the error in the Error field:

image

The Error Details screen shows you when the error has been occurring in the timeline and provides the stack trace that was captured with the error. Scroll down to see samples of the error:

image

Click View Logs for one of the samples to see the log messages that match this particular error:

image

You can expand any of the messages that matches the filter to see the full stack trace:

image

Errors Manufacturing

There are several ways in which you can experiment with Error Reporting tool and manufacture errors that will be reported and displayed in the UI. For the purpose of this demonstration, we will use Cloud Operations Sandbox’s Load Generator and SRE Recipes features to simulate errors in the system.

To simulate requests using the load generator we can use the UI or the sandboxctl command line tool.

$sandboxctl loadgen step
Redeploying Loadgenerator...
Loadgenerator deployed using step pattern
Loadgenerator web UI: http://<ExampleIP>

Then to break the service we will use sre-recipes(recipe2)

$sandboxctl sre-recipes break recipe2
Breaking service operations...
...done

In this case you will see in Error Reporting UI you will see a new reported error Unhealthy pod, failed probe

image

You can open it to see additional information, in the below example you can see that this error repeats itself several times in the last hour.

image

You can also press View logs to view detailed log information in Cloud Operations Logging.

image

Note: at the end, don’t forget to recover the service using sandboxctl sre-recipes restore.

Another way to break the service is to use the load generator to overload the service with too many requests. In the Load Generator UI( addressed provided about or using sandboxctl describe), we will start run a test with 500 users.

Note: Currently only load test <100 users would be successful.

image

In the UI you will see that the previous error Unhealthy pod, failed probe, in addition you can see an additional error Container Downtime:

image

image