Menu
Grafana Cloud

Cloud Insights reference

Note

Cloud Insights is currently in public preview. Grafana Labs offers limited support, and breaking changes might occur prior to the feature being made generally available.

You can enable this feature by clicking on Testing & synthetics > Performance in your Grafana Cloud account and typing ↑, ↑, ↓, ↓, ←, →, ←, →, B, A on your keyboard, and selecting Cloud Insights in the Beta Features dialog box. If you need additional help, contact Support.

This page provides a comprehensive reference for all audits within Cloud Insights, grouped by categories. Each audit entry details its purpose and provides generic recommendations for you to follow.

Reliability

Reliability audits diagnose issues related to your system under test’s ability to handle the load generated by your k6 test.

HTTP Failure Rate

  • Identifier: http-failure-rate
  • Group: http

The HTTP Failure Rate is one of the most important metrics to track when testing your system, and it’s often the first symptom of your system breaking.

How it works

Analyzes failed HTTP requests and lists the URLs with the highest error rates during the test run. What is considered a failed HTTP request can be altered with the setResponseCallback function.

Common problem sources

  • Invalid URLs, such as URLs with typos or hostnames outside the public DNS system.
  • Missing required request headers, such as authentication, authorization, or user agents.
  • Sending the wrong data in the request body.

Recommendations

  • Run a single script iteration locally to troubleshoot the failing requests before running a load test.
  • In the script, verify that the failing requests are formulated correctly and return the expected response.
  • Verify that the application is publicly accessible. If it is behind a firewall, consider using local execution for your tests.
  • Check whether your application is configured properly.
  • Check whether the errors are caused by saturation of a resource (CPU, memory, disk I/O, or database connections).

Note

Failed responses are often returned much faster than successful responses. Consequently, an increased HTTP failure rate may produce misleading metrics for request rates and response times.

HTTP Spans Failure Rate

  • Identifier: http-spans-failure-rate
  • Group: http

Spans, and more generally, traces are entities that let you follow a request as it traverses the services in your infrastructure.

How it works

Analyzes traces originating from HTTP requests made by k6 and highlights the names of spans with high error rates.

Common problem sources

  • Bugs or logic errors within the application code can lead to unexpected exceptions or failures, resulting in errors being recorded on spans.
  • Communication failures between services, such as timeouts, connection errors, or network partitions, can cause errors to be recorded on spans when expected interactions fail to occur.
  • Errors originating from external services or APIs that the application relies on can propagate to spans within the tracing context, indicating issues with downstream services.
  • Excessive resource utilization, such as high CPU usage, memory leaks, or database connection limits being reached, can lead to errors being recorded on spans due to service degradation or failure.

Recommendations

  • Navigate to the Traces tab and identify spans that have recorded errors. From there, explore your spans and identify interesting traces through exemplars.

Note

This audit only works with the k6 x Tempo integration enabled.

Web Vital CLS

  • Identifier: web-vital-cls
  • Group: browser

Cumulative Layout Shift (CLS) measures the visual stability of a webpage by quantifying the amount of unexpected layout shift of visible page content. Improving CLS is essential for providing a more predictable and pleasant user experience.

How it works

Verifies that the CLS metric falls within the recommended range.

Recommendations

  • Use Chrome UX Report (CrUX) and PageSpeed Insights: Check the Chrome UX Report and PageSpeed Insights to help identify the root cause of a low CLS score.
  • Set dimensions for images and videos: Reserve space for images and videos by specifying their dimensions in the HTML. This prevents the page content from shifting when these elements load.
  • Use CSS transitions for animations: If you have animations on your page, use CSS transitions instead of JavaScript animations. CSS transitions are less likely to cause layout shifts.
  • Preload and preconnect: Use the preload and preconnect hints to give the browser a heads-up about resources that will be needed soon. This can help to load resources earlier and reduce the chance of layout shifts.
  • Lazy load elements below the fold: Defer the loading of images and other non-critical resources until they’re about to be scrolled into the viewport.
  • Avoid dynamically injected content: If you inject content dynamically, reserve space for it in the layout or use placeholders.
  • Optimize font loading: Use the font-display: swap property in your CSS to ensure text is rendered with a fallback font until the custom font is loaded.
  • Optimize third-Party embeds: If using third-party embeds, like social media widgets or iframes, ensure they reserve the required space and don’t cause unexpected shifts. Some embed codes provide specific attributes or customization options to control layout behavior.
  • Test on different devices and resolutions: Test your website on various devices and screen resolutions to minimize layout shifts across different contexts.
  • Implement a loading skeleton: Use loading skeletons or placeholders for content that is about to load. This gives users a visual indication of where the content will appear and reduces the perceived impact of shifts.

Web Vital FCP

  • Identifier: web-vital-fcp
  • Group: browser

First Contentful Paint (FCP) measures the time it takes for the first content element to be painted on the screen. A faster FCP contributes to a better user experience.

How it works

Verifies that the FCP metric falls within the recommended range.

Recommendations

  • Minimize server response time: Efficient server responses contribute to faster content rendering. Employ caching mechanisms and consider upgrading your hosting plan for better server performance.
  • Minimize render-blocking resources: Identify and address render-blocking resources, especially JavaScript and CSS files. Utilize techniques such as asynchronous loading or defer loading for non-critical resources.
  • Prioritize above-the-fold content: Ensure that above-the-fold content, the content visible without scrolling, is prioritized for loading. This involves optimizing the delivery of critical resources and minimizing the number of requests needed to render this content.
  • Optimize images and videos: Compress and optimize images to reduce their file size without sacrificing quality. Consider using modern image formats like WebP. Also, lazy load images and videos to defer loading until they’re needed, improving initial page load times.
  • Optimize web fonts: Minimize the use of custom web fonts or use the font-display: swap; property to ensure text is visible even before the custom font has fully loaded.
  • Use browser caching: Implement proper caching strategies for static assets. Utilize cache headers to instruct the browser to store and reuse resources locally, reducing the need to download them on subsequent visits.
  • Use a content delivery network (CDN): Deploy a Content Delivery Network to distribute your content across servers globally. This reduces latency and speeds up content delivery, improving FCP.
  • Reduce third-party scripts: Limit the use of third-party scripts as they can introduce additional delays. Only include essential third-party scripts and load them asynchronously to prevent them from blocking rendering.

Web Vital FID

  • Identifier: web-vital-fid
  • Group: browser

First Input Delay (FID) measures the responsiveness of a web page by quantifying the delay between a user’s first interaction, such as clicking a button, and the browser’s response. Improving FID is essential for providing a more responsive and interactive user experience.

How it works

Verifies that the FID metric falls within the recommended range.

Recommendations

  • Async and defer attribute for scripts: Use the async or defer attribute when including scripts in your HTML. This allows the browser to continue parsing the HTML without blocking, resulting in a faster page load.
  • Optimize JavaScript execution: Minimize and optimize JavaScript code to reduce execution time. Eliminate unnecessary scripts, and consider code splitting to load only the necessary JavaScript for the current view.
  • Optimize third-party scripts: Limit the use of third-party scripts, and if necessary, load them asynchronously. Third-party scripts can significantly contribute to FID if they’re large or cause delays in execution.
  • Use Web Workers: Consider using Web Workers to offload heavy JavaScript tasks to a separate thread, preventing them from blocking the main thread and improving responsiveness.

Web Vital INP

  • Identifier: web-vital-inp
  • Group: browser

Interaction to Next Paint (INP) is a metric that assesses a page’s overall responsiveness to user interactions by observing the latency of all click, tap, and keyboard interactions that occur throughout the lifespan of a user’s visit to a page.

How it works

Verifies that the INP metric falls within the recommended range.

Recommendations

  • Optimize JavaScript execution: Minimize and optimize JavaScript code. Remove unnecessary scripts, and consider using code splitting to load only essential JavaScript for the current view.
  • Async and defer attribute for scripts: Use the async or defer attribute when including scripts in your HTML. This allows the browser to continue parsing the HTML without blocking, resulting in a faster page load and reduced INP.
  • Use Web Workers: Consider using Web Workers to offload heavy JavaScript tasks to a separate thread, preventing them from blocking the main thread and improving responsiveness.
  • Optimize third-party scripts: Limit the use of third-party scripts, and if necessary, load them asynchronously. Third-party scripts can significantly contribute to INP if they’re large or cause delays in execution.
  • Minimize DOM size: When a webpage’s Document Object Model (DOM) is small, rendering tasks are usually quick. However, as the DOM size increases, rendering work tends to scale, leading to a non-linear relationship. A large DOM poses challenges during the initial page render, requiring substantial rendering work, and during user interactions, where updates can be expensive, delaying the presentation of the next frame.

Web Vital LCP

  • Identifier: web-vital-lcp
  • Group: browser

Largest Contentful Paint (LCP) measures the time it takes for the largest content element on a page to become visible. Improving LCP contributes significantly to a better user experience.

How it works

Verifies that the LCP metric falls within the recommended range.

Recommendations

  • Minimize server response time: Efficient server responses contribute to faster content rendering. Employ caching mechanisms and consider upgrading your hosting plan for better server performance.
  • Minimize render-blocking resources: Identify and address render-blocking resources, especially JavaScript and CSS files. Utilize techniques such as asynchronous loading or defer loading for non-critical resources.
  • Prioritize above-the-fold content: Ensure that above-the-fold content, the content visible without scrolling, is prioritized for loading. This involves optimizing the delivery of critical resources and minimizing the number of requests needed to render this content.
  • Optimize images and videos: Compress and optimize images to reduce their file size without sacrificing quality. Consider using modern image formats like WebP. Also, lazy load images and videos to defer loading until they are needed, improving initial page load times.
  • Optimize web fonts: Minimize the use of custom web fonts or use the font-display: swap; property to ensure text is visible even before the custom font has fully loaded.
  • Use browser caching: Implement proper caching strategies for static assets. Utilize cache headers to instruct the browser to store and reuse resources locally, reducing the need to download them on subsequent visits.
  • Use a content delivery network (CDN): Deploy a Content Delivery Network to distribute your content across servers globally. This reduces latency and speeds up content delivery, improving FCP.
  • Reduce third-party scripts: Limit the use of third-party scripts as they can introduce additional delays. Only include essential third-party scripts and load them asynchronously to prevent them from blocking rendering.

Web Vital TTFB

  • Identifier: web-vital-ttfb
  • Group: browser

Time to First Byte (TTFB) measures the time between the request for a resource and when the first byte of a response begins to arrive. It’s a foundational web performance metric that precedes every other meaningful user experience metric such as First Contentful Paint and Largest Contentful Paint. A high TTFB value adds time to the metrics that follow it.

Even though it’s a foundational metric, it isn’t a part of Google’s Core Web Vitals. That’s because, depending on the technology you use for your website, a high TTFB metric might not be as significant as other metrics. For example, for Single Page Applications (SPAs), having a low TTFB score is important so that the client can start rendering elements as soon as possible. On the other hand, for a server-rendered site, a high TTFB score might not impact other metrics as much.

How it works

Verifies that the TTFB metric falls within the recommended range.

Recommendations

  • Upgrade hosting plan: Consider upgrading your hosting plan to one that offers better resources, especially if your website is currently on shared hosting. Dedicated or VPS hosting may provide better performance and a lower TTFB.
  • Optimize server configuration: Ensure your server is configured for optimal performance. This includes adjusting server settings, such as connection timeouts and keep-alive settings, to minimize delays.
  • Use a content delivery network (CDN): Utilize a CDN to distribute static assets geographically closer to users. This helps in reducing latency and improving TTFB by serving content from servers located nearer to the user.
  • Implement server-side caching: Implement server-side caching mechanisms to store and serve frequently requested content without re-generating it for each user. This can significantly reduce the TTFB by delivering cached content when possible.
  • Enable compression: Enable compression for your web server to reduce the size of data sent to the browser. Gzip or Brotli compression can be particularly effective in minimizing the time it takes to transmit resources.
  • Optimize database queries: Review and optimize database queries to ensure efficiency. Use indexes where necessary, and minimize the number of database queries required to generate a page.
  • Minimize redirects: Minimize the use of redirects, as each redirect introduces an additional round-trip delay. Where possible, ensure that URLs are structured to minimize unnecessary redirects.
  • Use browser caching: Leverage browser caching by setting appropriate cache headers for static resources. This can reduce the need for the browser to request the same resources on subsequent visits.
  • Use preconnect and prefetch: Implement preconnect and prefetch directives in your HTML to establish early connections to essential third-party domains and prefetch resources needed for subsequent navigation, respectively.

Best Practice

Best Practice audits ensure your scripts and tests adhere to recommended practices for k6 development, leading to more robust and reliable tests.

Metric Names

  • Identifier: metric-names
  • Group: metrics

Metrics measure how a system performs under test conditions over time. A metric name is a unique identifier for a specific metric and is used to distinguish between different types of metrics being collected and reported.

How it works

Analyzes the metrics generated by your k6 script to ensure they are not excessive and that their names adhere to k6 naming conventions.

Common problem sources

  • Producing too many custom metrics.
  • Using metric names that do not follow the ^[a-zA-Z\_][a-za-z0-9_]{1,128}$ regular expression.

Recommendations

  • Verify your tests don’t produce a large number of custom metrics.
  • Make sure your metric names include only up to 128 ASCII letters, numbers, or underscores, and start with a letter or an underscore.

Metric Tags

  • Identifier: metric-tags
  • Group: metrics

Metrics measure how a system performs under test conditions over time. A metric tag, also called a label, is an attribute attached to a metric data stream, providing additional context or dimensions to the metric data. Labels can be used to filter or aggregate metrics based on specific characteristics or conditions.

How it works

Analyzes both built-in and custom metrics, ensuring your k6 script is not producing excessive unique values for metric tags.

Common problem sources

  • The URL might contain a query parameter or unique ID per iteration, such as tokens, session IDs, etc. This often leads to high cardinality for the name tag.
  • Creating metrics in a loop often leads to misuse of the APIs. If you generate new metrics in a loop, make sure you know what you’re doing.

Recommendations

  • Use URL grouping to aggregate data in a single URL metric.
  • Use fewer tag values when using custom tags.
  • If you followed all previous recommendations and still get this alert, consider using drop_metrics and drop_tags to reduce the cardinality of your time series.

Third Party Content

  • Identifier: third-party-content
  • Group: ungrouped

Modern applications frequently depend on several third-party vendors, such as Content Delivery Networks (CDNs) and analytics tools. This reliance makes testing real-world systems significantly more challenging. Starting a load test that makes hundreds of thousands of requests may violate the third party’s terms of service (ToS). Additionally, you often have no control over the third party’s performance, which can lead to throttled requests and skewed test results.

How it works

Monitors all requests made by your k6 script to ensure your test doesn’t make excessive requests to third-party resources.

Common problem sources

  • Your k6 script makes requests to a lot of different domain names.

Recommendations

  • Remove calls to third-party resources from your k6 script.

Version

  • Identifier: version
  • Group: ungrouped

Grafana k6 is an ever-evolving tool. Using a legacy version of k6 that is significantly older than the latest stable version could lead to subtle issues.

How it works

Checks the version of the k6 binary used to run a test, making sure it is not significantly older than the latest stable version.

Common problem sources

  • Running an outdated k6 version.

Recommendations

  • Install the latest stable k6 version.

System

System audits uncover potential problems with the Grafana Cloud k6 load generation infrastructure.

Load Generator CPU Usage

  • Identifier: load-generator-cpu-usage
  • Group: ungrouped

Load Generators are an abstraction on top of k6 for running your cloud tests. Overutilization of these components can skew your test results, producing data that varies unpredictably from test to test.

How it works

Monitors the CPU metric of cloud load generators to ensure it stays within the recommended range.

Common problem sources

  • Lack of sleep times in your scripts. Sleep times help pace and emulate real user behavior.
  • High RPS per VU. When testing API endpoints at scale, consider using more virtual users.
  • Batching large numbers of requests. Requests in a request batch are parallelized to the default or defined limits.
  • Receiving large amounts of data in responses, resulting in high memory utilization and increased garbage collection efforts.
  • A JavaScript exception is being thrown early in VU execution, resulting in an endless restart loop until all CPU cycles are consumed.

Recommendations

  • Increase sleep() times where appropriate.
  • Increase the number of VUs to produce less RPS per VU (thus the same total load).
  • Use multiple load zones to spread VUs across multiple regions.

Load Generator Memory Usage

  • Identifier: load-generator-memory-usage
  • Group: ungrouped

Load Generators are an abstraction on top of k6 for running your cloud tests. Overutilization of these components can skew your test results, producing data that varies unpredictably from test to test.

How it works

Monitors the memory metric of cloud load generators to ensure it stays within the recommended range.

Common problem sources

  • Reading large files from the filesystem.
  • Receiving large amounts of data in responses.

Recommendations

  • Use the discardResponseBodies test option to throw away the response body by default.
  • Use responseType: to capture the response bodies you may require.

Reference table

For a quick reference on all audits, refer to the following table:

NameIdentifierCategoryGroupWeight
HTTP Failure Ratehttp-failure-ratereliabilityhttp2
HTTP Spans Failure Ratehttp-spans-failure-ratereliabilityhttp1
Web Vital CLSweb-vital-clsreliabilitybrowser1
Web Vital FCPweb-vital-fcpreliabilitybrowser1
Web Vital FIDweb-vital-fidreliabilitybrowser1
Web Vital INPweb-vital-inpreliabilitybrowser1
Web Vital LCPweb-vital-lcpreliabilitybrowser1
Web Vital TTFBweb-vital-ttfbreliabilitybrowser1
Metric Namesmetric-namesbest-practicemetrics5
Metric Tagsmetric-tagsbest-practicemetrics5
Third Party Contentthird-party-contentbest-practiceungrouped1
Load Generator CPU Usageload-generator-cpu-usagesystemungrouped1
Load Generator Memory Usageload-generator-memory-usagesystemungrouped1