What aspects should be considered when implementing a front-end monitoring system?

1. Why do we need front-end monitoring

  • Find problems faster
  • Basis for making product decisions
  • Improve the technical depth and breadth of front-end development
  • Provide more possibilities for business expansion

2. Front-end data classification

There is actually a lot of front-end data, ranging from PV, UV, and ad clicks that are generally concerned by the public, to the client’s network environment and login status, to browser and operating system information, and finally to page performance and JS exceptions. These data can all be Collected on the front end.

2.1 Access related data
  • PV/UV: The most basic PV (page views), UV (independent access to user data)
  • Page source: the referer of the page, which can locate the entrance of the page
  • Operating system: It is more meaningful to understand the user’s OS situation and help analyze the characteristics of the user group, especially the distribution of mobile terminals, iOS and Android.
  • Browser: The proportion of various browsers can be counted, providing reference value for research on whether it is still compatible with IE6 and the application of new technologies (HTML5, CSS3, etc.)
  • Resolution: Provides reference for page design, especially responsive design
  • Login rate: Login users have higher analysis value, and it is very important to guide users to log in.
  • Geographical distribution: The geographical distribution of visiting users, which can be used for operations and activities in different regions.
  • Network type: wifi/3G/2G, make decisions on whether the product needs to adapt to different network environments
  • Access period: Master the distribution of user access time, guide peaks and valleys, and save bandwidth
  • Duration of stay: Determine whether the content of the page is attractive, which is more meaningful for pages that require a long time to read.
  • Depth of arrival: similar to length of stay, such as Baidu Encyclopedia, the depth of arrival of the page when the user browses directly reflects the quality of the entry
2.2 Performance-related data
  • White screen time: The time taken from the time the user opens the page until something starts to appear on the page is the white screen time.
  • First screen time: the time it takes for all content on the first screen of the user’s browser to be displayed
  • User operable time: Users can perform normal clicks, inputs, etc.
  • Total page download time: the time it takes for all resources of the page to be loaded and rendered, that is, the page onload time
  • Customized time points: For developers, it is completely possible to customize some time points, such as: the time when a component init is completed, the time when an important module is loaded, etc.
2.3 Click related data
  • Total page clicks
  • Clicks per person: This indicator is very important for navigation web pages.
  • Outflow URL: Similarly, for navigation web pages, you can directly understand the direction of web page redirection.
  • Click time: The time distribution of all the user’s click behaviors reflects the user’s clicking habits.
  • First click time: Same as above, but only the user’s first click is counted. If the time is too long, does it mean that the page is stuck and the user cannot click for a long time?
  • Click heat map: According to the location where the user clicks, we can draw a click heat map of the entire page, which can intuitively understand the hot spots of the page.
2.4 Abnormal related data

The exception here refers to the JS exception. The user’s browser reports a JS bug, which will greatly reduce the user experience.

  • Exception prompt information: This is the most important basis for identifying an exception, such as: e.src is empty or not an object
  • JS file name
  • Exception line
  • The browser where the exception occurred
  • Stack information: Stack information of function calls is needed when necessary, but note that the stack information may be relatively large and needs to be intercepted
2.5 Other data

In addition to the 4 basic data statistical requirements mentioned above, we can of course also define some other statistical requirements based on the actual situation, such as the user’s browser’s support for canvas, and a more special example – the user’s carousel image flipping The number of pages, these data statistical requirements can be met by the front-end, and each statistical result reflects the value of the front-end data.

3. Performance indicators

  • FP (First Paint): First drawing time, including any user-defined background drawing, it is the moment when pixels are first drawn to the screen.
  • FCP (First Content Paint): First content painting. Is the time when the browser renders the first DOM to the screen, which may be text, image, SVG, etc. This is actually the white screen time
  • FMP (First Meaningful Paint): First meaningful paint. The time it takes for meaningful content on the page to render
  • LCP (Largest Contentful Paint). Maximum content rendering. Represents the loading time of the largest page element in the viewport.
  • DCL (DomContentLoaded): DOM loading is completed. When the HTML document is completely loaded and parsed, the DOMContentLoaded event is fired. No need to wait for stylesheets, images and subframes to finish loading.
  • L(onload): It will only be triggered after all dependent resources have been loaded.
  • TTI (Time to Interactive): interactive time. Used to mark the point in time when the app is visually rendered and reliably responding to user input.
  • FID (First Input Delay): First input delay. The time from the user’s first interaction with the page (clicking a link, clicking a button, etc.) to the page responding to the interaction.

4. Front-end monitoring targets (monitoring classification)

4.1 Stability
  • JS error, JS execution error or Promise exception
  • Resource exception, script, link and other resource loading exceptions
  • Interface error, ajax or fetch request interface exception
  • White screen, blank page
4.2 User experience (experience)
  • Loading time, loading time of each stage
  • TTFB (Time To First Byte). It refers to the time it takes for the browser to initiate the first request and return the first byte of data. This time includes the network request time and back-end processing time.
  • FP (First Paint). The first draw includes any user-defined background draws and is the time when the first pixel is drawn to the screen.
  • FCP (First Content Paint). The first content draw is when the browser renders the first DOM to the screen, which can be any text, image, SVG, etc.
  • FMP (First Meaningful Paint). First meaningful draw is a measure of page usability.
  • FID (First Input Delay). The time from the user’s first interaction with the page until the page responds to the interaction.
  • Stuck and stopped. Tasks longer than 50ms.
4.3 Business
  • PV: page view means page views or clicks
  • UV: Refers to the number of people from different IP addresses who visit a site.
  • Page residence time: The user’s residence time on each page.

5. Front-end monitoring process

  • Data burying point
  • Data reporting
  • Analyze and calculate, process and summarize the collected data
  • Visual display, display data according to various dimensions
  • Monitor alarms and trigger alarms based on certain conditions after problems are discovered.

6. Common burying plans

6.1 Code Burial Points
  • Code burying is to bury the point in the form of embedded code. For example, if you want to monitor the user’s click event, you will choose to insert a piece of code when the user clicks, save the monitoring behavior, or directly pass the monitoring behavior directly to a certain data format. Service-Terminal.
  • The advantage is that the required data information can be accurately sent or saved at any time.
  • The disadvantage is that the workload is heavy
6.2 Visual buried points
  • Use visual interaction to replace code burying points.
  • Separate the business code and hidden code, provide a visual interaction page, and input the business code. Through this visual system, you can customize the hidden events and so on in the business code. The final output code couples the business code and hidden code
  • Visual buried points actually use the system to replace manual insertion of hidden point codes.
6.3 Traceless buried points
  • Any event on the front end is bound to an identifier, and all events are recorded.
  • By regularly uploading record files and cooperating with file analysis, we can parse out the data we want and generate visual reports for analysis by professionals.
  • The advantage of traceless burying is that it collects a full amount of data, and there will be no leakage or accidental burying.
  • The disadvantage is that it puts more pressure on data transmission and servers, and it also cannot flexibly customize the data structure.

7. Write monitoring collection script

7.1 Monitoring errors
  • Misclassification

    • JS error
    • Promise exception
  • Resource exception

    • Listen for errors
7.2 Data structure design
  • jsError
let info = {
  title: "Front-end monitoring system", // Page title
  url: "http://localhost:8080", // Page url
  timestamp: "1212121212121212", // Access timestamp
  userAgent: "chrome", // User browser type
  kind: "stability", // Major categories
  type: "error", // small class
  errorType: "jsError", // error type
  message: "uncaught TypeError:blablabla", // Error details
  filename: "http://localhost:8080/", // Accessed file name
  position: "0:0", // row and column information
  stack: "btn Click (http://localhost:8080)", // stack information
  selector: "HTML BODY #container .content INPUT", // selector
};
  • Interface exception data structure settings
let info = {
  title: "Front-end monitoring system", // Page title
  url: "http://localhost:8080", // Page url
  timestamp: "1212121212121212", // Access timestamp
  userAgent: "chrome", // User browser type
  kind: "stability", // Major categories
  type: "xhr", // small class
  eventType: "load", // event type
  pathname: "/success",
  status: "200-0k",
  duration: "5", // duration
  response: "hahah", // response content
  params: "parameters", // parameters
};
  • White screen screen returns the screen object of the current window and returns the screen-related properties of the current rendering window.

    • innerWidth read-only window property. innerWidth returns the inner width of the window in pixels
    • innerHeight The inner height of the window (the height of the layout viewport)
    • layout_viewport
    • The elementsFromPoint method can obtain all elements arranged from inside to outside at the specified coordinates in the current viewport.
let info = {
  title: "Front-end monitoring system",
  url: "http://localhost:8080/",
  timestamp: "1239404040404044",
  userAgent: "chorme",
  kind: "stability",
  type: "blank",
  emptyPoints: "0", // Empty points
  screen: "2049 * 1152", // resolution
  viewPoint: "2048 * 994", // viewport
  selector: "HTML BODY #container", // selector
};

The whole process can be roughly divided into four stages: information collection, storage, analysis, and monitoring.

Collection phase: Collect exception logs, perform certain processing locally, and adopt a certain plan to report to the server.

Storage stage: The backend receives the exception log reported by the frontend, and after certain processing, stores it according to a certain storage plan.

Analysis stage: divided into machine automatic analysis and manual analysis. The machine automatically analyzes and counts and filters the stored log information through preset conditions and algorithms, discovers problems, and triggers alarms. Manual analysis, by providing a visual data panel, allows system users to see specific log data and discover the root cause of abnormal problems based on the information.

Alarm stage: divided into alarm and early warning. Alarms are automatically alerted according to a certain level, through set channels, and according to certain triggering rules. Early warning is to predict and give warnings before an abnormality occurs.

Performance monitoring: Using Resource Timing API and Performance Timing API, many important indicators can be calculated, such as the starting point time of page performance statistics, first screen time, etc.

Exception monitoring: Front-end exception capture is divided into global capture and local capture. Partial capture is used as a supplement to capture some special situations, but it is scattered and not conducive to management. Therefore, I will choose the global capture method, that is, write the capture code in one place through the global interface. Specifically, in the implementation project, I should use badjs-report, which rewrites window.onerror to report exceptions without writing any code to capture errors.

Front-end buried points: There are manual buried points, which means inserting monitoring logic where the monitoring is needed, but the workload may be huge; there is also no buried point, the front end automatically collects all events and reports buried point data, but it has shortcomings The server will be under great pressure. I might be inclined to use declarative tracking, decoupling the tracking code from the specific business logic, only caring about the controls that need tracking, and declaring the required tracking data for these controls, mainly to reduce tracking. cost. Add embedded information on the dom element, such as

// key represents the unique identifier of the buried point; act represents the buried point method
<button data-stat="{key:'buttonKey', act: 'click'}">Buried point</button>

Monitoring alarms: I think the most convenient and efficient way here is to connect to the internal alarm group. Especially in Alibaba, it seems that there are all kinds of wheels. Then you may need to consider the threshold and timing of triggering alarms.

Performance: Using the Performance API, you can get many important indicators, such as the starting point time of page performance statistics, first screen time, etc.

Error reporting: use onerror and onunhandledrejection, or even try catch

Operation behavior: patch the event triggering function, or add specific event listening

PV/UV: Use browser storage methods or cookies, IP, etc. to store corresponding user information and send it with the request

Device information: Get navigator.userAgent

PV and UV are growing digital types and can be recorded using Redis, etc., and stored in the database regularly if necessary. Others are large amounts of text information that can be consumed using a mature message queue. Because there is a lot of writing, you can consider separating reading and writing.

Technical Difficulties:

Perhaps what is more complicated about the entire system is how to upload monitoring data efficiently and reasonably. In addition to the abnormal error message itself, user operation logs also need to be recorded. If any logs are reported immediately, it is tantamount to a self-made DDOS attack. Then you need to consider the storage of front-end logs, how to upload logs, how to organize logs before uploading, etc.

The front-end may affect user experience during the collection process.

The backend must use appropriate tools to collect the received logs, and choose what to choose when the amount of data is large.

Possible options

  • indexDB stores logs because of its large capacity and asynchronous! Don’t worry about blocking pages.
  • Organize logs in a webworker, such as labeling each log and classifying it.
  • Log reporting is also done in webworker, which can be distinguished according to importance and urgency to determine whether to delay or report immediately.