Optimizing LLMs for Effective Postmortem Writing and Monitoring

Originally published at ssojet Datadog has integrated structured metadata from its incident management app with Slack messages to create an LLM-driven feature that aids engineers in composing incident postmortems. This approach enables the automation of compiling various sections of the postmortem report for review and customization by engineers. The team dedicated over 100 hours fine-tuning the structure and LLM instructions to ensure high-quality outputs tailored to diverse inputs. Experimentation with different models, including GPT-3.5 and GPT-4, revealed significant differences in cost, speed, and accuracy. While GPT-4 delivered more precise results, it was also slower and costlier compared to GPT-3.5. The team ultimately decided to use various model versions for different sections, optimizing the balance between efficiency and quality. This parallel processing strategy reduced the time to generate a complete report from 12 minutes to under 1 minute. A critical aspect of this functionality is the management of trust and privacy. Datadog engineers emphasized the importance of marking AI-generated content clearly to avoid misinterpretation as final drafts. They also implemented mechanisms for secret scanning to protect sensitive data during the LLM processing stage. "Given the sensitivity of technical incidents, protecting confidential information was paramount. As part of the ingestion API, we implemented secret scanning and filtering mechanisms that scrubbed and replaced suspected secrets with placeholders before feeding data into the LLM." Alongside AI enhancements, postmortem authors can customize templates and receive clear instructions on LLM usage to foster transparency and trust. Optimizing LLM Use for Cost, Quality, and Safety Datadog's integration of LLMs into postmortem writing addresses the complexities of documenting incidents while retaining human oversight. This method combines structured metadata from Datadog’s Incident Management app with unstructured discussions from Slack channels to generate preliminary drafts for human authors. The challenges encountered included data quality, cost-speed-quality trade-offs, and trust/privacy concerns. For instance, ensuring the accuracy of the postmortem is crucial as it encapsulates significant organizational learning. Non-determinism in LLM output, such as hallucinations, posed additional challenges that the team managed through refining LLM instructions. To mitigate these issues, a custom API was developed to extract and structure necessary data, ensuring quick experimentation while maintaining privacy. This framework allowed iterative testing with different datasets and LLM configurations to enhance content quality. Postmortems now benefit from both qualitative and quantitative evaluations of AI-generated drafts, comparing them against human-authored versions to assess accuracy, coherence, and coverage. Addressing Trust and Privacy Issues Trust in AI-generated content is critical in the postmortem writing process. Datadog implemented strategies to enhance transparency and credibility in AI outputs. The LLM-generated drafts clearly indicate machine-generated text, providing users with opportunities for customization and adjustment. Privacy mechanisms were also crucial, with the ingestion API ensuring sensitive information was protected. The AI-generated insights are supported by citations from relevant sources, increasing trust in the information provided. Datadog Launches LLM Observability to Enhance GenAI Monitoring and Security Datadog has launched LLM Observability, designed to improve the monitoring and security of Generative AI applications. This offering provides enhanced visibility into LLM chains, helping developers identify errors and anomalies in real-time. Kyle Triplett, VP of Product at AppFolio, noted, "The Datadog LLM Observability solution helps our team understand, debug, and evaluate the usage and performance of our GenAI applications." Key features of LLM Observability include: Enhanced Visibility and Monitoring: Detailed insights into LLM operations, allowing for real-time monitoring of operational metrics. Quality and Safety Evaluations: Evaluation criteria for AI applications to ensure content integrity and mitigate security risks. Integration and Scalability: Seamless integration with Datadog's Application Performance Monitoring (APM) capabilities, supporting platforms like OpenAI and Azure OpenAI. Yrieix Garnier, VP of Product at Datadog, emphasized the necessity of cost-effective adoption of LLM technologies. "Datadog LLM Observability provides the deep visibility needed to help teams manage and understand performance, detect drifts or biases, and resolve issues before they have a significant impact." Public Companies Embracing AI Many public companies, including Datadog, are actively discussing AI in their earnings calls. Datadog's recent focus on

Apr 14, 2025 - 07:40

Optimizing LLMs for Effective Postmortem Writing and Monitoring

Originally published at ssojet

Datadog has integrated structured metadata from its incident management app with Slack messages to create an LLM-driven feature that aids engineers in composing incident postmortems. This approach enables the automation of compiling various sections of the postmortem report for review and customization by engineers. The team dedicated over 100 hours fine-tuning the structure and LLM instructions to ensure high-quality outputs tailored to diverse inputs.

Experimentation with different models, including GPT-3.5 and GPT-4, revealed significant differences in cost, speed, and accuracy. While GPT-4 delivered more precise results, it was also slower and costlier compared to GPT-3.5. The team ultimately decided to use various model versions for different sections, optimizing the balance between efficiency and quality. This parallel processing strategy reduced the time to generate a complete report from 12 minutes to under 1 minute.

A critical aspect of this functionality is the management of trust and privacy. Datadog engineers emphasized the importance of marking AI-generated content clearly to avoid misinterpretation as final drafts. They also implemented mechanisms for secret scanning to protect sensitive data during the LLM processing stage.

"Given the sensitivity of technical incidents, protecting confidential information was paramount. As part of the ingestion API, we implemented secret scanning and filtering mechanisms that scrubbed and replaced suspected secrets with placeholders before feeding data into the LLM."

Alongside AI enhancements, postmortem authors can customize templates and receive clear instructions on LLM usage to foster transparency and trust.

Optimizing LLM Use for Cost, Quality, and Safety

Datadog's integration of LLMs into postmortem writing addresses the complexities of documenting incidents while retaining human oversight. This method combines structured metadata from Datadog’s Incident Management app with unstructured discussions from Slack channels to generate preliminary drafts for human authors.

The challenges encountered included data quality, cost-speed-quality trade-offs, and trust/privacy concerns. For instance, ensuring the accuracy of the postmortem is crucial as it encapsulates significant organizational learning. Non-determinism in LLM output, such as hallucinations, posed additional challenges that the team managed through refining LLM instructions.

To mitigate these issues, a custom API was developed to extract and structure necessary data, ensuring quick experimentation while maintaining privacy. This framework allowed iterative testing with different datasets and LLM configurations to enhance content quality.

Postmortems now benefit from both qualitative and quantitative evaluations of AI-generated drafts, comparing them against human-authored versions to assess accuracy, coherence, and coverage.

Addressing Trust and Privacy Issues

Trust in AI-generated content is critical in the postmortem writing process. Datadog implemented strategies to enhance transparency and credibility in AI outputs. The LLM-generated drafts clearly indicate machine-generated text, providing users with opportunities for customization and adjustment.

Privacy mechanisms were also crucial, with the ingestion API ensuring sensitive information was protected. The AI-generated insights are supported by citations from relevant sources, increasing trust in the information provided.

Datadog Launches LLM Observability to Enhance GenAI Monitoring and Security

Datadog has launched LLM Observability, designed to improve the monitoring and security of Generative AI applications. This offering provides enhanced visibility into LLM chains, helping developers identify errors and anomalies in real-time.

Kyle Triplett, VP of Product at AppFolio, noted, "The Datadog LLM Observability solution helps our team understand, debug, and evaluate the usage and performance of our GenAI applications."

Key features of LLM Observability include:

Enhanced Visibility and Monitoring: Detailed insights into LLM operations, allowing for real-time monitoring of operational metrics.
Quality and Safety Evaluations: Evaluation criteria for AI applications to ensure content integrity and mitigate security risks.
Integration and Scalability: Seamless integration with Datadog's Application Performance Monitoring (APM) capabilities, supporting platforms like OpenAI and Azure OpenAI.

Yrieix Garnier, VP of Product at Datadog, emphasized the necessity of cost-effective adoption of LLM technologies.

"Datadog LLM Observability provides the deep visibility needed to help teams manage and understand performance, detect drifts or biases, and resolve issues before they have a significant impact."

Public Companies Embracing AI

Many public companies, including Datadog, are actively discussing AI in their earnings calls. Datadog's recent focus on AI includes the introduction of Bits AI and Watchdog, aimed at enhancing operational efficiency through AI tools.

In Q2 2023 earnings calls, "Generative AI" was mentioned 718 times across multiple SaaS companies, indicating a strong industry trend toward integrating AI capabilities.

Olivier Pomel from Datadog noted the transformative potential of AI in developer productivity, stating, “We seek massive improvements in developer productivity that will allow individuals to write more applications and do so faster than ever before.”

Additional Key Use Cases

HubSpot: HubSpot has discussed integrating AI into its CRM platform, enhancing guided growth for customers.
CH Robinson: CH Robinson is leveraging AI to modernize its logistics operations, utilizing generative AI for order management and efficiency improvements.

As organizations continue to adapt to AI advancements, those in the IAM sector, such as SSOJet, are well-positioned to provide essential services like secure single sign-on (SSO), multi-factor authentication (MFA), and user management solutions through an API-first platform.

Explore how SSOJet can enhance your authentication processes and contact us for more information at SSOJet.