Designing Bidirectional Ticket Sync Without the Race Conditions

Key Takeaways

Unidirectional sync creates status drift, duplicate work, and lost context between security and IT operations
Bidirectional sync introduces race conditions that naive implementations can't handle — concurrent updates from both systems will corrupt data
Event-driven architecture with idempotent operations and vector clocks solves the coordination problem
Eventual consistency with a full audit trail beats strong consistency for cross-system ticket sync
Edge cases (reopened tickets, bulk ops, field mapping conflicts) are where most sync implementations fail

The Problem: Your Tickets Are Lying to You

Here's a scenario every security operations team has lived through: Your SIEM fires an alert at 2:14 AM. The on-call analyst investigates, determines it's a true positive, and creates a ticket in ServiceNow. They add their initial findings, set the priority, and assign it to the incident response team.

By 9:00 AM, three things have happened: the IR team updated the ticket in ServiceNow with containment steps, the analyst added new IOCs to the SIEM case, and the client-facing team posted a status update in ConnectWise. None of these systems know about the others' updates. The ServiceNow ticket says "containment in progress." The SIEM case still shows "investigating." The client portal shows "acknowledged." Three different truths. Zero confidence in any of them.

This isn't an edge case. This is Tuesday.

Why Unidirectional Sync Is a Trap

The obvious first solution is one-way sync. SIEM creates alert → push to ticketing system. Simple, right? The problem is that security workflows aren't linear. They're collaborative and bidirectional by nature.

Status Drift

When updates only flow in one direction, the source system has no idea what happened downstream. An analyst closes a ticket in ServiceNow after remediation. The SIEM case stays open. A week later, someone sees the open SIEM case, panics, and starts a duplicate investigation into an incident that was resolved days ago. Wasted hours. Wasted trust.

Duplicate Work

Without bidirectional context flow, analysts working in different systems duplicate effort. The SIEM analyst enriches the same IOCs that the ticket assignee already researched. Nobody knows because the enrichment lives in a different system. For an MSSP managing 50+ clients, this duplication compounds into hundreds of wasted analyst hours per month.

Lost Context

The most damaging failure: context that exists in one system never reaches the people who need it. An analyst adds a critical note to the SIEM case — "this host was reimaged last week, check if the malware persisted through reimage" — but the IR team working in ServiceNow never sees it. They waste four hours on the wrong containment strategy.

The Race Condition Problem

Bidirectional sync is the obvious answer. It's also where most implementations fail spectacularly. The fundamental challenge: two systems can update the same logical entity at the same time.

The Classic Race Condition

Consider this sequence:

T=0ms — Analyst A updates ticket status to "Investigating" in ServiceNow
T=50ms — Analyst B updates the same case status to "Escalated" in Splunk
T=200ms — Sync process picks up ServiceNow change, pushes "Investigating" to Splunk
T=250ms — Sync process picks up Splunk change, pushes "Escalated" to ServiceNow
T=400ms — Sync process sees ServiceNow changed again, pushes "Escalated" to Splunk (no-op)
T=450ms — Sync process sees Splunk changed again, pushes "Investigating" back to ServiceNow

You now have an infinite loop. Both systems flip-flop between states. Welcome to sync hell.

The Echo Problem

Even without concurrent updates, naive bidirectional sync creates echoes. System A updates → sync pushes to System B → System B's webhook fires → sync pushes back to System A → System A's webhook fires → infinite loop. Every bidirectional sync implementation must solve echo suppression as a first-class concern.

The Architecture: Event-Driven with Idempotent Operations

Here's how we solve this at Quandry Labs. The architecture has four key components:

1. Event Sourcing with Origin Tracking

Every change event carries metadata about its origin. When System A generates an update, the event includes a source identifier and a correlation ID. When the sync layer pushes that update to System B, the resulting webhook event from System B carries the same correlation ID. The sync layer recognizes the echo and drops it.

{
  "event_id": "evt_8f3a2b1c",
  "correlation_id": "corr_abc123",
  "source_system": "servicenow",
  "source_action": "user_update",
  "entity_type": "incident",
  "entity_id": "INC0012345",
  "field": "state",
  "old_value": "new",
  "new_value": "in_progress",
  "timestamp": "2026-02-28T14:32:01.847Z",
  "actor": "[email protected]"
}

The correlation_id follows the event through its entire lifecycle. When the sync layer writes to Splunk and Splunk fires a webhook, we tag that write with the correlation ID. When the webhook arrives back, we check: "Did I cause this change?" If yes, drop it. Echo suppressed.

2. Vector Clocks for Conflict Resolution

When two legitimate (non-echo) updates happen concurrently, you need a conflict resolution strategy. We use a simplified vector clock approach: each system maintains a logical timestamp for every synced entity. When a conflict is detected (two systems updated the same field with different values since the last sync), the resolution follows a deterministic priority order.

For security operations, the priority order is typically:

Escalation wins: If one update raises severity/priority and the other doesn't, the escalation wins
Human over automation: A manual analyst update takes priority over an automated status change
Latest timestamp with audit: When neither rule applies, last-write-wins with a full audit trail so nothing is lost

3. Idempotent Write Operations

Every sync operation must be idempotent. If the sync layer crashes and replays the last 100 events, the end state must be identical to processing them once. This means:

Updates are expressed as "set field X to value Y" not "increment field X"
Every write operation checks current state before applying changes
Duplicate event IDs are detected and dropped
Operations that would result in no change are no-ops (no webhook triggered, no audit entry)

4. The Sync Ledger

Every event, every conflict resolution, every dropped echo gets logged in a central sync ledger. This serves three purposes:

Debugging: When something looks wrong, you can trace exactly what happened and why
Audit compliance: For regulated environments, you have a complete chain of custody for every data change
Recovery: If a system goes down and comes back with stale data, you can replay the ledger to bring it current

Handling the Edge Cases

The core architecture handles 90% of sync scenarios. The remaining 10% is where teams spend 90% of their debugging time.

Reopened Tickets

A ticket is closed in both systems. Three days later, it's reopened in ServiceNow because the issue recurred. The SIEM case is also closed. Do you reopen the SIEM case? Create a new one? Link the new ticket to the old case?

Our approach: reopening creates a new linked entity rather than modifying the closed one. The historical record stays intact. The new case/ticket carries a reference to its predecessor. This avoids polluting closed-case metrics and preserves the audit trail of the original incident.

Bulk Operations

An analyst selects 30 tickets in ServiceNow and bulk-updates them to "Resolved." If each generates an individual sync event, you'll hit API rate limits on the target system. Worse, if 15 succeed and 15 fail, you have a partial sync that's harder to debug than a total failure.

Solution: bulk operations are batched into a single transactional sync event. Either all 30 sync or none do. The sync layer uses the target system's bulk API when available, falling back to sequential writes with automatic rollback on failure.

Field Mapping Mismatches

ServiceNow has 7 incident states. Splunk has 5. Jira has 4 (or however many your team configured). Mapping between them isn't always 1:1. "Awaiting Customer" in ServiceNow has no equivalent in Splunk. "Suppressed" in Splunk has no equivalent in ServiceNow.

We handle this with a canonical state model. The sync layer maintains an internal state machine that both systems map to. When a state exists in one system but not the other, the canonical model maps it to the closest equivalent and logs the mapping decision. Analysts can see that "ServiceNow: Awaiting Customer" mapped to "Splunk: In Progress (pending external)" and understand why.

Schema Evolution

ServiceNow ships quarterly updates. Custom fields get added. Required fields change. Your sync can't break every time an admin adds a field to the incident form.

The sync layer operates on a defined contract: a specific set of fields it cares about. New fields are ignored until explicitly mapped. Removed required fields trigger an alert to the operations team rather than a silent failure. The sync degrades gracefully — partial sync is better than no sync.

Real-World Example: Splunk to ServiceNow and Back

Here's the flow for a production implementation we've built:

Detection: Splunk correlation search fires, creating a Notable Event
Initial sync: Event-driven trigger creates a ServiceNow Security Incident with mapped fields (severity, description, affected assets, initial IOCs)
Enrichment loop: As the analyst enriches the Notable Event in Splunk (adding IOCs, running adaptive responses), new context syncs to the ServiceNow ticket as work notes
Bidirectional updates: IR team updates containment status in ServiceNow → syncs back to Splunk as a Notable Event status change and comment
Resolution: Ticket closed in ServiceNow → Notable Event disposition set to "True Positive: Resolved" in Splunk → detection metrics update automatically
Feedback: Post-incident, any tuning recommendations added to ServiceNow's problem record sync back to Splunk as suppression rules or correlation search modifications

Total manual intervention required: zero. Analyst time saved per incident: 15-25 minutes. Across 200+ incidents per month for a mid-size MSSP, that's 50-80 analyst hours recovered — every month.

Implementation Principles

If you're building bidirectional sync (or evaluating someone who is), here are the principles that separate robust implementations from fragile ones:

Eventual consistency over strong consistency. You don't need real-time. You need correct-within-seconds. Trying to force strong consistency across external APIs leads to locking, timeouts, and cascading failures.
Assume failure. APIs go down. Webhooks get lost. Rate limits hit. Every sync operation must be retryable, and the system must detect and recover from partial sync states automatically.
Audit everything. The sync ledger isn't optional. When an analyst asks "why does my ticket say X when I set it to Y?" you need to answer in seconds, not hours.
Start narrow. Don't try to sync every field on day one. Start with status, priority, and assignment. Add fields as the team stabilizes. A sync that handles three fields perfectly is infinitely more valuable than one that handles thirty fields unreliably.
Test with chaos. Deliberately inject concurrent updates, API failures, and webhook delays in staging. If your sync can't handle it in testing, it won't handle it in production at 3 AM.

Need Bidirectional Sync That Actually Works?

We design and build cross-system integrations for security teams. No status drift. No race conditions. No lost context.

Learn about our integration services →