Data Anonymization for Solvers

Some institutions (e.g. universities, academies) are unwilling or unable to expose real operational data (instructors, learners, facilities, locations, internal constraints) to external solver services.

However, DiTA fundamentally relies on third-party or distributed solvers to process scheduling problems. This creates a direct tension between:

data privacy and confidentiality, and
solver effectiveness and correctness.

This feature introduces a Data Anonymization System that enables solver participation without exposing real or sensitive institutional data, while preserving correctness guarantees or explicitly declaring deviations.

Problem Statement

We need a mechanism that:

prevents solvers from accessing real institutional data,
allows solvers to operate on anonymized or transformed packets,
guarantees that the produced solution is either:
- equivalent to the solution generated on real data, or
- explicitly marked as potentially divergent with clear reasons.

This mechanism must be flexible, pluggable, and institution-controlled.

Core Requirements

R1. Packet Transformation

The system MUST be able to:

modify packet data,
and, when required, modify the structure of the packet itself, to ensure that real data cannot be inferred or reconstructed.

This includes:

replacing identifiers,
abstracting entities,
removing or generalizing sensitive attributes,
restructuring references to external resources.

R2. Result Equivalence Guarantee

For any anonymized packet:

The solution produced by processing anonymized data MUST be:
- equivalent to the solution that would have been produced using real data, or
- accompanied by an explicit declaration of discrepancies and their causes.

The anonymization layer MUST therefore:

understand the semantic impact of transformations, or
expose uncertainty clearly and transparently.

R3. Resource Access Interception

The anonymization system MAY alter resource references inside the packet.

Problem Example

A packet may contain a statement such as:

“Use instructors of my university”

If passed directly:

the solver may request the real university resource,
leading to leakage of sensitive data.

Required Behavior

The anonymization system MUST handle this by one of the following strategies:

Pre-fetch and Embed
- Fetch real data from the institution resource.
- Anonymize it.
- Embed the anonymized dataset directly inside the packet.
Proxy Resource Redirection
- Replace the original resource address in the packet.
- Redirect it to an anonymization-compliant worker or proxy.
- This worker:
  - fetches real data,
  - anonymizes it according to defined rules,
  - serves anonymized responses to the solver.

In both cases:

the solver MUST never access real institutional endpoints directly.

R4. Pluggable Anonymization Systems

DiTA MUST support multiple anonymization systems.

Each anonymization system:

may define its own rules,
may implement different transformation strategies,
may offer different trade-offs between privacy and solver effectiveness.

The choice of anonymization system is the responsibility of the institution (customer), not the platform.

R5. Hard-to-Anonymize Data Handling

Some data types may be:

extremely difficult to anonymize,
or fundamentally non-anonymizable without breaking correctness.

A primary example is location data.

Required Capability

The anonymization system MAY provide alternative computational strategies, such as:

embedding a distance-calculation worker inside the packet,
exposing derived metrics (e.g. travel time, distance matrix) instead of raw coordinates,
or introducing innovative abstractions that preserve solver utility.

If no safe strategy exists:

the anonymization system MUST declare limitations explicitly.

Non-Goals

This document does NOT define a specific anonymization algorithm.
This document does NOT mandate cryptographic techniques.
This document does NOT define implementation details or deployment topology.

These concerns belong to architecture-level decisions (ADR).

Open Questions

Where should anonymization be applied: pre-packet, gateway-level, or per-resource?
How should equivalence or divergence be formally defined and verified?
What guarantees can realistically be enforced across heterogeneous solvers?
How should anonymization metadata be exposed to solvers, if at all?

Next Steps

Evaluate architectural approaches for anonymization placement.
Define trust boundaries between platform, anonymizer, and solver.
Create ADR(s) to formalize architectural decisions derived from this design.

DiTA — Distributed Timetable Assistant