Theta-lang: Developer Feedback
Questions for Design Validation and Implementation Guidance
As the developer tasked with implementing the Theta VM, I need clarification on several aspects of the design to ensure correct implementation and avoid costly refactoring. These questions are organized by architectural concern.
1. FlatBuffer Schema Design and Evolution
1.1 Schema Versioning Strategy
Q: How should schema evolution be handled when the VM state structure changes between releases?
- Should we use FlatBuffers’ built-in schema evolution (adding optional fields)?
- Do we need explicit version tags in each root table?
- How should we handle incompatible changes (e.g., removing required fields)?
- Should old VM states be forward-convertible, or is it acceptable to reject incompatible versions?
1.2 Nested vs. Flat Table Design
Q: How deeply should we nest FlatBuffer tables?
- Example: Should
ExecutionStatecontain nestedCallFrametables, or should it reference them by ID in a separate table? - Deep nesting simplifies access but complicates mutation (requires rebuilding entire tree)
- Flat design with references adds indirection but allows partial updates
- What is the preferred trade-off for this VM?
1.3 String Handling
Q: How should strings be represented and managed?
- Should strings be inline in value tables, or stored in a separate string pool with offsets?
- How do we handle string lifecycle (when can strings be freed)?
- Should we support string interning for deduplication?
- What encoding should be enforced (UTF-8, ASCII, raw bytes)?
1.4 Schema Organization
Q: Should all schemas be in a single .fbs file or split across multiple files?
- Single file: easier to maintain consistency, harder to navigate
- Multiple files: better organization, requires careful dependency management
- What granularity is preferred?
2. Memory Management and Buffer Lifecycle
2.1 Buffer Ownership Model
Q: Who owns FlatBuffer regions, and when are they freed?
- Does the VM own all buffers and free them on shutdown?
- Can the host retain references after VM shutdown?
- Should we use reference counting, or is manual lifetime management expected?
- What happens if the host holds a pointer to a buffer that the VM wants to replace?
2.2 Buffer Replacement Strategy
Q: When VM state changes (e.g., new register values), how do we handle buffer replacement?
- Do we rebuild the entire
ExecutionStatebuffer on every instruction? - Should we use a “dirty region” approach where only modified portions are rebuilt?
- Can we keep multiple generations of buffers alive simultaneously?
- How do we garbage collect old buffers?
2.3 Allocation Strategy
DECIDED: Arena allocation with bump-pointer allocators.
- Each computation phase gets its own arena (module, execution, dataflow results)
- Open question: Should we support host-provided arena implementations (callback-based)?
- Open question: What is the expected buffer size distribution for arena sizing (small, large, mixed)?
2.4 Memory Limits
Q: How should memory consumption be bounded?
- Should there be a global memory limit for all VM buffers?
- Should individual buffers be size-limited?
- What happens when memory limits are exceeded (error, eviction, spilling)?
3. Procedural Execution Model
3.1 Register File Size and Type
DECIDED: 256 typed registers (r0-r255) with unified register file.
- r0: constant zero (read-only)
- r1: function return value
- r2-r15: argument passing registers
- r16-r255: general purpose
- Types: i8, i16, i32, i64, f32, f64, bool, ptr, string, table_ref, node_ref
- Open question: Should we support register spilling for functions that need more than 256 registers?
3.2 Calling Convention
Q: How should function calls and returns be implemented?
- How are arguments passed (registers, stack, both)?
- How are return values communicated?
- Should we support multiple return values?
- How is the caller’s context saved (which registers are callee-saved)?
3.3 Control Flow Implementation
Q: How should branching and loops be encoded?
- Should branch targets be absolute offsets, relative offsets, or labels resolved at load time?
- How do we handle backward branches (loop detection)?
- Should we support computed jumps (switch/case optimization)?
3.4 Instruction Encoding Density
Q: How compact should instructions be?
- Should we optimize for space (variable-length encoding) or speed (fixed-width instructions)?
- Should operands be inline or referenced?
- What is the expected code size for typical programs (hundreds of instructions, thousands, millions)?
3.5 Error Handling Model
Q: How should runtime errors be communicated?
- Should the VM trap and halt on errors, or support structured exception handling?
- Can procedural code catch and handle errors from dataflow operations?
- Should we support unwinding the call stack on errors?
- How are error messages and diagnostics communicated to the host?
4. Dataflow Execution Model
4.1 Execution Strategy (Eager vs. Lazy)
DECIDED: Lazy pull-based evaluation with explicit materialization.
- Dataflow nodes are constructed without executing (build DAG only)
- Execution occurs when host/procedural code calls
materialize(node) - Pull model: each operator implements iterator interface (next(), reset())
- Open question: Should we support partial materialization (first N rows)?
- Open question: Should we cache materialized results for repeated access?
4.2 Operator Fusion
Q: Should the dataflow engine perform operator fusion?
- Example: SCAN → FILTER → PROJECT could be fused into a single pass
- If yes, when does fusion occur (compile-time, runtime, never)?
- How does fusion interact with the FlatBuffer representation (fused nodes vs. separate nodes)?
4.3 Dataflow Node Mutability
Q: Can dataflow nodes be modified after creation?
- Can a node’s parameters be changed (e.g., update a filter predicate)?
- If yes, does this create a new node or mutate in place?
- How does this affect memoization or caching?
4.4 Table Representation
DECIDED: Columnar storage with typed columns.
- Each column is a separate typed array (i8[], i16[], i32[], i64[], f32[], f64[], bool[], string)
- Null handling: bit-packed null bitmap (1 bit per row, 0=null, 1=valid)
- String columns: offset array [u32] + concatenated UTF-8 data buffer
- Open question: Should we support compressed columns (RLE, dictionary encoding)?
- Open question: Should we support nested columns (arrays, structs)?
4.5 Intermediate Materialization
Q: When should intermediate dataflow results be materialized to FlatBuffers?
- After every operator (safe but slow)?
- Only when requested by the host or procedural code (lazy but complex)?
- Should the VM automatically decide based on memory pressure or access patterns?
4.6 Streaming vs. Batch Processing
Q: Should dataflow operators process entire tables at once or support streaming?
- Batch: simpler to implement, requires full table materialization
- Streaming: memory-efficient, more complex state management
- Should we support both modes?
5. Procedural-Dataflow Integration
5.1 Value Type System
Q: How do procedural values and dataflow results interoperate?
- Can a procedural register hold a reference to a table?
- Can a table cell contain a reference to a procedural value or dataflow node?
- How are types checked at the boundary?
5.2 Dataflow Invocation
Q: How does procedural code trigger dataflow execution?
- Are there special instructions (EXECUTE_DATAFLOW_NODE)?
- Can procedural code iterate over table rows?
- How are dataflow errors propagated to procedural code?
5.3 Procedural Functions in Dataflow
Q: Can user-defined procedural functions be used in dataflow operators?
- Example: MAP operator that calls a procedural function for each row
- If yes, how is the function called (direct C function pointer, bytecode interpreter)?
- How are procedural function signatures validated for dataflow use?
6. Host Integration and API Design
6.1 API Surface
Q: What level of abstraction should the host API provide?
- Low-level: direct access to FlatBuffer accessors (fast, unsafe)
- High-level: wrapper functions that abstract FlatBuffers (safe, slower)
- Should we provide both?
6.2 Thread Safety
Q: Is the VM thread-safe?
- Can multiple threads execute the same VM instance concurrently?
- Can multiple threads access different VM instances?
- Should the VM provide its own locking, or is synchronization the host’s responsibility?
6.3 Resource Cleanup
Q: What is the cleanup contract between VM and host?
- Should the VM automatically free all resources on
vm_destroy()? - Can the host retain buffer references after VM destruction?
- Should we support explicit buffer pinning/unpinning?
6.4 Error Reporting to Host
Q: How should errors be communicated to the host?
- Error codes only (lightweight, requires lookup)?
- Structured error objects in FlatBuffers (consistent, but complex)?
- String error messages (human-readable, but allocation required)?
6.5 Callback Mechanisms
Q: Should the VM support host callbacks?
- Can the VM call back into host code (e.g., for I/O, external functions)?
- If yes, how are callbacks registered and invoked?
- How do we ensure callbacks don’t violate FlatBuffer immutability?
7. Type System and Type Safety
7.1 Static vs. Dynamic Typing
DECIDED: Statically typed with type checking at module load time.
- All variables, function parameters, and return values have explicit types
- Type checking occurs during module loading (before execution)
- No implicit type conversions (explicit casts required)
- Open question: Should we support type inference for local variables?
- Open question: Should we allow generic/polymorphic functions?
7.2 Type Inference
Q: Should the system infer types, or must they be explicit?
- For procedural code: do variables have declared types?
- For dataflow operations: are output schemas inferred from inputs?
- How are type errors reported?
7.3 Null Handling
Q: How are null values represented and handled?
- Should we use nullable types throughout, or reserve special sentinel values?
- How do arithmetic operations handle null (propagate, error, coerce)?
- Do we support three-valued logic (true/false/null) for predicates?
8. Performance and Optimization
8.1 Hot Path Identification
Q: What are the expected performance bottlenecks?
- Instruction dispatch in procedural interpreter?
- FlatBuffer accessor overhead?
- Dataflow operator execution?
- Memory allocation?
- Where should we focus optimization effort?
8.2 SIMD Vectorization
Q: Should dataflow operators use SIMD intrinsics?
- If yes, which SIMD instruction sets (SSE, AVX, NEON, SVE)?
- Should we have portable fallback implementations?
- How much performance improvement is expected?
8.3 Caching and Memoization
Q: Should the VM cache intermediate results?
- Should dataflow nodes memoize their outputs?
- Should we cache decoded instructions?
- How is cache invalidation handled?
8.4 Profiling Hooks
Q: Should the VM provide built-in profiling?
- Instruction counts per opcode?
- Operator execution times?
- Memory allocation tracking?
- Should profiling be always-on or opt-in?
9. Debugging and Observability
9.1 Debugger Support
Q: What debugging capabilities should be built in?
- Breakpoints (at instruction, at function, at dataflow node)?
- Single-stepping?
- Watchpoints (register or memory changes)?
- How is debugger state represented (also in FlatBuffers)?
9.2 Logging and Tracing
Q: Should the VM have internal logging?
- If yes, what logging levels (error, warn, info, debug, trace)?
- Should logs be directed to stdout, a file, or a host callback?
- Should execution traces be recordable for replay?
9.3 Introspection Depth
Q: How much VM internals should be exposed?
- Should the host see every instruction execution?
- Should the host be able to inspect in-progress dataflow nodes?
- What is the expected overhead of introspection?
10. Standard Library and I/O
10.1 I/O Model
Q: How should I/O be performed?
- Should the VM directly read files, or should the host provide data?
- Should we support streaming file reads, or only in-memory buffers?
- What file formats are essential (CSV, JSON, Parquet, custom)?
10.2 Built-in Function Implementation
Q: Should built-in functions be implemented in C or in VM bytecode?
- C: faster, but not introspectable
- Bytecode: slower, but uniform representation
- Hybrid: performance-critical in C, others in bytecode?
10.3 Extension Mechanism
Q: Can users add custom functions without modifying VM code?
- Should we support loading external C functions (shared libraries)?
- Should there be a plugin API?
- How are external functions registered and called?
11. Security and Sandboxing
11.1 Untrusted Code Execution
Q: Is the VM designed to run untrusted bytecode?
- If yes, what security guarantees are provided?
- Should we validate bytecode before execution (type safety, bounds checks)?
- How do we prevent resource exhaustion attacks (infinite loops, memory bombs)?
11.2 Host Access Restrictions
Q: Can VM code access host resources arbitrarily?
- Should there be a capability-based security model?
- Should I/O operations require explicit host permission?
- How are security violations detected and handled?
12. Testing and Validation
12.1 Test Coverage Goals
Q: What test coverage is required?
- Should we aim for 100% line coverage?
- What are critical paths that must be tested?
- Should we have integration tests, unit tests, or both?
12.2 Fuzzing Strategy
Q: Should we fuzz the VM?
- Fuzz bytecode loading?
- Fuzz dataflow graph construction?
- Fuzz FlatBuffer inputs?
- What fuzzing tools should we use (AFL, libFuzzer, custom)?
12.3 Conformance Testing
Q: How do we verify correctness?
- Should we have a reference implementation?
- Should we compare against Lua or DuckDB for equivalent operations?
- How do we test determinism across platforms?
13. Deployment and Distribution
13.1 Build Artifacts
Q: What should the build produce?
- Static library, shared library, or both?
- Should we distribute source or binaries?
- What platforms should we support (Linux, macOS, Windows, embedded)?
13.2 Dependencies
Q: How should FlatBuffers be included?
- As a git submodule?
- As a system dependency (pkg-config)?
- Vendored directly into the repository?
13.3 Versioning Scheme
Q: How should versions be numbered?
- Semantic versioning (major.minor.patch)?
- What constitutes a breaking change (schema change, API change, behavior change)?
- How are deprecations communicated?
14. Specific Design Ambiguities
14.1 Dataflow Graph Topology
Q: Are dataflow graphs always DAGs, or can they contain cycles?
- If cycles are allowed, how do we prevent infinite recursion?
- Should we support iterative dataflow (fixed-point computation)?
14.2 Procedural Language Syntax
DECIDED: Statically typed, C-like procedural language.
- Imperative control flow: if/else, while, for, break, continue, return
- First-class functions with typed signatures
- No classes/OOP, no garbage collection, no implicit conversions
- Open question: Should we provide a parser/compiler in Phase 1, or start with direct bytecode generation?
- Open question: What is the exact syntax for function definitions, type annotations, and dataflow integration?
14.3 Dataflow Operator Semantics
Q: What are the precise semantics of each operator?
- JOIN: inner, left, right, full outer - which are supported?
- AGGREGATE: how are ties broken in MIN/MAX?
- SORT: stable or unstable sort?
- Should we document semantics formally (operational semantics, denotational)?
14.4 Endianness Handling
Q: How do we handle mixed-endianness environments?
- Does FlatBuffers already solve this (yes for schema-defined fields)?
- What about raw byte arrays in Value unions?
- Should we enforce a canonical byte order?
15. Project-Specific Questions
15.1 Name and Branding
Q: What is the official name of this VM?
- “Theta VM” is used in the plan - is this confirmed?
- Should schemas use “theta” as a namespace prefix?
15.2 License
Q: What license should the project use?
- Permissive (MIT, Apache 2.0)?
- Copyleft (GPL, LGPL)?
- Does the license affect FlatBuffers integration?
15.3 Target Use Cases
Q: What are the primary expected use cases?
- Embedded analytics in applications?
- Scripting layer for data processing?
- Research prototype?
- Understanding target use cases will guide design trade-offs.
Priority Questions Summary
✅ DECIDED (Core Design Decisions)
The following critical questions have been answered and are documented in design-decisions.md:
- ✅ Memory ownership model - Arena allocation with bump-pointer allocators
- ✅ Register file architecture - 256 typed registers (r0-r255)
- ✅ Table storage layout - Columnar storage with typed columns
- ✅ Eager vs. lazy dataflow - Lazy pull-based evaluation with explicit materialization
- ✅ Static vs. dynamic typing - Statically typed with load-time type checking
- ✅ Procedural language syntax - C-like, statically typed, imperative
🟡 HIGH PRIORITY (Blockers for Implementation)
These questions should be answered before beginning Phase 1 implementation:
- Schema organization (Section 1.4) - Single file vs. multiple files for schemas
- String handling (Section 1.3) - Interning, lifecycle, encoding enforcement
- Buffer replacement strategy (Section 2.2) - Frequency and granularity of rebuilds
- Calling convention (Section 3.2) - Argument passing, return values, register saving
- Control flow encoding (Section 3.3) - Branch target format (absolute, relative, labels)
- Instruction encoding density (Section 3.4) - Fixed-width vs. variable-length
- Streaming vs. batch (Section 4.6) - Row-at-a-time vs. batch processing
- Thread safety (Section 6.2) - Concurrent access model
🔵 MEDIUM PRIORITY (Can be deferred to later phases)
These can be decided during implementation:
- Operator fusion strategy (Section 4.2)
- Error handling model (Section 3.5)
- Host API abstraction level (Section 6.1)
- SIMD vectorization strategy (Section 8.2)
- Operator semantics details (Section 14.3)
🟢 LOW PRIORITY (Optimization/Production concerns)
These can be addressed in Phases 8-10:
- Caching and memoization (Section 8.3)
- Profiling hooks (Section 8.4)
- Security sandboxing (Section 11)
- Compression support (Section 4.4 open questions)