Well-Typed Statement Generator Design
Status: proposed
Date: 2026-02-16
Goal
Design wellTypedProgramGen: Gen[WellTypedProgram] that produces source Statements plus typing witnesses (expected, typeEnv) so we can property-test the compiler pipeline with generated programs, not just parser-shaped random syntax.
Motivation
Current test generators have a gap:
Generators.genStatementsgenerates syntactically valid statements, but not type-directed ones.Generators.genTypedExprexplicitly notes it is random and not well typed.
That means we can stress parsing/printing/proto paths, but we cannot reliably generate source programs to test:
SourceConverter+ inference success paths,- normalization and compilation (
MatchlessFromTypedExpr), - backend codegen against broader random inputs.
Non-goals (v1)
- Full language coverage on day one (especially recursion, comprehensions, and complex pattern guards).
- Generating every valid program shape.
- Proving generator soundness independent of the typechecker.
v1 should prioritize high signal and low flake risk.
Proposed API
object WellTypedGenerators {
final case class Config(
maxStatements: Int,
maxExprDepth: Int,
maxTypeDepth: Int,
allowDefs: Boolean,
allowTypeDefs: Boolean
)
final case class WellTypedProgram(
statements: List[Statement],
expected: Map[Identifier.Bindable, rankn.Type],
typeEnv: rankn.TypeEnv[Kind.Arg]
)
def wellTypedProgramGen(cfg: Config): Gen[WellTypedProgram]
}
wellTypedProgramGen is the primary output because statements alone are not enough to assert typing behavior; tests also need value-type witnesses (expected) and the type-constant interpretation context (typeEnv).
Core Idea
Use type-directed synthesis:
- Generate a target type
t. - Generate an expression
esuch thate : tunder the current environment. - Emit a statement binding
name = e. - Extend the environment with
name : t. - Repeat.
This is a standard introduction/elimination strategy:
- Introduction rules build values of a requested type (
lambda, literals, constructors). - Elimination rules consume existing values/functions from scope (
var, application).
Generation Context
final case class ValSig(name: Identifier.Bindable, sigma: rankn.Type)
final case class CtorSig(
pack: PackageName,
cons: Identifier.Constructor,
sigma: rankn.Type,
resultTypeConst: rankn.Type.Const.Defined
)
final case class Ctx(
packageName: PackageName,
vals: Vector[ValSig],
ctors: Vector[CtorSig],
typeEnv: rankn.TypeEnv[Kind.Arg],
usedNames: Set[Identifier.Bindable]
)
Notes:
valsholds in-scope names generated so far.usedNamesis the global freshness set, and must include all value binders:vals.forall(v => usedNames(v.name)).-
ctorsis a sampling cache; the source of truth istypeEnv, and every cached constructor must exist there:ctors.forall(c => typeEnv.getConstructor(c.pack, c.cons).isDefined). -
CtorSig.resultTypeConstrecords the constructor’s final return type constant (the defined type it constructs), so constructor selection can quickly filter by demanded result type root before full instantiation. ctorsandtypeEnvcome from imported interfaces plus generated type definitions.- v1 uses monomorphic goals by default, but
sigmastorage keeps polymorphism support open.
TypeEnv witness
WellTypedProgram.typeEnv is required so expected: Map[Bindable, Type] has an explicit interpretation context for TyConst references (especially generated local struct/enum types).
Planned construction:
- Incrementally update
ctx.typeEnvwhile generating statements (for v1 this is mostly imported/predef; later phases add local type defs). - On finalize, derive the compiler-view env from generated statements via the same path used by compilation (
SourceConverter.toProgram-> parsed type env ->TypeEnv.fromParsed) and merge with imported env. - Return that finalized env as
WellTypedProgram.typeEnv.
This keeps the witness aligned with how the compiler interprets type constants.
Type Generation
Generate only inhabitable goal types to keep genExpr(goal) failure low.
genType(ctx, depth): Gen[Type]
Weighted choices:
- Ground base types:
Int,String,Char,Float64,Bool,Unit. - ADT applications with available constructors:
Option[a],List[a], tuples, and local generated ADTs. - Function types with explicit arity (small arity), i.e.
a -> r,(a1, a2) -> r,(a1, a2, a3) -> r, … (internallyFn1..Fn32, not curried chains).
Inhabited-type filter
Use a memoized predicate canInhabit(ctx, t):
Truefor base literal types.Truefor function types (we can always lambda-introduce at that exact arity if result type is inhabitable under an extended scope).Truefor ADT types when some constructor can be instantiated and all argument types are inhabitable.Trueif there is already an in-scope value assignable tot.
If no candidate is found at some depth, fall back to base types.
Expression Generation
def genExpr(ctx: Ctx, goal: Type, depth: Int): OptionT[Gen, Declaration.NonBinding]
OptionT is intentional: some goals may be temporarily unreachable under current context.
Introduction path (genIntro)
Construct directly by type shape:
- If goal is
Type.Fun(args, res): generate a lambda with exactlyargs.lengthparameters, extendctx, generate body forres. - If goal is a literal type: emit literal (
Int,String,Char,Float64). - If goal is an ADT: choose constructor where
resultTypeConstmatches goal rootTyConst(when present), then instantiatesigmato the full goal and recursively build args. - If goal is tuple/list/option and syntax sugar is easier, emit tuple/list forms directly.
Elimination path (genElim)
Use available names:
- Choose a value/function
v : sigmafromctx.valsor constructor fromctx.ctors. - For constructors, optionally prefilter by
resultTypeConstvs goal rootTyConst; then instantiatesigmatowardgoal(usingType.instantiate-style matching on result type). - If instantiation yields needed argument types
a1..an, recursively generate args and build one application node with arityn(not nested curried applications). - If no args needed and instantiation matches goal, emit variable directly.
Control flow expressions (phase-gated)
After v1 is stable:
if/ ternary by generatingBoolcondition and same-typed branches.matchwith typed pattern generation and exhaustiveness by construction.
Complete Declaration Surface (Exhaustive)
This section lists every Declaration syntax constructor and how to generate it.
Declaration (binding forms)
-
Declaration.Binding(BindingStatement(pattern, value, in))
Syntax shape:<pattern> = <nonbinding>then continuation declaration.
Generation: choosepattern, generatevaluethat matches pattern type, extend env with pattern bindings forin. -
Declaration.Comment(CommentStatement[Padding[Declaration]])
Syntax shape:# ...preceding declaration.
Generation: wrapper over an already generated declaration; semantics-preserving noise. -
Declaration.DefFn(DefStatement(...))
Syntax shape: localdefwith args/body plus continuation.
Generation: generate function type, args/patterns/body, optional return annotation, then generate continuation in extended env. -
Declaration.LeftApply(arg, fn, result)
Syntax shape:<pattern> <- <fn(exprs...)>then body.
Generation: choosefnreturning function-on-last-arg; generate lambda body for appended arg. Equivalent to desugaredApplywith final lambda argument.
Declaration.NonBinding (expression forms)
-
Declaration.Annotation(fn, tpe)
Syntax:<expr>: <TypeRef>.
Generation: generate expression of subtype/instantiation-compatible type, then annotate/widen to demanded type. -
Declaration.Apply(fn, args, kind = Parens)
Syntax:f(a, b, ...).
Generation: pick callable with arityn, generate exactlyntyped args, build one application. -
Declaration.Apply(fn, args, kind = Dot)
Syntax:recv.f(a, b, ...)(parser desugars shape constraints).
Generation: same as apply, but render in dot form when legal for readability/syntax coverage. -
Declaration.ApplyOp(left, op, right)
Syntax:left <op> right.
Generation: choose operator identifier with known function type and generate typed operands. -
Declaration.CommentNB(CommentStatement[Padding[NonBinding]])
Syntax shape: comment wrapping NB expression.
Generation: wrapper/noise over generated nonbinding. -
Declaration.IfElse(ifCases, elseCase)
Syntax:if cond: ... elif cond: ... else: ....
Generation: generateBoolconditions and branch bodies of demanded type. -
Declaration.Ternary(trueCase, cond, falseCase)
Syntax:trueCase if cond else falseCase.
Generation: same typing rule asIfElse, compact syntax variant. -
Declaration.Lambda(args, body)
Syntax:(p1, p2, ...) -> body/p -> body.
Generation: introduction forType.Fun(args, res)with exact arity and typed arg patterns. -
Declaration.Literal(lit)
Syntax: numeric/string/char/float literal.
Generation: direct for literal-compatible goal types. -
Declaration.Match(kind, arg, cases)
Syntax:match/recurwithcasebranches.
Generation: choose scrutinee type, build total pattern set, generate each branch in branch-extended env. -
Declaration.Matches(arg, pattern)
Syntax:expr matches pattern(returnsBool).
Generation: generateargand pattern at same type; use when demanded type isBool. -
Declaration.Parens(of)
Syntax:(decl).
Generation: wrapper for precedence/roundtrip diversity. -
Declaration.TupleCons(items)
Syntax:(),(a,),(a, b, ...).
Generation: produce tuple values for tuple goal types. -
Declaration.Var(name)
Syntax: identifier reference.
Generation: elimination path from env value/constructor. -
Declaration.RecordConstructor(cons, arg)
Syntax:Cons { field },Cons { field: value, ... }.
Generation: choose struct/constructor with record style fields; generate each provided field expression. -
Declaration.StringDecl(items)withStringDecl.Partvariants:StringDecl.Literal(region, str)StringDecl.StrExpr(nonBinding)StringDecl.CharExpr(nonBinding)
Syntax: interpolated string segments'foo${s}$.{c}'.
Generation: mixed literal and embedded expressions (StringorChar) to produceString.
-
Declaration.ListDecl(list)withListLangvariants:ListLang.Cons(items)where items areSpliceOrItem.Item(expr)orSpliceOrItem.Splice(expr)(*expr)ListLang.Comprehension(expr, binding, in, filter)
Syntax:[a, *rest],[expr for p in src if cond].
Generation: constructor/list-intro path plus optional comprehension path.
-
Declaration.DictDecl(list)withListLangvariants:ListLang.Cons(items)where item isKVPair(key, value)ListLang.Comprehension(KVPair(key, value), binding, in, filter)
Syntax:{k: v, ...},{k: v for p in src if cond}.
Generation: only when demanded type isDict[k, v]; generate typed keys/values and source/filter.
Internal declaration helper syntax that must also be generated
Declaration.MatchBranch(pattern, guard, body)formatchcases.Declaration.RecordArg.Simple(field)andDeclaration.RecordArg.Pair(field, arg)for record constructors.Declaration.ApplyKind.ParensandDeclaration.ApplyKind.Dotfor application rendering diversity.
Pattern Generation
match generation requires typed pattern synthesis plus branch-local binding environments.
Pattern surface (exhaustive)
Generate all Pattern forms:
Pattern.WildCardsyntax_.Pattern.Literal(lit)syntax literal pattern.Pattern.Var(name)syntax bindable name.Pattern.Named(name, pat)syntaxpat as name.Pattern.Annotation(pat, tpe)syntaxpat: Type.Pattern.PositionalStruct(name, params)syntax:- constructor tuple-like:
Cons(p1, p2, ...), - constructor record-like:
Cons { f1: p1, f2, ... }, - tuple patterns:
(),(p,),(p1, p2, ...), - partial constructor patterns:
Cons(...)andCons(p1, p2, ...)(NamedPartial). Pattern.ListPat(parts)syntax[p1, p2, *rest], with parts:ListPart.Item(pat),ListPart.NamedList(name)(*name),ListPart.WildList(*_).Pattern.StrPat(parts)syntax interpolated string patterns with parts:StrPart.LitStr(s),StrPart.NamedStr(name)(${name}),StrPart.WildStr(${_}),StrPart.NamedChar(name)($.{name}),StrPart.WildChar($.{_}).Pattern.Union(head, rest)syntaxp1 | p2 | ....
Typed branch plan
Use an internal typed plan during generation:
final case class BranchPattern(
parsed: Pattern.Parsed,
typed: Pattern[(PackageName, Identifier.Constructor), rankn.Type],
scrutineeType: rankn.Type,
bindings: Map[Identifier.Bindable, rankn.Type]
)
bindings is computed while building the pattern, not recovered later by guessing.
Total pattern-set generation
Define:
def totalPatterns(ctx: Ctx, tpe: rankn.Type, depth: Int): NonEmptyList[BranchPattern]
Rules (must always return a total set):
- If
tperoot is a known defined type inctx.typeEnv(including predef ADTs) and has constructors, generate one branch per constructor. - Instantiate constructor argument types by substituting the type parameters of the
DefinedTypewith the concrete type arguments fromtpe. - For each constructor argument type, generate subpatterns with a depth budget:
- prefer binders (
Var) when branch expressions should use the value, - prefer
_when binding is unnecessary, - optionally recurse into structural subpatterns for additional coverage.
- If no structural decomposition is available (functions, primitive numerics, opaque defined/external types, unsupported shape), return single wildcard branch
_. - Keep branches unguarded when proving totality by construction; guarded branches can be added only with an unguarded fallback branch.
DefinedType-specific totality
Given dt: DefinedType[Kind.Arg] and demanded type T:
- Let
constructors = dt.constructors. - Build branch pattern for each constructor
cfin source order. - Each branch root is
Pattern.PositionalStruct((dt.packageName, cf.name), argsPats). - If
constructorsis empty, emit wildcard-only fallback (_) for that type. - Otherwise, this constructor-complete set is total for that
DefinedType(modulo valid arity), so it is the default strategy for ADTs/enums.
Branch environment extension
For each generated branch:
- Compute
branchCtx = ctxextended withbindingsfrom that branch pattern. - Generate guard (if any) and branch body under
branchCtx. - Because each branch has different
bindings, branch generation is independent per branch.
Validation pass for pattern sets
Before finalizing a match:
- Convert planned patterns to typed/internal form.
- Run
TotalityCheckvalidation (validatePattern, missing/unreachable checks) in test mode. - If invalid/non-total, regenerate pattern set with reduced complexity or wildcard fallback.
Statement Generation
Use a stateful loop over Ctx.
v1 statement set
Statement.Bindwith variable pattern only.- Optional comments/padding can be added after semantic generation (do not affect typing).
This already enables nontrivial typed programs through lambdas/apps/constructors.
v2 statement set
Statement.Deffor named function declarations.- Local
struct/enumgeneration, then value generation that uses those constructors.
v3 statement set
- Recursive defs constrained to known-safe templates.
- Rich patterns and guards in generated
match.
Name and Scope Hygiene
- Generate fresh bindable names not in
ctx.usedNames. - For lambdas/matches, avoid shadowing collisions by allocating fresh locals and updating context explicitly.
- Keep top-level names unique in v1 (simplifies witness checking and shrinking).
Shrinking Strategy
Structural shrinking alone tends to break typing. Use a typing-aware shrinker:
- Remove whole statements while preserving at least one exported/main candidate.
- Shrink expression subtrees with typed rules (replace with in-scope variable of same type, simplify constructor args, simplify literals).
- Recheck candidates with the fast typecheck harness; keep only well-typed shrinks.
This balances reliability and implementation cost.
How This Checks the Compiler
The generator enables new properties beyond parser roundtrips.
1. Typecheck success property
For all generated programs, full package typechecking succeeds:
- Build a parsed package from generated statements.
- Run
PackageMap.typeCheckParsed(so predef import behavior matches real compilation). - Assert no type errors.
This validates both generator and SourceConverter + infer success-path robustness.
2. Witness agreement property
For wellTypedProgramGen, compare inferred top-level types with generator expectations:
- For each generated bind
x : t_expected, - infer type
t_inferred, - validate
t_expectedis closed/resolved inwellTypedProgram.typeEnv(all referencedTyConstare known in that env), - assert
t_inferredsubsumes/equatest_expected(modulo normalization/quantification), using the same env context.
This catches generator bookkeeping bugs and inference regressions.
3. Normalization equivalence property
For a generated package with a ground main value:
- infer unoptimized body (
inferBodyUnopt), - normalize (
TypedExprNormalization.normalizeProgram), - evaluate both via
Evaluation, - assert equal resulting value.
This directly tests optimization soundness on random typed source programs.
4. Matchless compilation totality property
For inferred packages:
- run
MatchlessFromTypedExpr.compile, - assert no exceptions and all lets compile.
This is a strong stability property for the middle-end.
5. Backend smoke/semantic checks (nightly)
For a restricted generated subset (ground result types, limited externals):
- compile with backend(s) (C/Python as configured),
- evaluate
main, - compare with interpreter result.
Run at lower test counts initially due runtime cost.
Incremental Rollout Plan
- Phase 1:
Bind-only generator, base/ADT/function expressions, typecheck + matchless properties. - Phase 2: add
Def, add generated type definitions (struct/enum), add witness agreement checks. - Phase 3: add
matchgeneration with typed patterns and guards, add normalization equivalence property. - Phase 4: backend cross-checks for a restricted profile.
Risks and Mitigations
-
High rejection rate from
OptionTfailures. Mitigation: inhabited-type filtering + fallback to base goals + bounded retries. -
Inference blowups on deep types/expressions. Mitigation: strict depth/arity limits and weighted small-size bias.
-
Shrinker flakiness/perf cost. Mitigation: typed rewrite shrinker first, then typecheck-filtered fallback.
-
Overfitting to current inference behavior. Mitigation: keep generator rules type-theoretic (intro/elim) and avoid solver internals where possible.
Why This Design Matches Bosatsu
- It composes with existing ASTs (
Statement,Declaration) and existing test infra (ScalaCheck,PackageMap.typeCheckParsed). - It follows the same environment-driven typing model used by inference.
- It gives a clear path from “can typecheck generated source” to “can validate compiler transformations and codegen behavior” using the same generated inputs.