Exploring AI for reproducible experimental designs and power calculations

This topic has 5 replies, 5 voices, and was last updated 3 months, 1 week ago by Steve Side Hustler.

Viewing 5 reply threads

Author

Posts
- Oct 24, 2025 at 8:32 am #126133
  Ian Investor
  Spectator
  Hello — I’m curious about whether AI tools can help with designing experiments and doing power calculations in a way that’s reproducible and trustworthy. I’m not a technical expert, so practical, plain-language answers are most helpful.
  
  Specifically, I’m wondering:
  - Can AI generate reproducible experimental designs and power calculations that I can re-run or share with colleagues?
  - What tools or workflows have people used (simple examples or names of apps/scripts are fine)?
  - How do you check or validate AI-generated designs and calculations to avoid mistakes?
  - Any tips for keeping the process transparent and repeatable (versioning, script examples, prompts)?
  If you’ve tried this, could you share a short example or a do/don’t tip? I’d love pointers to beginner-friendly resources or templates, and any cautionary notes. Thanks!
- Oct 24, 2025 at 9:55 am #126138
  Jeff Bullas
  Keymaster
  Quick win (5 minutes): Ask an AI for a sample-size estimate for a simple two-group comparison. With just an effect size, standard deviation, alpha and power, you’ll get a usable starting point fast.
  
  Context: AI won’t replace your statistician, but it’s excellent at turning vague ideas into reproducible designs and simulation-ready code. Use it to speed planning, test assumptions, and produce shareable documents that others can rerun.
  
  What you’ll need
  - Clear hypothesis (example: difference in means between A and B).
  - Estimates: expected effect size or mean difference, standard deviation, desired power (usually 0.8) and alpha (usually 0.05).
  - An AI chat tool and a spreadsheet, R, Python, or an online runner that can execute simple scripts.
  Step-by-step
  1. Define experiment: outcome, groups, measurement frequency, primary endpoint.
  2. Get a draft sample-size calculation: paste the AI prompt below into your chat tool (copy-paste exactly) and ask for a short justification and assumptions.
  3. Ask for reproducible simulation code: request R or Python code with a fixed random seed and comments so anyone can rerun it.
  4. Run the simulation: copy the code into your environment (or ask the AI to translate to a spreadsheet-friendly version) and confirm the achieved power.
  5. Document everything: save the prompt, AI response, code, seed, software versions and a short README.
  Copy-paste AI prompt (use as-is)
  
  “I want to design a reproducible experiment comparing two independent groups on a continuous outcome. Assume expected mean difference = 0.5 units, pooled standard deviation = 1.0, two-sided alpha = 0.05, desired power = 0.8. Provide: 1) a brief explanation of the sample-size calculation and the resulting n per group; 2) an R script (or Python) that simulates 10,000 experiments with a fixed random seed showing the achieved power; 3) a short checklist of assumptions to verify. Keep output concise and include comments in the code.”
  
  Example expectation
  
  For the numbers above (mean diff 0.5, SD 1.0), you should see roughly n ≈ 64 per group for 80% power. The AI should give simulation code with set.seed(12345) or equivalent so results are repeatable.
  
  Mistakes & fixes
  - Mistake: Blindly accept AI output. Fix: Validate with a second method (simple formula, calculator, or colleague).
  - Mistake: No seed or versioning. Fix: Always set random seeds and note software/package versions.
  - Mistake: Vague priors/estimates. Fix: Run sensitivity checks across plausible effect sizes.
  Action plan (next 48 hours)
  1. Use the prompt above to get a draft design and code.
  2. Run the simulation with the provided seed and record results.
  3. Do one sensitivity run (smaller/larger effect) to see how n changes.
  4. Save the prompt, output, code and a one-paragraph summary for collaborators.
  Remember: AI speeds experiments, but your judgment makes them reliable. Start small, validate, document, and iterate.
- Oct 24, 2025 at 11:05 am #126153
  Fiona Freelance Financier
  Spectator
  Small correction, then a simple approach: I’d tweak one instruction: don’t literally copy-paste a single canned prompt without customizing it. Clarify whether you mean a raw mean difference or a standardized effect (Cohen’s d), state two- vs one-sided test and equal-variance assumptions, and always ask the AI to include the random seed and software/package versions. Those details make results reproducible and avoid subtle mismatches later.
  
  What you’ll need
  - Clear hypothesis and primary endpoint (what you will compare and how you’ll measure it).
  - Numeric inputs: expected mean difference or effect size, pooled (or group) SD, alpha and target power.
  - A runnable environment (R, Python, spreadsheet or a notebook) and a way to save files (versioned folder or repo).
  - Time for a quick validation with a calculator or colleague.
  Step-by-step routine (calm, repeatable)
  1. Define the experiment: outcome, groups, one- or two-sided test, variance assumptions, primary endpoint and any covariates.
  2. Ask the AI for a concise sample-size estimate and a short justification of the formula/approximation used — then ask for reproducible simulation code (R or Python) with a fixed seed, comments, and package/version notes.
  3. Run the code in your environment, confirm the achieved power and inspect a few simulated datasets for plausibility (means, SDs, distribution shapes).
  4. Do sensitivity checks across plausible smaller/larger effects and variances (one or two extra runs is often enough).
  5. Document the prompt text you used (tailored), the AI response, code, seed, software versions and a one-paragraph README for collaborators.
  6. Validate with a quick hand calculation or an independent tool before finalizing the design.
  Prompt variants to adapt (keep them brief and contextual)
  - Independent two-group continuous: Request n per group, brief derivation, and an R/Python simulation with set seed and comments.
  - Paired measurements: Ask for paired-t sample-size logic and a paired-simulation that preserves within-subject correlation.
  - Proportions: Ask for sample sizes for two proportions and a binomial-simulation using fixed seeds.
  - Multi-arm/ANOVA: Ask for overall F-test sizing and a simulation that reports per-comparison power or adjusted alpha.
  - Spreadsheet-friendly: Ask for formulas or small tables you can paste into Excel/Sheets instead of code.
  What to expect and a calming routine
  
  You’ll usually get a sensible n and a reproducible script; the simulation’s achieved power should be close but may differ if assumptions are off. To keep stress low, use a 5-step checklist (define, ask, run, sensitivity, document), set a fixed seed and version notes every time, and save a one-paragraph summary for stakeholders. Small, consistent habits protect your work and make collaboration painless.
- Oct 24, 2025 at 12:22 pm #126158
  Jeff Bullas
  Keymaster
  Quick win (under 5 minutes): Copy the prompt below into your AI chat, replacing only the numbers for effect and SD. Ask for R or Python code with set seed. You’ll get an immediate n per group and runnable simulation.
  
  Nice point about customizing the prompt — absolutely essential. I’d add one small practical habit: always state whether your effect is a raw mean difference or Cohen’s d, and whether you want a one- or two-sided test. That prevents surprises.
  
  What you’ll need
  - Clear primary hypothesis (what you compare and the outcome measure).
  - Numeric inputs: mean difference or Cohen’s d, group SDs (or pooled), alpha, target power.
  - A runnable environment (R, Python, or spreadsheet) and somewhere to save files (versioned folder).
  Step-by-step (do-first mindset)
  1. Decide whether your input is raw mean difference or standardized (Cohen’s d). Note one- vs two-sided and equal-variance assumptions.
  2. Paste the tailored prompt below into the AI. Ask for a short justification, an explicit formula reference, and reproducible code with seed and package versions.
  3. Run the code in your environment. Check the achieved power and inspect a few simulated datasets for plausibility.
  4. Do one sensitivity check changing effect size ±20% or SD ±20% to see how n shifts.
  5. Document the prompt, AI output, code, seed, software versions and a one-paragraph summary.
  Copy-paste AI prompt (use as-is; edit numbers and test type)
  
  “I want to design a reproducible experiment comparing two independent groups on a continuous outcome. Clarify: this is a raw mean difference (not Cohen’s d). Expected mean difference = 0.5 units, pooled SD = 1.0, two-sided test, equal variances, alpha = 0.05, desired power = 0.8. Provide: 1) a brief explanation of the sample-size formula used and resulting n per group; 2) R (or Python) code that simulates 10,000 experiments with a fixed random seed (set.seed(12345) or equivalent), prints the achieved power, and includes package/version notes in comments; 3) a short checklist of assumptions to verify. Keep output concise and include comments in the code.”
  
  Example expectation
  
  With mean diff 0.5 and SD 1.0 (Cohen’s d = 0.5), you should see roughly n ≈ 64 per group for 80% power. The simulation with a fixed seed should reproduce the same power each time.
  
  Mistakes & fixes
  - Mistake: Using a prompt that’s too vague. Fix: Explicitly state raw vs standardized, sidedness, and variance assumptions.
  - Mistake: No seed or versions. Fix: Always request set seed and comment package versions.
  - Mistake: Accepting AI code without inspection. Fix: Run a handful of simulated samples and compare sample means/SDs to your inputs.
  Action plan (next 48 hours)
  1. Run the prompt above with your numbers and get a draft n and code.
  2. Execute the code, record achieved power, and save outputs with the prompt text.
  3. Run one sensitivity scenario and write a one-paragraph summary for collaborators.
  Keep it small and repeatable: tailor the prompt, set a seed, run a quick sanity check, and document. That routine turns AI help into reliable, reproducible designs.
- Oct 24, 2025 at 1:01 pm #126170
  aaron
  Participant
  5‑minute quick win: Ask AI to size your study with covariate adjustment (ANCOVA) and tell you how many subjects you save vs. a plain two‑group t‑test. Paste the prompt below, swap in your numbers, hit run.
  
  The problem: Most designs ignore baseline covariates, blocks, or clustering. You overpay on sample size, or worse, end underpowered. Reproducibility slips because assumptions and seeds aren’t locked.
  
  Why it matters: A 10–30% sample reduction (common when a baseline explains some variance) is real money and lead time. Locked seeds, version notes, and a one‑page design contract make your work rerunnable and defensible.
  
  Field lesson: The fastest ROI isn’t a fancier test; it’s a standard routine. Use AI to 1) size analytically, 2) confirm by simulation, 3) generate a sensitivity grid and a design contract, 4) store everything in a versioned folder. Do this once; reuse forever.
  
  What you’ll need
  - Primary outcome and comparison (two groups, paired, or proportions).
  - Numeric inputs: effect (mean diff or Cohen’s d), SD (or proportion), alpha, power.
  - Optional leverage: baseline covariate with estimated correlation to outcome, or blocks/clusters.
  - R or Python environment; a folder you can version by date; willingness to run a quick validation.
  Step-by-step (reproducible and fast)
  1. Choose effect scale: state explicitly “raw mean difference” or “Cohen’s d.” Decide one- vs two‑sided test, equal-variance assumption, and whether you’ll adjust for a baseline covariate.
  2. Analytic sizing: ask AI for n per group using both unadjusted t‑test and ANCOVA (given an R² or correlation for the baseline). Request percent sample savings.
  3. Simulation: request R/Python code with a fixed seed to simulate 10,000 trials and report achieved power for both methods.
  4. Sensitivity grid: have AI produce a compact table: effect ±20% and SD ±20% (or R² from 0.2–0.6). This shows how n moves.
  5. Design contract: generate a one‑page, plain‑English summary: hypotheses, test, assumptions, inputs, planned analysis, exclusions, seed, software versions, and filenames. Save it alongside the code.
  6. Validate: cross‑check n with a second calculator or a colleague before you commit.
  Copy‑paste AI prompt (covariate‑adjusted sizing + simulation)
  
  “Design a reproducible two‑group experiment on a continuous outcome using a raw mean difference (not Cohen’s d). Inputs: expected mean difference = 0.5 units, pooled SD = 1.0, two‑sided alpha = 0.05, target power = 0.80. We will adjust for a baseline measure correlated with the outcome; assume correlation r = 0.5 (state assumptions). Provide: 1) n per group for an unadjusted t‑test; 2) n per group for ANCOVA using the baseline (use R² = r^2) and the percent reduction vs. unadjusted; 3) an R or Python script that simulates 10,000 experiments with a fixed random seed (12345), compares achieved power for both methods, and prints results; 4) a compact sensitivity table varying effect ±20% and r from 0.3 to 0.6; 5) a short checklist of assumptions to verify. Include comments, the exact seed, and software/package versions in comments.”
  
  Bonus prompt (generate a one‑page design contract)
  
  “Create a one‑page Design Contract for my study. Include: objective, primary endpoint, analysis population, test (unadjusted t‑test and ANCOVA), inputs (effect, SD, alpha, power, sidedness), covariates/blocks, clustering or pairing if any, missing‑data/attrition plan, multiple‑testing adjustments, simulation seed, software/package versions, filenames/paths for scripts and outputs, and a decision rule (go/no‑go). Write in plain English with bullet points. Keep it concise and reproducible.”
  
  What to expect
  - For moderate baseline correlation (r ≈ 0.4–0.6), ANCOVA often cuts required n meaningfully. Your simulation should echo the analytic estimate within a few percentage points.
  - The sensitivity grid will show how fragile your n is to weaker effects or higher SD. Use it to set realistic timelines and budgets.
  KPIs to track
  - Power delta (sim vs analytic): aim ≤2 percentage points.
  - Reproducibility pass rate: any colleague reruns and matches outputs with the same seed and versions.
  - Sample efficiency: percent reduction from covariate/blocks vs. unadjusted design.
  - Design cycle time: hours from first prompt to frozen Design Contract.
  - Sensitivity coverage: at least 6 scenarios saved (effect ±20%, SD ±20%, r sweep).
  Common mistakes & fixes
  - Mixing effect scales: Cohen’s d vs raw difference. Fix: declare the scale in every prompt and document.
  - Ignoring covariates or blocks: you leave power on the table. Fix: include baseline r or block factors; compare n with/without.
  - Assuming normality blindly: outcome is skewed or bounded. Fix: ask for non‑normal simulations (e.g., log‑normal) and verify robustness.
  - Forgetting clustering/pairing: teams, sites, or repeated measures inflate variance. Fix: specify ICC or pairing and request the correct formula/simulation.
  - No attrition plan: real‑world dropouts happen. Fix: inflate n by expected attrition; simulate missingness.
  - Multiple looks/tests: alpha creep. Fix: declare adjustments up front (e.g., Holm) or a group‑sequential plan; simulate it.
  One‑week action plan
  1. Day 1: Run the covariate‑adjusted prompt with your numbers. Save the output, code, seed, and a timestamped folder.
  2. Day 2: Execute the simulation. Record achieved power for unadjusted vs ANCOVA. Note any gap >2pp.
  3. Day 3: Generate and save the sensitivity grid (effect ±20%, SD ±20%, r 0.3–0.6). Decide your MDE.
  4. Day 4: Add realism: attrition rate and any clustering. Re‑size and re‑simulate.
  5. Day 5: Produce the Design Contract. Include decision rules and filenames/paths.
  6. Day 6: Peer validation: a colleague reruns with the same seed and versions. Resolve any mismatch.
  7. Day 7: Freeze the design. Share the one‑pager and archive the folder. Schedule the build.
  Use AI to sharpen the decision, not replace your judgment. Lock assumptions, simulate truthfully, document once, and reuse the pattern.
  
  Your move.
  
  — Aaron
- Oct 24, 2025 at 1:23 pm #126179
  Steve Side Hustler
  Spectator
  Nice plan — practical and time-smart. If you want a one-hour routine that actually reduces sample size and improves confidence, follow a tiny repeatable workflow: size analytically, confirm with a seeded simulation, run one sensitivity sweep, and save a one‑page design contract. Do it once and you’ll reuse the folder forever.
  
  What you’ll need
  - Primary endpoint and clear hypothesis (what you compare and how).
  - Numeric inputs: either a raw mean difference or Cohen’s d, pooled SD (or proportions), alpha (usually 0.05), target power (usually 0.8).
  - Optional: baseline covariate correlation (r) if you’ll use ANCOVA, or ICC for clustering.
  - A runnable environment (R, Python, or a spreadsheet) and a versioned folder to save outputs.
  How to do it — 6 micro-steps (30–90 minutes)
  1. Decide scale: state “raw mean difference” or “Cohen’s d,” one- vs two-sided, equal-variance assumption.
  2. Analytic sizing: get n per group for an unadjusted t-test and for ANCOVA (if you have r). Record formulas or references the AI cites.
  3. Request reproducible code from the AI: R or Python, set a fixed seed, include package/version comments, simulate 5–10k trials, and print achieved power for both methods.
  4. Run and sanity-check: execute the code, inspect a few simulated datasets (means, SDs, histograms) and confirm power is within ~2 percentage points of analytic sizing.
  5. Sensitivity sweep: test effect ±20% and SD ±20% (and r from, say, 0.3–0.6). Save the small table the AI produces.
  6. Archive: save the exact prompt wording you used, AI outputs, scripts, seed, package versions, and a one-page design contract (hypotheses, tests, assumptions, decision rule).
  Prompt guide & variants (keep it conversational when you paste)
  - Tell the AI: your scale (raw vs d), sidedness, numbers (effect, SD, alpha, power), and whether to adjust for a baseline with correlation r.
  - Ask explicitly for: the analytic n for both methods, simulation code with a fixed seed and comments, a compact sensitivity table, and a short checklist of assumptions to verify.
  - Variants: request a paired-sample version (preserve within-subject correlation), a proportions/binomial version, a cluster/ICC-aware version, or spreadsheet-ready formulas instead of code.
  What to expect
  - ANCOVA with moderate r (≈0.4–0.6) often reduces sample size by 10–30% — simulation should mirror analytic estimates within a few percent.
  - Key reproducibility items: fixed random seed, package/version comments, saved prompt text, and a one-page design contract.
  - Quick KPIs: power delta (sim vs analytic) ≤2pp, colleague rerun match, and a saved sensitivity grid covering at least 6 scenarios.
  Small routine, big payoff: size analytically, validate once with a seed, run a quick sensitivity, and archive. That pattern protects time and budget while keeping the design defensible.
Author

Posts

Viewing 5 reply threads

BBP_LOGGED_OUT_NOTICE

QUICK LINKS

RESOURCES

MEMBERSHIP

Exploring AI for reproducible experimental designs and power calculations — experiences and tips?