Author: Yusaku Horiuchi
Affiliation: Syde P. Deeb Eminent Scholar in Political Science, Florida State University
Created: May 10, 2026
Last revised: May 15, 2026
This repository is an instruction page and template collection for building high-quality replication packages for social science research projects. It is designed to be read by humans and by coding agents such as Codex or Claude Code before they prepare, audit, or repair a replication package.
The guide and templates assume an R-based workflow, with master.R, R scripts, and session_info.log as the default examples. The underlying standard is not limited to R. If a project uses Stata, Python, Julia, MATLAB, or another toolchain, users can ask an agentic AI to read this guide and prepare an analogous replication package with the appropriate single entry point, logs, software-environment record, and figure/table crosswalk.
The lightweight templates in this repository illustrate the recommended package designs:
templates/README_TEMPLATE.md: a copyable starting point for a project’s one and only README.md.templates/compact/: compact project structure for smaller projects.templates/build-analyze/: larger project structure with separate build/ and analyze/ stages.examples/horiuchi_tago/: a finished compact replication package example.The structure templates intentionally contain only folder structure and essential example files such as .Rproj, README.md, master.R, script stubs, and logging helpers. They do not include full replication packages or large data files.
The example package shows what a completed compact package can look like after applying the guide. It includes a real README.md, master.R, numbered scripts, logs, generated figures and outputs, and a paper-source consistency note.
Use standard Markdown as the authoritative instruction format. Markdown is easiest for agents to read, easiest to host on GitHub or another static site, and does not require R to render. Each replication package should commit one and only one README file: README.md.
The public guide is available at:
https://yhoriuchi.github.io/replication-package-guide/
When using Codex, Claude Code, or another coding agent, give the agent this URL and ask it to read the guide before changing any files. The GitHub repository is available at:
https://github.com/yhoriuchi/replication-package-guide
Keep this Markdown file as the authoritative source so the public page, repository, agents, and human readers use the same instructions.
Before asking an AI agent to polish a replication package for publication, clean the project as much as possible yourself. AI is useful for checking, reorganizing, documenting, and catching inconsistencies, but it should not be treated as a substitute for the author’s judgment about which files, scripts, data sources, and results are actually part of the replication record.
At minimum, remove clearly obsolete files, label exploratory scripts, identify the scripts that generate reported results, gather the paper source files when available, and decide which data can legally be shared. The cleaner the starting point, the more reliable the AI-assisted audit will be.
For a new replication package:
templates/compact/ or templates/build-analyze/ into the new project and use its included README.md as the starting README.templates/README_TEMPLATE.md to the project root as README.md.README.md.source("master.R") from a fresh R session.session_info.log and one log per public script were created.README.md.To see a concrete finished package, inspect examples/horiuchi_tago/. It is included as an example, not as a template to copy blindly.
When preparing a replication package with Codex, Claude Code, or another coding agent, set the agent’s working directory so it can see both:
For Overleaf users, one practical workflow is to use Overleaf’s Dropbox integration and create the replication package inside, or immediately beside, the synced Overleaf project folder. This lets the agent inspect the manuscript source and the replication package in one workspace.
This integration makes the most important final check much easier: consistency between the paper and the replication package. The agent should verify that:
For public release, include paper source files only when appropriate and permitted. If the paper source cannot be included in the public archive, use it during preparation for the consistency check and document in README.md that the manuscript source was checked against the replication outputs.
Use these copy-paste prompts with Codex, Claude Code, or another coding agent. Each prompt assumes the agent can read this guide and inspect the project files. Use one prompt at a time so the task is specific and easy to verify.
Please read the Replication Package Guide before making changes. Then inspect my project and prepare a complete replication package. If the paper source files are available in the working directory, also check consistency between the paper and the replication outputs.
First decide whether the project should use the compact structure or the build/analyze structure. Use the compact structure when the project is small and all public inputs can be shared directly. Use the build/analyze structure when data construction is complex, uses restricted sources, involves scraping/APIs, or produces analysis-ready datasets that should be treated as the public replication inputs.
Every replication package must include master.R, script-specific log files, session_info.log, a self-contained README, and a complete crosswalk for all figures and tables reported in the paper or appendix. Check that every figure, table, and in-text numerical claim in the paper can be traced to the replication package, including estimates, standard errors, p-values, sample sizes, sampling dates, completion times, response rates, and descriptive statistics. Do not use absolute paths. Do not require manual steps unless they are documented as unavoidable.
Use templates/compact/ or templates/build-analyze/ as the starting structure. Use the selected template's README.md, or templates/README_TEMPLATE.md, as the starting point for README.md. Replace all placeholder text with project-specific documentation.
Please read the Replication Package Guide, then inspect all public R scripts in this project. Add or repair per-script logging so every public script writes a matching log file to logs/ or analyze/logs/.
Each log should record the script name, start and end time, important row counts, sample sizes, reported estimates or test results, warnings, and any other numbers reported in the paper.
Use the project's existing logging style if one exists. Do not change substantive analysis code unless needed to make logging reliable. After editing, run the public replication path and confirm that every public script produces its expected log file.
Please read the Replication Package Guide and prepare or repair the project's single authoritative README.md. Use templates/README_TEMPLATE.md as the model.
The README must include the paper title and authors, description, folder tree, files included, data sources and restrictions, paper source consistency status, how to run master.R, computing environment, session information, recommended citation, last verified date, and a paper-order crosswalk for every figure and table.
Do not create additional README files. Do not include README.html or README.pdf in the repository. Embedded figure/table previews are optional; the crosswalk is required.
Please read the Replication Package Guide, then audit the project for files and code that should not be in the public replication package.
Identify and remove temporary files, caches, old exploratory outputs, obsolete scripts, unused helper functions, personal files, absolute-path artifacts, and generated files that can be recreated by scripts. Keep source data, public scripts, documentation, final outputs, logs, and files needed to reproduce results.
Before deleting anything substantial, list what you plan to remove and why. Do not remove raw data, analysis-ready public inputs, manuscript source files, or scripts needed for reported results unless I explicitly approve.
Please read the Replication Package Guide, then compare the paper source files with the replication package outputs and logs.
Check every figure, table, and in-text numerical claim in the paper and appendix, including estimates, standard errors, p-values, confidence intervals, sample sizes, sampling dates, field dates, completion times, response rates, missing-data counts, and descriptive statistics.
For each reported item, verify that the value in the paper matches a script, log file, generated table, or generated figure. Report any mismatch with the paper source location, the replication source location, the paper value, and the replication value. Do not silently change paper text or analysis code; explain the discrepancy first.
Please read the Replication Package Guide, then review the public replication scripts for coding errors that could affect reported results.
Focus on data filtering, merges, joins, recoding, factor levels, missing-data handling, weights, clustered or robust standard errors, random seeds, model formulas, multiple-testing adjustments, output paths, and whether scripts run in the documented order from a clean R session.
Prioritize bugs, reproducibility risks, and missing tests or logs. Report findings with file paths, line references, severity, and suggested fixes. Make fixes only when they are clearly safe and within the replication package standard.
Please read the Replication Package Guide, then inspect the Overleaf/LaTeX source files and compare them with the replication package.
Check figure references, table references, labels, captions, file paths, appendix numbering, citations to results, and all in-text numerical claims. Verify that the manuscript points to the correct generated figures and tables and that the reported values match logs or generated outputs.
Report likely reporting errors with the TeX file path, label or nearby text, the value or reference in the paper, the corresponding replication source, and a recommended correction. Do not rewrite the manuscript unless I explicitly ask you to make the edits.
Please read the Replication Package Guide and perform a final pre-release audit of this replication package.
Verify that source("master.R") runs from a fresh R session; every public script creates a log; session_info.log exists; README.md is the only README file; the figure/table crosswalk is complete; paper source consistency has been checked when source files are available; no absolute paths, personal files, caches, or temporary files remain; and restricted data are documented.
Return a concise release-readiness report with pass/fail items, remaining risks, and exact files that need attention.
A replication package is successful when a reader can unzip it, open the project root, run one command, and see exactly how the reported results were produced.
The package should satisfy these requirements:
master.R.README.md that explains the package, the workflow, the required software, and every figure/table output.session_info.log file from a successful full run.Use the compact structure for small or medium projects when:
Use the build/ and analyze/ structure for larger projects when:
When uncertain, choose the simpler structure unless the build stage creates real complexity for users.
Recommended for smaller packages. See templates/compact/ for a lightweight starter version.
replication_package/
|-- README.md
|-- master.R
|-- project.Rproj # optional but recommended
|-- session_info.log
|
|-- data/
| `-- public input data
|
|-- documents/
| |-- paper/
| |-- questionnaires/
| `-- other supporting documents
|
|-- scripts/
| |-- 01_prepare_data.R
| |-- 02_analyze_main_results.R
| `-- 03_make_figures_tables.R
|
|-- functions/ # optional helper functions
|
|-- figures/
| `-- generated figures
|
|-- tables/
| `-- generated tables
|
|-- output/
| `-- intermediate reproducible objects
|
`-- logs/
`-- one log per script
The compact structure should still include logs/. A functions/ folder is optional, but recommended when multiple scripts reuse the same helpers.
Recommended for larger packages. See templates/build-analyze/ for a lightweight starter version.
replication_package/
|-- README.md
|-- master.R
|-- project.Rproj # optional but recommended
|-- session_info.log
|
|-- build/
| |-- data/
| | `-- raw or received inputs, when distributable
| |-- documents/
| | `-- source documentation and data provenance files
| |-- scripts/
| | `-- scripts that create analysis-ready data
| |-- output/
| | |-- analysis_ready/
| | `-- other build outputs
| `-- logs/ # use if build scripts are public and runnable
|
`-- analyze/
|-- scripts/
|-- functions/
|-- figures/
|-- tables/
|-- output/
`-- logs/
The build/ stage constructs analysis-ready datasets. The analyze/ stage produces the manuscript and appendix results. The public replication workflow should normally run from build/output/analysis_ready/ into analyze/.
If the build stage depends on restricted data, do not force users to run it. Keep the build scripts for transparency, remove restricted inputs, include the analysis-ready public files when legally permitted, and explain the limitation in README.md.
The README is the user’s map. It should be complete enough that a reader can understand and verify the package without opening every script.
Every replication README should include:
master.R;Use exactly one README.md regardless of package size. For build/analyze packages, keep the documentation in one file and organize it with internal sections:
Commit only README.md. If an archive or journal requires HTML or PDF documentation, generate those files from README.md at release time and make clear that README.md remains the source.
The README must include a crosswalk that maps every reported figure and table to its output file, script, and log. Embedded previews are optional. Use them only when they make the package easier to inspect; they are not required when the crosswalk is complete.
For figures:
.pdf or .eps;Example:
| Paper item | Output | Script | Log | Notes |
|---|---|---|---|---|
| Figure 1 | No output file | No code | Not applicable | Conceptual figure. |
| Figure 2 | `figures/main_effect.pdf` | `scripts/02_analyze_main_results.R` | `logs/02_analyze_main_results.log` | Main treatment effect. |
For tables:
.csv;.tex, .html, or .docx;Example:
| Paper item | Output | Script | Log | Notes |
|---|---|---|---|---|
| Table 1 | `tables/main_results.csv` | `scripts/03_make_tables.R` | `logs/03_make_tables.log` | Main regression table. |
For compact projects, this can be one table. For large projects, separate manuscript and appendix items if that makes the README easier to scan:
## Replication Guide: Figures And Tables
### Manuscript
| Paper item | Output | Script | Log | Notes |
|---|---|---|---|---|
| Figure 1 | `figures/main_effect.pdf` | `scripts/02_analyze_main_results.R` | `logs/02_analyze_main_results.log` | Main result. |
| Table 1 | `tables/main_results.csv` | `scripts/03_make_tables.R` | `logs/03_make_tables.log` | Main regression table. |
### Appendix
| Paper item | Output | Script | Log | Notes |
|---|---|---|---|---|
| Figure A.1 | `figures/appendix_balance.pdf` | `scripts/04_appendix_checks.R` | `logs/04_appendix_checks.log` | Balance check. |
Use script names that make the execution order and purpose obvious:
00_list_inputs.R
01_prepare_data.R
02_estimate_main_results.R
03_make_figures.R
04_make_tables.R
05_robustness_checks.R
Rules:
scripts/not_in_paper/ or scripts/archive/ and explain that they are not required.Every public script should write a log file. Logs are not just debugging artifacts; they are part of the replication record.
Logs should include:
The log filename should match the script filename:
scripts/02_estimate_main_results.R
logs/02_estimate_main_results.log
For a build/analyze package:
analyze/scripts/02_estimate_main_results.R
analyze/logs/02_estimate_main_results.log
Place a logging helper in functions/logging.R for compact packages or analyze/functions/logging.R for build/analyze packages.
start_script_log <- function(script_name, log_dir = "logs") {
dir.create(log_dir, recursive = TRUE, showWarnings = FALSE)
log_file <- file.path(log_dir, paste0(script_name, ".log"))
sink(log_file, split = TRUE)
cat("############################################################\n")
cat("Script:", paste0(script_name, ".R"), "\n")
cat("Started:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n")
cat("############################################################\n\n")
invisible(log_file)
}
end_script_log <- function() {
cat("\n############################################################\n")
cat("Ended:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n")
cat("############################################################\n")
cat("\n--- warnings() at end of script ---\n")
w <- warnings()
if (is.null(w) || length(w) == 0) {
cat("None\n")
} else {
print(w)
}
while (sink.number() > 0) sink()
}
Use this pattern in each public script:
source("functions/logging.R")
start_script_log("02_estimate_main_results")
tryCatch({
# Script body goes here.
}, error = function(e) {
cat("\nERROR:", conditionMessage(e), "\n")
stop(e)
}, finally = {
end_script_log()
})
For build/analyze packages, adjust paths:
source("analyze/functions/logging.R")
start_script_log("02_estimate_main_results", log_dir = "analyze/logs")
tryCatch({
# Script body goes here.
}, error = function(e) {
cat("\nERROR:", conditionMessage(e), "\n")
stop(e)
}, finally = {
end_script_log()
})
Every package should include master.R. It is the reproducibility entry point and should run the full public replication path from a clean R session.
The master script should:
session_info.log;Suggested skeleton:
# Master file
# Run from the project root after restarting R.
start_time <- Sys.time()
stopifnot(sink.number() == 0)
safe_source <- function(file) {
if (!file.exists(file)) stop("File not found: ", file)
cat("\n============================================================\n")
cat("Running:", file, "\n")
cat("============================================================\n")
source(file, echo = FALSE, print.eval = FALSE)
}
update_readme_environment <- function(readme = "README.md") {
if (!file.exists(readme)) return(invisible(FALSE))
session <- sessionInfo()
r_version <- paste("R version", paste(R.version$major, R.version$minor, sep = "."))
operating_system <- session$running
if (length(operating_system) == 0 || is.na(operating_system[1]) || !nzchar(operating_system[1])) {
sys <- Sys.info()
operating_system <- paste(sys[["sysname"]], sys[["release"]])
}
environment_block <- c(
"### Computing Environment",
"",
paste("Software:", r_version),
paste("Platform:", session$platform),
paste("Computer Operating System:", operating_system)
)
lines <- readLines(readme, warn = FALSE)
existing <- which(grepl("^#{1,6} Computing Environment$", lines))
trim_blank_edges <- function(x) {
if (length(x) == 0) return(x)
nonblank <- which(nzchar(x))
if (length(nonblank) == 0) return(character())
x[seq.int(min(nonblank), max(nonblank))]
}
if (length(existing) > 0) {
start <- existing[1]
following_heading <- which(seq_along(lines) > start & grepl("^#{1,6} ", lines))
end <- if (length(following_heading) > 0) following_heading[1] - 1 else length(lines)
existing_block <- if (start < end) lines[(start + 1):end] else character()
extra_environment_lines <- existing_block[
!grepl("^(Software:|Platform:|Computer Operating System:)", existing_block)
]
extra_environment_lines <- trim_blank_edges(extra_environment_lines)
if (length(extra_environment_lines) > 0) {
environment_block <- c(environment_block, "", extra_environment_lines)
}
environment_block <- c(environment_block, "")
before <- if (start > 1) lines[seq_len(start - 1)] else character()
after <- if (end < length(lines)) lines[(end + 1):length(lines)] else character()
lines <- c(before, environment_block, after)
} else {
environment_block <- c(environment_block, "")
session_heading <- which(grepl("^## .*Session Information$", lines))
if (length(session_heading) > 0) {
following_section <- which(seq_along(lines) > session_heading[1] & grepl("^## ", lines))
insert_before <- if (length(following_section) > 0) following_section[1] else length(lines) + 1
before <- if (insert_before > 1) lines[seq_len(insert_before - 1)] else character()
after <- if (insert_before <= length(lines)) lines[insert_before:length(lines)] else character()
if (length(before) > 0 && nzchar(tail(before, 1))) before <- c(before, "")
lines <- c(before, environment_block, after)
} else {
if (length(lines) > 0 && nzchar(tail(lines, 1))) lines <- c(lines, "")
lines <- c(
lines,
"## Session Information",
"",
"The file `session_info.log` records the R version, platform, loaded packages, and runtime from a successful full run.",
"",
environment_block
)
}
}
writeLines(lines, readme, useBytes = TRUE)
invisible(TRUE)
}
scripts <- c(
"scripts/01_prepare_data.R",
"scripts/02_estimate_main_results.R",
"scripts/03_make_figures.R",
"scripts/04_make_tables.R"
)
for (script in scripts) {
safe_source(script)
}
end_time <- Sys.time()
while (sink.number() > 0) sink()
sink("session_info.log", split = FALSE)
cat("Run Time\n")
cat("Started: ", format(start_time, "%Y-%m-%d %H:%M:%S"), "\n", sep = "")
cat("Ended: ", format(end_time, "%Y-%m-%d %H:%M:%S"), "\n", sep = "")
cat("Elapsed: ", format(end_time - start_time), "\n\n", sep = "")
cat("Session Information\n")
print(sessionInfo())
sink()
update_readme_environment("README.md")
For large packages, master.R should normally run the public analysis path only:
scripts <- c(
"analyze/scripts/00_list_inputs.R",
"analyze/scripts/01_estimate_main_results.R",
"analyze/scripts/02_make_figures.R",
"analyze/scripts/03_make_tables.R"
)
If a small public subset of the build stage can be rebuilt, master.R may check for missing required files and rebuild only those public inputs. Do not require restricted data in the public run.
Use clear data layers:
data/: public raw or received data for compact packages.build/data/: raw or received data for large packages.build/output/analysis_ready/: authoritative analysis inputs for large public replication packages.output/ or analyze/output/: reproducible intermediate objects.figures/ and tables/: final generated results.Data rules:
output/.documents/.set.seed() before simulations, bootstraps, random splits, random forests, MCMC starts, or any stochastic procedure.If any source cannot be redistributed, include a restricted-data section in README.md.
It should explain:
The public package should be designed so that users can reproduce published results without restricted access whenever legally and ethically possible.
When manuscript source files are available, treat them as part of the working context for package preparation. This is especially useful for Overleaf projects synced through Dropbox, because the paper source files and the R replication package can be inspected together.
The final consistency pass should check:
If the paper source files are not included in the public replication archive, state in README.md whether they were used during preparation for consistency checks.
At minimum, include a short computing environment summary in README.md and session_info.log from a successful full run. The template master.R files automatically refresh the software, platform, and operating-system lines after writing session_info.log.
Suggested README format:
### Computing Environment
Software: R version [version]
Platform: [R platform]
Computer Operating System: [operating system and version]
Additional details: [RAM, processor/GPU, external tools, or other project-specific requirements when relevant.]
The values should come from the run that produced session_info.log. For example, use the R version, Platform, and Running under lines printed by sessionInfo().
Some projects should report additional computing details when they affect reproducibility or runtime, such as RAM, CPU/GPU, external command-line tools, licensed software, high-performance-computing settings, or non-R language versions.
For stronger reproducibility, also consider:
renv.lock for R package versions;Do not assume the user has the same local folder structure. Avoid setwd() to personal paths.
Before releasing a replication package, verify:
source("master.R") runs from a fresh R session.session_info.log exists and comes from a successful full run.When an agent prepares a package, it should follow this sequence:
master.R.session_info.log.For a compact package:
README.md
master.R
session_info.log
data/
documents/
scripts/
functions/
figures/
tables/
output/
logs/
For a large package:
README.md
master.R
session_info.log
build/
analyze/
The final archive should feel boring in the best way: obvious structure, one command to run, traceable outputs, and no surprises.