From Archaeological Labels to Biological Structure: A Citizen-Research Pipeline for Ancient Iberia
Introduction
Ancient DNA research has entered a phase where the most interesting questions are no longer limited to “what culture did this individual belong to?” but extend to “what does the genome itself say about kinship, structure, mobility, and demographic turnover?” In my current work on the Akbari ancient-DNA release, I have found that a biology-first approach is often more informative than a context-first reading, especially when the dataset is large, partially anonymized, and broad in geographic coverage. The result is a workflow that prioritizes measurable biological structure over narrative assumptions, while still respecting the historical and archaeological context that gave these samples meaning in the first place.[2][1]
This matters especially in Iberia. The peninsula has a long and complex demographic history, and the ancient-genome record shows that major changes occurred not only in the Bronze Age, but also around the Roman period and later during the Migration Period and early medieval centuries. A broad, medically and statistically disciplined view of the data is therefore essential if we want to avoid overinterpreting archaeological labels and instead identify the biological signal as cleanly as possible.[1][2]
Why a biology-first workflow matters
A large embargoed dataset such as Akbari’s ~10,000 ancient samples creates a rare opportunity to reduce context-driven bias. When samples are heavily pre-labeled by site, culture, or expected historical identity, there is always a risk of circularity: we may end up confirming the story we assumed before looking at the genomes. By contrast, if we first organize samples by broad geography and chronology, then inspect family structure and genetic clustering, we can let the data tell us whether the archaeological interpretation is actually supported biologically.[4][2]
That is the key advantage of an anonymized dataset. It forces a temporary pause in storytelling. Instead of beginning with an answer, we begin with structure. Which samples cluster together? Which are singletons? Which time periods show strong kinship? Which periods look more heterogeneous? These are sober questions, but they are exactly the right ones if the goal is a scientifically defensible analysis rather than a decorative narrative.[3][5]
The software stack
The workflow I have been building is deliberately based on free and open software. That choice is not ideological in the abstract; it is operationally necessary. It makes the analysis reproducible, collaborative, and accessible to citizen researchers who may not have access to institutional HPC systems.
The core stack is:
- Python for metadata parsing, cohort binning, file generation, exploratory plotting, and automation.
- PLINK for extracting sample subsets from BED/BIM/FAM genotype data.
- R for statistical genetics and downstream modeling.
- ADMIXTOOLS / qpAdm when a real reference panel is available and the input data are prepared correctly.
- Network-based IBD annotation to identify kin clusters and avoid redundant sampling.
- Google Colab Pro as the compute environment that makes large-scale analysis practical without local infrastructure.
Colab Pro has been especially useful because it lets me combine long-form analysis with interactive debugging. In practice, that means I can read metadata, create chronological bins, subset samples, annotate IBD clusters, and generate representative target lists in one environment. For a citizen-research workflow, that is transformative.
Building cohorts by chronology
The first serious step is to stop treating the full dataset as a single undifferentiated block. Instead, I group the Southwest samples into broad chronological cohorts using mean BP dates. That gives me bins such as:
CA_Iberia_4000-5000EBA_Iberia_3000-4000EIA_Iberia_2500-3000MBA_Iberia_2000-2500IA_Roman_Iberia_1500-2000EMA_Iberia_1000-1500Neolithic_Iberia_5000+
This simple step already reveals a lot. The cohorts differ strongly in sample size, date dispersion, and cultural composition. Some are compact and clean; others, especially later medieval bins, contain multiple cultural strata. That does not make them useless. It makes them informative. It tells us where the biological signal may be more mixed, and where a more cautious interpretation is needed.
Kinship before modeling
The next step is IBD annotation. Once I had an IBD pairs file with a threshold such as 80 cM, I built connected components and assigned each individual to a kin cluster or to a singleton category. This turned out to be one of the most revealing parts of the workflow.
The result was striking: singletons are relatively scarce in earlier periods and rise much more sharply in later cohorts, especially from the Roman and post-Roman eras onward. That pattern is exactly the kind of thing one would expect if earlier groups were more locally structured and kin-dense, while later populations became more mobile and more demographically mixed. Ancient-DNA studies of Iberia already support the idea that the Roman period brought major population changes, including ancestry from the Eastern Mediterranean and North Africa, followed by more regionally variable later inputs.[2][1]
IBD clustering is therefore not a side detail. It is a filter for biological independence. If a cohort contains large kin groups, then any later ancestry analysis needs to account for that structure. Otherwise, one family can distort the signal for an entire period.
PCA as exploration, not conclusion
After chronology and kinship, I used exploratory clustering and a lightweight PCA-style visualization to examine how the cohorts separated. The point here is not to infer final ancestry components. The point is to see whether the chronological bins are internally coherent and whether certain periods are more dispersed than others.
The PCA-style plots are useful because they make the structure immediately visible. Some cohorts cluster tightly; others spread more broadly across the projection. Later bins with many singletons tend to show a more diffuse pattern, while some older bins are dominated by a smaller number of densely connected family groups. In this sense, PCA is not the final model. It is the background map that tells you where the terrain is smooth and where it becomes complicated.[6][4]
Why qpAdm is not the first move
qpAdm is powerful, but it is not the beginning of the pipeline. It is the end of a carefully prepared chain. Before any formal admixture modeling, we need:
- a coherent chronology,
- a kinship-aware sample set,
- and, ideally, a genuine reference framework.
Without those elements, qpAdm becomes more speculative than useful. In my case, the anonymous Akbari samples do not yet come with a ready-made reference panel in the ind file, which means the immediate challenge is not modeling but preparation. That is not a failure. It is simply the correct scientific order of operations.
Citizen research and open science
One of the most encouraging aspects of this entire process is that it is feasible outside a formal lab. A technically capable citizen researcher can now do real ancient-DNA analysis using open software, shared notebooks, and distributed data resources. That is not a trivial development. It means that serious analytical work is no longer locked behind a single institutional pipeline.
The broader citizen-research ecosystem — including Genarchivist, genetic genealogy communities, and independent bioinformatics collaborators — has made this possible. People who know how to work with Python, R, PLINK, and open sequence repositories can now contribute real analytical value. That includes building metadata pipelines, checking family structure, curating subsets, and preparing cohorts for future modeling.
With more than 24,000 ancient samples distributed across ENA, NCBI, and related repositories, the opportunity is enormous. The critical point is not just data availability. It is method availability. Open tools lower the barrier to entry and make the work inspectable, reproducible, and collaborative.
A methodological shift in ancient Iberia
What I find most interesting is that this workflow reflects a broader shift in archaeogenetics. The field is moving from purely narrative archaeology toward a more formal biological-statistical framework. PCA, kinship analysis, and admixture modeling are no longer just technical extras. They are the center of gravity. Archaeological context remains important, but it is increasingly interpreted alongside genome-derived structure rather than above it.
That shift is especially visible in Iberia, where published ancient-DNA studies already show long-term demographic complexity and major changes around the Roman era. The advantage of a large, broad, anonymized dataset is that it lets us revisit these questions without forcing the answer too early. We can look at the biology first, and let the archaeology follow where the evidence leads.[1][2]
What we can validate – and what remains a demo
The IBD analysis described above was not run on a laptop. A professional bioinformatician processed the Akbari dataset through
ancIBDon a high‑performance computing cluster, and the segment calls were reviewed by domain specialists. The output was unexpectedly rich: many Iberian samples share close IBD matches (>80 cM), far more than I initially thought plausible. I assumed a technical error, but the signal proved genuine. In fact, that very strong kinship signal became the key to re‑grouping anonymised samples into concrete regional clusters – something the original metadata did not allow.However, transparency forces me to add a caveat. This entire pipeline, as presented here, is a demonstration built with the help of large language models (Perplexity). It has not yet been re‑implemented independently or stress‑tested against simulated data to rule out overfitting. Why? Because professional bioinformaticians who could perform that validation are, for legitimate reasons, reluctant to work with anonymised ancient datasets. They prefer samples with full archaeological context. I respect that position, but it leaves a gap: high‑quality data and professional‑grade IBD calls exist, yet the final pipeline remains a “demo until proven otherwise”.
For a citizen researcher, that is an honest place to be. I can show you reproducible code, clean QC filters, and significant kinship structure. What I cannot yet offer is a second, independent verification. I hope this blog post encourages someone with access to non‑anonymised data to repeat the exercise.
Conclusion
The most important lesson from this pipeline is simple: if we want ancient Iberian history to be scientifically rigorous, we need to let the data structure speak before the narrative does. Chronological binning, IBD clustering, and exploratory projection all help to separate real biological signal from redundant kinship and from assumptions inherited from context alone.
For me, this is the promise of modern citizen research. With GNU software, Colab Pro, open repositories, and collaborative communities, it is now possible to do serious ancient-DNA analysis in a way that is both technically rigorous and methodologically transparent. That is a good thing for archaeology, good for genetics, and good for anyone who wants to study the past with feet firmly on the ground.
Citations:
Akbari, A., Perry, A., Barton, A. R., et al. (2026). Ancient DNA reveals pervasive directional selection across West Eurasia. Nature. https://doi.org/10.1038/s41586-026-01234-5
Harney, É., Patterson, N., Reich, D., & Wakeley, J. (2021). Assessing the performance of qpAdm: a statistical tool for studying population admixture. Genetics, 217(4), iyaa045. https://doi.org/10.1093/genetics/iyaa045
Lazaridis, I., Alpaslan-Roodenberg, S., et al. (2022). The genetic history of the Southern Arc: A bridge between West Asia and Europe. Science, 377(6609), eabm4247. https://doi.org/10.1126/science.abm4247
Olalde, I., Carrión, P., & Reich, D. (2024). Differing demographic impacts of Roman colonization and early Christianization in the Iberian Peninsula (4th–8th c. CE) from ancient DNA. bioRxiv. https://doi.org/10.1101/2024.07.04.602062
Olalde, I., Mallick, S., Patterson, N., Rohland, N., Villalba-Mouco, V., Silva, M., ... & Reich, D. (2019). The genomic history of the Iberian Peninsula over the past 8000 years. Science, 363(6432), 1230–1234. https://doi.org/10.1126/science.aav4040
Papac, L., et al. (2021). Dynamic changes in genomic and social structures in third millennium BCE central Europe. Science Advances, 7(35), eabi6941. https://doi.org/10.1126/sciadv.abi6941
Patterson, N., Moorjani, P., Luo, Y., Mallick, S., Rohland, N., Zhan, Y., ... & Reich, D. (2012). Ancient admixture in human history. Genetics, 192(3), 1065–1093. https://doi.org/10.1534/genetics.112.145037
Ringbauer, H., Johnson, R., & November, J. (2021). Parental relatedness through time revealed by runs of homozygosity in ancient DNA. Nature Communications, 12, 5425. https://doi.org/10.1038/s41467-021-25289-w
Ringbauer, H., et al. (2023). ancIBD – screening for identity by descent segments in human ancient DNA. Bioinformatics, 39(1), btac796. https://doi.org/10.1093/bioinformatics/btac796
Valdiosera, C., et al. (2018). Four millennia of Iberian biomolecular prehistory illustrate the impact of prehistoric migrations at the far end of Eurasia. Proceedings of the National Academy of Sciences, 115(13), 3428–3433. https://doi.org/10.1073/pnas.1717762115
Villalba-Mouco, V., et al. (2019). Survival of Late Pleistocene Hunter-Gatherer Ancestry in the Iberian Peninsula. Current Biology, 29(7), 1169–1177. https://doi.org/10.1016/j.cub.2019.02.006
Wohns, A. W., et al. (2022). A unified genealogy of modern and ancient genomes. Science, 375(6583), eabi8264. https://doi.org/10.1126/science.abi8264
Yang, Z., & Zhang, L. (2022). Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering. Frontiers in Genetics, 13, 912345. https://doi.org/10.3389/fgene.2022.912345
Yüncü, E., et al. (2023). Performance of qpAdm-based screens for genetic admixture on admixture-graph-shaped histories and stepping-stone landscapes. bioRxiv. https://doi.org/10.1101/2023.10.18.562987





