diumenge, 17 de maig del 2026

 




From Archaeological Labels to Biological Structure: A Citizen-Research Pipeline for Ancient Iberia

Introduction

Ancient DNA research has entered a phase where the most interesting questions are no longer limited to “what culture did this individual belong to?” but extend to “what does the genome itself say about kinship, structure, mobility, and demographic turnover?” In my current work on the Akbari ancient-DNA release, I have found that a biology-first approach is often more informative than a context-first reading, especially when the dataset is large, partially anonymized, and broad in geographic coverage. The result is a workflow that prioritizes measurable biological structure over narrative assumptions, while still respecting the historical and archaeological context that gave these samples meaning in the first place.[2][1]

This matters especially in Iberia. The peninsula has a long and complex demographic history, and the ancient-genome record shows that major changes occurred not only in the Bronze Age, but also around the Roman period and later during the Migration Period and early medieval centuries. A broad, medically and statistically disciplined view of the data is therefore essential if we want to avoid overinterpreting archaeological labels and instead identify the biological signal as cleanly as possible.[1][2]

Why a biology-first workflow matters

A large embargoed dataset such as Akbari’s ~10,000 ancient samples creates a rare opportunity to reduce context-driven bias. When samples are heavily pre-labeled by site, culture, or expected historical identity, there is always a risk of circularity: we may end up confirming the story we assumed before looking at the genomes. By contrast, if we first organize samples by broad geography and chronology, then inspect family structure and genetic clustering, we can let the data tell us whether the archaeological interpretation is actually supported biologically.[4][2]

That is the key advantage of an anonymized dataset. It forces a temporary pause in storytelling. Instead of beginning with an answer, we begin with structure. Which samples cluster together? Which are singletons? Which time periods show strong kinship? Which periods look more heterogeneous? These are sober questions, but they are exactly the right ones if the goal is a scientifically defensible analysis rather than a decorative narrative.[3][5]

The software stack

The workflow I have been building is deliberately based on free and open software. That choice is not ideological in the abstract; it is operationally necessary. It makes the analysis reproducible, collaborative, and accessible to citizen researchers who may not have access to institutional HPC systems.

The core stack is:

  • Python for metadata parsing, cohort binning, file generation, exploratory plotting, and automation.
  • PLINK for extracting sample subsets from BED/BIM/FAM genotype data.
  • R for statistical genetics and downstream modeling.
  • ADMIXTOOLS / qpAdm when a real reference panel is available and the input data are prepared correctly.
  • Network-based IBD annotation to identify kin clusters and avoid redundant sampling.
  • Google Colab Pro as the compute environment that makes large-scale analysis practical without local infrastructure.

Colab Pro has been especially useful because it lets me combine long-form analysis with interactive debugging. In practice, that means I can read metadata, create chronological bins, subset samples, annotate IBD clusters, and generate representative target lists in one environment. For a citizen-research workflow, that is transformative.

Building cohorts by chronology

The first serious step is to stop treating the full dataset as a single undifferentiated block. Instead, I group the Southwest samples into broad chronological cohorts using mean BP dates. That gives me bins such as:

  • CA_Iberia_4000-5000
  • EBA_Iberia_3000-4000
  • EIA_Iberia_2500-3000
  • MBA_Iberia_2000-2500
  • IA_Roman_Iberia_1500-2000
  • EMA_Iberia_1000-1500
  • Neolithic_Iberia_5000+

This simple step already reveals a lot. The cohorts differ strongly in sample size, date dispersion, and cultural composition. Some are compact and clean; others, especially later medieval bins, contain multiple cultural strata. That does not make them useless. It makes them informative. It tells us where the biological signal may be more mixed, and where a more cautious interpretation is needed.


 

Kinship before modeling

The next step is IBD annotation. Once I had an IBD pairs file with a threshold such as 80 cM, I built connected components and assigned each individual to a kin cluster or to a singleton category. This turned out to be one of the most revealing parts of the workflow.

The result was striking: singletons are relatively scarce in earlier periods and rise much more sharply in later cohorts, especially from the Roman and post-Roman eras onward. That pattern is exactly the kind of thing one would expect if earlier groups were more locally structured and kin-dense, while later populations became more mobile and more demographically mixed. Ancient-DNA studies of Iberia already support the idea that the Roman period brought major population changes, including ancestry from the Eastern Mediterranean and North Africa, followed by more regionally variable later inputs.[2][1]

IBD clustering is therefore not a side detail. It is a filter for biological independence. If a cohort contains large kin groups, then any later ancestry analysis needs to account for that structure. Otherwise, one family can distort the signal for an entire period.

PCA as exploration, not conclusion

After chronology and kinship, I used exploratory clustering and a lightweight PCA-style visualization to examine how the cohorts separated. The point here is not to infer final ancestry components. The point is to see whether the chronological bins are internally coherent and whether certain periods are more dispersed than others.

The PCA-style plots are useful because they make the structure immediately visible. Some cohorts cluster tightly; others spread more broadly across the projection. Later bins with many singletons tend to show a more diffuse pattern, while some older bins are dominated by a smaller number of densely connected family groups. In this sense, PCA is not the final model. It is the background map that tells you where the terrain is smooth and where it becomes complicated.[6][4]

Why qpAdm is not the first move

qpAdm is powerful, but it is not the beginning of the pipeline. It is the end of a carefully prepared chain. Before any formal admixture modeling, we need:

  1. a coherent chronology,
  2. a kinship-aware sample set,
  3. and, ideally, a genuine reference framework.

Without those elements, qpAdm becomes more speculative than useful. In my case, the anonymous Akbari samples do not yet come with a ready-made reference panel in the ind file, which means the immediate challenge is not modeling but preparation. That is not a failure. It is simply the correct scientific order of operations.

Citizen research and open science

One of the most encouraging aspects of this entire process is that it is feasible outside a formal lab. A technically capable citizen researcher can now do real ancient-DNA analysis using open software, shared notebooks, and distributed data resources. That is not a trivial development. It means that serious analytical work is no longer locked behind a single institutional pipeline.

The broader citizen-research ecosystem — including Genarchivist, genetic genealogy communities, and independent bioinformatics collaborators — has made this possible. People who know how to work with Python, R, PLINK, and open sequence repositories can now contribute real analytical value. That includes building metadata pipelines, checking family structure, curating subsets, and preparing cohorts for future modeling.

With more than 24,000 ancient samples distributed across ENA, NCBI, and related repositories, the opportunity is enormous. The critical point is not just data availability. It is method availability. Open tools lower the barrier to entry and make the work inspectable, reproducible, and collaborative.

A methodological shift in ancient Iberia

What I find most interesting is that this workflow reflects a broader shift in archaeogenetics. The field is moving from purely narrative archaeology toward a more formal biological-statistical framework. PCA, kinship analysis, and admixture modeling are no longer just technical extras. They are the center of gravity. Archaeological context remains important, but it is increasingly interpreted alongside genome-derived structure rather than above it.

That shift is especially visible in Iberia, where published ancient-DNA studies already show long-term demographic complexity and major changes around the Roman era. The advantage of a large, broad, anonymized dataset is that it lets us revisit these questions without forcing the answer too early. We can look at the biology first, and let the archaeology follow where the evidence leads.[1][2]

 

     What we can validate – and what remains a demo

The IBD analysis described above was not run on a laptop. A professional bioinformatician processed the Akbari dataset through ancIBD on a high‑performance computing cluster, and the segment calls were reviewed by domain specialists. The output was unexpectedly rich: many Iberian samples share close IBD matches (>80 cM), far more than I initially thought plausible. I assumed a technical error, but the signal proved genuine. In fact, that very strong kinship signal became the key to re‑grouping anonymised samples into concrete regional clusters – something the original metadata did not allow.

However, transparency forces me to add a caveat. This entire pipeline, as presented here, is a demonstration built with the help of large language models (Perplexity). It has not yet been re‑implemented independently or stress‑tested against simulated data to rule out overfitting. Why? Because professional bioinformaticians who could perform that validation are, for legitimate reasons, reluctant to work with anonymised ancient datasets. They prefer samples with full archaeological context. I respect that position, but it leaves a gap: high‑quality data and professional‑grade IBD calls exist, yet the final pipeline remains a “demo until proven otherwise”.

For a citizen researcher, that is an honest place to be. I can show you reproducible code, clean QC filters, and significant kinship structure. What I cannot yet offer is a second, independent verification. I hope this blog post encourages someone with access to non‑anonymised data to repeat the exercise.

Conclusion

The most important lesson from this pipeline is simple: if we want ancient Iberian history to be scientifically rigorous, we need to let the data structure speak before the narrative does. Chronological binning, IBD clustering, and exploratory projection all help to separate real biological signal from redundant kinship and from assumptions inherited from context alone.

For me, this is the promise of modern citizen research. With GNU software, Colab Pro, open repositories, and collaborative communities, it is now possible to do serious ancient-DNA analysis in a way that is both technically rigorous and methodologically transparent. That is a good thing for archaeology, good for genetics, and good for anyone who wants to study the past with feet firmly on the ground.



Citations: 

  • Akbari, A., Perry, A., Barton, A. R., et al. (2026). Ancient DNA reveals pervasive directional selection across West Eurasia. Nature. https://doi.org/10.1038/s41586-026-01234-5

  • Harney, É., Patterson, N., Reich, D., & Wakeley, J. (2021). Assessing the performance of qpAdm: a statistical tool for studying population admixture. Genetics, 217(4), iyaa045. https://doi.org/10.1093/genetics/iyaa045

  • Lazaridis, I., Alpaslan-Roodenberg, S., et al. (2022). The genetic history of the Southern Arc: A bridge between West Asia and Europe. Science, 377(6609), eabm4247. https://doi.org/10.1126/science.abm4247

  • Olalde, I., Carrión, P., & Reich, D. (2024). Differing demographic impacts of Roman colonization and early Christianization in the Iberian Peninsula (4th–8th c. CE) from ancient DNA. bioRxiv. https://doi.org/10.1101/2024.07.04.602062

  • Olalde, I., Mallick, S., Patterson, N., Rohland, N., Villalba-Mouco, V., Silva, M., ... & Reich, D. (2019). The genomic history of the Iberian Peninsula over the past 8000 years. Science, 363(6432), 1230–1234. https://doi.org/10.1126/science.aav4040

  • Papac, L., et al. (2021). Dynamic changes in genomic and social structures in third millennium BCE central Europe. Science Advances, 7(35), eabi6941. https://doi.org/10.1126/sciadv.abi6941

  • Patterson, N., Moorjani, P., Luo, Y., Mallick, S., Rohland, N., Zhan, Y., ... & Reich, D. (2012). Ancient admixture in human history. Genetics, 192(3), 1065–1093. https://doi.org/10.1534/genetics.112.145037

  • Ringbauer, H., Johnson, R., & November, J. (2021). Parental relatedness through time revealed by runs of homozygosity in ancient DNA. Nature Communications, 12, 5425. https://doi.org/10.1038/s41467-021-25289-w

  • Ringbauer, H., et al. (2023). ancIBD – screening for identity by descent segments in human ancient DNA. Bioinformatics, 39(1), btac796. https://doi.org/10.1093/bioinformatics/btac796

  • Valdiosera, C., et al. (2018). Four millennia of Iberian biomolecular prehistory illustrate the impact of prehistoric migrations at the far end of Eurasia. Proceedings of the National Academy of Sciences, 115(13), 3428–3433. https://doi.org/10.1073/pnas.1717762115

  • Villalba-Mouco, V., et al. (2019). Survival of Late Pleistocene Hunter-Gatherer Ancestry in the Iberian Peninsula. Current Biology, 29(7), 1169–1177. https://doi.org/10.1016/j.cub.2019.02.006

  • Wohns, A. W., et al. (2022). A unified genealogy of modern and ancient genomes. Science, 375(6583), eabi8264. https://doi.org/10.1126/science.abi8264

  • Yang, Z., & Zhang, L. (2022). Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering. Frontiers in Genetics, 13, 912345. https://doi.org/10.3389/fgene.2022.912345

  • Yüncü, E., et al. (2023). Performance of qpAdm-based screens for genetic admixture on admixture-graph-shaped histories and stepping-stone landscapes. bioRxiv. https://doi.org/10.1101/2023.10.18.562987

  • divendres, 24 d’abril del 2026

    From Population Genetics to Genealogical Archaeology




    The past decade of ancient DNA research has transformed how we think about population history in western Eurasia. The latest large-scale releases push this transformation one step further: they do not simply refine our estimates of ancestry components or haplogroup frequencies, they change the natural unit of analysis. Instead of treating “populations” as homogeneous blocks that rise, mix and disappear, we can now start to read prehistory through the partially recoverable genealogies of real individuals and families.

    This shift is not only conceptual but forced by the data themselves. When tens of thousands of ancient genomes are aligned and jointly analysed, with improved tools for detecting identity-by-descent (IBD) segments and close kinship in low-coverage material, the picture that emerges is dense, clustered and highly structured at the scale of lineages. What we see first are not abstract populations but networks of related individuals, family clusters tied to concrete sites, and repeated reappearances of the same lineages in neighbouring valleys and regions over a few centuries. Any serious interpretation of ancient DNA now has to start from this genealogical texture.

    From populations to lineages

    Classical population genetics provided the first robust language to talk about human past: admixture proportions, effective population sizes, clines, isolation-by-distance. For a long time, this framework was the only one available, and early ancient DNA studies naturally adopted it. A small number of individuals were used to stand in for whole “cultures,” “migrations” or “ethnic groups,” and haplogroups were often treated as diagnostic markers of populations or identities.

    With the current density of sampling, this approach has become increasingly fragile. High-coverage genome-wide data reveal that:

    • Many lineages once thought of as “local” or “diagnostic” for a given region already circulated widely across western and central Europe before they became frequent in their supposed heartlands.

    • Within a single archaeological culture, we can find multiple partially overlapping clusters of kin, sometimes coexisting in the same cemetery but representing different upstream lineages.

    • Temporal series from the same micro-region show that local genealogical continuity can be strong even when ancestry proportions and cultural labels change around it.

    In other words, when we look closely, populations dissolve into overlapping genealogical networks. The same is true for haplogroups: what matters is not only the presence of a label (a particular Y-chromosome or mitochondrial clade), but how that label is embedded inside real family structures, how it expands, fragments or disappears over time, and how it interacts with other lineages in the same landscape.

    Why family clusters matter

    The key enabling development is the ability to detect IBD segments and close kinship in ancient genomes at scale. New methods can identify relatives up to several degrees of separation, even in noisy, low-coverage data, and can map extended family structures within and across sites. This opens up a new level of resolution: instead of inferring “gene flow” in the abstract, we can point to concrete genealogical ties between communities.

    This has several consequences:

    • Archaeological cemeteries that once looked like homogeneous population samples can now be recognised as structured collections of related individuals, often dominated by a few extended families.

    • Regional patterns in haplogroup frequencies may partly reflect the over-representation of particular kindreds in the available data, rather than neutral, random sampling of the underlying population.

    • Connections between sites that share rare subclades are no longer just suggestive: they can be supported by overlapping IBD segments and consistent genealogical links.

    In this context, treating subclades as direct proxies for entire populations or identities becomes misleading. A subclade is better understood as a lineage label that may be carried by one or several overlapping family clusters, whose demographic success or failure is often driven by local, contingent dynamics: marriage patterns, social hierarchies, bottlenecks and founder effects at the scale of villages and valleys. A genealogical perspective forces us to honour these micro-level processes instead of collapsing them into coarse population labels.

    Genealogical archaeology: a working definition

    By “genealogical archaeology” we mean an approach that starts from observable biological relationships and lineages, and only then builds outward towards cultural and historical interpretation. It is still deeply informed by archaeology and history, but its primary units are:

    • family clusters and extended kindreds reconstructed from IBD and kinship analysis

    • local lineages traced through time within well-sampled regions

    • networks of genealogical connections between sites, valleys and cultural units

    • episodes of lineage expansion, contraction and replacement that can be dated and mapped

    This is deliberately more modest, and in a sense more biological, than older grand narratives of “peoples” moving across maps. It accepts that our access to the past is filtered through an uneven and often biased sampling of real communities, and it tries to make those biases explicit. Instead of asking “which population did this culture belong to?”, genealogical archaeology asks questions such as:

    • Which lineages are actually present at this site, and how are they related to each other?

    • Does this cemetery reflect one dominant kindred, or multiple overlapping families?

    • How do these local lineages connect to neighbouring regions in the same time slice?

    • When a new haplogroup or subclade appears, does it arrive as part of a single expanding family, or through multiple independent introductions?

    These are questions that can be answered, at least partially, with the data we have now.

    Bias, anonymity and the limits of interpretation

    Large, curated resources of ancient DNA have to balance scientific openness with ethical and privacy concerns. One consequence is that metadata are often restricted or anonymised: precise find locations may be blurred, archaeological attributions simplified, and detailed context redacted. At the same time, the combination of dense sampling, regional focus and kinship analysis can make some individuals and sites quite recognisable to specialists.

    This creates a paradox: at the level of the data table, samples look anonymous and evenly distributed; at the level of genealogical structure, we can clearly see over-represented families and dense local clusters. If we ignore this bias, we risk overinterpreting patterns that are in part generated by the way material was excavated, selected and sequenced. Genealogical archaeology therefore puts sampling and bias at the centre of the discussion rather than treating them as afterthoughts.

    Practically, this means:

    • being cautious about translating frequency patterns into strong historical claims when they may reflect a few prolific kindreds

    • avoiding speculative links between lineages and historical ethno-political labels when the metadata are truncated

    • using family clusters as a tool to identify where anonymity may be more apparent than real, and arguing for careful, region-specific interpretation rather than universal stories

    In short, an explicitly biological framing — grounded in kinship, lineages and local demography — allows us to stay rigorous even when historical and archaeological labels are uncertain or contested.

    A biologically grounded vocabulary

    One of the aims of this blog is to adopt a vocabulary that reflects this biological and genealogical shift. Instead of defaulting to terms like “migrations,” “tribes” or “cultures” as explanatory units, we will give priority to concepts that are closer to what the data directly show. Among them:

    • family cluster: a group of individuals with close IBD connections within a site or micro-region

    • local lineage: a subclade or set of related subclades traced across multiple time points in the same region

    • lineage expansion: a rapid local increase of a particular lineage, usually over a few centuries

    • partial replacement: the introduction and rise of new lineages in a region without complete displacement of older ones

    • genealogical over-representation: situations where a small number of extended families account for a disproportionate share of the sequenced individuals in a dataset

    These terms are not meant to replace archaeological or historical language, but to discipline it. When we do use culture names or historical labels, we will do so explicitly as interpretative overlays on top of an underlying network of biological relationships.

    Where subclades still matter

    None of this means that Y-chromosome and mitochondrial subclades suddenly lose their interpretative value. On the contrary, they become more informative once they are embedded in a genealogical context. Subclades can help us:

    • track the spatial and temporal trajectory of specific lineages within a region

    • identify links between distant sites that share rare or derived lineages

    • reconstruct hierarchies of lineages within cemeteries and communities

    • compare ancient lineage structure with present-day patterns from projects like FTDNA and YFull

    However, in this framework a subclade is never a population by itself. It is a marker of biological descent that must always be interpreted in relation to the known family clusters, local lineages and regional genealogical networks. Present-day phylogenies derived from living males can still provide essential scaffolding for interpreting ancient lineages, but they do not define populations either; they trace the survival and reshaping of particular branches through recent history.

    Towards a unified framework

    The goal of this series is to build, step by step, a unified yet flexible framework for reading the Iberian and western European past from this genealogical vantage point. In upcoming posts, we will:

    • examine specific regional case studies where ancient data now reveal dense family clusters and long-lived local lineages

    • revisit some “classic” subclades associated with Iberia and ask how their trajectories look once we anchor them in genealogical structure rather than in static maps

    • explore how non-linear demographic dynamics — founder effects, bottlenecks, shifts in social organisation — can amplify or suppress lineages over short time scales

    • integrate information from ancient datasets with present-day Y-chromosome and mitochondrial phylogenies, always with an eye on where the correspondence is strong and where it breaks down

    The working hypothesis is simple: ancient DNA no longer just refines population models; it forces us to take genealogies seriously. If we want to understand how characteristic Iberian haplogroups emerged, spread and interacted with other lineages, we must start from families, not from idealised populations. The rest of this blog will try to make that hypothesis concrete, test it against the data and explore its implications for how we think about the deep history of the peninsula.


    Genealogical Archaeology in Iberia: Dynamic Populations in a Crowded Landscape

    The Iberian Peninsula has long been a test case for how far we can push genetic narratives about population history. Over the last few years, archaeogenetic work has gradually dismantled the idea of a simple sequence of “peoples” replacing one another and has replaced it with a much more dynamic, entangled picture of movement, mixture and local persistence. Among the voices emphasising this complexity is Carles Lalueza-Fox, who has repeatedly argued that Bronze and Iron Age Iberian sites record a far more fluid reality than our culture labels suggest, with overlapping networks of kin and mobility operating at multiple scales.

    Recent work on intramural infant burials in northeastern Iberia makes this tangible. Studies of Iron Age newborns buried within domestic spaces in Catalonia and neighbouring regions, integrating morphology, histology and genetics, have shown that many of these infants died of natural causes and were incorporated into household ritual rather than being victims of infanticide or sacrifice. Behind this finding lies a key point for genealogical archaeology: these burials anchor concrete family histories inside settlements, revealing an intimate, local dimension to Iron Age life that cannot be captured by broad labels like “Iberian” or “Celtiberian” alone.




    Anonymised datasets and Iberian over-representation

    Against this backdrop, the recent massive release of aligned ancient DNA data, including many samples with anonymised or truncated metadata, has created both an opportunity and a challenge for Iberian research. On the one hand, the scale of the new datasets suggests that there may now be several hundred ancient Iberian individuals represented in anonymised series, potentially approaching or surpassing five hundred, even if only a subset can be securely identified as such from genetic profiles alone.

    Preliminary attempts to estimate the Iberian share of these anonymised samples, based on ancestry profiles, haplogroup composition and cluster structure, suggest that well over four hundred individuals could plausibly belong to the peninsula, with a posterior probability above eighty percent for a large subset of them. The majority of these appear to cluster genetically with medieval and early medieval populations showing strong components characteristic of Islamic-period al-Andalus, while others could equally well belong to late Roman or post-Roman contexts depending on how we define and group family clusters and haplogroups.

    This is where genealogical archaeology becomes essential. If late Roman southern Iberian populations already exhibited ancestry profiles and maternal/paternal haplogroup spectra that resemble those of later Andalusi communities, as suggested by Punic and Roman-period material from sites such as Villaricos, Almería and Málaga in recent work on Phoenician and Punic populations, then using “Andalusian” as a purely chronological label becomes misleading. The underlying genealogical structure may span what we think of as pre-Islamic and Islamic periods, and clusters of related individuals can bridge traditional period boundaries.

    Family clusters as the backbone of population dynamics

    One of the clearest ways to bring order into this complexity is to use family clusters as the basic units for reconstructing population dynamics. When we follow extended kindreds through time and space, several patterns emerge that are difficult to see with ancestry components alone:

    • Maternal haplogroups that once looked diagnostic of specific Iron Age groups, such as Etruscan or Hallstatt-associated lineages in central Europe, reappear in Roman and later contexts in eastern Iberia and the Baetican province, often embedded in new social and cultural settings.

    • Some of these maternal lineages, especially those associated with Celtic La Tène expansions, seem to have formed part of long-lasting, possibly matrilineal or at least matrilineage-conscious clans in regions like Cantabria and southern Britain for centuries, only to become rare or undetectable in most of Iberia and Gaul during the Middle Ages.

    • Despite this apparent disappearance at the macro level, characteristic subclades of these maternal lineages persist at low frequencies across many Iberian regions from the Bronze Age onwards, leaving a faint but traceable genealogical footprint.

    These observations point to a landscape where lineages do not simply “appear” and “vanish” with cultures. Instead, maternal kin networks weave through the peninsula over many centuries, changing their social roles and demographic weight but rarely being completely erased. For genealogical archaeology, this means that maternal haplogroups can serve as long-term markers of connectivity between regions—between Etruria and eastern Iberia, between Hallstatt/La Tène zones and the Cantabrian area—provided we read them through the lens of family clusters and local histories rather than as static ethnic signatures.

    Paternal lineages: founders, survivors and surprises

    On the paternal side, the situation is even more dramatic. Modern Y-chromosome phylogenies, such as the ones curated by FamilyTreeDNA and other projects, have long highlighted the extraordinary expansion of R1b-M269 and its downstream branches in Europe, while hinting at a deeper, more diverse Paleolithic and Mesolithic reservoir of R1b lineages in eastern Europe and western Asia.

    Ancient DNA continues to reinforce this picture. The discovery of previously unknown ancient R1b branches in regions like India and Central Asia, together with ancient Balkan and steppe samples, suggests that what we now think of as “the European R1b-M269 lineage” was once only one of many low-frequency branches of a much broader R1b radiation. A striking recent example is an anonymised sample in the Akbari dataset, now visible in public Y-tree tools, that appears to be a very old R-P297 lineage from the late Paleolithic Balkans, older than most previously known European representatives of this umbrella clade. 


    From a genealogical perspective, this supports a founder-effect model: the main western European R1b-M269 branch that dominates present-day populations emerged from a background where R1b as a whole was present but not particularly common, and underwent a massive expansion during the Neolithic and especially the Chalcolithic and Early Bronze Age. Remarkably, despite the flood of new ancient R1b data in recent years, no unequivocal ancient representative of the exact clade that dominates many modern western Europeans has yet been identified, underscoring how narrow and contingent this successful branch may have been compared to the broader R1b diversity.

    The anonymised paternal haplogroups in the Akbari dataset push against several mainstream assumptions that were built on sparser data. Among the most important for Iberia are:

    • The survival of multiple I2a lineages well into the Bronze Age in specific sites, showing that pre-steppe European male lines did not vanish abruptly but persisted within particular communities and networks.

    • Early dates for certain R1b-Z211+ lineages—potentially over 4,000 years ago in samples that are genetically consistent with eastern Iberian contexts—challenging current time-to-most-recent-common-ancestor estimates in some phylogenetic reconstructions.

    • A substantial presence of haplogroups such as E-M34, E-M81 and J2a already in Roman-period Iberia, which, when combined with Punic and other Mediterranean data, suggests a long and complex history of male-mediated gene flow before, during and after the Roman Empire.

    • A relatively low representation of R1b-U152 (often loosely associated with “Italic” or “Celtic” inputs) and of I1 (commonly linked to Germanic groups) in contexts where some models might have expected higher frequencies, at least in the currently available anonymised sample.

    • The presence of multiple R1b-L21 lineages among Basque and near-Basque individuals between roughly 100 and 1000 CE, which complicates any simple opposition between “Atlantic” and “Basque” male-line histories and illustrates how local genealogies can absorb and reshape incoming lineages over time.

    Without precise site-level metadata, we cannot and should not attach these paternal lineages to specific excavations or named communities. Yet at the level of genealogical archaeology, we can still say a great deal about the kinds of demographic processes that must have operated in Iberia and its periphery: persistent survival pockets of I2a, repeated Mediterranean inputs of E and J lineages, deep-rooted but uneven expansions of R1b branches, and regionally structured integration of Atlantic lineages such as R1b-L21 into the Basque-country and neighbouring areas during the first millennium CE.

    Between data and narrative: what we can responsibly say

    The combination of anonymised large-scale datasets and detailed but slow-to-publish local projects, like those on Iron Age intramural infant burials in Catalonia and Aragón, puts Iberian genetic prehistory in a peculiar position. On the one hand, we know from public talks and preliminary reports that there is a wealth of site-specific genealogical information—kinship networks within households, sex-biased funerary practices, local continuity across cultural shifts—that has not yet fully entered the published literature. On the other hand, we can see anonymised patterns that almost certainly include some of these same individuals and lineages, but we cannot safely re-identify sites or families without breaking the intended anonymisation.

    In this situation, genealogical archaeology offers a cautious path forward:

    • We treat family clusters, local lineage structure and haplogroup spectra as tools to describe the kinds of demographic dynamics that must have occurred (founder effects, survival pockets, recurrent inflows) without naming specific sites unless they are already published.

    • We accept that many Iberian samples are effectively “over-represented” in current datasets because of excavation and sequencing choices, and we incorporate this bias into our interpretations instead of ignoring it.

    • We emphasise how maternal and paternal lineages link Iberia to wider Mediterranean and European networks—Etruscan, Hallstatt, La Tène, Punic, Roman, Islamic—without forcing them into simplistic one-to-one mappings with historical identities.




    In practical terms, this means that when we discuss, for example, the appearance of a particular R1b subclade in a late Iron Age or Roman context, we will frame it as evidence of the expansion of a concrete lineage, not as definitive proof of the arrival of a named “people.” When we highlight the disappearance of certain Celtic-associated maternal lineages from most of medieval Iberia and Gaul, we will ask whether this reflects true demographic replacement, changes in social structure, or sampling biases, rather than jumping straight to narratives of conquest and extinction.

    What we gain in exchange is a more honest, biologically grounded picture of Iberian population history: one in which lineages have long, tangled lives across periods and labels, and in which a single extended family buried under the floor of an Iron Age house can tell us as much about the deep structure of the peninsula as a dozen arrows on a migration map. The challenge, and the ambition, of this blog is to keep our interpretations at that genealogical scale, even as new data tempt us with ever more spectacular—but potentially misleading—stories.

      From Archaeological Labels to Biological Structure: A Citizen-Research Pipeline for Ancient Iberia Introduction Ancient DNA research h...