Data
Miller, Charles, 2022, "Historical Conflict Event Dataset", https://doi.org/10.7910/DVN/6ZFC0V, Harvard Dataverse, V3
Probabilities
The dataset contains 8,881 battles between the years 1468 BC and 2003. Percentages were calculated in this order:
- A war in any given century may have been waged accross more than one territory. Each has their own probability based on how many battles occured in that war and location.
- Battle features — whether on land or at sea, its size and participants — depend on the records of that specific war in that specific location.
- The likelihood of fighting for any participant depends on how often that side engaged in battle.
- The likelihood of winning, losing, and/or a massacre depends on the track record of that side in that war and location.
- Survival rates were made up for a bit of personal stake: by default it's 50%, becomes 75% if your side won the battle, down to 30% if lost, and under 10% if your side lost and was massacred. The displayed rate was randomized (give or take a few %) for variety.
Data Cleaning
I manually filled a few missing values for battle locations based on the historical description provided for each data point.
Names of the larger wars to which the battles belong were edited for consistency, for example "World War One" and "World War 1" became "World War I". Some were changed for ease of calculating probabilities, such as several Napoleonic campaigns, each with varied names, were grouped into a long Napoleonic war.
Some names of battle participants were edited for clarity, especially when it's a civil war: fighting in the "Argentinian civil war" on the side of "Argentina" means fighting for the ruling government against opposing parties, and were reflected as such.
A big change was battle scale, or size, called the Lehmann Zhukov scale in the dataset: more than 43% of the data had no information on scale. I made a big assumption that larger battles would likely get more detailed reports, and so ascribed the smallest scale (up to 50,000 soldiers on either side) to all these data points.
These changes and other typo edits were based on very quick search, with results from Wikipedia and other online sources. There are surely many errors from my lack of understanding and familiarity with historical contexts.
Other Notes
On what the dataset's author called the battle's "theatre" — whether it took place on land, at sea, or in the air: all land battles of the 20th century were assumed to have tactical air support since, according to the author, they're largely ubiquitous in post-1918 warfare.
A battle has three possible outcomes: won, lost, or inconclusive result. The third is rare and means something like a tie, where according to the author both sides achieved their tactical goals.
Tech Stack
Analysis: Python (Pandas, Regex, Seaborn, Matplotlib)
Visualization: Javascript (D3, mo.js), HTML, CSS