“Big data” is about finding patterns in large amounts of data. It is used in many places. Supermarkets use the purchase data to optimize store design so they know to put the beer next to the nachos. Medical researchers discover correlations between genetic mutations and illnesses to diagnose hereditary diseases.
As genealogists, big data can help us too. Genealogical data is being made available as open data, allowing researchers to use it for their own analyses. Some patterns will be predictable, such as that infant mortality is lower when the parents’ income is higher, or that families have fewer children if the parents are older when they married. But unpredictable patterns can provide new insights.
I have made a modest start by analyzing the data from my one-place-study of Winterswijk. I discovered that one in four children born in 1839 died before the age of eighteen, and that one in three of the surviving children emigrated to America. By analyzing family connections, more than half of these emigrants settled in a place where a relative already lived. This “chain migration” was much more common than I had realized and emphasizes the importance of researching other emigrating relatives.
Insights based on statistics can also help us prove relationships. When trying to identify parents, I can use that the names of the candidate-parents appear among the children’s names. That argument is stronger if I can show that 99% of families in that time and place named their children after their grandparents.
Statistical analysis by itself is not enough to come to a conclusion. Our ancestors may well be that 1% who did not conform, so we have to demonstrate that the general pattern applies in their situation by finding supporting evidence in other sources. Because we don’t want 1% of our tree to be incorrect.