Kurt Thearling defines data mining in its simplest form, ‘the process of efficient` discovery of nonobvious valuable patterns from a large collection of data.’[1] His definition was intended for business companies. Even though the same definition is astute, in terms of academic data digital historians and scholars expand further by describing data mining as, ‘‘mining of knowledge’ as distinct from the ‘retrieval of data.’’[2] This is a general consensus which goes beyond historians and has added more depth to the process of data mining. It shows the difference between methodologies such as ‘keyword’ searches, which highlights a specific piece of data (a word), compared to highlighting information through the Semantic Web; this implies a meaning within the results.[3] Data mining is summarised into three steps; classification, clustering and regression. These processes will demonstrate the different methodologies within them. Furthermore, demonstrating data through different methods such as Ngram and Topic Modeling will evaluate how data mining is presented, as well as showing how historians interpret them. Finally, the question of whether analysing data online creates a different kind of history will be developed further on.
[1]A Data Mining Glossary, http://www.thearling.com/glossary.htm; consulted 15 April 2012
[2]Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, ‘Finding Needles in the Haystacks: Data-mining in Distributed Historical Datasets’ in Mark Greengrass and Lorna Hughes, The virtual representation of the past (surrey, 2008) p.66
[3] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, pp.65-67