Guest Post: Location Matters – Data Mining Research to Enhance Accuracy, Benefit Future Studies

Minitab Guest Blogger | 11 September, 2018

Topics: Automation, Design of Experiments - DOE, Machine Learning, Predictive Analytics, Data Analysis

Falk Huettman, professor of wildlife ecology at the Institute of Arctic Biology, Biology and Wildlife Department at the University of Alaska Fairbanks

Falk Huettman is a professor of wildlife ecology at the Institute of Arctic Biology, Biology and Wildlife Department at the University of Alaska Fairbanks. He uses machine learning extensively in his research.

How accurate is the location of an endangered Aleutian shield-fern specimen collected in Alaska in 1982? Research institutions and museums seek to provide the most accurate data possible to record the past and choose where to explore (and re-explore!) in the future. Modeling the accuracy of data is essential. And nailing down geographical locations of specimen samples as exactly as possible is vital to do this effectively.

These days, thanks to Open Access policies, data from around the world are posted in web portals – free to query and download as they are published. For example, take the Global Biodiversity Information Facility. Critical answers for complex but exciting data can sit at our fingertips if tools are handy and applied correctly!

So why not use data mining to check the trustworthiness of the data?

Along with my colleague Prof Stefanie M. Ickert-Bond, I mined over 100 years of data  and 260,000 specimens using University of Alaska Museum’s ARCTOS database. Unique and precious, the data covers specimen for many arctic nations – including Russia, Canada, Norway, Finland, Sweden and Iceland – as well as tropical collections (e.g. Papua New Guinea and Mexico) and from many parts of the United States.

The question here is, how do you synthesize such data effectively and in-time?

While many tools and approaches can be pursued, here Salford Predictive Modeler (SPM) was used to track specimen quality and geo-referencing errors with great success. The ability to use machine learning to quickly assess the data for potential errors facilitated the prioritization of specimens in our study while leveraging the abundant data points widely available in these data bases.   Machine learning helped uncover signals and factors in these international databases that helped us validate the quality of the data and to generalize the findings in space and time across many dimensions.

Take this example of how TreeNet in SPM was used to show impacts of collection year and catalogue number to derive a spatial uncertainty index.

 TreeNet in SPM shows impacts of collection year and catalogue number to derive a spatial uncertainty index.

We found that errors greater than 11km (about 6.8 miles) were generally a red flag for areas that might need to be examined again in order to draw more accurate spatial and collection conclusions. These could result from a number of factors – from human error over sampling protocols to technology (e.g. specimens collected before the pinpoint accuracy of GPS was developed).

As this research has shown, web data are very complex but precious. Machine learning allowed us to get at answers quickly, helping in this case to advance information otherwise locked up in museum basements for centuries.   


Would you like to see more details on this research? See the full study: On open access, data mining and plant conservation in the Circumpolar North with an online data example of the Herbarium, University of Alaska Museum of the North