Access to data is essential for research and development. Higher quality data result in higher quality research, but when those data contain sensitive or proprietary information, they might be omitted from public products to protect sensitive locations, innovative plans and proprietary elements of important databases and analyses.
To address this challenge, NETL innovated the Geospatial and Information Substitution and Anonymization tool (GISA), a new solution that leverages artificial intelligence (AI) to analyze and remove sensitive information from products prior to publishing. Datasets shared between partners and the U.S. Department of Energy (DOE) can contain sensitive information that may prevent or delay data from becoming public or shared with other entities. GISA helps the producers of those data prepare anonymized versions for public use and reuse.
NETL researchers recognized the need for a data anonymization tool several years ago when provided a sensitive dataset that required anonymization to complete internal sharing and analysis. GISA capabilities were leveraged to anonymize the locations of the data, enabling the sharing of the data to a broader group internally to facilitate analysis across the research team.
Building on a previous need to anonymize and curate datasets, GISA enables the anonymization of a subset of information within data while preserving important variables, enabling meaningful analysis and research without exposing sensitive information. GISA enables the redaction and anonymization of data through multiple methods, including randomization of geospatial point data, find and replace functions across multiple file types, and recommending and redacting terms and images from PDFs.
Using LUKE (Language Understanding with Knowledge-based Embeddings), a natural language processing model (NLP), GISA produces recommendations of location, company and entity names for the user to review and select for redaction within PDF text. When anonymizing a PDF, GISA uses the recommended terms selected by the data owner and creates a copy of the PDF with the selected terms redacted and completely removed.
GISA also enables review and redaction of images within PDFs, including logos, figures and select images. The ability to review and redact terms and images using GISA is agnostic to research topic area and can be applied generally by data owners to review and redact information as needed prior to publicly releasing or sharing PDFs.
GISA also provides support for multiple methods of anonymizing geospatial points into approximate coordinates for effective obfuscation and changing text within file names and content using a bulk find and replace function.
The use of AI and NLP is supported by the Science-Based Artificial Intelligence and Machine Learning Institute (SAMI) to accelerate the research at NETL. By enabling users to publish data without compromising sensitive information, GISA promotes open data sharing practices.
This tool development highlights NETL’s commitment to addressing the nation’s energy, economic and environmental challenges, with a focus on building a sustainable future for all Americans.
The GISA tool is available to the public on NETL’s Energy Data eXchange (EDX).
NETL is a U.S. Department of Energy (DOE) national laboratory dedicated to advancing the nation's energy future by creating innovative solutions that strengthen the security, affordability and reliability of energy systems and natural resources. With laboratories in Albany, Oregon; Morgantown, West Virginia; and Pittsburgh, Pennsylvania, NETL creates advanced energy technologies that support DOE’s mission while fostering collaborations that will lead to a resilient and abundant energy future for the nation.