Step 1: From Messy Titles to Standardized DOIs
My first goal was to get a standardized identifier for each article. The Digital Object Identifier (DOI) is the gold standard. Instead of manually searching for each of the 86 titles, I turned to Google Apps Script. I wrote a simple function, GET_DOI
, that takes an article title from a cell and queries the CrossRef API to find its DOI.
Here’s the code that did the heavy lifting:
/**
* Finds the DOI URL for a given article title using the CrossRef API.
* @param {string} title The article title to search for.
* @return {string} The full DOI URL.
* @customfunction
*/
function GET_DOI(title) {
const myEmail = "your-email@example.com"; // It's good practice to add your email
if (!title) return "Input title is empty.";
const url = `https://api.crossref.org/works?query.bibliographic=${encodeURIComponent(title)}&rows=1&mailto=${myEmail}`;
try {
const response = UrlFetchApp.fetch(url, { 'muteHttpExceptions': true });
if (response.getResponseCode() === 200) {
const data = JSON.parse(response.getContentText());
if (data && data.message && data.message.items && data.message.items.length > 0) {
return `https://doi.org/${data.message.items[0].DOI}`;
} else {
return "DOI not found.";
}
}
return "API Error.";
} catch (e) {
return `Error: ${e.toString()}`;
}
}
Step 2: Getting the "Official" Title
With a column full of DOIs, I was halfway there. But to verify my original list, I needed to get the official, publisher-registered title for each DOI. This would serve as my ground truth. So, I wrote another function, GET_TITLE_FROM_DOI
, to reverse the process.
/**
* Retrieves the full title of an article by its DOI using the CrossRef API.
* @param {string} doiInput The DOI of the article.
* @return {string} The full title of the article.
* @customfunction
*/
function GET_TITLE_FROM_DOI(doiInput) {
const myEmail = "your-email@example.com";
if (!doiInput) return "Input DOI is empty.";
const doi = doiInput.replace(/^(https?:\/\/)?(dx\.)?doi\.org\//, "");
const url = `https://api.crossref.org/works/${encodeURIComponent(doi)}?mailto=${myEmail}`;
try {
const response = UrlFetchApp.fetch(url, { 'muteHttpExceptions': true });
if (response.getResponseCode() === 200) {
const data = JSON.parse(response.getContentText());
if (data && data.message && data.message.title && data.message.title.length > 0) {
return data.message.title[0];
} else {
return "Title not found.";
}
}
return "DOI not found or API Error.";
} catch (e) {
return `Error: ${e.toString()}`;
}
}
Now my sheet had two columns to compare: my original title and the official title pulled from CrossRef. The end of my troubles? Not quite.
Step 3: The Flaw of Exact Matching and the Power of "Fuzzy" Logic
My first instinct was to use Google Sheets' built-in =EXACT(A2, C2)
function. The result was a sea of 'FALSE' values. A closer look revealed why: minor differences in capitalization, a stray comma, or a simple typo were enough to throw it off. Comparing all 86 by eye was the tedious task I was trying to avoid in the first place.
I needed a "fuzzy" comparison—a way to measure similarity instead of demanding perfection. This led me to the Levenshtein distance, an algorithm that calculates the number of single-character edits (insertions, deletions, or substitutions) needed to change one string into another. A lower score means a higher similarity. A quick search gave me the logic to implement it directly in Apps Script.
/**
* Calculates the Levenshtein distance between two strings.
* @param {string} string1 The first string.
* @param {string} string2 The second string.
* @return {number} The Levenshtein distance.
* @customfunction
*/
function LEVENSHTEIN(string1, string2) {
const s1 = String(string1), s2 = String(string2);
if (s1.length === 0) return s2.length;
if (s2.length === 0) return s1.length;
const matrix = [];
for (let i = 0; i <= s1.length; i++) matrix[i] = [i];
for (let j = 0; j <= s2.length; j++) matrix[0][j] = j;
for (let i = 1; i <= s1.length; i++) {
for (let j = 1; j <= s2.length; j++) {
const cost = s1[i - 1] === s2[j - 1] ? 0 : 1;
matrix[i][j] = Math.min(
matrix[i - 1][j] + 1, // Deletion
matrix[i][j - 1] + 1, // Insertion
matrix[i - 1][j - 1] + cost // Substitution
);
}
}
return matrix[s1.length][s2.length];
}
The Final Workflow: A Simple Heuristic to Finish the Job
With the LEVENSHTEIN
function, I could finally quantify the difference between my original titles and the official ones. I noticed that a low distance (e.g., under 10) almost always indicated a match. To create a simple filter, I developed a rough heuristic: I took the Levenshtein distance and subtracted the number of words in my original title. If the result was a low number (especially negative), I could be confident it was the same article.
This simple, semi-automated workflow in Google Sheets transformed a multi-hour manual slog into a manageable, 20-minute task. It allowed me to quickly sort the definite matches from the questionable ones, leaving only a handful that required a true manual check.
For any researcher drowning in data, I can't recommend this approach enough. A little bit of scripting in a tool you already use can save you an incredible amount of time and effort, letting you focus on the research itself.