Taming the Beast of Bibliographic Data with Google Sheets

Anyone who has conducted a systematic literature review knows the pain. You start with a list of potential articles, but that list is often a chaotic mix of inconsistent formatting, minor typos, and missing information. I recently found myself facing this exact hurdle with a set of 86 articles that needed verification. The manual, eye-straining task of checking each one pushed me to find a better way, and the solution I landed on lives right inside Google Sheets.

Step 1: From Messy Titles to Standardized DOIs

My first goal was to get a standardized identifier for each article. The Digital Object Identifier (DOI) is the gold standard. Instead of manually searching for each of the 86 titles, I turned to Google Apps Script. I wrote a simple function, GET_DOI, that takes an article title from a cell and queries the CrossRef API to find its DOI.

Here’s the code that did the heavy lifting:


/**
 * Finds the DOI URL for a given article title using the CrossRef API.
 * @param {string} title The article title to search for.
 * @return {string} The full DOI URL.
 * @customfunction
 */
function GET_DOI(title) {
  const myEmail = "your-email@example.com"; // It's good practice to add your email
  if (!title) return "Input title is empty.";
  const url = `https://api.crossref.org/works?query.bibliographic=${encodeURIComponent(title)}&rows=1&mailto=${myEmail}`;
  try {
    const response = UrlFetchApp.fetch(url, { 'muteHttpExceptions': true });
    if (response.getResponseCode() === 200) {
      const data = JSON.parse(response.getContentText());
      if (data && data.message && data.message.items && data.message.items.length > 0) {
        return `https://doi.org/${data.message.items[0].DOI}`;
      } else {
        return "DOI not found.";
      }
    }
    return "API Error.";
  } catch (e) {
    return `Error: ${e.toString()}`;
  }
}

Step 2: Getting the "Official" Title

With a column full of DOIs, I was halfway there. But to verify my original list, I needed to get the official, publisher-registered title for each DOI. This would serve as my ground truth. So, I wrote another function, GET_TITLE_FROM_DOI, to reverse the process.


/**
 * Retrieves the full title of an article by its DOI using the CrossRef API.
 * @param {string} doiInput The DOI of the article.
 * @return {string} The full title of the article.
 * @customfunction
 */
function GET_TITLE_FROM_DOI(doiInput) {
  const myEmail = "your-email@example.com";
  if (!doiInput) return "Input DOI is empty.";
  const doi = doiInput.replace(/^(https?:\/\/)?(dx\.)?doi\.org\//, "");
  const url = `https://api.crossref.org/works/${encodeURIComponent(doi)}?mailto=${myEmail}`;
  try {
    const response = UrlFetchApp.fetch(url, { 'muteHttpExceptions': true });
    if (response.getResponseCode() === 200) {
      const data = JSON.parse(response.getContentText());
      if (data && data.message && data.message.title && data.message.title.length > 0) {
        return data.message.title[0];
      } else {
        return "Title not found.";
      }
    }
    return "DOI not found or API Error.";
  } catch (e) {
    return `Error: ${e.toString()}`;
  }
}

Now my sheet had two columns to compare: my original title and the official title pulled from CrossRef. The end of my troubles? Not quite.

Step 3: The Flaw of Exact Matching and the Power of "Fuzzy" Logic

My first instinct was to use Google Sheets' built-in =EXACT(A2, C2) function. The result was a sea of 'FALSE' values. A closer look revealed why: minor differences in capitalization, a stray comma, or a simple typo were enough to throw it off. Comparing all 86 by eye was the tedious task I was trying to avoid in the first place.

I needed a "fuzzy" comparison—a way to measure similarity instead of demanding perfection. This led me to the Levenshtein distance, an algorithm that calculates the number of single-character edits (insertions, deletions, or substitutions) needed to change one string into another. A lower score means a higher similarity. A quick search gave me the logic to implement it directly in Apps Script.


/**
 * Calculates the Levenshtein distance between two strings.
 * @param {string} string1 The first string.
 * @param {string} string2 The second string.
 * @return {number} The Levenshtein distance.
 * @customfunction
 */
function LEVENSHTEIN(string1, string2) {
  const s1 = String(string1), s2 = String(string2);
  if (s1.length === 0) return s2.length;
  if (s2.length === 0) return s1.length;
  const matrix = [];
  for (let i = 0; i <= s1.length; i++) matrix[i] = [i];
  for (let j = 0; j <= s2.length; j++) matrix[0][j] = j;
  for (let i = 1; i <= s1.length; i++) {
    for (let j = 1; j <= s2.length; j++) {
      const cost = s1[i - 1] === s2[j - 1] ? 0 : 1;
      matrix[i][j] = Math.min(
        matrix[i - 1][j] + 1,      // Deletion
        matrix[i][j - 1] + 1,      // Insertion
        matrix[i - 1][j - 1] + cost // Substitution
      );
    }
  }
  return matrix[s1.length][s2.length];
}

The Final Workflow: A Simple Heuristic to Finish the Job

With the LEVENSHTEIN function, I could finally quantify the difference between my original titles and the official ones. I noticed that a low distance (e.g., under 10) almost always indicated a match. To create a simple filter, I developed a rough heuristic: I took the Levenshtein distance and subtracted the number of words in my original title. If the result was a low number (especially negative), I could be confident it was the same article.

This simple, semi-automated workflow in Google Sheets transformed a multi-hour manual slog into a manageable, 20-minute task. It allowed me to quickly sort the definite matches from the questionable ones, leaving only a handful that required a true manual check.

For any researcher drowning in data, I can't recommend this approach enough. A little bit of scripting in a tool you already use can save you an incredible amount of time and effort, letting you focus on the research itself.

Taming the Beast of Bibliographic Data with Google Sheets

Step 1: From Messy Titles to Standardized DOIs

Step 2: Getting the "Official" Title

Step 3: The Flaw of Exact Matching and the Power of "Fuzzy" Logic

The Final Workflow: A Simple Heuristic to Finish the Job

Postar um comentário

Remoção total de uma instalação do MariaDB no Debian

Hot Posts

Labels

Search This Blog

Most Recent

Remoção total de uma instalação do MariaDB no Debian

Redução de espaço de bancos de dados PostgreSQL

Motivações pra Criação de um Drone

Instalando o módulo do ActiveDirectory para uso no Powershell

ChatGPT e a base de artigos arXiv

Made with Love by

Contact form