week 8: break the algorithm!

Oh boy oh boy, is this week’s discussion near, dear, and feared in my heart! I was so happy to see Dr. Safiya Noble’s work on the reading list, as her contributions are crucial to information science/LIS studies, and have heavily informed my own approaches to teaching search and research in library settings (and specifically the need for critical interventions in artists’ and designers’ research processes). A few years ago, a colleague and I organized a workshop for cartoonists and illustrators that interrogated stereotypes and the use of Google Images in visual research and development. Beyond the tendency to find the same, stale images that everyone else might be drawing on (literally), we discussed how these searches tend to reinforce our preconceived ideas of people and things, with no counter-narrative to challenge what we see. In this case, our intent wasn’t necessarily to ask artists to stop using Google Images, but to consider the questions they’re asking of themselves when using these tools (in short: “doing research” is not enough, and you need to understand how your biases perspectives are going to interact with, and be amplified by biased search systems, and how to read against those results.) 

There’s another kind of insidious creep of “the algorithm” that worries me, though, noted briefly in Sharon Block’s article.1 While Block focuses on the specifics of JSTOR, Matthew Reidsma’s Masked by Trust investigates the adoption and proliferation of Google-like search interfaces and algorithms by library discovery systems, which are usually the products of commercial vendors. These systems are largely driven by proprietary software, without any real transparency to their operation, or the company’s use of user data. Because these discovery systems are implemented by libraries (with all their perceived values of “neutrality” and work for a larger, public good), they’re assumed to operate from that same position of objectivity, and don’t receive the same critical scrutiny of commercial search engines like Google. (And in case I haven’t said it here enough: libraries are not, and have never been, neutral.)

After reading Block’s article and her questioning of JSTOR’s (pretty troubling) handling of topic modeling, I was curious to see if our own, new library discovery system provided any information on how it ranks searches and handles subject terms. The vendor provides a brochure that, while not extensive with technical specs, gives some useful information on its treatment of user searches. 2 Things start off as expected: the system checks search terms against existing metadata fields first (title, author, subject headings), with some sophisticated use of synonym/word stem expansion. The next steps start to feel a little uncomfortable, though. When users search, they’re provided with autocomplete options (not unlike Google’s own suggestion feature), based on popular searches in the system’s logs and for content available through that library. Items are also ranked based on an item’s “academic significance”: 

“The item’s academic significance is calculated from factors unrelated to the query, such as whether the item was published in a peer-reviewed journal, how many times it has been cited, and what type of material it is, for example a journal article is considered more significant than a newspaper article.” 

Finally, materials are ranked based on publication date, with some further options to provide customization based on a user’s personally-set subject or disciplinary interests. At Mason, this means you can tailor your search to prioritize content from within your own field, filtering out the “noise” of irrelevant research or authors. 

There are a few things that we could investigate here, given this week’s concerns about ethics and bias. One of the most obvious (and probably unsurprising) issues from the start: this is not a system designed for historians, or users looking for primary source content, or… really, anything that might fall outside of traditional academic publishing paradigms. The presumed “standard” user seems to be a non-specialist who might not start with a subject database, but nonetheless is looking for a fast and direct way to connect with new research. In fact, these discovery systems are often the first way that new students or researchers connect with our library’s catalog and resources. For this reason, similarities to Google are probably intentional, and usually welcome: because the library system looks and acts like what new users are familiar with, they may be more likely to return and fully exploit the library’s resources.

But as with Google, how much do our users know about how this system works? How much do they care? When they log in as instructed by librarians (and as required to access full-text content), do they know if and how their searches are connected to their student data? Do librarians know if their searches are connected to their personally-identifying data? One important feature of many older library catalog and circulation systems was their short-term memory; once a patron returned an item, it was wiped from the record. Searches, likewise, were not connected to anything that might identify a user. Librarians have fought (and been threatened with jail) to protect user privacy; your reading history is your own, even when the PATRIOT Act might try to say otherwise. How does that change in this new environment, though, when searches from the system logs are used to suggest results to other users?

Beyond concerns of data privacy, how do these kinds of systems reinforce disciplinary bubbles or cognitive/confirmation biases, reinforcing only those kinds of results that users expect to see, without any troublesome interdisciplinary challenges? Or information hierarchies that privilege traditional systems and sources of publishing? What does it mean when the company building the discovery system is also providing much of the content? Even our vendor seems to recognize that there’s a concern about algorithms; the word is avoided entirely in its discussion of “intelligent ranking technology.”3 For librarians, this puts us in an awkward position when it comes to teaching students or new users, especially when those requests often come from faculty who might ask us to simply “teach them how to use the databases.” How can we support a learner’s critical engagement with our systems, while also supporting their immediate class or research needs? Block and Noble call attention to the issues in external systems, but how do we address this problem when it’s now built into the systems that manage and deliver our own library content?

(To try and wrap this ramble up in a spooky pre-Halloween analogy:

Animated gif from "When A Stranger Calls." A woman holds up a telephone; the text below reads "We've traced the call. It's coming from inside the house."
  1. Sharon Block, “Erasure, Misrepresentation and Confusion: Investigating JSTOR Topics on Women’s and Race Histories,” Digital Humanities Quarterly 14, no 1 (2020).
  2. ExLibris Primo, “Primo Discovery: Search, Ranking, and Beyond,” Exlibris, March 2015, accessed 16 October 2020.
  3. ExLibris Primo, “Primo Relevance Ranking: Technology Overview,” exlibrisgroup.com, Proquest, accessed 16 October 2020.

← Previous post

Next post →

5 Comments

  1. Terence V

    Stephanie, I really appreciate the urgency and gravity in your writing. The unease is palpable and it makes for a much more engaging read – thank you! Most importantly, I believe, is how you ask that heavy question that many people certainly aren’t comfortable addressing: “How much do [users] care?”

    As a library user (public, university, local, or otherwise), I usually walk in with a clear objective that must be met. As long as the tools made available to me deliver the information I want from a relatively straightforward query, then I (honestly) couldn’t care less about how it came about. I don’t think it would be an extreme claim to say that many others feel the same.

    But as the keepers of that knowledge, librarians have a vested interest in the operation of that system. They (you) are fighting such major and consequential battles that the public and/or your audience might not know about. When search algorithms employed for library/archive databases are functionally or structurally similar to those found on the internet, then the stakes extend beyond just marginalization or disciplinary isolation. The status of libraries as neutral-ish institutions in the public mind is placed directly under siege. If the tendency to build ideological bubbles is transmitted from online arenas to local libraries, then we could be robbing society of a bastion of (relatively) untarnished space for unbiased learning.

    I’d hate to end on such a downer, but Noble (and company’s) advocacy and awareness work has yielded some early and positive victories. Perhaps the outlook isn’t so … grimm?

  2. Madison Morrow

    I was looking forward to reading your blog post to get your take on how algorithms influence research on sites like JSTOR, and you did not disappoint! Data privacy is something that definitely should be an important part of research, but unfortunately it has been shown how past searches and interaction among the database influence the search results. I think you raise a lot of essential questions on what the users and librarians need to know in regards to the algorithms from these databases and its impact in their research. Another issue you raised is how the searches that come up tend to reinforce our preconceived notions. We need to look beyond those results and challenge them by acknowledging bias within algorithms.

  3. Stephanie this is a wonderful blog post following up on all of the readings for this week! I really appreciate your point of view as a librarian, as you’ve worked with and considered these issues in a way I never have prior to this week. I am very intrigued by the workshop you led, and hope more things like that will become commonplace. You’re right: people don’t know these things, don’t consider them, and don’t know how to fix it.
    I have frequently heard that ‘archives are never neutral’ and it is something I have thought about a good bit doing research and working with digital collections. To point out that libraries aren’t neutral either is something I needed to hear to grapple with; woof, always more issues. Thank you.

  4. Robert Carlock

    Terence is right, the tone of your post definitely makes it a compelling read. For me, I think that sense of urgency peaked at the point that these database systems may be using our personal data to provide search results. I suppose I had not considered the effects of algorithms and personal data in relation to library databases before; I think I just assumed they did not mine personal data in the same way. But if they did, I shudder to think at the potential results. I think that would be a useful tool in a fully functioning democracy, but in an increasingly threatening police state where the government is using its power to attack protestors, I worry what would happen if our information ended up in the wrong hands. Academic freedom is supposed to provide a barrier around us, to an extent, but if a fascist or other authoritarian regime came to power, would I be sent to jail for researching activism?

  5. Stephanie, great post and the whole look of your website. I couldn’t help but see some similarities with the movie The Social Dilemma. One of the people profiled is an ex-YouTube developer who had a hand in creating the algorithm that recommends videos for the autofeed. He has now come out against this algorithm, saying it is problematic and dangerous, and is now advocating for “algorithmic transparency” and working toward this on his website algotransparency.org.

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php