Mining with Hyphe

Reflections on publicETHOS 13# written by Mace Ojala. A student, a Software Studies scholar, and all around academic elf. University of Tampere/IT University of Copenhagen/University of Copenhagen.

A fruitful invitation for us in Science and Technology Studies (STS) to reflect on our engagement with tools was presented in a publicETHOS workshop on 1.12.2016. The afternoon workshop run by Mathieu Jacomy (@jacomyma) was a walkthrough of web scraping with Hyphe, an open-source, free tool developed at Sciences Po médialab where Jacomy works. The médialab designs and build tools for social scientists in particular, Hyphe being one of them. What does Hyphe propose as a tool, and what would it take for it to be entrenched in an academic community like the ITU?

Marie Blønd (@ITU_Marie), lab manager of ETHOS Lab and a graduate of the Digital Innovations Management (DIM) says Hyphe has developed since she learned about it on the Navigating Complexity course during her studies. Compared with other tools such as Tableau, Hyphe feels less like a black box. The difference stems from iterative, reflexive engagement with the data that Hyphe affords. Seen as a web search, an alternative to results-oriented tools such as the ubiquitous Google or even REX, Hyphe postpones results until the laborious work of corpus curation has been engaged with. Instead of querying a pre-given set of webpages or sites, the workflow starts by the researcher explicitly selecting URLs into the set to be scraped, and then crawled for further links.

Superficially speaking, the output of Hyphe is typical force-directed network visualisations, now familiar to all of us and part of the repertoire of so-called digital methods. They attract critique, and rightly so. Marie however argues that the valuable output of Hyphe is reflexive process with the data. Working with corpus curation keeps the data and the data sources from being hidden behind a surface, and helps verify and analyse them qualitatively. The removal of automatic crawling of “all the web” is where the heart of Hyphe’s value proposition seems to be.

Other than visualising the hyperlinks between web entities, the graph can be exported to Gephi or other tools for further qualitative or quantitative analysis. Hyphe thus sits as a tool among others, to be assembled into the instrumentation of a given research. As such, Hyphe is one implementation of the kind of tool to investigate that Richard Rogers (@richardrogers) from Digital Methods Initiative (DMI @digitalmethods) at an earlier publicETHOS event characterised as natively digital data. There is tremendous value in the extra design effort that DMI, Sciences Po and others purposefully invest into the software to make them more accessible to the wider research community, as social sciences and humanities alike adopt computational thinking into research. Despite this, the digital/non-digital remains a complex intertwine. Data, methods, nor tools warrant natural discrimination between one or the other, associate professor Luca Rossi (@lr) from Digital Society and Communication (DiSCo) section reminds us.

The crucial question when working with Hyphe is of course what to include in the corpus — which criteria to adopt? Lea Schick, (@schicku) postdoc at the Technologies in Practice (TiP) research group raised this question to Mathieu at the workshop, who refrained from providing generic answer. Instead, such decisions would necessarily be made from the perspective of particular research undertaken. The afternoon workshop centered around an artificial research scenario of mapping controversies around AIDS. A follow-up workshop where people would bring their own research could be useful, Lea hinted, grounding the Hyphe workflow and discussing specific questions.

I was recently working as a research assistant for Marisa Cohn at TiP, and we were looking at legacy software, legacyness, decay, discourses, sustenance and maintenance, entanglements and torques of bodies, biographies and softwares, how time plays out in software and that sorts of things. Were we using Hyphe? No, we were not. We had an eclectic corpus we were working with, but the fundamental Hyphe assumption of hyperlinks carrying information did not hold. The research object we collected was hyperlink-poor, consisting of web forum discussions threads, individual blogposts, PDFs, images and other sorts of digital documents which don’t express much hypertextuality. Therefore web crawling and hyperlink analysis did not seem like an appropriate idea. Had we chosen our collection differently, to match the assumptions, Hyphe might have been valuable. Web crawling itself is a non-trivial problem, and accepting some “white box” (semi-opaque, interrogatable) instruments would have been welcome, instead of building research automation from scratch. Similarly, had we committed to Hyphe early on, that choice of a tool would surely guided us towards data sources compatible with the tools assumptions, and a different research object would have been created.

/Mace (@xmacex)

Mining with Hyphe

Share this: