OLNet Fellowship Week 2 – Initial Thoughts on Tracking Downloaded OERs

As I mentioned when I first posted that I was coming to the UK for this fellowship, my main focus is how to generate some data on OER usage after it has been downloaded from a repository. In looking at the issue, it became clear that the primary mechanism to do so is actually the same as to track content use for sites themselves, by using a “web bug” in the same sort of way that many web analytics apps do, but instead of the tracking code being inserted into the repository software/site itself, it needs to be inserted into each piece of content. The trick then becomes

  • how do we get authors to insert these as part of their regular workflow
  • how do we make sure they are all unique / at what level do they need to be unique
  • how do we easily give the tracking data back to the authors.

My goal was to do all this without really altering the current workflow in SOL*R nor requiring any additional user accounts.

The solution I’ve struck upon (in conversation with folks here at the OU) is to use pwiki an open source analytics package with an extensive API to do the majority of the work, and to then work on how to insert this into the existing SOL*R workflow. So the scenario looks like this:

1a. Content owners are encouraged (as we do now) to use the BC Commons license generator to insert a license tag into their content. As part of the revised license generator, we insert an additional question – “Do you wish to enable tracking for this resource?”

1b. If they answer yes, the license code is ammended with a small html comment –

<!–insert tracking code here–>

1c. The content owner then pastes the license code and tracking placeholder into their content as they normally would. We let them know that the more places they place it into their content, the more detailed the tracking data will be. We also can note that this is *only* for web-based (e.g. html) content.

2. The content owner then uploads the finished product as they normally would.

3a. Each night a script (that I am writing now) runs on the server. It goes through the filesystem, and every time it finds the tracking placeholder:

  • based on the files location in the filesystem, it deconstructs the UUID assigned it in SOL*R
  • uses the UUID to get the resource name from SOL*R through the Equella web services
  • re-constructs the resource home url from its UUID
  • sends both of these to the piwik web service, which in return creates a new tracking site as well as the javascript to insert in the resource
  • finally writes this javascript where the tracking placeholder was.

4a. Finally, in modifying the SOL*R records, we also include a link to the new tracking results for each record that has it enabled.

4b. For tracking data the main things we will get is:

  • what are the new servers this content lives on
  • how many time each page of content in the resource (depending on how extensively they have pasted the tracking code) has been viewed, both total and unique views
  • other details about the end users of the content, for instance their location and other client details

I ran a test last week. This resource has a tracking code in it.  The “stock” reports for this resource are at http://u.nu/3q66d It should be noted: we are fully able to customize a dashboard that only shows *useful* reports (without all the cruft) as well as potentially incorporate the data from inside Equella on resource views / license acceptances. This is one of the HUGE benefits of using the SOL*R UUID in the tracking is that it is consistent both inside and outside of SOL*R.

I am pretty happy with how this is working so far; while I have expressed numerous times that I think the repository model is flawed for a host of reasons, to the extent to which it can be improved, this starts to provide content owners (and funders) details on how often resources are being used after they are downloaded, and (much like links and trackbacks in blogs) offer content owners a way to follow up with re-users, to start conversations that are currently absent.

But… I can hear the objections already. Some are easy to deal with: we plan to implement this in such a way that it will not be totally dependent on javascript. Others are much more sticky – does this infringe on the idea of “openness”? What level of disclosure is required? (This last especially given that potentially 2nd and 3rd generation re-users will be sending data back to the original server if the license retains intact.)

I do want to respect these concerns, but at the same time, I wonder how valid they are. You are reading this content right now, and it has a number of “web bugs” inserted in it to track usage yet is shared under a license that permits reuse. Even if it is seen as a “cost,” it seems like a small one to pay, with a large potential benefit in terms of reinforcing the motivations of people who have shared. But what do you think – setting aside for a second arguments about “what is OER?” and “the content’s not important,” does this seem like a problem to you? Would you be less likely to use content like this if you knew it sent usage data back? Would anonymizing the data (something piwik can easily do) ease your mind about this?