Technical Approaches

There are numerous technical challenges and milestones relating to the RISE project, this post will aim to address some of the most important aspects.

Database Structure

The database’s design is probably the most important aspect of the RISE Project’s development, being a heavily used source of data in the system it has to be capable of generating the recommendation types required as quickly and efficiently as possible.

The data in this database initially came from an archive of access logs generated by EZProxy, yet along with these log files it also has to accommodate data generated by the MyRecommendations web service going forward, as such the database’s schema, which can be found on the Technical Resources page is designed such that it is able to accept and use data from all of these sources effectively.

Throughout development it has also been imperative to bear in mind that some of the data contained within this database is to be released publicly and as such must be useful to external entities while also being fully functional internally. One of the main concerns regarding this ensuring anonymity, which is described in further detail on the Technical Resources page.

Parsing Data

As mentioned in the previous section the system has to be designed such that it can accept data from gzipped EZProxy log files. This is done via a PHP parser which extracts the relevant useful information. Some of this information is then used for further information gathering via the EBSCO Discovery Solution API, which stores full information on the resources available to the OU.

The data received by the parser is in the following format:

<REMOTE_HOST>|||<DATE_TIME>|||<USER_ID>|||<HTTP_REQUEST>
|||<HTTP_REFERER>|||<HTTP_RESPONSE>|||<RESPONSE_SIZE>
|||<SESSION_ID>

With the exception of the response size variable, all of the above are used by the database in the process of generating recommendations, more specific uses for each are outlined later in this post.

Notably at this stage there is no variable which contains information relating to a particular resource, hence the most salient variable becomes HTTP_REQUEST, from which we extract the requested URL’s ‘AN’ parameter. This ‘AN’ parameter, if present contains an Accession Number, which is a resource identifier used by EDS (EBSCO Discovery Solution – used to index resources). Using this AN (an 8-digit integer), the parser then requests further resource details from the EDS API.

The EDS API returns resource data in an XML format, an example of which can be found here.

From this XML result the parser extracts the required resource information, such as Name, ISSN/DOI, Author(s) and Publication Dates.  This data is then used to search for matching metadata from the Crossref service.

This crossref data, combined with the data retrieved from the information from the log file entries is then formatted appropriately, and inserted into the RISE database.

A flowchart depicting the logic behind this parser can be found on the Technical Resources page.

Authentication

Part of this project is to develop and release a Google Gadget, which is to provide similar functionality to that offered by the MyRecommendations interface in a lighter and more portable interface.

It is important to note that The Open University’s online e-resource collections are only to be accessed by current staff and students, as such any online service with the aim of indexing said resources must perform it’s own suitable authentication & authorization of users. On pages hosted on Open University Servers this is accomplished using SAMS, which is a proprietary server-side authentication system.

The initial plan for the Google Gadget was to authenticate using a token-based method similar to the following:

  1. Pass the user to a page hosted on OU Servers, which is protected by SAMS.
  2. SAMS-Protected page generates a hash token, and redirects the user to a gadget authentication page with the token as a URL parameter, and stores this token in the database.
  3. The gadget authentication page senses this URL parameter, and sets a cookie on the user’s machine which is accessible by the gadget itself.
  4. When a user accesses the gadget, the token stored in the cookie is checked against the database entry stored in step 2.

Fortunately, the RISE Project was able to take advantage of hosting the gadgets directly on Open University servers, and as such can utilize a much more efficient & secure authentication approach:

  1. If user isn’t authenticated with SAMS, the gadget page displays a link (to open in a new window) to a SAMS-protected page.
  2. In order to view this page, a user must be successfully authenticated, as such this page, when viewed instructs the user to close the external window, and return to the gadget.
  3. The gadget will now be able to successfully access the user’s credentials (as a successful SAMS authentication has taken place) and perform searches and provide recommendations.

Tracking & Analytics

As the MyRecommendations and Google Gadget aspects of the RISE Project utilize AJAX & Javascript technologies within their UIs and in the processes of displaying search results and recommendations it becomes necessary to implement advanced tracking measures which aren’t included by default with analytics implementations.

The main package in use for the RISE Project is Google Analytics.

As an example, if a user lands on the MyRecommendations interface their initial pageview is tracked. Their searches however use AJAX to Asynchronously fetch the results without having to reload the static parts of the interface. As this means the actual page isn’t being reloaded (the Google Analytics JavaScript isn’t being requested again by the user), it’s necessary to manually “push” a pageview to the Google Analytics servers with information about the current action.

In the above example, we would push a pageview event with the parameters “/search/<search terms>”. Which then allows users of the Google Analytics interface not only to see the number of actual searches, but also to drill-down and see which pages have been requested within the /search/ section (i.e. most commonly used search terms).

This approach is also used to track searches on the Google Gadget interface.

Providing Recommendations

The recommendation types outlined in previous posts are generated based on the logged-in user’s credentials, and resources which may have relationships with said credentials.

Relationships are stored in tables within the database, an example of a relationship stored within one of these tables is as follows:

Field Name: Value:
course_id 32
resource_id 54645
value 14

This example depicts a relationship between resource 54645 and course 32, having a value of 14. This means people on course 32 have given resource 54645 a value of 14. Values are assigned based on resource views and subsequent relevancy ratings (if available).

For example, a user following such a recommendation would increment the above relationship to 15 by simply following the link, indicating at least the resource title was somewhat interesting to the user.

If this user then chooses to rate, the following rating choices would result in the respective ‘Value’ after rating.

Rating Choice Resulting Relationship Value Change Logic, after +1 for resource visit
Very Useful 16 +1
Somewhat Useful 15 0
Not Useful 13 -2

From the table above, it is evident that the logic is weighted in favour of the resource being relevant (i.e. the value will always go UP by at least 1, unless ‘not useful’ is selected), this is based on the theory above that the resource link appearing useful enough to follow gives some indication that the resource is more likely to be relevant than irrelevant. Giving the ‘Not Useful’ rating a weight of -2 means the least useful resources can still ‘sink’ and appear less frequently. This approach also ensures the actions of those users choosing not to provide feedback can still be utilized by the recommendations engine.

It is also important to note that the ratings are purely for the actual relevance to a particular user, and not on the quality of the resource itself. This ability to rate both relevance and quality may be implemented in the future.

There is a relationship table for each of the three recommendation types, all are controlled and manipulated by actions of users in the RISE system, and the access logs generated by other library systems.

This entry was posted in Google Analytics, Recommendations, Technical and Standards and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *