Documentation

Text and data mining

Text and data mining (TDM) is the automatic (bot) analysis and extraction of information from large numbers of documents. TDM is more effective than screen-scraping, which is inefficient, error-prone, and fragile. Screen-scraping puts an unnecessary load on member sites (downloading html, css, javascript and other superfluous web assets), will often break if members (even slightly) redesign their websites, and typically is tied to specific members’ page layouts (and therefore need to be adapted on a member-by-member basis).

Using the DOI as the basis for TDM in a common API provides several benefits:

  • An easy way to de-duplicate documents that may be found on several sites. Processing the same document on multiple sites could easily skew TDM results and traditional techniques for eliminating duplicates (such as hashes) will not work reliably if the document in question exists in several representations (such as PDF, HTML, ePub) and/or versions (such as author’s accepted manuscript, and version of record)
  • Persistent provenance information. Using the DOI as a key allows researchers to retrieve and verify the provenance of the items in the TDM corpus, many years into the future when traditional HTTPS URLs will have already broken
  • An easy way to document, share, and compare corpora without having to exchange the actual documents
  • A mechanism to ensure the reproducibility of TDM results using the source documents
  • A mechanism to track the impact of updates, corrections, retractions, and withdrawals on corpora.

Researchers are increasingly interested in performing TDM with scholarly content. This requires automated access to the full-text content of large numbers of articles. The format of the full-text content varies by member. Our metadata helps researchers get access to this content and enables members to provide it.

How TDM works

  1. A member deposits URLs for their full-text and license/waivers (along with other publication metadata) weith us
  2. A researcher finds relevant content registered with us (such as journal articles) using a discovery service
  3. The researcher retrieves metadata for each item of registered content, including license information
  4. The researcher makes a full-text request from the member
  5. The member checks the subscription rights of the researcher and returns the full-text to them.

Researchers and text miners can access content URLs and license information via our API. If you are a member and would like to begin depositing URLs and access indicators, please contact us.

Page owner: Martyn Rittman   |   Last updated 2020-April-08