This website uses cookies and similar technologies to understand visitors' experiences. By continuing to use this website, you accept our use of cookies and similar technologies,Terms of Use, and Privacy Policy.

Sep 19 2019 - 02:09pm
Text mining for scholarly resources

As part of the research for Search and Recommendation System in the research team, finding an efficient, legal, and dynamic technique for obtaining the metadata of scholarly resource plays an essential role. In this blog, we will provide some useful resources about text mining for scholarly resources. Additionally, we will discuss the text mining project in EdLab.


Resources that may be helpful 

Please check out this summary from MIT libraries about the application programming inference (API) for scholarly resources. Among all of these APIs, we want to emphasis CrossRef. MIT libraries also provide a summary of other text and data mining (TDM) corpora. Among these, we also want to emphasis Google Book API.


Before text mining, researchers should also follow the license agreement for different library resources and scrape responsibly. Check out this guide from UC Berkley Library (other sources about text mining are also available)



APIs for Search & recommendation system projects in EdLab

Based on the URL in EZProxy logs (04/19/2018-07/21/2019), 347232 links are captured. The top vendors include: Proquest (81006, 23.33%), Ebscohost (70551, 20.31%), Gale (49851,14.35%), SAGE (39009, 11.23%), Taylor & Francis (22887, 6.15%), JSTOR (16408,4.73%), and PsycNET (11336, 3.26%). Among all the URLs, 60110 (17.31%) links are in OpenURL format and links that provide doi is 77937 (22.44%). Here is a summary of our current solution for these links:

  • OpenURL: try to use CrossRef API to obtain the metadata first. If not available, we directly obtain the main information that contains in the URL link.
  • DOI: use of CrossRef API
  • Proquest (contains EbookCentral): if the link cannot be handled by doi or OpenURL, we can obtain the metadata by web scraping.
  • Other prominent vendors that we have web-scraping solutions: Sage, Ebscohost, Gale, and JSTOR.
  • Most search behavior related URL: we can directly obtain the query keywords from URL.
  • Most junk links (e.g., "http:///menu" has no real meaning) from the most prominent vendors are discard.


For all URL links, web-scraping solutions are not recommended because:

  • Usually, there are no standardized patterns,
  • according to the license agreement from a different vendor, large scale text mining may be rejected and illegal.


Currently, we are focusing on the prominent vendors. Now, we are only updating the metadata which can be obtained without web-scraping and bulk download. We are also finding other content provider's in-platform tools.


There are many vendors may willing to provide a resource based on their contractual agreement. If you know any, please contact us. 

Posted in: LibraryResearchWork Progress|By: Yi Chen|99 Reads