Using Text Mining to apply thesaurus-based Indexing to digitized Print Material

schema thomson scientific

Thomson Scientific: Rescuing “lost” data

logo thomson scientific

Industry : Publishing

Using Text Mining to apply thesaurus-based Indexing to digitized Print Material

Customer challenge

..............................................................................

Thomson Scientic had 49 bound volumes of Biological Abstracts® for 1926 to 1968 and needed to make the data searchable with the BIOSIS controlled vocabulary to offer a new web product to its clients.


Solution

..............................................................................

The approach

  • Use entity extraction to obtain candidate terms from the titles and abstracts
  • Map the extracted entities to the BIOSIS vocabulary
  • Output the resulting indexing
  • Publish the new product

Key ITM Benefits

..............................................................................

Top Line Results

  • Project completed to schedule (5 months)
  • 1.9M indexed records
  • All records received at least required minimal indexing
  • Throughput: 500ms per item


Partner

..............................................................................

Intelligent Topic Manager (ITM) from MONDECA for

  • Biosis thesaurus management (more than 2 millions concepts in 6 thesaurus)
  • Normalize and publish content metadata

Luxid from TEMIS for entities extraction on digitized contents