Information Retrieval

Information retrieval

Top | Topic News | Event News | News Archive | Projects | Products | Topics | Events | Resource Links and Downloads | Associations | Electronic Journals | Site info | Search | Feedback

CONTENTS:
Information retrieval resources | Search Engine Standards Project | Case-based reasoning (CBR) | Case-based reasoning and the web | Identifying resources in distributed environments | "Invisible web" revealed | XML and metadata searching | Integrating multiple directories | Directories and XML | Digital imaging and cultural heritage | Indexing images report | Searchable video | Searchable video news | The cost of digitisation | TWAIN source site | Centre for Intelligent Information Retrieval | Image digitisation study | BCS Information Retrieval Specialist Group | Multimedia content analysis | Information retrieval resources | Library information portal | UK information retrieval research | Standards | Z39.50 | TREC-6 - Text Retrieval Conference

Information retrieval resources

Search Engine Standards Project

The Search Engine Standards Project has the remit to: "encourage all search engines to support some basic standard functions, which make it easier for researchers".

URL: http://searchenginewatch.internet.com/standards/990204.html
URL: http://searchenginewatch.internet.com/standards/index.html

Case-based reasoning (CBR)

Case-based reasoning (CBR) uses similarity measures and domain-specific knowledge for information retrieval and problem solving. The University of Kaiserslauten maintains a CBR home page that provides general information, its theory and applications and links to related resources. In addition they carry-out research considering the use of CBR in web-based applications.

URL: http://www.cbr-web.org/

Case-based reasoning and the web

CBRnet is a site currently "under construction" by a number of researchers concerned with the utility of case based reasoning (CBR) and the web. There are a number of papers available form the site including: a couple that explore the possibility for a distributed CBR system that uses XML; and a review paper outlining the CBR methodology and its potential role in e-commerce.

URL: CBRnet http://www.cs.tcd.ie/Conor.Hayes/cbrnet/
URL: papers http://www.cs.tcd.ie/Conor.Hayes/cbrnet/publish.html

Identifying resources in distributed environments

People and Resource Identification in Distributed Environments (PRIDE) is an EU funded research project exploring the use of directory services to provide "environment" knowledge to distributed information services. It will investigate metadata and protocol issues in several service scenarios.

URL: http://lirn.viscount.org.uk/pride/

"Invisible web" revealed

The Search Engine Report of July 6, 1999 (an e-zine which monitors search engine technologies) ran a brief article entitled: "The Invisible Web Revealed" which highlights how to search for resources on the web that are currently "locked away" in databases. Such information is usually invisible to search engines and the article explains how such resources may be made more "visible" and hence accessible. The article gives links to some initial "catalogues" of such databases.

URL: current issue http://searchenginewatch.com/sereport/current.html
URL: Search Engine Report http://searchenginewatch.com/

XML and metadata searching

XML.com has published a two-part tutorial showing how XML can be used in the development of metasearch engines - tools that aggregate data from several web databases and present the results in a consistent way for users. The demonstration which accompanies the article requires the use of IE 5 because the integrated XML results are processed on the client side.

URL: http://xml.com

The GoXML Search Engine which carries out XML-specific search by providing a second step for the query with a popup menu of "context": the markup tag for the text.

URL: GoXML http://www.goxml.com/

Integrating multiple directories

Isode (an X.500 directory development company) have published a white paper, written by Steve Kille, CEO, which examines the primary techniques for integrating multiple directories within large organisations. The paper questions the use of the term "meta-directory", observing the confusion that surrounds the meta-directory concept. It discusses the benefits of considering the available integration techniques individually, and determining long-term enterprise directory requirements at the outset.

URL: http://www.isode.com/IC-6087.html

Directories and XML

There appears to be confusion amongst the XML development community concerning the relative merits of the Directory Services Markup Language (DSML) and Novell's DirXML proposal. Details of both, recent, proposals are available on the web. As reported previously, a number of vendors have established the Directory Interoperability Forum with the intention, to speed development and deployment of directory-enabled applications that run across different computing environments. The member companies will work closely with industry associations such as the Internet Engineering Task Force (IETF), The Open Group and the DMTF to speed enhancement and adoption of directory standards.

URL: DirXML http://www.novell.com/products/nds/dirxmlfaq.html
URL: DSML http://www.dsml.org
URL: Directory Interoperability Forum http://www.directoryforum.org

Digital imaging and cultural heritage

The Digital Imaging Initiative running at the University of Illinois, USA is exploring the use of multimedia and network technology to promote preservation and provide widespread access to "cultural" collections. They run an excellent site which provides a wealth of resources based on current project work.

Alongside detailed descriptions of the projects, and their aims, there are resources and links to related research, along with appraisals and critiques of base technologies and their potential utility, for example they have links to some content based retrieval sites. There is also a very useful list of online information points and a section entitled: "Questions to Consider Before Beginning an Image Database Project".

The design of the site and its content reflects closely the goals of the program, which has many similarities to those expressed in proposals relating to Information Engineering, including:

Establish "best practices" for digitising various classes of visual and textual materials.

Develop multimedia databases that deliver visual resources and other media in innovative ways.

Conduct research on the ways in which visual information (photographs, drawings, illustrations, etc.) is used in the digital environment.

URL: http://images.grainger.uiuc.edu/
URL: information points http://rod.grainger.uiuc.edu:8001/links/

Indexing images report

"Description and indexing of images: report of a survey of ARLIS members, 1998/99", presents the findings of a survey of UK art and picture libraries into the description and indexing of images, carried out within the Institute for Image Data Research, University of Northumbria at Newcastle, UK, during the period November 1998 to January 1999. The report covers background information on the context of the survey; the methodology adopted; presentation and discussion of the findings; and, a summary and conclusions.

In the Autumn of 1998, the Institute for Image Data Research was commissioned by the Joint Information Systems Committee (JISC) of the Higher Education Funding Councils to prepare a state of the art report on content-based image retrieval, with particular emphasis on the capabilities and limitations of current technology, and the extent to which it is likely to prove of practical use to users in higher education and elsewhere. The ARLIS Survey was carried out in order to inform a section of the report to do with current techniques for image and video retrieval. It also gave the researchers the opportunity to find out what were some of the issues to do with the management of image collections and current cataloguing and indexing practices.

URL: Content-Based Image Retrieval http://www.unn.ac.uk/iidr/CBIR/cbir.html
URL: report http://www.unn.ac.uk/iidr/ARLIS/

Searchable video

Pictron is a US company developing software to make video searchable on the Internet. The company has developed proprietary video analysis and "artificial intelligence" technology to segment videos into "meaningful clips" based on visual and audio cues.

Video is automatically indexed based on scene changes, text transcripts, human faces and names, and objects in the scene. The company claims that "this process makes video searchable and interactive". 27/06/00

URL: http://www.pictron.com/

Searchable video news

NewsHunter.net is providing a subscription-based, video search engine, for US political coverage. NewsHunter's proprietary software, architecture, database and GUI have been combined with Virage, video cataloging software, to provide "real-time, fully indexed, searchable video of news and public policy broadcasts".

Currently NewsHunter gives subscribers instant access to television images and data from the White House, the House and Senate Floor, and from news feeds. 27/06/00

URL: NewsHunter.net http://www.newshunter.net/

The cost of digitisation

"Digitisation: How Much Does it Really Cost?", is the title of a paper given at the Digital Resources for the Humanities 1999, Conference, held in September 1999. The paper looks at the factors which influence the cost of undertaking a digitisation project. Practical tips on ways in which to minimise costs are included, as well as a matrix exploring the relative cost factors of digitising differing media. It is available in pdf format, along with a number of other papers from the UK's Higher Education Digitisation Service (HEDS) web site.

URL: "Digitisation: How Much Does it Really Cost" http://heds.herts.ac.uk/HEDCinfo/Papers/drh99.pdf
URL: papers http://heds.herts.ac.uk/HEDCinfo/Papers.html
URL: HEDS home page http://heds.herts.ac.uk/

TWAIN source site

The TWAIN Working Group is a not-for-profit organisation which represents the imaging industry. TWAIN’s purpose is to provide and foster a universal public standard which links applications and image acquisition devices. The ongoing mission of this organisation is to continue to enhance the standard to accommodate future technologies. TWAIN is the standard that most scanners use to connect to software packages.

URL: http://www.twain.org/

Centre for Intelligent Information Retrieval

The US-based Center for Intelligent Information Retrieval (CIIR) is developing tools to provide access to large, heterogeneous, distributed, text and multimedia databases. The site provides a wealth of information and resources relating to research in the areas of:

text representation and retrieval strategies,

distributed retrieval,

translingual retrieval,

document filtering and routing,

information extraction,

case-based reasoning,

agent architectures,

and image retrieval.

The research includes both low-level systems issues such as the design of protocols and architectures for distributed search, as well as more human-centered topics such as user interface design, visualization and data mining with text, and multimedia retrieval.

The site includes software demos and research information under the following headings: information retrieval; topic detection & tracking; multimodal IR; multi media indexing and retrieval; multi-agent systems; case-based reasoning; and natural language processing. There is also a very well stocked list of research papers many of which are available as downloadable postscript files.

URL: http://ciir.cs.umass.edu/index.html
URL: research papers http://ciir.cs.umass.edu/cgi-bin/w3-msql/publication_database/publications.html

Text mining web site

Weiguo Fan of the University of Michigan Business School has created a web site devoted to text data mining. The site has links to papers on the subjects, tools and research projects.

URL: http://www-personal.umich.edu/~wfan/text_mining.html

Image digitisation study

A feasibility study prepared by the UK's Higher Education Digitisation Service (HEDS) for the JISC Image Digitisation Initiative (JIDI) is now available on the HEDS web site. The study offers practical solutions for the issues presented by the range of image types involved in the JIDI. It details the background, method, technical baselines, findings and results, proposed production processes, procedures and potential costs.

The study looks at the particular challenges presented by each of the sample collections - challenges such as colour matching, photographs, textiles and large format or extremely fragile originals. Issues such as the use of photographic surrogates; transport and handling issues and metadata requirements are also presented.

The study's authors believe it will be of interest to any project team preparing an image digitisation project and describes many of the issues that should be considered.

URL: study http://heds.herts.ac.uk/Guidance/JIDI_fs.html
URL: JIDI http://www.ilrt.bris.ac.uk/projects/mru.html#jidi

The BCS Information Retrieval Specialist Group

The British Computer Society's Information Retrieval Specialist Group includes links to significant events in IR, associated groups in the UK and some IR resources.

URL: http://irsg.eu.org/

Multimedia content analysis

A broadcast news analysis project being run by Mitre Corp. addresses the challenges faced by users for processing increasing volumes of digital imagery, audio, video and text. It recognises that today's manual content creation techniques frequently result in inconsistent, error-prone and cumbersome products. The research highlights the requirement for automated mechanisms to capture, annotate, summarize, browse, search, visualize and disseminate multimedia information. In addition to project information the site includes papers, presentations, publications and links to related external Web sites.

URL: http://www.mitre.org/support/papers/mm_interact/index03.html

NEC Corporation researchers have developed the first experimental prototype of a highly accurate graphics search engine capable of locating digital images, photographs and video scenes regardless of data format. The search engine technology has been selected as part of an experimental model for the next-generation MPEG-7 format, and is being proposed as a contender for the basis of this world-wide standard.

According to the company the prototype technology can: distinguish visual features in images from any format, is 30 times faster; and is able to locate images 10 times more accurately than any previous technology. NEC are hoping that the technology will find itself into next generation digital broadcasting applications.

URL: http://www.nec-global.com/

Information retrieval resources

A site which provides a collection of online resources for research in the field of information retrieval and information extraction from the web. The pages contain materials that are related to "state-of-the-art" IR and IE techniques used for and on the web. The site also runs a mailing list.

URL: http://www.mri.mq.edu.au/~einat/web_ir/
URL: mailing list http://www.eGroups.com/list/webir/

Library and information management

Library Link is styled as a free online information and discussion forum for Librarians and Information Professionals worldwide. The site is divided into sections which provide tips and resources on: providing improved library services; managing information; news on events and listserv summaries. There is a discussion forum on the site and a resources section with links to library related sites.

URL: http://www.mcb.co.uk/liblink

Library information portal

LibrarySpot.com, an award-winning vertical information portal of the best library and reference resources on the web, was one of 33 Web sites selected by Forbes magazine as a "Forbes Favorite" web site in the publication's new "Best of the Web guide". The site was selected alongside sites such as ESPN.com, CNET.com and Yahoo Finance, as the "best of the best" in the reference category.

The guide evaluated more than 5,000 sites on five criteria: design, navigation, content, speed, and customisation with 1,200 sites selected for the online guide - 33 of these were designated as "Forbes Favorites". 29/02/00

URL: http://www.libraryspot.com/

UK information retrieval research

The UK's Library and Information Commission published details, during December 1999, of the results of their call for proposals in Information Retrieval Research. Descriptions of the eleven successful projects have been published on the web.

URL: http://www.lic.gov.uk/awards/ir-curpj.html

Standards

Z39.50

Z39.50 is an application layer protocol to facilitate the interconnection of computer systems, particularly intended for use by systems supporting information retrieval services such as libraries, information utilities, and union catalogue centers. Z39.50 is not constrained to any particular computer system or network and can be implemented without limitation on TCP/IP networks.

Resources

Library of Congress X39.50 Maintenance Agency Homepage, includes links to the ANSI/NISO standards, reports, and other links related to the Z39.50 standard.
URL: http://lcweb.loc.gov/z3950/agency/

A review paper of the protocol and its application
URL: http://www.cni.org/pub/NISO/docs/Z39.50-brochure/50.brochure.toc.html

For additional links to information about the Z39.50 standard
URL: http://ils.unc.edu/~stahk/Z3950.html

A presentation on Z39.50 in the form of slides online, or downloadable as a PowerPoint presentation includes bibliographies of further reading on the web and printed publications, and links to Z39.50 software.
URL: http://www.musiconline.ac.uk/z3950

Software resources

The CaseLibrary consortium funded by the European Union Directorate General XIII has mounted ZNavigator, a beta version of its information retrieval software for free download from the Web. The software allows users of PCs running Windows 3.x or Windows 95 (and Windows NT in most cases) to search a wide variety of library catalogues and databases via the Internet using the Z39.50 international standard for information retrieval.

URL: CaseLibrary http://www.sbu.ac.uk/litc/caselib/
URL: ZNavigator ftp://ftp.sbu.ac.uk/pub/znavig/zn10f.exe

The Sixth Text Retrieval Conference (TREC-6)

The following report on the TREC-6 conference has been contributed by Alan Smeaton (asmeaton@compapp.dcu.ie) - Dublin City University.

For many years information retrieval and its associated applications have been one of the minor applications of computing, nestling between computer, information and library sciences. The reasons for this are not surprising as both information and the computing resources to manipulate that information was centralised, information management was a profession restricted to skilled practitioners and knowledge in general was not regarded as a valuable commodity.

Information comes of age

Now, in the late 1990s, things are different. Computing and networking is becoming ubiquitous and distributed and computing technology is in the hands of a large and increasing section of the world's population with a large number of novices and comparatively few experienced users. Knowledge and information is now recognised as important and valuable, in text and in other media.

This development of the computing landscape has led to a situation in which all aspects of information management are important. This includes the broad area of information retrieval (IR), defined as the content-based manipulation of text and other information. IR has moved from being an obscure and niche application for specialists to becoming an enabling technology, underpinning many of the distributed, knowledge-intensive applications of the present and near future.

Navigating information

Apart from the general fast-paced evolution of computing and networking technology which we have seen lead to developments like personal computing, multimedia and the Internet, two specific events have occurred in the last 5 years which have highlighted the difficulties and non-trivial nature of the information retrieval task.

The Web stimulates searching

The first of these has been the huge growth in the WWW with huge volumes of (mostly textual) information deployed on a globally distributed network of computers and accessed incessantly by a population of mostly untrained users. For the most part, Web users are inadequately served with the current crop of Web search and navigation tools which has led to user dissatisfaction because of the inability to easily locate information of relevance to an information need.

TREC stimulates information retrieval

The second event to highlight the importance of IR in the 1990s has been TREC, the event itself and its fallout on the IR research community. TREC is an annual series of benchmarking exercises coordinated by the National Institute for Standards and Technology (NIST). In TREC, up to 50 research centres, Universities, companies and other organisations from around the world participate annually in a worldwide coordinated scientific exercise to benchmark the retrieval effectiveness of different approaches to indexing and retrieval of text documents. The participants work on the same 2 GByte collection of text documents, run the same queries, at the same time, and the pooled output from the participants' submissions is manually assessed for relevance and then evaluated.

The tangible outcomes of the series of annual TREC exercises are the experimental test collections of documents and queries now being made available to the wider research community, the large amount of experimental data accumulated over the years and the knowledge culled from these experiments on which IR techniques work better than others and under what circumstances. A more intangible outcome from TREC, which the reader may argue would have happened anyway, is that IR research is now regularly carried out and reported on large, multi-gigabyte collections of text. TREC has been both a carrot and a stick in helping to make this happen.

Key developments in TREC-6

As information retrieval has diversified into multiple directions, the TREC initiative has also introduced several divergent areas of information retrieval technology, called "tracks". For TREC-6, held during 1997, these included the following:

Interactive: covering detailed and in-depth analysis of user search behaviour for a small number of topics involving each participant using a benchmark IR system as well as their own.

Cross-lingual: covering retrieval across languages where the topics in one language were run against document collections in different languages involving English, French, German, Italian, Spanish and Dutch.

Spoken document retrieval where typed text queries were run against an archive of the audio of about 100 hours of radio and TV news broadcasts. This was the first time for this track to be run and the document set was provided as raw audio as well as manual (correct) transcriptions and as the output from a commercial speech recogniser.

Very large collection where a dataset of 20 Gbytes of text, as opposed to the usual 2 Gbytes, was used to measure the scalability of IR techniques to larger document bases.

High precision where the task was to find 10 (relevant) documents within a 5 minute window. Like the interactive track, this involved using real users to perform searches as opposed to running queries in batch mode.

The Natural Language Processing (NLP) track concentrated on evaluating the impact of NLP techniques on retrieval performance.

The Chinese language track involved running Chinese topics against a dataset of Chinese documents.

Routing: where the task is to filter an incoming stream of documents against a static pre-defined topic.

All these specialist tracks were run in addition to the main task, referred to as ad hoc retrieval, which is the conventional information retrieval application, running a new topic or query against a collection of 2 Gbytes of (English) documents.

Documented TREC-6 results

The results from the annual series of TREC events are documented in the annual TREC proceedings published by the US Department of Commerce which are over 1000 pages long, and recent TREC results and papers are also available on the TREC website. The documentation includes an exhaustive series of raw data on each retrieval run submitted by each participant, and a paper from each participating group describing what they have done, how they have done it and what retrieval results they have obtained.

No groups in TREC-6 took part in all tracks, though a small few almost did, because of the huge effort involved in participation. Costs for labour, equipment and travel required to take part in TREC are normally borne by the participating institution and this is the limiting factor to even greater participation. Nonetheless there are few areas of computing where such a global, coordinated and voluntary effort is made to make progress in a given field.

TREC has attempted to perform a common evaluation of performance on a common task with common data carried out in a single timeframe and in the main this has been successful. However, because of the complexity of the task and the evaluation and the diversity of approaches taken, the concerned investigator has to do a lot of digging into the official results in order to make straight one-on-one direct comparisons between systems and their performance. TREC was never intended to be a direct competition between groups or approaches to information retrieval and the whole exercise, unlike other language technology benchmarking exercises, has a scientific rather than a competitive feel to it. Commercial participants do not use their relative performances in TREC for marketing or advertising and indeed many use TREC as a vehicle for experimenting with new ideas.

Evaluating IR techniques

The reader may be interested in this report to find out who was the "winner" in TREC-6 or any other TREC or which technique works best. The simple answer to both questions is that there are no winners in TREC and there are an accumulation of techniques which can be mixed together into what is almost a cocktail of facets which make up a contemporary information retrieval system.

What makes the above questions impossible to answer with a single answer is that different "cocktails" work better or worse in different retrieval situations and that is what makes information retrieval hard. Even still, there are some trends that one can take away from TREC-6 such as:

data fusion (combining different retrievals into a single retrieval) generally works well,

NLP for the moment is having a limited impact on retrieval performance,

retrieval is now a fast operation and scalable to large collections,

information retrieval is diversifying into different applications.

Most of the techniques benchmarked in TREC-6 which have positive contributions to retrieval effectiveness have yet to make their way into mainstream IR, eg. for use in web search engines, but this will surely happen. It would be nice to think that TREC could be thought of as a window into what IR will look like but this author does not believe this to be true. What is missing from TREC, and has been absent since the first TREC in 1993, is measurement of retrieval efficiency - the time taken to return the answer to the query. This is being addressed now, indicating that TREC has matured, to consider all aspects of the IR task, which is healthy.

Another interesting and important lesson from TREC-6 is that the quality of retrieval in the late 1990s is appreciably better than 5 or 6 years ago, meaning that IR research has had an impact on the quality of retrieval. We look forward to this technology transfer into operational information retrieval.

The TREC initiative continues to grow every year and to attract new groups into the field such as the groups from Sheffield, Siemens Munich, Twentyone, Harris Corp., CSIRO and the Moscow State University who joined TREC for the first time in TREC-6. In addition there were the "usual suspects" who have been present for many, most or all of the TRECs. TREC will continue for at least another two annual cycles and thereafter its future is uncertain. Perhaps it will have served its task and will die down but more likely is that it will metamorphosise into something similar, possibly involving official European participation rather than the sporadic participation of individual European groups as we have seen to date.

Alan F. Smeaton is a Professor of Computing at Dublin City University. He has been on the program committee for TREC since it started and his Multimedia Information Retrieval (MMIR) research group has participated in TREC for the last four years.

URL: TREC proceedings http://www.nist.gov/itl/div894/894.02/
URL: TREC http://trec.nist.gov

El.pub - Interactive Electronic Publishing R & D News and Resources
We welcome feedback and contributions to the information service, and proposals for subjects for the news service (mail to: webmasters@elpub.org)

Edited by: Logical Events Limited - electronic marketing, search engine marketing, pay per click advertising, search engine optimisation, website optimisation consultants in London, UK. Visit our website at: www.logicalevents.org

Last up-dated: 16 February 2024

© 2024 Copyright and disclaimer El.pub and www.elpub.org are brand names owned by Logical Events Limited - no unauthorised use of them or the contents of this website is permitted without prior permission.