The explosion of Web documents, many of which are different descriptions of the
same facts, will also bring about the need to recognize which facts are conceptually
equivalent. Craven et al. (2002) refer to this as the multiple Elvis problem. In
our current work, we extract from and filter out, duplicate facts from multiple Web
sources, including not only the WSJ but also Reuters, and use this information to
create a knowledge base that contains only novel facts. Semantically conflicting
facts are identified and quarantined until new information validates or disavows
one or the other, and the conflict can be resolved. In this approach, the multiple
sources of a given fact are remembered (via URL references to the source articles)
for verification purposes, but each fact is stored only once.
Since Web information providers may be slow to convert their existing content into
a rich XML format, much of the semantic encoding may have to be done by third
party e-business service providers, or by end users themselves, using browser-side
extracting and encoding tools, such as the Thresher tool proposed by Hogue and
Karger (2005).
Pages:
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275