The following
subsection describes our corpus, the KWIC index generator, the CMU-SLM toolkit,
and the part of speech tagger. WordNet, which contains information on lexical
semantic relations, is discussed after that.
Figure 2. System architecture of FIRST
Automat cally Extract ng and Tagg ng Bus ness Informat on for E-Bus ness Systems 0
Copyright ?© 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
The.Corpus.and.Rule.Extraction.Process
To generate rules that enable FIRST to extract information from online documents,
we look for written patterns in a number of articles in the same domain. FIRST??™s
current goal is to extract information from the WSJ in the domain of corporate
finance. Figure 3 shows a sample WSJ document published in 1987.
We use articles from the WSJ written in 1987 as a training data set to help us find
patterns in the articles. Each article is tagged using Standard Generalized Markup
Language (SGML). SGML is an international standard for the definition of device-
independent, system-independent methods of representing texts in electronic
form.
Pages:
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259