Prev | Current Page 247 | Next

A. F. Salam and Jason R. Stevens

"Semantic Web Technologies and E-Business: Toward the Integrated Virtual Organization and Business Process Automation"

The following
subsection describes our corpus, the KWIC index generator, the CMU-SLM toolkit,
and the part of speech tagger. WordNet, which contains information on lexical
semantic relations, is discussed after that.
Figure 2. System architecture of FIRST
Automat cally Extract ng and Tagg ng Bus ness Informat on for E-Bus ness Systems 0
Copyright ?© 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
The.Corpus.and.Rule.Extraction.Process
To generate rules that enable FIRST to extract information from online documents,
we look for written patterns in a number of articles in the same domain. FIRST??™s
current goal is to extract information from the WSJ in the domain of corporate
finance. Figure 3 shows a sample WSJ document published in 1987.
We use articles from the WSJ written in 1987 as a training data set to help us find
patterns in the articles. Each article is tagged using Standard Generalized Markup
Language (SGML). SGML is an international standard for the definition of device-
independent, system-independent methods of representing texts in electronic
form.


Pages:
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259