Chinese Web Content Extraction Based on Naïve Bayes Model
Abstract
As the web content extraction becomes more and more difficult, this
paper proposes a method that using Naive Bayes Model to train the block
attributes eigenvalues of web page. Firstly, this method denoising the
web page, represents it as a DOM tree and divides web page into blocks,
then uses Naive Bayes Model to get the probability value of the
statistical feature about web blocks. At last, it extracts theme blocks
to compose content of web page. The test shows that the algorithm could
extract content of web page accurately. The average accuracy has reached
up to 96.2%.The method has been adopted to extract content for the
off-portal search of Hunan Farmer Training Website, and the efficiency
is well.
Origin | Files produced by the author(s) |
---|
Loading...