Li, W. and Tavanapong, W. (2000) Content-based Web Site Characterization and Its Implications. Technical Report TR00-13, Department of Computer Science, Iowa State University.

Content-based Web Site Characterization and Its Implications
Wenlu Li             Wallapak Tavanapong
In recent years, the World-Wide Web has rapidly evolved from a simple
information sharing mechanism offering only static text and images to
a rich assortment of dynamic and interactive services such as
video/audio conferencing, electronic commerce, and distance learning.
Since images in Web pages form up to $70\%$ of Internet
traffic, this paper investigates another unexplored aspect of Web
usage, the use of image content, in the hope to design novel
techniques that can further reduce Web service delays and WAN
bandwidth consumption.  Images are currently used on the Web for
various purposes such as text-related content, decoration, navigation,
and advertisement.  In this paper, we first discuss ways and potential
benefits to employ image content to improve Web performance.  To
encourage future investigations, we present a systematic approach to
analyze files published on a Web site along with the results of the
application of the approach to two popular commercial sites.  First,
images are classified into five categories according to their
purpose. Probability plots are then used to test whether file sizes of
each image category fit any commonly used simple statistical
distributions.  Next, we employ the Anderson-Darling test to determine
how well the assumed models describe an empirical data set.  For
categories that fail the aforementioned tests, Log-log complementary
distribution plots are used to test whether the distribution is heavy
tailed.  The study revealed a number of interesting results.  Most
importantly, we observed that average HTML file size is larger than
that of the images on the same site.  On average, an HTML file
includes less than one unique content image and nine unique
non-content images, of which unique navigational images and decorative
images are most often embedded.  File sizes of the two categories
follow lognormal distributions.  Perhaps, this could lead to an
explanation of the lognormal body of the heavy-tailed distributions
found in previous trace analyses.
Keywords: World Wide Web, Web characterization, image classification
1999 CR Categories:
H.4.3 [Information Systems Applications]
Communications Applications --- design, information browsers;
H.5.1 [Information Interfaces and Presentation]
Multimedia Information Systems --- evaluation, methodology;
I.6.4 [Simulation and Modeling]
Model Validation and Analysis --- Web site modeling;
I.2.1 [Artificial Intelligence]
Applications and Expert Systems --- image classification;
Copyright (c) 2000 by Wenlu and Wallapak Tavanapong

