Ruigang
Yang
Department of Computer Science
Columbia University
Dec, 1997
Over the last decade, the World-Wide Web has becoming the largest information
database in the world. Almost every kind of information we can possibly imagine
can be found some where in the web. With the introduction of Hype Text Markup
Language (HTML), navigating in the web becomes easy and less confusing, because
information are organized by these links between relevant pages. However,
since the links between web pages can be arbitrary, copying/moving a compound
document which consists multiple pages become a non-trivial problem in the
World-Wide Web society. Even the leading web browser can not guarantee to
save a complete page without broken links. So here, I designed and implemented
a software to copy/move/distribute compound document in the web, we call
it Web Copy (WCP). It can be used as a tool for web document backup, off
line browsing or the creation of mirror site. It is also submitted as a course
project to meet the requirement of COMS E6118.
In the proposal of WCP, we have finished the requirement analysis and outlined WCP's high level architecture. Here let us briefly recap it. The primary function of the WCP is to copy/move/distribute compound document in the web. A compound document is a collection of inter-related files which conceptually forms one document. Because a compound document may be distributed over different diretories or even sites and we want to keep the downloaded version under one single directory, the document download may have to be modified to keep the integrity of links. This is a inherent problem of the HTML standard, because it keeps links as part of the file. WCP's major technical challenge is also here. There are some existing applications to do a similar job. But most of them just recursively copy the files under one directory, they can't gurrantee the integrity of the links, and the downloaded files can only be saved to a local directory. WCP is oritented to overcome the 2 major deficiency of existing programs.
To accomplish such a goal, WCP is decomposed to several components as illustrated in the following figure.
DocumentBase is the central repository where all the tempoary file during the copying process shall be indexed. Because it is common that 2 page will refer to each other, forming a loop link, the DocumentBase shall have the intelligence to tell if a page has already been stored or not, so no duplicate copies of a same page will exist in the database.
UserInterface provides a user friendly graphical interface which all be straightforward and easy to use.
Parser parse the input HTML page to pick all the links in it. The HTML tags it shall recognize include, but not limited to HREF, IMG, SRC. It returns a list of links to the DocumentBase. DocumentBase filter out the links which already exist, then pass the list to Collector.
Collector is a set of protocol clients which can talk to remote host/server using the appropriate protocol to retrieve document. At least HTTP and FTP client should be implemented in the prototype. In future extensions, new client may be added, such as the SQL client, POP client. Because these protocol clients are encapsulated inside the Collector component, WCP can incorporate new protocol seamlessly.
Distributor save/distribute downloaded document to one or multiple destination. It contains 2 agents, local and remote. Local agent handle the normal local file system operations. Remote agent is able to send the document to a remote destination, or upload to the server. The remote agent may use some of the protocol clients to send/upload document. So the protocol clients are shared by Collector and Distributor.
Filter modify the document if necessary. Annotation, repackage (change the directory structure), Format converter (GIF to JPEG, HTML to text, HTML to ps, etc.) will be implemented inside this component.
WCP is written in Java as a Java application. The core components of WCP
are the 2 classes DB and HTMLpage. DB class is the implementation of
DocumentBase component in the System diagram. It contains a HashTable. Every
downloaded page is inserted into the table using its URL as the index. HTMLpage
is a very heavy duty class. It integrated the functions of Parser, Collector
and Filter. Because Java's URL class provides a consistent interface to download
a page by HTTP or FTP, Collector component becomes relatively trivial, so
I integrated it to the HTMLpage. In the current version, Filter only modify
the internal link structure of a HTML file, so it is also integrated to the
HTMLpage class. However, the most important function of HTMLpage is the Parser.
Currently, the parse can recognize 3 type of tags, <A HREF....> ,
<SRC img = ...> and <BODY ... background>, which, to my own
experience, cover the most frequently link type tags in the HTML syntax.
Here is how WCP works: for every single URL, an HTMLpage object was created, it was inserted into the DB object if the download rule is satisfied (i.e. within the maxim recursive depth or the page contains a certain keyword). Then parser function is called to find out all the links in that HTMLpage. every link is associated with a complete URL. If that URL should be downloaded, DB object is checked to see if there is a local copy already, if not , download the page and insert it otherwise , just skip and goes to the next link. This can avoid the problem of possible loop links in HTML files. When output files need be saved to the local disk, every link in a HTML page is checked again with the DB object, and modified if necessary.
WCP uses ORO, Inc. 's
Java package
NetComponents
to implement Email and FTP distribution of copied documents.
UUencoder class
used for email transfer is developed by
Matthew Phillip Hixson.
Next I will discuss 2 major technical difficulties I ran into.
Filename/Directory structure reordering: This is the problem
when I tried to save files to a local disk. Because Web server has
some default roles to retrieve html file, some time, a URL doesn't contain
a filename, so we need a scheme to generate a name if there isn't. Links
may contain point to files in different directories. Most existing applications
create subdirectory if necessary. But I find that some links look like
this <a href = ../foo/foo.html> or <img src = /image/bar.gif>.
Applying the "create subdirectory if necessary" rule to names like this will
course problem. It may result: file saved out side the designated directory,
or can't save file at all due the file access permission. So my decision
is to save every html under one directory and images under a subdirectory
named "images". The save file name will be encoded pathinfo plus the original
filename. I admit this is not the best solution available, but it keeps every
HTML file under one directory, and can guarantee that every file can be saved.
Multi Thread Programming: I am a new comer to the world of Java programming. So I spent a lot of time to figure out how to use the multi thread feature of Java. In WCP, during the download process, a window will pop up to show the progress. In the window, there is a Cancel button to let user stop the process which some time may be too long to wait. So, I created 2 threads to do this. One for the dialog, the other for the undergoing download process. One thing I am still not clear. I have 2 classes like that:
class A extends Thread{ |
class B extends Thread {
|
Then I write the following code:
A a = new A(...);
B b = new B(...);
a.start(); b.start();
Only A's run
function is called, B's run
function
is never called. Multi-thread program is very powerful and useful, but it
is still pretty complicated.
I implemented the WCP in about one week of extensive work. Following is a screen shot of the WCP. This version implemented almost all of the functions as proposed in the initial proposal. It can start to download from a single URL and save the files to a local disk , or distribute them remotely by email or ftp. User can use the WCP in a batch model,filling in a couple of fields, then just click "Go" button to get the document, or user may use an interactive model to check every link to decide whether to keep it remote or download it, WCP provides a Sync Display function to display the actual page in a web browser. In either model, WCP can guarantee the integrity of anchors and links in every output HTML file.
There have been a couple of similar applications to collect set of web document,
such as WebRider
or Snarfapp. They all just download
the HTML file as it is, and keep the directory structure. They, in
some sense, are XCOPY's for the Internet. WCP does a much complicated
job than those existing application. During the process of downloading, it
keeps a central repository of all the files, (HTML, GIF, JPG. etc)
downloaded. When writing the output files, it will change the content
of the HTML file to make sure that links inside the page are pointing to
the correct place. It also involves an rearrangement of directory structure
of the files. All the images are stored in one directory. Besides this, WCP
also provides rich means of saving by incorporating a varirty of internet
protocols. Downloaded files can be saved to a local disk or be zipped and
email/ftp to some remote receipts who might not have the WCP utililty. Based
on Java's AWT, WCP provides a friendly user interface which is very easy
to use. Because WCP is a multi-threaded program, user can interrupt the
downloading process at will.
Of course, still as a prototype, WCP has some places that can be improved. For pratical use, a configuration file is needed to keep track of the external program WCP may invoke. Every single URL connection can be implemented as a single tread, so WCP can retrieve pages in parrell which can greatly speed up the downloading process. A more complicated search option may be useful for advanced user of WCP.
In this reports, we introduced a software to copy/move/distribute compound documents over the web. It uses a HashTable indexed by URL as its temporary repository to store downloaded htmlpages to avoid duplication of same file and loop download. Compared with existing similar application, it has 2 dominant features guaranteed integrity of hyper links inside downloaded pages, rich means of distribution. It satisfied the specification as described in the initial proposal.
Future improvement of the software may include recognition of more tags, multi-thread downloading and the expansion of download rule set to make WCP faster, more reliable and smarter.