|
This web page presents material related to the
publication Archiving websites. General considerations and
strategies by Niels Brügger (The Centre for Internet Research,
Aarhus 2005).
This book treats the micro archiving of websites, i.e. archiving by
researchers, students or others without special technical knowledge
who, using a standard computer, wish to save a website for further
study. The phenomenon is discussed from the standpoint that
Internet research must be able to stabilise and save the object of
its analysis. However, the Internet is endowed with certain
fundamental media-specific dynamics that make stabilisation
difficult. Based on an account and discussion of these dynamics
(linked as they are to sender, text and recipient) the following
double conclusion is reached.
|
|
|
Firstly, unlike other well-known media, the
Internet does not simply exist in a form suited to being archived,
but rather is first formed as an object of study in the archiving,
and it is formed differently depending on who does the archiving,
when, and for what purpose. Secondly, this means that there is an
element of subjective creation in the archived material, so that
methodical deliberations are necessary — in other words, the
answers to why and how the archived material has been created.
These conclusions form the starting point for the last section of
the book, which, based on comprehensive tests of archiving
software, discusses in depth the elements that can be included in
an archiving strategy.
|
The book is free of charge, and as long as in print, copies of the
book may be obtained by contacting cfi@imv.au.dk. Please specify
complete address. An electronic version can be downloaded from this
site (for the purpose of citation please note that the printed and
electronic versions are identical).
|
|
|
Please address any comments on
the book or the website to nb@imv.au.dk.
|
A test of archiving software has been carried
out as a supplement to the book. The test was done by graduate
student Bo Hovgaard Thomasen, and its premises and main results are
explicated in the text Test of software and strategies for
micro-archiving websites
[download the text].
Note: We do not have the resources to offer technical support or
other advice on the use of the tested archiving programmes beyond
what can be found in the individual tests.
|
Top
General conclusions
|
The test of 18 different software
programmes for micro-archiving websites, with 4 archiving methods,
was carried out from July-December 2004 by graduate student Bo
Hovgaard Thomasen, on the background of Niels Brügger's book
(2005) Archiving websites. General considerations and
strategies.
As regards software that archives
a 'complete website', we can conclude that the programmes
in the test that archive most completely are WebHTTrack 3.33-beta-3
and WinHTTrack 3.33-beta-3. DeepVacuum 1.24, wget 1.9, and
WebReaper 9.8 can also be used, but their archiving results have
more deficiencies than the two first-mentioned programmes. There
are considerable differences between the programmes - among other
things, archiving speed – but they can all archive websites so that
they appear more or less as they do online. The exception to this
rule is that content elements requiring an online presence for
viewing cannot be archived using this method. It was found to be
advantageous to limit link-levels, domains (internal and external)
and filtering, in order to ensure that archiving was limited to the
desired web pages.
A further strong point of the five
above-mentioned programmes was that they can be used free of
charge, and that they are continually being further developed and
updated. The remaining programmes for archiving a complete website
that were tested cannot be recommended, either because they cannot
archive a sufficient number of content elements, so that the
archived website does not appear acceptably correct, or because
they archive in an operative system- and programme-specific format,
or because their cost is prohibitive compared to archiving
capability, especially considering that the first five programmes
are free.
All the programmes for archiving
individual web pages in a static state, screenshots, and screen
recordings can be used to archive. Some of the programmes are more
flexible than others, for instance as regards editing options, but
this is then usually reflected in their purchase cost. SnagIt 7.1.2
deserves special mention, as the programme can archive all of the
above while being fast and easy to use. However, it is one of the
most expensive programmes in the test. One programme, Web2Pic 1.1
('individual web page in a static state'), is unable to
archive all types of web pages.
|
Test results
|
Software for archiving complete websites
|
|
Adobe Acrobat Professional
6.01
|
Windows/
Mac OS X
|
5187.50 DKK; 30- day test version
can be downloaded for Windows
|
Adobe
Acrobat
Professional archives complete web pages based on keying in a URL,
after which the website is downloaded and saved. The strength of
Adobe Acrobat Professional lies in its archiving format:
'PDF' (Portable Document Format), which can be viewed on
all platforms and is probably a relatively future-oriented format.
The archiving attained with this programme is not perfect – often
relatively many elements of the archived web pages are lacking, and
at the same time it often happens that the content elements are not
correctly positioned in the archived version, so that the
experience of the web page is far from the same as the original
experience on the browser. Although some pages are remarkably
correctly archived, the archived material appears relatively
chaotic and incomplete. In addition, setting options are extremely
limited, resulting in archiving where the programme either easily
includes irrelevant web pages or fails to include a sufficient
number of pages. A further disadvantage is the fact that the
programme first downloads all the archived material to the computer
memory, after which it is stored on the hard disk. Finally, it
should be noted that Adobe Acrobat makes up the archived web pages
as paper pages, in formats such as A4, A3, etc., so that a web page
does not appear as it does on the browser. Finally, we must
conclude that the cost of procuring Adobe Acrobat Professional is
very high, compared to its ability to archive web pages.
see test details
|
|
DeepVacuum 1.24
|
Mac OS X
|
7,00 USD
|
DeepVacuum
(vers. 1.24) uses the wget 1.9 programme to archive
the source codes and other content of web pages, and to convert
these elements to a navigable offline version. The programme cannot
archive content elements requiring an online connection for
viewing. The graphic interface makes using command-line based wget
less complicated, but at the same time the interface means reduced
flexibility and configuration speed. Furthermore, archiving is
relatively slow. The archived websites often appear correct, but
the programme does not archive as correctly as, for instance,
WebHTTrack and WinHTTrack. DeepVacuum allows several websites to be
archived simultaneously.
see test details
|
|
Microsoft Internet Explorer
5.2.3
|
Mac OS X
|
Free
|
The Microsoft Internet Explorer
5.2.3 browser can archive up to five levels
of a website's hyperstructure. The web page is archived in the
'Web Archive' format (optional file name extension .waff),
which is both a platform-specific (Mac OS X) and programme-specific
format. Web pages are usually archived correctly. It must be seen
as a disadvantage that web pages archived with this program can
only be inputted into Internet Explorer 5.2.3, and can only be used
with Mac OS X. A further complaint is that archiving takes place at
a very low speed and occasionally freezes, after which the process
must be begun again
see test details
|
|
Microsoft Internet Explorer
6.0
|
Windows
|
Free
|
The Microsoft Internet Explorer 6
browser can archive one level of a website:
i.e. archive one web page. The web page is archived in the 'Web
Archive, single file' format, (filename extension .mht), which
is both a platform-specific (Windows) and programme-specific
format. Web pages are usually archived correctly. It must be seen
as a disadvantage that net pages archived with this program can
only be inputted into Internet Explorer 6, and can only be used
with Windows.
see test details
|
|
MM3-WebAssistant
Private 2005
|
Java
|
Free; 'Professional'
version with additional options costs 29.95 EUR
|
MM3-Webassistant
saves web pages visited by the browser in a cache for
offline use. The archived material often appears extremely correct.
However, a disadvantage of this archiving program is that the
archived web pages can only be viewed on a computer with
MM3-Webassistant installed. This archiving programme is most
applicable in cases where the archived material is personal
(working) copies, rather than serving as documentation or
appendices. The programme can be used on all platforms, if
Java Virtual Machine is
installed (which is the case for most newer operative
systems).
see test details
|
|
WebHTTrack Website
Copier 3.33-beta-3
|
UNIX/
Mac OS X
|
Free
|
WebHTTrack
is
a command-line operated offline browser that can archive
websites' source codes and remaining content, as well as
converting these elements so that the archived versions are
navigable. An advantage of archiving with HTTrack is that the
archived material is archived in the format in which it is written,
so that the archived pages appear in the browser, and one can work
with the archived material just as with the online version. HTTrack
cannot archive material requiring an online connection for viewing
(typically chat, polls, test-yourself, streamed elements, most
games). On the other hand, the programme is capable of converting
web pages remarkably well, so that links usually work internally in
the archived version, and web pages usually appear as they did
online, with the exception of the online elements mentioned.
Several archiving processes can be carried out at the same time
with this programme, and archiving can be automated via
scripts (see an example of an AppleScript that starts 25 simultaneous archiving
processes).
see test details
|
|
WebReaper 9.8
|
Windows
|
Free
|
WebReaper
archives websites' source codes and other
elements, as well as converting these files so that the archived
elements can be used offline. Elements requiring an online
connection for viewing cannot be archived with this programme.
WebReaper archives rapidly but individual pages are often missing
or appear defective. A further complaint against the programme is
that certain limitations of the material to be archived are not
possible; among other things, external web pages (domain
boundaries) can only be exempted from archiving with difficulty. In
spite of this, it should be noted that the programme is capable of
archiving many web pages correctly.
see test details
|
|
wget 1.9/
(+ wGetGUI 1.05)
|
UNIX/ Windows/
Mac OS X
|
Free
|
wget
archives
a copy of web pages' source code and other elements, and
converts web pages' links so that it can be used in an offline
version. Elements requiring an online connection for viewing cannot
be archived with wget. This test covers wget for MS-DOS and the
graphic interface wGetGUI (only for Windows). The programme
archives relatively correctly – many pages are correctly archived.
On the other hand, archiving speed is extremely low, which is a
serious deficiency in the programme. Several archiving processes
can be carried out at the same time, and archiving can be automated
with wget (using scripts or batch files).
see test details
|
|
WinHTTrack Website
Copier 3.33-beta-3
|
Windows
|
Free
|
WinHTTrack
is
an offline browser that can
archive websites' source codes and remaining content, as well
as converting these elements so that the archived versions are
navigable. An advantage of
archiving with WinHTTrack is that the archived material is archived
in the format in which it is written, so that the archived pages
appear in the browser. WinHTTrack cannot
archive material requiring an online connection for viewing
(typically chat, polls, test-yourself, streamed elements, most
games). On the other hand, the programme is capable of converting
web pages remarkably well, so that links usually work internally in
the archived version, and web pages usually appear as they did
online, with the exception of the online elements mentioned. It is
possible to carry out several archiving processes at the same time
with this programme. A command
line version for MS-DOS is included in the WinHTTrack software
package, which makes automation of the archiving process possible
by the use of batch scripts.
see test details
|
|
Software for archiving individual web pages
in a static state
|
|
The function Save As PDF...
|
Mac OS X
|
Integral in Mac OS X
|
With the function
'Save As PDF', Mac OS X makes it possible to
print a web page as a PDF file, instead of for the printer. It is
advantageous to use the function to archive the visual parts of a
web page, which then appear as a static snapshot of the web page.
The method cannot be used to archive dynamic elements. The (static)
visual parts of the archive web page appear remarkably correct with
this archiving function, and the only disadvantage is that the web
page is made up in a printable format (A4, A3, etc.).
see test
details
|
|
Paparazzi! 0.1.8
|
Mac OS X
|
Free
|
Paparazzi!
Is
a programme that can make a screenshot of a single web page. One
advantage of using this programme is that the page visually appears
exactly as experienced in the browser – with the exception of
dynamic elements, which are not archived by the programme. Another
strong point with Paparazzi! Is that it is designed for the sole
purpose of archiving individual web pages. This solves one of the
problems with screenshot programmes, which is that they typically
only allow for shots of windows or areas of the screen – not the
whole web page. A disadvantage of the programme and screen
shots/screen recording in general, is that a person must be present
during the entire archiving process, as this is done manually, one
web page at a time.
see test details
|
|
PrimoPDF 1.0
|
Windows
|
Free
|
PrimoPDF
installs a printer that permits printing a web page as
a PDF file. The programme is suitable for use in archiving the
visual parts of a web page as static snapshots. Dynamic elements
are not archived in this method. The (static) visual parts of the
archived web page appear remarkably correct when archived using
PrimoPDF, although a functional disadvantage of the programme is
that it is often necessary to manually set the scale to be used in
archiving; otherwise the full width of the web page is not included
in the archiving. It is a disadvantage that the web page is made up
in printable format (A4, A3, etc.).
see test details
|
|
SnagIt
7.1.2
|
Windows
|
39,95 EUR
|
SnagIt
can
make a screenshot of an individual web page so that the archived
version visually appears the same as when seen on the browser –
except that the archived web page is static. A disadvantage of the
programme, and screenshots/screen recording in general, is that a
person must be present during the entire archiving process, as this
takes place manually, one web page at a time. A further
disadvantage is the absence of sound, video and other dynamics in
the archived material. One great advantage to the programme is that
it can be integrated with the Internet Explorer 6.0 browser, making
archiving of a web page very simple.
see test details
|
|
Web2Pic 1.1
|
Windows
|
Free
|
Web2Pic
can
make a screenshot of an individual web page. The archived material
appears as a static snapshot of the web page and dynamics are thus
not archived. The programme is not always able to archive the web
page (among others, this was the case when archiving http://www.dr.dk/skum and http://tv2.dk/
in this test), which instead appears severely
lacking. It should be noted that the programme archives a great
number of web pages correctly. It is also an advantage that Web2Pic
is free.
see test details
|
|
Webkit2png 0.4
|
Mac OS X
|
Free
|
Webkit2png
can make a screenshot of an individual web page. An
advantage in using this programme to archive web pages is that the
page visually appears 100% as experienced in the
browser.
However, the programme does not archive dynamic
elements. A strong point in Webkit2png is that the programme is
operated from the command line, which enables automatic archiving
with the aid of scripts. See an example of a simple UNIX-script for archiving 17 websites
or the script photourl.sh that archives the URLs
specified in a text file using the webkit2png programme.
see test details
|
|
Software
for archiving
screenshots
|
|
The Print Screen utility
|
Windows
|
Integral in Windows
|
Windows
has a
built-in utility that can make a screenshot of the screen image
when the Print Screen hotkey is activated. The utility archives
satisfactorily, but is inflexible. Thus it is only possible to
capture the entire screen image, so that the screenshot cannot be
limited to regions or windows. At the same time, it is a
disadvantage that two work processes are necessary before the
screenshot is archived on the hard disk (first the screenshot, then
archiving with the aid of a photo-editing programme.
see test details
|
|
Grab 1.2
|
Mac OS X
|
Included with Mac OS X
|
Grab
(Dk.:
'Skærmbillede') can photograph the entire screen image,
regions or windows. It is also possible to photograph with a time
delay. The utility archives satisfactorily, although it is not
possible to choose the image format to archive in; archiving is
done in the un-compressed TIFF format.
see test details
|
|
SnagIt 7.1.2
|
Windows
|
39,95 EUR
|
SnagIt
can
make a screenshot of the full screen image, objects on the screen
(such as windows), or areas of the screen image. The programme
offers a high degree of flexibility, with editing options and a
choice of archiving format as well as defining keyboard shortcuts
for various archiving methods. The archived web pages (of which
parts are visible on the screen at the time of archiving) appear
correctly. The programme also includes an integrated image browser
in which the archived screen shots can be reviewed.
see test details
|
|
Snapz Pro X 2.0 |
Mac OS X |
69,00 USD |
Snapz
Pro X
can make a screenshot of the entire screen image, objects on the
screen such as windows, or areas of the screen image. The programme
offers a high degree of flexibility, among other things the option
of choosing archiving format, scale, frame, colour palette, and a
preview of the screenshot before archiving. The archived web pages
(of which parts are visible on the screen during archiving) are
viewed correctly.
see test details
|
|
Software
for archiving screen
recordings
|
|
SnagIt 7.1.2
|
Windows
|
39,95 EUR
|
SnagIt can be used to record what
is happening in an area of the screen image, an object on the
screen, such as a window, or the entire screen image, as well as
recording any sound occurring while recording is taking place. The
programme offers a high degree of flexibility, with among other
things, the option of choosing archiving format and quality as well
as defining keyboard shortcuts for various archiving methods. The
archived web pages and content elements appear correctly. One of
the programme's strong points is that it compresses the
archived material very rapidly.
see test details
|
|
Snapz Pro X
2.0 |
Mac OS X
|
69,00 USD
|
Snapz Pro X can be used to record
what is taking place in an area of the screen image, an object on
the screen such as a window, or the entire screen image, as well as
recording any sound occurring while recording is taking place. The
programme offers s high degree of flexibility, among other things
the option of choosing archiving format and quality. The archived
web pages or content elements appear correctly; all types of
content elements are correctly archived. A disadvantage of the
programme is that it stores and compresses the recorded content
elements or web pages very slowly; in one case in this test, it
took approx. 45 min. to compress a 15 min. recording (30 fps, 22500
KHz mono sound). New recordings cannot be made during the
compression and storage process.
see test
details
|
|