Thursday, October 14, 2004

Slurping Kho Ping Ho's Novel Books

Are you one of many Indonesians who love reading classical martial art novels of Kho Ping Ho? If yes, you might have probably known that online has been providing online edition of his novels. At the time I am writing this blog, it is publishing "Harta Karun Jenghis Khan".

Unfortunately, the site provides only a few pages of the current novel every day (although the past novels are achived there), but one album might takes hundreds of these pages. I am a very lazy person in term of reading this site everyday. If you are like me, I have developed two simple scripts to download the whole album therefore becomes readable offline. One thing you need to know that, in order to get a complete set of the novel, you have to wait until the last episod gets published. Currently, the default link for the site you need to pass to geturl.tcl script is [ name of the album ] /episode1.shtml

To run the script:

- do: geturl.tcl[albumname]/episode1.shtml. For example: geturl.tcl
- type: merge.tcl
- Enter the number of episodes (the number of episod*.shtml files you just downloaded, or any number that quite big such as 5000)
- The result will be: albumname.html, and there will be a directory called "images" which will contain all the pictures (if any).

Ok, enough talking, now save the following script as geturl:

#----------- geturl.tcl -------------------
#!/bin/sh -e
# exec tclsh "$0" ${1+"$@"}

if {[lindex $argv 0] == "" } {
puts "$argv0 "

set url [lindex $argv 0]
set urlpath [file dirname $url]
set logfile "[file tail $urlpath]\.log"

puts "Getting $url ..."
puts "log file: $logfile"
puts "You can see the progress by typing \"tail -F $logfile\""

set par "-nv --force-html --tries=0 --cache=on --convert-links --recursive --accept=shtml --no-direct
ories --glob=on -L -p -m --page-requisites -np -nd -o [file tail $urlpath]\.log $url"

set res [ exec sh -c "wget $par" ]

#----------------------- end of geturl.tcl -----------------------

and the following as "merge":

#----------------------- start of erge.tcl ----------------------

proc AskAndGet { msg }
puts -nonewline $msg
flush stdout
return [gets stdin]

set n [AskAndGet "Number of files: "]
set title [string range [pwd] [string last "/" [pwd]] end]

puts "Title = $title"
set fho [open "merged.html" w]
puts $fho ""
puts $fho ""
puts $fho "\n\n"
puts $fho ""

for {set i 1} {$i <= $n} {incr i} { set fn "episode${i}.shtml" if {[file exists $fn]} { puts "File $fn exists...wait while I merge it ...." set fhi [open $fn r] set line [gets $fhi] set line [ string trim $line ] set line "$line\n" while {![eof $fhi] & ![regexp -nocase {} $line]} { set line [gets $fhi] set line [string trim $line] set line "$line\n" } ## found the start point, now read it until we find while {![eof $fhi] & !([regexp -nocase { } $line] || [regexp -nocase { } $line]) } { set line [gets $fhi] if {[regexp -nocase {Episode belum ada atau sudah habis} $line ] } { continue } if {[regsub -all {\xC2} $line "" line]} { puts "0xC2 found and been removed" #exit } if {[regsub -all {[\x93]} $line {"} line]} { #puts "OPENQUOTE: \{$line\}" } if {[regsub -all "\x94" $line {"} line]} { #puts "CLOSEQUOTE: \{$line\}" } if {[regexp -nocase {*).jpg} $line dummy imgname]} { if {![file exists "./images"]} { file mkdir "./images" } set imgname "${imgname}.jpg" if {![file exists "./images/$imgname"]} { puts "Downloading picture: $imgname" set imgurl "$imgname" eval { exec wget -q $imgurl } if {[file exists $imgname]} { exec mv $imgname "./images" } } else { puts "$imgname exists in ./images; not downloaded" } regsub -nocase {*).jpg} $line "./images/${imgname }" line } puts $fho $line #puts $line if {[regexp -nocase {} $line] || [regexp -nocase {.*>[ ]*[ ]*TAMAT[ ]*} $line]} {
puts "End of episode $fn"
close $fhi
} else {
; #puts "File $fn does not exists"

puts $fho "\n"
close $fho

#------------------------ end of merge.tcl -----------------------

