TOC of the series
I finish this series with a couple of new modules (each on its own file) that minimally accomplish the rest of the sub-tasks mentioned in the first post of the series.
Let's recall those tasks this time with a reference to the responsible module.
- Read the text contained in the pdf pages
- Search for the ISBN on those pages
- Make a query for the ISBN to a remote server that provides information about books.
- Parse the response of the server to represent it as a Racket data type.
The first and second tasks above are carried out by "pdf-read.rkt" with the aid of "isbn.rkt".
The third and forth tasks, in turn, are solved by "book-info.rkt".
"pdf-isbn" does the same thing as "isbn" but over pdf files. It uses the latter along with the pdf-read
package, a Racket interface to the popular libpoppler
library, available by default on Linux and, to my knowledge, on MacOS systems. You probably need to install pdf-read
before running the code. Don't worry, installing packages from DrRacket is very easy and self-explanatory.
As for relevant Racket constructs used in "pdf-read" it is worth to mention error
, that raises an exception when the pdf file doesn't exist. For testing exceptions you should use check-exn
from rackunit
.
Another interesting divergence between Racket and *SL languages has to do with or
and conditionals in general. In Racket everything there except#false
is treated as #true
. This fact allows us to use something like:
(or (extract/format 'isbn-13) (extract/format 'isbn-10)))
The module "book-info" requests information to a remote provider (via functions belonging to the net/url
package) to get bibliographic information about a book and parses that information in two phases: 1. parses the JSON response via read-json
(from the json
package), and 2. parses the output of read-json
into the structure book-info
. For the second phase I have resorted to the package json-pointer
that you may need to install on your system.
Note that only the Open Library provider is supported. Other providers like Google Books, The Library of Congress, etc. could be also supported in a similar way.
A few new frequently used Racket constructs applied by "book-info" are define-values
, which allows to bind multiple identifiers at once, and with-handlers
, that allows to handle an exception as wished.
; ---------------------------------------------------------- ; pdf-isbn.rkt ; ---------------------------------------------------------- #lang racket/base (require racket/contract) ; ---------------------------------------------------------- (provide (contract-out ; extracts all isbn's from the given pdf document ; raises exception when the pdf file does not exist [extract-isbn-from-pdf/list (-> pdf-document? (listof isbn?))] ; extracts the first isbn from the given pdf document, if any ; favors isbn-13 over isbn-10 ; raises exception when the pdf file does not exist [extract-isbn-from-pdf (-> pdf-document? (or/c isbn? #f))])) ; ---------------------------------------------------------- (require (only-in racket/function curry)) (require (only-in racket/port call-with-input-string)) (require pdf-read) (require "isbn.rkt") (module+ test (require rackunit) (require (only-in racket/function thunk)) ; examples for tests (define pdf-with-isbn "test-isbn-examples.pdf") (define pdf-with-isbn-10-only "test-with-isbn-10-only.pdf") (define pdf-without-isbn "test-without-isbn.pdf")) ; ---------------------------------------------------------- ; PDF-Document -> [List-of ISBN] (module+ test (check-exn exn:fail? (thunk (extract-isbn-from-pdf "hi.pdf"))) (check-equal? (extract-isbn-from-pdf/list pdf-without-isbn) '()) (check-equal? (extract-isbn-from-pdf/list pdf-with-isbn) (list "0262062186" "026256114X" "1593274912" "9781593274917" "0201896834" "9780201896831" "0262062186" "026256114X" "0262062186" "0201896834" "9780201896831" "026256114X" "9780201896831"))) (define (extract-isbn-from-pdf/list f) (check-input f 'extract-isbn-from-pdf/list) (for/fold ([isbns '()]) ([pg# (in-range (pdf-count-pages f))]) (append isbns (call-with-input-string (page-text (pdf-page f pg#)) isbn-find/list)))) ; PDF-Document -> [Maybe ISBN] (module+ test (check-exn exn:fail? (thunk (extract-isbn-from-pdf "hi.pdf"))) (check-equal? (extract-isbn-from-pdf pdf-with-isbn) "9781593274917") (check-equal? (extract-isbn-from-pdf pdf-with-isbn-10-only) "0262062186") (check-false (extract-isbn-from-pdf pdf-without-isbn))) (define (extract-isbn-from-pdf f) (check-input f 'extract-isbn-from-pdf) (define (extract/format format) (for/or ([pg# (in-range (pdf-count-pages f))]) (call-with-input-string (page-text (pdf-page f pg#)) (curry isbn-find #:format format)))) (or (extract/format 'isbn-13) (extract/format 'isbn-10))) ; PDF-Document Symbol -> PDF-Document ; effect: report error for src if f doesn't exist (module+ test (check-exn exn:fail? (thunk (check-input "not-avail.pdf"))) (check-pred file-exists? (check-input pdf-with-isbn 'fun))) (define/contract (check-input f src) (-> pdf-document? symbol? pdf-document?) (unless (file-exists? f) (error src "~s not found" f)) f)
; ---------------------------------------------------------- ; book-info.rkt ; ---------------------------------------------------------- #lang racket/base (require racket/contract) ; ---------------------------------------------------------- (provide (contract-out ; record of book information [struct book-info ([isbn isbn?] [authors (listof string?)] [date string?] [title string?] [places (listof string?)] [publishers (listof string?)])] ; gets information about a book of given isbn from given provider ; produces a book-info with only the isbn filled if no info avail [book-retrieve-info (-> isbn? provider? book-info?)] ; determines whether the given is a provider of book information [provider? predicate/c])) ; ---------------------------------------------------------- (require (only-in racket/function curry)) (require (only-in racket/match match)) (require (only-in racket/string string-replace)) (require (only-in json read-json jsexpr?)) (require (only-in json-pointer json-pointer? json-pointer-expression? json-pointer-value)) (require (only-in net/url call/input-url get-pure-port string->url url?)) (require (only-in "isbn.rkt" isbn?)) (module+ test (require rackunit) (require (only-in racket/function thunk))) ; ---------------------------------------------------------- ; Data Types (struct book-info [isbn authors date title places publishers] #:transparent) (define book-info-template (book-info "" '() "" "" '() '())) ; Providers: ; ol: openlibrary ; gb: google-books (not implemented) ; ... (define (provider? v) (or (equal? v 'ol) (equal? v 'gb))) ; ---------------------------------------------------------- ; Book Info Builder ; ISBN Provider -> Book-Info ; TODO: test (define (book-retrieve-info isbn provider) (define-values (request reader parser) (match provider ['ol (values (book-query isbn uri-ol) read-json (curry parse/ol isbn))] [_ (error "Not implemented")])) (call/input-url request get-pure-port (compose1 parser reader))) ; ISBN String -> Url ; produces the url from given template and isbn (module+ test (check-equal? (book-query "0262062186" "http://example.com?id:$$isbn$$") (string->url "http://example.com?id:0262062186"))) (define/contract (book-query isbn url) (-> isbn? string? url?) (string->url (string-replace url "$$isbn$$" isbn))) ; ---------------------------------------------------------- ; Providers ; - Open Library (ol) ; query template (define uri-ol (string-append "http://openlibrary.org/api/books?bibkeys=ISBN:$$isbn$$" "&format=json" "&jscmd=data")) ; parser ; ISBN JSExpr -> Book-Info ; parses the given jsexpr to get book info from OL ; produces a book info with only the isbn filled when OL knows ; nothing about it ; TODO: test (define/contract (parse/ol isbn jsexpr) (-> isbn? jsexpr? book-info?) (define (build-book-info) (define base-point (symbol->string (hash-iterate-key jsexpr 0))) (define authors (json-pointer-value/index (cons base-point '("authors")) '("name") jsexpr)) (define date (json-pointer-value (cons base-point '("publish_date")) jsexpr)) (define title (json-pointer-value (cons base-point '("title")) jsexpr)) (define places (json-pointer-value/index (cons base-point '("publish_places")) '("name") jsexpr)) (define publishers (json-pointer-value/index (cons base-point '("publishers")) '("name") jsexpr)) (book-info isbn authors date title places publishers)) (define (build-book-info/not-avail) (struct-copy book-info book-info-template [isbn isbn])) (match jsexpr [(? hash-empty?) (build-book-info/not-avail)] [_ (build-book-info)])) ; ---------------------------------------------------------- ; Helpers (extending json-pointer) ; [JSON-Pointer | JSON-Pointer-Expr] JSExpr -> JSExpr ; wrapper to produce #f when json-pointer raises an exception (module+ test (check-equal? (json-pointer-value/false "/a" (hash 'a 1)) (json-pointer-value "/a" (hash 'a 1))) (check-false (json-pointer-value/false "/a" (hash 'b 2)))) (define/contract (json-pointer-value/false jp jsexpr) (-> (or/c json-pointer? json-pointer-expression?) jsexpr? jsexpr?) (with-handlers ([exn:fail? (lambda (e) #f)]) (json-pointer-value jp jsexpr))) ; JSON-Pointer-Expr JSON-Pointer-Expr JSExpr -> JSExpr ; gets all the values at /pre/n/post in jsexpr for all legal n's (module+ test ;json example (define jse-ex '#hasheq((a . #hasheq((b . (#hasheq((x . 1) (y . 2)) #hasheq((x . 3) (y . 4)))) (c . (#hasheq((y . 0)))))))) (check-equal? (json-pointer-value/index '("a" "b") '("y") jse-ex) '(2 4)) (check-equal? (json-pointer-value/index '("a" "c") '("y") jse-ex) '(0))) (define/contract (json-pointer-value/index pre post jsexpr) (-> (or/c json-pointer? json-pointer-expression?) (or/c json-pointer? json-pointer-expression?) jsexpr? jsexpr?) (for*/list ([n (in-naturals)] [v (in-value (json-pointer-value/false (append pre (list (number->string n)) post) jsexpr))] #:break (not v)) v))