martes, 16 de abril de 2019

From HtDP to Racket. ISBN extraction in ISL+

WARNING: This series is a work in progress, both text and code can change over time.

If you have read the book How to Design Programs (HtDP) you already know some parts of Racket, the language on which BSL, ISL, etc. are built. And you may wish to get more stuff about Racket itself and find out how to translate your HtDP skills into full Racket.

You could read the book Realm of Racket, where Racket is introduced as needed in a friendly setting familiar to HtDP readers. You could even try The Racket Guide, a gently and complete introduction to the language. But you may miss something in the middle, something not so terse as The Guide, written from (professional) programmers to (professional) programmers, but more systematic than the Realm.

That's the intent of this series of posts. Its purpose is to show by means of an example how to start the transition from HtDP languages (in particular, ISL+) to Racket. At least, what new things to consider when working directly in Racket.

An example cannot be complete at all, though. Racket is a very rich language, and sooner than later you'll need to go deeply into The Guide and The Racket Reference. So take them as a first step into Racket for whom already masters HtDP.

A public repository for the code presented in this series is available at FromHtDPtoRacket (git repo)

The sample problem we are going to tackle says as follows:

Create a program to obtain bibliographic information about pdf books. It is assumed that those pdfs are readable (meaning that they are not made of photos of pages) and, for simplicity, unencrypted.

This task can be naturally designed as a composition of the following sub-tasks:

  1. Read the text contained in the pdf pages.
  2. Search for the ISBN on those pages.
  3. Retrieve from a remote provider information about the book with that ISBN.
  4. Parse the response to represent it as a Racket data type.

The second sub-task will be implemented entirely in ISL+, with the aid of a few Racket batteries. From the ISL+ implementation a Racket version will be derived. How to go from the ISL version to the Racket version is the main subject of these posts.

For completeness, an initial solution to the rest of the steps will be given directly in Racket.

Regarding the exposition I don't want to be verbose in describing all Racket constructs presented. The aim is not to explain Racket. Instead, I'll introduce those constructs when needed, and point to places in the Racket documentation that explain them.

Let's get started with the second step, the extraction of the ISBN, if any, from an input string.

This involves, of course, domain knowledge. See https://www.isbn-international.org/content/isbn-users-manual for more information (specifically, Section 5).

In summary, an ISBN can be of two types (formats): ISBN-10 and ISBN-13. The number itself is a string of 13 or 10 digits (from '0' to '9'), respectively. ISBN-10 accepts 'X' as last digit, too. An ISBN-10 number consists of four groups of digits, three of them of variable length (the first group can have up to 5 digits; the second up to 7 digits; the third up to 6 digits), the last group is a single digit. In turn, an ISBN-13 has one more group, a prefix of 3 digits, currently: 978 or 979. Groups may appear separated by hyphens or spaces. So we could see forms like the following:

0262062186 
0-262-06218-6 
0 262 06218 6 
9780201896831 
978-0-201-89683-1 
978 0 201 89683 1 
978-0201896831

In printed books, that number is typically preceded by an identifier that can appear under different guises:

ISBN 
ISBN: 
ISBN-10 
ISBN-13 
ISBN-10: 
ISBN-13:

followed by a space before the number itself.

Additionally, a string matching that format is an actual ISBN if it passes a validity check, that involves an arithmetic operation. See the ISBN User Manual cited above for a description of the algorithm.

Although it is possible to devise a pure ISL+ implementation, it is better to resort to Racket regular expressions support, if only because the resulting Racket version will share the very same design with the ISL-with-regexes version. Otherwise, the ISL code would be quite different, and the translation into Racket not so seamless.

Regular expressions support is in the core of Racket, so we need to require racket/base to use it. A warning, though: don't require racket/base in your ISL projects. if you really need it, just get into full Racket.

We will also require a couple of libraries to apply a few functions: file->string from racket/file for some test cases (you could use in place the function read-file from 2htdp/batch-io), and, for brevity, three string functions, that you could design on your own by following the HtDP recipe (string-split, string-replace, and string-normalize-spaces).

The first section of the code is, as HtDP mandates, a description of the data types we will use to represent the information at hand.

Next, patterns and regular expressions embedding the description of ISBN forms described above are created.

For details about regular expressions patterns and the functions regexp-match-exact? and regexp-match* see Regular Expressions [Racket Guide], and Regular Expressions [Racket Reference]

The next part contains predicates concerning all the data types of the first section. In fact, those predicates should be enough for any programmer as a precise description of the data types involved.

The rest proposes an implementation. It begins with functions to validate isbn-string's according to the required mathematical algorithm. (Note, by the way, that isbn-checksumf is an abstract function obtained by following the design recipe explained in HtDP/2e (Part III))

The main functions are isbn-find/list and isbn-find, which produce a list of all isbn's found in a string, or the first isbn of a given isbn-format found in a string, respectively. The basic strategy is to search for sub-strings in all the lines of the input that look like an ISBN, and select from the candidates those that, previously normalized, are actual ISBNs.

This is the code:

; ----------------------------------------------------------
; isbn-isl-with-batteries.rkt
; ----------------------------------------------------------

(require racket/base) ;for regexes support
(require racket/string) ;string-normalize-spaces
                        ;string-replace
                        ;string-split
(require racket/file) ;file->string (for tests)

; ----------------------------------------------------------
; Data Types

; An ISBN is one of:
; - ISBN-13
; - ISBN-10

; An ISBN-String is one of:
; - ISBN-13-String
; - ISBN-10-String

; An ISBN-13 is an valid ISBN-13-String
; An ISBN-10 is an valid ISBN-10-String
; where valid means that the numbers in that string
; fullfil a certain mathematical computation. (See
; ISBN User Manual for information about this computation)
(define isbn-13-ex "9781593274917")
(define isbn-10-ex "0262062186")

; An ISBN-13-String is a String consisting of 13 digits,
; where a digit is one of '0', '1', ..., '9'.
(define isbn-13-str-ex "1234567890123")

; An ISBN-10-String is a String consisting of 9 digits,
; and a last letter that can be a digit or 'X'.
(define isbn-10-str-ex-1 "1234567890")
(define isbn-10-str-ex-2 "123456789X")

; An ISBN-Format is one of:
; - 'isbn-13
; - 'isbn-10

; ----------------------------------------------------------
; Patterns and Regexes
; [See ISBN International User Manual 7e. Sect. 5]

; - Pattern Components
(define pat-isbn-sep "[ -]")
(define pat-isbn-id "ISBN(-1[03])?:? ")
(define pat-isbn-13-id "(?:ISBN(?:-13)?:? )?")
(define pat-isbn-10-id "(?:ISBN(?:-10)?:? )?")
(define pat-isbn-prefix "97[89][ -]")
(define pat-isbn-registration "\\d{1,5}[ -]")
(define pat-isbn-registrant "\\d{1,7}[ -]")
(define pat-isbn-publication "\\d{1,6}[ -]")
(define pat-isbn-13-check "\\d")
(define pat-isbn-10-check "[X\\d]")

; - Look ahead to ISBN groups
(define pat-isbn-13-look-ahead
  (string-append "(?=" pat-isbn-prefix pat-isbn-registration ")"))

(define pat-isbn-10-look-ahead
  (string-append "(?=" pat-isbn-registration ")")) 

; - Main patterns
(define pat-isbn-13/groups
  (string-append pat-isbn-13-id
                 pat-isbn-13-look-ahead
                 pat-isbn-prefix
                 pat-isbn-registration
                 pat-isbn-registrant
                 pat-isbn-publication
                 pat-isbn-13-check))

(define pat-isbn-10/groups
  (string-append pat-isbn-10-id
                 pat-isbn-10-look-ahead
                 pat-isbn-registration
                 pat-isbn-registrant
                 pat-isbn-publication
                 pat-isbn-10-check))

(define pat-isbn-13/prefix
  (string-append pat-isbn-13-id pat-isbn-prefix "\\d{10}"))

(define pat-isbn-13-norm "\\d{13}")

(define pat-isbn-10-norm "\\d{9}[X\\d]")

(define pat-isbn-13
  (string-append pat-isbn-13-norm "|"
                 pat-isbn-13/prefix "|"
                 pat-isbn-13/groups))

(define pat-isbn-10
  (string-append pat-isbn-10-norm "|"
                 pat-isbn-10/groups))

(define pat-isbn
  (string-append pat-isbn-13 "|"
                 pat-isbn-10))

; - Regexes
(define re-isbn-id (regexp pat-isbn-id))
(define re-isbn-sep (regexp pat-isbn-sep))
(define re-isbn-13 (pregexp pat-isbn-13))
(define re-isbn-10 (pregexp pat-isbn-10))
(define re-isbn-13-norm (pregexp pat-isbn-13-norm))
(define re-isbn-10-norm (pregexp pat-isbn-10-norm))
(define re-isbn (pregexp pat-isbn))

; ----------------------------------------------------------
; Predicates

(check-expect (isbn? "9781593274917") #t)
(check-expect (isbn? "0262062186") #t)
(check-expect (isbn? #f) #f)
  
(check-expect (isbn-13? "9781593274917") #t)
(check-expect (isbn-13? "0262062186") #f)
(check-expect (isbn-13? "") #f)
  
(check-expect (isbn-10? "9781593274917") #f)
(check-expect (isbn-10? "0262062186") #t)
(check-expect (isbn-10? 1) #f)
  
(check-expect (isbn-string? "9781593274912") #t)
(check-expect (isbn-string? "026206218X") #t)
(check-expect (isbn-string? "97815932749122") #f) ;too long
(check-expect (isbn-string? "978159327491") #f)   ;too short
(check-expect (isbn-string? "0262062189X") #f)    ;too long
(check-expect (isbn-string? "026206218") #f)      ;too short
(check-expect (isbn-string? "0-262-06218-6") #f)
  
(check-expect (isbn-13-string? "9781593274912") #t)
(check-expect (isbn-13-string? "97815932749122") #f) ;too long
(check-expect (isbn-13-string? "978159327491") #f)   ;too short
(check-expect (isbn-13-string? #f) #f)
  
(check-expect (isbn-10-string? "026206218X") #t)
(check-expect (isbn-10-string? "0262062189X") #f) ;too long
(check-expect (isbn-10-string? "026206218") #f)   ;too short
(check-expect (isbn-10-string? #f) #f)
  
(check-expect (isbn-format? 'isbn-13) #t)
(check-expect (isbn-format? 'isbn-10) #t)
(check-expect (isbn-format? "isbn-10") #f)
  
(define (isbn? v)
  (or (isbn-13? v) (isbn-10? v)))

(define (isbn-13? v)
  (and (isbn-13-string? v) (isbn-13-valid? v)))

(define (isbn-10? v)
  (and (isbn-10-string? v) (isbn-10-valid? v)))

(define (isbn-string? v)
  (or (isbn-13-string? v) (isbn-10-string? v)))

(define (isbn-13-string? v)
  (and (string? v) (regexp-match-exact? re-isbn-13-norm v)))

(define (isbn-10-string? v)
  (and (string? v) (regexp-match-exact? re-isbn-10-norm v)))

(define (isbn-format? v)
  (or (equal? v 'isbn-13) (equal? v 'isbn-10)))

; ----------------------------------------------------------
; ISBN Validation

; ISBN-13-String -> Boolean
; is the given isbn-13 string a valid isbn-13

(check-expect (isbn-13-valid? "9781593274917") #t)
(check-expect (isbn-13-valid? "9780201896831") #t)
(check-expect (isbn-13-valid? "9781593274912") #f)
(check-expect (isbn-13-valid? "9780201896834") #f)

(define (isbn-13-valid? isbn-str)
  (isbn-checksumf (isbn-string->numbers isbn-str)
                  '(1 3 1 3 1 3 1 3 1 3 1 3 1)
                  10))

; ISBN-10-String -> Boolean
; is the given isbn-10 string a valid isbn-10

(check-expect (isbn-10-valid? "0262062186") #t)
(check-expect (isbn-10-valid? "026256114X") #t)
(check-expect (isbn-10-valid? "026206218X") #f)
(check-expect (isbn-10-valid? "0262561141") #f)

(define (isbn-10-valid? isbn-str)
  (isbn-checksumf (isbn-string->numbers isbn-str)
                  (range 10 0 -1)
                  11))

; [List-of N] [List-of N] N -> Boolean
; abstract checksum algorithm for isbn validation

(check-expect (isbn-checksumf '(9 7 8 1 5 9 3 2 7 4 9 1 7)
                              '(1 3 1 3 1 3 1 3 1 3 1 3 1)
                              10) #t)
(check-expect (isbn-checksumf '(0 2 6 2 0 6 2 1 8 10)
                              (range 10 0 -1)
                              11) #f)

(define (isbn-checksumf multiplicands multipliers mod)
  (local [(define sum
            (foldl + 0 (map * multiplicands multipliers)))]
    (zero? (modulo sum mod))))

; ISBN-String -> [List-of N]
; translates str into the numbers their isbn letters represent

(check-expect (isbn-string->numbers "026256114X")
              '(0 2 6 2 5 6 1 1 4 10))

(define (isbn-string->numbers str)
  (local [(define (isbn-digit->number d)
            (cond
              [(string=? d "X") 10]
              [else (string->number d)]))]
    (map isbn-digit->number (explode str))))

; ----------------------------------------------------------
; ISBN Extraction

; String -> [List-of ISBN]
; extracts all isbns from str
  
(check-expect (isbn-find/list "") '())
(check-expect (isbn-find/list "none") '())
(check-expect
 (isbn-find/list (file->string "test-isbn-examples"))
 (list
  ;isbn normalized
  "0262062186" "026256114X" "1593274912"
  "9781593274917" "0201896834" "9780201896831"
  ;isbn w/ several id's a sep's
  "0262062186" "026256114X" "0262062186" "0201896834"
  "9780201896831" "026256114X" "9780201896831"))

(define (isbn-find/list str)
  (local [(define candidates
            (foldr append '()
                   (map isbn-match* (string-split str "\n"))))]
    (filter isbn? (map isbn-normalize candidates))))

; String ISBN-Format -> [Maybe ISBN]
; extracts the first isbn of given format from str, if any

(check-expect (isbn-find "" 'isbn-13) #f)
(check-expect (isbn-find "" 'isbn-10) #f)
(check-expect (isbn-find "0262062186" 'isbn-13) #f)
(check-expect (isbn-find "9781593274917" 'isbn-10) #f)
(check-expect (isbn-find (file->string "test-isbn-examples")
                         'isbn-13)
              "9781593274917")
(check-expect (isbn-find (file->string "test-isbn-examples")
                         'isbn-10)
              "0262062186")

(define (isbn-find str format)
  (local [(define p?
            (cond [(symbol=? format 'isbn-13) isbn-13?]
                  [(symbol=? format 'isbn-10) isbn-10?]))
          (define isbns
            (filter p? (isbn-find/list str)))]
    (cond
      [(empty? isbns) #f]
      [else (first isbns)])))
    
; ----------------------------------------------------------
; Helpers

; String -> [List-of String]
; matches substrings in the given string looking like isbn tags

(check-expect (isbn-match* "") '())
(check-expect (isbn-match* "abc\nd") '())
(check-expect
 (isbn-match* (file->string "test-isbn-examples"))
 (list
  ;isbn normalized (all matched)
  "0262062186" "026256114X" "1593274912"
  "9781593274917" "0201896834" "9780201896831"
  ;isbn w/ several id's and sep's (all matched)
  "ISBN 0-262-06218-6"
  "ISBN: 0 262 56114 X"
  "ISBN-10 0 262 06218-6"
  "ISBN-10: 0-201-89683-4"
  "ISBN-13: 978-0-201-89683-1"
  "ISBN-10: 0 262 56114 X"
  "ISBN-13: 978-0201896831"
  ;not isbn strings (usually impossible in real-world)
  "026206218X" "0262561141" "9780201896834" ;partially matched
  ;isbn strings, but isbn invalid (all matched)
  "026206218X" "0262561141" "1593274913"
  "9781593274912" "0201896833" "9780201896834"))

(define (isbn-match* str)
  (regexp-match* re-isbn (string-normalize-spaces str)))

; String -> String
; removes the isbn-id and, then, the isbn separators from str

(check-expect (isbn-normalize "123-45 67") "1234567")
(check-expect (isbn-normalize "ISBN 123") "123")
(check-expect (isbn-normalize "ISBN: 123") "123")
(check-expect (isbn-normalize "ISBN-10 123") "123")
(check-expect (isbn-normalize "ISBN-10: 123") "123")
(check-expect (isbn-normalize "ISBN-13 123") "123")
(check-expect (isbn-normalize "ISBN-13: 123") "123")
(check-expect (isbn-normalize "ISBN-14: hi") "ISBN14:hi")

(define (isbn-normalize str)
  (string-replace
   (string-replace str re-isbn-id "") re-isbn-sep ""))

Next article in the series: Racket: #lang, local definitions

No hay comentarios:

Publicar un comentario