First Haskell Parser (for Kindle Highlights)
Kindle saves highlights, bookmarks and notes in a My Clippings.txt file. I needed a way to parse it and filter out the content I wanted. I also wanted to work on a parser in haskell, so this was a great opportunity to get my hands dirty.
I used stack to manage the project by running:
stack new kindle-highlights simple
cd kindle-highlights
stack setup
stack ghci # launches haskell repl for project
Within the ghci repl, the following commands were useful when making changes:
:load Main
:r # reload changes made to code
Parsec is the library I chose to use. To start of, I wanted to parse a simple string that had a similar structure to the kindle file:
exampleString = "this\nis\ngood\n==========\nanther\ngroup\n=========="
The exampleString above contains two different highlights each separated with ==========. To get a single group, I'd look for this separator and collect the groups before this. To do this in parsec:
import Text.Parsec
eogString :: String
eogString = "=========="
endOfGroup :: Parsec String st String
endOfGroup = string eogString
kindleGroup :: Parsec String st String
kindleGroup = manyTill anyChar (try endOfGroup)
test = parse kindleGroup "failed" exampleString
Running test in ghci returns Right "thisnisngoodn".
endOfGroup is a parser that only matches the eogString. This is combined with the kindleGroup to get all characters (using anyChar) until the endOfGroup is found.
All the groups can be got by building on this through:
groups :: Parsec String st [String]
groups = do
first <- kindleGroup
next <- remainingGroups
return (first : next)
remainingGroups :: Parsec String st [String]
remainingGroups = (char '\n' >> groups) <|> return []
test = parse groups "fail" (exampleString)
which results in Right ["thisnisngoodn","antherngroupn"]. Groups gets a group from the string, and passes the rest of the string to the remainingGroups function. This in turn checks if the first character in the string is n after which it calls groups on the string less the starting n.
The above gave me a good framework for working on the kindle highlighter which has the general form:
book title
- Your Highlight on page 818-810 | Added on Wednestday, 24 October 2018 04:41:47
the actual highlighted sections
==========
book title
- Your Highlight on page 818-810 | Added on Wednestday, 24 October 2018 04:41:47
the actual highlighted sections
==========
This is what I came up with for that:
-- test
test = parse highlight "fail" exampleGroup
title :: Parsec String st String
title = manyTill anyChar newline
location :: Parsec String st [String]
location = between (string "- Your Highlight at location ") (oneOf " |") locationGroupings
locationGroupings = do
start <- many1 digit
char '-'
end <- many1 digit
return [start, end]
highlight :: Parsec String st [String]
highlight = do
t <- title
l <- location
title
title
h <- title
let x = [t, h] ++ l
return x
exampleGroup = "Axiomatic (Greg Egan)\n" ++
"- Your Highlight at location 3722-3722 | Added on Sunday, 28 October 2018 08:42:11\n" ++
"\n" ++
"mind; maybe some dreams take shape only in the\n"
This results in Right ["Axiomatic (Greg Egan)","mind; maybe some dreams take shape only in the","3722","3722"]
To parse multiple groups, we just:
groups :: Parsec String st [[String]]
groups = do
first <- highlight
next <- remainingGroups
return (first : next)
remainingGroups :: Parsec String st [[String]]
remainingGroups = (char '\n' >> groups) <|> return []
test = parse groups "fail" (exampleGroup ++ "\n" ++ exampleGroup)
This returns:
Right [["Axiomatic (Greg Egan)","mind; maybe some dreams take shape only in the","3722","3722"],["Axiomatic (Greg Egan)","mind; maybe some dreams take shape only in the","3722","3722"]]
The kindle format has more nuts than what the above demonstrates, but nevertheless I had a lot of fun working with this.