-- 050923 -- High-level, executable Haskell specification of the MapReduce parallel -- programming model -- An interesting exjobb (too hard, though?) would be to automatically -- translate a subset of Haskell, using the mapreduce primitives, into C++ -- code with calls to the MapReduce API. module Mapreduce where import IO -- needed to get access to the more advanced file handling -- operations used below -- "shuffle" = group values with same key shuffle [] = [] shuffle (x:xs) = insert x (shuffle xs) where insert (k,v) [] = [(k,[v])] insert (k,v) ((k1,l):xs) | k == k1 = (k1,v:l):xs | otherwise = (k1,l):insert (k,v) xs -- "reduce" = for each key, fold over its values reduce op init = map (\(k,l)->(k, foldr op init l)) -- "mapreduce" = putting it all together. (m is a parameter for the "map" -- function, which may vary.) mapreduce op init m = reduce op init . shuffle . m -- connecting the purely functional mapreduce to an input and output file: mapreduce_f op init m infile outfile = do f <- openFile infile ReadMode s <- hGetContents f writeFile outfile (show (mapreduce op init m s)) hClose f -- Some examples of MapReduce uses -- 1. Count occurrences of words in file: no_words infile outfile = mapreduce_f (+) (0::Int) (map (\x->(x,1)) . words) infile outfile -- (Note the explicit typing 0::Int. Without it, we would get the following -- type error: -- ERROR "Mapreduce.hs":46 - Ambiguous type signature in inferred type -- *** ambiguous type : (Show [([Char],a)], Num a) => [Char] -> [Char] -> IO () -- *** assigned to : no_words -- The reason is that the exact numeric type for a, of the numeric constants -- and (+) needs to be known at latest when the function is run. Here, it -- cannot be decided even at runtime, since the arguments to no_words -- cannot provide any further info about a. Thus, the type system must -- reject the function. However, if we give an explicit typing somewhere -- that helps the type system decide what type a should be, then no_words -- typechecks fine.) -- 2. Count of URL access frequency: -- test if a string is a substring of another string: substr s1 [] = False substr [] s2 = True substr (c1:s1) (c2:s2) | c1 == c2 = substr s1 s2 | otherwise = substr (c1:s1) s2 -- A predicate that tests if a word is an URL (if it contains "http://"): is_url = substr "http://" no_urls infile outfile = mapreduce_f (+) (0::Int) (map (\x->(x,1)) . filter is_url . words) infile outfile