The string and I

Numbers are magnificent, but have you seen what can be done with strings? In this section, we will discuss strings. In particular, how to split strings, join strings, and manipulate them in other ways. We start off with characters in the section Characters because a string can be conceptualized as a… um… string of characters. The section Strings then discusses common operations on strings. We will learn how to join strings together, split a string apart, and embed strings and numbers into another string.

Characters

The character for elation is written in octal notation as \o/.

Haskell uses the type Char to represent characters. Like some other languages,1 a character literal is delimited by single quotation marks. The character literal “a” is written in Haskell as 'a'. Each value of the type Char should be one Unicode character, which can be given using one of the following formats:

  • Glyph. The glyph of a Unicode character is its visual representation. The glyph for the character “a” is a and we write it in Haskell as 'a'. The Greek letter beta is written as 'β'. Many characters, e.g. control characters, do not have their own glyphs and we must use special escape sequences to specify such characters.
  • Decimal. A character can be given by its decimal notation. The character “a” has the decimal code 97 and we write it in Haskell as '\97'. Notice the use of the backslash symbol \. The newline (or line feed) character can be written as '\n' or via its decimal notation as '\10'.
  • Octal. A character can be written in terms of its octal notation. The character “a” has the octal code 0141 and we write it in Haskell as '\o0141'. Notice the use of the escape sequence \o (with the lowercase version of O) to specify that we are using octal notation. The newline character is written in octal notation as '\o012'. Yay \o/! Time to pet :3.

The package Data.Char

To do anything interesting with characters, we must load the package Data.Char. Loading is done via the keyword import. In the GHCi session below, we perform some basic operations on a character.

1
2
3
4
5
6
7
8
9
10
ghci> import Data.Char
ghci> c = 'a'
ghci> isNumber c
False
ghci> isAlpha c
True
ghci> isLower c
True
ghci> toUpper c
'A'

Strings

Strings in Haskell are represented in many different ways. A simple and easy to understand representation is as a list of characters.2 A list in Haskell is represented using square brackets, e.g. ["abc","defg"]. The type String is defined as a list of characters. Unlike the package Data.Char, the package Data.String is loaded by default to provide us with a few functions for processing strings. The type String is simple to understand, easy to explain to beginners of the Haskell language, and values of type String can be manipulated by means of standard Haskell functions and operators.3

In the GHCi session below, we create a string s and use the function lines to split the string into multiple lines. The split occurs at the newline character \n. The result is a list of two elements, each being a substring of s. As its name implies, the function unlines joins multiple strings into one string. We then use the function words to decompose the string s into words. This is similar to the function lines in the sense that words uses whitespace to split a string. The function unwords joins multiple words, each represented as a string, into one string.

1
2
3
4
5
6
7
8
9
ghci> s = "The five boxing wizards\njump quickly."
ghci> lines s
["The five boxing wizards","jump quickly."]
ghci> unlines (lines s)
"The five boxing wizards\njump quickly.\n"
ghci> words s
["The","five","boxing","wizards","jump","quickly."]
ghci> unwords (words s)
"The five boxing wizards jump quickly."

Care to join?

“Jones, you’d better join our cooperative life insurance company before that cough of yours gets any worse.”
“I’d like to do it, Furguson, but I don’t believe I would pass the medical examination.”
“That’s all right. I’m on the examining board. I can get you in.”
“Then I won’t join it, Furguson. I don’t want to have anything to do with a company that would take a risk on me.”
The Chicago Daily Tribune, “In a Minor Key: Unanswerable”, 06th June, 1891, p.4, column 5.

In the section Printing numbers, we learnt that the operator4 ++ can join two strings together. Another way to concatenate strings is to use the aptly named function concat. Both ++ and concat have the same effect. Here are some differences between ++ and concat:

  1. The function concat is used in prefix notation, whereas ++ is used in infix notation.
  2. The function concat takes a list of strings as its argument, whereas ++ behaves like a binary operator.
  3. The operator ++ can be used in prefix notation by surrounding it within parentheses, i.e. (++). The function concat cannot be used in infix notation.

The GHCi session below should clarify the similarity and differences between ++ and concat.

1
2
3
4
5
6
ghci> "battle" ++ "beetle" -- infix notation
"battlebeetle"
ghci> (++) "battle" "beetle" -- prefix notation
"battlebeetle"
ghci> concat ["any", "way"] -- prefix notation
"anyway"

The cons operator : is similar to ++ insofar as : allows for concatenation. The cons operator is a binary operator. To use : for concatenating to a string, the left operand of : must be a character and the right operand must be a string. Like ++, the cons operator can be used in prefix or infix notations.

1
2
3
4
5
6
7
8
ghci> 'b' : "c"
"bc"
ghci> 'a' : ('b' : "c")
"abc"
ghci> (:) 'b' "c"
"bc"
ghci> (:) 'a' ((:) 'b' "c")
"abc"

Splitting headache

Dick: Come on, Harry. Let’s make like a banana split and leave.
3rd Rock from the Sun, season 2, episode 19, 1997

The functions lines and words can split a string according to whitespaces. What if you want to split a string s at a given index? Use the function splitAt. The function requires two arguments:

  • An integer $k$ representing the index at which to split s. In Haskell, index starts from zero.
  • The string s itself.

The function returns a tuple of two elements. A tuple in Haskell is represented using parentheses, e.g. ("abc","defg"). The first element is a substring that consists of the first $k$ characters of s. The second element is a substring made up of the rest of s. Let’s see splitAt in action. We use the function length to query the number of characters in a string.

1
2
3
4
5
6
7
8
9
ghci> s = "Split"
ghci> t = " me like a log."
ghci> u = s ++ t
ghci> u
"Split me like a log."
ghci> length s
5
ghci> splitAt (length s) u
("Split"," me like a log.")

We now have another problem. The function splitAt returns a tuple of two elements. How are we to extract the individual elements? We have two ways to extract the individual elements of a tuple: a quick way and a labourious way. Let’s consider the labourious way first. The tuple functions fst and snd extract the first and second elements, respectively, of a tuple having two elements. Below we use the functions fst and snd to extract the left and right parts of a splitted string.

1
2
3
4
5
6
7
ghci> str = "Split me like a log."
ghci> left = fst (splitAt 5 str)
ghci> right = snd (splitAt 5 str)
ghci> left
"Split"
ghci> right
" me like a log."

Enough of the labouriously fun way. A quick way to extract elements from a tuple is to destructure the tuple.5 The function splitAt returns a tuple of two elements. It makes sense to assign the result to another tuple that has two elements.6 Observe destructuring in action.

1
2
3
4
5
6
ghci> str = "Split me like a log."
ghci> (left, right) = splitAt 5 str
ghci> left
"Split"
ghci> right
" me like a log."

More index fun

The function splitAt is not alone in splitting a string based on an integer index. Similar to splitAt, the function take accepts two arguments: an integer index $k$ at which to split a string, and the string itself. In contrast to splitAt, the function take returns a substring of the first $k$ characters of the given string. No need to use fst with splitAt.

1
2
3
ghci> status = "splitting headache"
ghci> take 5 status
"split"

Here’s a quick quiz. The opposite of take is [blank]. If you answer “drop”, well done! In case your response is “leave”, please stay. Don’t make like a tree and leave. The function drop takes an integer $k$ and a string s, and returns all but the first $k$ characters of s. Equivalently, drop removes the first $k$ characters of s. The function is more convenient than using snd with splitAt. See it for yourself.

1
2
3
4
5
6
7
ghci> status = "splitting headache"
ghci> take 5 status
"split"
ghci> drop 5 status
"ting headache"
ghci> (take 5 status) ++ (drop 5 status)
"splitting headache"

In general, the functions take and drop give us the left and right portions, respectively, of a string. The leftmost section of a string is its very first character. The rightmost section of a string is its very last character. The leftmost and rightmost elements of a string can conveniently be accessed via the functions head and last, respectively. Time for a biology lesson with a caterpillar.

1
2
3
4
5
ghci> insect = "8caterpillar-"
ghci> head insect
'8'
ghci> last insect
'-'

Dropping the head character, we would have the rest of the string, as returned by the function tail. Dropping the last character, we would have all characters except the rightmost one, as returned by the function init. The caterpillar is fed up with being dissected. It uses the function reverse to do a 180-degree turn and crawl away.

1
2
3
4
5
6
7
8
9
10
11
ghci> insect = "8caterpillar-"
ghci> tail insect
"caterpillar-"
ghci> (head insect) : (tail insect)
"8caterpillar-"
ghci> init insect
"8caterpillar"
ghci> (init insect) ++ ((last insect) : "")
"8caterpillar-"
ghci> reverse insect
"-rallipretac8"

String format

The operator ++ can be used to concatenate as well as format strings. This can be problematic if a string includes numbers. A workaround is to use the method show. Here is an example.

1
2
3
4
5
ghci> name = "Tabby"
ghci> age = 2
ghci> food = "fish"
ghci> name ++ " is " ++ show age ++ " years old and likes " ++ food ++ "."
"Tabby is 2 years old and likes fish."

As you can see, using ++ together with show to format a string can be cumbersome. The resulting code can be difficult to read. What we need is a more convenient way of formatting strings. The package Text.Printf provides functions to format strings, using the same formatting conventions as the function printf in the C programming language. The format specifier %s is a placeholder for a string and %d is a placeholder for an integer. You can use the Haskell function printf to achieve the same output as per the above GHCi session.

1
2
3
4
5
6
ghci> import Text.Printf
ghci> name = "Tabby"
ghci> age = 2
ghci> food = "fish"
ghci> printf "%s is %d years old and likes %s.\n" name age food
Tabby is 2 years old and likes fish.

We can format floating-point numbers as well. Prepare to be dazzled:

1
2
3
ghci> import Text.Printf
ghci> printf "%s is yummy, but pi is %f\n" "pie" pi
pie is yummy, but pi is 3.141592653589793

Exercises

Exercise 1. The cons operator : can be used for string concatenation. The left operand of : is a character, whereas the right operand is a string. Can the left operand be an empty character, i.e. ''? Why or why not? Can the right operand be an empty string, i.e. ""? Why or why not?

Exercise 2. Rewrite the programs age.hs and circle.hs to use the function printf for string formatting.

Exercise 3. In the section Splitting headache, you learnt about destructuring a tuple. Does destructuring work with lists as well? Split the following strings at whitespaces and destructure the results: "battle beetle" and "Peter panning gold".

Exercise 4. A phrase when abbreviated is commonly written using the uppercase of the first letter of each word. For example, ante meridiem (meaning before noon) is abbreviated as AM, post meridiem (meaning after noon) is abbreviated as PM, and your significant other (i.e. your bae) is abbreviated as SO. Write a program to prompt for a two-word phrase and output the abbreviation of the phrase.

Exercise 5. Use one or more functions discussed in this section to extract the following words from the string "chopper": chop, hopper, cop, copper, her, Cher.

Exercise 6. Write a program that prompts for a string s, and two integers $i$ and $j$ where $i < j$. Extract the substring in s between the indices $i$ and $j$, inclusive.

Exercise 7. The package Data.Text should be used instead of the type String whenever you require efficient text processing. Import the package and use its functions to work through various code examples in this section.

Exercise 8. Use one or more string manipulation functions discussed in this section to extract the word “caterpillar” from the string "8caterpillar-". Repeat the exercise, but use the package Data.Text.

Exercise 9. Write a program that prompts for a string. Exchange the first and last characters with each other and output the resulting string. For example, the input string "lived" should be transformed to "divel". What happens if you enter an empty string or a string having one character?

Exercise 10. Write a program that prompts for a string s and an integer $i$. Remove the character at index $i$ in s and output the resulting string.

  1. The languages C, C++, and Java each uses single quotation marks to delimit a character literal. 

  2. This is similar to how strings are represented in C, where a string is merely an array of characters. 

  3. Use the package Data.Text to manipulate strings if performance is a top priority. 

  4. Ackchyually, ++ is a function. 

  5. JavaScript has the destructuring assignment syntax. The analogous notion in Python is unpacking

  6. Like equality of vectors in mathematics. Everyone loves mathematics, yes?