Representing Text in C

Computers only speak binary, so how can we represent text?

Posted 15 January 2020 at 1:26 PM

By Joseph Mellor

This is the seventh article in the Making Sense of C series. In this article, we're going to figure out how to represent text. We aren't going to worry about displaying text or fonts or printing it out to the console, we're just going to worry about how we can put text into a format the computer can understand.

So far, we've determined that we're going to give the compiler a file with a specified format, we've established the syntax for comments, we've set up a system that allows us to perform arithmetic and store values in variables for later use, and we've invented types that allow us to work with numbers of different sizes. Text, however, isn't built into the hardware the same way as numerical types. For example, 0000 0000 can represent 0 since it's exactly how you would represent 0 in eight bits, but how can we make those eight bits have something to do with text?

Defining Text

First, we need a clear definition of "text" so we know exactly what we're doing. In the context of computers, text is just a list of characters, where characters consist of letters, digits, punctuation, spaces, etc. For a computer to handle text, it needs to know how to convert bits into a character and which bits to convert into characters. Once again, the computer quite literally just sees a bunch of ones and zeros (not even ones and zeros, just electricity or not electricity), so it has to know that you want it to treat certain bits as text, which is why I had to use the awkward phrase "know which bits to convert into characters".

The programming term for a specific list of characters is a string, which is actually quite close to one of the original meaning of the word "string", which was "a number of objects arranged in a line", like beads on a string.

How to Map Bits to Characters

In the first part of the article, we're going to come up with a system to map bits to characters. We're going to use a process similar to the process we used in coming up with how computers represent integers. First, we're going to figure out what characters to represent, then we're going to come up with a mapping from bits to characters.

What Characters Do We Want to Represent?

First, we need to figure out what characters to represent. Since a lot of early programming languages and computers were developed in America, let's just use the characters used in American text. With both upper and lower case letters, we're at 52 characters already. Next, let's add all the digits from zero to nine, which leads to a total of 62 characters. We'll need a space character, so let's add a space character, putting us at 63 characters. Now, we'll add some standard punctuation, which consists of 20 more characters (.";/\'[]{}()^:,?!-&*) for a total of 83 characters. We can also add in a few math characters (+=%<>), putting us at 88 characters. Since computers are going to be used for money, we can also add a ($) character, putting us at 89 characters. We'll also need a character that looks like a space, but functions like a letter since a space breaks things apart, so let's add an (_). The only characters we haven't mentioned on your keyboard are @#~`|, where the first two are carried over from English shorthand and typewriters, the next two allow you to represent characters with a mark over them (For example, we could represent ñ as n~, é as e`, or ö as o``. The exact system you set up doesn't matter, but you should be able to set up a system.) in plaintext, and the last one is useful for a lot of math purposes. Of course, all of these symbols have other uses (~ means similar in math), but I'm just trying to hit some practical uses.

We're now at 95 characters. We've established that we're going to need at least seven bits of memory to represent a character in English and it looks like we've represented all the normal characters that we can see on our keyboard, but we're missing several characters. For example, we completely skipped over the tab character, the newline character (move down a line), the carriage return character (move back to the left of the screen), the escape character (what happens when you hit Esc), the Backspace character, and the Delete character, which means we're now at 101 characters. We have 27 characters left within seven bits.

Mapping Characters to Bits

We're going to use the ASCII character mapping, since it's the standard on every computer and in Unicode.

At the time when ASCII was created, computers were a lot closer to the metal, meaning that they often had characters that won't make sense to us now, such as the Acknowledge and Negative Acknowledge. The fact that the modern Enter key represented a line feed (move to the next line) and a carriage return (move to the beginning of the line) was because each character was an instruction to a device, such as a printer, and it had to be told to move to the proper position. These characters are still part of ASCII because standards like that aren't supposed to change, otherwise, they're not standards.

ASCII only goes from 0000 0000 to 0111 1111, which is 0 to 127, but since we like powers of two, we'll make a character eight bits and let people do whatever they want outside of ASCII with the remaining bit. Some computers used them as other characters, other computers used them to indicate italic letters, etc.

Unicode

Letting people use the eighth bit to do whatever they wanted turned out to be a terrible idea, which is why we now use Unicode, which assigns a character to a number and lets a system like utf-8 figure out how to encode the number. The 128 characters of ASCII map directly to the Unicode values 0 to 127 so that most encodings of Unicode are backwards compatible with ASCII. In fact, there is literally no difference between an ASCII file and utf-8 file that only uses the ASCII characters.

While Unicode is awesome, right now, we're going to focus on just standard ASCII.

Anyway, we've now established that we're going to represent a character using an individual byte, which will have enough memory to store English text and most text derived from the Latin alphabet (Spanish, French, German, etc). If you remember from the article on types, we just so happen to have a type called char that takes up a byte. As you might expect, char was made to represent characters. So now that we have a type built into C specifically to represent characters, we just need a way to map the characters we have to the range 0000 0000→0111 1111. People can do whatever they want with the leftmost bit, so we're not going to worry about it.

Once again, that mapping is ASCII, which has a lot of benefits, though most of them were more useful way back when computers still ran on tape and punch cards.

All the letters are in order and all the digits are in order, as you would expect.
A is 0100 0001 and a is 0110 0001 and all letters follow the same pattern, which means that you can make an ASCII letter upper or lower case by flipping the third bit from the left, which is a trivial operation in C that we will cover later. Originally, your Shift key would just set the third bit from the left to zero if Caps Lock was off and one if Caps Lock was on entirely through the hardware.
The first 32 characters are control characters, which can be accessed by holding down the control key and typing a letter. Originally, the control key would just zero out the first two bits in the hardware.

In modern computers, each key is registered independently from everything else, so the ASCII encoding loses a bit of the efficiency of setting individual bits to achieve the effect of Shift and Ctrl.

Printing Out Characters

Now that we've established a mapping from a char to a character, all we need to do if we want a computer to print out a letter is to draw a picture for each character and write a program that, when given a char finds the proper picture and draws it on the screen. Lucky for you, most computers already have that feature built in, so we don't need to worry about it. If we were making a video game or anything else that has to draw a picture (i.e. not a terminal), we might need to implement our own char to picture feature.

How to Know Which Bits to Convert to Characters

For the numeric types, an int was exactly four bytes, so you would just need to know where it was in memory, but strings can vary in how many bytes you need to represent them, so you need some way of telling the computer where to start reading and where to stop reading.

First, since a string is a list of characters, we will have to make a feature that we can use to represent lists in C. We'll take care of actually implementing that feature later, but for now, we'll act as if we have that feature when coming up with how to represent strings in C. For a list of characters, we need some way of identifying the beginning of the list and the end of the list, but how we iterate through the list depends on how we implement the list, so we'll leave that part alone. We can, however, figure out how we know where to start reading characters and when to stop reading characters since we just need to know what the start and end of the list look like.

Let's try to think of real life examples of how we know when to start and stop something?

How do you know when to go to and come home from work?
How do you know when you've finished dealing out cards for a game of poker?
How do you know when a conversation ends (starting a conversation is too varied for us to use)?

Each of these approaches uses a different system to start and stop a process.

You agree on a time to start working and a time to stop working.
You deal out cards starting from the top of the deck until each person has five cards.
You keep talking with the person until one of you says something that indicates that you want to end the conversation ("I have to leave now, but it was fun talking to you.").

We can apply these systems to knowing when to start and stop reading a string like so:

Specify where to start reading a string and where to stop reading a string.
Specify where to start reading a string and how many characters to read.
Specify where to start reading a string until we hit a special character. Remember that we only specified 101 out of 128 available characters, so we can absolutely make up a special character that just tells the computer to stop reading a string.

All of them have to specify where to start reading a string, but how they specify where to end a string differs. Each has its pros and cons, but the special (NULL) character won out, since memory was quite expensive and it's guaranteed to use exactly one byte, while the first two methods have to use four to be safe. In modern C programs, however, we have the memory on modern computers, so we often use all three methods to reap the benefits and not have to suffer the cons.

WARNING

In general, because text handling is so close to the metal and users will generally interact with your program through text, text handling is one of the most dangerous parts of a program if you don't know what you're doing. If a malicious user can somehow get your program to not put the NULL character at the end of a string, then the computer will keep reading data until it happens to find one somewhere in memory or it will just crash the program. When we start talking more about the details of how your computer handles text, we can go into more depth.

Representing Characters in `C`

I don't know about you, but I would hate if I had to look up the ASCII value for each individual character to set a char. Instead, I would much rather use some shorthand and let the compiler figure out what value to set. Since it's not really a heavy burden on the compiler, let's just give it the character.

char a = b;

Do you see the problem? The compiler will think that you want to set a to the value stored in b, not to the ASCII value for b, which is not at all what we wanted. Instead, let's put the character in single quotes 'b' to tell the compiler that we're talking about a character.

char a = 'b';
a = 66;         // Less clear than the last line, but does the same thing.

Much better.

You might think the choice of single quotes for characters is arbitrary, but its because a character is directly related to a string, which will use quotation marks "" since quotation marks indicate that someone is speaking. If you were to go through all the keys on your keyboard, you would find that quotes make the most sense for text.

Escape Characters

Because there are several characters that you can't really type (like the Backspace, newline, and carriage return), we should have some functionality that allows us to say, "We want this character to be a newline." We'll pick a character that we can type and say that if another character follows it, we'll interpret them together as a different character. In this case, we'll use backslash \ since we don't think it will show up in text that often (unless you're on a Windows file system, which didn't exist at the time).

char a = '\n';          // Newline (what happens when you hit Enter)
a = '\r';               // Carriage return (move to the front of the line)
a = '\0';               // NULL character (the character that ends a string that
                        // we talked about earlier)
a = 0;                  // The NULL character has the ASCII value of zero, so
                        // this line is equivalent to the previous line.
a = '\b';               // Backspace (what happens when you hit backspace)
a = '\t';               // Tab
a = '\'';               // Single quote
a = '\"';               // Double quote
a = '\\';               // Backslash

There are a few more escape characters that I won't mention because they're really uncommon (VERTICAL TAB), but I will go more indepth on what are known as ANSI escape sequences after I finish the C tutorial. ANSI escape sequences are an advanced topic, so don't worry about them for now.

Summary

In this article, we decided that we're going to represent text in C by using a list of characters, and we will represent each character with a char, which is a type that represents a byte of memory. We're going to use ASCII for now, but it can be extended in the future with Unicode. We're going to mark a string by somehow keeping track of where it starts and storing a NULL character as the last character of the string.

What's Next

In representing text, we have two currently unsolved issues: how to keep track of where a list starts and how to implement a list. In the next article, we're going to take care of both of these issues by introducing memory addresses.