This is the seventh article in the Making Sense of C series. In this article, we're going to figure out how to represent text. We aren't going to worry about displaying text or fonts or printing it out to the console, we're just going to worry about how we can put text into a format the computer can understand.
So far, we've determined that we're going to give the compiler a file with a
specified format, we've established the syntax for comments, we've set up a
system that allows us to perform arithmetic and store values in variables for
later use, and we've invented types that allow us to work with numbers of
different sizes.
Text, however, isn't built into the hardware the same way as numerical types.
For example, 0000 0000
can represent 0
since it's exactly how you would
represent 0
in eight bits, but how can we make those eight bits have something
to do with text?
First, we need a clear definition of "text" so we know exactly what we're doing. In the context of computers, text is just a list of characters, where characters consist of letters, digits, punctuation, spaces, etc. For a computer to handle text, it needs to know how to convert bits into a character and which bits to convert into characters. Once again, the computer quite literally just sees a bunch of ones and zeros (not even ones and zeros, just electricity or not electricity), so it has to know that you want it to treat certain bits as text, which is why I had to use the awkward phrase "know which bits to convert into characters".
The programming term for a specific list of characters is a string, which is actually quite close to one of the original meaning of the word "string", which was "a number of objects arranged in a line", like beads on a string.
In the first part of the article, we're going to come up with a system to map bits to characters. We're going to use a process similar to the process we used in coming up with how computers represent integers. First, we're going to figure out what characters to represent, then we're going to come up with a mapping from bits to characters.
First, we need to figure out what characters to represent.
Since a lot of early programming languages and computers were developed in
America, let's just use the characters used in American text.
With both upper and lower case letters, we're at 52 characters already.
Next, let's add all the digits from zero to nine, which leads to a total of 62
characters.
We'll need a space character, so let's add a space character, putting us at 63
characters.
Now, we'll add some standard punctuation, which consists of 20 more characters
(.";/\'[]{}()^:,?!-&*
) for a total of 83 characters.
We can also add in a few math characters (+=%<>
), putting us at 88
characters.
Since computers are going to be used for money, we can also add a ($
)
character, putting us at 89 characters.
We'll also need a character that looks like a space, but functions like a
letter since a space breaks things apart, so let's add an (_
).
The only characters we haven't mentioned on your keyboard are @#~`|
, where the
first two are carried over from English shorthand and typewriters, the next two
allow you to represent characters with a mark over them (For example, we could
represent ñ as n~
, é as e`
, or ö as o``
. The exact
system you set up doesn't matter, but you should be able to set up a system.) in
plaintext, and the last one is useful for a lot of math purposes.
Of course, all of these symbols have other uses (~ means similar in math), but
I'm just trying to hit some practical uses.
We're now at 95 characters. We've established that we're going to need at least seven bits of memory to represent a character in English and it looks like we've represented all the normal characters that we can see on our keyboard, but we're missing several characters. For example, we completely skipped over the tab character, the newline character (move down a line), the carriage return character (move back to the left of the screen), the escape character (what happens when you hit Esc), the Backspace character, and the Delete character, which means we're now at 101 characters. We have 27 characters left within seven bits.
We're going to use the ASCII character mapping, since it's the standard on every computer and in Unicode.
At the time when ASCII was created, computers were a lot closer to the metal, meaning that they often had characters that won't make sense to us now, such as the Acknowledge and Negative Acknowledge. The fact that the modern Enter key represented a line feed (move to the next line) and a carriage return (move to the beginning of the line) was because each character was an instruction to a device, such as a printer, and it had to be told to move to the proper position. These characters are still part of ASCII because standards like that aren't supposed to change, otherwise, they're not standards.
ASCII only goes from 0000 0000
to 0111 1111
, which is 0
to 127
, but
since we like powers of two, we'll make a character eight bits and let people do
whatever they want outside of ASCII with the remaining bit.
Some computers used them as other characters, other computers used them to
indicate italic letters, etc.
Letting people use the eighth bit to do whatever they wanted turned out to be a
terrible idea, which is why we now use Unicode, which assigns a character to a
number and lets a system like utf-8 figure out how to encode the number.
The 128 characters of ASCII map directly to the Unicode values 0
to 127
so
that most encodings of Unicode are backwards compatible with ASCII.
In fact, there is literally no difference between an ASCII file and utf-8 file
that only uses the ASCII characters.
While Unicode is awesome, right now, we're going to focus on just standard ASCII.
Anyway, we've now established that we're going to represent a character using an
individual byte, which will have enough memory to store English text and most
text derived from the Latin alphabet (Spanish, French, German, etc).
If you remember from the article on types, we just so happen to have a type
called char
that takes up a byte.
As you might expect, char
was made to represent char
acters.
So now that we have a type built into C
specifically to represent characters,
we just need a way to map the characters we have to the range 0000
0000
→0111 1111
.
People can do whatever they want with the leftmost bit, so we're not going to
worry about it.
Once again, that mapping is ASCII, which has a lot of benefits, though most of them were more useful way back when computers still ran on tape and punch cards.
A
is 0100 0001
and a
is 0110 0001
and all
letters follow the same pattern, which means that you can make an ASCII letter
upper or lower case by flipping the third bit from the left, which is a trivial
operation in C
that we will cover later.
Originally, your Shift key would just set the third bit from the left to zero if
Caps Lock was off and one if Caps Lock was on entirely through the hardware.
In modern computers, each key is registered independently from everything else, so the ASCII encoding loses a bit of the efficiency of setting individual bits to achieve the effect of Shift and Ctrl.
Now that we've established a mapping from a char
to a character, all we need
to do if we want a computer to print out a letter is to draw a picture for each
character and write a program that, when given a char
finds the proper picture
and draws it on the screen.
Lucky for you, most computers already have that feature built in, so we don't
need to worry about it.
If we were making a video game or anything else that has to draw a picture (i.e.
not a terminal), we might need to implement our own char
to picture feature.
For the numeric types, an int
was exactly four bytes, so you would just need
to know where it was in memory, but strings can vary in how many bytes you need
to represent them, so you need some way of telling the computer where to start
reading and where to stop reading.
First, since a string is a list of characters, we will have to make a feature
that we can use to represent lists in C
.
We'll take care of actually implementing that feature later, but for now, we'll
act as if we have that feature when coming up with how to represent strings in
C
.
For a list of characters, we need some way of identifying the beginning of the
list and the end of the list, but how we iterate through the list depends on how
we implement the list, so we'll leave that part alone.
We can, however, figure out how we know where to start reading characters and
when to stop reading characters since we just need to know what the start and
end of the list look like.
Let's try to think of real life examples of how we know when to start and stop something?
Each of these approaches uses a different system to start and stop a process.
We can apply these systems to knowing when to start and stop reading a string like so:
All of them have to specify where to start reading a string, but how they
specify where to end a string differs.
Each has its pros and cons, but the special (NULL
) character won out, since
memory was quite expensive and it's guaranteed to use exactly one byte, while
the first two methods have to use four to be safe.
In modern C
programs, however, we have the memory on modern computers, so we
often use all three methods to reap the benefits and not have to suffer the
cons.
In general, because text handling is so close to the metal and users will
generally interact with your program through text, text handling is one of the
most dangerous parts of a program if you don't know what you're doing.
If a malicious user can somehow get your program to not put the NULL
character
at the end of a string, then the computer will keep reading data until it
happens to find one somewhere in memory or it will just crash the program.
When we start talking more about the details of how your computer handles text,
we can go into more depth.
C
I don't know about you, but I would hate if I had to look up the ASCII value for
each individual character to set a char
.
Instead, I would much rather use some shorthand and let the compiler figure out
what value to set.
Since it's not really a heavy burden on the compiler, let's just give it the
character.
char a = b;
Do you see the problem?
The compiler will think that you want to set a
to the value stored in b
, not
to the ASCII value for b
, which is not at all what we wanted.
Instead, let's put the character in single quotes 'b'
to tell the compiler
that we're talking about a character.
char a = 'b'; a = 66; // Less clear than the last line, but does the same thing.
Much better.
You might think the choice of single quotes for characters is arbitrary, but its
because a character is directly related to a string, which will use quotation
marks ""
since quotation marks indicate that someone is speaking.
If you were to go through all the keys on your keyboard, you would find that
quotes make the most sense for text.
Because there are several characters that you can't really type (like the
Backspace, newline, and carriage return), we should have some functionality that
allows us to say, "We want this character to be a newline."
We'll pick a character that we can type and say that if another character
follows it, we'll interpret them together as a different character.
In this case, we'll use backslash \
since we don't think it will show up in
text that often (unless you're on a Windows file system, which didn't exist at
the time).
char a = '\n'; // Newline (what happens when you hit Enter) a = '\r'; // Carriage return (move to the front of the line) a = '\0'; // NULL character (the character that ends a string that // we talked about earlier) a = 0; // The NULL character has the ASCII value of zero, so // this line is equivalent to the previous line. a = '\b'; // Backspace (what happens when you hit backspace) a = '\t'; // Tab a = '\''; // Single quote a = '\"'; // Double quote a = '\\'; // Backslash
There are a few more escape characters that I won't mention because they're
really uncommon (VERTICAL TAB), but I will go more indepth on what are known as
ANSI escape sequences after I finish the C
tutorial.
ANSI escape sequences are an advanced topic, so don't worry about them for now.
In this article, we decided that we're going to represent text in C
by using a
list of characters, and we will represent each character with a char
, which is
a type that represents a byte of memory.
We're going to use ASCII for now, but it can be extended in the future with
Unicode.
We're going to mark a string by somehow keeping track of where it starts and
storing a NULL
character as the last character of the string.
In representing text, we have two currently unsolved issues: how to keep track of where a list starts and how to implement a list. In the next article, we're going to take care of both of these issues by introducing memory addresses.