Reverse engineering PDP-11 BASIC: Part 6
This post is the beginning of the description of the main BASIC syntax parsing loop. For context and a list of other posts on this topic, see the PDP-11 BASIC reverse engineering project page.
Let's pick up the story from where it was left off in Part 5. The setup code has completed and "READY" has been displayed. The code now enters a loop reading input from the user and interpreting it as BASIC commands. Let's get started with the analysis.
003124 005067 CLR 13702 003130 104500 TRAP 100 003132 104472 TRAP 72
Firstly, the memory addresss 13702 is cleared. This memory address is used as a flag to indicate whether execution has been interrupted by the user pressing Ctrl-P. It is set in the TTY receive interrupt handling routine. Here the value is cleared to indicate that the code has not been interrupted, since we are just entering the parsing loop.
Then, TRAP 100 is used to read a line of input from the user. The resulting string is stored at in the string buffer, which is at memory address 13540. The location of the resulting string (i.e. 13540) is also stored in R1. Next, TRAP 72 is used to get the next non-whitespace character in the string pointed to by R1. The identified character ASCII code is stored in R2.
Handling a newline on its own
003134 020227 CMP R2, #12 003140 001771 BEQ 3124
The first test carried out is to see whether the user just pressed return. If so, the string entered will consist of only the character linefeed (ASCII 12). Therefore the code tests whether R2 contains 12 and if so, branches back to the beginning of the loop to wait for the user to enter more commands.
Handling a number at the beginning of the line
003142 012701 MOV #13540, R1 003146 104410 TRAP 10 003150 121127 CMPB (R1), #12 003154 001675 BEQ 2750
The next test carried out is to see whether the user entered a number at the beginning of the command. If so, there are two possibilities;
The number could be the line-number for a command, which might be a new command or a replacement for an existing line number.
The number could be the only thing on a line on its own. In BASIC, that would mean to delete the specified line number from the program.
Here's how it works. Firstly, the memory location of the input buffer (address 13540) is stored in register R1. TRAP 10 is then used to convert bytes of the string pointed to by R1 into a number, which will be stored in R0. If the characters at the beginning of the string are non-numeric, R0 will contain zero after this call.
After the TRAP 10 call, the register R1 will point at the character immediately after any digits that have been parsed from the input. Therefore, the next test checks whether the subsequent character, the one after the numbers that have just been parsed, is a linefeed. In other words, whether the number has been entered on a line on its own. If this is the case, it means that the line number specified should be deleted from the program, and control jumps to address 2750 to carry this out. I will examine what happens next in this situation in a subsequent post. Suffice to say, for now, that this will lead to the specified line number being deleted from the program.
Figuring out what BASIC command has been entered
003156 010103 MOV R1, R3 003160 012700 MOV #3626, R0 003164 005002 CLR R2 003166 122327 CMPB (R3)+, #40 003172 001775 BEQ 3166 003174 124320 CMPB -(R3), (R0)+ 003176 001005 BNE 3212 003200 005203 INC R3 003202 121027 CMPB (R0), #44 003206 001430 BEQ 3270 003210 000766 BR 3166 003212 122027 CMPB (R0)+, #44 003216 001375 BNE 3212 003220 121027 CMPB (R0), #44 003224 001420 BEQ 3266 003226 010103 MOV R1, R3 003230 005202 INC R2 003232 000755 BR 3166
If this code is being executed, R1 points at the next character after any digits at the beginning of the line. Let's see what happens next.
003156 010103 MOV R1, R3
Firstly, R1 is copied into R3.
003160 012700 MOV #3626, R0
Next, the address 3626 is moved into R0. At this memory address is the following string:
This is a "$" separated list of each of the BASIC commands, terminated with "$$". You can probably imagine what happens now, the code is going to iterate through this list and compare the characters on the input line to each valid command in turn until one is located or the end of the list is reached.
003164 005002 CLR R2
R2 is set to zero. This will be used to track the current index into the list of valid commands. When a match is found, R2 will contain the index of the matching command.
003166 122327 CMPB (R3)+, #40 003172 001775 BEQ 3166
This code skips any whitespace that the user may have entered. The byte at R3 is compared to the space character (ASCII 40). R3 is then incremented. If the character at R3 equals space then the code branches around again to check for another space.
When the final non-space character is identified, R3 will be positioned two characters after the last space character.
003174 124320 CMPB -(R3), (R0)+
R3 is pre-decremented (so that it is now positioned at the character after the last space character) and compared to the value pointed to by R0. R0 is then incremented.
003176 001005 BNE 3212
If the two characters do not match, that means that the entered command does not match the current entry in the list of commands. Therefore, control branches to 3212 to skip the remainder of the current entry in the list of commands.
003200 005203 INC R3
Otherwise, the current characters matched. Since there was a need to pre-decrement R3 (at the compare instruction at memory address 003174), R3 needs to be incremented here so that when control loops back to check the next character, R3 needs to be one character ahead of where it needs to be so that the compare instruction above predecrements it to point at the correct value.
003202 121027 CMPB (R0), #44 003206 001430 BEQ 3270 003210 000766 BR 3166
The character pointed to by R0 is compared to "$" (ASCII 44). If yes, that means that we have found the matching command, in which case we branch to 3270. Otherwise we loop back to 3166 to test the next character.
003212 122027 CMPB (R0)+, #44 003216 001375 BNE 3212
This code is executed when it is determined that the two characters that are being compared (the current character in the input string and the current character in the list of commands) do not match. In this case, R0 is compared to "$" and autoincremented. This is repeated until a "$" is identified. At this point, R0 will point at the character after the "$". This will either be the first character of the next command in the list or, if the end of the list has been reached, another "$".
003220 121027 CMPB (R0), #44 003224 001420 BEQ 3266
We therefore compare R0 to "$" again to see if we are at the end of the list of valid commands (remember the list ends with "$$"). If this test matches that means that we have reached the end of the list and the input string has not matched any of the commands in the list. Therefore, the user has not entered a valid command so control jumps to an error condition down below at address 3266.
003226 010103 MOV R1, R3 003230 005202 INC R2 003232 000755 BR 3166
Otherwise, we're not at the end of the list of valid commands so we copy R1 (the start of the command string after any initial digits have been parsed) into R3 to reset back to the beginning of the user-inputted comand again. R2 is then incremented to represent the index of the next command in the list of valid commands and then control branches back to 3166 to begin again testing with the next command in the list of valid commands.
Tokenising the BASIC command
In the interest of saving memory, the BASIC commands are replaced with unique single character representations, based on the index of the command in the "$"-separated list above.
Here's the code that does that:
003270 062702 ADD #140, R2 003274 110221 MOVB R2, (R1)+ 003276 010104 MOV R1, R4 003300 111321 MOVB (R3), (R1)+ 003302 122327 CMPB (R3)+, #12 003306 001374 BNE 3300
Here's what you need to remember:
R2 contains the index of the matching command from the list of valid commands, starting with 0 for "LIST".
R1 points at the next location in the string after any digits that may be at the beginning of the line.
R3 points at the next character after the whitespace character that follows the command that has been identified.
For example, suppose the command being considered was "10 PRINT "HELLO WORLD"". R1 would contain 10 (which is the index of "PRINT" in the list of commands). R1 points at the space after the number "10" and R3 points at the double quotes after the space after "PRINT".
Now, let's walk through the token mapping code.
003270 062702 ADD #140, R2
Firstly, 140 is added to the value in R2. This means that the BASIC commands have the following token representations:
003274 110221 MOVB R2, (R1)+
The value of the token, now stored in R2, is placed in the string at the location after any numbers on the line, then R1 is incremented.
003276 010104 MOV R1, R4
The value in R1 is copied to R4.
003300 111321 MOVB (R3), (R1)+ ; any characters on the line after the identified command (pointed to by R3) are ; moved to the address pointed to by R1 ; then R1 is incremented 003302 122327 CMPB (R3)+, #12 ; If the value pointed to by R3 equals LF ; Then increment R3 003306 001374 BNE 3300
Now, the value of the character pointed to by R3 is moved to the address pointed to by R1, then R1 is incremented. The value at R3 is then compared to linefeed, and then incremented. If R3 is not equal to linefeed, control branches back to 3300 to copy another character.
These three lines copy everything after the command into the string buffer pointed to by R1. So, the command:
10 PRINT "HELLO WORLD"
will be translated into
10j "HELLO WORLD"
Referring to the table above, the "j" represents the PRINT command. This saves memory because all of the commands are represented by a single byte.
So far, the code has;
Determined whether the user has pressed return, in which case control returns to the beginning of the loop waiting for another line of input.
Checked whether the line of input begins with a number. If so, determines whether the line consists of a number on its own, which means delete that line number from the program.
Identified which BASIC command the user has entered, or generated an error if the command is not recognised.
Tokenised the BASIC command to save memory.