Reverse engineering PDP-11 BASIC: Part 8

daveor
Feb 3, 2021
16 min read

In Part 6 I analysed the code that:

Determined whether the user has pressed return, in which case control returns to the beginning of the loop waiting for another line of input.
Checked whether the line of input begins with a number. If so, determines whether the line consists of a number on its own, which means delete that line number from the program.
Identified which BASIC command the user has entered, or generated an error if the command is not recognised.
Tokenised the BASIC command to save memory.

This post continues the analysis of the main syntax parsing loop, looking at some more of the input validation, execution of commands in immediate mode and storage of commands in deferred mode.

NOTE: For context and a list of other posts on this topic, see the PDP-11 BASIC reverse engineering project page.

Skipping REMarks

Firstly, any REM commands in the code can skip further processing. Therefore, the syntax parsing code first checks whether the command being analysed is REM, which is tokenised to "c" (see Part 6 for a description of the tokenisation process):

003310 120227 CMPB R2, #143
003314 001465 BEQ 3470

R2 contains the command token value and it is compared to 143, which is the ASCII code for "c". If R2 equals "c" that means the current line is a REM command and the remainder of the testing described in this post is skipped. Control jumps directly down to the code described in the section "Execute/store command check" below.

Checking (some!) IF syntax

The next check carried out in the code is to make sure that the syntax of an IF statement is correct. This is a bit strange because some, but not all, of the valid IF syntax formats are checked here. For example "IF <something> GOTO <number>" is a valid for of statement, but not checked by this code.

I think that the point of this code is actually principally to check for cases where there is another command on the same line as IF. For example there's the structure of IF consisting of "IF <something> THEN <command>". My suspicion is that the main purpose of this code is to identify the <command> in these situations, so that it can be parsed and validated.

Anyway, here's the code:

003316 020227 CMP R2, #155
003322 001035 BNE 3416
003324 010401 MOV R4, R1
003326 104472 TRAP 72
003330 120227 CMPB R2, #124
003334 001022 BNE 3402
003336 104472 TRAP 72
003340 120227 CMPB R2, #110
003344 001016 BNE 3402
003346 104472 TRAP 72
003350 120227 CMPB R2, #105
003354 001012 BNE 3402
003356 104472 TRAP 72
003360 120227 CMPB R2, #116
003364 001006 BNE 3402
003366 104472 TRAP 72
003370 005301 DEC R1
003372 010104 MOV R1, R4
003374 104470 TRAP 70
003376 001407 BEQ 3416
003400 000666 BR 3156
003402 120227 CMPB R2, #12
003406 001430 BEQ 3470
003410 120227 CMPB R2, #72
003414 001344 BNE 3326

Let's see what this does.

003316 020227 CMP R2, #155
003322 001035 BNE 3416

Firstly, the value in R2, representing the current command, is compared to 155, which is the ASCII code for "m". This is the tokenised value for the IF command (again, see Part 6 for a description of the tokenisation process). If the value in R2 is not "m" then this set of instructions is skipped.

003324 010401 MOV R4, R1

Next, R4 is copied into R1. After replacing the command with its corresponding lower-case letter token, but before copying the remainder of the instruction to overwrite the original command word, R1 was backed up to R4 (see Part 6, and look for the instruction at address 3276). This means that R4 contains a pointer to the next character in the input string after the command token. This is moved into R1, so now R1 points to the same place.

003326 104472 TRAP 72
003330 120227 CMPB R2, #124
003334 001022 BNE 3402

We now enter a loop validating the remaining characters in the IF statement. The first thing we're looking for is "THEN", which must be on the same line as IF.

First, we use TRAP 72 to get the next non-whitespace character in the command. The character will be stored in R2. We compare the value in R2 to "T". If it's not equal, we jump down to 3402, which basically checks if we're at the end of the line and if not, loops back around to try the next character.

003336 104472 TRAP 72
003340 120227 CMPB R2, #110
003344 001016 BNE 3402
003346 104472 TRAP 72
003350 120227 CMPB R2, #105
003354 001012 BNE 3402
003356 104472 TRAP 72
003360 120227 CMPB R2, #116
003364 001006 BNE 3402

If we found a "T", we then do the exact same series of steps looking for "H" (ASCII 110), "E" (ASCII 105) and "N" (ASCII 116); use TRAP 72 to get the next non-whitespace character, compare it to the ASCII code we're looking for and if not equal branch to 3402 to check if we're at the end of the line or loop back to the top.

003366 104472 TRAP 72
003370 005301 DEC R1
003372 010104 MOV R1, R4
003374 104470 TRAP 70
003376 001407 BEQ 3416
003400 000666 BR 3156

At this point in the code we have found the "THEN" part of the IF statement. The code then checks for a couple of possible (but not all!) valid structures of an "IF" statement. Specifically:

IF <something> THEN <number>
IF <something> THEN <command>

First we get the next non-whitespace character using TRAP 72, with the character being placed into R2. As a side-effect TRAP 72 will also increment R1 to point at the next character, after the character that has just been placed into R2. Therefore, R1 is decremented again, in case we need to convert the value into a number.

R1 now points at the next non-whitespace character after "THEN", and this location is stored in R4.

A TRAP 70 is used to check whether the value in R2 is numeric. If it is, the zero flag is set. If the zero flag is set we have identified the "IF <something> THEN <number>" structure and control branches to 3416, because we have identified a valid structure. Otherwise, the code branches to address 3156, which is the top of the command parsing loop to try and parse a command following "THEN".

003402 120227 CMPB R2, #12
003406 001430 BEQ 3470
003410 120227 CMPB R2, #72
003414 001344 BNE 3326

Finally we have the code that is run in the case where we didn't find "THEN". First we compare R2 to line feed (ASCII 12) and if R2 equals 12 branch to 3470, which is the end of the parsing of the statement. See below in the section "Execute/store command check".

Otherwise, we compare the value in R2 to ":" (ASCII 72). If it's not equal, we loop back up to continue looking for the "T" in "THEN".

Other syntax checks

Following on from the code just described in the previous section, control may move to this block of code when a ":" is identified after an IF statement. However, this is also the same code that is executed when the test for the IF token failed (i.e. ASCII "m" was checked by the instruction at address 3316). In other words, this code is executed for all commands except REM and IF.

Here's the code:

003416 010401 MOV R4, R1
003420 121127 CMPB (R1), #42
003424 001410 BEQ 3446
003426 122127 CMPB (R1)+, #72
003432 001755 BEQ 3366
003434 124127 CMPB -(R1), #12
003440 001412 BEQ 3466
003442 005201 INC R1
003444 000765 BR 3420
003446 005201 INC R1
003450 121127 CMPB (R1), #42
003454 001772 BEQ 3442
003456 121127 CMPB (R1), #12
003462 001371 BNE 3446
003464 104463 TRAP 63

Let's see what's going on here.

003416 010401 MOV R4, R1

Firstly, R4 is copied into R1. For most instructions R4 contains a pointer to the next character in the input string after the command token (see Part 6, and look for the instruction at address 3276).

In the case of the IF instruction (but only in cases where a ":" is found after the IF statement), things are slightly more complicated. Initially, R4 will contain a pointer to the next non-whitespace character after "THEN", if "THEN" was found as part of the IF command (see instruction at address 3372 above). In this case, the code will loop, as described below, until R1 points at the ":" after the IF statement.

It is probably easiest to think of what happens in each of these two cases separately.

003420 121127 CMPB (R1), #42
003424 001410 BEQ 3446

The value pointed to by R1 is compared to the double quote character (ASCII 42). If it equals, the code seeks forwards looking for the corresponding closing double quote character by branching to the instruction at address 3446.

003426 122127 CMPB (R1)+, #72
003432 001755 BEQ 3366

If the character pointed to by R1 is not a double quote, it is compared to the ":" character (ASCII 72) and then R1 is incremented, so it now points at the character after the colon. If the character equals a colon then control jumps back to the instruction at address 3366. This will, in turn, cause a branch to the instruction at address 3156 to parse the command after the colon.

003434 124127 CMPB -(R1), #12
003440 001412 BEQ 3466

For the next test we have determined that R1 did not point at a double quote character or a colon character, so R1 is pre-decremented again (because of the post-increment in the previous instruction) and compared against the linefeed character (ASCII 12), to see if we are at the end of the line. If we are, control branches to address 3466, which is there the "execute/store command check" dicussed below can be found.

003442 005201 INC R1
003444 000765 BR 3420

If none of the three tests above caused a jump to somewhere else in the code, the value in R1 is incremented and control branches back to 3420 to compare the next character to double quote, colon and newline.

This is the most relevant case for the purposes of the "IF" situation. What will happen is that the three tests above will fail until the colon after the IF statement is encountered, at which point the colon test above will succeed and control will pass to the instructions for parsing the command after the IF statement.

In any case, control will loop until one of the three conditions above is met.

003446 005201 INC R1
003450 121127 CMPB (R1), #42
003454 001772 BEQ 3442
003456 121127 CMPB (R1), #12
003462 001371 BNE 3446
003464 104463 TRAP 63
003466 005201 INC R1

The remaining code in this block is executed to seek for a closing double quotation mark once an opening quotation mark has been identified.

First R1 is incremented to move on to the next character. Then the character is compared to the double quote character (ASCII 42). If it equals we have found a closing quote and control branches to address 3442 in order to parse the remainder of the line. On the other hand, if a linefeed is encountered (ASCII 12) that means we have encountered a newline before a closing double quotation mark, which means there is an open string constant. Therefore, in the situation where we do not identify a linefeed, control continues looping reading characters until either a closing quotation mark or a linefeed is identified. If a linefeed is identified before a closing quotation mark, this causes an error message, represented by the odd-valued TRAP instruction, TRAP 63 (remember that all odd TRAP values are errors).

R1 is then incremented. Since this happens only after a linefeed at the end of the command has been reached, R1 now points at the next space in the string buffer immediately following the command being parsed.

Execute/store command check

Now we need to decide whether the command needs to be executed or stored. If the command has a number at the beginning it is part of a program and should be stored for future execution. This is called deferred mode. On the other hand, if the command has no number at the beginning then it should be executed immediately and this is called immediate mode.

In this section we'll look at the very simple code that checks which of these two situations applies. Here's the code:

003470 010103 MOV R1, R3
003472 012701 MOV #13540, R1
003476 104472 TRAP 72
003500 104470 TRAP 70
003502 001402 BEQ 3510
003504 005301 DEC R1
003506 000652 BR 3234

Let's see what happens.

003470 010103 MOV R1, R3

Firstly R1 is copied to R3. R1 points at the character immediately following the linefeed at the end of the current command.

003472 012701 MOV #13540, R1

Next, the memory address 13540 is moved into R1. This is a pointer to the beginning of the string buffer. In other words, to be beginning of the command that has just been entered.

003476 104472 TRAP 72
003500 104470 TRAP 70

TRAP 72 is used to get the next non-whitespace character pointed to by R1, which will be the first non-whitespace character of user input. The character is stored in R2. TRAP 70 is then used to determine whether or not the character in R2 is represents the ASCII code of a number. If it does then the zero flag is set, otherwise the zero flag is not set.

003502 001402 BEQ 3510
003504 005301 DEC R1
003506 000652 BR 3234

If the zero flag is set, that means there is a number at the beginning of the line, so control jumpt to 3510 to handle the command in deferred mode. Otherwise, there is not a number at the beginning of the line, so R1 is decremented to point at the beginning of the string buffer again, and control jumps to address 3234 to handle the command in immediate mode.

Executing commands in immediate mode

If the command is to be executed in immediate mode, this is the code that is used:

003234 005767 TST 13702
003240 001306 BNE 3056
003242 004767 JSR PC, 1224
003246 162702 SUB #140, R2
003252 100405 BMI 3266
003254 006302 ASL R2
003256 000172 JMP @4020(R2)
003262 000167 JMP 4106
003266 104403 TRAP 3

Let's see what happens.

003234 005767 TST 13702
003240 001306 BNE 3056

First the value at memory address 13702 is tested. This value is set if code execution has been interrupted by the user pressing Ctrl-P. Otherwise this should be zero. If it is non-zero, control jumps to the beginning of the syntax parsing loop at address 3056.

003242 004767 JSR PC, 1224

Next the subroutine at memory address 1224 is executed. This is the same as calling TRAP 72. In other words, when this returns, R2 will contain the next non-whitespace character in the user input. This will be the token character representing the command that has been entered (see Part 6 for a description of the tokenisation process).

003246 162702 SUB #140, R2

Now, 140 is subtracted from the value in R2. The value in R2 is now the index of the command that has been entered in the list of possible commands.

003252 100405 BMI 3266

The value in R2 should be zero or greater. If the value in R2 is negative, this is an error, so in this case, control jumps to address 3266. This causes the generation of a TRAP 3 - remember all odd-numbered TRAPs are error conditions.

003254 006302 ASL R2

Now, the value in R2 is multiplied by 2 by shifting it left by one bit. So the values in R2 will be 0, 2, 4, 8, 10, 12, ... depending on the index of the command that was selected by the user.

003256 000172 JMP @4020(R2)

The next command gets the value this is contained in the memory address 4020+R2 and jumps to that address. In other words, starting at address 4020 is a table of function pointers to code for handling each of the BASIC commands. Here is the table, annotated with the corresponding BASIC commands.

004020 002440 ; LIST
004022 006070 ; LET
004024 007150 ; READ
004026 006324 ; REM
004030 004150 ; RUN
004032 004240 ; RESTORE
004034 004246 ; RETURN
004036 006324 ; DATA
004040 004332 ; DIM
004042 002670 ; DELETE
004044 006430 ; PRINT
004046 004204 ; GOSUB
004050 004216 ; GOTO
004052 006140 ; IF
004054 007302 ; FOR
004056 007620 ; NEXT
004060 006660 ; INPUT
004062 002060 ; SAVE 
004064 004106 ; STOP
004066 004106 ; END
004070 004474 ; DEF
004072 002072 ; OLD
004074 010054 ; RANDOMIZE

In other words, what this says is that the code for handling the LIST command, for example, can be found at address 002440. Likewise for all of the other commands.

003262 000167 JMP 4106
003266 104403 TRAP 3

Finally, there is a jump to memory address 4106. As can be seen in the table above, this is the memory location of the STOP/END command. The TRAP 3 is used to indicate an error if an invalid command token is encountered.

That's how immediate mode works.

Storing commands in deferred mode

If the command is being handled in deferred mode, then it is stored in the program memory area until the user enters the RUN command to execute the stored program.

Here's the code:

003510 104516 TRAP 116
003512 012701 MOV #13540, R1
003516 160103 SUB R1, R3
003520 010346 MOV R3, -(SP)
003522 104410 TRAP 10
003524 005700 TST R0
003526 001436 BEQ 3624
003530 020027 CMP R0, #17777
003534 003033 BGT 3624
003536 104474 TRAP 74
003540 001001 BNE 3544
003542 104476 TRAP 76
003544 012603 MOV (SP)+, R3
003546 104516 TRAP 116
003550 010300 MOV R3, R0
003552 104504 TRAP 104
003554 103003 BCC 3564
003556 104401 TRAP 1
003560 060005 ADD R0, R5
003562 000410 BR 3604
003564 020105 CMP R1, R5
003566 103374 BCC 3560
003570 010502 MOV R5, R2
003572 060005 ADD R0, R5
003574 010504 MOV R5, R4
003576 114244 MOVB -(R2), -(R4)
003600 020102 CMP R1, R2
003602 101775 BLOS 3576
003604 012702 MOV #13540, R2
003610 111221 MOVB (R2), (R1)+
003612 122227 CMPB (R2)+, #12
003616 001374 BNE 3610
003620 000167 JMP 3004
003624 104441 TRAP 41

Let's work through this and see how it works.

003510 104516 TRAP 116

First we have mystery TRAP 116. I haven't figured out what this does yet, but I will.

003512 012701 MOV #13540, R1
003516 160103 SUB R1, R3
003520 010346 MOV R3, -(SP)

The size of the command is then calculated. The beginning of the input buffer (address 13540) is moved into R1. Then, R1 is subtracted from R3. Remember that R3 contains the memory location of the end of the command. Therefore, subtracting R1 will leave the length of the command, in bytes, in R3. This value is then pushed onto the stack.

003522 104410 TRAP 10
003524 005700 TST R0
003526 001436 BEQ 3624

Next, TRAP 10 is used to convert the digits at the beginning of the command into a number, which will be stored in R0. This value is then checked and if it equals zero, in other words the command did not begin with a number, control branches to address 3624 which causes an error to be generated. This is because we are in the deferred mode execution code at the moment and therefore every line should begin with a line number.

003530 020027 CMP R0, #17777
003534 003033 BGT 3624

We then compare the line number in the command to the value 17777 (which is 8192 in octal). This is presumably the largest allowed line number and if the line number is greater than this, control branches to address 3624 which causes an error to be generated.

003536 104474 TRAP 74
003540 001001 BNE 3544
003542 104476 TRAP 76

Now we use TRAP 74 (described in Part 7) to see whether a command with the line number of the command we are adding already exists in the program. If it does not we skip the TRAP 76. On the other hand, if it does already exist in the program, TRAP 76 is used to delete the current command with the same line from the program.

003544 012603 MOV (SP)+, R3
003546 104516 TRAP 116

The value in R3 is restored from the stack, so now R3 will contain the length of the command being added to the program. Then there is another TRAP 116 and, as I mentioned a few minutes ago, I'm still not sure what this does.

003550 010300 MOV R3, R0
003552 104504 TRAP 104
003554 103003 BCC 3564
003556 104401 TRAP 1

The length of the command being added to the program is now copied into R0. Then TRAP 104 is used to check whether there is enough space in the program memory area for the command. See Part 7 for a description of how TRAP 104 works. If there is enough space, the Carry flag will be clear, in which case the TRAP instruction is skipped and control jumps to address 3564. Otherwise, if there is not enough space, the TRAP 1 is invoked and an error is generated.

003564 020105 CMP R1, R5
003566 103374 BCC 3560

Next R1 is compared to R5. The TRAP 74 and TRAP 76 calls will have set R1 to point at the value where the line number of the current command used to exist in the program memory, or at the end of the program memory if the line number of the command to be added was not found. Therefore, what we are, in effect, checking here is whether the new command is to be added at the end of the program or whether there are any commands after the location where the current command is to be added to the program memory.

If R1 is greater than or equal to R5 then control jumps to 3560. This means we are adding the new line to the end of the program, in which case the following instructions are executed:

003560 060005 ADD R0, R5
003562 000410 BR 3604

R0, the size of the current command, is added to R5. This basically adds space for the new command onto the end of the current program code. Then we branch to address 3604.

Otherwise we have to make room somewhere in the middle of the current program for the new line, in which case these instructions are executed:

003570 010502 MOV R5, R2
003572 060005 ADD R0, R5
003574 010504 MOV R5, R4
003576 114244 MOVB -(R2), -(R4)
003600 020102 CMP R1, R2
003602 101775 BLOS 3576

Firstly the current value of R5 is moved to R2, so R2 will hold the "old" end of the program location. Then R0 (the length of the new command) is added to R5. Then R5 is moved to R4, so R4 will hold the "new" end of the program location.

Now, a byte is moved from the "old" end of the program location (pointed to by R2) to the "new" end of the program location (pointed to by R4), and both registers are decremented.

The R2 value is then compared to the value in R1, which points at the location where the new program command is to be inserted. We loop until R1 is greater than R2.

What we have now done is create a gap in the program code, of size specified by the value in R0 (i.e. the size of the new command) at the location specified by R1 (i.e. the location where the new command is to be inserted), and moved all subsequent bytes up in memory by an offset of R0 bytes, to make room.

003604 012702 MOV #13540, R2
003610 111221 MOVB (R2), (R1)+
003612 122227 CMPB (R2)+, #12
003616 001374 BNE 3610

Now, finally, we copy the new command into place.

We move the location of the string input buffer (address 13540) into R2. Then we move a byte from address specified in R2 to the address specified in R1, which is the location in the program code storage area, then increment the value in R1.

We check if we have reached the line feed character (ASCII 12) in the input buffer, representing the end of the command to be copied, and increment R2. If not, we loop and copy another byte into the program code storage area.

003620 000167 JMP 3004
003624 104441 TRAP 41

Now that the command has been copied into the program code storage area we jump back to the beginning of the syntax parsing loop to allow the user enter another command.

The TRAP 41 is the error code that is generated at various locations throughout the code.

Conclusion

That concludes the walkthrough of the main syntax parsing loop. For those who are still with me, there's a small bit of the syntax parsing loop left that I haven't written about. It seems to relate to running programs (i.e. programs started by the RUN command). I don't really understand how that works yet, so I'll come back to it in a later post. In the meantime, I think the next thing is to look at how the BASIC commands themselves are executed.

Thanks for reading, and if you've gotten this far, well done!