Elliott 900-Series BCPL Implementation. ======================================= Terry Froggatt, 18th December 2018. My Interest in BCPL. -------------------- Throughout 1977 & 1978, I was the research fellow in real-time programming languages at the University of York. My key activity was to attend meetings at the Berlaymont building in Brussels, to discuss the proposed standard "Long-Term Procedural language for Europe" (LTPL-E). At the same time, the US Department of Defense was developing its own standard real-time language, later called "Ada". So the LTPL-E group transformed itself into an Ada liaison group, which is how I later became involved with the Ada language at Systems Designers, from 1979 to 1991. But amongst the background tasks which I undertook at York was the porting of a BCPL implementation from a PDP-11 running RSX to a PDP-11 running Unix, to support York's development of Modula. This was not a major task. I didn't really have to understand much about the internal workings of the BCPL compiler. Also, in my own time at York, I was writing a BASIC interpreter for the Elliott 900-series Computer User's Association, having worked on the 900-series from 1966 to 1976. To help me with this task, DDT, a 3rd-party maintenance organisation, loaned me an Elliott Arch 9000, which I subsequently bought from them. So it was natural to wonder whether I could implement BCPL on the 900-series. Before I left York, I made paper-tape copies of the source files (GET & BCPL) and O-code files, of each of: the BCPL Library, the BCPL compiler front-end (TSYN & NTRN) and the BCPL compiler back-end or code-generator (TRAN). A chance meeting with Martin Richards at an Oxford conference in November 2016 made me consider doing this again. In November 2001 I'd interfaced my Arch 9000 to the printer port of a Pentium, and in June 2008 I'd uploaded the paper tapes, via the Arch 9000 paper-tape reader and this link, to the Pentium. So I could now develop the implementation using any modern screen editor, and test my code in my Elliott simulator on the Pentium, making it all a lot easier than it would have been 30 years earlier. Starting Point. --------------- Although Martin Richards is still developing BCPL, I very much wanted to start from my York tapes, not only because I had some knowledge of them, but because the PDP-11 implementation used a 16-bit word, so would port more easily to the 18-bit Elliott word than a modern 32-bit or 64-bit versions would. The modern BCPL versions also contain many facilities not in the York version, such as floating-point, and fitting BCPL into an 8192-word or 16384-word Elliott was already a challenge without these extras. The modern preference is to port BCPL by writing an IntCode interpreter, which is claimed to make porting easier. But having already implemented a BASIC text-interpreter and a pidgin-Ada threaded-code interpreter for the Elliott 900-series, writing yet another interpreter was of no real interest. I'd already sketched out an O-code to Elliott SIR mapping, and I didn't consider implementing this difficult. A fundamental property of BCPL is that it supports recursion. 900-series ALGOL 60 and my 900-series BASIC both support recursion easily, being interpretive. MECSL's & CAP's 900-series CORAL 66 implementations both generate native code, but both omit recursion. (In CORAL 66, recursive procedures have to be flagged as such, which avoids the overhead on other procedures, but the official definition notes that recursion "may or may not be considered worth having"). What I wanted to do was generate true native code from O-code, with a view to mapping rather more of Ada onto it someday. So the challenge was to implement recursion in native code, on an Elliott 900-series machine which has no stack support. This was not as hard as I expected (see "Stack Addressing" below). What to Target? --------------- The usual minimum store size for a 900-series machine is 8192 words, which can be extended in 8192-word multiples. There are currently a handful of working machines, all with either 8192 or 16384 words, and my Arch 9000 now has 32768 words. Of course, the available simulators can simulate more store. In BCPL it is assumed that all code & data is addressed in a single flat address space. Certainly BCPL programs of up to a certain size will fit into an 8192-word machine, with no addressing problems. My implementation puts everything into the first 8192 words when it is clear that it will fit. For larger programs, the stack, and data structures such as strings, can be placed into higher store modules when these are available, accessed by B-modification, provided that the code and the simple data (global vector, static variables, literal constants, and pointers to the stack & strings and to procedures) still fit into the first 8192 words. My implementation implements this too. In the case of even larger programs, it is unlikely that the simple data would exceed 8192 words before the code exceeded 8192 words, so the next step would be to place some or all of the code into higher store modules, with the simple data still in the first 8192 words. The default on the 900-series is that data and code addresses are relative to the start of the 8192 words containing the code being obeyed, and accessing elsewhere requires B-modification. Machines from the 920C or 905 onwards have an additional address mode in which data (but not code) addresses are absolute. So placing any code outside the first 8192 words would require: * the use of the absolute address mode for data, which is not available on any of the currently working machines, * more complicated slower system routines for the likes of procedure entry & exit, to convert code addresses from absolute to relative, * a modification to 2-PASS SIR to always locate literals in the first 8192 words, rather than in the current module. Accordingly, I've not considered this option any further. There are some differences between machines in the 900-series. On the 920A, Functions 6 7 8 9 all leave the Q-register corrupted, and B-modification corrupts the Q-register before the instruction is executed. On later machines there are fewer Q-register corruptions, but I've not found anywhere in my BCPL implementation where I could usefully exploit these improvements. The later 905 & 920C have some extra instructions which the earlier 920A 920B 920M & 903 do not support, including the potentially useful absolute mode just mentioned above. But I've not found anywhere in my BCPL implementation where using any of the extra instructions would give a significant saving. I resisted the temptation to implement BCPL's GOTO command by writing the address calculated in the A-register directly into the program counter, which would work on all currently working machines (but NOT on a 905 or 920C or in some simulators). Formally, the effect of writing to the active program counter is undefined. Accordingly, I've aimed to produce code which will run on ANY 18-bit 900-series machine with at least an 8192-word store, provided that the program code & simple data together fit into just under 8192 words. The challenge here is to get the compiler passes themselves to fit. The GET Command. ---------------- The BCPL language has a GET command, which switches the input of the BCPL front-end from its current source to a named file, then switches back to the current source at the end of the named file. In the York implementation, only a single depth of GET is supported. In small programs, a small GET file is often used to define the global vector offsets of library procedures, to give access to a separately-compiled library. In larger, multi-Section programs, another GET file may be used to provide communication between the Sections, giving global vector offsets of shared procedures and manifest constant declarations for shared enumeration types. Although the 900-series does have file system support in the shape of FAS for magnetic tapes and RADOS for disks, the standard 900-series machine has neither, just paper tape. So, given that I'm developing this BCPL implementation on a host Pentium which does have a file system, I've taken the pragmatic approach, for now, of writing a pre-processor to run on the Pentium, which inserts the GET files into the BCPL files producing expanded BCPL files (which I call GOT files). Possible GET implementations on a real 900-series machine: * Given that only one level of GET is expected, use one tape reader for the BCPL files and a second tape reader (or at a pinch the on-line TeleType's reader) for the GET files. The GET files could be in the form of a searchable paper tape loop to facilitate reading once per Section (like subroutines on TNMoC's Harwell WITCH). However, whilst the 900-series instructions for a second paper tape station are defined, I've never seen a machine with more than one attached high-speed reader. * A pre-processor to run on the 900-series machine, which reads the GET files (and their names) into store first as text, so that it can then can copy a BCPL file line-by-line, inserting the GET files into the BCPL file producing an expanded BCPL file. On an 8192-word machine, around 7000 words would be available as file store, each holding at least two bytes, which is more than enough, for example, to hold the four GET files of the BCPL front-end, back-end, and library simultaneously (8197 bytes). The downside of this method is the amount of paper tape used, remembering that the GET files can be included once per Section. (The four BCPL source files expand from 92,680 bytes in total to 164,736 bytes in total: a 78% increase). * It should be possible to read a GET file into upper store first as text, then implement the GET command in the BCPL front-end by reading this back, avoiding producing an expanded paper tape. However, unless more than 16384 words of store are available, this would reduce the tree space available within the front-end. * When the BCPL front-end finds a GET line it could print the file name on the TeleType, then wait for BCPL tape to be marked and removed and for the GET file tape to be loaded. At the end of the GET tape, it would wait for the BCPL tape to be reloaded at the mark. * As a slight variant to the previous bullet, the BCPL tape could be reloaded somewhat before the mark (or at its start), and the restarted front-end could scan up to (and ignore) the GET command. Other Obstacles. ---------------- There are several other aspects of the 900-series instruction set which can give problems when implementing high-level languages. The 900-series has no hardware overflow detection. Detecting overflow by software typically requires testing the signs of both operands and of the result. The 900-Series ALGOL 60 interpreter does do this. Ada requires overflow detection unless you say otherwise, although saying otherwise not guaranteed to suppress it (for example when it is done by hardware). CORAL 66 recognises that "overflow may occur", and CAP's 900-Series CORAL 66 does not detect integer overflow. In BCPL "overflow in arithmetic operations is ignored" and "the treatment of arithmetic overflow is undefined". The 900-series has no hardware compare instruction or condition codes. So generally subtract has to be used to implement the relational operators, which is OK for Equal and for Not Equal. But for the other four relational operators, the subtraction result in the A-register is not sufficient to determine the result when overflow occurs. This not the same problem as the lack of overflow detection. Ada 83 paragraph 4.5.2(12) recognised this problem and specifically states that "no exception is ever raised by a predefined relational operator", the right Boolean result is expected for all six relational operators, whether or not subtraction overflow occurs. Early 900-series ALGOL 60 gave wrong results, until I reported the error, then it was fixed by adding extra code to all six operators, not just to the faulty four. CAP 900 CORAL uses subtract with no overflow check, and BCPL is silent on the topic. The 900-series has an infamously messy hardware divide. This gives the nearest ODD number to the right answer, (rounding in the direction of the denominator if there is a tie), allegedly to minimise integration drift (when used for fixed-point arithmetic). ALGOL 60 requires integer divide to round towards zero, defined in terms of real divide as "sign(a/b) * entier(abs(a/b))". Only in 2014 did we discover that there was still a bug in the division subroutine used by 900-series ALGOL & 900-series FORTRAN. But this subroutine, when fixed, does give the "round towards zero". Ada also requires integer divide to round towards zero. CORAL 66 is silent on the topic. For integers (but not for fixed-point numbers), CAP 900 CORAL doubles the numerator (held double-length), divides by the denominator, and halves the result with a shift, which gets rid of the ODD problem, giving "round towards zero" when neither operand is negative (but often off by one otherwise). However this method can overflow when dividing by +1 or -1, despite the correct result usually being in range. BCPL explicitly states that "/" is truncating integer division, but that the result is implementation dependent unless both operands are positive. Consequently in BCPL, division by powers of two CAN be implemented by shifting. The 900-series division does not provide a remainder. The divide instruction calculates its result in the Q-register and its remainder in the A-register. But then for convenience the result is moved into the A-register, so inconveniently the remainder is lost. BCPL states that REM is the integer remainder (or modulo) operator, but that the result is implementation dependent unless both operands are positive. BCPL later defined MOD as a synonym of REM. Consequently in BCPL, REM or MOD by powers of two CAN be implemented by masking. There does not appear to be any requirement in BCPL that A = (A/B)*B + (A REM B) unless both operands are positive. On the 900-series, the remainder is usually calculated by dividing, multiplying back, and subtracting, thus (A REM B) = A - (A/B)*B. Ada explicitly requires that A = (A/B)*B + (A REM B) where integer divide rounds towards zero. It also defines a different MOD, which corresponds to a division rounding to minus infinity, (not provided by Ada but closer to what the 900-series does). Languages such as ALGOL and Ada are intended to give correct results in all cases, despite the obstacles listed above, even when the cost is to make the code larger & slower than it need be for typical cases. Programmers working in SIR should be aware of these obstacles, and will know that it is usually safe to write simpler code. For example: * they will know that a given relational operator will not overflow because one operand is zero, or because both are the same sign, as is usually the case with loop counters or array indices. * they will avoid division altogether wherever possible, changing division by a constant to multiplication by its reciprocal, and they will use double/divide/halve for integer division with positive operands (and where the result is known to be below 65536). The aim of CAP's 900 CORAL 66 appears to be to enable a programmer who knows the limitations of the machine to express their assembly code program in high-level language terms. (CORAL is aimed at embedded applications, typically interfaced by bespoke hardware to the likes of a steel mill, with no expectation that the program could be ported to a different computer without changes). The objectives which I set for my 900-Series BCPL implementation are: * My code should give correct results wherever CAP 900-series CORAL does. * My code should give correct results for all 16-bit operands, and thus give correct results whenever the York PDP-11 implementation would. Relational operators which work for 16-bit operands on a PDP-11 using its compare will work on an 18-bit 900 using subtract. Divisions by +1 or -1 which could overflow using double/divide/halve will not overflow with 16-bit operands. This approach enables me to encapsulate some of the optimisations that a SIR programmer would use when, for example, it is known that one operand of an operation is a constant. The fact that the 900-series only has "Jump if Zero" (Function 7) and "Jump if Negative" (Function 9) is again something which is not much of a problem when programming at the SIR level. "Jump if Positive" just involves reversing the operands (do-able given O-code), and the other three relations can be implemented by swapping the THEN & OR parts of a TEST (although not do-able given O-code, so occasionally one extra instruction is needed: such as a conditional jump around an unconditional jump). Work done in 2016. ------------------ Porting BCPL to a new machine involves writing a new complier back-end or code-generator (TRAN), to convert O-code into code for that machine. The machine-independent compiler front-end can then be generated from its O-code (TSYN & NTRN), to run on the new machine. O-code operations such as PLUS & MULT operate on the stack and are addressless, whereas most instructions on machines like the PDP-11 and the Elliott 900-series include an operand address. So BCPL code generators often contain a simulated stack, (specifically York's TRAN, targetted to the PDP-11, does). When an O-code operation is read which should stack an operand, the operand is recorded in the simulated stack and no code is generated, in the expectation that the operation which consumes the operand later can address it directly. Soon after meeting Martin Richards (on 19th November 2016), I'd written a translator from O-code to SIR, the symbolic assembly language of the Elliott 900-series (by 5th December 2016). (Purely for my convenience this was written in Ada 83 to run under DOS on a 16-bit or 32-bit Windows PC). This was a very simple program which did NOT include a simulated stack. It only contained two optimisations: * If an operation wrote a result to the top of the stack, and the following operation consumed its second or only operand from the top of stack, the resulting write-then-read instruction pair was deleted. (Even if the operation was commutative, no attempt was made to check the first operand). * If one of the six relational operators was followed immediately by Jump-if-False or Jump-if-True, the appropriate code was generated, without specifically calculating & testing a Boolean value. At this point the only serious way to test this code generator was to translate the O-code of the York implementation. I was able to translate the O-code of both TSYN & NTRN to SIR. They just fitted, separately, with the string constants and the stack in the 2nd 8K, and everything else in the 1st 8K. The translation of TRAN to SIR did not quite fit, but since it generates PDP-11 code, I didn’t actually need it. Unfortunately TSYN & NTRN are not separate passes. TSYN builds a tree for each Section of BCPL in store, then NTRN converts each tree in store to O-code. Processing a BCPL file typically involves multiple switches between TSYN & NTRN. They are compiled separately in the York system, but they are linked together, TSYN calls NTRN as a subroutine, and they share the location & contents of globals and the stack. So I was not able to create a working front-end. However, TSYN has an option to suppress O-code generation and print an ASCII representation of each tree instead. So by assembling just the SIR of TSYN, I was able to generate the three tree files from the BCPL of TSYN & NTRN & TRAN, which became a useful reference for checking my subsequent implementation. Rather than attempting to hand-optimise the SIR of TSYN & NTRN, or attempting to write the front-end from scratch in SIR or Ada, at this point it was more useful to put the missing simulated stack into the back-end, TRAN, which would benefit all programs. And rather than attempting to put the simulated stack into the simple O-code to SIR translator which I'd just written, I thought I would learn more by starting from York's O-code to PDP-11 translator. Given that I could not yet compile any BCPL, my first step was to translate York's TRAN from BCPL into Ada 83 (which I'd done by 28th December 2016), then I paused for a while. The 2016 back-end above worked largely by text manipulation; each Section was a separate SIR block to avoid label name clashes. The 2018 back-end below, based on York's implementation, evaluates & adjusts label numbers to avoid label name clashes. Work done in 2018. ------------------ I returned to the project in 2018, and between February & June I edited TRAN many times, eventually generating SIR with various optimisations, rather than generating PDP-11 code. I also developed the SIR assembler routines, BCPL_ENV.SIR, to replace the PDP-11 assembler system & library routines. "In BCPL it is defined that consecutive words of store have numerically consecutive addresses", so a minor advantage that my word-addressed Elliott implementation has over the byte-addressed PDP-11 implementation is that there is no need to use a register for doubling or halving addresses, for example in procedure LOADLV. There were two confusing points regarding York's simulated stack. * I was not sure why the simulated stack starts with two dummy entries. There are two pointers, ARG1 & ARG2, which point to the top two items on the stack, and the dummy entries appear to give them something to point to when the stack is empty. This matters, for example, in York's procedure SAME, which compares the addresses of the top two stack items (when optimising X := X+1 for example) without checking that they exist. I've removed the dummy entries, and instead I check for the PLUS operator before I compare the addresses. If the O-code has failed to stack two items before the PLUS, Ada's array bound checking will spot this, rather than just generating code to increment a dummy. * Stack discontinuity, the H3 field, and the "N >= SSP+6" test. The H1=Mode & H2=Arg fields are straightforward, representing the code needed to access an item in the simulated stack. The H3 field is the stack offset of where it belongs in the real stack. Thus (M_NUMBER, 5, 8) can be flushed from the simulated stack by "4 +5, /5 8", and (M_LOCAL, 5, 8) by "/4 5, /5 8". This 3-field mechanism permits the simulated stack to represent discontinuous portions of the real stack. If the H3 field is omitted, only a continuous portion can be represented, where the real offset of ARG1 is SSP-1, and the real offset of TEMPV'FIRST is SSP-1-(ARG1-TEMPV'FIRST)/TEMPSIZE, which can be held in BASE. Discontinuities arise when vectors are allocated & deallocated: it would be unwise to simulate a 6000-word tree vector, and there is little point in simulating a 120-word string vector, since if anything only the first two words are addressed. Perhaps the STACK routine tests "N >= SSP+6" to allow for the optimisation of the parameters to front-end LIST routines up to LIST6. The York implementation is inconsistent in that it has the H3 field but does not use it to represent discontinuities. So I've omitted H3, which was only used in VACATE and in STORE, (where subsequently VACATE was assumed into FREE_A) and I now effectively calculate H3 as (H2-H2'FIRST)/TEMPSIZE+BASE At best, the simulated stack can only work on one stack frame, it cannot track run-time calls at compile-time. But in fact, in York's implementation and in mine, it only tracks part-frames. Now I translated the O-code of both TSYN & NTRN to SIR again. This time, (each including a global vector and its own copy of the library, with the string constants and the stack in the 2nd 8K, and everything else in the 1st 8K), each used about 6K of the first 8K. And I estimated that TSYN & NTRN translated together with a single global vector and single library, and still containing the tree printing routine, would only overrun the first 8K by 1068 words. This was a breakthrough: getting the back-end to the point where the SIR generated from the O-code of the front-end was good enough to be hand-optimised to fit, so that then the BCPL sources of the front-end could be used, and altered. Producing a useful front-end happened in two stages. Firstly, I cut the tree-printing routine out of the O-code of TSYN, translated the resulting O-code of TSYN & NTRN to SIR again, and performed extensive hand-optimisations to the SIR to make it fit. However, it would be tedious to repeat these hand-optimisations whenever I wanted to change the front-end BCPL code. But now that I had the ability to run the front-end, the second stage was to implement some of the optimisations in the BCPL code. My TSYN is as the York version, but: * With the testing of options removed, thus ignoring the "print tree" option (built into BCPL_ENV.SIR to test TSYN unaltered), and thus forcing the O-code defaults, and with the remaining uses of NUMARG & GETARG, for reading input file names, removed, * With PERFORMGET removed, any GET will now give error 93, "NESTED GET", although perhaps it should act as a HaltCode, * With PRINTNAMES & PRINTTREE removed, and with the whole of PLIST (which actually prints the tree) and IN removed, this is a substantial saving, * With LIST1 removed, it is never used. My NTRN is as the York version, but: * In EVALCONST, the code for evaluating the operands of the binary operators is shared, then another case statement is used to call the operator, this is a substantial saving in 1st-8K code, although costing 32 words of any-8K data, * In EVALCONST, the code for evaluating the operand of NEG & NOT and the first operand of COND is shared, then another case statement is used to call the operator, this is a slight saving in 1st-8K code, although costing 8 words of any-8K data. I also wrote a small program, OPTFRONT, to apply my other two significant hand-optimisations (which cannot be expressed at the BCPL level) to the front-end O-code. It is written so as to survive certain changes to the BCPL source &/or to TRAN, specifically it does not assume any J/K/L label names. * It is probably not possible in BCPL to statically initialise a table with a list of string addresses. The code for DECLSYSWORDS in Martin Richards' book does not do this, and the York implementation uses distinct code in for each keyword. This program changes DECLSYSWORDS into a loop using a table, this is a substantial saving in 1st-8K code. * This program modifies LIST2 to LIST6, to share their code in an obvious way, saving 71 words of 1st-8K code. Finally, I made some changes to the front-end and library to accept some modern BCPL syntax, as described below. The remaining notes apply to the 2018 implementation. Further changes made during the remainder of 2018 include: * In the back-end, I completely revised my implementation of Relational Operators, unifying the code used when followed by JT or JF with the code used otherwise. * In TRANSFOR in NTRN, I altered the implementation of FOR loops, to better exploit the asymmetry of available 900-series conditional jump instructions. York Oddities. -------------- As stated above, the "N >= SSP+6" test in STACK in York's TRAN seemed odd, but I now understand what it’s for, and that the value of 6 is somewhat arbitrary. CGAPPLY in York's TRAN declares the frame size as "2*N+6" bytes, which would be N+3 words, which seemed odd. However this value is autodecremented by the ENTER system code as it sets up the 3-word frame header. In my CGAPPLY I've simply used a frame size of +N. In SAME in York's TRAN, "TEST (B=1 & OP=S.PLUS..." should test PENDINGOP rather than OP, consequently some optimisations using INC & DEC get missed. I've corrected this in my implementation of STOREIN, which generates the appropriate number of count instructions (Function 10) for X:=X+N and X:=N+X, for 0<=N<=2. A latent bug was found in the York implementation (shortly before I left) related to sign-bit regeneration when halving byte addresses. This is unlikely to affect my word-based Elliott implementation. NTRN necessarily evaluates constant expressions in MANIFEST & VEC declarations at compile-time. Possibly the only significant functional change that I made during my RSX-to-Unix port at York was also to fold ANY expression or subexpression with constant operands. But as I noted at the time, LOGAND & LOGOR had been overlooked. My TRAN assumes that NTRN always folds constants, so I've corrected LOGAND & LOGOR in my 900 implementation of NTRN. CGBLOCK in York's TRAN carelessly LOADTs TRUE as octal 177777, which is the same as -1 on the PDP-11, but not on the Elliott. In my implementation, I use LOADT (-1) which works for both. The York implementation provides Infix function calls. "The construct A !PLUS B is equivalent to the function call PLUS (A,B)." It looks like a York special and is not used in any York code. For the reason explained below, I have removed this facility. York's Strings start with a length word, rather than the length byte which BCPL expects. I have retained this. York's Strings end with one or two null bytes for word alignment. In my implementation there are zero or one null bytes. BCPL Changes. ------------- In my implementation I have added three extra keywords, used in modern BCPL, which were not in York's implementation: ELSE as a synonym for OR, which is as in Algol & Ada, MOD as a synonym for REM, which are distinct in Ada, XOR as a synonym for NEQV, which is useful and clearer. The York implementation inherited some character set oddities from the earlier Elliott/ICL 4130 implementation on which it was based: "%" and "|" are both available for vector indexing, although only "%" is used in the York source files, "!" is used for infix calls where A !PLUS B means PLUS (A,B), although this is not used in the York source files, there is no string indexing. In modern BCPL: "!" is used for vector indexing, "%" is used for string indexing, "|" is an alternative to LOGOR and "\/", there is no syntax for infix calls. In my implementation: "%" and "!" are both available for vector indexing, to accept both York's sources and modern sources. "|" is an alternative to LOGOR and "\/", there is no syntax for infix calls. there is no string indexing, but the example ENIGMA.BCP shows a work around for this. In all three cases: "&" is an alternative to LOGAND and "/\". In my implementation: the symbols "NOT" and "\" and tilde "~" are equivalent, the symbols "NE" and "\=" and "~=" are equivalent (but not Ada's "/="). York's implementation did not support "~=". The York implementation and mine still allow dots in identifiers, my implementation also now permits the underscores of modern BCPL. The York implementation only permitted upper-case letters in string layout escapes, *N *S *T *P *E *C *L, and in WRITEF format insertion escapes, %C %N %O %S %D. My implementation also allows lower-case in these contexts. The escape to output a decimal number with # digits can now be either York's %D# (or %d#) or the modern %i# (or %I#). Modern BCPL defines "ug" for the first "User Global", so I have added the definition "ug=60" to LIBHDR. Modern BCPL treats global and static declarations with no number as one larger than the previous one, this is not in York's implementation or in mine. Modern BCPL's uses "?" for "Undefined", like SIR's ">1", this is not in York's implementation or in mine. Store Allocation. ----------------- As explained in "What to Target?" above, my implementation puts everything into the first 8192 words when it is clear that it will fit. For larger programs, the stack, and data structures such as strings, can be placed into higher store modules when these are available, accessed by B-modification, provided that the code and the simple data (global vector, static variables, literal constants, and pointers to the stack & strings and to procedures) still fit into the first 8192 words. My implementation implements this too. The back-end assumes that any portion of the library written in BCPL is already present in the O-code, either by splicing that library BCPL onto the main program BCPL code, or by precompiling the library and splicing its O-code onto the O-code of the main program. As my back-end reads the O-code, it writes to three temporary files: "BCPLCODE.TMP" holds code to be placed in the 1st 8K, "BCPLDATA.TMP" holds data to be placed in the 1st 8K, "BCPLHIGH.TMP" holds data not constrained to the 1st 8K. It counts the number of words written to each of these files, and makes an overestimate of the number of literals used. It then counts the words and overestimates the literals in the SIR assembler system & library routines in BCPL_ENV.SIR. (Each BCPL TABLE appears in the O-code as a list of static variables, and so they are constrained to the 1st 8K code). If it is then clear that everything, ignoring the stack, will fit into the first 8K, store is allocated thus: Registers and Entry code from 0 to 11 Global Vector skip of 60 or more words 1st 8K Code from "BCPLCODE.TMP" 1st 8K Data from "BCPLDATA.TMP" Higher Data from "BCPLHIGH.TMP" System Code from "BCPL_ENV.SIR" Literals as placed by 2-PASS SIR. The Stack follows the Literals. Initially the stack pointer points to the start of the literals, but the tape autotriggers and advances the pointer to the second zero, the first being a literal and the second being free store, cleared by an 8K clear-store on the front of the tape. (The code to do this, and the ACD binary loader, occupy locations 8157..8179, and may be overwritten by the stack). If it is not clear that everything will fit, and provided there is at least SOME Higher Data, store is allocated thus: Registers and Entry code from 0 to 11 Global Vector skip of 60 or more words 1st 8K Code from "BCPLCODE.TMP" 1st 8K Data from "BCPLDATA.TMP" System Code from "BCPL_ENV.SIR" Literals as placed by 2-PASS SIR. Starting from location 8192: Higher Data from "BCPLHIGH.TMP" The Stack follows the Higher Data. I make no attempt to fit SOME of the higher data into the first 8K, (especially as I only have an estimate of the number of literals). There is no check that the items now placed into the first 8K actually fit with the Higher Data removed, or that the Higher Data actually fits into the second 8K. However, 2-PASS SIR will check this later. MASIR has a notation for a symbolic address anywhere in store, but in 2-PASS SIR, symbolic addresses are only recorded modulo 8192. Pointers have to be written in instruction format, either as "0 LABEL" to the first 8K or as "1 LABEL" to the second 8K. Pointers to the Higher Data are written to the temporary files as "? LABEL". The back-end then concatenates the three temporary files and BCPL_ENV.SIR into a complete SIR program with an option and a "%", converting "? LABEL" into "0 LABEL" or "1 LABEL" as appropriate. Regardless of which store allocation method is used, the stack can run on, not just through the remainder of the 8K module in which it starts, but also though any further 8K store modules. Specifically to cover the case where everything (ignoring the stack) fits into 8K but more than 8K is actually available, the program is run on level 4, so as to turn initial instructions off. Interrupts should be disabled: no handlers are provided. The Global Vector. ------------------ As my back-end reads the O-code, it also records the highest global vector number encountered, either by declaration or by use. This does not include global vector numbers in "BCPL_ENV.SIR", but, as in the York implementation, these are all below 60, so the highest number is set to 59 before reading the O-code. After reading the O-code the back-end generates a suitable skip (for example "GV >60") in the SIR output. In the York implementation, at the end of processing an O-code file, any global vector items declared in that file are set appropriately, and entries which are not declared, but are used (or are the START entry) are set to -1, which I presume gives a run-time hardware address trap. Global vector items neither declared nor used are not altered. I think this is to permit the linking of multiple files sharing the global vector, but it will only work if files are linked in a certain order. Since the global vector entries are buffered to be output at the end of the O-code, it would be possible to report duplicate declarations within one file. This was not done, although there is a comment suggesting it. My implementation assumes that the program is in a single O-code file: I do not support linking at the SIR level. The global vector is set to all zeros by the 8K clear-store on the start of the tape. Location 0 (being the top-level SCR) is not a valid program address. The run-time system detects procedure calls, and GOTOs & SWITCHONs, to location 0, and enters a dynamic stop (causing the simulator to quit). Global vector entries are output, as SIR patches into the global vector skip, when they appear in the O-code, unbuffered, so I do not report duplicates. Regardless of the amount of store actually available, the clear-store on the start of the tape only clears the first 8K, which is adequate for clearing both the global vector and the first word after the literals. Presetting. ----------- "Presetting" is a term which has always worried me. To my mind there are variables, which are undefined when a program is loaded, and should be set by the program before they are used, and there are constants, often built into the program image, which should not be altered by the program. The difference is important, because it is not unusual to load a program into an Elliott 900 from paper tape, and run it, and then expect to be able to run it again without reloading it. This will not work if the program relies on an initial value which is set by loading the tape but is altered by the program. I've seen code in the Harrier flight program, which runs several times per second, where a variable was set to its initial value immediately AFTER being used in one cycle, ready for the next cycle, relying for the first cycle on the variable being set when the program was loaded. But the program was only reloaded into core memory (from ground equipment) when it was updated, not before each flight. The Harrier could be powered down at any point within the program cycle, which left that variable undefined when it was used in the first cycle of the next flight. In my own SIR programs, I am careful to use a skip ">1" for a variable, using "+0" only to denote a constant. Sadly much of MECSL's standard SIR library uses "+0" for both. Modern BCPL does provide "?" for "when a variable does not need an initial value". (This issue is not to be confused with the coding style which mandates that all variables should be given some value at the point of declaration. Whilst this does make tests repeatable, it can hide errors, and can lead to a less clear programming style.) In BCPL, dynamic variables are given their declared value (unless flagged "?") necessarily dynamically, and "there is no way of initialising the contents of a vector when it is declared", so these are OK. Static variables are initialised to their declared values but it is not clear whether this occurs at load-time or at the start of run-time. Where global variables are associated with procedures, this "takes place at load time", at least in modern BCPL, and this is presumably also the case for a "global:value" statement, given that "global:=value" is a run-time assignment. It is not clear when global vector entries which are unused and which have no association are supposed to be initialised, if anywhere. In my implementation, as explained above, all globals are zeroed by an 8K clear-store at the start of the tape, to enable calls to unset globals to be trapped. Then the static & global variables are all set to the values given, at load-time. As I've suggested, this is not ideal, but is OK if the program is only being run in simulator and is always loaded before each run. (I did it this way to assist fitting the front-end code into 8K, but it would be possible to set the initial values dynamically, and since all static & global variables are in 1st 8K and not on the stack, the code to do this could be above 8K and use the B-register to modify back to the first 8K.) Meanwhile, the correct programming style for re-runability is that the programmer should give static variables and global variables a value by assignment, not by declaration, before they are read. There is, however, a problem with globals associated with procedures, in that BCPL states that these are set at load-time, yet it is also clear that the language permits their alteration. This is a BCPL problem, not a limitation of my present implementation. For example, it may be required to temporarily point the standard ERROR routine at an exception handler, then restore its initial value which the user cannot know in advance, leading to the risk that the program will not restart if aborted (as for the Harrier). The work around, given that my implementation does clear the globals, is to reserve one global as a first time flag, FTF, and to write at the start of the program: TEST FTF=0 {SAVERROR:=ERROR; FTF:=-1} OR ERROR:=SAVERROR; (In the York front-end and consequently in mine, the REPORT:151 pointer is saved/altered/restored by ISCONST, this having been set to TRANSREPORT by COMPILEAE in NTRN, it is also set to CAEREPORT by FORMTREE in TSYN, both of these settings also save & restore what may be an undefined and unused initial value). Finally, Strings are preset at load-time, and BCPL allows the programmer to alter them, but if this is done then the program cannot be rerun, (and if I did set initial values dynamically using the mechanism suggested above, it would not apply to Strings, due to their size and because they might not be in the first 8K). My ENIGMA work around is OK: I don't use the initial string contents. Interrupt Safety. ----------------- Various 900-series facts cards give the instructions for servicing an interrupt as: LOOP 0 LLQ Reset lower-level Qreg. 14 1 Shift left 1 place. 4 LLA Reset lower-level Areg. 15 7168 Terminate. ENTRY 5 LLA Store lower-level Areg. 3 LLQ Store lower-level Qreg. Required higher-level program. 8 LOOP Jump to reset for lower-level. This code does not preserve the least-significant bit of the Q-register. Function 3 right-shifts the Q-register as it stores it, and the "14 1" left-shifts it when it is reloaded. Potentially, the least-significant bit of the Q-register could be changed from a 1 to a 0, anywhere in an interruptible program. Specifically this could happen if another interrupt were to occur in the above code between the "0 LLQ" and the "14 1". To prevent this, the 900 hardware defers any interrupt which occurs during a Function 0 instruction. This is only a partial solution. Both Functions 0 and 2 have the defined effect of loading the operand into the Q-register, Function 0 also alters the B-register but not the A-register, Function 2 also alters the A-register but not the B-register, so both are sometimes useful, and Function 2 is usually faster. It is OK to CLEAR the Q-register with either a "0 +0" or "2 +0". But it is only safe to USE this bit if it is set by Function 0, not by Function 2 or by a right-shift, and that bit must be consumed immediately by the following instruction. In Airborne Computing Division, we chose to use interrupt code which saved the whole Q-register, despite it being slower and despite the fact that we were working on real-time programs: LOOP 2 LLQ Reset lower level Qreg. 4 LLA Reset lower level Areg. 15 7168 Terminate. ENTRY 5 LLA Store lower level Areg. 14 18 Shift left 18 places. 5 LLQ Store lower level Qreg. Required higher-level program. 8 LOOP Jump to reset for lower level. This gave us freedom about how to load the Q-register, and removed the need to use it immediately: we could for example use A & Q to assemble all 6 characters of a SIR identifier. In my BCPL implementation, the generated code runs on level 4, but no interrupt code is provided because no interrupts are expected. Nevertheless, where my Ada code of the back-end generates SIR code to clear Q, I've commented that this would be "interrupt safe" even using the facts card interrupt code. The only place where it is currently not interrupt safe is in shift instructions, where, to meet the BCPL requirement that shifts do not regenerate the sign-bit, I load the value to be shifted, "LHS", into the Q-register rather than into the A-register. I use "2 LHS" (possibly B-modified) to (possibly corrupt then) load Q, first, because it alters A, followed by a "4 +0" to clear A without corrupting Q, then the planted shift. To be interrupt safe, the "4 +0" would have to come first, to meet the immediate use rule, then "0 LHS" (possibly B-modified) to (possibly corrupt then) load Q without altering A, then the planted shift. But as explained below, a "0 P" would also usually be needed to reset the B-register to the stack frame pointer. Stack Addressing. ----------------- Implementing a stack in native code does NOT require incrementing or decrementing a stack pointer whenever an operand is pushed or popped, as one might in an interpreter. (On the 900-series, such a stack might be inverted, to make offsets relative to the stack pointer non-negative). In BCPL (by design) operands either have static addresses or they can be addressed as non-negative offsets from the frame base, and there is no need to track the top-of-stack address at run-time. This exactly matches the capability of the 900-series, using unmodified instructions or B-modified instructions respectively, with BCPL's frame pointer "P" held in the B-register. (Accessing non-local non-global data, as required in Ada for enclosing scopes, would be messier). The main cost is that a good many instructions have to be B-modified which otherwise would not be, slowing these instructions down (typically by 25% on a 920B or 903), but importantly occupying no more space. Also B-modification corrupts the Q-register. There is some cost associated with the loss of freedom as to whether to clear the Q-register using Function 0 or Function 2. Where a "0 +0" might have been used before a left-shift to clear Q (without altering A), a Function 6 Mask instruction might have to be used after the shift, potentially costing an extra literal. There is a cost when the B-register is required for another purpose, such as to index into a vector. The usual code is used to load the index into B (and may itself be B-modified), but an extra instruction is generally required to reload the frame pointer into B after the indexed operation. (But see "B-register Tracking" below). Stack Frames. ------------- My implementation retains the stack frame layout of the York implementation, in that the frame starts with a three word header. Word 0 holds the link, the return address to the point of call. Word 1 holds the frame pointer "old P" of the enclosing frame. Word 2 in the York implementation holds a pointer to the textual name of the current procedure, which I have not implemented. Words 3 onwards hold the parameters to the current procedure, if any, followed by its local variables, if any. The York implementation only stores the procedure names as a diagnostic aid, and I've omitted them purely to save space. However, I do output the procedure name as a SIR title which will also appear in the SIR label list, enabling the names to be recovered from a stack dump. Remember that the bodies of BCPL procedures can be nested, so some part of an outer procedure might follow the title of an inner procedure, (for example in TSYN: most of DECLSYSWORDS follows title D). Now that I can edit the BCPL front-end, I could reduce the frame head to two words, although the York sources do not have a manifest "savespacesize" which some BCPL implementations have to make this easy. Modern BCPL retains word 2 for "the function entry address". Entry & Exit Code. ------------------ Some system code is required to adjust the frame pointer when entering and leaving procedures (functions & routines). This is only a one-off shared overhead in space, but does cost time, 14 instructions for entry, 9 for routine exit & 10 for function exit. In the generated code, the normal SIR subroutine call 11 LINK 8 ENTRY (Subroutine address) is replaced by 11 LINK (Shared link) 8 APPLY (System routine) +N with the address of the procedure in the A-register, which may have been calculated dynamically in BCPL, although more often it is read from a static word of the form 0 ENTRY (Subroutine address) and where the parameter +N gives the amount by which the frame pointer has to be increased. In the generated code, the normal SIR subroutine exit 0 LINK /8 1 for routine returns is reduced to 8 RTRN (System routine) for function returns is reduced to 8 FNRN (System routine) with the result in the A-register. The system code for entering procedures & for leaving procedures and for implementing GOTOs & SWITCHONs are all written so that: * they can be called from the generated code regardless of whether or not the B-register holds the frame pointer, * when they return control to the generated code, the B-register DOES hold the frame pointer, (the returns being achieved by planted jump instructions, which works given that all code is in the same 8K, rather than by the usual B-modified jumps). This not only avoids the need to reset the B-register with a "0 P" at the start of every procedure and following every call, it provides an opportunity for further optimisation, (see "B-register Tracking" below). And in BCPL the target of a GOTO must be in the current procedure body so there is no need for a "0 P" after every label. Although the system calls follow the rules given above, library routines (whether written in SIR or in BCPL) can assume that the B-register is set to P on entry, (as I do in LEVEL & LONGJUMP and in PACKSTRING & UNPACKSTRING, all in BCPL_ENV.SIR). Jump Optimisation. ------------------ As noted in the previous section, procedures are called via the "APPLY" system routine, with the procedure address in the A-register, which may have been calculated dynamically in BCPL. So, for example, a call to the READC=RDCH routine, without this optimisation, would take four words of store: 4 GV+27 11 LINK 8 APPLY +N The Jump Optimisation changes this into three words per call: 11 LINK 8 J4 +N (GV+27) and a further two words shared by all calls to READC=RDCH: J4 4 GV+27 8 APPLY If a procedure is called both as a routine and as a function then they share the same two-word entry in the jump table. The name of the location holding the address of the procedure is retained (within the comment) for clarity. This saves store for any procedure called three or more times, and breaks even for a procedure only called twice. It does cost one word for a procedure only called once, but such procedures would be better in-lined if store is tight. (It is not possible for the back-end to tell how often a procedure is called when code is being generated for its first call). The optimisation is only performed if the address is known to be either in a labelled location or in the global vector. (The transformation would work for addresses on the stack, but the varying offsets would be unlikely to give a saving). A similar optimisation is performed for GOTO commands. One might not expect to see many GOTOs in a modern program, but for example, the BCPL front-end procedure NEXTSYMB uses RETURN at the end of a token which has performed a look-ahead, and a GOTO L at the end of other tokens, where L reads a character before RETURNing. Without this optimisation, the following would occur 25 times: 4 L80 8 GOTO The Jump Optimisation changes this into one word per GOTO: 8 J23 and a further two words shared by all GOTOs to that label: J23 4 L80 8 GOTO (It's a pity that the optimisation can't spot that all 25 GOTOs in this example are preceded by "5 GV+115"). Luckily these 25 GOTOs do all jump to the same label, rather than to 25 different labels on the same location, in contrast to what can happen with SWITCHONs, (see the "Switches" section below). I've not (yet) attempted to convert the above GOTOs into direct jumps. However "8 L80" would be wrong: L80 holds the address of the location to be jumped to, namely it is "L80 0 L79". The O-code for the address (DATALAB L4, ITEML L3) and for the GOTO (LL L4, GOTO) are separate, and to simply assume that the address and destination labels differ by one (without understanding the front-end code) would probably be unwise. And it might be necessary to insert a "0 P" before some direct jumps (see "B-Register Tracking" below). The jump table is copied into BCPLCODE.TMP which is later merged into the *.SIR file, rather than being copied into the *.SIR file at the appropriate point, so that its store usage is counted. The simulated stack and the jump table are both implemented using arrays within the back-end whose size could limit the complexity of the BCPL program. Whereas the simulated stack only has to cater for at most one BCPL procedure at a time, the jump table has to cover the whole program. Function Returns, and Duplication. ---------------------------------- At present I make no attempt to pass parameters in registers, given that the A-register already holds the procedure address. At a pinch, I could pass a parameter or the entry address in the B-register. I do, however, pass the switch value in A and the address of the switch table in B into the SWITCHON system code. At present I return the result of a function in the A-register, which is correctly written to the stack if not needed immediately, as in the York implementation (which always returns values in R0). An alternative would be to return the result on the stack, which might require it to be read from the stack if needed immediately. In fact, if FNRT were changed to return the result on the stack (by uncommenting one instruction) it would nevertheless also be in the A-register, so neither a stack write nor a stack read would be needed. But the simulated stack has no mode to express that an item is in two places. (The result of a function also happens to be in the LINK location, which I've exploited in the optimisation of the LEVEL procedures by OPTFRONT). The opposite problem occurs when I need to access an operand twice. For example, in EQV & NEQV and in REM, I access both operands twice. The code for these operations starts with a FREE_A, so that if any operand is in the A-register, it is written to store (on the stack or otherwise) to make it addressable. The first operand is read into the A-register with an explicit Function 4, regardless of whether it has just been written, not by using LOAD_A which would change its mode. This will sometimes generate a "5 X, 4 X" sequence, which the peephole optimiser will reduce to just "5 X". (See "A-Register Tracking & Peephole Optimisation" below). I remain slightly confused about the O-code used to load the result of a function. The only way out of a function is via O-code FNRN, which loads the result into A then calls the FNRN system code. But there is also O-code RES: why does this also load the result into A, before jumping to a label which can only lead to an O-code FNRN? This occurs in the York implementation and others, so I've kept it, calling LOAD_A a second time with the same operand has no effect. B-Register Tracking. -------------------- As described above, the SIR which I generate assumes that the B-register normally contains P, the stack frame pointer. It is set at the start of the program by a "0 P" in BCPL_ENV.SIR. The system code for entering procedures & for leaving procedures and for implementing GOTOs & SWITCHONs are all written so that, when they return control to the generated code, the B-register again does hold the frame pointer, which avoids the need to reset the B-register with a "0 P" at the start of every procedure and following every call. The system code for entering procedures & for leaving procedures and for implementing GOTOs & SWITCHONs are also all written so that they can be called from the generated code regardless of whether or not the B-register holds the frame pointer. It follows that, if the B-register is required for another purpose, such as to index into a vector, it is not always necessary to reload the frame pointer into B with a "0 P" after the indexed operation, if no stack-related operations occur before the next system call, or if the B-register is again required for another purpose. The B-register is tracked by a back-end flag, BREG, initially set OK (=TRUE). If an instruction is coded in LOCAL mode and BREG is not OK, it is set OK and a "0 P" is generated, in SIR_WORD, before the requested B-modified instruction accesses the stack. If the B-register is required for another purpose, such as to index into a vector (RV & STIND), the usual code is used to load the index into B. At this point BREG could be either value. BREG is then set OK before generating the indexed instruction, to avoid an incorrect "0 P" (if BREG was not OK), and BREG is set not OK immediately after it. If a system call is encountered, BREG is set OK. They are recognised by a Function 8 unconditional jump to a textual system routine name, or by a call via GET_JUMP_NUMBER to APPLY (FNAP or RTAP) or to GOTO. This paragraph is applied before the following paragraph. The tracking is performed on the code as it is translated, not when it is run, so a global decision has to be made about the state of the B-register when jumps occur, namely that it should hold P. So if a jump is encountered, conditional or unconditional, and BREG is not OK, it is set OK and a "0 P" is generated, before the jump. (The code for EQV and NEQV=XOR, when not followed by JT or JF, includes "7 ;+2, 4 -1", which could generate an unnecessary "0 P"). An interesting situation occurs at labels. If arriving at the label from the instruction above, BREG will be set appropriately. If arriving at the label from a jump, the B-register will hold P. So there is no need to change BREG or to generate a "0 P", yet. But if a stack operation subsequently does need a "0 P", it would have been better to generate it before the label. Rather than generate "0 P" in anticipation before any label where BREG is not OK (potentially wasting store), I've accepted that sometimes an extra "0 P" might be obeyed following a jump (potentially wasting time). (The system code for GOTO does not need to use the B-register, so it could simply leave the B-register unchanged. However, it seemed more useful to make it follow the same rules as the other system calls, in the ways described above. Including the "0 P" cost no store because GOTOs & SWITCHONs share exit code, and SWITCHONs do use the B-register and reset it with "0 P" ). Given that calls to the back-end procedure SIR_WORD can generate a "0 P" as well as the requested SIR word, using instructions with SCR-relative address mode SCRREL needs care. Specifically there was a problem with "5 ;+3" used in Shifts, which might need to be "5 ;+4". Rather than labelling the planted instruction, which would make the code less clear, I've added code to flush out the "0 P" sooner, if it is required (by copying a line from SIR_WORD, exactly, even though calling SIR rather than SIR_WORD would be OK here, but adding a "flush 0 P only" mode to SIR_WORD might be neater). A-Register Tracking & Peephole Optimisation. -------------------------------------------- Operations where the only operand or both operands are constants are performed ("folded") by the front-end. Since the front-end is a 900-series BCPL program compiled by itself and my back-end, these calculations will give the same results as if performed at run-time (although more quickly), including division (regardless of the signs of the operands). But the front-end preserves the binding order of operations. So in the statement "LET C = 'a'<=CH<='z' -> CH-'a'+'A', CH", the generated code includes "4 GV+76, 1 -97, 1 +65, /5 5", even though ""4 GV+76, 1 -32, /5 5" would be better, and would be generated by "LET C = 'a'<=CH<='z' -> CH+('A'-'a'), CH". (It is generally thought legitimate for compilers to implement optimisations like this for integers but not for real numbers). The Elliott 900-series do not have lots of registers to track. The A-register is tracked by a back-end variable, AREG, initially set FREE (= an off-stack pointer), which records which simulated stack entry is currently in A. This is used to avoid reading the first or only operand of an operator, or either operand of a commutative operator, which has been left in the A-register by the previous operation. The simulated stack mechanism rearranges, for example, "P := A * (C+D)", from "4 A, /5 T, 4 C, 1 D, /12 T, 14 17". to "4 C, 1 D, 12 A, 14 17", which reduces the code size, although the stack frame might still contain a slot for T. Sometimes when one operand of a binary operator is a constant, the result can be determined from the other operand without using the A-register. For example, adding +0 or multiplying by +1 return the other operand, and multiplying by +0 or LOGAND by +0 return zero. Such operations can arise when the constant is MANIFEST, where the special values is not immediately obvious (and is best not exploited) by the programmer. This is achieved by copying entries within the simulated stack (using LOSE1 with parameters, rather than the default parameters). Thus "P := (A+B) * (C+D)" typically generates (when A B C D are static variables) "4 A, 1 B, /5 T, 4 C, 1 D, /12 T, 14 17". But if B happens to be manifestly zero, not only is the "1 B" omitted, the sequence again becomes "4 C, 1 D, 12 A, 14 17". The simulated stack works well within statements, but not between statements, which is where the "peephole" optimiser comes in. This is built into the routine which outputs the SIR code. It records the most recent Read instruction (Function 4) and it records the most recent Write instruction (Function 5). Both are set "Unknown" initially, or by a label, or by any instruction other than a Read or Write or Jump, and the Write record is set "Unknown" by a Read instruction. A Read from, or Write to, either of the recorded addresses can then be suppressed. For example, if one statement ends with "5 X" and the next statement starts with "4 X", the "4 X" is omitted. Initialisations such as "4 +0, 5 X, 4 +0, 5 Y, 4 +0, 5 Z" become "4 +0, 5 X, 5 Y, 5 Z". (The statements "ARG2:=ARG1; ARG1:=ARG1+3", which appear in the PDP11 back-end, would also get optimised). Logically, the most recent two or more locations written to could be recorded, but this might not save much more code. A Write to the B-register does have set the Read record "Unknown", so that "/4 0, 5 BREG, /4 0" does not get optimised. Conditional jumps do not need to set either record "Unknown", whilst the next label following an unconditional jump (and possible dead code) will set both records "Unknown". Also built into the routine which outputs the SIR code, any remaining no-ops "1 +0" and "14 0" are eliminated. Most will have been removed by better optimisations described above, but for example the back-end generates "1 +0" in LOADLV if the offset is zero and it generates "14 0" in MULT if either operand is -1. Finally built into the routine which outputs the SIR code, is a dead-code eliminator. Any code following an unconditional jump (Function 8) unless that jump follows "7 ;+2 or "9 ;+2" or "11 LINK", up to the next label, is unreachable and so is deleted. The BCPL front-end does generate a small amount of such code. I have not attempted to merge adjacent Add literal instructions (such as the "1 -97, 1 +65" above), or Negate-&-Add literal instructions (such as pairs of "2 +0"), or combinations thereof. I have not attempted to optimise multiple tests of the form "4 CHAR, 1 -43, 7 PLUS, 4 CHAR, 1 -45, 7 MINUS" into "4 CHAR, 1 -43, 7 PLUS, 1 -2, 7 MINUS". Arithmetic Operators. --------------------- MINUS. On the 900-series, Add (using Function 1) is slightly faster than Subtract (using Function 2), so subtraction with a constant second operand is converted into addition of the negated constant, which can benefit from commutativity and from other optimisations as for PLUS. I've not (yet) implemented an obvious optimisation here. Where the "wrong" LHS operand is already in the A-register, I generate "/5 T, 4 RHS, /2 T" by default, but I could generate "2 RHS, 2 +0". However, possibly because of the many other optimisations performed, this change would only reduce the SIR of the front-end by one word. PLUS. I do optimise the cases where the either operand is known statically to be zero, by returning the other operand. Addition being commutative simplifies testing for these cases, and for recognising when either operand is already in the A-register. Code of the form X:=X+N or X:=N+X, for 0<=N<=2, is recognised and generates the appropriate number of Count instructions (Function 10, B-modified or otherwise). This happens, wherever X may be, and without disturbing the A-register. (See the loops in "Relational Operators" below). There could be places where it is worth including N=3. Three "10 X"s are slightly slower than "4 X, 1 +3, 5 X" but save a literal and do avoid losing the A-register. However, in many cases, the addressed variable might be used in the previous or subsequent statement, so I've not done this. PLUS followed by RV or STIND is recognised, and optimised when possible to use non-zero B-register offsets, so that, for example, the code to read from "A!3" becomes just "0 A, /4 3", rather than "4 A, 1 +3, 5 BREG, /4 0". MULT. I do optimise the cases where the either operand is known statically to be zero, by returning zero, or to be one, by returning the other operand. Multiplication being commutative simplifies testing for these cases, and for recognising when either operand is already in the A-register. Multiplication by powers of two can be implemented by placing the other operand in the A-register, and obeying a left shift instruction, followed by a mask instruction with a constant which can be calculated statically by the back-end, to remove garbage shifted in from the Q-register. Multiplication by powers of two and left-shift do share a subroutine, which optimises three cases where the mask literal can be avoided, see the "Shifts" section below. Multiplying by negated powers of two is surprisingly simpler. The other operand is placed in the A-register and negated, which also clears the Q-register, and is interrupt safe, then the left shift instruction is obeyed. No mask instruction or literal is required. The peephole optimiser will remove the shift for multiplication by -1. (See "A-Register Tracking & Peephole Optimisation" above). When multiplying by other constants, the constant is normalised, so that the subsequent shift can be made shorter, saving time. For example "*10" is not coded as "4 N, 12 +10, 14 17", rather it is coded as "4 N, 12 +81920, 14 4". Multiplying two variables, however, still does require a 17-place shift. (Function 3 could only be used when the product is known to be positive). Some time (but not store) could be saved by optimising multiplication by three as either "4 X, 1 X, 1 X" or "5 X, 1 X, 1 X" as appropriate, but I've not (yet) done this. DIV & REM. When the dividend (LHS) is known statically to be zero, both DIV and REM-or-MOD return zero. When the divisor (RHS) is known statically to be +1, DIV returns the dividend, and REM-or-MOD returns zero. I've not treated division by zero as a special case, because it isn't, overflow is being ignored everywhere, so whatever the hardware returns will do. Divisions by powers of two are implemented by placing the LHS in the A-register, and obeying a right shift instruction. Right shifts in BCPL require the deletion of the sign bits, so right-shift and division do not share a subroutine. REM-or-MOD with a power-of-two RHS are implemented by placing the LHS in the A-register, and obeying a "6 RHS-1" mask instruction. Even for negative LHS, these optimisations do not alter the values returned, and A = (A/B)*B + (A REM B) is preserved. I have not (yet) provided optimisations for DIV and REM-or-MOD by negated powers of two, including a RHS of -1. REM-or-MOD could be implemented as for powers of two above, observing that A REM (-B) = A REM B, as the Ada LRM says, giving a correct REM corresponding to a round-to-zero DIV. Division could be implemented as for powers of two above, preceded (rather than followed) by "2 +0", giving the same (often-off-by-one) result as the unoptimised code. However, A = (A/B)*B + (A REM B) would not be preserved. Otherwise, division uses the "double/divide/halve" method, as used in CAP's 900 CORAL and described in "Other Obstacles" above, "4 LHS, 14 8176, 13 RHS, 14 8191". The right shift is 16 places rather than the usual pre-divide 17, and so implements the doubling. Because of the doubling, this method can overflow when dividing by +1 or -1, despite the correct result usually being in range. The "13 RHS" must not be modified, because this would corrupt the Q-register before the division, so if the RHS is on the stack or in the A-register it must be moved to a static location beforehand. This code does not clear the least-significant TWO bits of the Q-register, and I have a vague memory (from the 1970s) that CAP said that this was OK. Certainly it is well published that division ignores the least-significant ONE bit of the Q-register, and I have recently established that it does ignore the next bit, which is why division only returns a sign and 16 data bits. (This is useful, because clearing these bits by changing "14 8176" to "14 8174, 14 2" would not be interrupt safe). The 16-place shift makes this code quite slow. Where the LHS is known statically, it would be possible to generate code which places the LHS directly into A & Q, already shifted by the back-end. This saves no space, so I've not (yet) implemented it. I spent some time investigating in what circumstances dividing by a constant can be replaced by multiplication by a constant (and a shift). Certainly for fixed-point real-number calculations, it would be usual to do this, but integer arithmetic is more exacting. The best multiplying constant is found by dividing 131072.0 by the divisor, normalising it (retaining the fractional part), then rounding up, and optionally denormalising it whilst even (to reduce the shift). Division by powers of two needs just a shift (as above). Division by +3 can be implemented as "12 +43691" and division by +43691 can be implemented as "12 +3" for all positive dividends, correctly (with no shift). Division by any other positive number (such as +10) can be implemented by a multiply and a shift ("12 +52429, 14 8190"). For 69% of divisors the result is right for all positive dividends, for 31% it fails for positive dividends somewhere above 65535. This certainly meets the "better than PDP-11" test but it might not meet the "at least as good as CAP CORAL" test for some values. Where it is known that the dividend has a reduced range (which is not generally the case in BCPL) the shift can be reduced or eliminated (for dividing by +10, "12 +13108" works for dividends up to 16389). Division by a negative number could at worst be implemented with a negative multiplier, but I've not investigated how to round this. The results may be worse for negative dividends and for negative divisors, but BCPL has no specific requirements for them. I did not implement any of these (except powers of two). Except in the special cases covered earlier, and as explained in "Other Obstacles" above, REM-or-MOD has to be implemented by "Divide/Multiply/Subtract". But when the RHS is a constant, it is NOT normalised for the multiplication, because this would make the divide and multiply literals different. If I were to implement dividing by a constant as multiplication by a different constant, then it would be appropriate to normalise the RHS constant used for multiplying back (since they differ anyway), or it might be possible to recover the remainder from the Q-register, without subtraction (as I do when unpacking text in my 900 BASIC). Shifts. ------- On many computers, there are separate opcodes for left shifts and for right shifts, so a given instruction can only shift in one direction. The Elliott 900-series has a single shift instruction, so for example a "/14 0" instruction will shift the combined A&Q registers either left or right (arithmetically) according to the value held in the B-register at run-time, positive for left and negative for right. For shifts in either direction, BCPL requires that the shift distance "must evaluate to yield a non-negative integer" and "vacated positions are filled with zeros". Modern BCPL changes the first requirement to "negative shifts or ones of more than the word length return 0". Given the ability of the 900-series to implement negative shift distances as shifts in the opposite direction, it seems sense to do exactly this and to ignore the new requirement. (In fact to do otherwise could be slightly messy for right shifts of distance zero). Sadly "/14 0", being modified, corrupts the Q-register before it gets shifted, so for dynamic shift distances, RHS, it is usual to plant a shift instruction. The 900-series right shift regenerates the sign of the A-register, but the BCPL requirement to shift in zeros can be neatly met by loading the value to be shifted, LHS, into Q rather than into A, and shifting 18 places further left than requested. Thus: Left shift, RHS in A. Right shift, RHS in A. 1 =15 18 2 =15 18 6 =14 8191 6 =14 8191 5 ;+3 5 ;+3 2 LHS 2 LHS 4 +0 4 +0 >1 >1 (This code is prefixed by "0 P" if BREG is false and the LHS is on the stack, otherwise the "5 ;+3" might have to become "5 ;+4, 0 P"). Because the A-register is used to calculate the planted instruction, the LHS will not be in the A-register, so loading it into Q costs no more than loading it into A. (This is not the case when the RHS is known statically, to be covered below). As noted in "Interrupt Safety" above, this code is not interrupt safe. To make it safe, change "2 LHS, 4 +0" to "4 +0, 0 LHS", and add "0 P" at the end if needed. (If we were prepared to assume that the shift distance, whether positive or negative, was never more than 18 places either way, both "=15 18" above could be changed to "=14 18", giving left shift instructions in the range 14 0 to 14 36, and then the mask instructions could be removed). The potentially long shifts make this code quite slow. Where the LHS is known statically, it would be possible to generate code which only displaces the LHS by 1 place rather than by 18 places to the right and then generate what will often be shorter run-time shifts, (where LHS<<17 will be either +0 or &400000): Left shift, RHS in A. Right shift, RHS in A. 1 =15 1 2 =15 1 6 =14 8191 6 =14 8191 5 ;+3 5 ;+3 2 LHS<<17 2 LHS<<17 4 LHS>>1 4 LHS>>1 >1 >1 Or, when the LHS is known statically and is not negative: Left shift, RHS in A. Right shift, RHS in A. 1 =15 0 2 =15 0 6 =14 8191 6 =14 8191 5 ;+3 5 ;+3 2 +0 2 +0 4 LHS 4 LHS >1 >1 Both of these pairs avoid setting the least significant bit of Q to 1, so they are interrupt safe. But neither of them saves space, so I've not (yet) implemented them. (If we forgo negative shift distances: the "=15" in both versions of left-shift code above could be changed to "=14" and then the mask instructions could be removed; whilst in the lower right-shift code, the "5 ;+3, 2 +0" could be changed to just "5 ;+2", or the whole sequence changed to "2 +0, 5 BREG, 4 LHS, /14 0" and possibly "0 P"). I do optimise the cases where the value to be shifted (the LHS) is known statically to be zero, by returning zero. When the distance to be shifted (the RHS) is known statically, there is no need for planted instructions. Where it is known statically to be zero, no operation is required (the LHS is returned), where it is known to be over 17 places either way, zero is returned. Right shifts of other static distances can be implemented by placing the LHS in the A-register, and obeying the shift instruction "14 8192-RHS", followed by a mask instruction with a constant which can be calculated statically by the back-end, to remove the regenerated sign bits. Division by powers of two requires the retention of the sign bits, so right-shift and division by powers of two do not share a subroutine. Left shifts of other static distances can be implemented by placing the LHS in the A-register, and obeying the shift instruction "14 RHS", followed by a mask instruction with a constant which can be calculated statically by the back-end, to remove garbage shifted in from the Q-register. Multiplication by powers of two and left-shift do share a subroutine, which optimises three cases where the mask literal can be avoided: * If shifting one place left, the LHS is added to itself, using "4 LHS, 1 LHS" or "5 LHS, 1 LHS" as appropriate, which avoids shifting garbage in from the Q-register. * If the LHS is not already in the A-register, but it can be read without B-modification, Q can be cleared first, using "2 +0, 4 LHS, 14 RHS". * Otherwise, having loaded the LHS into the A-register, if BREG is not OK, Q can be cleared, using "0 +0, 14 RHS". Formally, for some computers in the Elliott 900-series, the effect of shifting more than 36 places is undefined. The code described above will give correct results whenever the distance to be shifted is known statically, but otherwise, excessive shift distances could give wrong results or invoke a hardware block transfer instruction. Logical Operators. ------------------ Logical operators have to be evaluated left-to-right in a "boolean context", but otherwise require bit-wise evaluation. Only the latter generate the O-codes: EQV NEQV LOGAND LOGOR. Cases where an operand is +0 or -1 are optimised to return: +0, -1, the other operand, or its complement, as appropriate; and the fact that these operators are all commutative simplifies testing for these cases, and for recognising when either operand is already in the A-register. LOGAND is trivially implemented by Function 6, Collate or Mask. It is tempting to implement LOGOR using DeMorgan's laws: 4 X 2 -1 5 TEMP 4 Y 2 -1 6 TEMP 2 -1 But this is simpler: 4 X 6 Y (common bits) 2 X (bits in X not in Y) 1 Y or in a different order if more convenient: 4 X 6 Y (common bits) 2 Y (bits in Y not in X) 1 X or slightly faster if X and Y are on the stack and need B-modification whereas the constant does not: 4 X 2 -1 6 Y (bits in Y not in X) 1 X which is what I've used in my implementation, or any of these with X and Y reversed. When one operand is a constant C, the last of these becomes: 4 -1-C 6 Y 1 +C or probably more usefully: 4 Y (if Y not already in A-register) 6 -1-C 1 +C for example for C=10, the value -11 has all bits except those worth 2 and 8, so they are masked out from Y if they are present, then they are added in with no risk of carry: 4 Y 6 -11 1 +10 To show how effective this non-DeMorgan LOGOR code is: an example from the QUEENS program: ... is longer modified to avoid LOGOR: LET poss = all & ~(ld | col | rd) LET poss = all & ~ld & ~col & ~rd /4 4 /4 3 2 -1 2 -1 /6 3 6 ALL /1 4 /5 6 /5 6 /4 4 2 -1 2 -1 /6 5 /6 6 /1 6 /5 6 2 -1 /4 5 6 ALL 2 -1 /5 6 /6 6 /5 6 NEQV or XOR is bit-wise binary addition, without carry, and EQV is the complement of this. Carry will occur wherever a 1 occurs in both operands, which can be found by LOGAND, Function 6, enabling the carry bits to be subtracted out of add-with-carry: 4 X 6 Y (gives bits common to both) 2 +0 (also clears Q, interrupt safe) 14 1 (negated carried bits) 1 X 1 Y (An initial read can be avoided if either operand is already in the A-register, and the two final adds can be in either order). Relational Operators. --------------------- This section considers: O-codes LS, GR, EQ, NE, LE, GE, not followed by JT or JF, O-codes JT & JF, not preceded by LS, GR, EQ, NE, LE, GE, O-codes LS, GR, EQ, NE, LE, GE, when followed by JT or JF, and a slight rearrangement of FOR-loops in the front-end. O-codes JT & JF, not preceded by LS, GR, EQ, NE, LE, GE: JF is required to jump if the top of stack is FALSE = zero, JT is required to jump if the top of stack is ANY other value. O-codes JT & JF, not preceded by LS, GR, EQ, NE, LE, GE: are required to stack TRUE specifically as -1, if LHS OP RHS is true, and FALSE as zero otherwise. O-codes LS, GR, EQ, NE, LE, GE, when followed by JT or JF are optimised to Jump if LHS OP RHS, without stacking and testing the Boolean result. The relational operators are arranged within the O_CODE type in the order given here, which has the property that reversing the order negates the condition, and this enables the procedure CGBRANCH to map JF onto JT, using just a subtraction. Implementation of EQ & NE in either context is straightforward, and exploits the commutativity of these operators. If any non-zero value were permitted for TRUE, the code for NE could be simplified. Implementation of the other four operators is straightforward in some cases, for example, "jump if LHS < 0" maps onto "9 LAB". But when it is less straightforward, there are often several possibilities, for example, "jump if LHS <= 0" or "jump if LHS < 1" can be mapped onto "9 LAB, 7 LAB" or "7 LAB, 9 LAB" or "1 -1, 9 LAB". In the Boolean context, these four operators can be implemented by arranging to set the A-register negative if & only if the condition is true, followed by a "14 8175" to set the A-register to -1 or +0. For example, "LHS <= 0" or "LHS < 1" can be mapped onto "1 -1, 14 8175". By selecting the "1 -1, 9 LAB" style in the Jumping context, the two contexts can share common back-end code (in procedure COMPARE) for the four operators, just differing in the final "9 LAB" versus "14 8175". This choice eliminates the other possibilities from the Jumping context, which may have given faster code, but this choice never costs more store. The four operators use at most two instructions to compare the operands (sometimes less if either is in the A-register), possibly followed by an adjustment (which often can be absorbed into a constant operand) of "2 +0" or "1 -1" or "2 -1", followed by the "9 LAB" or "14 8175". And where possible, as for MINUS, comparison by subtraction of a constant operand is converted into addition of the negated constant. As explained in "Other Obstacles" above, this implementation ignores any overflow which occurs when comparing the operands of the four operators by subtraction, and the adjustment instructions above might slightly alter the points at which the overflows occur. With this implementation of the relational operators, the O-code for a typical loop, as generated by the York front-end, translates to the left column below. Although the test is correctly performed before the first loop, textually it is at the end of the loop. This is generally faster, because the unconditional jump is only obeyed once, not once per loop. But putting the test at the top of the loop reverses the condition, which, on the 900-series (due to the asymmetry of available conditional jumps) uses one less word of store, as in the right column below, and both methods have the same loop overhead, of five instructions executed. York front-end: Rearranged front-end: 4 FIRST 4 FIRST /5 8 /5 8 4 LAST 4 LAST /5 9 /5 9 8 L25 L25 /4 8 L26 BODY OF LOOP /2 9 /10 8 9 L26 L25 /4 9 BODY OF LOOP /2 8 /10 8 1 -1 8 L25 9 L26 L26 The York front-end generates slightly different O-code when LAST is a constant, which, with this implementation of the relational operators, translates the O-code for "FOR A = 0 TO 6" as in the left column below. In this case, putting the test at the top of the loop again reverses the condition, as in the right column below, but it gives no store saving, and the York arrangement has the lower loop execution overhead. York front-end: Rearranged front-end: 4 +0 4 +0 /5 8 /5 8 8 L66 L66 /4 8 L67 BODY OF LOOP 2 +6 /10 8 9 L67 L66 /4 8 BODY OF LOOP 1 -7 /10 8 9 L67 8 L66 L67 Accordingly, I have modified the front-end to rearrange the O-code for FOR loops, but only when the LAST value is not a constant. I do not (yet) know whether any other BCPL loop constructs would benefit, on the 900-series, from a similar modification. Compatibility is retained: my back-end will still work with York's front-end O-code (it will just produce less efficient code), and the York PDP-11 and other back-ends should work with my new front-end. When programming in SIR, assuming in the Variable case that FIRST & LAST can be safely evaluated in the opposite order, it would be usual to slightly rearrange the above chosen code, to avoid executing the "/4 8" the first time round the loop: Variable LAST: Constant LAST (=+6): 4 LAST 4 +0 /5 9 /5 8 4 FIRST 8 L66 /5 8 L67 BODY OF LOOP L25 /2 9 /10 8 9 L26 /4 8 BODY OF LOOP L66 1 -7 /10 8 9 L67 /4 8 8 L25 L26 This code can also be modified as follows, which is slower, but takes no more store, and is suitable for increments other than +1: Variable LAST: Constant LAST (=+6): 4 LAST 4 +0 /5 9 8 L66 4 FIRST L67 BODY OF LOOP L25 /5 8 /4 8 /2 9 1 +or-INCREMENT 9 L26 L66 /5 8 BODY OF LOOP 1 -7 /4 8 9 L67 1 +or-INCREMENT 8 L25 L26 And in all three cases above where "1 -7" appears, this could be removed by changing the "4 +0" to "4 -7", provided that the loop variable is not used within the loop (or by suitably altering such references). Unfortunately loops are not visible as such in O-code, so none of these SIR optimisations can be applied. (However, the loop could be written "FOR A = -7 TO -1"). Switches. --------- In the York implementation, the front-end does check for duplicate case labels, but it does this without sorting them. However, the York back-end does sort the case labels. If 3*number_of_cases > 16+highest_case-lowest_case, then it calls LSWITCH otherwise it calls BSWITCH: * BSWITCH is recursive. If it is dealing with more than about 6 cases, it compiles code for the middle case, and it codes two calls of itself for the higher & lower slices of cases, otherwise it compiles a "CMP" & "JEQ" for each case, and finally a "JBR" to the default, in every slice. * LSWITCH checks if the value is within the bounds, then jumps indirect via a table containing a list of case addresses interspersed with default addresses. BSWITCH does not call LSWITCH to deal with a compact subrange. In my 2016 implementation, I simply compiled a linear list of "1 constant" and "7 label" pairs, with a final "8 default", (the constants being the differences between unsorted adjacent case labels), thus costing 2 words & 1 literal per case, and 2 executed instructions for each case tested, which is clearly potentially much slower than the York implementation. In my 2018 implementation, I use a system call which performs a linear scan of a table containing pairs of: value to compare with (unsorted, of either sign), address to jump to (always positive), finishing with the pair: default address to jump to, -1 (always negative), thus costing 2 words but no literals per case. The system call is entered with the switch value in the A-register and the table address in the B-register. It is only a one-off shared overhead in space, of 15 instructions, executing 8 instructions per case tested. I chose this even slower implementation, not for the saving of the literals, but because the table is not confined to the first 8K (unlike the code of the 2016 implementation). One could argue that this is interpretive not native code. The lexical analyser NEXTSYMB in TSYN switches on the first character of a token, and 52 of the 128 possible values, namely "A" to "Z" and "a" to "z", label the same case, which processes an identifier. And in O-code, BCPL allocates a separate label for each value, so, even though the York back-end sorts the cases, it is not apparent to the back-end that the sorted list contains two runs of 26 labels all pointing at the same case. (In fact York uses LSWITCH here). Here BCPL switches lack the range facility of Ada's case statements: "case CH is when 'A'..'Z | 'a'..'z' => statement;". So whilst I could modify my SWITCHON system code to perform a BSWITCH-style tree search of a sorted table of values, I'm more interested in developing good code for ranges (or a mixture of single values & ranges), for Ada. Strings. -------- Strings are packed big-endian as two 9-bit characters per word. 9 bits are used rather than 8 to make the values more readable in octal (although this does make PACKSTRING & UNPACKSTRING marginally slower). 6-bit SIR internal code could be used but then the layout escapes such as "*N" would have to be processed on output rather than on input and lower-case letters would be lost. Because strings are not constrained to the first 8K, the less efficient packing is not so important. The characters of the string are prefixed by a whole word, S!0, holding the length of the string in characters (as in the York implementation) and so strings can be of any length at run-time (although the front-end has a 255-character limit). If and only if there are an odd number of characters, there is a final null character, excluded from the length. (In the York implementation there appear to be either 1 or 2 final nulls). Note that BCPL usually expects the length to occupy just the first byte. PACKSTRING & UNPACKSTRING are both written so that the input & output can be the same, with the string overlaying the start of the vector. Finally, as noted in "Presetting" above, Strings are preset at load-time, and BCPL allows the programmer to alter them, but if this is done, then the program cannot be rerun without reloading it. Use on a Real 900 System. ------------------------- The *.BIN files, as produced by my current implementation, can be run on a real 900-Series computer. Input & output will default to the paper tape reader and punch, but can be directed to other addresses (for example to the TeleType) by changing INPUT=GV+3 (from 2048 to 2052) & OUTPUT=GV+4 (from 6144 to 6148). Input ignores parity, output generates correct (even) parity. However it might be worth changing the implementation of READC=RDCH & WRITEC=WRCH in BCPL_ENV.SIR, to call the ACD standard CHIP & CHOP subroutines (or their MESCL equivalents QCHIN & QCHOP). On input, CHIP provides parity checking and line buffering (for non-stop-on-character readers). On output, CHOP provides blanks after newlines, and leading & trailing blanks, and it prefixes each tab (which TeleTypes ignore) with a space. Given the implementation which I now have, I'm sure that an implementation could be produced which ran entirely on a 16K 903. GET remains a problem, see "The GET Command" above for possible solutions. The front-end currently treats GET as a syntax error, and perhaps as a minimum it should treat it as a HaltCode. The front-end is a 900-series program, which currently requires 18K words, including a TREE vector of 6000 words, whether or not it is all used. Error message strings could be shortened, and the TREE vector could easily be reduced, to just fit into 16K. Reducing the TREE vector reduces the maximum size of every segment, rather than the size of a whole program. I don't know if such a 16K front-end would be able to compile its own segments as they now are. As suggested above, it might be worth changing the implementation of READC=RDCH in BCPL_ENV.SIR in the front-end, to call the ACD standard CHIP subroutine, to provides parity checking and line buffering. The back-end is currently a host Ada (not BCPL) program. But I do have York's PDP11 back-end in BCPL, which I can compile, and run in the simulator, which, when applied to its own O-code, produces PDP11 code recognisably like YORK-RSX\10027\TRAN#4.MAC, (which I have in my magnetic tape HIXTRACT). So rather oddly, my 32K 903 could be used to produce PDP11 code (without needing a PC). (The manifest constant for the PDP11 opcode XOR had to be changed to XXX to avoid a clash with the new BCPL keyword, but the string output is still "xor"; and TRUE being printed in unsigned octal appears as $777777 rather than $177777). In "Work done in 2016" above, I noted that "the translation of TRAN to SIR did not quite fit, but since it generates PDP-11 code, I didn't actually need it". With my 2018 implementation, the SIR of this PDP11 back-end does now fit, and finishes with NEXT1=7306, which includes high data of 1016 bytes, excluding this NEXT1=6290. My equivalent Ada code (TRAN_PDP.AD6) was 4827+37662 bytes, my current back-end Ada code (TRAN_BCL.A53) is 4825+48645 bytes. 6290 * 53470 / 42489 = 7916 words. So there is a good chance that, if implemented in BCPL, the code of my current back-end would just fit into 8K, with higher data in the 2nd 8K, and without removing any of the optimisations which it currently performs. The remainder of the 2nd 8K will be more than adequate to hold the simulated stack and jump table of any program which otherwise fits. The current textual O-code interface could be changed to binary, in which opcodes would be one byte, labels would be two bytes, constants would be three bytes, and names & strings would use one byte per character. Some blanks would be needed where reading might pause, at the end of each complete O-code statement and after each constant-and-label pair of a switch list (actually corresponding to the newlines of the textual interface). This would make both front-end & back-end slightly smaller, and save paper tape. The SIR generated by the back-end, and the tape of BCPL_ENV.SIR, could both use tabs rather than multiple spaces, to save paper tape. In BCPL_ENV.SIR, the comments containing PDP11 code should be removed, and the code & comments for NUMARG & GETARG could be removed (which are only there for regression testing), to make the tape shorter. The two SIR global lists in the generated code and the SIR global list at the start of BCPL_ENV.SIR could be replaced by a single "[]" after the option in the generated code, simplifying the back-end. As my current back-end reads the O-code, it writes to three temporary files, it then chooses the 8K or 16K store allocation strategy, and then concatenates the files, changing the higher store pointers from "? LABEL" to either "0 LABEL" or "1 LABEL". On a real 900 System, the O-code could be read three times, generating the CODE & DATA & HIGH file outputs respectively, to produce a similar concatenated file. The store allocation strategy would be known by the end of the CODE pass, enabling the correct pointers to be generated from the DATA to HIGH. It might still be appropriate for the back-end to measure BCPL_ENV.SIR to choose its strategy, but to save paper tape there is no need to copy it into the concatenated file. Alternatively, the back-end could generate the three streams, interlaced into a single SIR tape, in a single scan of the O-code, inserting a SIR patch whenever switching between the three streams. At the end of the O-code, the back-end could then output a separate short tape of the form "*8211, CODE >CU, DATA >DU, HIGH >HU" or "*8211, CODE >CU, DATA >DU, ^8192, HIGH >HU, $", to be read into 2-PASS SIR before the main interlaced SIR tape. The patches in the main tape would locate each portion of SIR into the right place within the appropriate skip of the three. (Rather than patching to large offsets of the three labels, it would be more symbolic for each patch to be to a label inserted at the end of the previous SIR portion in the same stream). With this scheme (short of modifying 2-PASS SIR to understand "?"), it would be up to the user to select the 8K or 16K store allocation method initially. There would be no need to measure BCPL_ENV.SIR. (With either approach to the back-end, to avoid needing two versions of BCPL_ENV.SIR, the line "ADSTCK ? STACK" therein could be deleted, and "ADSTCK 0 STACK" or "ADSTCK 1 STACK" added to the generated code). When assembling with 2-PASS SIR, paper tape could be saved by changing the initial option from *8211 to *8208, to minimise the label list. Small enough programs could be loaded using 1-PASS SIR, avoiding the need to make a binary paper tape, in which case the option should be changed to *23 for labels or to *22 to minimise the label list. (The code used to find the end of the literals would need to be located somewhere else and triggered differently, or, if more than 8K of store is available, it would suffice to set ADSTCK to "1 0"). The front-end and back-end will each be over 8K words of paper tape, (24K bytes, 200 feet, 1/5th of a reel), and implementing them in SIR would reduce the amount of paper tape handling required. Of course nobody nowadays has enough paper tape to want to run BCPL compilations on a real 900-series machine anyway, because of the amount of tape used per compilation, rather than the amount used to make the compilers. *** EOF ***