Elliott 900-Series BCPL Implementation.
        =======================================
        Terry Froggatt, 18th December 2018.


My Interest in BCPL.
--------------------

Throughout 1977 & 1978, I was the research fellow in
real-time programming languages at the University of York.
My key activity was to attend meetings at the Berlaymont
building in Brussels, to discuss the proposed standard
"Long-Term Procedural language for Europe" (LTPL-E).

At the same time, the US Department of Defense was developing
its own standard real-time language, later called "Ada". So
the LTPL-E group transformed itself into an Ada liaison group,
which is how I later became involved with the Ada language
at Systems Designers, from 1979 to 1991.

But amongst the background tasks which I undertook at York
was the porting of a BCPL implementation from a PDP-11 running
RSX to a PDP-11 running Unix, to support York's development of
Modula. This was not a major task. I didn't really have to
understand much about the internal workings of the BCPL compiler.

Also, in my own time at York, I was writing a BASIC interpreter
for the Elliott 900-series Computer User's Association, having
worked on the 900-series from 1966 to 1976. To help me with
this task, DDT, a 3rd-party maintenance organisation, loaned
me an Elliott Arch 9000, which I subsequently bought from them.

So it was natural to wonder whether I could implement BCPL on
the 900-series. Before I left York, I made paper-tape copies
of the source files (GET & BCPL) and O-code files, of each of:
the BCPL Library, the BCPL compiler front-end (TSYN & NTRN)
and the BCPL compiler back-end or code-generator (TRAN).

A chance meeting with Martin Richards at an Oxford conference
in November 2016 made me consider doing this again.

In November 2001 I'd interfaced my Arch 9000 to the printer port
of a Pentium, and in June 2008 I'd uploaded the paper tapes, via
the Arch 9000 paper-tape reader and this link, to the Pentium.
So I could now develop the implementation using any modern screen
editor, and test my code in my Elliott simulator on the Pentium,
making it all a lot easier than it would have been 30 years earlier.


Starting Point.
---------------

Although Martin Richards is still developing BCPL,
I very much wanted to start from my York tapes,
not only because I had some knowledge of them, but
because the PDP-11 implementation used a 16-bit word,
so would port more easily to the 18-bit Elliott word
than a modern 32-bit or 64-bit versions would.

The modern BCPL versions also contain many facilities
not in the York version, such as floating-point,
and fitting BCPL into an 8192-word or 16384-word
Elliott was already a challenge without these extras.

The modern preference is to port BCPL by writing an IntCode
interpreter, which is claimed to make porting easier.
But having already implemented a BASIC text-interpreter and a
pidgin-Ada threaded-code interpreter for the Elliott 900-series,
writing yet another interpreter was of no real interest.
I'd already sketched out an O-code to Elliott SIR mapping,
and I didn't consider implementing this difficult.

A fundamental property of BCPL is that it supports recursion.
900-series ALGOL 60 and my 900-series BASIC both
support recursion easily, being interpretive.
MECSL's & CAP's 900-series CORAL 66 implementations both
generate native code, but both omit recursion.
(In CORAL 66, recursive procedures have to be flagged as such, which
avoids the overhead on other procedures, but the official definition
notes that recursion "may or may not be considered worth having").

What I wanted to do was generate true native code from O-code,
with a view to mapping rather more of Ada onto it someday.
So the challenge was to implement recursion in native code,
on an Elliott 900-series machine which has no stack support.
This was not as hard as I expected (see "Stack Addressing" below).


What to Target?
---------------

The usual minimum store size for a 900-series machine is
8192 words, which can be extended in 8192-word multiples.
There are currently a handful of working machines, all with
either 8192 or 16384 words, and my Arch 9000 now has 32768 words.
Of course, the available simulators can simulate more store.

In BCPL it is assumed that all code & data is addressed in a single
flat address space. Certainly BCPL programs of up to a certain size
will fit into an 8192-word machine, with no addressing problems.
My implementation puts everything into the first 8192 words when
it is clear that it will fit.

For larger programs, the stack, and data structures such as strings,
can be placed into higher store modules when these are available,
accessed by B-modification, provided that the code and the simple
data (global vector, static variables, literal constants, and
pointers to the stack & strings and to procedures) still fit into
the first 8192 words. My implementation implements this too.

In the case of even larger programs, it is unlikely that the simple
data would exceed 8192 words before the code exceeded 8192 words, so
the next step would be to place some or all of the code into higher
store modules, with the simple data still in the first 8192 words.

The default on the 900-series is that data and code addresses
are relative to the start of the 8192 words containing the code
being obeyed, and accessing elsewhere requires B-modification.
Machines from the 920C or 905 onwards have an additional address
mode in which data (but not code) addresses are absolute.

So placing any code outside the first 8192 words would require:
*  the use of the absolute address mode for data, which is
   not available on any of the currently working machines,
*  more complicated slower system routines for the likes of procedure
   entry & exit, to convert code addresses from absolute to relative,
*  a modification to 2-PASS SIR to always locate literals in
   the first 8192 words, rather than in the current module.
Accordingly, I've not considered this option any further.

There are some differences between machines in the 900-series.

On the 920A, Functions 6 7 8 9 all leave the Q-register corrupted,
and B-modification corrupts the Q-register before the instruction is
executed. On later machines there are fewer Q-register corruptions,
but I've not found anywhere in my BCPL implementation where I could
usefully exploit these improvements.

The later 905 & 920C have some extra instructions which the earlier
920A 920B 920M & 903 do not support, including the potentially useful
absolute mode just mentioned above. But I've not found anywhere
in my BCPL implementation where using any of the extra instructions
would give a significant saving.

I resisted the temptation to implement BCPL's GOTO command by
writing the address calculated in the A-register directly into
the program counter, which would work on all currently working
machines (but NOT on a 905 or 920C or in some simulators). Formally,
the effect of writing to the active program counter is undefined.

Accordingly, I've aimed to produce code which will run on ANY 18-bit
900-series machine with at least an 8192-word store, provided that
the program code & simple data together fit into just under 8192 words.
The challenge here is to get the compiler passes themselves to fit.


The GET Command.
----------------

The BCPL language has a GET command, which switches the input of
the BCPL front-end from its current source to a named file, then
switches back to the current source at the end of the named file.
In the York implementation, only a single depth of GET is supported.

In small programs, a small GET file is often used to define the
global vector offsets of library procedures, to give access to a
separately-compiled library. In larger, multi-Section programs,
another GET file may be used to provide communication between the
Sections, giving global vector offsets of shared procedures and
manifest constant declarations for shared enumeration types.

Although the 900-series does have file system support in
the shape of FAS for magnetic tapes and RADOS for disks,
the standard 900-series machine has neither, just paper tape.
So, given that I'm developing this BCPL implementation on a
host Pentium which does have a file system, I've taken the
pragmatic approach, for now, of writing a pre-processor to run
on the Pentium, which inserts the GET files into the BCPL files
producing expanded BCPL files (which I call GOT files).

Possible GET implementations on a real 900-series machine:
*  Given that only one level of GET is expected, use one tape reader
   for the BCPL files and a second tape reader (or at a pinch the
   on-line TeleType's reader) for the GET files. The GET files could
   be in the form of a searchable paper tape loop to facilitate reading
   once per Section (like subroutines on TNMoC's Harwell WITCH).
   However, whilst the 900-series instructions for a second paper
   tape station are defined, I've never seen a machine with more
   than one attached high-speed reader.
*  A pre-processor to run on the 900-series machine, which reads
   the GET files (and their names) into store first as text, so that
   it can then can copy a BCPL file line-by-line, inserting the
   GET files into the BCPL file producing an expanded BCPL file.
   On an 8192-word machine, around 7000 words would be available
   as file store, each holding at least two bytes, which is more
   than enough, for example, to hold the four GET files of the BCPL
   front-end, back-end, and library simultaneously (8197 bytes).
   The downside of this method is the amount of paper tape used,
   remembering that the GET files can be included once per Section.
   (The four BCPL source files expand from 92,680 bytes in total
   to 164,736 bytes in total: a 78% increase).
*  It should be possible to read a GET file into upper store first
   as text, then implement the GET command in the BCPL front-end
   by reading this back, avoiding producing an expanded paper tape.
   However, unless more than 16384 words of store are available,
   this would reduce the tree space available within the front-end.
*  When the BCPL front-end finds a GET line it could print the file
   name on the TeleType, then wait for BCPL tape to be marked and
   removed and for the GET file tape to be loaded. At the end of the
   GET tape, it would wait for the BCPL tape to be reloaded at the mark.
*  As a slight variant to the previous bullet, the BCPL tape could
   be reloaded somewhat before the mark (or at its start), and the
   restarted front-end could scan up to (and ignore) the GET command.


Other Obstacles.
----------------

There are several other aspects of the 900-series instruction set
which can give problems when implementing high-level languages.

The 900-series has no hardware overflow detection.
Detecting overflow by software typically requires
testing the signs of both operands and of the result.
The 900-Series ALGOL 60 interpreter does do this.
Ada requires overflow detection unless you say otherwise,
although saying otherwise not guaranteed to suppress it
(for example when it is done by hardware).
CORAL 66 recognises that "overflow may occur", and
CAP's 900-Series CORAL 66 does not detect integer overflow.
In BCPL "overflow in arithmetic operations is ignored"
and "the treatment of arithmetic overflow is undefined".

The 900-series has no hardware compare instruction or condition codes.
So generally subtract has to be used to implement the relational
operators, which is OK for Equal and for Not Equal. But for the other
four relational operators, the subtraction result in the A-register
is not sufficient to determine the result when overflow occurs.
This not the same problem as the lack of overflow detection. Ada 83
paragraph 4.5.2(12) recognised this problem and specifically states
that "no exception is ever raised by a predefined relational operator",
the right Boolean result is expected for all six relational operators,
whether or not subtraction overflow occurs. Early 900-series ALGOL 60
gave wrong results, until I reported the error, then it was fixed by
adding extra code to all six operators, not just to the faulty four.
CAP 900 CORAL uses subtract with no overflow check,
and BCPL is silent on the topic.

The 900-series has an infamously messy hardware divide.
This gives the nearest ODD number to the right answer, (rounding
in the direction of the denominator if there is a tie), allegedly to
minimise integration drift (when used for fixed-point arithmetic).
ALGOL 60 requires integer divide to round towards zero,
defined in terms of real divide as "sign(a/b) * entier(abs(a/b))".
Only in 2014 did we discover that there was still a bug in the
division subroutine used by 900-series ALGOL & 900-series FORTRAN.
But this subroutine, when fixed, does give the "round towards zero".
Ada also requires integer divide to round towards zero.
CORAL 66 is silent on the topic.
For integers (but not for fixed-point numbers), CAP 900 CORAL doubles
the numerator (held double-length), divides by the denominator, and
halves the result with a shift, which gets rid of the ODD problem,
giving "round towards zero" when neither operand is negative (but
often off by one otherwise). However this method can overflow when
dividing by +1 or -1, despite the correct result usually being in range.
BCPL explicitly states that "/" is truncating integer division,
but that the result is implementation dependent unless both operands
are positive. Consequently in BCPL, division by powers of two CAN be
implemented by shifting.

The 900-series division does not provide a remainder.
The divide instruction calculates its result in the Q-register and its
remainder in the A-register. But then for convenience the result is
moved into the A-register, so inconveniently the remainder is lost.
BCPL states that REM is the integer remainder (or modulo) operator,
but that the result is implementation dependent unless both operands
are positive. BCPL later defined MOD as a synonym of REM. Consequently
in BCPL, REM or MOD by powers of two CAN be implemented by masking.
There does not appear to be any requirement in BCPL that
A = (A/B)*B + (A REM B) unless both operands are positive.
On the 900-series, the remainder is usually calculated by dividing,
multiplying back, and subtracting, thus (A REM B) = A - (A/B)*B.
Ada explicitly requires that A = (A/B)*B + (A REM B) where integer
divide rounds towards zero. It also defines a different MOD,
which corresponds to a division rounding to minus infinity,
(not provided by Ada but closer to what the 900-series does).

Languages such as ALGOL and Ada are intended to give correct results
in all cases, despite the obstacles listed above, even when the cost
is to make the code larger & slower than it need be for typical cases.
Programmers working in SIR should be aware of these obstacles, and
will know that it is usually safe to write simpler code. For example:
*  they will know that a given relational operator will not overflow
   because one operand is zero, or because both are the same sign,
   as is usually the case with loop counters or array indices.
*  they will avoid division altogether wherever possible, changing
   division by a constant to multiplication by its reciprocal, and
   they will use double/divide/halve for integer division with positive
   operands (and where the result is known to be below 65536).
The aim of CAP's 900 CORAL 66 appears to be to enable a programmer
who knows the limitations of the machine to express their assembly
code program in high-level language terms. (CORAL is aimed at
embedded applications, typically interfaced by bespoke hardware
to the likes of a steel mill, with no expectation that the program
could be ported to a different computer without changes).

The objectives which I set for my 900-Series BCPL implementation are:
*  My code should give correct results wherever CAP 900-series CORAL does.
*  My code should give correct results for all 16-bit operands, and thus
   give correct results whenever the York PDP-11 implementation would.
   Relational operators which work for 16-bit operands on a PDP-11
   using its compare will work on an 18-bit 900 using subtract.
   Divisions by +1 or -1 which could overflow using
   double/divide/halve will not overflow with 16-bit operands.
This approach enables me to encapsulate some of the optimisations
that a SIR programmer would use when, for example, it is known that
one operand of an operation is a constant.

The fact that the 900-series only has "Jump if Zero" (Function 7)
and "Jump if Negative" (Function 9) is again something which
is not much of a problem when programming at the SIR level.
"Jump if Positive" just involves reversing the operands
(do-able given O-code), and the other three relations can be
implemented by swapping the THEN & OR parts of a TEST (although
not do-able given O-code, so occasionally one extra instruction
is needed: such as a conditional jump around an unconditional jump).


Work done in 2016.
------------------

Porting BCPL to a new machine involves writing a new complier
back-end or code-generator (TRAN), to convert O-code into code
for that machine. The machine-independent compiler front-end
can then be generated from its O-code (TSYN & NTRN), to run
on the new machine.

O-code operations such as PLUS & MULT operate on the stack and
are addressless, whereas most instructions on machines like the
PDP-11 and the Elliott 900-series include an operand address.
So BCPL code generators often contain a simulated stack,
(specifically York's TRAN, targetted to the PDP-11, does).
When an O-code operation is read which should stack an operand,
the operand is recorded in the simulated stack and no code
is generated, in the expectation that the operation which
consumes the operand later can address it directly.

Soon after meeting Martin Richards (on 19th November 2016),
I'd written a translator from O-code to SIR, the symbolic
assembly language of the Elliott 900-series (by 5th December 2016).
(Purely for my convenience this was written in Ada 83
to run under DOS on a 16-bit or 32-bit Windows PC).

This was a very simple program which did NOT include a
simulated stack. It only contained two optimisations:
*  If an operation wrote a result to the top of the stack,
   and the following operation consumed its second or only operand
   from the top of stack, the resulting write-then-read instruction
   pair was deleted. (Even if the operation was commutative,
   no attempt was made to check the first operand).
*  If one of the six relational operators was followed
   immediately by Jump-if-False or Jump-if-True,
   the appropriate code was generated, without
   specifically calculating & testing a Boolean value.

At this point the only serious way to test this code generator
was to translate the O-code of the York implementation.
I was able to translate the O-code of both TSYN & NTRN to SIR.
They just fitted, separately, with the string constants and
the stack in the 2nd 8K, and everything else in the 1st 8K.
The translation of TRAN to SIR did not quite fit, but
since it generates PDP-11 code, I didn�t actually need it.

Unfortunately TSYN & NTRN are not separate passes. TSYN builds
a tree for each Section of BCPL in store, then NTRN converts
each tree in store to O-code. Processing a BCPL file
typically involves multiple switches between TSYN & NTRN.
They are compiled separately in the York system, but they
are linked together, TSYN calls NTRN as a subroutine, and
they share the location & contents of globals and the stack.
So I was not able to create a working front-end.

However, TSYN has an option to suppress O-code generation
and print an ASCII representation of each tree instead.
So by assembling just the SIR of TSYN, I was able to generate the
three tree files from the BCPL of TSYN & NTRN & TRAN, which became
a useful reference for checking my subsequent implementation.

Rather than attempting to hand-optimise the SIR of TSYN & NTRN,
or attempting to write the front-end from scratch in SIR or Ada,
at this point it was more useful to put the missing simulated
stack into the back-end, TRAN, which would benefit all programs.
And rather than attempting to put the simulated stack into
the simple O-code to SIR translator which I'd just written,
I thought I would learn more by starting from York's O-code to
PDP-11 translator. Given that I could not yet compile any BCPL,
my first step was to translate York's TRAN from BCPL into Ada 83
(which I'd done by 28th December 2016), then I paused for a while.

The 2016 back-end above worked largely by text manipulation;
each Section was a separate SIR block to avoid label name clashes.
The 2018 back-end below, based on York's implementation,
evaluates & adjusts label numbers to avoid label name clashes.


Work done in 2018.
------------------

I returned to the project in 2018, and between February & June
I edited TRAN many times, eventually generating SIR with
various optimisations, rather than generating PDP-11 code.
I also developed the SIR assembler routines, BCPL_ENV.SIR,
to replace the PDP-11 assembler system & library routines.

"In BCPL it is defined that consecutive words of store have
numerically consecutive addresses", so a minor advantage that my
word-addressed Elliott implementation has over the byte-addressed
PDP-11 implementation is that there is no need to use a register
for doubling or halving addresses, for example in procedure LOADLV.

There were two confusing points regarding York's simulated stack.

*  I was not sure why the simulated stack starts with two dummy
   entries. There are two pointers, ARG1 & ARG2, which point to
   the top two items on the stack, and the dummy entries appear
   to give them something to point to when the stack is empty.
   This matters, for example, in York's procedure SAME,
   which compares the addresses of the top two stack items (when
   optimising X := X+1 for example) without checking that they exist.
   I've removed the dummy entries, and instead I check for the PLUS
   operator before I compare the addresses. If the O-code has failed
   to stack two items before the PLUS, Ada's array bound checking will
   spot this, rather than just generating code to increment a dummy.

*  Stack discontinuity, the H3 field, and the "N >= SSP+6" test.

   The H1=Mode & H2=Arg fields are straightforward, representing
   the code needed to access an item in the simulated stack. The
   H3 field is the stack offset of where it belongs in the real
   stack. Thus (M_NUMBER, 5, 8) can be flushed from the simulated
   stack by "4 +5, /5 8", and (M_LOCAL, 5, 8) by "/4 5, /5 8".

   This 3-field mechanism permits the simulated stack to represent
   discontinuous portions of the real stack. If the H3 field is
   omitted, only a continuous portion can be represented, where the
   real offset of ARG1 is SSP-1, and the real offset of TEMPV'FIRST
   is SSP-1-(ARG1-TEMPV'FIRST)/TEMPSIZE, which can be held in BASE.

   Discontinuities arise when vectors are allocated & deallocated:
   it would be unwise to simulate a 6000-word tree vector, and
   there is little point in simulating a 120-word string vector,
   since if anything only the first two words are addressed. Perhaps
   the STACK routine tests "N >= SSP+6" to allow for the optimisation
   of the parameters to front-end LIST routines up to LIST6.

   The York implementation is inconsistent in that it has the
   H3 field but does not use it to represent discontinuities.
   So I've omitted H3, which was only used in VACATE and in STORE,
   (where subsequently VACATE was assumed into FREE_A) and I
   now effectively calculate H3 as (H2-H2'FIRST)/TEMPSIZE+BASE

   At best, the simulated stack can only work on one stack frame,
   it cannot track run-time calls at compile-time. But in fact, in
   York's implementation and in mine, it only tracks part-frames.

Now I translated the O-code of both TSYN & NTRN to SIR again.
This time, (each including a global vector and its own copy of the
library, with the string constants and the stack in the 2nd 8K, and
everything else in the 1st 8K), each used about 6K of the first 8K.
And I estimated that TSYN & NTRN translated together with a single
global vector and single library, and still containing the tree
printing routine, would only overrun the first 8K by 1068 words.

This was a breakthrough: getting the back-end to the point where
the SIR generated from the O-code of the front-end was good enough
to be hand-optimised to fit, so that then the BCPL sources of the
front-end could be used, and altered.

Producing a useful front-end happened in two stages. Firstly,
I cut the tree-printing routine out of the O-code of TSYN,
translated the resulting O-code of TSYN & NTRN to SIR again, and
performed extensive hand-optimisations to the SIR to make it fit.
However, it would be tedious to repeat these hand-optimisations
whenever I wanted to change the front-end BCPL code. But now
that I had the ability to run the front-end, the second stage
was to implement some of the optimisations in the BCPL code.

My TSYN is as the York version, but:
*  With the testing of options removed, thus ignoring the
      "print tree" option (built into BCPL_ENV.SIR to test
      TSYN unaltered), and thus forcing the O-code defaults,
   and with the remaining uses of NUMARG & GETARG,
      for reading input file names, removed,
*  With PERFORMGET removed,
      any GET will now give error 93, "NESTED GET",
      although perhaps it should act as a HaltCode,
*  With PRINTNAMES & PRINTTREE removed,
   and with the whole of PLIST (which actually prints the
      tree) and IN removed, this is a substantial saving,
*  With LIST1 removed, it is never used.

My NTRN is as the York version, but:
*  In EVALCONST, the code for evaluating the operands
   of the binary operators is shared, then another
   case statement is used to call the operator,
   this is a substantial saving in 1st-8K code,
   although costing 32 words of any-8K data,
*  In EVALCONST, the code for evaluating the operand of NEG &
   NOT and the first operand of COND is shared, then another
   case statement is used to call the operator,
   this is a slight saving in 1st-8K code,
   although costing 8 words of any-8K data.

I also wrote a small program, OPTFRONT, to apply my other two
significant hand-optimisations (which cannot be expressed at
the BCPL level) to the front-end O-code. It is written so as
to survive certain changes to the BCPL source &/or to TRAN,
specifically it does not assume any J/K/L label names.
*  It is probably not possible in BCPL to statically
   initialise a table with a list of string addresses.
   The code for DECLSYSWORDS in Martin Richards' book
   does not do this, and the York implementation uses
   distinct code in for each keyword. This program
   changes DECLSYSWORDS into a loop using a table,
   this is a substantial saving in 1st-8K code.
*  This program modifies LIST2 to LIST6, to share their
   code in an obvious way, saving 71 words of 1st-8K code.

Finally, I made some changes to the front-end and library
to accept some modern BCPL syntax, as described below.
The remaining notes apply to the 2018 implementation.

Further changes made during the remainder of 2018 include:
*  In the back-end, I completely revised my implementation
   of Relational Operators, unifying the code used when
   followed by JT or JF with the code used otherwise.
*  In TRANSFOR in NTRN, I altered the implementation
   of FOR loops, to better exploit the asymmetry of
   available 900-series conditional jump instructions.


York Oddities.
--------------

As stated above, the "N >= SSP+6" test in STACK in York's
TRAN seemed odd, but I now understand what it�s for,
and that the value of 6 is somewhat arbitrary.

CGAPPLY in York's TRAN declares the frame size as "2*N+6" bytes,
which would be N+3 words, which seemed odd. However this value is
autodecremented by the ENTER system code as it sets up the 3-word
frame header. In my CGAPPLY I've simply used a frame size of +N.

In SAME in York's TRAN, "TEST (B=1 & OP=S.PLUS..." should test
PENDINGOP rather than OP, consequently some optimisations using
INC & DEC get missed. I've corrected this in my implementation
of STOREIN, which generates the appropriate number of count
instructions (Function 10) for X:=X+N and X:=N+X, for 0<=N<=2.

A latent bug was found in the York implementation (shortly before I
left) related to sign-bit regeneration when halving byte addresses.
This is unlikely to affect my word-based Elliott implementation.

NTRN necessarily evaluates constant expressions in MANIFEST & VEC
declarations at compile-time. Possibly the only significant
functional change that I made during my RSX-to-Unix port at York was
also to fold ANY expression or subexpression with constant operands.
But as I noted at the time, LOGAND & LOGOR had been overlooked.
My TRAN assumes that NTRN always folds constants, so I've
corrected LOGAND & LOGOR in my 900 implementation of NTRN.

CGBLOCK in York's TRAN carelessly LOADTs TRUE as octal 177777,
which is the same as -1 on the PDP-11, but not on the Elliott.
In my implementation, I use LOADT (-1) which works for both.

The York implementation provides Infix function calls. "The
construct A !PLUS B is equivalent to the function call PLUS (A,B)."
It looks like a York special and is not used in any York code.
For the reason explained below, I have removed this facility.

York's Strings start with a length word, rather than the
length byte which BCPL expects. I have retained this.
York's Strings end with one or two null bytes for word alignment.
In my implementation there are zero or one null bytes.


BCPL Changes.
-------------

In my implementation I have added three extra keywords,
used in modern BCPL, which were not in York's implementation:
   ELSE as a synonym for OR,   which is as in Algol & Ada,
   MOD  as a synonym for REM,  which are distinct in Ada,
   XOR  as a synonym for NEQV, which is useful and clearer.

The York implementation inherited some character set oddities from
the earlier Elliott/ICL 4130 implementation on which it was based:
   "%" and "|" are both available for vector indexing,
      although only "%" is used in the York source files,
   "!" is used for infix calls where A !PLUS B means PLUS (A,B),
      although this is not used in the York source files,
   there is no string indexing.
In modern BCPL:
   "!" is used for vector indexing,
   "%" is used for string indexing,
   "|" is an alternative to LOGOR and "\/",
   there is no syntax for infix calls.
In my implementation:
   "%" and "!" are both available for vector indexing,
      to accept both York's sources and modern sources.
   "|" is an alternative to LOGOR and "\/",
   there is no syntax for infix calls.
   there is no string indexing,
      but the example ENIGMA.BCP shows a work around for this.
In all three cases:
   "&" is an alternative to LOGAND and "/\".

In my implementation:
   the symbols "NOT" and "\" and tilde "~" are equivalent,
   the symbols "NE" and "\=" and "~=" are equivalent (but not Ada's "/=").
York's implementation did not support "~=".

The York implementation and mine still allow dots in identifiers,
   my implementation also now permits the underscores of modern BCPL.

The York implementation only permitted upper-case letters
   in string layout escapes, *N *S *T *P *E *C *L, and
   in WRITEF format insertion escapes, %C %N %O %S %D.
My implementation also allows lower-case in these contexts.
The escape to output a decimal number with # digits can now
   be either York's %D# (or %d#) or the modern %i# (or %I#).

Modern BCPL defines "ug" for the first "User Global",
   so I have added the definition "ug=60" to LIBHDR.
Modern BCPL treats global and static declarations
   with no number as one larger than the previous one,
   this is not in York's implementation or in mine.
Modern BCPL's uses "?" for "Undefined", like SIR's ">1",
   this is not in York's implementation or in mine.


Store Allocation.
-----------------

As explained in "What to Target?" above, my implementation puts
everything into the first 8192 words when it is clear that it will fit.
For larger programs, the stack, and data structures such as strings,
can be placed into higher store modules when these are available,
accessed by B-modification, provided that the code and the simple
data (global vector, static variables, literal constants, and
pointers to the stack & strings and to procedures) still fit into
the first 8192 words. My implementation implements this too.

The back-end assumes that any portion of the library written in BCPL
is already present in the O-code, either by splicing that library
BCPL onto the main program BCPL code, or by precompiling the library
and splicing its O-code onto the O-code of the main program.

As my back-end reads the O-code, it writes to three temporary files:
   "BCPLCODE.TMP" holds code to be placed in the 1st 8K,
   "BCPLDATA.TMP" holds data to be placed in the 1st 8K,
   "BCPLHIGH.TMP" holds data not constrained to the 1st 8K.
It counts the number of words written to each of these files,
and makes an overestimate of the number of literals used.
It then counts the words and overestimates the literals in
the SIR assembler system & library routines in BCPL_ENV.SIR.
(Each BCPL TABLE appears in the O-code as a list of static
variables, and so they are constrained to the 1st 8K code).

If it is then clear that everything, ignoring the stack,
will fit into the first 8K, store is allocated thus:
   Registers and Entry code from 0 to 11
   Global Vector skip of 60 or more words
   1st 8K Code from "BCPLCODE.TMP"
   1st 8K Data from "BCPLDATA.TMP"
   Higher Data from "BCPLHIGH.TMP"
   System Code from "BCPL_ENV.SIR"
   Literals as placed by 2-PASS SIR.
The Stack follows the Literals. Initially the stack pointer points
to the start of the literals, but the tape autotriggers and advances
the pointer to the second zero, the first being a literal and the
second being free store, cleared by an 8K clear-store on the front
of the tape. (The code to do this, and the ACD binary loader,
occupy locations 8157..8179, and may be overwritten by the stack).

If it is not clear that everything will fit, and provided
there is at least SOME Higher Data, store is allocated thus:
   Registers and Entry code from 0 to 11
   Global Vector skip of 60 or more words
   1st 8K Code from "BCPLCODE.TMP"
   1st 8K Data from "BCPLDATA.TMP"
   System Code from "BCPL_ENV.SIR"
   Literals as placed by 2-PASS SIR.
Starting from location 8192:
   Higher Data from "BCPLHIGH.TMP"
   The Stack follows the Higher Data.
I make no attempt to fit SOME of the higher data into the first 8K,
(especially as I only have an estimate of the number of literals).
There is no check that the items now placed into the first 8K actually
fit with the Higher Data removed, or that the Higher Data actually
fits into the second 8K. However, 2-PASS SIR will check this later.

MASIR has a notation for a symbolic address anywhere in store,
but in 2-PASS SIR, symbolic addresses are only recorded modulo 8192.
Pointers have to be written in instruction format, either
as "0 LABEL" to the first 8K or as "1 LABEL" to the second 8K.
Pointers to the Higher Data are written to the temporary files as
"? LABEL". The back-end then concatenates the three temporary files
and BCPL_ENV.SIR into a complete SIR program with an option and a "%",
converting "? LABEL" into "0 LABEL" or "1 LABEL" as appropriate.

Regardless of which store allocation method is used, the stack
can run on, not just through the remainder of the 8K module in
which it starts, but also though any further 8K store modules.
Specifically to cover the case where everything (ignoring the
stack) fits into 8K but more than 8K is actually available, the
program is run on level 4, so as to turn initial instructions off.
Interrupts should be disabled: no handlers are provided.


The Global Vector.
------------------

As my back-end reads the O-code, it also records the highest
global vector number encountered, either by declaration or by use.
This does not include global vector numbers in "BCPL_ENV.SIR",
but, as in the York implementation, these are all below 60,
so the highest number is set to 59 before reading the O-code.
After reading the O-code the back-end generates a suitable
skip (for example "GV >60") in the SIR output.

In the York implementation, at the end of processing an O-code file,
any global vector items declared in that file are set appropriately,
and entries which are not declared, but are used (or are the START entry)
are set to -1, which I presume gives a run-time hardware address trap.
Global vector items neither declared nor used are not altered. I think
this is to permit the linking of multiple files sharing the global
vector, but it will only work if files are linked in a certain order.
Since the global vector entries are buffered to be output at the end of
the O-code, it would be possible to report duplicate declarations within
one file. This was not done, although there is a comment suggesting it.

My implementation assumes that the program is in a single O-code file:
I do not support linking at the SIR level. The global vector is set to
all zeros by the 8K clear-store on the start of the tape. Location 0
(being the top-level SCR) is not a valid program address. The run-time
system detects procedure calls, and GOTOs & SWITCHONs, to location 0,
and enters a dynamic stop (causing the simulator to quit). Global vector
entries are output, as SIR patches into the global vector skip, when
they appear in the O-code, unbuffered, so I do not report duplicates.

Regardless of the amount of store actually available, the clear-store
on the start of the tape only clears the first 8K, which is adequate for
clearing both the global vector and the first word after the literals.


Presetting.
-----------

"Presetting" is a term which has always worried me.
To my mind there are variables, which are undefined when a
program is loaded, and should be set by the program before
they are used, and there are constants, often built into
the program image, which should not be altered by the program.

The difference is important, because it is not unusual to load
a program into an Elliott 900 from paper tape, and run it, and
then expect to be able to run it again without reloading it.
This will not work if the program relies on an initial value
which is set by loading the tape but is altered by the program.

I've seen code in the Harrier flight program, which runs several
times per second, where a variable was set to its initial value
immediately AFTER being used in one cycle, ready for the next cycle,
relying for the first cycle on the variable being set when the
program was loaded. But the program was only reloaded into core
memory (from ground equipment) when it was updated, not before
each flight. The Harrier could be powered down at any point
within the program cycle, which left that variable undefined
when it was used in the first cycle of the next flight.

In my own SIR programs, I am careful to use a skip ">1" for a
variable, using "+0" only to denote a constant. Sadly much of
MECSL's standard SIR library uses "+0" for both. Modern BCPL does
provide "?" for "when a variable does not need an initial value".
(This issue is not to be confused with the coding style which
mandates that all variables should be given some value at the point
of declaration. Whilst this does make tests repeatable, it can
hide errors, and can lead to a less clear programming style.)

In BCPL, dynamic variables are given their declared value (unless
flagged "?") necessarily dynamically, and "there is no way of
initialising the contents of a vector when it is declared", so
these are OK. Static variables are initialised to their declared
values but it is not clear whether this occurs at load-time or at
the start of run-time. Where global variables are associated with
procedures, this "takes place at load time", at least in modern BCPL,
and this is presumably also the case for a "global:value" statement,
given that "global:=value" is a run-time assignment. It is not
clear when global vector entries which are unused and which have
no association are supposed to be initialised, if anywhere.

In my implementation, as explained above, all globals are zeroed
by an 8K clear-store at the start of the tape, to enable calls to
unset globals to be trapped. Then the static & global variables
are all set to the values given, at load-time. As I've suggested,
this is not ideal, but is OK if the program is only being run in
simulator and is always loaded before each run. (I did it this way to
assist fitting the front-end code into 8K, but it would be possible
to set the initial values dynamically, and since all static & global
variables are in 1st 8K and not on the stack, the code to do this could
be above 8K and use the B-register to modify back to the first 8K.)
Meanwhile, the correct programming style for re-runability is that
the programmer should give static variables and global variables
a value by assignment, not by declaration, before they are read.

There is, however, a problem with globals associated with procedures,
in that BCPL states that these are set at load-time, yet it is also
clear that the language permits their alteration. This is a BCPL
problem, not a limitation of my present implementation. For example,
it may be required to temporarily point the standard ERROR routine at
an exception handler, then restore its initial value which the user
cannot know in advance, leading to the risk that the program will not
restart if aborted (as for the Harrier). The work around, given that
my implementation does clear the globals, is to reserve one global as
a first time flag, FTF, and to write at the start of the program:
   TEST FTF=0 {SAVERROR:=ERROR; FTF:=-1} OR ERROR:=SAVERROR;

(In the York front-end and consequently in mine, the REPORT:151
pointer is saved/altered/restored by ISCONST, this having been set
to TRANSREPORT by COMPILEAE in NTRN, it is also set to CAEREPORT
by FORMTREE in TSYN, both of these settings also save & restore
what may be an undefined and unused initial value).

Finally, Strings are preset at load-time, and BCPL allows the
programmer to alter them, but if this is done then the program
cannot be rerun, (and if I did set initial values dynamically
using the mechanism suggested above, it would not apply to Strings,
due to their size and because they might not be in the first 8K).
My ENIGMA work around is OK: I don't use the initial string contents.


Interrupt Safety.
-----------------

Various 900-series facts cards give the
instructions for servicing an interrupt as:

LOOP      0     LLQ       Reset lower-level Qreg.
          14    1         Shift left 1 place.
          4     LLA       Reset lower-level Areg.
          15    7168      Terminate.
ENTRY     5     LLA       Store lower-level Areg.
          3     LLQ       Store lower-level Qreg.
          Required higher-level program.
          8     LOOP      Jump to reset for lower-level.

This code does not preserve the least-significant bit of the
Q-register. Function 3 right-shifts the Q-register as it
stores it, and the "14 1" left-shifts it when it is reloaded.
Potentially, the least-significant bit of the Q-register could
be changed from a 1 to a 0, anywhere in an interruptible program.
Specifically this could happen if another interrupt were to
occur in the above code between the "0 LLQ" and the "14 1".
To prevent this, the 900 hardware defers any interrupt which
occurs during a Function 0 instruction.

This is only a partial solution. Both Functions 0 and 2 have
the defined effect of loading the operand into the Q-register,
Function 0 also alters the B-register but not the A-register,
Function 2 also alters the A-register but not the B-register,
so both are sometimes useful, and Function 2 is usually faster.
It is OK to CLEAR the Q-register with either a "0 +0" or "2 +0".
But it is only safe to USE this bit if it is set by Function 0,
not by Function 2 or by a right-shift, and that bit must be
consumed immediately by the following instruction.

In Airborne Computing Division, we chose to use interrupt code
which saved the whole Q-register, despite it being slower and
despite the fact that we were working on real-time programs:

LOOP      2     LLQ       Reset lower level Qreg.
          4     LLA       Reset lower level Areg.
          15    7168      Terminate.
ENTRY     5     LLA       Store lower level Areg.
          14    18        Shift left 18 places.
          5     LLQ       Store lower level Qreg.
          Required higher-level program.
          8     LOOP      Jump to reset for lower level.

This gave us freedom about how to load the Q-register, and
removed the need to use it immediately: we could for example
use A & Q to assemble all 6 characters of a SIR identifier.

In my BCPL implementation, the generated code runs on level 4, but
no interrupt code is provided because no interrupts are expected.
Nevertheless, where my Ada code of the back-end generates
SIR code to clear Q, I've commented that this would be
"interrupt safe" even using the facts card interrupt code.

The only place where it is currently not interrupt safe is in
shift instructions, where, to meet the BCPL requirement that
shifts do not regenerate the sign-bit, I load the value to be
shifted, "LHS", into the Q-register rather than into the A-register.
I use "2 LHS" (possibly B-modified) to (possibly corrupt then)
load Q, first, because it alters A, followed by a "4 +0" to
clear A without corrupting Q, then the planted shift.

To be interrupt safe, the "4 +0" would have to come first, to
meet the immediate use rule, then "0 LHS" (possibly B-modified)
to (possibly corrupt then) load Q without altering A, then the
planted shift. But as explained below, a "0 P" would also usually
be needed to reset the B-register to the stack frame pointer.


Stack Addressing.
-----------------

Implementing a stack in native code does NOT require
incrementing or decrementing a stack pointer whenever an
operand is pushed or popped, as one might in an interpreter.
(On the 900-series, such a stack might be inverted, to
make offsets relative to the stack pointer non-negative).

In BCPL (by design) operands either have static addresses or they
can be addressed as non-negative offsets from the frame base, and
there is no need to track the top-of-stack address at run-time.
This exactly matches the capability of the 900-series, using
unmodified instructions or B-modified instructions respectively,
with BCPL's frame pointer "P" held in the B-register.
(Accessing non-local non-global data, as required
in Ada for enclosing scopes, would be messier).

The main cost is that a good many instructions have to be B-modified
which otherwise would not be, slowing these instructions down
(typically by 25% on a 920B or 903), but importantly occupying
no more space. Also B-modification corrupts the Q-register.

There is some cost associated with the loss of freedom as to
whether to clear the Q-register using Function 0 or Function 2.
Where a "0 +0" might have been used before a left-shift to clear Q
(without altering A), a Function 6 Mask instruction might have to
be used after the shift, potentially costing an extra literal.

There is a cost when the B-register is required for another purpose,
such as to index into a vector. The usual code is used to load the
index into B (and may itself be B-modified), but an extra instruction
is generally required to reload the frame pointer into B after the
indexed operation. (But see "B-register Tracking" below).


Stack Frames.
-------------

My implementation retains the stack frame layout of the York
implementation, in that the frame starts with a three word header.

Word 0 holds the link, the return address to the point of call.
Word 1 holds the frame pointer "old P" of the enclosing frame.
Word 2 in the York implementation holds a pointer to the textual
   name of the current procedure, which I have not implemented.
Words 3 onwards hold the parameters to the current procedure,
   if any, followed by its local variables, if any.

The York implementation only stores the procedure names as a
diagnostic aid, and I've omitted them purely to save space.
However, I do output the procedure name as a SIR title which
will also appear in the SIR label list, enabling the names
to be recovered from a stack dump. Remember that the bodies
of BCPL procedures can be nested, so some part of an outer
procedure might follow the title of an inner procedure,
(for example in TSYN: most of DECLSYSWORDS follows title D).

Now that I can edit the BCPL front-end, I could reduce the frame
head to two words, although the York sources do not have a manifest
"savespacesize" which some BCPL implementations have to make this
easy. Modern BCPL retains word 2 for "the function entry address".


Entry & Exit Code.
------------------

Some system code is required to adjust the frame pointer when
entering and leaving procedures (functions & routines). This is
only a one-off shared overhead in space, but does cost time, 14
instructions for entry, 9 for routine exit & 10 for function exit.

In the generated code, the normal SIR subroutine call
          11    LINK
          8     ENTRY     (Subroutine address)
is replaced by
          11    LINK      (Shared link)
          8     APPLY     (System routine)
          +N
with the address of the procedure in the A-register,
which may have been calculated dynamically in BCPL,
although more often it is read from a static word of the form
          0     ENTRY     (Subroutine address)
and where the parameter +N gives the amount by which
the frame pointer has to be increased.

In the generated code, the normal SIR subroutine exit
          0     LINK
          /8    1
for routine returns is reduced to
          8     RTRN      (System routine)
for function returns is reduced to
          8     FNRN      (System routine)
with the result in the A-register.

The system code for entering procedures & for leaving procedures
and for implementing GOTOs & SWITCHONs are all written so that:
*  they can be called from the generated code regardless
   of whether or not the B-register holds the frame pointer,
*  when they return control to the generated code,
   the B-register DOES hold the frame pointer,
   (the returns being achieved by planted jump instructions,
   which works given that all code is in the same 8K,
   rather than by the usual B-modified jumps).

This not only avoids the need to reset the B-register with a "0 P" at
the start of every procedure and following every call, it provides an
opportunity for further optimisation, (see "B-register Tracking" below).
And in BCPL the target of a GOTO must be in the current procedure body
so there is no need for a "0 P" after every label.

Although the system calls follow the rules given above, library
routines (whether written in SIR or in BCPL) can assume that the
B-register is set to P on entry, (as I do in LEVEL & LONGJUMP
and in PACKSTRING & UNPACKSTRING, all in BCPL_ENV.SIR).


Jump Optimisation.
------------------

As noted in the previous section, procedures are called via
the "APPLY" system routine, with the procedure address in the
A-register, which may have been calculated dynamically in BCPL.

So, for example, a call to the READC=RDCH routine,
without this optimisation, would take four words of store:
          4     GV+27
          11    LINK
          8     APPLY
          +N
The Jump Optimisation changes this into three words per call:
          11    LINK
          8     J4
          +N    (GV+27)
and a further two words shared by all calls to READC=RDCH:
J4        4     GV+27
          8     APPLY
If a procedure is called both as a routine and as a function
then they share the same two-word entry in the jump table.
The name of the location holding the address of the procedure
is retained (within the comment) for clarity.

This saves store for any procedure called three or more times,
and breaks even for a procedure only called twice.
It does cost one word for a procedure only called once,
but such procedures would be better in-lined if store is tight.
(It is not possible for the back-end to tell how often a procedure
is called when code is being generated for its first call).

The optimisation is only performed if the address is known
to be either in a labelled location or in the global vector.
(The transformation would work for addresses on the stack,
but the varying offsets would be unlikely to give a saving).

A similar optimisation is performed for GOTO commands.
One might not expect to see many GOTOs in a modern program, but for
example, the BCPL front-end procedure NEXTSYMB uses RETURN at the
end of a token which has performed a look-ahead, and a GOTO L at the
end of other tokens, where L reads a character before RETURNing.
Without this optimisation, the following would occur 25 times:
          4     L80
          8     GOTO
The Jump Optimisation changes this into one word per GOTO:
          8     J23
and a further two words shared by all GOTOs to that label:
J23       4     L80
          8     GOTO
(It's a pity that the optimisation can't spot that
all 25 GOTOs in this example are preceded by "5 GV+115").

Luckily these 25 GOTOs do all jump to the same label, rather than
to 25 different labels on the same location, in contrast to what
can happen with SWITCHONs, (see the "Switches" section below).

I've not (yet) attempted to convert the above GOTOs into direct
jumps. However "8 L80" would be wrong: L80 holds the address of
the location to be jumped to, namely it is "L80 0 L79".
The O-code for the address (DATALAB L4, ITEML L3) and for the
GOTO (LL L4, GOTO) are separate, and to simply assume that
the address and destination labels differ by one (without
understanding the front-end code) would probably be unwise.
And it might be necessary to insert a "0 P" before some
direct jumps (see "B-Register Tracking" below).

The jump table is copied into BCPLCODE.TMP which is later merged
into the *.SIR file, rather than being copied into the *.SIR file
at the appropriate point, so that its store usage is counted.

The simulated stack and the jump table are both implemented using
arrays within the back-end whose size could limit the complexity
of the BCPL program. Whereas the simulated stack only has to cater
for at most one BCPL procedure at a time, the jump table has to
cover the whole program.


Function Returns, and Duplication.
----------------------------------

At present I make no attempt to pass parameters in registers,
given that the A-register already holds the procedure address.
At a pinch, I could pass a parameter or the entry address in the
B-register. I do, however, pass the switch value in A and the
address of the switch table in B into the SWITCHON system code.

At present I return the result of a function in the A-register,
which is correctly written to the stack if not needed immediately,
as in the York implementation (which always returns values in R0).
An alternative would be to return the result on the stack, which
might require it to be read from the stack if needed immediately.
In fact, if FNRT were changed to return the result on the stack
(by uncommenting one instruction) it would nevertheless also be
in the A-register, so neither a stack write nor a stack read
would be needed. But the simulated stack has no mode to express
that an item is in two places. (The result of a function also
happens to be in the LINK location, which I've exploited
in the optimisation of the LEVEL procedures by OPTFRONT).

The opposite problem occurs when I need to access an operand twice.
For example, in EQV & NEQV and in REM, I access both operands twice.
The code for these operations starts with a FREE_A, so that if any
operand is in the A-register, it is written to store (on the stack or
otherwise) to make it addressable. The first operand is read into the
A-register with an explicit Function 4, regardless of whether it has
just been written, not by using LOAD_A which would change its mode.
This will sometimes generate a "5 X, 4 X" sequence,
which the peephole optimiser will reduce to just "5 X".
(See "A-Register Tracking & Peephole Optimisation" below).

I remain slightly confused about the O-code used to load the result
of a function. The only way out of a function is via O-code FNRN,
which loads the result into A then calls the FNRN system code. But
there is also O-code RES: why does this also load the result into A,
before jumping to a label which can only lead to an O-code FNRN?
This occurs in the York implementation and others, so I've kept it,
calling LOAD_A a second time with the same operand has no effect.


B-Register Tracking.
--------------------

As described above, the SIR which I generate assumes that
the B-register normally contains P, the stack frame pointer.
It is set at the start of the program by a "0 P" in BCPL_ENV.SIR.

The system code for entering procedures & for leaving procedures
and for implementing GOTOs & SWITCHONs are all written so that,
when they return control to the generated code, the B-register
again does hold the frame pointer, which avoids the need to reset
the B-register with a "0 P" at the start of every procedure and
following every call.

The system code for entering procedures & for leaving procedures
and for implementing GOTOs & SWITCHONs are also all written so
that they can be called from the generated code regardless of
whether or not the B-register holds the frame pointer.

It follows that, if the B-register is required for another purpose,
such as to index into a vector, it is not always necessary to reload
the frame pointer into B with a "0 P" after the indexed operation,
if no stack-related operations occur before the next system call,
or if the B-register is again required for another purpose.

The B-register is tracked by a back-end flag, BREG, initially set
OK (=TRUE). If an instruction is coded in LOCAL mode and BREG
is not OK, it is set OK and a "0 P" is generated, in SIR_WORD,
before the requested B-modified instruction accesses the stack.

If the B-register is required for another purpose, such as to index
into a vector (RV & STIND), the usual code is used to load the index
into B. At this point BREG could be either value. BREG is then set OK
before generating the indexed instruction, to avoid an incorrect "0 P"
(if BREG was not OK), and BREG is set not OK immediately after it.

If a system call is encountered, BREG is set OK. They are recognised
by a Function 8 unconditional jump to a textual system routine name,
or by a call via GET_JUMP_NUMBER to APPLY (FNAP or RTAP) or to GOTO.
This paragraph is applied before the following paragraph.

The tracking is performed on the code as it is translated, not when
it is run, so a global decision has to be made about the state of
the B-register when jumps occur, namely that it should hold P.
So if a jump is encountered, conditional or unconditional, and BREG
is not OK, it is set OK and a "0 P" is generated, before the jump.
(The code for EQV and NEQV=XOR, when not followed by JT or JF,
includes "7 ;+2, 4 -1", which could generate an unnecessary "0 P").

An interesting situation occurs at labels. If arriving at the
label from the instruction above, BREG will be set appropriately.
If arriving at the label from a jump, the B-register will hold P.
So there is no need to change BREG or to generate a "0 P", yet.
But if a stack operation subsequently does need a "0 P", it would
have been better to generate it before the label. Rather than
generate "0 P" in anticipation before any label where BREG is not OK
(potentially wasting store), I've accepted that sometimes an extra
"0 P" might be obeyed following a jump (potentially wasting time).

(The system code for GOTO does not need to use the B-register,
so it could simply leave the B-register unchanged. However,
it seemed more useful to make it follow the same rules as the
other system calls, in the ways described above. Including the
"0 P" cost no store because GOTOs & SWITCHONs share exit code,
and SWITCHONs do use the B-register and reset it with "0 P" ).

Given that calls to the back-end procedure SIR_WORD can generate
a "0 P" as well as the requested SIR word, using instructions with
SCR-relative address mode SCRREL needs care. Specifically there
was a problem with "5 ;+3" used in Shifts, which might need to be
"5 ;+4". Rather than labelling the planted instruction, which would
make the code less clear, I've added code to flush out the "0 P"
sooner, if it is required (by copying a line from SIR_WORD, exactly,
even though calling SIR rather than SIR_WORD would be OK here,
but adding a "flush 0 P only" mode to SIR_WORD might be neater).


A-Register Tracking & Peephole Optimisation.
--------------------------------------------

Operations where the only operand or both operands are constants
are performed ("folded") by the front-end. Since the front-end
is a 900-series BCPL program compiled by itself and my back-end,
these calculations will give the same results as if performed at
run-time (although more quickly), including division (regardless
of the signs of the operands).

But the front-end preserves the binding order of operations.
So in the statement "LET C = 'a'<=CH<='z' -> CH-'a'+'A', CH",
the generated code includes "4 GV+76, 1 -97, 1 +65, /5 5",
even though ""4 GV+76, 1 -32, /5 5" would be better, and would
be generated by "LET C = 'a'<=CH<='z' -> CH+('A'-'a'), CH".
(It is generally thought legitimate for compilers to implement
optimisations like this for integers but not for real numbers).

The Elliott 900-series do not have lots of registers to track.
The A-register is tracked by a back-end variable, AREG,
initially set FREE (= an off-stack pointer), which
records which simulated stack entry is currently in A.
This is used to avoid reading the first or only operand of
an operator, or either operand of a commutative operator,
which has been left in the A-register by the previous operation.

The simulated stack mechanism rearranges, for example,
"P := A * (C+D)", from "4 A, /5 T, 4 C, 1 D, /12 T, 14 17".
to "4 C, 1 D, 12 A, 14 17", which reduces the code size,
although the stack frame might still contain a slot for T.

Sometimes when one operand of a binary operator is a constant, the result
can be determined from the other operand without using the A-register.
For example, adding +0 or multiplying by +1 return the other operand,
and multiplying by +0 or LOGAND by +0 return zero. Such operations can
arise when the constant is MANIFEST, where the special values is not
immediately obvious (and is best not exploited) by the programmer.

This is achieved by copying entries within the simulated stack
(using LOSE1 with parameters, rather than the default parameters).
Thus "P := (A+B) * (C+D)" typically generates (when A B C D
are static variables) "4 A, 1 B, /5 T, 4 C, 1 D, /12 T, 14 17".
But if B happens to be manifestly zero, not only is the "1 B"
omitted, the sequence again becomes "4 C, 1 D, 12 A, 14 17".

The simulated stack works well within statements, but not between
statements, which is where the "peephole" optimiser comes in.
This is built into the routine which outputs the SIR code.

It records the most recent Read instruction (Function 4) and
it records the most recent Write instruction (Function 5).
Both are set "Unknown" initially, or by a label,
or by any instruction other than a Read or Write or Jump,
and the Write record is set "Unknown" by a Read instruction.

A Read from, or Write to, either of the recorded addresses can
then be suppressed. For example, if one statement ends with "5 X"
and the next statement starts with "4 X", the "4 X" is omitted.
Initialisations such as "4 +0, 5 X, 4 +0, 5 Y, 4 +0, 5 Z" become
"4 +0, 5 X, 5 Y, 5 Z". (The statements "ARG2:=ARG1; ARG1:=ARG1+3",
which appear in the PDP11 back-end, would also get optimised).

Logically, the most recent two or more locations written to
could be recorded, but this might not save much more code.
A Write to the B-register does have set the Read record "Unknown",
so that "/4 0, 5 BREG, /4 0" does not get optimised.
Conditional jumps do not need to set either record "Unknown",
whilst the next label following an unconditional jump
(and possible dead code) will set both records "Unknown".

Also built into the routine which outputs the SIR code, any
remaining no-ops "1 +0" and "14 0" are eliminated. Most will
have been removed by better optimisations described above, but
for example the back-end generates "1 +0" in LOADLV if the offset
is zero and it generates "14 0" in MULT if either operand is -1.

Finally built into the routine which outputs the SIR code,
is a dead-code eliminator. Any code following an unconditional
jump (Function 8) unless that jump follows "7 ;+2 or "9 ;+2" or
"11 LINK", up to the next label, is unreachable and so is deleted.
The BCPL front-end does generate a small amount of such code.

I have not attempted to merge adjacent Add literal instructions
(such as the "1 -97, 1 +65" above), or Negate-&-Add literal
instructions (such as pairs of "2 +0"), or combinations thereof.
I have not attempted to optimise multiple tests of the
form "4 CHAR, 1 -43, 7 PLUS, 4 CHAR, 1 -45, 7 MINUS"
into "4 CHAR, 1 -43, 7 PLUS, 1 -2, 7 MINUS".


Arithmetic Operators.
---------------------

MINUS.

On the 900-series, Add (using Function 1) is slightly faster than
Subtract (using Function 2), so subtraction with a constant second
operand is converted into addition of the negated constant, which can
benefit from commutativity and from other optimisations as for PLUS.

I've not (yet) implemented an obvious optimisation here. Where
the "wrong" LHS operand is already in the A-register, I generate
"/5 T, 4 RHS, /2 T" by default, but I could generate "2 RHS, 2 +0".
However, possibly because of the many other optimisations performed,
this change would only reduce the SIR of the front-end by one word.

PLUS.

I do optimise the cases where the either operand is
known statically to be zero, by returning the other operand.
Addition being commutative simplifies testing for these cases,
and for recognising when either operand is already in the A-register.

Code of the form X:=X+N or X:=N+X, for 0<=N<=2, is recognised and
generates the appropriate number of Count instructions (Function 10,
B-modified or otherwise). This happens, wherever X may be, and without
disturbing the A-register. (See the loops in "Relational Operators" below).

There could be places where it is worth including N=3.
Three "10 X"s are slightly slower than "4 X, 1 +3, 5 X"
but save a literal and do avoid losing the A-register.
However, in many cases, the addressed variable might be used
in the previous or subsequent statement, so I've not done this.

PLUS followed by RV or STIND is recognised, and optimised when
possible to use non-zero B-register offsets, so that, for example,
the code to read from "A!3" becomes just "0 A, /4 3",
rather than "4 A, 1 +3, 5 BREG, /4 0".

MULT.

I do optimise the cases where the either operand is known statically to
be zero, by returning zero, or to be one, by returning the other operand.
Multiplication being commutative simplifies testing for these cases,
and for recognising when either operand is already in the A-register.

Multiplication by powers of two can be implemented by placing the
other operand in the A-register, and obeying a left shift instruction,
followed by a mask instruction with a constant which can be calculated
statically by the back-end, to remove garbage shifted in from the Q-register.
Multiplication by powers of two and left-shift do share a subroutine,
which optimises three cases where the mask literal can be avoided,
see the "Shifts" section below.

Multiplying by negated powers of two is surprisingly simpler. The
other operand is placed in the A-register and negated, which also
clears the Q-register, and is interrupt safe, then the left shift
instruction is obeyed. No mask instruction or literal is required.
The peephole optimiser will remove the shift for multiplication
by -1. (See "A-Register Tracking & Peephole Optimisation" above).

When multiplying by other constants, the constant is normalised,
so that the subsequent shift can be made shorter, saving time.
For example "*10" is not coded as "4 N, 12 +10, 14 17", rather
it is coded as "4 N, 12 +81920, 14 4". Multiplying two variables,
however, still does require a 17-place shift. (Function 3
could only be used when the product is known to be positive).

Some time (but not store) could be saved by optimising multiplication
by three as either "4 X, 1 X, 1 X" or "5 X, 1 X, 1 X" as appropriate,
but I've not (yet) done this.

DIV & REM.

When the dividend (LHS) is known statically to be zero, both DIV and
REM-or-MOD return zero. When the divisor (RHS) is known statically to
be +1, DIV returns the dividend, and REM-or-MOD returns zero. I've not
treated division by zero as a special case, because it isn't, overflow
is being ignored everywhere, so whatever the hardware returns will do.

Divisions by powers of two are implemented by placing the
   LHS in the A-register, and obeying a right shift instruction.
   Right shifts in BCPL require the deletion of the sign bits,
   so right-shift and division do not share a subroutine.
REM-or-MOD with a power-of-two RHS are implemented by placing the
   LHS in the A-register, and obeying a "6 RHS-1" mask instruction.
Even for negative LHS, these optimisations do not alter the
   values returned, and A = (A/B)*B + (A REM B) is preserved.

I have not (yet) provided optimisations for DIV and REM-or-MOD
   by negated powers of two, including a RHS of -1.
REM-or-MOD could be implemented as for powers of two above,
   observing that A REM (-B) = A REM B, as the Ada LRM says,
   giving a correct REM corresponding to a round-to-zero DIV.
Division could be implemented as for powers of two above,
   preceded (rather than followed) by "2 +0", giving the
   same (often-off-by-one) result as the unoptimised code.
However, A = (A/B)*B + (A REM B) would not be preserved.

Otherwise, division uses the "double/divide/halve" method, as
used in CAP's 900 CORAL and described in "Other Obstacles" above,
"4 LHS, 14 8176, 13 RHS, 14 8191". The right shift is 16 places
rather than the usual pre-divide 17, and so implements the doubling.
Because of the doubling, this method can overflow when dividing
by +1 or -1, despite the correct result usually being in range.
The "13 RHS" must not be modified, because this would corrupt the
Q-register before the division, so if the RHS is on the stack or
in the A-register it must be moved to a static location beforehand.

This code does not clear the least-significant TWO bits of the
Q-register, and I have a vague memory (from the 1970s) that
CAP said that this was OK. Certainly it is well published that
division ignores the least-significant ONE bit of the Q-register,
and I have recently established that it does ignore the next bit,
which is why division only returns a sign and 16 data bits.
(This is useful, because clearing these bits by changing
"14 8176" to "14 8174, 14 2" would not be interrupt safe).

The 16-place shift makes this code quite slow. Where the
LHS is known statically, it would be possible to generate code
which places the LHS directly into A & Q, already shifted by the
back-end. This saves no space, so I've not (yet) implemented it.

I spent some time investigating in what circumstances dividing by
a constant can be replaced by multiplication by a constant (and a
shift). Certainly for fixed-point real-number calculations, it
would be usual to do this, but integer arithmetic is more exacting.
The best multiplying constant is found by dividing 131072.0 by the
divisor, normalising it (retaining the fractional part), then rounding
up, and optionally denormalising it whilst even (to reduce the shift).

Division by powers of two needs just a shift (as above).
Division by +3 can be implemented as "12 +43691" and
division by +43691 can be implemented as "12 +3" for
   all positive dividends, correctly (with no shift).
Division by any other positive number (such as +10) can be
   implemented by a multiply and a shift ("12 +52429, 14 8190").
   For 69% of divisors the result is right for all positive dividends,
   for 31% it fails for positive dividends somewhere above 65535.
This certainly meets the "better than PDP-11" test but it might not
   meet the "at least as good as CAP CORAL" test for some values.

Where it is known that the dividend has a reduced range (which is not
   generally the case in BCPL) the shift can be reduced or eliminated
   (for dividing by +10, "12 +13108" works for dividends up to 16389).
Division by a negative number could at worst be implemented with a
   negative multiplier, but I've not investigated how to round this.
The results may be worse for negative dividends and for negative
   divisors, but BCPL has no specific requirements for them.
I did not implement any of these (except powers of two).

Except in the special cases covered earlier, and as explained
in "Other Obstacles" above, REM-or-MOD has to be implemented
by "Divide/Multiply/Subtract". But when the RHS is a constant,
it is NOT normalised for the multiplication, because this
would make the divide and multiply literals different.

If I were to implement dividing by a constant as multiplication by
a different constant, then it would be appropriate to normalise the
RHS constant used for multiplying back (since they differ anyway),
or it might be possible to recover the remainder from the Q-register,
without subtraction (as I do when unpacking text in my 900 BASIC).


Shifts.
-------

On many computers, there are separate opcodes for left shifts and for
right shifts, so a given instruction can only shift in one direction.
The Elliott 900-series has a single shift instruction, so for example
a "/14 0" instruction will shift the combined A&Q registers either
left or right (arithmetically) according to the value held in the
B-register at run-time, positive for left and negative for right.

For shifts in either direction, BCPL requires that the shift distance
"must evaluate to yield a non-negative integer" and "vacated positions
are filled with zeros". Modern BCPL changes the first requirement
to "negative shifts or ones of more than the word length return 0".
Given the ability of the 900-series to implement negative shift distances
as shifts in the opposite direction, it seems sense to do exactly this
and to ignore the new requirement. (In fact to do otherwise could be
slightly messy for right shifts of distance zero).

Sadly "/14 0", being modified, corrupts the Q-register before it gets
shifted, so for dynamic shift distances, RHS, it is usual to plant a
shift instruction. The 900-series right shift regenerates the sign of
the A-register, but the BCPL requirement to shift in zeros can be neatly
met by loading the value to be shifted, LHS, into Q rather than into A,
and shifting 18 places further left than requested. Thus:

   Left shift, RHS in A.        Right shift, RHS in A.
      1 =15 18                     2 =15 18
      6 =14 8191                   6 =14 8191
      5 ;+3                        5 ;+3
      2 LHS                        2 LHS
      4 +0                         4 +0
      >1                           >1

(This code is prefixed by "0 P" if BREG is false and the LHS is on
the stack, otherwise the "5 ;+3" might have to become "5 ;+4, 0 P").

Because the A-register is used to calculate the planted instruction,
the LHS will not be in the A-register, so loading it into Q costs
no more than loading it into A. (This is not the case when the RHS is
known statically, to be covered below). As noted in "Interrupt Safety"
above, this code is not interrupt safe. To make it safe, change
"2 LHS, 4 +0" to "4 +0, 0 LHS", and add "0 P" at the end if needed.

(If we were prepared to assume that the shift distance, whether
positive or negative, was never more than 18 places either way,
both "=15 18" above could be changed to "=14 18",
giving left shift instructions in the range 14 0 to 14 36,
and then the mask instructions could be removed).

The potentially long shifts make this code quite slow. Where the
LHS is known statically, it would be possible to generate code
which only displaces the LHS by 1 place rather than by 18 places
to the right and then generate what will often be shorter run-time
shifts, (where LHS<<17 will be either +0 or &400000):

   Left shift, RHS in A.        Right shift, RHS in A.
      1 =15 1                      2 =15 1
      6 =14 8191                   6 =14 8191
      5 ;+3                        5 ;+3
      2 LHS<<17                    2 LHS<<17
      4 LHS>>1                     4 LHS>>1
      >1                           >1

Or, when the LHS is known statically and is not negative:

   Left shift, RHS in A.        Right shift, RHS in A.
      1 =15 0                      2 =15 0
      6 =14 8191                   6 =14 8191
      5 ;+3                        5 ;+3
      2 +0                         2 +0
      4 LHS                        4 LHS
      >1                           >1

Both of these pairs avoid setting the least significant bit of Q to 1,
so they are interrupt safe. But neither of them saves space, so
I've not (yet) implemented them.

(If we forgo negative shift distances: the "=15" in both versions
of left-shift code above could be changed to "=14" and then the mask
instructions could be removed; whilst in the lower right-shift code,
the "5 ;+3, 2 +0" could be changed to just "5 ;+2", or the whole
sequence changed to "2 +0, 5 BREG, 4 LHS, /14 0" and possibly "0 P").

I do optimise the cases where the value to be shifted (the LHS)
is known statically to be zero, by returning zero.

When the distance to be shifted (the RHS) is known statically,
there is no need for planted instructions. Where it is known
statically to be zero, no operation is required (the LHS is returned),
where it is known to be over 17 places either way, zero is returned.

Right shifts of other static distances can be implemented by placing
the LHS in the A-register, and obeying the shift instruction "14 8192-RHS",
followed by a mask instruction with a constant which can be calculated
statically by the back-end, to remove the regenerated sign bits.
Division by powers of two requires the retention of the sign bits,
so right-shift and division by powers of two do not share a subroutine.

Left shifts of other static distances can be implemented by placing
the LHS in the A-register, and obeying the shift instruction "14 RHS",
followed by a mask instruction with a constant which can be calculated
statically by the back-end, to remove garbage shifted in from the Q-register.
Multiplication by powers of two and left-shift do share a subroutine,
which optimises three cases where the mask literal can be avoided:
*  If shifting one place left, the LHS is added to itself,
   using "4 LHS, 1 LHS" or "5 LHS, 1 LHS" as appropriate,
   which avoids shifting garbage in from the Q-register.
*  If the LHS is not already in the A-register,
   but it can be read without B-modification,
   Q can be cleared first, using "2 +0, 4 LHS, 14 RHS".
*  Otherwise, having loaded the LHS into the A-register,
   if BREG is not OK, Q can be cleared, using "0 +0, 14 RHS".

Formally, for some computers in the Elliott 900-series, the effect
of shifting more than 36 places is undefined. The code described
above will give correct results whenever the distance to be shifted
is known statically, but otherwise, excessive shift distances could
give wrong results or invoke a hardware block transfer instruction.


Logical Operators.
------------------

Logical operators have to be evaluated left-to-right in a
"boolean context", but otherwise require bit-wise evaluation.
Only the latter generate the O-codes: EQV NEQV LOGAND LOGOR.

Cases where an operand is +0 or -1 are optimised to return: +0, -1,
the other operand, or its complement, as appropriate; and the fact that
these operators are all commutative simplifies testing for these cases,
and for recognising when either operand is already in the A-register.

LOGAND is trivially implemented by Function 6, Collate or Mask.

It is tempting to implement LOGOR using DeMorgan's laws:
          4     X
          2     -1
          5     TEMP
          4     Y
          2     -1
          6     TEMP
          2     -1
But this is simpler:
          4     X
          6     Y         (common bits)
          2     X         (bits in X not in Y)
          1     Y
or in a different order if more convenient:
          4     X
          6     Y         (common bits)
          2     Y         (bits in Y not in X)
          1     X
or slightly faster if X and Y are on the stack and
need B-modification whereas the constant does not:
          4     X
          2     -1
          6     Y         (bits in Y not in X)
          1     X
which is what I've used in my implementation,
or any of these with X and Y reversed.

When one operand is a constant C, the last of these becomes:
          4     -1-C
          6     Y
          1     +C
or probably more usefully:
          4     Y         (if Y not already in A-register)
          6     -1-C
          1     +C
for example for C=10, the value -11 has all bits except
those worth 2 and 8, so they are masked out from Y if they
are present, then they are added in with no risk of carry:
          4     Y
          6     -11
          1     +10

To show how effective this non-DeMorgan LOGOR code is:
an example from the QUEENS program: ... is longer modified to avoid LOGOR:
LET poss = all & ~(ld | col | rd)       LET poss = all & ~ld & ~col & ~rd
          /4    4                                 /4    3
          2     -1                                2     -1
          /6    3                                 6     ALL
          /1    4                                 /5    6
          /5    6                                 /4    4
          2     -1                                2     -1
          /6    5                                 /6    6
          /1    6                                 /5    6
          2     -1                                /4    5
          6     ALL                               2     -1
          /5    6                                 /6    6
                                                  /5    6

NEQV or XOR is bit-wise binary addition, without carry, and EQV
is the complement of this. Carry will occur wherever a 1 occurs
in both operands, which can be found by LOGAND, Function 6,
enabling the carry bits to be subtracted out of add-with-carry:
          4     X
          6     Y         (gives bits common to both)
          2     +0        (also clears Q, interrupt safe)
          14    1         (negated carried bits)
          1     X
          1     Y
(An initial read can be avoided if either operand is already in
the A-register, and the two final adds can be in either order).


Relational Operators.
---------------------

This section considers:
   O-codes LS, GR, EQ, NE, LE, GE, not followed by JT or JF,
   O-codes JT & JF, not preceded by LS, GR, EQ, NE, LE, GE,
   O-codes LS, GR, EQ, NE, LE, GE, when followed by JT or JF,
and a slight rearrangement of FOR-loops in the front-end.

O-codes JT & JF, not preceded by LS, GR, EQ, NE, LE, GE:
   JF is required to jump if the top of stack is FALSE = zero,
   JT is required to jump if the top of stack is ANY other value.
O-codes JT & JF, not preceded by LS, GR, EQ, NE, LE, GE:
   are required to stack TRUE specifically as -1,
   if LHS OP RHS is true, and FALSE as zero otherwise.
O-codes LS, GR, EQ, NE, LE, GE, when followed by JT or JF
   are optimised to Jump if LHS OP RHS, without stacking and testing
   the Boolean result. The relational operators are arranged within
   the O_CODE type in the order given here, which has the property
   that reversing the order negates the condition, and this enables
   the procedure CGBRANCH to map JF onto JT, using just a subtraction.

Implementation of EQ & NE in either context is straightforward,
and exploits the commutativity of these operators. If any non-zero
value were permitted for TRUE, the code for NE could be simplified.

Implementation of the other four operators is straightforward
in some cases, for example, "jump if LHS < 0" maps onto "9 LAB".
But when it is less straightforward, there are often several
possibilities, for example, "jump if LHS <= 0" or "jump if LHS < 1"
can be mapped onto "9 LAB, 7 LAB" or "7 LAB, 9 LAB" or "1 -1, 9 LAB".

In the Boolean context, these four operators can be implemented by
arranging to set the A-register negative if & only if the condition
is true, followed by a "14 8175" to set the A-register to -1 or +0.
For example, "LHS <= 0" or "LHS < 1" can be mapped onto "1 -1, 14 8175".

By selecting the "1 -1, 9 LAB" style in the Jumping context, the two
contexts can share common back-end code (in procedure COMPARE) for the
four operators, just differing in the final "9 LAB" versus "14 8175".
This choice eliminates the other possibilities from the Jumping context,
which may have given faster code, but this choice never costs more store.

The four operators use at most two instructions to compare the operands
(sometimes less if either is in the A-register), possibly followed by
an adjustment (which often can be absorbed into a constant operand)
of "2 +0" or "1 -1" or "2 -1", followed by the "9 LAB" or "14 8175".
And where possible, as for MINUS, comparison by subtraction of a
constant operand is converted into addition of the negated constant.

As explained in "Other Obstacles" above, this implementation
ignores any overflow which occurs when comparing the operands of
the four operators by subtraction, and the adjustment instructions
above might slightly alter the points at which the overflows occur.

With this implementation of the relational operators, the O-code for a
typical loop, as generated by the York front-end, translates to the left
column below. Although the test is correctly performed before the first
loop, textually it is at the end of the loop. This is generally faster,
because the unconditional jump is only obeyed once, not once per loop.
But putting the test at the top of the loop reverses the condition, which,
on the 900-series (due to the asymmetry of available conditional jumps)
uses one less word of store, as in the right column below, and both
methods have the same loop overhead, of five instructions executed.

   York front-end:                 Rearranged front-end:
             4     FIRST                     4     FIRST
             /5    8                         /5    8
             4     LAST                      4     LAST
             /5    9                         /5    9
             8     L25             L25       /4    8
   L26       BODY OF LOOP                    /2    9
             /10   8                         9     L26
   L25       /4    9                         BODY OF LOOP
             /2    8                         /10   8
             1     -1                        8     L25
             9     L26             L26

The York front-end generates slightly different O-code when LAST is a
constant, which, with this implementation of the relational operators,
translates the O-code for "FOR A = 0 TO 6" as in the left column below.
In this case, putting the test at the top of the loop again reverses the
condition, as in the right column below, but it gives no store saving,
and the York arrangement has the lower loop execution overhead.

   York front-end:                 Rearranged front-end:
             4     +0                        4     +0
             /5    8                         /5    8
             8     L66             L66       /4    8
   L67       BODY OF LOOP                    2     +6
             /10   8                         9     L67
   L66       /4    8                         BODY OF LOOP
             1     -7                        /10   8
             9     L67                       8     L66
                                   L67

Accordingly, I have modified the front-end to rearrange the O-code
for FOR loops, but only when the LAST value is not a constant.
I do not (yet) know whether any other BCPL loop constructs would
benefit, on the 900-series, from a similar modification.

Compatibility is retained: my back-end will still work with York's
front-end O-code (it will just produce less efficient code), and the
York PDP-11 and other back-ends should work with my new front-end.

When programming in SIR, assuming in the Variable case that
FIRST & LAST can be safely evaluated in the opposite order,
it would be usual to slightly rearrange the above chosen code,
to avoid executing the "/4 8" the first time round the loop:

   Variable LAST:                  Constant LAST (=+6):
             4     LAST                      4     +0
             /5    9                         /5    8
             4     FIRST                     8     L66
             /5    8               L67       BODY OF LOOP
   L25       /2    9                         /10   8
             9     L26                       /4    8
             BODY OF LOOP          L66       1     -7
             /10   8                         9     L67
             /4    8
             8     L25
   L26

This code can also be modified as follows, which is slower, but
takes no more store, and is suitable for increments other than +1:

   Variable LAST:                  Constant LAST (=+6):
             4     LAST                      4     +0
             /5    9                         8     L66
             4     FIRST           L67       BODY OF LOOP
   L25       /5    8                         /4    8
             /2    9                         1     +or-INCREMENT
             9     L26             L66       /5    8
             BODY OF LOOP                    1     -7
             /4    8                         9     L67
             1     +or-INCREMENT
             8     L25
   L26

And in all three cases above where "1 -7" appears,
this could be removed by changing the "4 +0" to "4 -7",
provided that the loop variable is not used within the loop
(or by suitably altering such references).
Unfortunately loops are not visible as such in O-code,
so none of these SIR optimisations can be applied.
(However, the loop could be written "FOR A = -7 TO -1").


Switches.
---------

In the York implementation, the front-end does check for
duplicate case labels, but it does this without sorting them.
However, the York back-end does sort the case labels.
If 3*number_of_cases > 16+highest_case-lowest_case,
then it calls LSWITCH otherwise it calls BSWITCH:
*  BSWITCH is recursive. If it is dealing with more than about
   6 cases, it compiles code for the middle case, and it codes
   two calls of itself for the higher & lower slices of cases,
   otherwise it compiles a "CMP" & "JEQ" for each case,
   and finally a "JBR" to the default, in every slice.
*  LSWITCH checks if the value is within the bounds,
   then jumps indirect via a table containing a list of
   case addresses interspersed with default addresses.
BSWITCH does not call LSWITCH to deal with a compact subrange.

In my 2016 implementation, I simply compiled a linear list
of "1 constant" and "7 label" pairs, with a final "8 default",
(the constants being the differences between unsorted adjacent
case labels), thus costing 2 words & 1 literal per case,
and 2 executed instructions for each case tested, which is
clearly potentially much slower than the York implementation.

In my 2018 implementation, I use a system call which
performs a linear scan of a table containing pairs of:
   value to compare with (unsorted, of either sign),
   address to jump to (always positive),
finishing with the pair:
   default address to jump to,
   -1 (always negative),
thus costing 2 words but no literals per case.

The system call is entered with the switch value in
the A-register and the table address in the B-register.
It is only a one-off shared overhead in space, of 15
instructions, executing 8 instructions per case tested.
I chose this even slower implementation, not for the saving
of the literals, but because the table is not confined to
the first 8K (unlike the code of the 2016 implementation).
One could argue that this is interpretive not native code.

The lexical analyser NEXTSYMB in TSYN switches on the first character
of a token, and 52 of the 128 possible values, namely "A" to "Z" and
"a" to "z", label the same case, which processes an identifier.
And in O-code, BCPL allocates a separate label for each value, so,
even though the York back-end sorts the cases, it is not apparent
to the back-end that the sorted list contains two runs of 26 labels
all pointing at the same case. (In fact York uses LSWITCH here).

Here BCPL switches lack the range facility of Ada's case statements:
"case CH is when 'A'..'Z | 'a'..'z' => statement;". So whilst I could
modify my SWITCHON system code to perform a BSWITCH-style tree search
of a sorted table of values, I'm more interested in developing good
code for ranges (or a mixture of single values & ranges), for Ada.


Strings.
--------

Strings are packed big-endian as two 9-bit characters per word.
9 bits are used rather than 8 to make the values more readable in octal
(although this does make PACKSTRING & UNPACKSTRING marginally slower).
6-bit SIR internal code could be used but then the layout escapes such
as "*N" would have to be processed on output rather than on input and
lower-case letters would be lost. Because strings are not constrained
to the first 8K, the less efficient packing is not so important.

The characters of the string are prefixed by a whole word, S!0, holding
the length of the string in characters (as in the York implementation)
and so strings can be of any length at run-time (although the front-end
has a 255-character limit). If and only if there are an odd number of
characters, there is a final null character, excluded from the length.
(In the York implementation there appear to be either 1 or 2 final nulls).
Note that BCPL usually expects the length to occupy just the first byte.

PACKSTRING & UNPACKSTRING are both written so that the input & output
can be the same, with the string overlaying the start of the vector.

Finally, as noted in "Presetting" above, Strings are preset at
load-time, and BCPL allows the programmer to alter them, but if this
is done, then the program cannot be rerun without reloading it.


Use on a Real 900 System.
-------------------------

The *.BIN files, as produced by my current implementation,
can be run on a real 900-Series computer. Input & output will
default to the paper tape reader and punch, but can be directed
to other addresses (for example to the TeleType) by changing
INPUT=GV+3 (from 2048 to 2052) & OUTPUT=GV+4 (from 6144 to 6148).
Input ignores parity, output generates correct (even) parity.

However it might be worth changing the implementation of
READC=RDCH & WRITEC=WRCH in BCPL_ENV.SIR, to call the ACD standard
CHIP & CHOP subroutines (or their MESCL equivalents QCHIN & QCHOP).
On input, CHIP provides parity checking and line buffering
(for non-stop-on-character readers). On output, CHOP provides
blanks after newlines, and leading & trailing blanks, and
it prefixes each tab (which TeleTypes ignore) with a space.

Given the implementation which I now have, I'm sure that an
implementation could be produced which ran entirely on a 16K 903.
GET remains a problem, see "The GET Command" above for possible
solutions. The front-end currently treats GET as a syntax error,
and perhaps as a minimum it should treat it as a HaltCode.

The front-end is a 900-series program, which currently requires
18K words, including a TREE vector of 6000 words, whether or not
it is all used. Error message strings could be shortened, and
the TREE vector could easily be reduced, to just fit into 16K.
Reducing the TREE vector reduces the maximum size of every segment,
rather than the size of a whole program. I don't know if such a 16K
front-end would be able to compile its own segments as they now are.
As suggested above, it might be worth changing the implementation of
READC=RDCH in BCPL_ENV.SIR in the front-end, to call the ACD standard
CHIP subroutine, to provides parity checking and line buffering.

The back-end is currently a host Ada (not BCPL) program.
But I do have York's PDP11 back-end in BCPL, which I can compile,
and run in the simulator, which, when applied to its own O-code,
produces PDP11 code recognisably like YORK-RSX\10027\TRAN#4.MAC,
(which I have in my magnetic tape HIXTRACT). So rather oddly, my
32K 903 could be used to produce PDP11 code (without needing a PC).
(The manifest constant for the PDP11 opcode XOR had to be
changed to XXX to avoid a clash with the new BCPL keyword,
but the string output is still "xor"; and TRUE being printed
in unsigned octal appears as $777777 rather than $177777).

In "Work done in 2016" above, I noted that "the translation of
TRAN to SIR did not quite fit, but since it generates PDP-11 code,
I didn't actually need it". With my 2018 implementation, the SIR
of this PDP11 back-end does now fit, and finishes with NEXT1=7306,
which includes high data of 1016 bytes, excluding this NEXT1=6290.
My equivalent Ada code (TRAN_PDP.AD6) was 4827+37662 bytes, my
current back-end Ada code (TRAN_BCL.A53) is 4825+48645 bytes.
6290 * 53470 / 42489 = 7916 words. So there is a good chance that,
if implemented in BCPL, the code of my current back-end would
just fit into 8K, with higher data in the 2nd 8K, and without
removing any of the optimisations which it currently performs.
The remainder of the 2nd 8K will be more than adequate to hold the
simulated stack and jump table of any program which otherwise fits.

The current textual O-code interface could be changed to binary,
in which opcodes would be one byte, labels would be two bytes,
constants would be three bytes, and names & strings would use one
byte per character. Some blanks would be needed where reading might
pause, at the end of each complete O-code statement and after each
constant-and-label pair of a switch list (actually corresponding
to the newlines of the textual interface). This would make both
front-end & back-end slightly smaller, and save paper tape.

The SIR generated by the back-end, and the tape of BCPL_ENV.SIR,
could both use tabs rather than multiple spaces, to save paper tape.
In BCPL_ENV.SIR, the comments containing PDP11 code should be removed,
and the code & comments for NUMARG & GETARG could be removed (which
are only there for regression testing), to make the tape shorter.
The two SIR global lists in the generated code and the SIR global
list at the start of BCPL_ENV.SIR could be replaced by a single "[]"
after the option in the generated code, simplifying the back-end.

As my current back-end reads the O-code, it writes to three
temporary files, it then chooses the 8K or 16K store allocation
strategy, and then concatenates the files, changing the higher
store pointers from "? LABEL" to either "0 LABEL" or "1 LABEL".
On a real 900 System, the O-code could be read three times,
generating the CODE & DATA & HIGH file outputs respectively,
to produce a similar concatenated file. The store allocation
strategy would be known by the end of the CODE pass, enabling
the correct pointers to be generated from the DATA to HIGH.
It might still be appropriate for the back-end to measure
BCPL_ENV.SIR to choose its strategy, but to save paper tape
there is no need to copy it into the concatenated file.

Alternatively, the back-end could generate the three streams,
interlaced into a single SIR tape, in a single scan of the O-code,
inserting a SIR patch whenever switching between the three streams.
At the end of the O-code, the back-end could then output a separate
short tape of the form "*8211, CODE >CU, DATA >DU, HIGH >HU"
or "*8211, CODE >CU, DATA >DU, ^8192, HIGH >HU, $", to be
read into 2-PASS SIR before the main interlaced SIR tape.
The patches in the main tape would locate each portion of SIR
into the right place within the appropriate skip of the three.
(Rather than patching to large offsets of the three labels, it
would be more symbolic for each patch to be to a label inserted
at the end of the previous SIR portion in the same stream).
With this scheme (short of modifying 2-PASS SIR to understand "?"),
it would be up to the user to select the 8K or 16K store allocation
method initially. There would be no need to measure BCPL_ENV.SIR.

(With either approach to the back-end, to avoid needing two versions
of BCPL_ENV.SIR, the line "ADSTCK ? STACK" therein could be deleted,
and "ADSTCK 0 STACK" or "ADSTCK 1 STACK" added to the generated code).

When assembling with 2-PASS SIR, paper tape could be saved by changing
the initial option from *8211 to *8208, to minimise the label list.
Small enough programs could be loaded using 1-PASS SIR, avoiding the
need to make a binary paper tape, in which case the option should
be changed to *23 for labels or to *22 to minimise the label list.
(The code used to find the end of the literals would need to be
located somewhere else and triggered differently, or, if more than
8K of store is available, it would suffice to set ADSTCK to "1 0").

The front-end and back-end will each be over 8K words of paper tape,
(24K bytes, 200 feet, 1/5th of a reel), and implementing them in SIR
would reduce the amount of paper tape handling required. Of course
nobody nowadays has enough paper tape to want to run BCPL compilations
on a real 900-series machine anyway, because of the amount of tape used
per compilation, rather than the amount used to make the compilers.


*** EOF ***