491 lines
19 KiB
Plaintext
491 lines
19 KiB
Plaintext
Tiny Cobol Compiler Overview
|
||
----------------------------
|
||
|
||
|
||
|
||
Introduction
|
||
------------
|
||
|
||
The COBOL compiler was intended to be constructed with proven tools like
|
||
lex/flex and byacc (standard yacc from Berkeley) with a C code generator.
|
||
They are applied in that order for the files:
|
||
scan.l (scanner), htcobol.y (parser), htcobgen.c (code generator, listings
|
||
generator and symbol table management functions).
|
||
The output is assembler with AT&T syntax, standard for the Linux environment.
|
||
We plan to do first a ANSI-1974 compliant version, with extensions for
|
||
embedding SQL access and an access to tcl/tk libraries for writing visual
|
||
(GUI) applications, but later we expect to evolve to a full '85 version.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Scanner (tokenizer)
|
||
------------------
|
||
|
||
You will find a large table, reserved_symbols[], that defines all tokens
|
||
recognized by the scanner as reserved symbols.
|
||
When adding a new token at the parser, the same token must be added here.
|
||
The first entry is the token string, as it will be read by the scanner.
|
||
It shall be in uppercase, because the lookup() function convert everything
|
||
before looking at the table (but don't write back at the token buffer).
|
||
If something is not found as a reserved word, the case matters (upper/lower).
|
||
The second entry is the same token definition you entered in the parser
|
||
(htcobol.y). The third entry (minor) is a minor token number and may be
|
||
removed in a future release. It was needed in our limited version
|
||
of lex for the Ms-DOS(TM) environment, so we shared tokens with different
|
||
minor codes. Now they are history and gradually will be removed.
|
||
|
||
The scanner is controlled by the parser, to make it context-dependent, so
|
||
become easier to know what to looking for.
|
||
|
||
(>>>> Note: for instance, there was found a hard to fix reduce/reduce
|
||
conflict because of "NOT ON OVERFLOW", that confused the parser because
|
||
the token NOT was usually found in conditions, for instance in the IF
|
||
statement. That conflict was removed by a dirty trick, defining a new
|
||
state and changing NOT to another "kind of NOT", we have called it
|
||
NOTEXCEP. It's another token, dependent on the context, but is the very
|
||
same string: 'N','O','T'. In the user point of view it is
|
||
indistinguishable from the other.)
|
||
|
||
At the very beginning of the scanner
|
||
code, there is a large switch statement:
|
||
|
||
switch (curr_division) {
|
||
case CDIV_IDENT:
|
||
scdebug("-> IDENT_ST\n");
|
||
BEGIN IDENT_ST;
|
||
break;
|
||
...
|
||
}
|
||
curr_division = 0; /* to avoid new state switching */
|
||
|
||
So each state only look for the tokens it is expected to find. For instance,
|
||
at the COMMENT_ST state, everything that matches the regular expression
|
||
|
||
{letters}(({alphanum}|-)*{alphanum}+)?
|
||
|
||
will be matched and no token is returned to the parser (when it calls yylex()),
|
||
except if one reserved token DIVISNUM (any COBOL division identifier) is found.
|
||
This way, it consumes any input from the source program until a new division
|
||
is found. It is entered in the IDENTIFICATION DIVISION, when the parser
|
||
(htcobol.y) executes the following:
|
||
|
||
identification_division:
|
||
PROGRAM_ID "." IDSTRING EOS {
|
||
curr_division = CDIV_COMMENT;
|
||
pgm_header($3); }
|
||
;
|
||
|
||
As we see, first the program-id is parsed and stored. As it's the only thing
|
||
that's really matters here, we discard anything else after the program-id,
|
||
until we find the next division token: ENVIRONMENT.
|
||
|
||
Several recognizers may be grouped with the same starting states in the
|
||
scanner, surrounding them with something like:
|
||
|
||
<COMMENT_ST>{
|
||
... (several lex declarations)
|
||
}
|
||
|
||
Instead of COMMENT_ST, we can put several states for which the same
|
||
declarations apply, like in <INITIAL,ENVIR_ST,DATA_ST,REDEF_ST>, or even
|
||
<*> to enter a global (all states) declaration. This last case must be done
|
||
with extreme care, because it affects already tested states. Probably you don't
|
||
want to do it. Another special starting state we use is <<EOF>>. Please
|
||
consult flex manual or any good lex/flex book to read more about it. In our
|
||
case, it signals the end of a "copy" (include file) operation.
|
||
|
||
One of the goals of making this scanner with several main states, was because
|
||
we can distinguish between variables and labels (paragraphs and sections), so
|
||
we can avoid several tokens look-ahead in the parser. (There is a discussion
|
||
at the mailing list of GNU-Cobol2C compiler project about this topic)
|
||
Our compiler stores all variables during the DATA_ST state (corresponding to
|
||
data division) and already knows what is a variable during INITIAL state
|
||
(procedure division, default state), so it returns 2 distinct tokens for a
|
||
variable or a paragraph or section identifier (VARIABLE and LABELSTR,
|
||
respectively).
|
||
This is needed because, for instance, a
|
||
|
||
PERFORM IDENT-1 OF IDENT-2 IDENT-3 TIMES
|
||
|
||
and an statement like
|
||
|
||
PERFORM IDENT-1 OF IDENT-2 TIMES <statements> END-PERFORM
|
||
|
||
where in the first case, IDENT-1 is a paragraph name and in the second, IDENT-1
|
||
is a variable (field) name. This example need to lookahead 3 tokens
|
||
(if IDENT-1 and IDENT-2 is not qualified), and in general as much as
|
||
50 lookahead! The other possible solution is to do our
|
||
parsing with a better tool than yacc (btyacc is a backtracking yacc, see at
|
||
tiny-cobol's home page links section). With our simple solution, IDENT-1 is
|
||
know to be a VARIABLE or a LABELSTR (if no variable was found), so the
|
||
lookahead is not needed, and we can stay with regular yacc, besides the parser
|
||
generated is much faster than with btyacc (if lookahead is needed).
|
||
|
||
|
||
|
||
|
||
|
||
Symbol table organization
|
||
------------------------
|
||
|
||
The symbol table use a hash to choose one of HASHLEN entries and store <20>ach
|
||
symbol in this thread, according a hash() function applied on the symbol name
|
||
characters. For each value of "litflag" we can have:
|
||
|
||
litflag | actual structure | it is meant for
|
||
---------+-------------------+---------------------------------------------
|
||
0 | struct sym | general symbols (files, fields, ws, linkage)
|
||
1 | struct lit | literals
|
||
2,',', | struct vref | variable references to compute array indices
|
||
'+','-'| " " | " " " " "
|
||
4 | struct refmod | refmod's which encapsulate litflag = 0 or 2
|
||
---------+-------------------+---------------------------------------------
|
||
|
||
All those structures must have "litflag" as the first field so we can make a
|
||
pointer conversion when accessing the actual storage. In addition, the
|
||
structures "lit" and "sym" must share the following representation:
|
||
|
||
struct XXX {
|
||
char litflag;
|
||
struct XXX *next;
|
||
char *name;
|
||
char type;
|
||
int decimals;
|
||
unsigned location;
|
||
unsigned descriptor;
|
||
... /* the rest of the particular structure */
|
||
};
|
||
|
||
where XXX = lit or sym.
|
||
|
||
How the subscripting works? Let's see the representation of a subscripted
|
||
variable reference.
|
||
Suppose the following COBOL statement: MOVE 5 TO VAR ( I + 1, J - 2 )
|
||
|
||
where I,J are numeric variables (anyone, not just "indexed by").
|
||
In the parser we need a "struct sym" to reference it in a call to
|
||
|
||
gen_move( struct sym *sy_src, struct sym *sy_dst )
|
||
|
||
where sy_src is the source of the moved field (can be a literal too, of
|
||
course), and sy_dst is the destination variable. The definition of this
|
||
function is very simple indeed:
|
||
|
||
gen_move( struct sym *sy_src, struct sym *sy_dst ) {
|
||
gen_loadvar( sy_dst );
|
||
gen_loadvar( sy_src );
|
||
asm_call("move");
|
||
}
|
||
that will generate a "push" in the stack for the representation of the 2
|
||
references and call the runtime library function "move". The hard work is done
|
||
by the function gen_loadvar. It inspects first the litflag of the received
|
||
argument to see if it is really a symbol (struct sym), or a literal (struct
|
||
lit) or yet a subscripted/indexed variable reference (struct ref) and decides
|
||
what to do depending on it's value.
|
||
The result will be the generation of code to push two values (unless the
|
||
reference is a NULL) at the runtime stack:
|
||
(1) the "struct fld_desc" of the field; (2) a pointer (char *) to the field
|
||
storage. (please look also the section on code generation below)
|
||
|
||
Returning to the subscripting/indexing stuff, when the gen_loadvar find
|
||
(really at gen_loadloc) a litflag=2, meaning a "struct vref" is aliased,
|
||
it calls gen_subscripted to generate the code for computing the offset for
|
||
accessing the array element following the list of variable references in vref.
|
||
In the example given above, VAR ( I + 1, J - 2 ) will be represented as a list
|
||
with the following values:
|
||
|
||
(Notes: the headers are the fields of "struct vref"
|
||
the addresses are fictitious)
|
||
|
||
address | litflag | next | sym->name
|
||
--------+---------+-------+-----------
|
||
80001 | '\x2' | 80012 | VAR
|
||
80012 | '+' | 80035 | I
|
||
80035 | ',' | 80047 | 1
|
||
80047 | '-' | 80059 | J
|
||
80059 | ',' | NULL | 2
|
||
|
||
Another related function is value_to_eax, that generates code for loading the
|
||
register %eax with the value of a subscript variable (I or J above). This
|
||
function will be extended to include variables with the "usage is comp" clause
|
||
(for working with real indices, not only subscripts).
|
||
|
||
|
||
--- more to be added later ---
|
||
|
||
|
||
|
||
|
||
The parser
|
||
----------
|
||
|
||
--- to be written (any takers?) ---
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Code generation
|
||
---------------
|
||
|
||
Our output is assembly language, using the cdecl calling convention. In
|
||
COBOL, we cannot handle null-terminated strings like we do in C, because a
|
||
character may contain any binary value (including the null char), and the
|
||
fields in COBOL are fixed length. Most of the library functions require a
|
||
description of the field that's not supposed to be altered in any way. Each
|
||
COBOL variable is represented by two pointers:
|
||
|
||
* a fld_desc (field descriptor) pointer
|
||
|
||
* a storage pointer (the real buffer contents)
|
||
|
||
Here is a description of each entry (please note that VarStructure.Info.txt
|
||
contains a more up to date description of this structure):
|
||
|
||
struct fld_desc {
|
||
unsigned short len;
|
||
char type;
|
||
unsigned char decimals;
|
||
unsigned int all:1;
|
||
unsigned int just_r:1;
|
||
unsigned int reserved:6;
|
||
char *pic;
|
||
};
|
||
|
||
Field Types:
|
||
|
||
'8': DISPLAY : 88 field
|
||
'9': DISPLAY : numeric display
|
||
'A': ALPHA : alpha
|
||
'B': BININT : binary (computational, computational-5)
|
||
'C': PACKED : packed-decimal (computational-3)
|
||
'D': ACCEPT_DISPLAY : screen data display (screen section only)
|
||
'E': EDITED : edited
|
||
'F': : file entity
|
||
'G': GROUP : group
|
||
'I': : ??? Index variable ???
|
||
'J': : global variable
|
||
'K': : external variable
|
||
'L': ???
|
||
'O': ???
|
||
'P': : paragraph
|
||
'Q': : report item
|
||
'R': ???
|
||
'S': : section
|
||
'T': ???
|
||
'U': FLOAT : float (computational-1/2 - 4/8 bytes)
|
||
'V': : ??? decimal point in picture clause ???
|
||
'W': : report descriptor
|
||
'X': ALPHANUMERIC : alphanumeric
|
||
'Z': : ??? suppress zero in picture clause ???
|
||
|
||
This is a static structure (seen by the library code), and should not be
|
||
changed in any way by library code. Its components are the description of our
|
||
variable pointed to by the second argument.
|
||
|
||
"len" is the length of the field in bytes.
|
||
"type" is the field type (see the 'Field Types' table above).
|
||
"decimals" is, for numeric fields, how many decimals positions there are after
|
||
the "V" assumed decimal point, if positive. Otherwise, how many "P"s to the
|
||
left there are for negative values. In other words, the scaling of the
|
||
numeric variable.
|
||
"all" is a flag that indicates if the 'ALL' flags was defined (for literals).
|
||
If this is the case, the variable should be continued as required by a move
|
||
operation (wrap-around at the end).
|
||
"just_r" is a flag that indicates if the field has been declared JUST RIGHT.
|
||
|
||
Why not simply put the pointer to the variable's buffer in its descriptor?
|
||
This wouldn't work, because variables may be passed to sub-programs (calling
|
||
another COBOL program) and its storage is defined as a stack frame, so it's
|
||
very volatile when externally linked.
|
||
|
||
Compressed fields and signs are stored like IBM does in its compilers, with
|
||
the sign at the rightmost half-byte, and all digits bcd-coded.
|
||
|
||
Files are different things, because they need different information. Please
|
||
look at the "struct file_desc" (htcoblib.h) to see its components.
|
||
|
||
-- to be better described later --
|
||
|
||
|
||
|
||
|
||
Memory allocation
|
||
-----------------
|
||
|
||
There are several xxxx_offset integer variables that track the positions of
|
||
code in several data segments, and also at the stack. The table below show the
|
||
usage of them:
|
||
|
||
offset track var | storage kind or segment
|
||
-------------------+-----------------------------------
|
||
stack_offset | automatic, or local data.
|
||
literal_offset | persistent (constant) data.
|
||
| used for descriptors (fld_desc),
|
||
| literals, and compressed pictures.
|
||
using_offset | for cobol subroutine received parameters
|
||
| (procedure division using ...)
|
||
linkage_offset | for linkage section variables (*)
|
||
global_offset | this is mostly historical, holding variables
|
||
| with common storage for several program
|
||
| modules (shared). Now it is being used for
|
||
| collecting all file descriptors (**)
|
||
file_offset | (??) there seems to be of no use (***)
|
||
|
||
(*) linkage section variables does not have local storage. Instead they access
|
||
variables arriving from the calling program, as pointers to the actual data
|
||
storage and descriptor.
|
||
(**) This area can be used to allocate static variable storage, saved
|
||
between calls of a subprogram.
|
||
(***) It doesn't exist in the original compiler and I don't remeber why I (who
|
||
defined it?) created it. Anyway, it is doing nothing, because it is storing a
|
||
value at an aliased variable that is being overwritten :->
|
||
|
||
When the parser finishes parsing the data division and just before reading the
|
||
procedure division, the stack_offset was updated with all automatic storage
|
||
variables already in place and it reserves space in the stack for them,
|
||
generating the following code (function proc_header):
|
||
|
||
fprintf(o_src,"\tpushl\t%%ebp\n\tmovl\t%%esp, %%ebp\n");
|
||
fprintf(o_src,"\tsubl\t$%u, %%esp\n",stack_offset);
|
||
fprintf(o_src,"\tmovl\t%%ebx, -%u(%%ebp)\n",stack_offset - 16);
|
||
|
||
(this last line just saves %ebx, in case we need to use that register,
|
||
in particular when accessing subscripted variables)
|
||
Register usage is the same as in a C program, where %ebp holds our
|
||
stack frame.
|
||
|
||
To know which kind of structure a symbol is, we look at
|
||
(struct sym *)sy->litflag, as reported above (section "symbol table
|
||
organization"). The structures sym, lit, and vref are as a kind of union
|
||
of three different things, selected by litflag's value. Just by using a
|
||
cast we convert from one form to the other, as in the following fragment of
|
||
code:
|
||
|
||
void gen_loadloc( struct sym *sy ) {
|
||
...
|
||
if (sy->litflag == 2) {
|
||
gen_subscripted( (struct vref *)sy );
|
||
...
|
||
|
||
|
||
this code says, if the symbol is actually a vref (subscript expression
|
||
reference), cast it as a (struct vref *) and proceed with the required code
|
||
generation.
|
||
|
||
|
||
|
||
Notes on interfacing with the library functions
|
||
-----------------------------------------------
|
||
|
||
|
||
Let us see a code for a typical function generation:
|
||
|
||
At the parser, we detect the ADD COBOL verb and it's arguments
|
||
|
||
statement:
|
||
...
|
||
| ADD { }
|
||
gname req_to { $<ival>$=ADD; }
|
||
var_list
|
||
...
|
||
|
||
Here "gname" is a non-terminal describing any variable name or literal, or some
|
||
figurative constants; "req_to" is a non-terminal that ensures a TO was
|
||
detected (it's not simply TO, because of the minor codes I've told about when
|
||
explaining the scanner); the action { $<ival>$=ADD } makes the stacked value of
|
||
this action equals to the token code ADD, so we can share several statements
|
||
with the same productions in "var_list"; finally "var_list" is _the_ code
|
||
generating production. Let's see how it works:
|
||
|
||
var_list:
|
||
var_list opt_sep gname
|
||
...
|
||
else if ($<ival>0 == ADD)
|
||
gen_add($<sval>-2,$<sval>3);
|
||
...
|
||
|
||
It's a recursive declaration that generates and ADD instruction for each
|
||
variable detected at the list. For instance, suppose we are parsing:
|
||
|
||
ADD 1 TO VAR-1 VAR-2 VAR-3
|
||
|
||
this will generate the same code as if we have done instead:
|
||
|
||
ADD 1 TO VAR-1
|
||
ADD 1 TO VAR-2
|
||
ADD 1 TO VAR-3
|
||
|
||
Of course, this could be much optimized , but let's keep things simple
|
||
for now.
|
||
The test (if condition) of ($<ival>0 == ADD) will tell us if this is really
|
||
the ADD statement (not MOVE, nor SUBTRACT, ...), because it looks one token
|
||
before reaching the present yacc stack position. This is called an "inherited
|
||
attribute" in compiler theory notation. We are really looking at that action
|
||
value we talked above. The need of typing the value with $<ival>$ is because
|
||
an action cannot be named as we do with other non-terminals (it's typeless),
|
||
but share the same stack space as all other terminals and non-terminals, as
|
||
defined by the %union yacc statement. Please look at a good compiler book to
|
||
understand better that, or I have no way to help you.
|
||
|
||
Now we need to use another inherited attribute to access our left-hand variable
|
||
(before the action and before the "req_to" at the ADD production), counting
|
||
back we get it's value -2 stack positions far away, that's why the first
|
||
argument for gen_add() will be $<sval>-2. The other argument is the right-hand
|
||
variable we are just parsing, or $3. Here there is no need to typify it,
|
||
because it's a known non-terminal of the type "sval" (for "symbol value").
|
||
BTW, the "ival" means "integer value". See the %union statements at the
|
||
beginning of htcobol.y to get a full picture of this.
|
||
|
||
At the code generation side, we have the following code-generating function:
|
||
|
||
void gen_add( struct sym *s1, struct sym *s2 ) {
|
||
gen_loadvar( s2 );
|
||
gen_loadvar( s1 );
|
||
asm_call("add");
|
||
}
|
||
|
||
the function gen_loadvar() generate the code for pushing the "struct fld_desc
|
||
*" and "char *" (the buffer) for the variable which was given (s2 or s1).
|
||
Remember that in C calling conventions, the first variable seen must be at the
|
||
top of stack, so we push it in reversed order. Each variable occupy 8 bytes of
|
||
the stack as discussed before (2 pointers). The asm_call() function generate
|
||
the code for calling the library function and take care of cleaning the stack.
|
||
This auto-cleaning is only possible if you don't write code to push variables
|
||
manually (like fprintf(o_src,"\tpushl\t%%eax\n") for pushing %eax register), as
|
||
this keeps the counter with the wrong value. You shall use the function
|
||
push_eax() instead. Please look for this section at htcoblib.c. (search
|
||
push_eax and look around!)
|
||
|
||
|
||
-- I'll write more later. Please be patient. ...or write it yourself! --
|
||
|
||
|
||
|
||
|
||
|
||
Some random notes
|
||
-----------------
|
||
|
||
As we work within a very heterogeneous group, we have to ensure that the
|
||
compiler is always usable (runnable). Otherwise, other developers working
|
||
on another part of the compiler or run-time library, would not be able to
|
||
check-in their implementations with the CVS server.
|
||
|
||
So if you want to do a large number of changes, that will make the compiler
|
||
temporarily unusable, please create a new branch on your computer, and do your
|
||
changes and tests there. But please, don't update the main development branch
|
||
with unusable code. Read the CVS manual for more information.
|
||
|
||
Our first rule is: "the compiler must compile all times !"
|
||
|
||
|
||
Rildo Pragana
|
||
|
||
Modified by: David Essex
|
||
Bernard Giroud
|