Skip to content

第54章 编译器与解释器

学习目标

完成本章学习后,你将能够:

  1. 理解编译原理:词法分析、语法分析、语义分析、代码生成
  2. 实现词法分析器:正则表达式、有限状态机、Token生成
  3. 实现语法分析器:上下文无关文法、递归下降、AST构建
  4. 构建解释器:AST遍历、环境管理、求值执行
  5. 开发领域特定语言:DSL设计、语法设计、解释执行
  6. 实现编译器后端:中间表示、优化、目标代码生成
  7. 进行代码分析:静态分析、类型检查、代码转换
  8. 构建字节码虚拟机:指令集设计、栈式虚拟机、执行引擎

54.1 词法分析

54.1.1 Token定义

python
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
from enum import Enum, auto
import re


class TokenType(Enum):
    NUMBER = auto()
    STRING = auto()
    IDENTIFIER = auto()
    KEYWORD = auto()
    OPERATOR = auto()
    DELIMITER = auto()
    EOF = auto()

    PLUS = auto()
    MINUS = auto()
    MULTIPLY = auto()
    DIVIDE = auto()
    ASSIGN = auto()
    EQUAL = auto()
    NOT_EQUAL = auto()
    LESS = auto()
    GREATER = auto()
    LESS_EQUAL = auto()
    GREATER_EQUAL = auto()

    LPAREN = auto()
    RPAREN = auto()
    LBRACE = auto()
    RBRACE = auto()
    LBRACKET = auto()
    RBRACKET = auto()
    COMMA = auto()
    SEMICOLON = auto()
    DOT = auto()
    COLON = auto()

    IF = auto()
    ELSE = auto()
    WHILE = auto()
    FOR = auto()
    FUNCTION = auto()
    RETURN = auto()
    VAR = auto()
    LET = auto()
    CONST = auto()
    TRUE = auto()
    FALSE = auto()
    NULL = auto()

    AND = auto()
    OR = auto()
    NOT = auto()


@dataclass
class Token:
    type: TokenType
    value: Any
    line: int
    column: int

    def __repr__(self) -> str:
        return f"Token({self.type.name}, {self.value!r}, {self.line}:{self.column})"


KEYWORDS: Dict[str, TokenType] = {
    "if": TokenType.IF,
    "else": TokenType.ELSE,
    "while": TokenType.WHILE,
    "for": TokenType.FOR,
    "function": TokenType.FUNCTION,
    "return": TokenType.RETURN,
    "var": TokenType.VAR,
    "let": TokenType.LET,
    "const": TokenType.CONST,
    "true": TokenType.TRUE,
    "false": TokenType.FALSE,
    "null": TokenType.NULL,
    "and": TokenType.AND,
    "or": TokenType.OR,
    "not": TokenType.NOT,
}


class Lexer:
    def __init__(self, source: str):
        self.source = source
        self.pos = 0
        self.line = 1
        self.column = 1
        self.tokens: List[Token] = []

    def tokenize(self) -> List[Token]:
        while self.pos < len(self.source):
            self._skip_whitespace()
            if self.pos >= len(self.source):
                break

            char = self.source[self.pos]

            if char.isdigit():
                self._read_number()
            elif char == '"' or char == "'":
                self._read_string(char)
            elif char.isalpha() or char == '_':
                self._read_identifier()
            else:
                self._read_symbol()

        self.tokens.append(Token(TokenType.EOF, None, self.line, self.column))
        return self.tokens

    def _skip_whitespace(self) -> None:
        while self.pos < len(self.source):
            char = self.source[self.pos]
            if char in ' \t\r':
                self._advance()
            elif char == '\n':
                self.line += 1
                self.column = 1
                self.pos += 1
            elif char == '/' and self.pos + 1 < len(self.source):
                if self.source[self.pos + 1] == '/':
                    self._skip_line_comment()
                elif self.source[self.pos + 1] == '*':
                    self._skip_block_comment()
                else:
                    break
            else:
                break

    def _skip_line_comment(self) -> None:
        while self.pos < len(self.source) and self.source[self.pos] != '\n':
            self._advance()

    def _skip_block_comment(self) -> None:
        self._advance()
        self._advance()

        while self.pos < len(self.source) - 1:
            if self.source[self.pos] == '*' and self.source[self.pos + 1] == '/':
                self._advance()
                self._advance()
                break
            self._advance()

    def _advance(self) -> str:
        char = self.source[self.pos]
        self.pos += 1
        self.column += 1
        return char

    def _read_number(self) -> None:
        start_col = self.column
        num_str = ""

        while self.pos < len(self.source) and (self.source[self.pos].isdigit() or self.source[self.pos] == '.'):
            num_str += self._advance()

        if '.' in num_str:
            value = float(num_str)
        else:
            value = int(num_str)

        self.tokens.append(Token(TokenType.NUMBER, value, self.line, start_col))

    def _read_string(self, quote: str) -> None:
        start_col = self.column
        self._advance()
        value = ""

        while self.pos < len(self.source) and self.source[self.pos] != quote:
            if self.source[self.pos] == '\\':
                self._advance()
                if self.pos < len(self.source):
                    escape_char = self._advance()
                    if escape_char == 'n':
                        value += '\n'
                    elif escape_char == 't':
                        value += '\t'
                    elif escape_char == '\\':
                        value += '\\'
                    elif escape_char == quote:
                        value += quote
                    else:
                        value += escape_char
            else:
                value += self._advance()

        if self.pos < len(self.source):
            self._advance()

        self.tokens.append(Token(TokenType.STRING, value, self.line, start_col))

    def _read_identifier(self) -> None:
        start_col = self.column
        value = ""

        while self.pos < len(self.source) and (self.source[self.pos].isalnum() or self.source[self.pos] == '_'):
            value += self._advance()

        token_type = KEYWORDS.get(value, TokenType.IDENTIFIER)
        self.tokens.append(Token(token_type, value, self.line, start_col))

    def _read_symbol(self) -> None:
        start_col = self.column
        char = self._advance()

        symbols: Dict[str, TokenType] = {
            '+': TokenType.PLUS,
            '-': TokenType.MINUS,
            '*': TokenType.MULTIPLY,
            '/': TokenType.DIVIDE,
            '(': TokenType.LPAREN,
            ')': TokenType.RPAREN,
            '{': TokenType.LBRACE,
            '}': TokenType.RBRACE,
            '[': TokenType.LBRACKET,
            ']': TokenType.RBRACKET,
            ',': TokenType.COMMA,
            ';': TokenType.SEMICOLON,
            '.': TokenType.DOT,
            ':': TokenType.COLON,
        }

        double_symbols: Dict[str, TokenType] = {
            '==': TokenType.EQUAL,
            '!=': TokenType.NOT_EQUAL,
            '<=': TokenType.LESS_EQUAL,
            '>=': TokenType.GREATER_EQUAL,
            '&&': TokenType.AND,
            '||': TokenType.OR,
        }

        if self.pos < len(self.source):
            two_char = char + self.source[self.pos]
            if two_char in double_symbols:
                self._advance()
                self.tokens.append(Token(double_symbols[two_char], two_char, self.line, start_col))
                return

        if char == '=':
            self.tokens.append(Token(TokenType.ASSIGN, char, self.line, start_col))
        elif char == '<':
            self.tokens.append(Token(TokenType.LESS, char, self.line, start_col))
        elif char == '>':
            self.tokens.append(Token(TokenType.GREATER, char, self.line, start_col))
        elif char == '!':
            self.tokens.append(Token(TokenType.NOT, char, self.line, start_col))
        elif char in symbols:
            self.tokens.append(Token(symbols[char], char, self.line, start_col))
        else:
            raise SyntaxError(f"Unexpected character: {char} at line {self.line}, column {start_col}")

54.2 语法分析

54.2.1 AST节点

python
from dataclasses import dataclass, field
from typing import List, Optional, Any, Union
from abc import ABC, abstractmethod


class ASTNode(ABC):
    pass


@dataclass
class Program(ASTNode):
    statements: List[ASTNode]


@dataclass
class NumberLiteral(ASTNode):
    value: Union[int, float]


@dataclass
class StringLiteral(ASTNode):
    value: str


@dataclass
class BooleanLiteral(ASTNode):
    value: bool


@dataclass
class NullLiteral(ASTNode):
    pass


@dataclass
class Identifier(ASTNode):
    name: str


@dataclass
class BinaryExpression(ASTNode):
    operator: str
    left: ASTNode
    right: ASTNode


@dataclass
class UnaryExpression(ASTNode):
    operator: str
    operand: ASTNode


@dataclass
class CallExpression(ASTNode):
    callee: ASTNode
    arguments: List[ASTNode]


@dataclass
class MemberExpression(ASTNode):
    object: ASTNode
    property: ASTNode
    computed: bool = False


@dataclass
class ArrayExpression(ASTNode):
    elements: List[ASTNode]


@dataclass
class ObjectExpression(ASTNode):
    properties: List[tuple]


@dataclass
class VariableDeclaration(ASTNode):
    kind: str
    name: str
    init: Optional[ASTNode]


@dataclass
class AssignmentExpression(ASTNode):
    left: ASTNode
    right: ASTNode


@dataclass
class IfStatement(ASTNode):
    condition: ASTNode
    consequent: ASTNode
    alternate: Optional[ASTNode]


@dataclass
class WhileStatement(ASTNode):
    condition: ASTNode
    body: ASTNode


@dataclass
class ForStatement(ASTNode):
    init: Optional[ASTNode]
    condition: Optional[ASTNode]
    update: Optional[ASTNode]
    body: ASTNode


@dataclass
class FunctionDeclaration(ASTNode):
    name: str
    params: List[str]
    body: ASTNode


@dataclass
class ReturnStatement(ASTNode):
    value: Optional[ASTNode]


@dataclass
class BlockStatement(ASTNode):
    statements: List[ASTNode]


@dataclass
class ExpressionStatement(ASTNode):
    expression: ASTNode


class Parser:
    def __init__(self, tokens: List[Token]):
        self.tokens = tokens
        self.pos = 0

    def parse(self) -> Program:
        statements = []

        while not self._is_at_end():
            stmt = self._parse_statement()
            if stmt:
                statements.append(stmt)

        return Program(statements)

    def _current(self) -> Token:
        return self.tokens[self.pos]

    def _peek(self, offset: int = 1) -> Token:
        pos = self.pos + offset
        if pos < len(self.tokens):
            return self.tokens[pos]
        return self.tokens[-1]

    def _is_at_end(self) -> bool:
        return self._current().type == TokenType.EOF

    def _advance(self) -> Token:
        if not self._is_at_end():
            self.pos += 1
        return self.tokens[self.pos - 1]

    def _match(self, *types: TokenType) -> bool:
        for t in types:
            if self._current().type == t:
                return True
        return False

    def _consume(self, type: TokenType, message: str) -> Token:
        if self._current().type == type:
            return self._advance()
        raise SyntaxError(f"{message} at line {self._current().line}")

    def _parse_statement(self) -> Optional[ASTNode]:
        if self._match(TokenType.VAR, TokenType.LET, TokenType.CONST):
            return self._parse_variable_declaration()
        elif self._match(TokenType.IF):
            return self._parse_if_statement()
        elif self._match(TokenType.WHILE):
            return self._parse_while_statement()
        elif self._match(TokenType.FOR):
            return self._parse_for_statement()
        elif self._match(TokenType.FUNCTION):
            return self._parse_function_declaration()
        elif self._match(TokenType.RETURN):
            return self._parse_return_statement()
        elif self._match(TokenType.LBRACE):
            return self._parse_block_statement()
        else:
            return self._parse_expression_statement()

    def _parse_variable_declaration(self) -> VariableDeclaration:
        kind = self._advance().value
        name = self._consume(TokenType.IDENTIFIER, "Expected variable name").value

        init = None
        if self._match(TokenType.ASSIGN):
            self._advance()
            init = self._parse_expression()

        if self._match(TokenType.SEMICOLON):
            self._advance()

        return VariableDeclaration(kind, name, init)

    def _parse_if_statement(self) -> IfStatement:
        self._advance()
        self._consume(TokenType.LPAREN, "Expected '(' after 'if'")
        condition = self._parse_expression()
        self._consume(TokenType.RPAREN, "Expected ')' after condition")

        consequent = self._parse_statement()
        alternate = None

        if self._match(TokenType.ELSE):
            self._advance()
            alternate = self._parse_statement()

        return IfStatement(condition, consequent, alternate)

    def _parse_while_statement(self) -> WhileStatement:
        self._advance()
        self._consume(TokenType.LPAREN, "Expected '(' after 'while'")
        condition = self._parse_expression()
        self._consume(TokenType.RPAREN, "Expected ')' after condition")
        body = self._parse_statement()

        return WhileStatement(condition, body)

    def _parse_for_statement(self) -> ForStatement:
        self._advance()
        self._consume(TokenType.LPAREN, "Expected '(' after 'for'")

        init = None
        if not self._match(TokenType.SEMICOLON):
            init = self._parse_variable_declaration()
        else:
            self._advance()

        condition = None
        if not self._match(TokenType.SEMICOLON):
            condition = self._parse_expression()
        self._consume(TokenType.SEMICOLON, "Expected ';' after condition")

        update = None
        if not self._match(TokenType.RPAREN):
            update = self._parse_expression()
        self._consume(TokenType.RPAREN, "Expected ')' after for clauses")

        body = self._parse_statement()

        return ForStatement(init, condition, update, body)

    def _parse_function_declaration(self) -> FunctionDeclaration:
        self._advance()
        name = self._consume(TokenType.IDENTIFIER, "Expected function name").value

        self._consume(TokenType.LPAREN, "Expected '(' after function name")
        params = []

        if not self._match(TokenType.RPAREN):
            params.append(self._consume(TokenType.IDENTIFIER, "Expected parameter name").value)
            while self._match(TokenType.COMMA):
                self._advance()
                params.append(self._consume(TokenType.IDENTIFIER, "Expected parameter name").value)

        self._consume(TokenType.RPAREN, "Expected ')' after parameters")
        body = self._parse_block_statement()

        return FunctionDeclaration(name, params, body)

    def _parse_return_statement(self) -> ReturnStatement:
        self._advance()
        value = None

        if not self._match(TokenType.SEMICOLON, TokenType.RBRACE):
            value = self._parse_expression()

        if self._match(TokenType.SEMICOLON):
            self._advance()

        return ReturnStatement(value)

    def _parse_block_statement(self) -> BlockStatement:
        self._advance()
        statements = []

        while not self._match(TokenType.RBRACE) and not self._is_at_end():
            stmt = self._parse_statement()
            if stmt:
                statements.append(stmt)

        self._consume(TokenType.RBRACE, "Expected '}' after block")
        return BlockStatement(statements)

    def _parse_expression_statement(self) -> ExpressionStatement:
        expr = self._parse_expression()

        if self._match(TokenType.SEMICOLON):
            self._advance()

        return ExpressionStatement(expr)

    def _parse_expression(self) -> ASTNode:
        return self._parse_assignment()

    def _parse_assignment(self) -> ASTNode:
        expr = self._parse_or()

        if self._match(TokenType.ASSIGN):
            self._advance()
            value = self._parse_assignment()
            return AssignmentExpression(expr, value)

        return expr

    def _parse_or(self) -> ASTNode:
        left = self._parse_and()

        while self._match(TokenType.OR):
            op = self._advance().value
            right = self._parse_and()
            left = BinaryExpression(op, left, right)

        return left

    def _parse_and(self) -> ASTNode:
        left = self._parse_equality()

        while self._match(TokenType.AND):
            op = self._advance().value
            right = self._parse_equality()
            left = BinaryExpression(op, left, right)

        return left

    def _parse_equality(self) -> ASTNode:
        left = self._parse_comparison()

        while self._match(TokenType.EQUAL, TokenType.NOT_EQUAL):
            op = self._advance().value
            right = self._parse_comparison()
            left = BinaryExpression(op, left, right)

        return left

    def _parse_comparison(self) -> ASTNode:
        left = self._parse_term()

        while self._match(TokenType.LESS, TokenType.GREATER, TokenType.LESS_EQUAL, TokenType.GREATER_EQUAL):
            op = self._advance().value
            right = self._parse_term()
            left = BinaryExpression(op, left, right)

        return left

    def _parse_term(self) -> ASTNode:
        left = self._parse_factor()

        while self._match(TokenType.PLUS, TokenType.MINUS):
            op = self._advance().value
            right = self._parse_factor()
            left = BinaryExpression(op, left, right)

        return left

    def _parse_factor(self) -> ASTNode:
        left = self._parse_unary()

        while self._match(TokenType.MULTIPLY, TokenType.DIVIDE):
            op = self._advance().value
            right = self._parse_unary()
            left = BinaryExpression(op, left, right)

        return left

    def _parse_unary(self) -> ASTNode:
        if self._match(TokenType.NOT, TokenType.MINUS):
            op = self._advance().value
            operand = self._parse_unary()
            return UnaryExpression(op, operand)

        return self._parse_call()

    def _parse_call(self) -> ASTNode:
        expr = self._parse_primary()

        while True:
            if self._match(TokenType.LPAREN):
                self._advance()
                args = self._parse_arguments()
                self._consume(TokenType.RPAREN, "Expected ')' after arguments")
                expr = CallExpression(expr, args)
            elif self._match(TokenType.DOT):
                self._advance()
                prop = self._consume(TokenType.IDENTIFIER, "Expected property name")
                expr = MemberExpression(expr, Identifier(prop.value), False)
            elif self._match(TokenType.LBRACKET):
                self._advance()
                prop = self._parse_expression()
                self._consume(TokenType.RBRACKET, "Expected ']' after index")
                expr = MemberExpression(expr, prop, True)
            else:
                break

        return expr

    def _parse_arguments(self) -> List[ASTNode]:
        args = []

        if not self._match(TokenType.RPAREN):
            args.append(self._parse_expression())
            while self._match(TokenType.COMMA):
                self._advance()
                args.append(self._parse_expression())

        return args

    def _parse_primary(self) -> ASTNode:
        if self._match(TokenType.NUMBER):
            return NumberLiteral(self._advance().value)
        elif self._match(TokenType.STRING):
            return StringLiteral(self._advance().value)
        elif self._match(TokenType.TRUE):
            self._advance()
            return BooleanLiteral(True)
        elif self._match(TokenType.FALSE):
            self._advance()
            return BooleanLiteral(False)
        elif self._match(TokenType.NULL):
            self._advance()
            return NullLiteral()
        elif self._match(TokenType.IDENTIFIER):
            return Identifier(self._advance().value)
        elif self._match(TokenType.LPAREN):
            self._advance()
            expr = self._parse_expression()
            self._consume(TokenType.RPAREN, "Expected ')' after expression")
            return expr
        elif self._match(TokenType.LBRACKET):
            return self._parse_array()
        elif self._match(TokenType.LBRACE):
            return self._parse_object()
        else:
            raise SyntaxError(f"Unexpected token: {self._current().type} at line {self._current().line}")

    def _parse_array(self) -> ArrayExpression:
        self._advance()
        elements = []

        if not self._match(TokenType.RBRACKET):
            elements.append(self._parse_expression())
            while self._match(TokenType.COMMA):
                self._advance()
                if self._match(TokenType.RBRACKET):
                    break
                elements.append(self._parse_expression())

        self._consume(TokenType.RBRACKET, "Expected ']' after array elements")
        return ArrayExpression(elements)

    def _parse_object(self) -> ObjectExpression:
        self._advance()
        properties = []

        if not self._match(TokenType.RBRACE):
            key = self._parse_object_key()
            self._consume(TokenType.COLON, "Expected ':' after key")
            value = self._parse_expression()
            properties.append((key, value))

            while self._match(TokenType.COMMA):
                self._advance()
                if self._match(TokenType.RBRACE):
                    break
                key = self._parse_object_key()
                self._consume(TokenType.COLON, "Expected ':' after key")
                value = self._parse_expression()
                properties.append((key, value))

        self._consume(TokenType.RBRACE, "Expected '}' after object properties")
        return ObjectExpression(properties)

    def _parse_object_key(self) -> ASTNode:
        if self._match(TokenType.STRING):
            return StringLiteral(self._advance().value)
        elif self._match(TokenType.IDENTIFIER):
            return StringLiteral(self._advance().value)
        else:
            raise SyntaxError(f"Expected property name at line {self._current().line}")

54.3 解释器

54.3.1 执行引擎

python
from typing import Dict, Any, List, Optional, Callable
from dataclasses import dataclass, field


@dataclass
class Environment:
    variables: Dict[str, Any] = field(default_factory=dict)
    parent: Optional["Environment"] = None

    def define(self, name: str, value: Any) -> None:
        self.variables[name] = value

    def get(self, name: str) -> Any:
        if name in self.variables:
            return self.variables[name]
        if self.parent:
            return self.parent.get(name)
        raise NameError(f"Undefined variable: {name}")

    def set(self, name: str, value: Any) -> None:
        if name in self.variables:
            self.variables[name] = value
            return
        if self.parent:
            self.parent.set(name, value)
            return
        raise NameError(f"Undefined variable: {name}")

    def has(self, name: str) -> bool:
        if name in self.variables:
            return True
        if self.parent:
            return self.parent.has(name)
        return False


@dataclass
class ReturnValue:
    value: Any


@dataclass
class Function:
    name: str
    params: List[str]
    body: ASTNode
    closure: Environment


class Interpreter:
    def __init__(self):
        self.global_env = Environment()
        self._setup_builtins()

    def _setup_builtins(self) -> None:
        builtins = {
            "print": lambda *args: print(*args),
            "len": lambda x: len(x),
            "type": lambda x: type(x).__name__,
            "str": lambda x: str(x),
            "int": lambda x: int(x),
            "float": lambda x: float(x),
            "abs": lambda x: abs(x),
            "min": lambda *args: min(args),
            "max": lambda *args: max(args),
            "range": lambda *args: list(range(*args)),
        }

        for name, func in builtins.items():
            self.global_env.define(name, func)

    def interpret(self, program: Program) -> Any:
        result = None
        for stmt in program.statements:
            result = self._execute(stmt, self.global_env)
            if isinstance(result, ReturnValue):
                return result.value
        return result

    def _execute(self, node: ASTNode, env: Environment) -> Any:
        method_name = f"_execute_{type(node).__name__}"
        method = getattr(self, method_name, self._execute_generic)
        return method(node, env)

    def _execute_generic(self, node: ASTNode, env: Environment) -> Any:
        raise NotImplementedError(f"No execution method for {type(node).__name__}")

    def _execute_Program(self, node: Program, env: Environment) -> Any:
        result = None
        for stmt in node.statements:
            result = self._execute(stmt, env)
            if isinstance(result, ReturnValue):
                return result.value
        return result

    def _execute_NumberLiteral(self, node: NumberLiteral, env: Environment) -> Any:
        return node.value

    def _execute_StringLiteral(self, node: StringLiteral, env: Environment) -> Any:
        return node.value

    def _execute_BooleanLiteral(self, node: BooleanLiteral, env: Environment) -> Any:
        return node.value

    def _execute_NullLiteral(self, node: NullLiteral, env: Environment) -> Any:
        return None

    def _execute_Identifier(self, node: Identifier, env: Environment) -> Any:
        return env.get(node.name)

    def _execute_BinaryExpression(self, node: BinaryExpression, env: Environment) -> Any:
        left = self._execute(node.left, env)
        right = self._execute(node.right, env)

        op = node.operator

        if op == '+':
            return left + right
        elif op == '-':
            return left - right
        elif op == '*':
            return left * right
        elif op == '/':
            return left / right
        elif op == '==':
            return left == right
        elif op == '!=':
            return left != right
        elif op == '<':
            return left < right
        elif op == '>':
            return left > right
        elif op == '<=':
            return left <= right
        elif op == '>=':
            return left >= right
        elif op == 'and':
            return left and right
        elif op == 'or':
            return left or right

        raise RuntimeError(f"Unknown operator: {op}")

    def _execute_UnaryExpression(self, node: UnaryExpression, env: Environment) -> Any:
        operand = self._execute(node.operand, env)

        if node.operator == '-':
            return -operand
        elif node.operator == 'not':
            return not operand

        raise RuntimeError(f"Unknown unary operator: {node.operator}")

    def _execute_CallExpression(self, node: CallExpression, env: Environment) -> Any:
        callee = self._execute(node.callee, env)
        args = [self._execute(arg, env) for arg in node.arguments]

        if callable(callee):
            return callee(*args)

        if isinstance(callee, Function):
            return self._call_function(callee, args)

        raise RuntimeError(f"Cannot call {type(callee).__name__}")

    def _call_function(self, func: Function, args: List[Any]) -> Any:
        local_env = Environment(parent=func.closure)

        for param, arg in zip(func.params, args):
            local_env.define(param, arg)

        try:
            self._execute(func.body, local_env)
        except ReturnValue as ret:
            return ret.value

        return None

    def _execute_MemberExpression(self, node: MemberExpression, env: Environment) -> Any:
        obj = self._execute(node.object, env)

        if node.computed:
            prop = self._execute(node.property, env)
        else:
            prop = node.property.name

        if isinstance(obj, dict):
            return obj.get(prop)
        elif isinstance(obj, list):
            return obj[int(prop)]
        elif hasattr(obj, prop):
            return getattr(obj, prop)

        raise RuntimeError(f"Cannot access property {prop}")

    def _execute_ArrayExpression(self, node: ArrayExpression, env: Environment) -> Any:
        return [self._execute(elem, env) for elem in node.elements]

    def _execute_ObjectExpression(self, node: ObjectExpression, env: Environment) -> Any:
        obj = {}
        for key, value in node.properties:
            k = self._execute(key, env)
            v = self._execute(value, env)
            obj[k] = v
        return obj

    def _execute_VariableDeclaration(self, node: VariableDeclaration, env: Environment) -> Any:
        value = None
        if node.init:
            value = self._execute(node.init, env)
        env.define(node.name, value)
        return value

    def _execute_AssignmentExpression(self, node: AssignmentExpression, env: Environment) -> Any:
        value = self._execute(node.right, env)

        if isinstance(node.left, Identifier):
            env.set(node.left.name, value)
        elif isinstance(node.left, MemberExpression):
            obj = self._execute(node.left.object, env)
            if node.left.computed:
                prop = self._execute(node.left.property, env)
            else:
                prop = node.left.property.name

            if isinstance(obj, dict):
                obj[prop] = value
            elif isinstance(obj, list):
                obj[int(prop)] = value

        return value

    def _execute_IfStatement(self, node: IfStatement, env: Environment) -> Any:
        condition = self._execute(node.condition, env)

        if condition:
            return self._execute(node.consequent, env)
        elif node.alternate:
            return self._execute(node.alternate, env)

        return None

    def _execute_WhileStatement(self, node: WhileStatement, env: Environment) -> Any:
        result = None

        while self._execute(node.condition, env):
            result = self._execute(node.body, env)
            if isinstance(result, ReturnValue):
                return result

        return result

    def _execute_ForStatement(self, node: ForStatement, env: Environment) -> Any:
        local_env = Environment(parent=env)

        if node.init:
            self._execute(node.init, local_env)

        result = None

        while True:
            if node.condition:
                if not self._execute(node.condition, local_env):
                    break

            result = self._execute(node.body, local_env)
            if isinstance(result, ReturnValue):
                return result

            if node.update:
                self._execute(node.update, local_env)

        return result

    def _execute_FunctionDeclaration(self, node: FunctionDeclaration, env: Environment) -> Any:
        func = Function(node.name, node.params, node.body, env)
        env.define(node.name, func)
        return func

    def _execute_ReturnStatement(self, node: ReturnStatement, env: Environment) -> ReturnValue:
        value = None
        if node.value:
            value = self._execute(node.value, env)
        return ReturnValue(value)

    def _execute_BlockStatement(self, node: BlockStatement, env: Environment) -> Any:
        local_env = Environment(parent=env)
        result = None

        for stmt in node.statements:
            result = self._execute(stmt, local_env)
            if isinstance(result, ReturnValue):
                return result

        return result

    def _execute_ExpressionStatement(self, node: ExpressionStatement, env: Environment) -> Any:
        return self._execute(node.expression, env)

54.4 字节码编译器

54.4.1 指令集设计

python
from dataclasses import dataclass
from typing import List, Any
from enum import Enum, auto


class OpCode(Enum):
    LOAD_CONST = auto()
    LOAD_NAME = auto()
    STORE_NAME = auto()
    BINARY_ADD = auto()
    BINARY_SUB = auto()
    BINARY_MUL = auto()
    BINARY_DIV = auto()
    COMPARE_EQ = auto()
    COMPARE_NE = auto()
    COMPARE_LT = auto()
    COMPARE_GT = auto()
    COMPARE_LE = auto()
    COMPARE_GE = auto()
    POP = auto()
    DUP = auto()
    JUMP = auto()
    JUMP_IF_FALSE = auto()
    JUMP_IF_TRUE = auto()
    CALL = auto()
    RETURN = auto()
    MAKE_FUNCTION = auto()
    BUILD_LIST = auto()
    BUILD_MAP = auto()
    GET_ITEM = auto()
    SET_ITEM = auto()
    UNARY_NEGATIVE = auto()
    UNARY_NOT = auto()
    PRINT = auto()


@dataclass
class Instruction:
    opcode: OpCode
    operand: Any = None
    line: int = 0

    def __repr__(self) -> str:
        if self.operand is not None:
            return f"{self.opcode.name} {self.operand}"
        return self.opcode.name


class BytecodeCompiler:
    def __init__(self):
        self.instructions: List[Instruction] = []
        self.constants: List[Any] = []
        self.names: List[str] = []
        self._current_line = 0

    def compile(self, node: ASTNode) -> List[Instruction]:
        self.instructions = []
        self._compile(node)
        return self.instructions

    def _emit(self, opcode: OpCode, operand: Any = None) -> int:
        instruction = Instruction(opcode, operand, self._current_line)
        self.instructions.append(instruction)
        return len(self.instructions) - 1

    def _add_constant(self, value: Any) -> int:
        if value not in self.constants:
            self.constants.append(value)
        return self.constants.index(value)

    def _add_name(self, name: str) -> int:
        if name not in self.names:
            self.names.append(name)
        return self.names.index(name)

    def _compile(self, node: ASTNode) -> None:
        method_name = f"_compile_{type(node).__name__}"
        method = getattr(self, method_name, self._compile_generic)
        method(node)

    def _compile_generic(self, node: ASTNode) -> None:
        raise NotImplementedError(f"No compile method for {type(node).__name__}")

    def _compile_NumberLiteral(self, node: NumberLiteral) -> None:
        idx = self._add_constant(node.value)
        self._emit(OpCode.LOAD_CONST, idx)

    def _compile_StringLiteral(self, node: StringLiteral) -> None:
        idx = self._add_constant(node.value)
        self._emit(OpCode.LOAD_CONST, idx)

    def _compile_BooleanLiteral(self, node: BooleanLiteral) -> None:
        idx = self._add_constant(node.value)
        self._emit(OpCode.LOAD_CONST, idx)

    def _compile_Identifier(self, node: Identifier) -> None:
        idx = self._add_name(node.name)
        self._emit(OpCode.LOAD_NAME, idx)

    def _compile_BinaryExpression(self, node: BinaryExpression) -> None:
        self._compile(node.left)
        self._compile(node.right)

        op_map = {
            '+': OpCode.BINARY_ADD,
            '-': OpCode.BINARY_SUB,
            '*': OpCode.BINARY_MUL,
            '/': OpCode.BINARY_DIV,
            '==': OpCode.COMPARE_EQ,
            '!=': OpCode.COMPARE_NE,
            '<': OpCode.COMPARE_LT,
            '>': OpCode.COMPARE_GT,
            '<=': OpCode.COMPARE_LE,
            '>=': OpCode.COMPARE_GE,
        }

        if node.operator in op_map:
            self._emit(op_map[node.operator])

    def _compile_IfStatement(self, node: IfStatement) -> None:
        self._compile(node.condition)

        else_jump = self._emit(OpCode.JUMP_IF_FALSE, None)

        self._compile(node.consequent)

        if node.alternate:
            end_jump = self._emit(OpCode.JUMP, None)

        self.instructions[else_jump].operand = len(self.instructions)

        if node.alternate:
            self._compile(node.alternate)
            self.instructions[end_jump].operand = len(self.instructions)

    def _compile_WhileStatement(self, node: WhileStatement) -> None:
        start = len(self.instructions)

        self._compile(node.condition)
        exit_jump = self._emit(OpCode.JUMP_IF_FALSE, None)

        self._compile(node.body)
        self._emit(OpCode.JUMP, start)

        self.instructions[exit_jump].operand = len(self.instructions)

    def _compile_VariableDeclaration(self, node: VariableDeclaration) -> None:
        if node.init:
            self._compile(node.init)
        else:
            self._emit(OpCode.LOAD_CONST, self._add_constant(None))

        idx = self._add_name(node.name)
        self._emit(OpCode.STORE_NAME, idx)

    def _compile_FunctionDeclaration(self, node: FunctionDeclaration) -> None:
        for param in node.params:
            self._add_name(param)

        body_compiler = BytecodeCompiler()
        body_instructions = body_compiler.compile(node.body)

        func_data = {
            "name": node.name,
            "params": node.params,
            "code": body_instructions,
            "constants": body_compiler.constants,
            "names": body_compiler.names
        }

        idx = self._add_constant(func_data)
        self._emit(OpCode.MAKE_FUNCTION, idx)

        name_idx = self._add_name(node.name)
        self._emit(OpCode.STORE_NAME, name_idx)

    def _compile_CallExpression(self, node: CallExpression) -> None:
        self._compile(node.callee)

        for arg in node.arguments:
            self._compile(arg)

        self._emit(OpCode.CALL, len(node.arguments))

    def _compile_ReturnStatement(self, node: ReturnStatement) -> None:
        if node.value:
            self._compile(node.value)
        else:
            self._emit(OpCode.LOAD_CONST, self._add_constant(None))

        self._emit(OpCode.RETURN)

54.4.2 栈式虚拟机

python
from typing import List, Any, Dict, Optional
from dataclasses import dataclass, field


@dataclass
class Frame:
    code: List[Instruction]
    constants: List[Any]
    names: List[str]
    pc: int = 0
    stack: List[Any] = field(default_factory=list)
    locals: Dict[str, Any] = field(default_factory=dict)
    parent: Optional["Frame"] = None


class VirtualMachine:
    def __init__(self):
        self.globals: Dict[str, Any] = {}
        self._setup_builtins()

    def _setup_builtins(self) -> None:
        self.globals["print"] = lambda *args: print(*args)
        self.globals["len"] = lambda x: len(x)
        self.globals["type"] = lambda x: type(x).__name__
        self.globals["str"] = lambda x: str(x)
        self.globals["int"] = lambda x: int(x)
        self.globals["float"] = lambda x: float(x)
        self.globals["abs"] = lambda x: abs(x)
        self.globals["min"] = lambda *args: min(args)
        self.globals["max"] = lambda *args: max(args)
        self.globals["range"] = lambda *args: list(range(*args))

    def run(self, frame: Frame) -> Any:
        while frame.pc < len(frame.code):
            instruction = frame.code[frame.pc]
            frame.pc += 1

            self._execute_instruction(frame, instruction)

        return frame.stack[-1] if frame.stack else None

    def _execute_instruction(self, frame: Frame, instruction: Instruction) -> None:
        opcode = instruction.opcode
        operand = instruction.operand

        if opcode == OpCode.LOAD_CONST:
            frame.stack.append(frame.constants[operand])

        elif opcode == OpCode.LOAD_NAME:
            name = frame.names[operand]
            if name in frame.locals:
                frame.stack.append(frame.locals[name])
            elif name in self.globals:
                frame.stack.append(self.globals[name])
            else:
                raise NameError(f"Undefined variable: {name}")

        elif opcode == OpCode.STORE_NAME:
            name = frame.names[operand]
            value = frame.stack.pop()
            frame.locals[name] = value

        elif opcode == OpCode.BINARY_ADD:
            right = frame.stack.pop()
            left = frame.stack.pop()
            frame.stack.append(left + right)

        elif opcode == OpCode.BINARY_SUB:
            right = frame.stack.pop()
            left = frame.stack.pop()
            frame.stack.append(left - right)

        elif opcode == OpCode.BINARY_MUL:
            right = frame.stack.pop()
            left = frame.stack.pop()
            frame.stack.append(left * right)

        elif opcode == OpCode.BINARY_DIV:
            right = frame.stack.pop()
            left = frame.stack.pop()
            frame.stack.append(left / right)

        elif opcode == OpCode.COMPARE_EQ:
            right = frame.stack.pop()
            left = frame.stack.pop()
            frame.stack.append(left == right)

        elif opcode == OpCode.COMPARE_LT:
            right = frame.stack.pop()
            left = frame.stack.pop()
            frame.stack.append(left < right)

        elif opcode == OpCode.JUMP:
            frame.pc = operand

        elif opcode == OpCode.JUMP_IF_FALSE:
            value = frame.stack.pop()
            if not value:
                frame.pc = operand

        elif opcode == OpCode.CALL:
            args = [frame.stack.pop() for _ in range(operand)][::-1]
            func = frame.stack.pop()

            if callable(func):
                result = func(*args)
                frame.stack.append(result)
            else:
                result = self._call_function(frame, func, args)
                frame.stack.append(result)

        elif opcode == OpCode.RETURN:
            value = frame.stack.pop() if frame.stack else None
            raise ReturnValue(value)

        elif opcode == OpCode.MAKE_FUNCTION:
            func_data = frame.constants[operand]
            frame.stack.append(func_data)

        elif opcode == OpCode.POP:
            frame.stack.pop()

        elif opcode == OpCode.DUP:
            frame.stack.append(frame.stack[-1])

        elif opcode == OpCode.PRINT:
            value = frame.stack.pop()
            print(value)

    def _call_function(self, parent_frame: Frame, func_data: Dict, args: List[Any]) -> Any:
        frame = Frame(
            code=func_data["code"],
            constants=func_data.get("constants", []),
            names=func_data.get("names", []),
            parent=parent_frame
        )

        for param, arg in zip(func_data["params"], args):
            frame.locals[param] = arg

        try:
            self.run(frame)
        except ReturnValue as ret:
            return ret.value

        return None

学术注记:编译器设计是计算机科学的核心领域之一。现代编译器通常采用多阶段编译架构:前端(词法分析→语法分析→语义分析)→中端(中间表示→优化)→后端(代码生成)。LLVM框架的成功证明了这种分层架构的强大之处——前端可以针对多种语言,后端可以针对多种目标平台。


54.5 本章小结

本章详细介绍了Python编译器与解释器的核心概念和实践:

  1. 词法分析:Token定义、词法分析器、关键字处理
  2. 语法分析:AST节点、递归下降解析、表达式解析
  3. 解释器:环境管理、AST遍历、求值执行
  4. 内置函数:print、len、type等内置功能
  5. 字节码编译:指令集设计、编译器实现
  6. 虚拟机:栈式执行、函数调用、控制流

54.5.1 知识图谱

编译器与解释器
├── 词法分析
│   ├── Token定义(类型、值、位置)
│   ├── 有限状态机(DFA、NFA)
│   ├── 正则表达式匹配
│   └── 关键字与标识符识别
├── 语法分析
│   ├── 文法定义(BNF、EBNF)
│   ├── 递归下降解析
│   ├── 算符优先解析
│   └── AST构建与遍历
├── 语义分析
│   ├── 符号表管理
│   ├── 类型检查
│   ├── 作用域分析
│   └── 语义错误检测
├── 中间表示
│   ├── 三地址码
│   ├── SSA形式
│   ├── 控制流图
│   └── 数据流分析
├── 代码生成
│   ├── 字节码生成
│   ├── 目标代码生成
│   ├── 寄存器分配
│   └── 指令调度
└── 运行时系统
    ├── 解释器执行
    ├── 虚拟机实现
    ├── 内存管理
    └── 垃圾回收

54.5.2 最佳实践清单

场景推荐做法避免做法
词法分析使用正则表达式或状态机手写字符匹配
语法分析递归下降或解析器生成器完全手写解析
错误处理提供详细的错误位置和信息简单抛出异常
AST设计使用访问者模式遍历在节点类中硬编码逻辑
内存管理对象池复用AST节点频繁创建销毁对象
性能优化字节码编译+虚拟机执行纯AST解释执行
类型系统渐进式类型检查无类型或全静态类型

54.5.3 技术选型指南

需求场景推荐方案备选方案
学习编译原理手写递归下降解析器ANTLR
快速原型Python ast模块PLY
生产级解析器ANTLR、LarkPLY、PyParsing
DSL开发Lark、textX自定义解析器
字节码编译自定义指令集Python字节码
性能关键编译到C/RustNumba JIT
嵌入式语言Lua、Tcl自定义语言

练习题

基础题

  1. 扩展词法分析器,支持+=-=*=/=等复合赋值运算符。

  2. 为语法分析器添加对switch/case语句的支持。

  3. 实现一个简单的类型检查器,在编译时检测类型错误。

进阶题

  1. 实现一个完整的闭包系统,支持函数捕获外部变量。

  2. 开发一个简单的优化器,实现常量折叠和死代码消除。

  3. 实现一个垃圾回收器,支持标记-清除算法。

项目实践

  1. 迷你编程语言:实现一个完整的迷你编程语言,要求:
    • 支持变量、函数、条件、循环
    • 实现词法分析、语法分析、解释执行
    • 支持闭包和递归
    • 提供基本的错误处理
    • 实现简单的标准库(数学函数、字符串操作)

思考题

  1. 解释器与编译器的主要区别是什么?各自的优缺点是什么?

  2. 为什么现代语言实现(如Python、Java)通常采用字节码+虚拟机的架构?

  3. JIT编译器是如何工作的?它如何平衡解释执行的灵活性和编译执行的性能?

扩展阅读

54.6.1 编译原理

  • 《编译原理》(龙书) — 编译器设计的经典教材
  • 《Engineering a Compiler》 — 现代编译器工程实践
  • 《Crafting Interpreters》 (https://craftinginterpreters.com/) — 解释器实现指南
  • 《Writing An Interpreter In Go》 — 从零实现解释器

54.6.2 解析器工具

54.6.3 Python内部机制

54.6.4 虚拟机设计

54.6.5 进阶书籍

  • 《程序设计语言理论基础》 — PLT理论基础
  • 《Types and Programming Languages》 — 类型系统理论
  • 《Advanced Compiler Design and Implementation》 — 高级编译技术
  • 《Garbage Collection》 — 垃圾回收算法

下一章:第55章 高性能计算

Python技术丛书 - 江苏省宿城中等专业学校