第54章 编译器与解释器
学习目标
完成本章学习后,你将能够:
- 理解编译原理:词法分析、语法分析、语义分析、代码生成
- 实现词法分析器:正则表达式、有限状态机、Token生成
- 实现语法分析器:上下文无关文法、递归下降、AST构建
- 构建解释器:AST遍历、环境管理、求值执行
- 开发领域特定语言:DSL设计、语法设计、解释执行
- 实现编译器后端:中间表示、优化、目标代码生成
- 进行代码分析:静态分析、类型检查、代码转换
- 构建字节码虚拟机:指令集设计、栈式虚拟机、执行引擎
54.1 词法分析
54.1.1 Token定义
python
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
from enum import Enum, auto
import re
class TokenType(Enum):
NUMBER = auto()
STRING = auto()
IDENTIFIER = auto()
KEYWORD = auto()
OPERATOR = auto()
DELIMITER = auto()
EOF = auto()
PLUS = auto()
MINUS = auto()
MULTIPLY = auto()
DIVIDE = auto()
ASSIGN = auto()
EQUAL = auto()
NOT_EQUAL = auto()
LESS = auto()
GREATER = auto()
LESS_EQUAL = auto()
GREATER_EQUAL = auto()
LPAREN = auto()
RPAREN = auto()
LBRACE = auto()
RBRACE = auto()
LBRACKET = auto()
RBRACKET = auto()
COMMA = auto()
SEMICOLON = auto()
DOT = auto()
COLON = auto()
IF = auto()
ELSE = auto()
WHILE = auto()
FOR = auto()
FUNCTION = auto()
RETURN = auto()
VAR = auto()
LET = auto()
CONST = auto()
TRUE = auto()
FALSE = auto()
NULL = auto()
AND = auto()
OR = auto()
NOT = auto()
@dataclass
class Token:
type: TokenType
value: Any
line: int
column: int
def __repr__(self) -> str:
return f"Token({self.type.name}, {self.value!r}, {self.line}:{self.column})"
KEYWORDS: Dict[str, TokenType] = {
"if": TokenType.IF,
"else": TokenType.ELSE,
"while": TokenType.WHILE,
"for": TokenType.FOR,
"function": TokenType.FUNCTION,
"return": TokenType.RETURN,
"var": TokenType.VAR,
"let": TokenType.LET,
"const": TokenType.CONST,
"true": TokenType.TRUE,
"false": TokenType.FALSE,
"null": TokenType.NULL,
"and": TokenType.AND,
"or": TokenType.OR,
"not": TokenType.NOT,
}
class Lexer:
def __init__(self, source: str):
self.source = source
self.pos = 0
self.line = 1
self.column = 1
self.tokens: List[Token] = []
def tokenize(self) -> List[Token]:
while self.pos < len(self.source):
self._skip_whitespace()
if self.pos >= len(self.source):
break
char = self.source[self.pos]
if char.isdigit():
self._read_number()
elif char == '"' or char == "'":
self._read_string(char)
elif char.isalpha() or char == '_':
self._read_identifier()
else:
self._read_symbol()
self.tokens.append(Token(TokenType.EOF, None, self.line, self.column))
return self.tokens
def _skip_whitespace(self) -> None:
while self.pos < len(self.source):
char = self.source[self.pos]
if char in ' \t\r':
self._advance()
elif char == '\n':
self.line += 1
self.column = 1
self.pos += 1
elif char == '/' and self.pos + 1 < len(self.source):
if self.source[self.pos + 1] == '/':
self._skip_line_comment()
elif self.source[self.pos + 1] == '*':
self._skip_block_comment()
else:
break
else:
break
def _skip_line_comment(self) -> None:
while self.pos < len(self.source) and self.source[self.pos] != '\n':
self._advance()
def _skip_block_comment(self) -> None:
self._advance()
self._advance()
while self.pos < len(self.source) - 1:
if self.source[self.pos] == '*' and self.source[self.pos + 1] == '/':
self._advance()
self._advance()
break
self._advance()
def _advance(self) -> str:
char = self.source[self.pos]
self.pos += 1
self.column += 1
return char
def _read_number(self) -> None:
start_col = self.column
num_str = ""
while self.pos < len(self.source) and (self.source[self.pos].isdigit() or self.source[self.pos] == '.'):
num_str += self._advance()
if '.' in num_str:
value = float(num_str)
else:
value = int(num_str)
self.tokens.append(Token(TokenType.NUMBER, value, self.line, start_col))
def _read_string(self, quote: str) -> None:
start_col = self.column
self._advance()
value = ""
while self.pos < len(self.source) and self.source[self.pos] != quote:
if self.source[self.pos] == '\\':
self._advance()
if self.pos < len(self.source):
escape_char = self._advance()
if escape_char == 'n':
value += '\n'
elif escape_char == 't':
value += '\t'
elif escape_char == '\\':
value += '\\'
elif escape_char == quote:
value += quote
else:
value += escape_char
else:
value += self._advance()
if self.pos < len(self.source):
self._advance()
self.tokens.append(Token(TokenType.STRING, value, self.line, start_col))
def _read_identifier(self) -> None:
start_col = self.column
value = ""
while self.pos < len(self.source) and (self.source[self.pos].isalnum() or self.source[self.pos] == '_'):
value += self._advance()
token_type = KEYWORDS.get(value, TokenType.IDENTIFIER)
self.tokens.append(Token(token_type, value, self.line, start_col))
def _read_symbol(self) -> None:
start_col = self.column
char = self._advance()
symbols: Dict[str, TokenType] = {
'+': TokenType.PLUS,
'-': TokenType.MINUS,
'*': TokenType.MULTIPLY,
'/': TokenType.DIVIDE,
'(': TokenType.LPAREN,
')': TokenType.RPAREN,
'{': TokenType.LBRACE,
'}': TokenType.RBRACE,
'[': TokenType.LBRACKET,
']': TokenType.RBRACKET,
',': TokenType.COMMA,
';': TokenType.SEMICOLON,
'.': TokenType.DOT,
':': TokenType.COLON,
}
double_symbols: Dict[str, TokenType] = {
'==': TokenType.EQUAL,
'!=': TokenType.NOT_EQUAL,
'<=': TokenType.LESS_EQUAL,
'>=': TokenType.GREATER_EQUAL,
'&&': TokenType.AND,
'||': TokenType.OR,
}
if self.pos < len(self.source):
two_char = char + self.source[self.pos]
if two_char in double_symbols:
self._advance()
self.tokens.append(Token(double_symbols[two_char], two_char, self.line, start_col))
return
if char == '=':
self.tokens.append(Token(TokenType.ASSIGN, char, self.line, start_col))
elif char == '<':
self.tokens.append(Token(TokenType.LESS, char, self.line, start_col))
elif char == '>':
self.tokens.append(Token(TokenType.GREATER, char, self.line, start_col))
elif char == '!':
self.tokens.append(Token(TokenType.NOT, char, self.line, start_col))
elif char in symbols:
self.tokens.append(Token(symbols[char], char, self.line, start_col))
else:
raise SyntaxError(f"Unexpected character: {char} at line {self.line}, column {start_col}")54.2 语法分析
54.2.1 AST节点
python
from dataclasses import dataclass, field
from typing import List, Optional, Any, Union
from abc import ABC, abstractmethod
class ASTNode(ABC):
pass
@dataclass
class Program(ASTNode):
statements: List[ASTNode]
@dataclass
class NumberLiteral(ASTNode):
value: Union[int, float]
@dataclass
class StringLiteral(ASTNode):
value: str
@dataclass
class BooleanLiteral(ASTNode):
value: bool
@dataclass
class NullLiteral(ASTNode):
pass
@dataclass
class Identifier(ASTNode):
name: str
@dataclass
class BinaryExpression(ASTNode):
operator: str
left: ASTNode
right: ASTNode
@dataclass
class UnaryExpression(ASTNode):
operator: str
operand: ASTNode
@dataclass
class CallExpression(ASTNode):
callee: ASTNode
arguments: List[ASTNode]
@dataclass
class MemberExpression(ASTNode):
object: ASTNode
property: ASTNode
computed: bool = False
@dataclass
class ArrayExpression(ASTNode):
elements: List[ASTNode]
@dataclass
class ObjectExpression(ASTNode):
properties: List[tuple]
@dataclass
class VariableDeclaration(ASTNode):
kind: str
name: str
init: Optional[ASTNode]
@dataclass
class AssignmentExpression(ASTNode):
left: ASTNode
right: ASTNode
@dataclass
class IfStatement(ASTNode):
condition: ASTNode
consequent: ASTNode
alternate: Optional[ASTNode]
@dataclass
class WhileStatement(ASTNode):
condition: ASTNode
body: ASTNode
@dataclass
class ForStatement(ASTNode):
init: Optional[ASTNode]
condition: Optional[ASTNode]
update: Optional[ASTNode]
body: ASTNode
@dataclass
class FunctionDeclaration(ASTNode):
name: str
params: List[str]
body: ASTNode
@dataclass
class ReturnStatement(ASTNode):
value: Optional[ASTNode]
@dataclass
class BlockStatement(ASTNode):
statements: List[ASTNode]
@dataclass
class ExpressionStatement(ASTNode):
expression: ASTNode
class Parser:
def __init__(self, tokens: List[Token]):
self.tokens = tokens
self.pos = 0
def parse(self) -> Program:
statements = []
while not self._is_at_end():
stmt = self._parse_statement()
if stmt:
statements.append(stmt)
return Program(statements)
def _current(self) -> Token:
return self.tokens[self.pos]
def _peek(self, offset: int = 1) -> Token:
pos = self.pos + offset
if pos < len(self.tokens):
return self.tokens[pos]
return self.tokens[-1]
def _is_at_end(self) -> bool:
return self._current().type == TokenType.EOF
def _advance(self) -> Token:
if not self._is_at_end():
self.pos += 1
return self.tokens[self.pos - 1]
def _match(self, *types: TokenType) -> bool:
for t in types:
if self._current().type == t:
return True
return False
def _consume(self, type: TokenType, message: str) -> Token:
if self._current().type == type:
return self._advance()
raise SyntaxError(f"{message} at line {self._current().line}")
def _parse_statement(self) -> Optional[ASTNode]:
if self._match(TokenType.VAR, TokenType.LET, TokenType.CONST):
return self._parse_variable_declaration()
elif self._match(TokenType.IF):
return self._parse_if_statement()
elif self._match(TokenType.WHILE):
return self._parse_while_statement()
elif self._match(TokenType.FOR):
return self._parse_for_statement()
elif self._match(TokenType.FUNCTION):
return self._parse_function_declaration()
elif self._match(TokenType.RETURN):
return self._parse_return_statement()
elif self._match(TokenType.LBRACE):
return self._parse_block_statement()
else:
return self._parse_expression_statement()
def _parse_variable_declaration(self) -> VariableDeclaration:
kind = self._advance().value
name = self._consume(TokenType.IDENTIFIER, "Expected variable name").value
init = None
if self._match(TokenType.ASSIGN):
self._advance()
init = self._parse_expression()
if self._match(TokenType.SEMICOLON):
self._advance()
return VariableDeclaration(kind, name, init)
def _parse_if_statement(self) -> IfStatement:
self._advance()
self._consume(TokenType.LPAREN, "Expected '(' after 'if'")
condition = self._parse_expression()
self._consume(TokenType.RPAREN, "Expected ')' after condition")
consequent = self._parse_statement()
alternate = None
if self._match(TokenType.ELSE):
self._advance()
alternate = self._parse_statement()
return IfStatement(condition, consequent, alternate)
def _parse_while_statement(self) -> WhileStatement:
self._advance()
self._consume(TokenType.LPAREN, "Expected '(' after 'while'")
condition = self._parse_expression()
self._consume(TokenType.RPAREN, "Expected ')' after condition")
body = self._parse_statement()
return WhileStatement(condition, body)
def _parse_for_statement(self) -> ForStatement:
self._advance()
self._consume(TokenType.LPAREN, "Expected '(' after 'for'")
init = None
if not self._match(TokenType.SEMICOLON):
init = self._parse_variable_declaration()
else:
self._advance()
condition = None
if not self._match(TokenType.SEMICOLON):
condition = self._parse_expression()
self._consume(TokenType.SEMICOLON, "Expected ';' after condition")
update = None
if not self._match(TokenType.RPAREN):
update = self._parse_expression()
self._consume(TokenType.RPAREN, "Expected ')' after for clauses")
body = self._parse_statement()
return ForStatement(init, condition, update, body)
def _parse_function_declaration(self) -> FunctionDeclaration:
self._advance()
name = self._consume(TokenType.IDENTIFIER, "Expected function name").value
self._consume(TokenType.LPAREN, "Expected '(' after function name")
params = []
if not self._match(TokenType.RPAREN):
params.append(self._consume(TokenType.IDENTIFIER, "Expected parameter name").value)
while self._match(TokenType.COMMA):
self._advance()
params.append(self._consume(TokenType.IDENTIFIER, "Expected parameter name").value)
self._consume(TokenType.RPAREN, "Expected ')' after parameters")
body = self._parse_block_statement()
return FunctionDeclaration(name, params, body)
def _parse_return_statement(self) -> ReturnStatement:
self._advance()
value = None
if not self._match(TokenType.SEMICOLON, TokenType.RBRACE):
value = self._parse_expression()
if self._match(TokenType.SEMICOLON):
self._advance()
return ReturnStatement(value)
def _parse_block_statement(self) -> BlockStatement:
self._advance()
statements = []
while not self._match(TokenType.RBRACE) and not self._is_at_end():
stmt = self._parse_statement()
if stmt:
statements.append(stmt)
self._consume(TokenType.RBRACE, "Expected '}' after block")
return BlockStatement(statements)
def _parse_expression_statement(self) -> ExpressionStatement:
expr = self._parse_expression()
if self._match(TokenType.SEMICOLON):
self._advance()
return ExpressionStatement(expr)
def _parse_expression(self) -> ASTNode:
return self._parse_assignment()
def _parse_assignment(self) -> ASTNode:
expr = self._parse_or()
if self._match(TokenType.ASSIGN):
self._advance()
value = self._parse_assignment()
return AssignmentExpression(expr, value)
return expr
def _parse_or(self) -> ASTNode:
left = self._parse_and()
while self._match(TokenType.OR):
op = self._advance().value
right = self._parse_and()
left = BinaryExpression(op, left, right)
return left
def _parse_and(self) -> ASTNode:
left = self._parse_equality()
while self._match(TokenType.AND):
op = self._advance().value
right = self._parse_equality()
left = BinaryExpression(op, left, right)
return left
def _parse_equality(self) -> ASTNode:
left = self._parse_comparison()
while self._match(TokenType.EQUAL, TokenType.NOT_EQUAL):
op = self._advance().value
right = self._parse_comparison()
left = BinaryExpression(op, left, right)
return left
def _parse_comparison(self) -> ASTNode:
left = self._parse_term()
while self._match(TokenType.LESS, TokenType.GREATER, TokenType.LESS_EQUAL, TokenType.GREATER_EQUAL):
op = self._advance().value
right = self._parse_term()
left = BinaryExpression(op, left, right)
return left
def _parse_term(self) -> ASTNode:
left = self._parse_factor()
while self._match(TokenType.PLUS, TokenType.MINUS):
op = self._advance().value
right = self._parse_factor()
left = BinaryExpression(op, left, right)
return left
def _parse_factor(self) -> ASTNode:
left = self._parse_unary()
while self._match(TokenType.MULTIPLY, TokenType.DIVIDE):
op = self._advance().value
right = self._parse_unary()
left = BinaryExpression(op, left, right)
return left
def _parse_unary(self) -> ASTNode:
if self._match(TokenType.NOT, TokenType.MINUS):
op = self._advance().value
operand = self._parse_unary()
return UnaryExpression(op, operand)
return self._parse_call()
def _parse_call(self) -> ASTNode:
expr = self._parse_primary()
while True:
if self._match(TokenType.LPAREN):
self._advance()
args = self._parse_arguments()
self._consume(TokenType.RPAREN, "Expected ')' after arguments")
expr = CallExpression(expr, args)
elif self._match(TokenType.DOT):
self._advance()
prop = self._consume(TokenType.IDENTIFIER, "Expected property name")
expr = MemberExpression(expr, Identifier(prop.value), False)
elif self._match(TokenType.LBRACKET):
self._advance()
prop = self._parse_expression()
self._consume(TokenType.RBRACKET, "Expected ']' after index")
expr = MemberExpression(expr, prop, True)
else:
break
return expr
def _parse_arguments(self) -> List[ASTNode]:
args = []
if not self._match(TokenType.RPAREN):
args.append(self._parse_expression())
while self._match(TokenType.COMMA):
self._advance()
args.append(self._parse_expression())
return args
def _parse_primary(self) -> ASTNode:
if self._match(TokenType.NUMBER):
return NumberLiteral(self._advance().value)
elif self._match(TokenType.STRING):
return StringLiteral(self._advance().value)
elif self._match(TokenType.TRUE):
self._advance()
return BooleanLiteral(True)
elif self._match(TokenType.FALSE):
self._advance()
return BooleanLiteral(False)
elif self._match(TokenType.NULL):
self._advance()
return NullLiteral()
elif self._match(TokenType.IDENTIFIER):
return Identifier(self._advance().value)
elif self._match(TokenType.LPAREN):
self._advance()
expr = self._parse_expression()
self._consume(TokenType.RPAREN, "Expected ')' after expression")
return expr
elif self._match(TokenType.LBRACKET):
return self._parse_array()
elif self._match(TokenType.LBRACE):
return self._parse_object()
else:
raise SyntaxError(f"Unexpected token: {self._current().type} at line {self._current().line}")
def _parse_array(self) -> ArrayExpression:
self._advance()
elements = []
if not self._match(TokenType.RBRACKET):
elements.append(self._parse_expression())
while self._match(TokenType.COMMA):
self._advance()
if self._match(TokenType.RBRACKET):
break
elements.append(self._parse_expression())
self._consume(TokenType.RBRACKET, "Expected ']' after array elements")
return ArrayExpression(elements)
def _parse_object(self) -> ObjectExpression:
self._advance()
properties = []
if not self._match(TokenType.RBRACE):
key = self._parse_object_key()
self._consume(TokenType.COLON, "Expected ':' after key")
value = self._parse_expression()
properties.append((key, value))
while self._match(TokenType.COMMA):
self._advance()
if self._match(TokenType.RBRACE):
break
key = self._parse_object_key()
self._consume(TokenType.COLON, "Expected ':' after key")
value = self._parse_expression()
properties.append((key, value))
self._consume(TokenType.RBRACE, "Expected '}' after object properties")
return ObjectExpression(properties)
def _parse_object_key(self) -> ASTNode:
if self._match(TokenType.STRING):
return StringLiteral(self._advance().value)
elif self._match(TokenType.IDENTIFIER):
return StringLiteral(self._advance().value)
else:
raise SyntaxError(f"Expected property name at line {self._current().line}")54.3 解释器
54.3.1 执行引擎
python
from typing import Dict, Any, List, Optional, Callable
from dataclasses import dataclass, field
@dataclass
class Environment:
variables: Dict[str, Any] = field(default_factory=dict)
parent: Optional["Environment"] = None
def define(self, name: str, value: Any) -> None:
self.variables[name] = value
def get(self, name: str) -> Any:
if name in self.variables:
return self.variables[name]
if self.parent:
return self.parent.get(name)
raise NameError(f"Undefined variable: {name}")
def set(self, name: str, value: Any) -> None:
if name in self.variables:
self.variables[name] = value
return
if self.parent:
self.parent.set(name, value)
return
raise NameError(f"Undefined variable: {name}")
def has(self, name: str) -> bool:
if name in self.variables:
return True
if self.parent:
return self.parent.has(name)
return False
@dataclass
class ReturnValue:
value: Any
@dataclass
class Function:
name: str
params: List[str]
body: ASTNode
closure: Environment
class Interpreter:
def __init__(self):
self.global_env = Environment()
self._setup_builtins()
def _setup_builtins(self) -> None:
builtins = {
"print": lambda *args: print(*args),
"len": lambda x: len(x),
"type": lambda x: type(x).__name__,
"str": lambda x: str(x),
"int": lambda x: int(x),
"float": lambda x: float(x),
"abs": lambda x: abs(x),
"min": lambda *args: min(args),
"max": lambda *args: max(args),
"range": lambda *args: list(range(*args)),
}
for name, func in builtins.items():
self.global_env.define(name, func)
def interpret(self, program: Program) -> Any:
result = None
for stmt in program.statements:
result = self._execute(stmt, self.global_env)
if isinstance(result, ReturnValue):
return result.value
return result
def _execute(self, node: ASTNode, env: Environment) -> Any:
method_name = f"_execute_{type(node).__name__}"
method = getattr(self, method_name, self._execute_generic)
return method(node, env)
def _execute_generic(self, node: ASTNode, env: Environment) -> Any:
raise NotImplementedError(f"No execution method for {type(node).__name__}")
def _execute_Program(self, node: Program, env: Environment) -> Any:
result = None
for stmt in node.statements:
result = self._execute(stmt, env)
if isinstance(result, ReturnValue):
return result.value
return result
def _execute_NumberLiteral(self, node: NumberLiteral, env: Environment) -> Any:
return node.value
def _execute_StringLiteral(self, node: StringLiteral, env: Environment) -> Any:
return node.value
def _execute_BooleanLiteral(self, node: BooleanLiteral, env: Environment) -> Any:
return node.value
def _execute_NullLiteral(self, node: NullLiteral, env: Environment) -> Any:
return None
def _execute_Identifier(self, node: Identifier, env: Environment) -> Any:
return env.get(node.name)
def _execute_BinaryExpression(self, node: BinaryExpression, env: Environment) -> Any:
left = self._execute(node.left, env)
right = self._execute(node.right, env)
op = node.operator
if op == '+':
return left + right
elif op == '-':
return left - right
elif op == '*':
return left * right
elif op == '/':
return left / right
elif op == '==':
return left == right
elif op == '!=':
return left != right
elif op == '<':
return left < right
elif op == '>':
return left > right
elif op == '<=':
return left <= right
elif op == '>=':
return left >= right
elif op == 'and':
return left and right
elif op == 'or':
return left or right
raise RuntimeError(f"Unknown operator: {op}")
def _execute_UnaryExpression(self, node: UnaryExpression, env: Environment) -> Any:
operand = self._execute(node.operand, env)
if node.operator == '-':
return -operand
elif node.operator == 'not':
return not operand
raise RuntimeError(f"Unknown unary operator: {node.operator}")
def _execute_CallExpression(self, node: CallExpression, env: Environment) -> Any:
callee = self._execute(node.callee, env)
args = [self._execute(arg, env) for arg in node.arguments]
if callable(callee):
return callee(*args)
if isinstance(callee, Function):
return self._call_function(callee, args)
raise RuntimeError(f"Cannot call {type(callee).__name__}")
def _call_function(self, func: Function, args: List[Any]) -> Any:
local_env = Environment(parent=func.closure)
for param, arg in zip(func.params, args):
local_env.define(param, arg)
try:
self._execute(func.body, local_env)
except ReturnValue as ret:
return ret.value
return None
def _execute_MemberExpression(self, node: MemberExpression, env: Environment) -> Any:
obj = self._execute(node.object, env)
if node.computed:
prop = self._execute(node.property, env)
else:
prop = node.property.name
if isinstance(obj, dict):
return obj.get(prop)
elif isinstance(obj, list):
return obj[int(prop)]
elif hasattr(obj, prop):
return getattr(obj, prop)
raise RuntimeError(f"Cannot access property {prop}")
def _execute_ArrayExpression(self, node: ArrayExpression, env: Environment) -> Any:
return [self._execute(elem, env) for elem in node.elements]
def _execute_ObjectExpression(self, node: ObjectExpression, env: Environment) -> Any:
obj = {}
for key, value in node.properties:
k = self._execute(key, env)
v = self._execute(value, env)
obj[k] = v
return obj
def _execute_VariableDeclaration(self, node: VariableDeclaration, env: Environment) -> Any:
value = None
if node.init:
value = self._execute(node.init, env)
env.define(node.name, value)
return value
def _execute_AssignmentExpression(self, node: AssignmentExpression, env: Environment) -> Any:
value = self._execute(node.right, env)
if isinstance(node.left, Identifier):
env.set(node.left.name, value)
elif isinstance(node.left, MemberExpression):
obj = self._execute(node.left.object, env)
if node.left.computed:
prop = self._execute(node.left.property, env)
else:
prop = node.left.property.name
if isinstance(obj, dict):
obj[prop] = value
elif isinstance(obj, list):
obj[int(prop)] = value
return value
def _execute_IfStatement(self, node: IfStatement, env: Environment) -> Any:
condition = self._execute(node.condition, env)
if condition:
return self._execute(node.consequent, env)
elif node.alternate:
return self._execute(node.alternate, env)
return None
def _execute_WhileStatement(self, node: WhileStatement, env: Environment) -> Any:
result = None
while self._execute(node.condition, env):
result = self._execute(node.body, env)
if isinstance(result, ReturnValue):
return result
return result
def _execute_ForStatement(self, node: ForStatement, env: Environment) -> Any:
local_env = Environment(parent=env)
if node.init:
self._execute(node.init, local_env)
result = None
while True:
if node.condition:
if not self._execute(node.condition, local_env):
break
result = self._execute(node.body, local_env)
if isinstance(result, ReturnValue):
return result
if node.update:
self._execute(node.update, local_env)
return result
def _execute_FunctionDeclaration(self, node: FunctionDeclaration, env: Environment) -> Any:
func = Function(node.name, node.params, node.body, env)
env.define(node.name, func)
return func
def _execute_ReturnStatement(self, node: ReturnStatement, env: Environment) -> ReturnValue:
value = None
if node.value:
value = self._execute(node.value, env)
return ReturnValue(value)
def _execute_BlockStatement(self, node: BlockStatement, env: Environment) -> Any:
local_env = Environment(parent=env)
result = None
for stmt in node.statements:
result = self._execute(stmt, local_env)
if isinstance(result, ReturnValue):
return result
return result
def _execute_ExpressionStatement(self, node: ExpressionStatement, env: Environment) -> Any:
return self._execute(node.expression, env)54.4 字节码编译器
54.4.1 指令集设计
python
from dataclasses import dataclass
from typing import List, Any
from enum import Enum, auto
class OpCode(Enum):
LOAD_CONST = auto()
LOAD_NAME = auto()
STORE_NAME = auto()
BINARY_ADD = auto()
BINARY_SUB = auto()
BINARY_MUL = auto()
BINARY_DIV = auto()
COMPARE_EQ = auto()
COMPARE_NE = auto()
COMPARE_LT = auto()
COMPARE_GT = auto()
COMPARE_LE = auto()
COMPARE_GE = auto()
POP = auto()
DUP = auto()
JUMP = auto()
JUMP_IF_FALSE = auto()
JUMP_IF_TRUE = auto()
CALL = auto()
RETURN = auto()
MAKE_FUNCTION = auto()
BUILD_LIST = auto()
BUILD_MAP = auto()
GET_ITEM = auto()
SET_ITEM = auto()
UNARY_NEGATIVE = auto()
UNARY_NOT = auto()
PRINT = auto()
@dataclass
class Instruction:
opcode: OpCode
operand: Any = None
line: int = 0
def __repr__(self) -> str:
if self.operand is not None:
return f"{self.opcode.name} {self.operand}"
return self.opcode.name
class BytecodeCompiler:
def __init__(self):
self.instructions: List[Instruction] = []
self.constants: List[Any] = []
self.names: List[str] = []
self._current_line = 0
def compile(self, node: ASTNode) -> List[Instruction]:
self.instructions = []
self._compile(node)
return self.instructions
def _emit(self, opcode: OpCode, operand: Any = None) -> int:
instruction = Instruction(opcode, operand, self._current_line)
self.instructions.append(instruction)
return len(self.instructions) - 1
def _add_constant(self, value: Any) -> int:
if value not in self.constants:
self.constants.append(value)
return self.constants.index(value)
def _add_name(self, name: str) -> int:
if name not in self.names:
self.names.append(name)
return self.names.index(name)
def _compile(self, node: ASTNode) -> None:
method_name = f"_compile_{type(node).__name__}"
method = getattr(self, method_name, self._compile_generic)
method(node)
def _compile_generic(self, node: ASTNode) -> None:
raise NotImplementedError(f"No compile method for {type(node).__name__}")
def _compile_NumberLiteral(self, node: NumberLiteral) -> None:
idx = self._add_constant(node.value)
self._emit(OpCode.LOAD_CONST, idx)
def _compile_StringLiteral(self, node: StringLiteral) -> None:
idx = self._add_constant(node.value)
self._emit(OpCode.LOAD_CONST, idx)
def _compile_BooleanLiteral(self, node: BooleanLiteral) -> None:
idx = self._add_constant(node.value)
self._emit(OpCode.LOAD_CONST, idx)
def _compile_Identifier(self, node: Identifier) -> None:
idx = self._add_name(node.name)
self._emit(OpCode.LOAD_NAME, idx)
def _compile_BinaryExpression(self, node: BinaryExpression) -> None:
self._compile(node.left)
self._compile(node.right)
op_map = {
'+': OpCode.BINARY_ADD,
'-': OpCode.BINARY_SUB,
'*': OpCode.BINARY_MUL,
'/': OpCode.BINARY_DIV,
'==': OpCode.COMPARE_EQ,
'!=': OpCode.COMPARE_NE,
'<': OpCode.COMPARE_LT,
'>': OpCode.COMPARE_GT,
'<=': OpCode.COMPARE_LE,
'>=': OpCode.COMPARE_GE,
}
if node.operator in op_map:
self._emit(op_map[node.operator])
def _compile_IfStatement(self, node: IfStatement) -> None:
self._compile(node.condition)
else_jump = self._emit(OpCode.JUMP_IF_FALSE, None)
self._compile(node.consequent)
if node.alternate:
end_jump = self._emit(OpCode.JUMP, None)
self.instructions[else_jump].operand = len(self.instructions)
if node.alternate:
self._compile(node.alternate)
self.instructions[end_jump].operand = len(self.instructions)
def _compile_WhileStatement(self, node: WhileStatement) -> None:
start = len(self.instructions)
self._compile(node.condition)
exit_jump = self._emit(OpCode.JUMP_IF_FALSE, None)
self._compile(node.body)
self._emit(OpCode.JUMP, start)
self.instructions[exit_jump].operand = len(self.instructions)
def _compile_VariableDeclaration(self, node: VariableDeclaration) -> None:
if node.init:
self._compile(node.init)
else:
self._emit(OpCode.LOAD_CONST, self._add_constant(None))
idx = self._add_name(node.name)
self._emit(OpCode.STORE_NAME, idx)
def _compile_FunctionDeclaration(self, node: FunctionDeclaration) -> None:
for param in node.params:
self._add_name(param)
body_compiler = BytecodeCompiler()
body_instructions = body_compiler.compile(node.body)
func_data = {
"name": node.name,
"params": node.params,
"code": body_instructions,
"constants": body_compiler.constants,
"names": body_compiler.names
}
idx = self._add_constant(func_data)
self._emit(OpCode.MAKE_FUNCTION, idx)
name_idx = self._add_name(node.name)
self._emit(OpCode.STORE_NAME, name_idx)
def _compile_CallExpression(self, node: CallExpression) -> None:
self._compile(node.callee)
for arg in node.arguments:
self._compile(arg)
self._emit(OpCode.CALL, len(node.arguments))
def _compile_ReturnStatement(self, node: ReturnStatement) -> None:
if node.value:
self._compile(node.value)
else:
self._emit(OpCode.LOAD_CONST, self._add_constant(None))
self._emit(OpCode.RETURN)54.4.2 栈式虚拟机
python
from typing import List, Any, Dict, Optional
from dataclasses import dataclass, field
@dataclass
class Frame:
code: List[Instruction]
constants: List[Any]
names: List[str]
pc: int = 0
stack: List[Any] = field(default_factory=list)
locals: Dict[str, Any] = field(default_factory=dict)
parent: Optional["Frame"] = None
class VirtualMachine:
def __init__(self):
self.globals: Dict[str, Any] = {}
self._setup_builtins()
def _setup_builtins(self) -> None:
self.globals["print"] = lambda *args: print(*args)
self.globals["len"] = lambda x: len(x)
self.globals["type"] = lambda x: type(x).__name__
self.globals["str"] = lambda x: str(x)
self.globals["int"] = lambda x: int(x)
self.globals["float"] = lambda x: float(x)
self.globals["abs"] = lambda x: abs(x)
self.globals["min"] = lambda *args: min(args)
self.globals["max"] = lambda *args: max(args)
self.globals["range"] = lambda *args: list(range(*args))
def run(self, frame: Frame) -> Any:
while frame.pc < len(frame.code):
instruction = frame.code[frame.pc]
frame.pc += 1
self._execute_instruction(frame, instruction)
return frame.stack[-1] if frame.stack else None
def _execute_instruction(self, frame: Frame, instruction: Instruction) -> None:
opcode = instruction.opcode
operand = instruction.operand
if opcode == OpCode.LOAD_CONST:
frame.stack.append(frame.constants[operand])
elif opcode == OpCode.LOAD_NAME:
name = frame.names[operand]
if name in frame.locals:
frame.stack.append(frame.locals[name])
elif name in self.globals:
frame.stack.append(self.globals[name])
else:
raise NameError(f"Undefined variable: {name}")
elif opcode == OpCode.STORE_NAME:
name = frame.names[operand]
value = frame.stack.pop()
frame.locals[name] = value
elif opcode == OpCode.BINARY_ADD:
right = frame.stack.pop()
left = frame.stack.pop()
frame.stack.append(left + right)
elif opcode == OpCode.BINARY_SUB:
right = frame.stack.pop()
left = frame.stack.pop()
frame.stack.append(left - right)
elif opcode == OpCode.BINARY_MUL:
right = frame.stack.pop()
left = frame.stack.pop()
frame.stack.append(left * right)
elif opcode == OpCode.BINARY_DIV:
right = frame.stack.pop()
left = frame.stack.pop()
frame.stack.append(left / right)
elif opcode == OpCode.COMPARE_EQ:
right = frame.stack.pop()
left = frame.stack.pop()
frame.stack.append(left == right)
elif opcode == OpCode.COMPARE_LT:
right = frame.stack.pop()
left = frame.stack.pop()
frame.stack.append(left < right)
elif opcode == OpCode.JUMP:
frame.pc = operand
elif opcode == OpCode.JUMP_IF_FALSE:
value = frame.stack.pop()
if not value:
frame.pc = operand
elif opcode == OpCode.CALL:
args = [frame.stack.pop() for _ in range(operand)][::-1]
func = frame.stack.pop()
if callable(func):
result = func(*args)
frame.stack.append(result)
else:
result = self._call_function(frame, func, args)
frame.stack.append(result)
elif opcode == OpCode.RETURN:
value = frame.stack.pop() if frame.stack else None
raise ReturnValue(value)
elif opcode == OpCode.MAKE_FUNCTION:
func_data = frame.constants[operand]
frame.stack.append(func_data)
elif opcode == OpCode.POP:
frame.stack.pop()
elif opcode == OpCode.DUP:
frame.stack.append(frame.stack[-1])
elif opcode == OpCode.PRINT:
value = frame.stack.pop()
print(value)
def _call_function(self, parent_frame: Frame, func_data: Dict, args: List[Any]) -> Any:
frame = Frame(
code=func_data["code"],
constants=func_data.get("constants", []),
names=func_data.get("names", []),
parent=parent_frame
)
for param, arg in zip(func_data["params"], args):
frame.locals[param] = arg
try:
self.run(frame)
except ReturnValue as ret:
return ret.value
return None学术注记:编译器设计是计算机科学的核心领域之一。现代编译器通常采用多阶段编译架构:前端(词法分析→语法分析→语义分析)→中端(中间表示→优化)→后端(代码生成)。LLVM框架的成功证明了这种分层架构的强大之处——前端可以针对多种语言,后端可以针对多种目标平台。
54.5 本章小结
本章详细介绍了Python编译器与解释器的核心概念和实践:
- 词法分析:Token定义、词法分析器、关键字处理
- 语法分析:AST节点、递归下降解析、表达式解析
- 解释器:环境管理、AST遍历、求值执行
- 内置函数:print、len、type等内置功能
- 字节码编译:指令集设计、编译器实现
- 虚拟机:栈式执行、函数调用、控制流
54.5.1 知识图谱
编译器与解释器
├── 词法分析
│ ├── Token定义(类型、值、位置)
│ ├── 有限状态机(DFA、NFA)
│ ├── 正则表达式匹配
│ └── 关键字与标识符识别
├── 语法分析
│ ├── 文法定义(BNF、EBNF)
│ ├── 递归下降解析
│ ├── 算符优先解析
│ └── AST构建与遍历
├── 语义分析
│ ├── 符号表管理
│ ├── 类型检查
│ ├── 作用域分析
│ └── 语义错误检测
├── 中间表示
│ ├── 三地址码
│ ├── SSA形式
│ ├── 控制流图
│ └── 数据流分析
├── 代码生成
│ ├── 字节码生成
│ ├── 目标代码生成
│ ├── 寄存器分配
│ └── 指令调度
└── 运行时系统
├── 解释器执行
├── 虚拟机实现
├── 内存管理
└── 垃圾回收54.5.2 最佳实践清单
| 场景 | 推荐做法 | 避免做法 |
|---|---|---|
| 词法分析 | 使用正则表达式或状态机 | 手写字符匹配 |
| 语法分析 | 递归下降或解析器生成器 | 完全手写解析 |
| 错误处理 | 提供详细的错误位置和信息 | 简单抛出异常 |
| AST设计 | 使用访问者模式遍历 | 在节点类中硬编码逻辑 |
| 内存管理 | 对象池复用AST节点 | 频繁创建销毁对象 |
| 性能优化 | 字节码编译+虚拟机执行 | 纯AST解释执行 |
| 类型系统 | 渐进式类型检查 | 无类型或全静态类型 |
54.5.3 技术选型指南
| 需求场景 | 推荐方案 | 备选方案 |
|---|---|---|
| 学习编译原理 | 手写递归下降解析器 | ANTLR |
| 快速原型 | Python ast模块 | PLY |
| 生产级解析器 | ANTLR、Lark | PLY、PyParsing |
| DSL开发 | Lark、textX | 自定义解析器 |
| 字节码编译 | 自定义指令集 | Python字节码 |
| 性能关键 | 编译到C/Rust | Numba JIT |
| 嵌入式语言 | Lua、Tcl | 自定义语言 |
练习题
基础题
扩展词法分析器,支持
+=、-=、*=、/=等复合赋值运算符。为语法分析器添加对
switch/case语句的支持。实现一个简单的类型检查器,在编译时检测类型错误。
进阶题
实现一个完整的闭包系统,支持函数捕获外部变量。
开发一个简单的优化器,实现常量折叠和死代码消除。
实现一个垃圾回收器,支持标记-清除算法。
项目实践
- 迷你编程语言:实现一个完整的迷你编程语言,要求:
- 支持变量、函数、条件、循环
- 实现词法分析、语法分析、解释执行
- 支持闭包和递归
- 提供基本的错误处理
- 实现简单的标准库(数学函数、字符串操作)
思考题
解释器与编译器的主要区别是什么?各自的优缺点是什么?
为什么现代语言实现(如Python、Java)通常采用字节码+虚拟机的架构?
JIT编译器是如何工作的?它如何平衡解释执行的灵活性和编译执行的性能?
扩展阅读
54.6.1 编译原理
- 《编译原理》(龙书) — 编译器设计的经典教材
- 《Engineering a Compiler》 — 现代编译器工程实践
- 《Crafting Interpreters》 (https://craftinginterpreters.com/) — 解释器实现指南
- 《Writing An Interpreter In Go》 — 从零实现解释器
54.6.2 解析器工具
- ANTLR (https://www.antlr.org/) — 强大的解析器生成器
- Lark (https://lark-parser.readthedocs.io/) — 现代Python解析库
- PLY (https://www.dabeaz.com/ply/) — Python Lex-Yacc
- textX (https://textx.github.io/textX/) — DSL开发框架
54.6.3 Python内部机制
- Python AST模块 (https://docs.python.org/3/library/ast.html) — Python抽象语法树
- Python dis模块 (https://docs.python.org/3/library/dis.html) — 字节码反汇编
- Python inspect模块 (https://docs.python.org/3/library/inspect.html) — 内省工具
- CPython源码 (https://github.com/python/cpython) — Python官方实现
54.6.4 虚拟机设计
- LLVM教程 (https://llvm.org/docs/tutorial/) — LLVM编译器基础设施
- Lua虚拟机设计 (https://www.lua.org/doc.html) — 高效虚拟机实现
- JVM规范 (https://docs.oracle.com/javase/specs/) — Java虚拟机规范
- WebAssembly (https://webassembly.org/) — 现代字节码格式
54.6.5 进阶书籍
- 《程序设计语言理论基础》 — PLT理论基础
- 《Types and Programming Languages》 — 类型系统理论
- 《Advanced Compiler Design and Implementation》 — 高级编译技术
- 《Garbage Collection》 — 垃圾回收算法
下一章:第55章 高性能计算