You have reached the objective observer. I hope you will have an informative journey.
Warning: This is a Christian site, part of the Christian Programmers League. Trannies, jews and homos are not welcome here.
Steffen "RmbRT" Rattay
Ambassador, Kingdom of Heaven
Over the past months, I didn't really make that much progress on my code; I have been thinking a lot about the CPU ISA I want to design for my language, however, and made good progress there. I have settled on a lot of stuff already:
- Registers are merely a fast scratchpad storage, but they aren't required at all.
There are special instructions dedicated to loading immediate values or register values to the ALU, reducing the complexity of the actual ALU instructions.
The result of the previous calculation automatically becomes the first argument to the next ALU instruction, and does not have to be written back to a register, allowing for very compact encodings of chained calculations, such as
~LOAD4(a + b)into
ARG0 a; ARG1 b; ADD; LOAD4; NOT. There are 16 registers. Since most operations are now operand-less, they are more compact and easier to decode, and the few that do have operands are also easier to decode because they are few in number.
- Registers and words are 5 bytes large, but all memory accesses from 1-5 bytes are supported. Due to supporting non power-of-2 data types, there is also no alignment requirement for data, meaning that no padding bytes are required within data structures, leading to better cache usage. 64-bit integers are too large for almost all practical use cases. 40-bit integers range support a trillion values, and address a TiB of memory with byte resolution, which is also more than any reasonable system should ever require.
- Special loop instructions that allow for zero-overhead loop jumps: instead of decoding relative jump targets ahead of time and prefetching them, the CPU keeps track of the loops' start (and optionally the end) using a stack, and keeps the first instructions of a loop (and optionally the first instructions after the loop) decoded and available for zero-friction loops without unrolling. Tracking the end of the loop is only required for loops that break from within. At least loop 2 nesting levels per stackframe are guaranteed to be supported this way.
Instead of the traditional multi-core paradigm, we leverage two single-core features for speed:
- There's a general SIMD mode, which works like a GPU shader, in that it works on separate register sets, and executes abitrary ALU instructions, and with the same constraint that all SIMD units execute the same instructions, meaning that they cannot branch heterogenously. The SIMD mode is not fully specified yet, but it guarantees at least 4 SIMD units.
- There will be no true multi-threading, instead there will be short-lived micro-threads that are intended to speed up two independent computations of the same critical path of a single procedure. This makes concurrency/atomics superfluous, and will boost the critical path, which is the greatest bottleneck for most computations. Micro-threads are forked/joined efficiently, and are intended to pay off for anything over 5 ALU instructions, and can be spawned without requiring any OS or library calls, or any elevated permissions. There are at least two micro-threads (master and at least one slave) on available.
Parallel throughput is accelerated via SIMD mode on homogenous data, in tight loops. Linear throughput is accelerated via micro-threading, and replaces the need for dedicated out-of-order execution units. Of course, when stalling, future instructions can be executed ahead of time, but nothing fancy like branch prediction units is required, leading to more simplex CPU circuitry. Compilers should manually direct out-of-order execution via micro-threads. Independent processes are implemented via pre-emptive scheduling, while independent threads are implemented via cooperatively scheduled green threads.
Why no multi-core support? Once you achieve full ALU pipeline utilisation using SIMD and micro-threads, you already have a massive performance boost on data and code that is highly localised, meaning all available cache capacity is put to good use and reduces memory bandwidth consumption. Once you go multi-core, you effectively halve (or worse) your cache capacity available per core (for the same amount of silicon), and you require twice the previous memory bandwidth for instruction and data fetching in order to not have slow-downs. Instead of running two cores that interfere with each other, we aim to make the single core we have as fast as possible, to fully consume all memory bandwidth and cache capacity for a single stream of execution. And once we maxed out the memory bandwidth with a single thread, it makes no sense to add another core, anyway. Instead, we would add more SIMD units, redundant ALUs for faster instruction dispatch, larger ALUs for faster processing of complex instructions, a bigger cache, deeper loop nesting support, etc.. All of these scale better with respect to memory bandwidth than just adding more cores.
The compiler rewrite is going well; I'm making good progress. Today, I seemingly finished the scoper stage of the compiler, next up are the symbol resolver, template instantiator, type checker/overload resolver, and code generation stages. Due to the better compiler architecture, adding new stages has very little overhead, and only the actually new parts have to be written.
The paper for a clock-based alternative to blockchains is on hold until further notice because I hit a major roadblock with my design. I realised that while it works in clearly honest and clearly adversarial scenarios, when someone is right on the edge of the allowed clock discrepancy, it breaks down, as decisions cannot be clearly categorised as honest or malicious by all honest participants, and disputes cannot be solved without voting, or at least I didn't find a way to do so. I will still look into it from time to time, and try out new ways, since I really like the project, but for now, I'll focus entirely on my compiler until I get some new inspiration. I'm looking into web of trust designs similar to PGP, that may also be viable. Also I'm rethinking the transaction model, and whether I need a total global ordering of all transactions or whether something like a causally consistent ordering is sufficient, as well as the general system design that would arise from that.
P.S.: I'm probably also going to write a tool for publishing small blog posts like this and putting them into an RSS feed.