A different way to share the memory bus between the CPU and the Video

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












Considering the ZX Spectrum, part of the memory is accessible to both the ULA and the CPU, and the CPU is slowed down when it is using that area, so that the framebuffer can be read out. As I understand, some Amigas also have a region of memory which slows the CPU down when the CPU reads/writes to it, because it is connected not only to the CPU, but also to the video/sound/etc.



But it's occurred to me, that the vast majority of interaction with the framebuffer on these systems will be to write something. Only rarely will it be necessary to read from the framebuffer. So it seems to me, that whenever the ZX Spectrum wrote a byte to the 8 KB containing the display file, instead of slowing the CPU down to allow that to happen, the value and the address could have been latched, and the write to the memory could happen later on -- say, when the Z80 is fetching the next instruction. It should be possible so long as the video memory is on a separate bus from the rest of memory. (Of course, it would require a little extra circuitry, and would possibly preclude being able to execute code from this region, but it's probably worth doing, right?).



So what's with this design decision? The ZX Spectrum, the Amiga, the Commodore 64, all slow the CPU down so that the video can be read! Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?










share|improve this question























  • I'm curious about a variant of this - interleaved/banked DRAM and two busses, so the slowdown is only necessary when the video readout and the CPU access the same bank. Shouldn't need that much extra logic. Of course programs with cycle-exact timing on this would have been a PITA. And DRAM refresh still needs to be taken into account. Maybe I should make that a question, but I don't think that model was ever used...
    – dirkt
    3 hours ago










  • @dirkt AFAIK that's already the case with a 48KiB Spectrum due its two banks, and only the first (16 KiB) is also used by video. IIRC, there wasalready a question about the way the Spectrum does video access, highlighting the mechanics.
    – Raffzahn
    2 hours ago











  • @Raffzahn do you mean this one? retrocomputing.stackexchange.com/questions/7133/…
    – Wilson
    2 hours ago










  • @Wilson Nope, there was another one, more specific about the sharing scheme. I remember having looked at the schematics for that. Somehow I fail to come up with the right keywords for searching - and there are already way too many questions (and answers) on RC to just browse. We should stop adding more.
    – Raffzahn
    2 hours ago











  • The very first assumption of 'usually VRAM is only written to' is actually wrong: ZX Spectrum, Commodore 64, Amiga 500 (to name a few) all massively execute code from VRAM (i.e. from the RAM also used for video display fetches). Otherwise you are right, if we gonna only write to VRAM by CPU (meaning relatively infrequently in comparison with video fetches), those writes could be done waitless for the CPU most of time.
    – lvd
    7 mins ago














up vote
1
down vote

favorite












Considering the ZX Spectrum, part of the memory is accessible to both the ULA and the CPU, and the CPU is slowed down when it is using that area, so that the framebuffer can be read out. As I understand, some Amigas also have a region of memory which slows the CPU down when the CPU reads/writes to it, because it is connected not only to the CPU, but also to the video/sound/etc.



But it's occurred to me, that the vast majority of interaction with the framebuffer on these systems will be to write something. Only rarely will it be necessary to read from the framebuffer. So it seems to me, that whenever the ZX Spectrum wrote a byte to the 8 KB containing the display file, instead of slowing the CPU down to allow that to happen, the value and the address could have been latched, and the write to the memory could happen later on -- say, when the Z80 is fetching the next instruction. It should be possible so long as the video memory is on a separate bus from the rest of memory. (Of course, it would require a little extra circuitry, and would possibly preclude being able to execute code from this region, but it's probably worth doing, right?).



So what's with this design decision? The ZX Spectrum, the Amiga, the Commodore 64, all slow the CPU down so that the video can be read! Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?










share|improve this question























  • I'm curious about a variant of this - interleaved/banked DRAM and two busses, so the slowdown is only necessary when the video readout and the CPU access the same bank. Shouldn't need that much extra logic. Of course programs with cycle-exact timing on this would have been a PITA. And DRAM refresh still needs to be taken into account. Maybe I should make that a question, but I don't think that model was ever used...
    – dirkt
    3 hours ago










  • @dirkt AFAIK that's already the case with a 48KiB Spectrum due its two banks, and only the first (16 KiB) is also used by video. IIRC, there wasalready a question about the way the Spectrum does video access, highlighting the mechanics.
    – Raffzahn
    2 hours ago











  • @Raffzahn do you mean this one? retrocomputing.stackexchange.com/questions/7133/…
    – Wilson
    2 hours ago










  • @Wilson Nope, there was another one, more specific about the sharing scheme. I remember having looked at the schematics for that. Somehow I fail to come up with the right keywords for searching - and there are already way too many questions (and answers) on RC to just browse. We should stop adding more.
    – Raffzahn
    2 hours ago











  • The very first assumption of 'usually VRAM is only written to' is actually wrong: ZX Spectrum, Commodore 64, Amiga 500 (to name a few) all massively execute code from VRAM (i.e. from the RAM also used for video display fetches). Otherwise you are right, if we gonna only write to VRAM by CPU (meaning relatively infrequently in comparison with video fetches), those writes could be done waitless for the CPU most of time.
    – lvd
    7 mins ago












up vote
1
down vote

favorite









up vote
1
down vote

favorite











Considering the ZX Spectrum, part of the memory is accessible to both the ULA and the CPU, and the CPU is slowed down when it is using that area, so that the framebuffer can be read out. As I understand, some Amigas also have a region of memory which slows the CPU down when the CPU reads/writes to it, because it is connected not only to the CPU, but also to the video/sound/etc.



But it's occurred to me, that the vast majority of interaction with the framebuffer on these systems will be to write something. Only rarely will it be necessary to read from the framebuffer. So it seems to me, that whenever the ZX Spectrum wrote a byte to the 8 KB containing the display file, instead of slowing the CPU down to allow that to happen, the value and the address could have been latched, and the write to the memory could happen later on -- say, when the Z80 is fetching the next instruction. It should be possible so long as the video memory is on a separate bus from the rest of memory. (Of course, it would require a little extra circuitry, and would possibly preclude being able to execute code from this region, but it's probably worth doing, right?).



So what's with this design decision? The ZX Spectrum, the Amiga, the Commodore 64, all slow the CPU down so that the video can be read! Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?










share|improve this question















Considering the ZX Spectrum, part of the memory is accessible to both the ULA and the CPU, and the CPU is slowed down when it is using that area, so that the framebuffer can be read out. As I understand, some Amigas also have a region of memory which slows the CPU down when the CPU reads/writes to it, because it is connected not only to the CPU, but also to the video/sound/etc.



But it's occurred to me, that the vast majority of interaction with the framebuffer on these systems will be to write something. Only rarely will it be necessary to read from the framebuffer. So it seems to me, that whenever the ZX Spectrum wrote a byte to the 8 KB containing the display file, instead of slowing the CPU down to allow that to happen, the value and the address could have been latched, and the write to the memory could happen later on -- say, when the Z80 is fetching the next instruction. It should be possible so long as the video memory is on a separate bus from the rest of memory. (Of course, it would require a little extra circuitry, and would possibly preclude being able to execute code from this region, but it's probably worth doing, right?).



So what's with this design decision? The ZX Spectrum, the Amiga, the Commodore 64, all slow the CPU down so that the video can be read! Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?







hardware






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 6 hours ago

























asked 6 hours ago









Wilson

8,434437106




8,434437106











  • I'm curious about a variant of this - interleaved/banked DRAM and two busses, so the slowdown is only necessary when the video readout and the CPU access the same bank. Shouldn't need that much extra logic. Of course programs with cycle-exact timing on this would have been a PITA. And DRAM refresh still needs to be taken into account. Maybe I should make that a question, but I don't think that model was ever used...
    – dirkt
    3 hours ago










  • @dirkt AFAIK that's already the case with a 48KiB Spectrum due its two banks, and only the first (16 KiB) is also used by video. IIRC, there wasalready a question about the way the Spectrum does video access, highlighting the mechanics.
    – Raffzahn
    2 hours ago











  • @Raffzahn do you mean this one? retrocomputing.stackexchange.com/questions/7133/…
    – Wilson
    2 hours ago










  • @Wilson Nope, there was another one, more specific about the sharing scheme. I remember having looked at the schematics for that. Somehow I fail to come up with the right keywords for searching - and there are already way too many questions (and answers) on RC to just browse. We should stop adding more.
    – Raffzahn
    2 hours ago











  • The very first assumption of 'usually VRAM is only written to' is actually wrong: ZX Spectrum, Commodore 64, Amiga 500 (to name a few) all massively execute code from VRAM (i.e. from the RAM also used for video display fetches). Otherwise you are right, if we gonna only write to VRAM by CPU (meaning relatively infrequently in comparison with video fetches), those writes could be done waitless for the CPU most of time.
    – lvd
    7 mins ago
















  • I'm curious about a variant of this - interleaved/banked DRAM and two busses, so the slowdown is only necessary when the video readout and the CPU access the same bank. Shouldn't need that much extra logic. Of course programs with cycle-exact timing on this would have been a PITA. And DRAM refresh still needs to be taken into account. Maybe I should make that a question, but I don't think that model was ever used...
    – dirkt
    3 hours ago










  • @dirkt AFAIK that's already the case with a 48KiB Spectrum due its two banks, and only the first (16 KiB) is also used by video. IIRC, there wasalready a question about the way the Spectrum does video access, highlighting the mechanics.
    – Raffzahn
    2 hours ago











  • @Raffzahn do you mean this one? retrocomputing.stackexchange.com/questions/7133/…
    – Wilson
    2 hours ago










  • @Wilson Nope, there was another one, more specific about the sharing scheme. I remember having looked at the schematics for that. Somehow I fail to come up with the right keywords for searching - and there are already way too many questions (and answers) on RC to just browse. We should stop adding more.
    – Raffzahn
    2 hours ago











  • The very first assumption of 'usually VRAM is only written to' is actually wrong: ZX Spectrum, Commodore 64, Amiga 500 (to name a few) all massively execute code from VRAM (i.e. from the RAM also used for video display fetches). Otherwise you are right, if we gonna only write to VRAM by CPU (meaning relatively infrequently in comparison with video fetches), those writes could be done waitless for the CPU most of time.
    – lvd
    7 mins ago















I'm curious about a variant of this - interleaved/banked DRAM and two busses, so the slowdown is only necessary when the video readout and the CPU access the same bank. Shouldn't need that much extra logic. Of course programs with cycle-exact timing on this would have been a PITA. And DRAM refresh still needs to be taken into account. Maybe I should make that a question, but I don't think that model was ever used...
– dirkt
3 hours ago




I'm curious about a variant of this - interleaved/banked DRAM and two busses, so the slowdown is only necessary when the video readout and the CPU access the same bank. Shouldn't need that much extra logic. Of course programs with cycle-exact timing on this would have been a PITA. And DRAM refresh still needs to be taken into account. Maybe I should make that a question, but I don't think that model was ever used...
– dirkt
3 hours ago












@dirkt AFAIK that's already the case with a 48KiB Spectrum due its two banks, and only the first (16 KiB) is also used by video. IIRC, there wasalready a question about the way the Spectrum does video access, highlighting the mechanics.
– Raffzahn
2 hours ago





@dirkt AFAIK that's already the case with a 48KiB Spectrum due its two banks, and only the first (16 KiB) is also used by video. IIRC, there wasalready a question about the way the Spectrum does video access, highlighting the mechanics.
– Raffzahn
2 hours ago













@Raffzahn do you mean this one? retrocomputing.stackexchange.com/questions/7133/…
– Wilson
2 hours ago




@Raffzahn do you mean this one? retrocomputing.stackexchange.com/questions/7133/…
– Wilson
2 hours ago












@Wilson Nope, there was another one, more specific about the sharing scheme. I remember having looked at the schematics for that. Somehow I fail to come up with the right keywords for searching - and there are already way too many questions (and answers) on RC to just browse. We should stop adding more.
– Raffzahn
2 hours ago





@Wilson Nope, there was another one, more specific about the sharing scheme. I remember having looked at the schematics for that. Somehow I fail to come up with the right keywords for searching - and there are already way too many questions (and answers) on RC to just browse. We should stop adding more.
– Raffzahn
2 hours ago













The very first assumption of 'usually VRAM is only written to' is actually wrong: ZX Spectrum, Commodore 64, Amiga 500 (to name a few) all massively execute code from VRAM (i.e. from the RAM also used for video display fetches). Otherwise you are right, if we gonna only write to VRAM by CPU (meaning relatively infrequently in comparison with video fetches), those writes could be done waitless for the CPU most of time.
– lvd
7 mins ago




The very first assumption of 'usually VRAM is only written to' is actually wrong: ZX Spectrum, Commodore 64, Amiga 500 (to name a few) all massively execute code from VRAM (i.e. from the RAM also used for video display fetches). Otherwise you are right, if we gonna only write to VRAM by CPU (meaning relatively infrequently in comparison with video fetches), those writes could be done waitless for the CPU most of time.
– lvd
7 mins ago










5 Answers
5






active

oldest

votes

















up vote
6
down vote














So what's with this design decision?




Not as helpful as it looks on first sight.



For one, it does need at least three chips for latching the data (8 Bit Data and 11 to 16 bit address) plus a considerable number to handle the access whenever there is time, plus a some muxes to switch the busses, as the CPU still needs the ability for read data from screen memory. Quite a lot of pins to solder (and pay for).



Second, the gain is rather meager. It's only a single (write) access that gets buffered. As soon as there is a second (before the screen memory can be accessed again), the CPU gets halted … as without. Equally important, the CPU gets also still put on hold for a read access.



Timing



A Line



To validate this it helps to take a look at (average) screen timing. Lets assume a system where the video part needs the whole RAM bandwidth during display (or at least so much that there can't be any CPU access in between). During a picture this is only true while a scan line is displayed. Lets stay with 1980s TV sized picture. Here a line is defined as 64 (63,6 for NTSC) µs. Thereof 12 (10.9) µs are used for synchronisation purpose and 52 (52,7) µs are for a potentially visible signal. Let's just assume the VDC uses all of that.



52 µs is quite some time in a CPU life – and especially more than enough to attempt to write more than one byte to screen memory. On a 1 MHz 6502 that equals up to 6–8 sequential writes in a tight copy loop. A 4 MHz Z80 can as well do up to 9 writes during that time (Let's for simplicity assume it tales 10µs per single meaningful write). That's about the maximum. And it'll of course overrun the single transaction buffer. To really use it some FIFO for address and data is needed. Increasing cost again.



Of course even a FIFO would only postpone the access into the gap between lines. Then again, a memory good enough for either of these CPUs will be able to squeeze a good 20 write access into the line retrace. So yes, such a FIFO could resolve it and give the CPU seemingly unhindered access. As noted, that's only true as there is no intermediate read access, which would halt the CPU again. So no bitblit please. Not to mention that there needs to be some priority logic to maybe let a waiting read slip in at the beginning of a line gap before the buffered writes are done, so the CPU only needs to wait for a minimum … err … no, bad idea, as it would possibly deliver old, already changed data, which has just not been committed by now. Drats.



The Whole Picture



A picture isn't just lines, but also frame structure. There are 286 (243) visible lines for a total of 18.3 (15.5) ms within a 20 (16.6) ms frame (50/60 Hz). With our model CPUs that allows for a maximum of ~170 (110) bytes written outside the visible part. In addition two writes per line can be done (*1), adding it up to 456 (353) accesses per frame or 22.8 (21.2) KiB/s (*2).



Adding a single access buffer to this would increase this by ~62 (68) percent to 37.1 (35.7) KiB/s. Sounds not bad. A 10 entry FIFO will even get it up to close to full speed (something like 100 KiB/s). Even better, isn't it? Except, this will break down quite hard if the job is not just a tight copy loop, but maybe some bitblitting with decisions and transformations in between.



So, What to do Instead?



Full Interleave



As long as the timing requirement isn't too tight (as in 'already using the fastest RAM affordable'), it's better to avoid a collision at all by using RAM with a bandwidth of double what's needed for screen refresh (or CPU, whatever is higher), so the CPU can access it with no (or little) speed penalty. Something very common on 6502 systems, where RAM anyway has to be twice as fast as the CPU. Now there are no additional cost for perfect access.



On the backside, this requires RAMs always being double the speed, even when not needed (during retrace).



Only Partial Take Over



The C64s VIC shows an in between solution, by mostly acting when the CPU isn't accessing the RAM and only stoping it when it needs additional bandwidth.



Just Live With It



As seen in the whole picture calculation, while we are talking about a 60+ percent speed up in tight loops, it is in absolute numbers still a meager transfer rate. A lot of Hardware to gain a little.



Better Improve the Rest



Instead of spending a handful of TTLs on a little and often not reached gain, using the same amount of gates for a more capable VDC could be way more rewarding. Building a (very simple) DMA copy circuit can move like a 5 to 10-fold amount per time than the CPU can do (*3). Adding a full DMA controller to the system might cost even less. Then again, with a specific circuit additional benefits are possible – like bitblitting and so on.




Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?




Leaving the interleave/partial blocking systems apart, there are all machines using a 9918. It does buffer one transaction and a 3.5 MHz Z80 could access it (virtually) without being stopped (*4). Similar systems that used a second CPU for I/O had that advantage. Though, not many where made.




*1 – Thats one lovely part – and crucial to memorize: Any access done during a line will stop the CPU (in basic configuration), and release it one memory cycle after the end of the line. Giving the lock the whole 12 (10.9) µs for another interaction, which can be done in a tight loop, making the next write happen instantly and then putting the CPU on hold again. So as soon as the horizontal retrace is longer than one loop iteration plus one memory cycle, always two will fit – already creating parallel action for up to 2/6th of a line.



*2 – Here also lies one major reason why early computers had to use tile graphics and sprites (besides not enough RAM): not enough bandwidth. A CPU of that kind can not move even in a tight loop more than about ~2 KiB per frame – without being halted that is. Not enough to brute force an acceptable frame rate. Using tiles and sprites reduced the amount to (maybe) less than an 1/8th, making it possible again to sustain a good frame rate (still needing a good tile design :))



*3 – No, using a Z80 LDIR is not the same as DMA. It takes 21 cycles to move a byte. With two cycles per memory access, DMA can do the same in merely 4 cycles.



*4 – Calculation is a bit complex here, as for one the 9918 has a basic access time of 2 µs plus 0–6 µs wait, depending on internal action (frame or line) and within a line on what graphics mode is used. With a Z80 and LDIR as (a fast) use case, this comes down to a maximum effective CPU clock of 2–10 MHz depending on when the access happens. For all practical matters a 3.5 MHz Z80 can write to a 9918s screen memory virtually without being slowed down.






share|improve this answer


















  • 1




    there was also another solution ... using Dual port RAM chips ... but IIRC those come up much latter and where expensive
    – Spektre
    4 hours ago







  • 2




    @Spektre That would be (more generalized) any kind of dual port RAM. But they always have been - and still are - expensive. In fact, for this video aplication a perfect RAM 'only' needs one write and two read ports. As Video does not write. This woudl help a lot with synchronisation. Maybe only adding a wait state when writing exactly the location that is accessed at the same time. In an FPGA implementation that would be the way to go. For single chips way too expensive.
    – Raffzahn
    4 hours ago






  • 1




    I agree double speed RAM will be much cheaper and more available than DPRAM ... and the needed change would not be as big especially if modern ULA is an FPGA ... I just wanted to point that out there exist DPRAM which are used in nowadays gfx cards and are perfect for this apart the cost ...
    – Spektre
    4 hours ago


















up vote
1
down vote













Ultimately these computers were designed with price as a primary design goal, rather than squeezing the absolute maximum performance possible.



First you'd have to separate the video memory. The screen is 7k, not 8k, and the Spectrum uses a 16+32 configuration of RAM chips. So you'd have to go to 8+8+32. That alone would probably make this idea a non-starter as you're adding 8 more chips and their routing to the board area. And it "wastes" about a kilobyte.



Then you'd have to do the write buffering somewhere. Maybe within the ULA, but if not you're adding yet more chips to the board.



(The Amiga does make this distinction for addon memory between "chip" and "fast": What is the benefit of increasing Amiga chip memory? )






share|improve this answer



























    up vote
    1
    down vote













    Re: did any computer use this scheme, which I think is still unanswered; yes: several.



    The TMS9818[/a] was used by the TI99/4[/a], MSX, ColecoVision, and many more. It doesn't share RAM with the CPU, it has its own. The CPU isn't synchronised to the TMS's memory windows when it wants to read or write, it writes to (or reads from) a latch that is written to (/read from) video memory whenever there is next a spare slot to do so.



    The problem is that the access slots are far enough apart that you can't write to the latch at full speed or you'll start overwriting values that haven't been written yet. If you spend up the RAM so that slots were always available, you might as well eliminate the latch.



    That being said, the chip has a diverse family tree, at least some of which block on write only if the latch is full; that's an effective strategy but starting to get a little electronically complicated for a machine of the Spectrum and C64 vintage.






    share|improve this answer






















    • The TMS9918 runs at 5.4MHz, and guarantees CPU access at least once per 32 cycles (or once per 6 cycles if in text mode). For a 4MHz Z80, you don't have to do much between memory accesses for that not to be a problem. It's also somewhat better than the Spectrum manages. :)
      – Jules
      3 hours ago











    • @Jules Keep in mind, the access is only about Video. Even with an LDIR, video access would be only once per 21 cycles on a Z80. which would make it perfect for a 3.5 MHz Z80 without adding any wait cycle. ((5.4 MHz / 32 cycles worst case) * 21 cycles per LDIR) = 3,5475 MHz Looks like a nice number for such a system, doesn't it :))
      – Raffzahn
      2 hours ago

















    up vote
    0
    down vote













    In addition to pjc50's answer (i.e. that performance simply wasn't a high priority design goal for these systems), there are more things to consider, at least for the Spectrum:



    • A cheaper way of increasing performance would have simply to have used the full performance of the CPU: the Spectrum ran at 3.5MHz but its processor is rated for 4MHz. The modifications to allow this would have increased complexity, but would have done so less than the modification outlined in pjc50's answer to allow for separate framebuffer access. This would have made a much larger improvement for lower cost.


    • Another simpler improvement would have been to use slightly faster memory and a small readahead buffer. The Spectrum used 200ns memory, but if it had used 150ns memory it would have been able to squeeze all of its screen access operations into the gaps between Z80 memory accesses, allowing the CPU to run unimpeded. It wouldn't be able to guarantee getting framebuffer data at exactly the right time, though, so it would have had to buffer a handful of bytes ahead in an internal buffer. This would have had a minimal cost increase to the memory and only a small complexity increase in the ALU (I don't know how much of the available capacity of the ALU was used, so that may or may not have resulted in a cost increase for producing them).


    • And, at the end of the day, the 48K spectrum had 2/3 of its memory that didn't have the penalty on access anyway, so performance critical applications could simply use that memory and avoid it (and perform display updates during the retrace interval). The 16K model only existed as a concession to the budget end of the market, and its performance was therefore a lot less critical.






    share|improve this answer





























      up vote
      -1
      down vote













      You seem to assume a single latch could suffice buffering video memory writes from the CPU while the RAM is busy shifting out bits to the DAC.



      That would assume that a single CPU instruction is only capable of pushing out a single byte to video RAM - but this is not the case: 16-bit accesses (which you could buffer with two latches) or even more complex instructions like LDIR (that would need way more buffers) on a Z80 transfer lots more bytes in a single instruction fetch. In order to catch these, you would need more buffers, actually a second video RAM - This ends up in an actual component, dual-ported memory, that you can actually buy, albeit at much higher prices than "normal" RAM. Dual-ported memory has been used as video memory on contemporary (expensive) computers and is even used today.



      Dual-ported memory is actually the thing that you are describing, just extended to cover any real application.






      share|improve this answer




















      • "LDIR (that would need way more buffers)" Not realy, since an LDIR can be put on hold like any other access when the one transaction buffer is filled. "[LDIR] on a Z80 transfer lots more bytes in a single instruction fetch." Again, not realy, as the LDIR gets fetched for each and every byte transfered. Over and Over again. And no, he isn't decribing a dual port memory, but a write back buffer.
        – Raffzahn
        3 hours ago










      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "648"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fretrocomputing.stackexchange.com%2fquestions%2f7655%2fa-different-way-to-share-the-memory-bus-between-the-cpu-and-the-video%23new-answer', 'question_page');

      );

      Post as a guest






























      5 Answers
      5






      active

      oldest

      votes








      5 Answers
      5






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      6
      down vote














      So what's with this design decision?




      Not as helpful as it looks on first sight.



      For one, it does need at least three chips for latching the data (8 Bit Data and 11 to 16 bit address) plus a considerable number to handle the access whenever there is time, plus a some muxes to switch the busses, as the CPU still needs the ability for read data from screen memory. Quite a lot of pins to solder (and pay for).



      Second, the gain is rather meager. It's only a single (write) access that gets buffered. As soon as there is a second (before the screen memory can be accessed again), the CPU gets halted … as without. Equally important, the CPU gets also still put on hold for a read access.



      Timing



      A Line



      To validate this it helps to take a look at (average) screen timing. Lets assume a system where the video part needs the whole RAM bandwidth during display (or at least so much that there can't be any CPU access in between). During a picture this is only true while a scan line is displayed. Lets stay with 1980s TV sized picture. Here a line is defined as 64 (63,6 for NTSC) µs. Thereof 12 (10.9) µs are used for synchronisation purpose and 52 (52,7) µs are for a potentially visible signal. Let's just assume the VDC uses all of that.



      52 µs is quite some time in a CPU life – and especially more than enough to attempt to write more than one byte to screen memory. On a 1 MHz 6502 that equals up to 6–8 sequential writes in a tight copy loop. A 4 MHz Z80 can as well do up to 9 writes during that time (Let's for simplicity assume it tales 10µs per single meaningful write). That's about the maximum. And it'll of course overrun the single transaction buffer. To really use it some FIFO for address and data is needed. Increasing cost again.



      Of course even a FIFO would only postpone the access into the gap between lines. Then again, a memory good enough for either of these CPUs will be able to squeeze a good 20 write access into the line retrace. So yes, such a FIFO could resolve it and give the CPU seemingly unhindered access. As noted, that's only true as there is no intermediate read access, which would halt the CPU again. So no bitblit please. Not to mention that there needs to be some priority logic to maybe let a waiting read slip in at the beginning of a line gap before the buffered writes are done, so the CPU only needs to wait for a minimum … err … no, bad idea, as it would possibly deliver old, already changed data, which has just not been committed by now. Drats.



      The Whole Picture



      A picture isn't just lines, but also frame structure. There are 286 (243) visible lines for a total of 18.3 (15.5) ms within a 20 (16.6) ms frame (50/60 Hz). With our model CPUs that allows for a maximum of ~170 (110) bytes written outside the visible part. In addition two writes per line can be done (*1), adding it up to 456 (353) accesses per frame or 22.8 (21.2) KiB/s (*2).



      Adding a single access buffer to this would increase this by ~62 (68) percent to 37.1 (35.7) KiB/s. Sounds not bad. A 10 entry FIFO will even get it up to close to full speed (something like 100 KiB/s). Even better, isn't it? Except, this will break down quite hard if the job is not just a tight copy loop, but maybe some bitblitting with decisions and transformations in between.



      So, What to do Instead?



      Full Interleave



      As long as the timing requirement isn't too tight (as in 'already using the fastest RAM affordable'), it's better to avoid a collision at all by using RAM with a bandwidth of double what's needed for screen refresh (or CPU, whatever is higher), so the CPU can access it with no (or little) speed penalty. Something very common on 6502 systems, where RAM anyway has to be twice as fast as the CPU. Now there are no additional cost for perfect access.



      On the backside, this requires RAMs always being double the speed, even when not needed (during retrace).



      Only Partial Take Over



      The C64s VIC shows an in between solution, by mostly acting when the CPU isn't accessing the RAM and only stoping it when it needs additional bandwidth.



      Just Live With It



      As seen in the whole picture calculation, while we are talking about a 60+ percent speed up in tight loops, it is in absolute numbers still a meager transfer rate. A lot of Hardware to gain a little.



      Better Improve the Rest



      Instead of spending a handful of TTLs on a little and often not reached gain, using the same amount of gates for a more capable VDC could be way more rewarding. Building a (very simple) DMA copy circuit can move like a 5 to 10-fold amount per time than the CPU can do (*3). Adding a full DMA controller to the system might cost even less. Then again, with a specific circuit additional benefits are possible – like bitblitting and so on.




      Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?




      Leaving the interleave/partial blocking systems apart, there are all machines using a 9918. It does buffer one transaction and a 3.5 MHz Z80 could access it (virtually) without being stopped (*4). Similar systems that used a second CPU for I/O had that advantage. Though, not many where made.




      *1 – Thats one lovely part – and crucial to memorize: Any access done during a line will stop the CPU (in basic configuration), and release it one memory cycle after the end of the line. Giving the lock the whole 12 (10.9) µs for another interaction, which can be done in a tight loop, making the next write happen instantly and then putting the CPU on hold again. So as soon as the horizontal retrace is longer than one loop iteration plus one memory cycle, always two will fit – already creating parallel action for up to 2/6th of a line.



      *2 – Here also lies one major reason why early computers had to use tile graphics and sprites (besides not enough RAM): not enough bandwidth. A CPU of that kind can not move even in a tight loop more than about ~2 KiB per frame – without being halted that is. Not enough to brute force an acceptable frame rate. Using tiles and sprites reduced the amount to (maybe) less than an 1/8th, making it possible again to sustain a good frame rate (still needing a good tile design :))



      *3 – No, using a Z80 LDIR is not the same as DMA. It takes 21 cycles to move a byte. With two cycles per memory access, DMA can do the same in merely 4 cycles.



      *4 – Calculation is a bit complex here, as for one the 9918 has a basic access time of 2 µs plus 0–6 µs wait, depending on internal action (frame or line) and within a line on what graphics mode is used. With a Z80 and LDIR as (a fast) use case, this comes down to a maximum effective CPU clock of 2–10 MHz depending on when the access happens. For all practical matters a 3.5 MHz Z80 can write to a 9918s screen memory virtually without being slowed down.






      share|improve this answer


















      • 1




        there was also another solution ... using Dual port RAM chips ... but IIRC those come up much latter and where expensive
        – Spektre
        4 hours ago







      • 2




        @Spektre That would be (more generalized) any kind of dual port RAM. But they always have been - and still are - expensive. In fact, for this video aplication a perfect RAM 'only' needs one write and two read ports. As Video does not write. This woudl help a lot with synchronisation. Maybe only adding a wait state when writing exactly the location that is accessed at the same time. In an FPGA implementation that would be the way to go. For single chips way too expensive.
        – Raffzahn
        4 hours ago






      • 1




        I agree double speed RAM will be much cheaper and more available than DPRAM ... and the needed change would not be as big especially if modern ULA is an FPGA ... I just wanted to point that out there exist DPRAM which are used in nowadays gfx cards and are perfect for this apart the cost ...
        – Spektre
        4 hours ago















      up vote
      6
      down vote














      So what's with this design decision?




      Not as helpful as it looks on first sight.



      For one, it does need at least three chips for latching the data (8 Bit Data and 11 to 16 bit address) plus a considerable number to handle the access whenever there is time, plus a some muxes to switch the busses, as the CPU still needs the ability for read data from screen memory. Quite a lot of pins to solder (and pay for).



      Second, the gain is rather meager. It's only a single (write) access that gets buffered. As soon as there is a second (before the screen memory can be accessed again), the CPU gets halted … as without. Equally important, the CPU gets also still put on hold for a read access.



      Timing



      A Line



      To validate this it helps to take a look at (average) screen timing. Lets assume a system where the video part needs the whole RAM bandwidth during display (or at least so much that there can't be any CPU access in between). During a picture this is only true while a scan line is displayed. Lets stay with 1980s TV sized picture. Here a line is defined as 64 (63,6 for NTSC) µs. Thereof 12 (10.9) µs are used for synchronisation purpose and 52 (52,7) µs are for a potentially visible signal. Let's just assume the VDC uses all of that.



      52 µs is quite some time in a CPU life – and especially more than enough to attempt to write more than one byte to screen memory. On a 1 MHz 6502 that equals up to 6–8 sequential writes in a tight copy loop. A 4 MHz Z80 can as well do up to 9 writes during that time (Let's for simplicity assume it tales 10µs per single meaningful write). That's about the maximum. And it'll of course overrun the single transaction buffer. To really use it some FIFO for address and data is needed. Increasing cost again.



      Of course even a FIFO would only postpone the access into the gap between lines. Then again, a memory good enough for either of these CPUs will be able to squeeze a good 20 write access into the line retrace. So yes, such a FIFO could resolve it and give the CPU seemingly unhindered access. As noted, that's only true as there is no intermediate read access, which would halt the CPU again. So no bitblit please. Not to mention that there needs to be some priority logic to maybe let a waiting read slip in at the beginning of a line gap before the buffered writes are done, so the CPU only needs to wait for a minimum … err … no, bad idea, as it would possibly deliver old, already changed data, which has just not been committed by now. Drats.



      The Whole Picture



      A picture isn't just lines, but also frame structure. There are 286 (243) visible lines for a total of 18.3 (15.5) ms within a 20 (16.6) ms frame (50/60 Hz). With our model CPUs that allows for a maximum of ~170 (110) bytes written outside the visible part. In addition two writes per line can be done (*1), adding it up to 456 (353) accesses per frame or 22.8 (21.2) KiB/s (*2).



      Adding a single access buffer to this would increase this by ~62 (68) percent to 37.1 (35.7) KiB/s. Sounds not bad. A 10 entry FIFO will even get it up to close to full speed (something like 100 KiB/s). Even better, isn't it? Except, this will break down quite hard if the job is not just a tight copy loop, but maybe some bitblitting with decisions and transformations in between.



      So, What to do Instead?



      Full Interleave



      As long as the timing requirement isn't too tight (as in 'already using the fastest RAM affordable'), it's better to avoid a collision at all by using RAM with a bandwidth of double what's needed for screen refresh (or CPU, whatever is higher), so the CPU can access it with no (or little) speed penalty. Something very common on 6502 systems, where RAM anyway has to be twice as fast as the CPU. Now there are no additional cost for perfect access.



      On the backside, this requires RAMs always being double the speed, even when not needed (during retrace).



      Only Partial Take Over



      The C64s VIC shows an in between solution, by mostly acting when the CPU isn't accessing the RAM and only stoping it when it needs additional bandwidth.



      Just Live With It



      As seen in the whole picture calculation, while we are talking about a 60+ percent speed up in tight loops, it is in absolute numbers still a meager transfer rate. A lot of Hardware to gain a little.



      Better Improve the Rest



      Instead of spending a handful of TTLs on a little and often not reached gain, using the same amount of gates for a more capable VDC could be way more rewarding. Building a (very simple) DMA copy circuit can move like a 5 to 10-fold amount per time than the CPU can do (*3). Adding a full DMA controller to the system might cost even less. Then again, with a specific circuit additional benefits are possible – like bitblitting and so on.




      Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?




      Leaving the interleave/partial blocking systems apart, there are all machines using a 9918. It does buffer one transaction and a 3.5 MHz Z80 could access it (virtually) without being stopped (*4). Similar systems that used a second CPU for I/O had that advantage. Though, not many where made.




      *1 – Thats one lovely part – and crucial to memorize: Any access done during a line will stop the CPU (in basic configuration), and release it one memory cycle after the end of the line. Giving the lock the whole 12 (10.9) µs for another interaction, which can be done in a tight loop, making the next write happen instantly and then putting the CPU on hold again. So as soon as the horizontal retrace is longer than one loop iteration plus one memory cycle, always two will fit – already creating parallel action for up to 2/6th of a line.



      *2 – Here also lies one major reason why early computers had to use tile graphics and sprites (besides not enough RAM): not enough bandwidth. A CPU of that kind can not move even in a tight loop more than about ~2 KiB per frame – without being halted that is. Not enough to brute force an acceptable frame rate. Using tiles and sprites reduced the amount to (maybe) less than an 1/8th, making it possible again to sustain a good frame rate (still needing a good tile design :))



      *3 – No, using a Z80 LDIR is not the same as DMA. It takes 21 cycles to move a byte. With two cycles per memory access, DMA can do the same in merely 4 cycles.



      *4 – Calculation is a bit complex here, as for one the 9918 has a basic access time of 2 µs plus 0–6 µs wait, depending on internal action (frame or line) and within a line on what graphics mode is used. With a Z80 and LDIR as (a fast) use case, this comes down to a maximum effective CPU clock of 2–10 MHz depending on when the access happens. For all practical matters a 3.5 MHz Z80 can write to a 9918s screen memory virtually without being slowed down.






      share|improve this answer


















      • 1




        there was also another solution ... using Dual port RAM chips ... but IIRC those come up much latter and where expensive
        – Spektre
        4 hours ago







      • 2




        @Spektre That would be (more generalized) any kind of dual port RAM. But they always have been - and still are - expensive. In fact, for this video aplication a perfect RAM 'only' needs one write and two read ports. As Video does not write. This woudl help a lot with synchronisation. Maybe only adding a wait state when writing exactly the location that is accessed at the same time. In an FPGA implementation that would be the way to go. For single chips way too expensive.
        – Raffzahn
        4 hours ago






      • 1




        I agree double speed RAM will be much cheaper and more available than DPRAM ... and the needed change would not be as big especially if modern ULA is an FPGA ... I just wanted to point that out there exist DPRAM which are used in nowadays gfx cards and are perfect for this apart the cost ...
        – Spektre
        4 hours ago













      up vote
      6
      down vote










      up vote
      6
      down vote










      So what's with this design decision?




      Not as helpful as it looks on first sight.



      For one, it does need at least three chips for latching the data (8 Bit Data and 11 to 16 bit address) plus a considerable number to handle the access whenever there is time, plus a some muxes to switch the busses, as the CPU still needs the ability for read data from screen memory. Quite a lot of pins to solder (and pay for).



      Second, the gain is rather meager. It's only a single (write) access that gets buffered. As soon as there is a second (before the screen memory can be accessed again), the CPU gets halted … as without. Equally important, the CPU gets also still put on hold for a read access.



      Timing



      A Line



      To validate this it helps to take a look at (average) screen timing. Lets assume a system where the video part needs the whole RAM bandwidth during display (or at least so much that there can't be any CPU access in between). During a picture this is only true while a scan line is displayed. Lets stay with 1980s TV sized picture. Here a line is defined as 64 (63,6 for NTSC) µs. Thereof 12 (10.9) µs are used for synchronisation purpose and 52 (52,7) µs are for a potentially visible signal. Let's just assume the VDC uses all of that.



      52 µs is quite some time in a CPU life – and especially more than enough to attempt to write more than one byte to screen memory. On a 1 MHz 6502 that equals up to 6–8 sequential writes in a tight copy loop. A 4 MHz Z80 can as well do up to 9 writes during that time (Let's for simplicity assume it tales 10µs per single meaningful write). That's about the maximum. And it'll of course overrun the single transaction buffer. To really use it some FIFO for address and data is needed. Increasing cost again.



      Of course even a FIFO would only postpone the access into the gap between lines. Then again, a memory good enough for either of these CPUs will be able to squeeze a good 20 write access into the line retrace. So yes, such a FIFO could resolve it and give the CPU seemingly unhindered access. As noted, that's only true as there is no intermediate read access, which would halt the CPU again. So no bitblit please. Not to mention that there needs to be some priority logic to maybe let a waiting read slip in at the beginning of a line gap before the buffered writes are done, so the CPU only needs to wait for a minimum … err … no, bad idea, as it would possibly deliver old, already changed data, which has just not been committed by now. Drats.



      The Whole Picture



      A picture isn't just lines, but also frame structure. There are 286 (243) visible lines for a total of 18.3 (15.5) ms within a 20 (16.6) ms frame (50/60 Hz). With our model CPUs that allows for a maximum of ~170 (110) bytes written outside the visible part. In addition two writes per line can be done (*1), adding it up to 456 (353) accesses per frame or 22.8 (21.2) KiB/s (*2).



      Adding a single access buffer to this would increase this by ~62 (68) percent to 37.1 (35.7) KiB/s. Sounds not bad. A 10 entry FIFO will even get it up to close to full speed (something like 100 KiB/s). Even better, isn't it? Except, this will break down quite hard if the job is not just a tight copy loop, but maybe some bitblitting with decisions and transformations in between.



      So, What to do Instead?



      Full Interleave



      As long as the timing requirement isn't too tight (as in 'already using the fastest RAM affordable'), it's better to avoid a collision at all by using RAM with a bandwidth of double what's needed for screen refresh (or CPU, whatever is higher), so the CPU can access it with no (or little) speed penalty. Something very common on 6502 systems, where RAM anyway has to be twice as fast as the CPU. Now there are no additional cost for perfect access.



      On the backside, this requires RAMs always being double the speed, even when not needed (during retrace).



      Only Partial Take Over



      The C64s VIC shows an in between solution, by mostly acting when the CPU isn't accessing the RAM and only stoping it when it needs additional bandwidth.



      Just Live With It



      As seen in the whole picture calculation, while we are talking about a 60+ percent speed up in tight loops, it is in absolute numbers still a meager transfer rate. A lot of Hardware to gain a little.



      Better Improve the Rest



      Instead of spending a handful of TTLs on a little and often not reached gain, using the same amount of gates for a more capable VDC could be way more rewarding. Building a (very simple) DMA copy circuit can move like a 5 to 10-fold amount per time than the CPU can do (*3). Adding a full DMA controller to the system might cost even less. Then again, with a specific circuit additional benefits are possible – like bitblitting and so on.




      Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?




      Leaving the interleave/partial blocking systems apart, there are all machines using a 9918. It does buffer one transaction and a 3.5 MHz Z80 could access it (virtually) without being stopped (*4). Similar systems that used a second CPU for I/O had that advantage. Though, not many where made.




      *1 – Thats one lovely part – and crucial to memorize: Any access done during a line will stop the CPU (in basic configuration), and release it one memory cycle after the end of the line. Giving the lock the whole 12 (10.9) µs for another interaction, which can be done in a tight loop, making the next write happen instantly and then putting the CPU on hold again. So as soon as the horizontal retrace is longer than one loop iteration plus one memory cycle, always two will fit – already creating parallel action for up to 2/6th of a line.



      *2 – Here also lies one major reason why early computers had to use tile graphics and sprites (besides not enough RAM): not enough bandwidth. A CPU of that kind can not move even in a tight loop more than about ~2 KiB per frame – without being halted that is. Not enough to brute force an acceptable frame rate. Using tiles and sprites reduced the amount to (maybe) less than an 1/8th, making it possible again to sustain a good frame rate (still needing a good tile design :))



      *3 – No, using a Z80 LDIR is not the same as DMA. It takes 21 cycles to move a byte. With two cycles per memory access, DMA can do the same in merely 4 cycles.



      *4 – Calculation is a bit complex here, as for one the 9918 has a basic access time of 2 µs plus 0–6 µs wait, depending on internal action (frame or line) and within a line on what graphics mode is used. With a Z80 and LDIR as (a fast) use case, this comes down to a maximum effective CPU clock of 2–10 MHz depending on when the access happens. For all practical matters a 3.5 MHz Z80 can write to a 9918s screen memory virtually without being slowed down.






      share|improve this answer















      So what's with this design decision?




      Not as helpful as it looks on first sight.



      For one, it does need at least three chips for latching the data (8 Bit Data and 11 to 16 bit address) plus a considerable number to handle the access whenever there is time, plus a some muxes to switch the busses, as the CPU still needs the ability for read data from screen memory. Quite a lot of pins to solder (and pay for).



      Second, the gain is rather meager. It's only a single (write) access that gets buffered. As soon as there is a second (before the screen memory can be accessed again), the CPU gets halted … as without. Equally important, the CPU gets also still put on hold for a read access.



      Timing



      A Line



      To validate this it helps to take a look at (average) screen timing. Lets assume a system where the video part needs the whole RAM bandwidth during display (or at least so much that there can't be any CPU access in between). During a picture this is only true while a scan line is displayed. Lets stay with 1980s TV sized picture. Here a line is defined as 64 (63,6 for NTSC) µs. Thereof 12 (10.9) µs are used for synchronisation purpose and 52 (52,7) µs are for a potentially visible signal. Let's just assume the VDC uses all of that.



      52 µs is quite some time in a CPU life – and especially more than enough to attempt to write more than one byte to screen memory. On a 1 MHz 6502 that equals up to 6–8 sequential writes in a tight copy loop. A 4 MHz Z80 can as well do up to 9 writes during that time (Let's for simplicity assume it tales 10µs per single meaningful write). That's about the maximum. And it'll of course overrun the single transaction buffer. To really use it some FIFO for address and data is needed. Increasing cost again.



      Of course even a FIFO would only postpone the access into the gap between lines. Then again, a memory good enough for either of these CPUs will be able to squeeze a good 20 write access into the line retrace. So yes, such a FIFO could resolve it and give the CPU seemingly unhindered access. As noted, that's only true as there is no intermediate read access, which would halt the CPU again. So no bitblit please. Not to mention that there needs to be some priority logic to maybe let a waiting read slip in at the beginning of a line gap before the buffered writes are done, so the CPU only needs to wait for a minimum … err … no, bad idea, as it would possibly deliver old, already changed data, which has just not been committed by now. Drats.



      The Whole Picture



      A picture isn't just lines, but also frame structure. There are 286 (243) visible lines for a total of 18.3 (15.5) ms within a 20 (16.6) ms frame (50/60 Hz). With our model CPUs that allows for a maximum of ~170 (110) bytes written outside the visible part. In addition two writes per line can be done (*1), adding it up to 456 (353) accesses per frame or 22.8 (21.2) KiB/s (*2).



      Adding a single access buffer to this would increase this by ~62 (68) percent to 37.1 (35.7) KiB/s. Sounds not bad. A 10 entry FIFO will even get it up to close to full speed (something like 100 KiB/s). Even better, isn't it? Except, this will break down quite hard if the job is not just a tight copy loop, but maybe some bitblitting with decisions and transformations in between.



      So, What to do Instead?



      Full Interleave



      As long as the timing requirement isn't too tight (as in 'already using the fastest RAM affordable'), it's better to avoid a collision at all by using RAM with a bandwidth of double what's needed for screen refresh (or CPU, whatever is higher), so the CPU can access it with no (or little) speed penalty. Something very common on 6502 systems, where RAM anyway has to be twice as fast as the CPU. Now there are no additional cost for perfect access.



      On the backside, this requires RAMs always being double the speed, even when not needed (during retrace).



      Only Partial Take Over



      The C64s VIC shows an in between solution, by mostly acting when the CPU isn't accessing the RAM and only stoping it when it needs additional bandwidth.



      Just Live With It



      As seen in the whole picture calculation, while we are talking about a 60+ percent speed up in tight loops, it is in absolute numbers still a meager transfer rate. A lot of Hardware to gain a little.



      Better Improve the Rest



      Instead of spending a handful of TTLs on a little and often not reached gain, using the same amount of gates for a more capable VDC could be way more rewarding. Building a (very simple) DMA copy circuit can move like a 5 to 10-fold amount per time than the CPU can do (*3). Adding a full DMA controller to the system might cost even less. Then again, with a specific circuit additional benefits are possible – like bitblitting and so on.




      Did any retrocomputing system buffer the write so as to let the CPU run at full speed whenever it's not reading from the framebuffer (which is practically always)?




      Leaving the interleave/partial blocking systems apart, there are all machines using a 9918. It does buffer one transaction and a 3.5 MHz Z80 could access it (virtually) without being stopped (*4). Similar systems that used a second CPU for I/O had that advantage. Though, not many where made.




      *1 – Thats one lovely part – and crucial to memorize: Any access done during a line will stop the CPU (in basic configuration), and release it one memory cycle after the end of the line. Giving the lock the whole 12 (10.9) µs for another interaction, which can be done in a tight loop, making the next write happen instantly and then putting the CPU on hold again. So as soon as the horizontal retrace is longer than one loop iteration plus one memory cycle, always two will fit – already creating parallel action for up to 2/6th of a line.



      *2 – Here also lies one major reason why early computers had to use tile graphics and sprites (besides not enough RAM): not enough bandwidth. A CPU of that kind can not move even in a tight loop more than about ~2 KiB per frame – without being halted that is. Not enough to brute force an acceptable frame rate. Using tiles and sprites reduced the amount to (maybe) less than an 1/8th, making it possible again to sustain a good frame rate (still needing a good tile design :))



      *3 – No, using a Z80 LDIR is not the same as DMA. It takes 21 cycles to move a byte. With two cycles per memory access, DMA can do the same in merely 4 cycles.



      *4 – Calculation is a bit complex here, as for one the 9918 has a basic access time of 2 µs plus 0–6 µs wait, depending on internal action (frame or line) and within a line on what graphics mode is used. With a Z80 and LDIR as (a fast) use case, this comes down to a maximum effective CPU clock of 2–10 MHz depending on when the access happens. For all practical matters a 3.5 MHz Z80 can write to a 9918s screen memory virtually without being slowed down.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 10 mins ago









      LangLangC

      11716




      11716










      answered 5 hours ago









      Raffzahn

      34.3k476136




      34.3k476136







      • 1




        there was also another solution ... using Dual port RAM chips ... but IIRC those come up much latter and where expensive
        – Spektre
        4 hours ago







      • 2




        @Spektre That would be (more generalized) any kind of dual port RAM. But they always have been - and still are - expensive. In fact, for this video aplication a perfect RAM 'only' needs one write and two read ports. As Video does not write. This woudl help a lot with synchronisation. Maybe only adding a wait state when writing exactly the location that is accessed at the same time. In an FPGA implementation that would be the way to go. For single chips way too expensive.
        – Raffzahn
        4 hours ago






      • 1




        I agree double speed RAM will be much cheaper and more available than DPRAM ... and the needed change would not be as big especially if modern ULA is an FPGA ... I just wanted to point that out there exist DPRAM which are used in nowadays gfx cards and are perfect for this apart the cost ...
        – Spektre
        4 hours ago













      • 1




        there was also another solution ... using Dual port RAM chips ... but IIRC those come up much latter and where expensive
        – Spektre
        4 hours ago







      • 2




        @Spektre That would be (more generalized) any kind of dual port RAM. But they always have been - and still are - expensive. In fact, for this video aplication a perfect RAM 'only' needs one write and two read ports. As Video does not write. This woudl help a lot with synchronisation. Maybe only adding a wait state when writing exactly the location that is accessed at the same time. In an FPGA implementation that would be the way to go. For single chips way too expensive.
        – Raffzahn
        4 hours ago






      • 1




        I agree double speed RAM will be much cheaper and more available than DPRAM ... and the needed change would not be as big especially if modern ULA is an FPGA ... I just wanted to point that out there exist DPRAM which are used in nowadays gfx cards and are perfect for this apart the cost ...
        – Spektre
        4 hours ago








      1




      1




      there was also another solution ... using Dual port RAM chips ... but IIRC those come up much latter and where expensive
      – Spektre
      4 hours ago





      there was also another solution ... using Dual port RAM chips ... but IIRC those come up much latter and where expensive
      – Spektre
      4 hours ago





      2




      2




      @Spektre That would be (more generalized) any kind of dual port RAM. But they always have been - and still are - expensive. In fact, for this video aplication a perfect RAM 'only' needs one write and two read ports. As Video does not write. This woudl help a lot with synchronisation. Maybe only adding a wait state when writing exactly the location that is accessed at the same time. In an FPGA implementation that would be the way to go. For single chips way too expensive.
      – Raffzahn
      4 hours ago




      @Spektre That would be (more generalized) any kind of dual port RAM. But they always have been - and still are - expensive. In fact, for this video aplication a perfect RAM 'only' needs one write and two read ports. As Video does not write. This woudl help a lot with synchronisation. Maybe only adding a wait state when writing exactly the location that is accessed at the same time. In an FPGA implementation that would be the way to go. For single chips way too expensive.
      – Raffzahn
      4 hours ago




      1




      1




      I agree double speed RAM will be much cheaper and more available than DPRAM ... and the needed change would not be as big especially if modern ULA is an FPGA ... I just wanted to point that out there exist DPRAM which are used in nowadays gfx cards and are perfect for this apart the cost ...
      – Spektre
      4 hours ago





      I agree double speed RAM will be much cheaper and more available than DPRAM ... and the needed change would not be as big especially if modern ULA is an FPGA ... I just wanted to point that out there exist DPRAM which are used in nowadays gfx cards and are perfect for this apart the cost ...
      – Spektre
      4 hours ago











      up vote
      1
      down vote













      Ultimately these computers were designed with price as a primary design goal, rather than squeezing the absolute maximum performance possible.



      First you'd have to separate the video memory. The screen is 7k, not 8k, and the Spectrum uses a 16+32 configuration of RAM chips. So you'd have to go to 8+8+32. That alone would probably make this idea a non-starter as you're adding 8 more chips and their routing to the board area. And it "wastes" about a kilobyte.



      Then you'd have to do the write buffering somewhere. Maybe within the ULA, but if not you're adding yet more chips to the board.



      (The Amiga does make this distinction for addon memory between "chip" and "fast": What is the benefit of increasing Amiga chip memory? )






      share|improve this answer
























        up vote
        1
        down vote













        Ultimately these computers were designed with price as a primary design goal, rather than squeezing the absolute maximum performance possible.



        First you'd have to separate the video memory. The screen is 7k, not 8k, and the Spectrum uses a 16+32 configuration of RAM chips. So you'd have to go to 8+8+32. That alone would probably make this idea a non-starter as you're adding 8 more chips and their routing to the board area. And it "wastes" about a kilobyte.



        Then you'd have to do the write buffering somewhere. Maybe within the ULA, but if not you're adding yet more chips to the board.



        (The Amiga does make this distinction for addon memory between "chip" and "fast": What is the benefit of increasing Amiga chip memory? )






        share|improve this answer






















          up vote
          1
          down vote










          up vote
          1
          down vote









          Ultimately these computers were designed with price as a primary design goal, rather than squeezing the absolute maximum performance possible.



          First you'd have to separate the video memory. The screen is 7k, not 8k, and the Spectrum uses a 16+32 configuration of RAM chips. So you'd have to go to 8+8+32. That alone would probably make this idea a non-starter as you're adding 8 more chips and their routing to the board area. And it "wastes" about a kilobyte.



          Then you'd have to do the write buffering somewhere. Maybe within the ULA, but if not you're adding yet more chips to the board.



          (The Amiga does make this distinction for addon memory between "chip" and "fast": What is the benefit of increasing Amiga chip memory? )






          share|improve this answer












          Ultimately these computers were designed with price as a primary design goal, rather than squeezing the absolute maximum performance possible.



          First you'd have to separate the video memory. The screen is 7k, not 8k, and the Spectrum uses a 16+32 configuration of RAM chips. So you'd have to go to 8+8+32. That alone would probably make this idea a non-starter as you're adding 8 more chips and their routing to the board area. And it "wastes" about a kilobyte.



          Then you'd have to do the write buffering somewhere. Maybe within the ULA, but if not you're adding yet more chips to the board.



          (The Amiga does make this distinction for addon memory between "chip" and "fast": What is the benefit of increasing Amiga chip memory? )







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 6 hours ago









          pjc50

          1753




          1753




















              up vote
              1
              down vote













              Re: did any computer use this scheme, which I think is still unanswered; yes: several.



              The TMS9818[/a] was used by the TI99/4[/a], MSX, ColecoVision, and many more. It doesn't share RAM with the CPU, it has its own. The CPU isn't synchronised to the TMS's memory windows when it wants to read or write, it writes to (or reads from) a latch that is written to (/read from) video memory whenever there is next a spare slot to do so.



              The problem is that the access slots are far enough apart that you can't write to the latch at full speed or you'll start overwriting values that haven't been written yet. If you spend up the RAM so that slots were always available, you might as well eliminate the latch.



              That being said, the chip has a diverse family tree, at least some of which block on write only if the latch is full; that's an effective strategy but starting to get a little electronically complicated for a machine of the Spectrum and C64 vintage.






              share|improve this answer






















              • The TMS9918 runs at 5.4MHz, and guarantees CPU access at least once per 32 cycles (or once per 6 cycles if in text mode). For a 4MHz Z80, you don't have to do much between memory accesses for that not to be a problem. It's also somewhat better than the Spectrum manages. :)
                – Jules
                3 hours ago











              • @Jules Keep in mind, the access is only about Video. Even with an LDIR, video access would be only once per 21 cycles on a Z80. which would make it perfect for a 3.5 MHz Z80 without adding any wait cycle. ((5.4 MHz / 32 cycles worst case) * 21 cycles per LDIR) = 3,5475 MHz Looks like a nice number for such a system, doesn't it :))
                – Raffzahn
                2 hours ago














              up vote
              1
              down vote













              Re: did any computer use this scheme, which I think is still unanswered; yes: several.



              The TMS9818[/a] was used by the TI99/4[/a], MSX, ColecoVision, and many more. It doesn't share RAM with the CPU, it has its own. The CPU isn't synchronised to the TMS's memory windows when it wants to read or write, it writes to (or reads from) a latch that is written to (/read from) video memory whenever there is next a spare slot to do so.



              The problem is that the access slots are far enough apart that you can't write to the latch at full speed or you'll start overwriting values that haven't been written yet. If you spend up the RAM so that slots were always available, you might as well eliminate the latch.



              That being said, the chip has a diverse family tree, at least some of which block on write only if the latch is full; that's an effective strategy but starting to get a little electronically complicated for a machine of the Spectrum and C64 vintage.






              share|improve this answer






















              • The TMS9918 runs at 5.4MHz, and guarantees CPU access at least once per 32 cycles (or once per 6 cycles if in text mode). For a 4MHz Z80, you don't have to do much between memory accesses for that not to be a problem. It's also somewhat better than the Spectrum manages. :)
                – Jules
                3 hours ago











              • @Jules Keep in mind, the access is only about Video. Even with an LDIR, video access would be only once per 21 cycles on a Z80. which would make it perfect for a 3.5 MHz Z80 without adding any wait cycle. ((5.4 MHz / 32 cycles worst case) * 21 cycles per LDIR) = 3,5475 MHz Looks like a nice number for such a system, doesn't it :))
                – Raffzahn
                2 hours ago












              up vote
              1
              down vote










              up vote
              1
              down vote









              Re: did any computer use this scheme, which I think is still unanswered; yes: several.



              The TMS9818[/a] was used by the TI99/4[/a], MSX, ColecoVision, and many more. It doesn't share RAM with the CPU, it has its own. The CPU isn't synchronised to the TMS's memory windows when it wants to read or write, it writes to (or reads from) a latch that is written to (/read from) video memory whenever there is next a spare slot to do so.



              The problem is that the access slots are far enough apart that you can't write to the latch at full speed or you'll start overwriting values that haven't been written yet. If you spend up the RAM so that slots were always available, you might as well eliminate the latch.



              That being said, the chip has a diverse family tree, at least some of which block on write only if the latch is full; that's an effective strategy but starting to get a little electronically complicated for a machine of the Spectrum and C64 vintage.






              share|improve this answer














              Re: did any computer use this scheme, which I think is still unanswered; yes: several.



              The TMS9818[/a] was used by the TI99/4[/a], MSX, ColecoVision, and many more. It doesn't share RAM with the CPU, it has its own. The CPU isn't synchronised to the TMS's memory windows when it wants to read or write, it writes to (or reads from) a latch that is written to (/read from) video memory whenever there is next a spare slot to do so.



              The problem is that the access slots are far enough apart that you can't write to the latch at full speed or you'll start overwriting values that haven't been written yet. If you spend up the RAM so that slots were always available, you might as well eliminate the latch.



              That being said, the chip has a diverse family tree, at least some of which block on write only if the latch is full; that's an effective strategy but starting to get a little electronically complicated for a machine of the Spectrum and C64 vintage.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited 2 hours ago

























              answered 3 hours ago









              Tommy

              12.3k13262




              12.3k13262











              • The TMS9918 runs at 5.4MHz, and guarantees CPU access at least once per 32 cycles (or once per 6 cycles if in text mode). For a 4MHz Z80, you don't have to do much between memory accesses for that not to be a problem. It's also somewhat better than the Spectrum manages. :)
                – Jules
                3 hours ago











              • @Jules Keep in mind, the access is only about Video. Even with an LDIR, video access would be only once per 21 cycles on a Z80. which would make it perfect for a 3.5 MHz Z80 without adding any wait cycle. ((5.4 MHz / 32 cycles worst case) * 21 cycles per LDIR) = 3,5475 MHz Looks like a nice number for such a system, doesn't it :))
                – Raffzahn
                2 hours ago
















              • The TMS9918 runs at 5.4MHz, and guarantees CPU access at least once per 32 cycles (or once per 6 cycles if in text mode). For a 4MHz Z80, you don't have to do much between memory accesses for that not to be a problem. It's also somewhat better than the Spectrum manages. :)
                – Jules
                3 hours ago











              • @Jules Keep in mind, the access is only about Video. Even with an LDIR, video access would be only once per 21 cycles on a Z80. which would make it perfect for a 3.5 MHz Z80 without adding any wait cycle. ((5.4 MHz / 32 cycles worst case) * 21 cycles per LDIR) = 3,5475 MHz Looks like a nice number for such a system, doesn't it :))
                – Raffzahn
                2 hours ago















              The TMS9918 runs at 5.4MHz, and guarantees CPU access at least once per 32 cycles (or once per 6 cycles if in text mode). For a 4MHz Z80, you don't have to do much between memory accesses for that not to be a problem. It's also somewhat better than the Spectrum manages. :)
              – Jules
              3 hours ago





              The TMS9918 runs at 5.4MHz, and guarantees CPU access at least once per 32 cycles (or once per 6 cycles if in text mode). For a 4MHz Z80, you don't have to do much between memory accesses for that not to be a problem. It's also somewhat better than the Spectrum manages. :)
              – Jules
              3 hours ago













              @Jules Keep in mind, the access is only about Video. Even with an LDIR, video access would be only once per 21 cycles on a Z80. which would make it perfect for a 3.5 MHz Z80 without adding any wait cycle. ((5.4 MHz / 32 cycles worst case) * 21 cycles per LDIR) = 3,5475 MHz Looks like a nice number for such a system, doesn't it :))
              – Raffzahn
              2 hours ago




              @Jules Keep in mind, the access is only about Video. Even with an LDIR, video access would be only once per 21 cycles on a Z80. which would make it perfect for a 3.5 MHz Z80 without adding any wait cycle. ((5.4 MHz / 32 cycles worst case) * 21 cycles per LDIR) = 3,5475 MHz Looks like a nice number for such a system, doesn't it :))
              – Raffzahn
              2 hours ago










              up vote
              0
              down vote













              In addition to pjc50's answer (i.e. that performance simply wasn't a high priority design goal for these systems), there are more things to consider, at least for the Spectrum:



              • A cheaper way of increasing performance would have simply to have used the full performance of the CPU: the Spectrum ran at 3.5MHz but its processor is rated for 4MHz. The modifications to allow this would have increased complexity, but would have done so less than the modification outlined in pjc50's answer to allow for separate framebuffer access. This would have made a much larger improvement for lower cost.


              • Another simpler improvement would have been to use slightly faster memory and a small readahead buffer. The Spectrum used 200ns memory, but if it had used 150ns memory it would have been able to squeeze all of its screen access operations into the gaps between Z80 memory accesses, allowing the CPU to run unimpeded. It wouldn't be able to guarantee getting framebuffer data at exactly the right time, though, so it would have had to buffer a handful of bytes ahead in an internal buffer. This would have had a minimal cost increase to the memory and only a small complexity increase in the ALU (I don't know how much of the available capacity of the ALU was used, so that may or may not have resulted in a cost increase for producing them).


              • And, at the end of the day, the 48K spectrum had 2/3 of its memory that didn't have the penalty on access anyway, so performance critical applications could simply use that memory and avoid it (and perform display updates during the retrace interval). The 16K model only existed as a concession to the budget end of the market, and its performance was therefore a lot less critical.






              share|improve this answer


























                up vote
                0
                down vote













                In addition to pjc50's answer (i.e. that performance simply wasn't a high priority design goal for these systems), there are more things to consider, at least for the Spectrum:



                • A cheaper way of increasing performance would have simply to have used the full performance of the CPU: the Spectrum ran at 3.5MHz but its processor is rated for 4MHz. The modifications to allow this would have increased complexity, but would have done so less than the modification outlined in pjc50's answer to allow for separate framebuffer access. This would have made a much larger improvement for lower cost.


                • Another simpler improvement would have been to use slightly faster memory and a small readahead buffer. The Spectrum used 200ns memory, but if it had used 150ns memory it would have been able to squeeze all of its screen access operations into the gaps between Z80 memory accesses, allowing the CPU to run unimpeded. It wouldn't be able to guarantee getting framebuffer data at exactly the right time, though, so it would have had to buffer a handful of bytes ahead in an internal buffer. This would have had a minimal cost increase to the memory and only a small complexity increase in the ALU (I don't know how much of the available capacity of the ALU was used, so that may or may not have resulted in a cost increase for producing them).


                • And, at the end of the day, the 48K spectrum had 2/3 of its memory that didn't have the penalty on access anyway, so performance critical applications could simply use that memory and avoid it (and perform display updates during the retrace interval). The 16K model only existed as a concession to the budget end of the market, and its performance was therefore a lot less critical.






                share|improve this answer
























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  In addition to pjc50's answer (i.e. that performance simply wasn't a high priority design goal for these systems), there are more things to consider, at least for the Spectrum:



                  • A cheaper way of increasing performance would have simply to have used the full performance of the CPU: the Spectrum ran at 3.5MHz but its processor is rated for 4MHz. The modifications to allow this would have increased complexity, but would have done so less than the modification outlined in pjc50's answer to allow for separate framebuffer access. This would have made a much larger improvement for lower cost.


                  • Another simpler improvement would have been to use slightly faster memory and a small readahead buffer. The Spectrum used 200ns memory, but if it had used 150ns memory it would have been able to squeeze all of its screen access operations into the gaps between Z80 memory accesses, allowing the CPU to run unimpeded. It wouldn't be able to guarantee getting framebuffer data at exactly the right time, though, so it would have had to buffer a handful of bytes ahead in an internal buffer. This would have had a minimal cost increase to the memory and only a small complexity increase in the ALU (I don't know how much of the available capacity of the ALU was used, so that may or may not have resulted in a cost increase for producing them).


                  • And, at the end of the day, the 48K spectrum had 2/3 of its memory that didn't have the penalty on access anyway, so performance critical applications could simply use that memory and avoid it (and perform display updates during the retrace interval). The 16K model only existed as a concession to the budget end of the market, and its performance was therefore a lot less critical.






                  share|improve this answer














                  In addition to pjc50's answer (i.e. that performance simply wasn't a high priority design goal for these systems), there are more things to consider, at least for the Spectrum:



                  • A cheaper way of increasing performance would have simply to have used the full performance of the CPU: the Spectrum ran at 3.5MHz but its processor is rated for 4MHz. The modifications to allow this would have increased complexity, but would have done so less than the modification outlined in pjc50's answer to allow for separate framebuffer access. This would have made a much larger improvement for lower cost.


                  • Another simpler improvement would have been to use slightly faster memory and a small readahead buffer. The Spectrum used 200ns memory, but if it had used 150ns memory it would have been able to squeeze all of its screen access operations into the gaps between Z80 memory accesses, allowing the CPU to run unimpeded. It wouldn't be able to guarantee getting framebuffer data at exactly the right time, though, so it would have had to buffer a handful of bytes ahead in an internal buffer. This would have had a minimal cost increase to the memory and only a small complexity increase in the ALU (I don't know how much of the available capacity of the ALU was used, so that may or may not have resulted in a cost increase for producing them).


                  • And, at the end of the day, the 48K spectrum had 2/3 of its memory that didn't have the penalty on access anyway, so performance critical applications could simply use that memory and avoid it (and perform display updates during the retrace interval). The 16K model only existed as a concession to the budget end of the market, and its performance was therefore a lot less critical.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 5 hours ago

























                  answered 5 hours ago









                  Jules

                  7,13112037




                  7,13112037




















                      up vote
                      -1
                      down vote













                      You seem to assume a single latch could suffice buffering video memory writes from the CPU while the RAM is busy shifting out bits to the DAC.



                      That would assume that a single CPU instruction is only capable of pushing out a single byte to video RAM - but this is not the case: 16-bit accesses (which you could buffer with two latches) or even more complex instructions like LDIR (that would need way more buffers) on a Z80 transfer lots more bytes in a single instruction fetch. In order to catch these, you would need more buffers, actually a second video RAM - This ends up in an actual component, dual-ported memory, that you can actually buy, albeit at much higher prices than "normal" RAM. Dual-ported memory has been used as video memory on contemporary (expensive) computers and is even used today.



                      Dual-ported memory is actually the thing that you are describing, just extended to cover any real application.






                      share|improve this answer




















                      • "LDIR (that would need way more buffers)" Not realy, since an LDIR can be put on hold like any other access when the one transaction buffer is filled. "[LDIR] on a Z80 transfer lots more bytes in a single instruction fetch." Again, not realy, as the LDIR gets fetched for each and every byte transfered. Over and Over again. And no, he isn't decribing a dual port memory, but a write back buffer.
                        – Raffzahn
                        3 hours ago














                      up vote
                      -1
                      down vote













                      You seem to assume a single latch could suffice buffering video memory writes from the CPU while the RAM is busy shifting out bits to the DAC.



                      That would assume that a single CPU instruction is only capable of pushing out a single byte to video RAM - but this is not the case: 16-bit accesses (which you could buffer with two latches) or even more complex instructions like LDIR (that would need way more buffers) on a Z80 transfer lots more bytes in a single instruction fetch. In order to catch these, you would need more buffers, actually a second video RAM - This ends up in an actual component, dual-ported memory, that you can actually buy, albeit at much higher prices than "normal" RAM. Dual-ported memory has been used as video memory on contemporary (expensive) computers and is even used today.



                      Dual-ported memory is actually the thing that you are describing, just extended to cover any real application.






                      share|improve this answer




















                      • "LDIR (that would need way more buffers)" Not realy, since an LDIR can be put on hold like any other access when the one transaction buffer is filled. "[LDIR] on a Z80 transfer lots more bytes in a single instruction fetch." Again, not realy, as the LDIR gets fetched for each and every byte transfered. Over and Over again. And no, he isn't decribing a dual port memory, but a write back buffer.
                        – Raffzahn
                        3 hours ago












                      up vote
                      -1
                      down vote










                      up vote
                      -1
                      down vote









                      You seem to assume a single latch could suffice buffering video memory writes from the CPU while the RAM is busy shifting out bits to the DAC.



                      That would assume that a single CPU instruction is only capable of pushing out a single byte to video RAM - but this is not the case: 16-bit accesses (which you could buffer with two latches) or even more complex instructions like LDIR (that would need way more buffers) on a Z80 transfer lots more bytes in a single instruction fetch. In order to catch these, you would need more buffers, actually a second video RAM - This ends up in an actual component, dual-ported memory, that you can actually buy, albeit at much higher prices than "normal" RAM. Dual-ported memory has been used as video memory on contemporary (expensive) computers and is even used today.



                      Dual-ported memory is actually the thing that you are describing, just extended to cover any real application.






                      share|improve this answer












                      You seem to assume a single latch could suffice buffering video memory writes from the CPU while the RAM is busy shifting out bits to the DAC.



                      That would assume that a single CPU instruction is only capable of pushing out a single byte to video RAM - but this is not the case: 16-bit accesses (which you could buffer with two latches) or even more complex instructions like LDIR (that would need way more buffers) on a Z80 transfer lots more bytes in a single instruction fetch. In order to catch these, you would need more buffers, actually a second video RAM - This ends up in an actual component, dual-ported memory, that you can actually buy, albeit at much higher prices than "normal" RAM. Dual-ported memory has been used as video memory on contemporary (expensive) computers and is even used today.



                      Dual-ported memory is actually the thing that you are describing, just extended to cover any real application.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered 4 hours ago









                      tofro

                      12k32570




                      12k32570











                      • "LDIR (that would need way more buffers)" Not realy, since an LDIR can be put on hold like any other access when the one transaction buffer is filled. "[LDIR] on a Z80 transfer lots more bytes in a single instruction fetch." Again, not realy, as the LDIR gets fetched for each and every byte transfered. Over and Over again. And no, he isn't decribing a dual port memory, but a write back buffer.
                        – Raffzahn
                        3 hours ago
















                      • "LDIR (that would need way more buffers)" Not realy, since an LDIR can be put on hold like any other access when the one transaction buffer is filled. "[LDIR] on a Z80 transfer lots more bytes in a single instruction fetch." Again, not realy, as the LDIR gets fetched for each and every byte transfered. Over and Over again. And no, he isn't decribing a dual port memory, but a write back buffer.
                        – Raffzahn
                        3 hours ago















                      "LDIR (that would need way more buffers)" Not realy, since an LDIR can be put on hold like any other access when the one transaction buffer is filled. "[LDIR] on a Z80 transfer lots more bytes in a single instruction fetch." Again, not realy, as the LDIR gets fetched for each and every byte transfered. Over and Over again. And no, he isn't decribing a dual port memory, but a write back buffer.
                      – Raffzahn
                      3 hours ago




                      "LDIR (that would need way more buffers)" Not realy, since an LDIR can be put on hold like any other access when the one transaction buffer is filled. "[LDIR] on a Z80 transfer lots more bytes in a single instruction fetch." Again, not realy, as the LDIR gets fetched for each and every byte transfered. Over and Over again. And no, he isn't decribing a dual port memory, but a write back buffer.
                      – Raffzahn
                      3 hours ago

















                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fretrocomputing.stackexchange.com%2fquestions%2f7655%2fa-different-way-to-share-the-memory-bus-between-the-cpu-and-the-video%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      What does second last employer means? [closed]

                      Installing NextGIS Connect into QGIS 3?

                      One-line joke