How does write cache work with a filesystem spanning disks with different speeds?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?



For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?



For a btrfs RAID1 how would the same situation play out? There's no --write-behind functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?



How do the vm.dirty_*ratio tunables affect these setups?










share|improve this question

























    up vote
    3
    down vote

    favorite












    On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?



    For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?



    For a btrfs RAID1 how would the same situation play out? There's no --write-behind functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?



    How do the vm.dirty_*ratio tunables affect these setups?










    share|improve this question























      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?



      For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?



      For a btrfs RAID1 how would the same situation play out? There's no --write-behind functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?



      How do the vm.dirty_*ratio tunables affect these setups?










      share|improve this question













      On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?



      For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?



      For a btrfs RAID1 how would the same situation play out? There's no --write-behind functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?



      How do the vm.dirty_*ratio tunables affect these setups?







      linux filesystems btrfs mdadm software-raid






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 40 mins ago









      Steven Davies

      23319




      23319




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote













          The --write-mostly, --write-behind is handled by the md driver internally. md keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).




          But how is that cached at kernel level?




          For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.



          Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.



          Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.



          The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)






          share|improve this answer



























            up vote
            1
            down vote














            For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?




            After, since this feature is specific to md-raid.



            You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:




            --write-behind=



            Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.




            I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests and max_hw_sectors_kb. See /sys/class/block/$write_behind_device/queue/. By hardware buffering, I mean the write cache on the drive.




            At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?




            Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.






            share|improve this answer






















            • Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
              – Steven Davies
              15 mins ago






            • 2




              the sub-writes to non-write-behind disks must complete first
              – sourcejedi
              12 mins ago










            Your Answer







            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471551%2fhow-does-write-cache-work-with-a-filesystem-spanning-disks-with-different-speeds%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            2
            down vote













            The --write-mostly, --write-behind is handled by the md driver internally. md keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).




            But how is that cached at kernel level?




            For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.



            Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.



            Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.



            The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)






            share|improve this answer
























              up vote
              2
              down vote













              The --write-mostly, --write-behind is handled by the md driver internally. md keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).




              But how is that cached at kernel level?




              For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.



              Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.



              Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.



              The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)






              share|improve this answer






















                up vote
                2
                down vote










                up vote
                2
                down vote









                The --write-mostly, --write-behind is handled by the md driver internally. md keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).




                But how is that cached at kernel level?




                For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.



                Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.



                Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.



                The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)






                share|improve this answer












                The --write-mostly, --write-behind is handled by the md driver internally. md keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).




                But how is that cached at kernel level?




                For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.



                Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.



                Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.



                The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 12 mins ago









                frostschutz

                24.6k14777




                24.6k14777






















                    up vote
                    1
                    down vote














                    For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?




                    After, since this feature is specific to md-raid.



                    You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:




                    --write-behind=



                    Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.




                    I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests and max_hw_sectors_kb. See /sys/class/block/$write_behind_device/queue/. By hardware buffering, I mean the write cache on the drive.




                    At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?




                    Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.






                    share|improve this answer






















                    • Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
                      – Steven Davies
                      15 mins ago






                    • 2




                      the sub-writes to non-write-behind disks must complete first
                      – sourcejedi
                      12 mins ago














                    up vote
                    1
                    down vote














                    For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?




                    After, since this feature is specific to md-raid.



                    You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:




                    --write-behind=



                    Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.




                    I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests and max_hw_sectors_kb. See /sys/class/block/$write_behind_device/queue/. By hardware buffering, I mean the write cache on the drive.




                    At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?




                    Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.






                    share|improve this answer






















                    • Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
                      – Steven Davies
                      15 mins ago






                    • 2




                      the sub-writes to non-write-behind disks must complete first
                      – sourcejedi
                      12 mins ago












                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote










                    For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?




                    After, since this feature is specific to md-raid.



                    You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:




                    --write-behind=



                    Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.




                    I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests and max_hw_sectors_kb. See /sys/class/block/$write_behind_device/queue/. By hardware buffering, I mean the write cache on the drive.




                    At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?




                    Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.






                    share|improve this answer















                    For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?




                    After, since this feature is specific to md-raid.



                    You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:




                    --write-behind=



                    Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.




                    I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests and max_hw_sectors_kb. See /sys/class/block/$write_behind_device/queue/. By hardware buffering, I mean the write cache on the drive.




                    At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?




                    Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 7 mins ago

























                    answered 26 mins ago









                    sourcejedi

                    20.2k42887




                    20.2k42887











                    • Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
                      – Steven Davies
                      15 mins ago






                    • 2




                      the sub-writes to non-write-behind disks must complete first
                      – sourcejedi
                      12 mins ago
















                    • Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
                      – Steven Davies
                      15 mins ago






                    • 2




                      the sub-writes to non-write-behind disks must complete first
                      – sourcejedi
                      12 mins ago















                    Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
                    – Steven Davies
                    15 mins ago




                    Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
                    – Steven Davies
                    15 mins ago




                    2




                    2




                    the sub-writes to non-write-behind disks must complete first
                    – sourcejedi
                    12 mins ago




                    the sub-writes to non-write-behind disks must complete first
                    – sourcejedi
                    12 mins ago

















                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471551%2fhow-does-write-cache-work-with-a-filesystem-spanning-disks-with-different-speeds%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Comments

                    Popular posts from this blog

                    What does second last employer means? [closed]

                    List of Gilmore Girls characters

                    Confectionery