How does write cache work with a filesystem spanning disks with different speeds?
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?
For md-raid RAID1 the array can be configured with disks as --write-mostly
and --write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind
disks?
For a btrfs
RAID1 how would the same situation play out? There's no --write-behind
functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?
How do the vm.dirty_*ratio
tunables affect these setups?
linux filesystems btrfs mdadm software-raid
add a comment |Â
up vote
3
down vote
favorite
On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?
For md-raid RAID1 the array can be configured with disks as --write-mostly
and --write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind
disks?
For a btrfs
RAID1 how would the same situation play out? There's no --write-behind
functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?
How do the vm.dirty_*ratio
tunables affect these setups?
linux filesystems btrfs mdadm software-raid
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?
For md-raid RAID1 the array can be configured with disks as --write-mostly
and --write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind
disks?
For a btrfs
RAID1 how would the same situation play out? There's no --write-behind
functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?
How do the vm.dirty_*ratio
tunables affect these setups?
linux filesystems btrfs mdadm software-raid
On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?
For md-raid RAID1 the array can be configured with disks as --write-mostly
and --write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind
disks?
For a btrfs
RAID1 how would the same situation play out? There's no --write-behind
functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?
How do the vm.dirty_*ratio
tunables affect these setups?
linux filesystems btrfs mdadm software-raid
linux filesystems btrfs mdadm software-raid
asked 40 mins ago
Steven Davies
23319
23319
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
The --write-mostly
, --write-behind
is handled by the md
driver internally. md
keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).
But how is that cached at kernel level?
For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.
Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.
Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.
The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md
driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)
add a comment |Â
up vote
1
down vote
For md-raid RAID1 the array can be configured with disks as
--write-mostly
and--write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?
After, since this feature is specific to md-raid.
You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm
option:
--write-behind=
Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.
I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests
and max_hw_sectors_kb
. See /sys/class/block/$write_behind_device/queue/
. By hardware buffering, I mean the write cache on the drive.
At the end of a write() call is the data guaranteed to be written to one of the not-
--write-behind
disks?
Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.
Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
– Steven Davies
15 mins ago
2
the sub-writes to non-write-behind disks must complete first
– sourcejedi
12 mins ago
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
The --write-mostly
, --write-behind
is handled by the md
driver internally. md
keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).
But how is that cached at kernel level?
For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.
Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.
Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.
The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md
driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)
add a comment |Â
up vote
2
down vote
The --write-mostly
, --write-behind
is handled by the md
driver internally. md
keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).
But how is that cached at kernel level?
For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.
Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.
Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.
The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md
driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)
add a comment |Â
up vote
2
down vote
up vote
2
down vote
The --write-mostly
, --write-behind
is handled by the md
driver internally. md
keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).
But how is that cached at kernel level?
For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.
Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.
Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.
The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md
driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)
The --write-mostly
, --write-behind
is handled by the md
driver internally. md
keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).
But how is that cached at kernel level?
For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.
Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.
Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.
The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md
driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)
answered 12 mins ago
frostschutz
24.6k14777
24.6k14777
add a comment |Â
add a comment |Â
up vote
1
down vote
For md-raid RAID1 the array can be configured with disks as
--write-mostly
and--write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?
After, since this feature is specific to md-raid.
You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm
option:
--write-behind=
Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.
I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests
and max_hw_sectors_kb
. See /sys/class/block/$write_behind_device/queue/
. By hardware buffering, I mean the write cache on the drive.
At the end of a write() call is the data guaranteed to be written to one of the not-
--write-behind
disks?
Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.
Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
– Steven Davies
15 mins ago
2
the sub-writes to non-write-behind disks must complete first
– sourcejedi
12 mins ago
add a comment |Â
up vote
1
down vote
For md-raid RAID1 the array can be configured with disks as
--write-mostly
and--write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?
After, since this feature is specific to md-raid.
You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm
option:
--write-behind=
Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.
I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests
and max_hw_sectors_kb
. See /sys/class/block/$write_behind_device/queue/
. By hardware buffering, I mean the write cache on the drive.
At the end of a write() call is the data guaranteed to be written to one of the not-
--write-behind
disks?
Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.
Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
– Steven Davies
15 mins ago
2
the sub-writes to non-write-behind disks must complete first
– sourcejedi
12 mins ago
add a comment |Â
up vote
1
down vote
up vote
1
down vote
For md-raid RAID1 the array can be configured with disks as
--write-mostly
and--write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?
After, since this feature is specific to md-raid.
You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm
option:
--write-behind=
Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.
I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests
and max_hw_sectors_kb
. See /sys/class/block/$write_behind_device/queue/
. By hardware buffering, I mean the write cache on the drive.
At the end of a write() call is the data guaranteed to be written to one of the not-
--write-behind
disks?
Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.
For md-raid RAID1 the array can be configured with disks as
--write-mostly
and--write-behind
which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?
After, since this feature is specific to md-raid.
You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm
option:
--write-behind=
Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.
I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests
and max_hw_sectors_kb
. See /sys/class/block/$write_behind_device/queue/
. By hardware buffering, I mean the write cache on the drive.
At the end of a write() call is the data guaranteed to be written to one of the not-
--write-behind
disks?
Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.
edited 7 mins ago
answered 26 mins ago


sourcejedi
20.2k42887
20.2k42887
Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
– Steven Davies
15 mins ago
2
the sub-writes to non-write-behind disks must complete first
– sourcejedi
12 mins ago
add a comment |Â
Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
– Steven Davies
15 mins ago
2
the sub-writes to non-write-behind disks must complete first
– sourcejedi
12 mins ago
Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
– Steven Davies
15 mins ago
Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
– Steven Davies
15 mins ago
2
2
the sub-writes to non-write-behind disks must complete first
– sourcejedi
12 mins ago
the sub-writes to non-write-behind disks must complete first
– sourcejedi
12 mins ago
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471551%2fhow-does-write-cache-work-with-a-filesystem-spanning-disks-with-different-speeds%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password