How does write cache work with a filesystem spanning disks with different speeds?

up vote
3
down vote

favorite

On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?

For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer? At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?

For a btrfs RAID1 how would the same situation play out? There's no --write-behind functionality, so are dirty pages counted at a device level or filesystem level? At what point would a write() return?

How do the vm.dirty_*ratio tunables affect these setups?

asked 40 mins ago

Steven Davies

23319

add a commentÂ |Â

up vote
3
down vote

favorite

On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?

How do the vm.dirty_*ratio tunables affect these setups?

asked 40 mins ago

Steven Davies

23319

add a commentÂ |Â

up vote
3
down vote

favorite

On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?

How do the vm.dirty_*ratio tunables affect these setups?

asked 40 mins ago

Steven Davies

23319

On a modern Linux system with multiple disks and a software RAID spanning both slow (HDD) and fast (SSD) drives, how are writes to the filesystem cached?

How do the vm.dirty_*ratio tunables affect these setups?

linux filesystems btrfs mdadm software-raid

asked 40 mins ago

Steven Davies

23319

asked 40 mins ago

Steven Davies

23319

asked 40 mins ago

Steven Davies

23319

asked 40 mins ago

Steven Davies

23319

asked 40 mins ago

Steven Davies

23319

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
2
down vote

The --write-mostly, --write-behind is handled by the md driver internally. md keeps metadata, like the write-intent bitmap (which is mandatory for the write-behind feature) that basically logs which data has been written yet vs. which data is still missing. This is necessary in case there is a power loss event, when the data hasn't reached the write-mostly devices yet. In that case the affected data area will be re-synced (in your case read from SSD, write to HDD).

But how is that cached at kernel level?

For the write-behind case, the md driver basically duplicates the write request internally. The master write request goes to the primary drive(s) and tells the upper layers "OK I've done this already"; the copied write request then stays around for the write-mostly-behind side of the RAID and may take longer to complete, hopefully without anyone noticing.

Then the raid layer takes a lot of steps to make sure no data will be read from the write-mostly device while there are still pending write-behind requests in the queue. Why would data be read from a write-mostly device? Well, the SSD might have failed so it's all there's left. It's complicated, and write-behind introduces some corner cases.

Which is probably also why it's only supported for RAID-1 level, not any of the others. Although it might make sense in theory to have SSDs essentially as RAID-0 and two parity HDDs in write-behind mode, there's no support for a write-behind RAID-6 like that. It's RAID-1 only and rarely used even there.

The other cache settings remain unaffected by this, basically the overall caching mechanism does not care in the least about how the md driver has implemented things internally. The cache does its thing and md does its thing. So a filesystem cache works the same for a filesystem on top of md vs. a filesystem on top of a bare drive. (The reality is a tad more complicated than that but you can think of it this way.)

answered 12 mins ago

frostschutz

24.6k14777

add a commentÂ |Â

up vote
1
down vote

For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?

After, since this feature is specific to md-raid.

You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:

--write-behind=

Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.

I can only think that it is also limited by the normal kernel and hardware buffering (i.e. if that is smaller). The normal kernel buffering is bounded by nr_requests and max_hw_sectors_kb. See /sys/class/block/$write_behind_device/queue/. By hardware buffering, I mean the write cache on the drive.

At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?

Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.

edited 7 mins ago

answered 26 mins ago

sourcejedi

20.2k42887

Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
â€“Â Steven Davies
15 mins ago

2

the sub-writes to non-write-behind disks must complete first
â€“Â sourcejedi
12 mins ago

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471551%2fhow-does-write-cache-work-with-a-filesystem-spanning-disks-with-different-speeds%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

But how is that cached at kernel level?

answered 12 mins ago

frostschutz

24.6k14777

add a commentÂ |Â

up vote
2
down vote

But how is that cached at kernel level?

answered 12 mins ago

frostschutz

24.6k14777

add a commentÂ |Â

up vote
2
down vote

But how is that cached at kernel level?

answered 12 mins ago

frostschutz

24.6k14777

But how is that cached at kernel level?

answered 12 mins ago

frostschutz

24.6k14777

answered 12 mins ago

frostschutz

24.6k14777

answered 12 mins ago

frostschutz

24.6k14777

answered 12 mins ago

frostschutz

24.6k14777

add a commentÂ |Â

up vote
1
down vote

For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?

After, since this feature is specific to md-raid.

You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:

--write-behind=

Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.

At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?

Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.

edited 7 mins ago

answered 26 mins ago

sourcejedi

20.2k42887

Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
â€“Â Steven Davies
15 mins ago

2

the sub-writes to non-write-behind disks must complete first
â€“Â sourcejedi
12 mins ago

add a commentÂ |Â

up vote
1
down vote

For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?

After, since this feature is specific to md-raid.

You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:

--write-behind=

Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.

At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?

Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.

edited 7 mins ago

answered 26 mins ago

sourcejedi

20.2k42887

Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
â€“Â Steven Davies
15 mins ago

2

the sub-writes to non-write-behind disks must complete first
â€“Â sourcejedi
12 mins ago

add a commentÂ |Â

up vote
1
down vote

For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?

After, since this feature is specific to md-raid.

You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:

--write-behind=

Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.

At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?

Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.

edited 7 mins ago

answered 26 mins ago

sourcejedi

20.2k42887

For md-raid RAID1 the array can be configured with disks as --write-mostly and --write-behind which suggests that reads are performed from the faster disk, and that writes to the slower disk can lag behind. But how is that cached at kernel level? Does the kernel cache the disk writes before or after the md-raid layer?

After, since this feature is specific to md-raid.

You should think about this md-raid feature as buffering, not caching. It is bounded by the following mdadm option:

--write-behind=

Specify that write-behind mode should be enabled (valid for RAID1 only). If an argument is specified, it will set the maximum number of outstanding writes allowed. The default value is 256.

At the end of a write() call is the data guaranteed to be written to one of the not---write-behind disks?

Of course, assuming you mean the write() was on a file opened with O_SYNC / O_DSYNC, or you actually meant write()+fsync(). If not, no guarantees apply at all.

edited 7 mins ago

answered 26 mins ago

sourcejedi

20.2k42887

edited 7 mins ago

answered 26 mins ago

sourcejedi

20.2k42887

answered 26 mins ago

sourcejedi

20.2k42887

answered 26 mins ago

sourcejedi

20.2k42887

Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
â€“Â Steven Davies
15 mins ago

2

the sub-writes to non-write-behind disks must complete first
â€“Â sourcejedi
12 mins ago

add a commentÂ |Â

Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
â€“Â Steven Davies
15 mins ago

2

the sub-writes to non-write-behind disks must complete first
â€“Â sourcejedi
12 mins ago

Thanks, but that does pose another question: if the file was opened with O_SYNC, does the write() return after the first disk has been written to or all disks have been written to in this case?
â€“Â Steven Davies
15 mins ago

the sub-writes to non-write-behind disks must complete first
â€“Â sourcejedi
12 mins ago

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky