SQL Server 2019 UTF-8 Support Benefits

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
3
down vote

favorite

I'm already quite comfortable with using COMPRESS() and DECOMPRESS() in an internal forum software for our company (Currently in SQL Server 2017), but trying to make the database as efficient as possible, is there an advantage to adding _UTF-8 to my current collation as in Latin1_General_100_CI_AS_SC_UTF8 upon future migration to SQL Server 2019?

edited 28 mins ago

Solomon Rutzky

46.6k577168

asked 4 hours ago

John Titor

164

migrated from stackoverflow.com 1 hour ago

This question came from our site for professional and enthusiast programmers.

Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
â€“Â Panagiotis Kanavos
3 hours ago

add a commentÂ |Â

up vote
3
down vote

favorite

edited 28 mins ago

Solomon Rutzky

46.6k577168

asked 4 hours ago

John Titor

164

migrated from stackoverflow.com 1 hour ago

This question came from our site for professional and enthusiast programmers.

Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
â€“Â Panagiotis Kanavos
3 hours ago

add a commentÂ |Â

up vote
3
down vote

favorite

edited 28 mins ago

Solomon Rutzky

46.6k577168

asked 4 hours ago

John Titor

164

sql-server collation encoding utf-8

edited 28 mins ago

Solomon Rutzky

46.6k577168

asked 4 hours ago

John Titor

164

edited 28 mins ago

Solomon Rutzky

46.6k577168

asked 4 hours ago

John Titor

164

edited 28 mins ago

Solomon Rutzky

46.6k577168

edited 28 mins ago

Solomon Rutzky

46.6k577168

edited 28 mins ago

Solomon Rutzky

46.6k577168

asked 4 hours ago

John Titor

164

asked 4 hours ago

John Titor

164

asked 4 hours ago

John Titor

164

migrated from stackoverflow.com 1 hour ago

This question came from our site for professional and enthusiast programmers.

migrated from stackoverflow.com 1 hour ago

This question came from our site for professional and enthusiast programmers.

Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
â€“Â Panagiotis Kanavos
3 hours ago

add a commentÂ |Â

Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
â€“Â Panagiotis Kanavos
3 hours ago

Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
â€“Â Panagiotis Kanavos
3 hours ago

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

Here's a list of recommended uses taken from here:

The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a Ã¢Â€Âœ_UTF8Ã¢Â€Â encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:

Data is mostly standard ASCII (values 0 Ã¢Â€Â“ 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).

Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data wonÃ¢Â€Â™t fit into NVARCHAR(4000)).

There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR).

Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of Ã¢Â€Âœ=Ã¢Â€Â) or manipulated.

The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.

Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.

answered 3 hours ago

Outman

464

add a commentÂ |Â

up vote
3
down vote

trying to make the database as efficient as possible

There are at least two different types of efficiency that are really at play here:

space (disk and memory)

speed

Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.

However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8 VARCHAR data, both values are converted to UTF-16 LE (i.e. NVARCHAR). I wouldn't be surprised if other (perhaps even most) operations needed to convert the UTF-8 data into NVARCHAR given that is how Windows / SQL Server / .NET have always handled Unicode.

So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.

Also keep in mind that there are several bugs in the current UTF-8 implementation that make it very much not ready for Production use. They have supposedly fixed two of the minor ones in the next CTP, but there is still a major issue with NULLs not always being handled properly by drivers (i.e. client connectivity) released prior to SQL Server 2019.

Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.

edited 21 mins ago

answered 29 mins ago

Solomon Rutzky

46.6k577168

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "182"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f221531%2fsql-server-2019-utf-8-support-benefits%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

Here's a list of recommended uses taken from here:

Data is mostly standard ASCII (values 0 Ã¢Â€Â“ 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).

Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data wonÃ¢Â€Â™t fit into NVARCHAR(4000)).

There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR).

Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of Ã¢Â€Âœ=Ã¢Â€Â) or manipulated.

The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.

answered 3 hours ago

Outman

464

add a commentÂ |Â

up vote
3
down vote

accepted

Here's a list of recommended uses taken from here:

Data is mostly standard ASCII (values 0 Ã¢Â€Â“ 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).

Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data wonÃ¢Â€Â™t fit into NVARCHAR(4000)).

There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR).

Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of Ã¢Â€Âœ=Ã¢Â€Â) or manipulated.

The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.

answered 3 hours ago

Outman

464

add a commentÂ |Â

up vote
3
down vote

accepted

Here's a list of recommended uses taken from here:

Data is mostly standard ASCII (values 0 Ã¢Â€Â“ 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).

Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data wonÃ¢Â€Â™t fit into NVARCHAR(4000)).

There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR).

Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of Ã¢Â€Âœ=Ã¢Â€Â) or manipulated.

The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.

answered 3 hours ago

Outman

464

Here's a list of recommended uses taken from here:

Data is mostly standard ASCII (values 0 Ã¢Â€Â“ 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).

Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data wonÃ¢Â€Â™t fit into NVARCHAR(4000)).

There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR).

Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of Ã¢Â€Âœ=Ã¢Â€Â) or manipulated.

The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.

answered 3 hours ago

Outman

464

answered 3 hours ago

Outman

464

answered 3 hours ago

Outman

464

answered 3 hours ago

Outman

464

add a commentÂ |Â

up vote
3
down vote

trying to make the database as efficient as possible

There are at least two different types of efficiency that are really at play here:

space (disk and memory)

speed

So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.

edited 21 mins ago

answered 29 mins ago

Solomon Rutzky

46.6k577168

add a commentÂ |Â

up vote
3
down vote

trying to make the database as efficient as possible

There are at least two different types of efficiency that are really at play here:

space (disk and memory)

speed

So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.

edited 21 mins ago

answered 29 mins ago

Solomon Rutzky

46.6k577168

add a commentÂ |Â

up vote
3
down vote

trying to make the database as efficient as possible

There are at least two different types of efficiency that are really at play here:

space (disk and memory)

speed

So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.

edited 21 mins ago

answered 29 mins ago

Solomon Rutzky

46.6k577168

trying to make the database as efficient as possible

There are at least two different types of efficiency that are really at play here:

space (disk and memory)

speed

So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.

edited 21 mins ago

answered 29 mins ago

Solomon Rutzky

46.6k577168

edited 21 mins ago

answered 29 mins ago

Solomon Rutzky

46.6k577168

answered 29 mins ago

Solomon Rutzky

46.6k577168

answered 29 mins ago

Solomon Rutzky

46.6k577168

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky