SQL Server 2019 UTF-8 Support Benefits
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I'm already quite comfortable with using COMPRESS()
and DECOMPRESS()
in an internal forum software for our company (Currently in SQL Server 2017), but trying to make the database as efficient as possible, is there an advantage to adding _UTF-8
to my current collation as in Latin1_General_100_CI_AS_SC_UTF8
upon future migration to SQL Server 2019?
sql-server collation encoding utf-8
migrated from stackoverflow.com 1 hour ago
This question came from our site for professional and enthusiast programmers.
add a comment |Â
up vote
3
down vote
favorite
I'm already quite comfortable with using COMPRESS()
and DECOMPRESS()
in an internal forum software for our company (Currently in SQL Server 2017), but trying to make the database as efficient as possible, is there an advantage to adding _UTF-8
to my current collation as in Latin1_General_100_CI_AS_SC_UTF8
upon future migration to SQL Server 2019?
sql-server collation encoding utf-8
migrated from stackoverflow.com 1 hour ago
This question came from our site for professional and enthusiast programmers.
Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
– Panagiotis Kanavos
3 hours ago
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I'm already quite comfortable with using COMPRESS()
and DECOMPRESS()
in an internal forum software for our company (Currently in SQL Server 2017), but trying to make the database as efficient as possible, is there an advantage to adding _UTF-8
to my current collation as in Latin1_General_100_CI_AS_SC_UTF8
upon future migration to SQL Server 2019?
sql-server collation encoding utf-8
I'm already quite comfortable with using COMPRESS()
and DECOMPRESS()
in an internal forum software for our company (Currently in SQL Server 2017), but trying to make the database as efficient as possible, is there an advantage to adding _UTF-8
to my current collation as in Latin1_General_100_CI_AS_SC_UTF8
upon future migration to SQL Server 2019?
sql-server collation encoding utf-8
sql-server collation encoding utf-8
edited 28 mins ago


Solomon Rutzky
46.6k577168
46.6k577168
asked 4 hours ago
John Titor
164
164
migrated from stackoverflow.com 1 hour ago
This question came from our site for professional and enthusiast programmers.
migrated from stackoverflow.com 1 hour ago
This question came from our site for professional and enthusiast programmers.
Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
– Panagiotis Kanavos
3 hours ago
add a comment |Â
Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
– Panagiotis Kanavos
3 hours ago
Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
– Panagiotis Kanavos
3 hours ago
Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
– Panagiotis Kanavos
3 hours ago
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
3
down vote
accepted
Here's a list of recommended uses taken from here:
The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8†encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:
- Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).
- Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data won’t fit into NVARCHAR(4000)). - There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR). - Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of “=â€Â) or manipulated. - The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.
Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.
add a comment |Â
up vote
3
down vote
trying to make the database as efficient as possible
There are at least two different types of efficiency that are really at play here:
- space (disk and memory)
- speed
Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.
However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8 VARCHAR
data, both values are converted to UTF-16 LE (i.e. NVARCHAR
). I wouldn't be surprised if other (perhaps even most) operations needed to convert the UTF-8 data into NVARCHAR
given that is how Windows / SQL Server / .NET have always handled Unicode.
So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.
Also keep in mind that there are several bugs in the current UTF-8 implementation that make it very much not ready for Production use. They have supposedly fixed two of the minor ones in the next CTP, but there is still a major issue with NULL
s not always being handled properly by drivers (i.e. client connectivity) released prior to SQL Server 2019.
Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
Here's a list of recommended uses taken from here:
The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8†encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:
- Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).
- Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data won’t fit into NVARCHAR(4000)). - There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR). - Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of “=â€Â) or manipulated. - The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.
Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.
add a comment |Â
up vote
3
down vote
accepted
Here's a list of recommended uses taken from here:
The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8†encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:
- Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).
- Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data won’t fit into NVARCHAR(4000)). - There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR). - Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of “=â€Â) or manipulated. - The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.
Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
Here's a list of recommended uses taken from here:
The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8†encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:
- Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).
- Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data won’t fit into NVARCHAR(4000)). - There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR). - Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of “=â€Â) or manipulated. - The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.
Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.
Here's a list of recommended uses taken from here:
The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8†encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:
- Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).
- Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning,
data won’t fit into NVARCHAR(4000)). - There is a lot of data for this column or set of columns (1 GB or
more when stored in NVARCHAR). - Performance would be negatively impacted by making the table a
Clustered Columnstore table (due to how the table is used) OR data
is typically < 8000 bytes There is no desire to make the column
VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and
use DECOMPRESS() for SELECT queries (no need to worry about lack of
ability to index the VARBINARY value since it is MAX data anyway
that cannot be indexed). Keep in mind that the Gzipped value will be
much smaller than even the UTF-8 version of the string, though it
would require decompressing before the value could be filtered on
(outside of “=â€Â) or manipulated. - The benefits of reducing the size of backups and reducing the time
it takes to backup and restore, and reducing the impact on the
buffer pool, outweigh the cost of the likely negative impact on
query performance (for both CPU and elapsed times). Just keep in
mind that Backup Compression (available in Enterprise and Standard
Editions) might help here.
Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.
answered 3 hours ago
Outman
464
464
add a comment |Â
add a comment |Â
up vote
3
down vote
trying to make the database as efficient as possible
There are at least two different types of efficiency that are really at play here:
- space (disk and memory)
- speed
Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.
However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8 VARCHAR
data, both values are converted to UTF-16 LE (i.e. NVARCHAR
). I wouldn't be surprised if other (perhaps even most) operations needed to convert the UTF-8 data into NVARCHAR
given that is how Windows / SQL Server / .NET have always handled Unicode.
So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.
Also keep in mind that there are several bugs in the current UTF-8 implementation that make it very much not ready for Production use. They have supposedly fixed two of the minor ones in the next CTP, but there is still a major issue with NULL
s not always being handled properly by drivers (i.e. client connectivity) released prior to SQL Server 2019.
Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.
add a comment |Â
up vote
3
down vote
trying to make the database as efficient as possible
There are at least two different types of efficiency that are really at play here:
- space (disk and memory)
- speed
Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.
However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8 VARCHAR
data, both values are converted to UTF-16 LE (i.e. NVARCHAR
). I wouldn't be surprised if other (perhaps even most) operations needed to convert the UTF-8 data into NVARCHAR
given that is how Windows / SQL Server / .NET have always handled Unicode.
So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.
Also keep in mind that there are several bugs in the current UTF-8 implementation that make it very much not ready for Production use. They have supposedly fixed two of the minor ones in the next CTP, but there is still a major issue with NULL
s not always being handled properly by drivers (i.e. client connectivity) released prior to SQL Server 2019.
Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.
add a comment |Â
up vote
3
down vote
up vote
3
down vote
trying to make the database as efficient as possible
There are at least two different types of efficiency that are really at play here:
- space (disk and memory)
- speed
Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.
However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8 VARCHAR
data, both values are converted to UTF-16 LE (i.e. NVARCHAR
). I wouldn't be surprised if other (perhaps even most) operations needed to convert the UTF-8 data into NVARCHAR
given that is how Windows / SQL Server / .NET have always handled Unicode.
So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.
Also keep in mind that there are several bugs in the current UTF-8 implementation that make it very much not ready for Production use. They have supposedly fixed two of the minor ones in the next CTP, but there is still a major issue with NULL
s not always being handled properly by drivers (i.e. client connectivity) released prior to SQL Server 2019.
Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.
trying to make the database as efficient as possible
There are at least two different types of efficiency that are really at play here:
- space (disk and memory)
- speed
Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.
However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8 VARCHAR
data, both values are converted to UTF-16 LE (i.e. NVARCHAR
). I wouldn't be surprised if other (perhaps even most) operations needed to convert the UTF-8 data into NVARCHAR
given that is how Windows / SQL Server / .NET have always handled Unicode.
So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.
Also keep in mind that there are several bugs in the current UTF-8 implementation that make it very much not ready for Production use. They have supposedly fixed two of the minor ones in the next CTP, but there is still a major issue with NULL
s not always being handled properly by drivers (i.e. client connectivity) released prior to SQL Server 2019.
Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.
edited 21 mins ago
answered 29 mins ago


Solomon Rutzky
46.6k577168
46.6k577168
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f221531%2fsql-server-2019-utf-8-support-benefits%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Don't rush it. UTF8 is most useful when the data needs that encoding, eg web content, data that comes from or is sent to UTF8 endpoints (REST services, UTF8 data files etc). It's also needed in Linux environments where UTF8 is assumed at the system level - programs like R use single-byte arrays assuming the environment codepage will be set at UTF8.
– Panagiotis Kanavos
3 hours ago