Slice a string containing Unicode chars
Clash Royale CLAN TAG#URR8PPP
up vote
13
down vote
favorite
I have a piece of text with characters of different bytelength.
let text = "Hello ÿÃÂøòõÃÂ";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'ÿ' (bytes 6..8) of `Hello ÿÃÂøòõÃÂ`'
I suppose it happens since Cyrillic letters are multi-byte and the [..]
notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end]
?
I know I can use the chars()
iterator and manually walk through the desired substring, but is there a more concise way?
string unicode rust slice
add a comment |Â
up vote
13
down vote
favorite
I have a piece of text with characters of different bytelength.
let text = "Hello ÿÃÂøòõÃÂ";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'ÿ' (bytes 6..8) of `Hello ÿÃÂøòõÃÂ`'
I suppose it happens since Cyrillic letters are multi-byte and the [..]
notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end]
?
I know I can use the chars()
iterator and manually walk through the desired substring, but is there a more concise way?
string unicode rust slice
2
I thinkchars()
is the way to go here:text.chars().take(end).skip(start)
â Tim Diekmann
Aug 23 at 10:10
@TimDiekmann how do I convert theTake<Chars>
to&str
then if the API needs it?
â Sasha Tsukanov
Aug 23 at 10:17
You should callcollect()
. See this question stackoverflow.com/questions/37157926/â¦
â ozkriff
Aug 23 at 10:18
1
@ozkriffcollect()
will result inString
, not in&str
. This is why I didn't marked this as duplicate to your linked question.
â Tim Diekmann
Aug 23 at 10:31
add a comment |Â
up vote
13
down vote
favorite
up vote
13
down vote
favorite
I have a piece of text with characters of different bytelength.
let text = "Hello ÿÃÂøòõÃÂ";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'ÿ' (bytes 6..8) of `Hello ÿÃÂøòõÃÂ`'
I suppose it happens since Cyrillic letters are multi-byte and the [..]
notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end]
?
I know I can use the chars()
iterator and manually walk through the desired substring, but is there a more concise way?
string unicode rust slice
I have a piece of text with characters of different bytelength.
let text = "Hello ÿÃÂøòõÃÂ";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'ÿ' (bytes 6..8) of `Hello ÿÃÂøòõÃÂ`'
I suppose it happens since Cyrillic letters are multi-byte and the [..]
notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end]
?
I know I can use the chars()
iterator and manually walk through the desired substring, but is there a more concise way?
string unicode rust slice
edited Aug 23 at 19:16
Matthieu M.
194k26257487
194k26257487
asked Aug 23 at 9:52
Sasha Tsukanov
414112
414112
2
I thinkchars()
is the way to go here:text.chars().take(end).skip(start)
â Tim Diekmann
Aug 23 at 10:10
@TimDiekmann how do I convert theTake<Chars>
to&str
then if the API needs it?
â Sasha Tsukanov
Aug 23 at 10:17
You should callcollect()
. See this question stackoverflow.com/questions/37157926/â¦
â ozkriff
Aug 23 at 10:18
1
@ozkriffcollect()
will result inString
, not in&str
. This is why I didn't marked this as duplicate to your linked question.
â Tim Diekmann
Aug 23 at 10:31
add a comment |Â
2
I thinkchars()
is the way to go here:text.chars().take(end).skip(start)
â Tim Diekmann
Aug 23 at 10:10
@TimDiekmann how do I convert theTake<Chars>
to&str
then if the API needs it?
â Sasha Tsukanov
Aug 23 at 10:17
You should callcollect()
. See this question stackoverflow.com/questions/37157926/â¦
â ozkriff
Aug 23 at 10:18
1
@ozkriffcollect()
will result inString
, not in&str
. This is why I didn't marked this as duplicate to your linked question.
â Tim Diekmann
Aug 23 at 10:31
2
2
I think
chars()
is the way to go here: text.chars().take(end).skip(start)
â Tim Diekmann
Aug 23 at 10:10
I think
chars()
is the way to go here: text.chars().take(end).skip(start)
â Tim Diekmann
Aug 23 at 10:10
@TimDiekmann how do I convert the
Take<Chars>
to &str
then if the API needs it?â Sasha Tsukanov
Aug 23 at 10:17
@TimDiekmann how do I convert the
Take<Chars>
to &str
then if the API needs it?â Sasha Tsukanov
Aug 23 at 10:17
You should call
collect()
. See this question stackoverflow.com/questions/37157926/â¦â ozkriff
Aug 23 at 10:18
You should call
collect()
. See this question stackoverflow.com/questions/37157926/â¦â ozkriff
Aug 23 at 10:18
1
1
@ozkriff
collect()
will result in String
, not in &str
. This is why I didn't marked this as duplicate to your linked question.â Tim Diekmann
Aug 23 at 10:31
@ozkriff
collect()
will result in String
, not in &str
. This is why I didn't marked this as duplicate to your linked question.â Tim Diekmann
Aug 23 at 10:31
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
20
down vote
accepted
Possible solutions to codepoint slicing
I know I can use the
chars()
iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello ÿÃÂøòõÃÂ";
println!("", &text[2..10]);
This prints "llo ÿÃÂ". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices()
iterator (alternatively you could use chars()
with char::len_utf8()
):
let text = "Hello ÿÃÂøòõÃÂ";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("", &text[2..idx]);
As another alternative, you can first collect the string into Vec<char>
. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello ÿÃÂøòõÃÂ";
let text_vec = text.chars().collect::<Vec<_>>();
println!("", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str
is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char
represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "JuÃÂrgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A
LATIN CAPITAL LETTER J0x0075
LATIN SMALL LETTER U0x0308
COMBINING DIAERESIS- ...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "ï¬Âre"
>>> s[0:2]
'ï¬Âr'
Also not what you'd expect. This time, fi
is actually the ligature ï¬Â
, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation
is very useful.
Further resources on this topic:
- Blogpost "Let's stop ascribing meaning to unicode codepoints"
- Blogpost "Breaking our Latin-1 assumptions
- http://utf8everywhere.org/
To makelet end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something likelet end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
â Sasha Tsukanov
Aug 23 at 13:58
add a comment |Â
up vote
5
down vote
An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, ÿ
starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices
for solving this (remember, that getting to a position in UTF-8 is O(n)
):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str>
assert!(end >= start);
string.char_indices().nth(start).and_then(
playground
You may use str::chars()
if you are fine with getting a String
:
let string: String = text.chars().take(end).skip(start).collect();
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
20
down vote
accepted
Possible solutions to codepoint slicing
I know I can use the
chars()
iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello ÿÃÂøòõÃÂ";
println!("", &text[2..10]);
This prints "llo ÿÃÂ". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices()
iterator (alternatively you could use chars()
with char::len_utf8()
):
let text = "Hello ÿÃÂøòõÃÂ";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("", &text[2..idx]);
As another alternative, you can first collect the string into Vec<char>
. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello ÿÃÂøòõÃÂ";
let text_vec = text.chars().collect::<Vec<_>>();
println!("", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str
is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char
represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "JuÃÂrgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A
LATIN CAPITAL LETTER J0x0075
LATIN SMALL LETTER U0x0308
COMBINING DIAERESIS- ...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "ï¬Âre"
>>> s[0:2]
'ï¬Âr'
Also not what you'd expect. This time, fi
is actually the ligature ï¬Â
, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation
is very useful.
Further resources on this topic:
- Blogpost "Let's stop ascribing meaning to unicode codepoints"
- Blogpost "Breaking our Latin-1 assumptions
- http://utf8everywhere.org/
To makelet end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something likelet end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
â Sasha Tsukanov
Aug 23 at 13:58
add a comment |Â
up vote
20
down vote
accepted
Possible solutions to codepoint slicing
I know I can use the
chars()
iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello ÿÃÂøòõÃÂ";
println!("", &text[2..10]);
This prints "llo ÿÃÂ". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices()
iterator (alternatively you could use chars()
with char::len_utf8()
):
let text = "Hello ÿÃÂøòõÃÂ";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("", &text[2..idx]);
As another alternative, you can first collect the string into Vec<char>
. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello ÿÃÂøòõÃÂ";
let text_vec = text.chars().collect::<Vec<_>>();
println!("", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str
is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char
represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "JuÃÂrgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A
LATIN CAPITAL LETTER J0x0075
LATIN SMALL LETTER U0x0308
COMBINING DIAERESIS- ...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "ï¬Âre"
>>> s[0:2]
'ï¬Âr'
Also not what you'd expect. This time, fi
is actually the ligature ï¬Â
, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation
is very useful.
Further resources on this topic:
- Blogpost "Let's stop ascribing meaning to unicode codepoints"
- Blogpost "Breaking our Latin-1 assumptions
- http://utf8everywhere.org/
To makelet end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something likelet end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
â Sasha Tsukanov
Aug 23 at 13:58
add a comment |Â
up vote
20
down vote
accepted
up vote
20
down vote
accepted
Possible solutions to codepoint slicing
I know I can use the
chars()
iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello ÿÃÂøòõÃÂ";
println!("", &text[2..10]);
This prints "llo ÿÃÂ". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices()
iterator (alternatively you could use chars()
with char::len_utf8()
):
let text = "Hello ÿÃÂøòõÃÂ";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("", &text[2..idx]);
As another alternative, you can first collect the string into Vec<char>
. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello ÿÃÂøòõÃÂ";
let text_vec = text.chars().collect::<Vec<_>>();
println!("", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str
is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char
represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "JuÃÂrgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A
LATIN CAPITAL LETTER J0x0075
LATIN SMALL LETTER U0x0308
COMBINING DIAERESIS- ...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "ï¬Âre"
>>> s[0:2]
'ï¬Âr'
Also not what you'd expect. This time, fi
is actually the ligature ï¬Â
, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation
is very useful.
Further resources on this topic:
- Blogpost "Let's stop ascribing meaning to unicode codepoints"
- Blogpost "Breaking our Latin-1 assumptions
- http://utf8everywhere.org/
Possible solutions to codepoint slicing
I know I can use the
chars()
iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello ÿÃÂøòõÃÂ";
println!("", &text[2..10]);
This prints "llo ÿÃÂ". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices()
iterator (alternatively you could use chars()
with char::len_utf8()
):
let text = "Hello ÿÃÂøòõÃÂ";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("", &text[2..idx]);
As another alternative, you can first collect the string into Vec<char>
. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello ÿÃÂøòõÃÂ";
let text_vec = text.chars().collect::<Vec<_>>();
println!("", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str
is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char
represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "JuÃÂrgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A
LATIN CAPITAL LETTER J0x0075
LATIN SMALL LETTER U0x0308
COMBINING DIAERESIS- ...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "ï¬Âre"
>>> s[0:2]
'ï¬Âr'
Also not what you'd expect. This time, fi
is actually the ligature ï¬Â
, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation
is very useful.
Further resources on this topic:
- Blogpost "Let's stop ascribing meaning to unicode codepoints"
- Blogpost "Breaking our Latin-1 assumptions
- http://utf8everywhere.org/
edited Aug 23 at 10:37
answered Aug 23 at 10:23
Lukas Kalbertodt
22.4k249100
22.4k249100
To makelet end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something likelet end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
â Sasha Tsukanov
Aug 23 at 13:58
add a comment |Â
To makelet end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something likelet end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
â Sasha Tsukanov
Aug 23 at 13:58
To make
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
â Sasha Tsukanov
Aug 23 at 13:58
To make
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
work when we want to slice till the last codepoint in the string (say, with index 11) by using, say, 12 as excluded bound we need more work. One can add something like let end = if end_codepoint_idx == text.chars().count() text.len() else i).nth(end_codepoint_idx).unwrap();;
â Sasha Tsukanov
Aug 23 at 13:58
add a comment |Â
up vote
5
down vote
An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, ÿ
starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices
for solving this (remember, that getting to a position in UTF-8 is O(n)
):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str>
assert!(end >= start);
string.char_indices().nth(start).and_then(
playground
You may use str::chars()
if you are fine with getting a String
:
let string: String = text.chars().take(end).skip(start).collect();
add a comment |Â
up vote
5
down vote
An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, ÿ
starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices
for solving this (remember, that getting to a position in UTF-8 is O(n)
):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str>
assert!(end >= start);
string.char_indices().nth(start).and_then(
playground
You may use str::chars()
if you are fine with getting a String
:
let string: String = text.chars().take(end).skip(start).collect();
add a comment |Â
up vote
5
down vote
up vote
5
down vote
An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, ÿ
starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices
for solving this (remember, that getting to a position in UTF-8 is O(n)
):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str>
assert!(end >= start);
string.char_indices().nth(start).and_then(
playground
You may use str::chars()
if you are fine with getting a String
:
let string: String = text.chars().take(end).skip(start).collect();
An UTF-8 encoded string may contain characters, which consists of multiple bytes. In your case, ÿ
starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices
for solving this (remember, that getting to a position in UTF-8 is O(n)
):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str>
assert!(end >= start);
string.char_indices().nth(start).and_then(
playground
You may use str::chars()
if you are fine with getting a String
:
let string: String = text.chars().take(end).skip(start).collect();
edited Aug 23 at 10:37
answered Aug 23 at 10:26
Tim Diekmann
2,57481631
2,57481631
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51982999%2fslice-a-string-containing-unicode-chars%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
I think
chars()
is the way to go here:text.chars().take(end).skip(start)
â Tim Diekmann
Aug 23 at 10:10
@TimDiekmann how do I convert the
Take<Chars>
to&str
then if the API needs it?â Sasha Tsukanov
Aug 23 at 10:17
You should call
collect()
. See this question stackoverflow.com/questions/37157926/â¦â ozkriff
Aug 23 at 10:18
1
@ozkriff
collect()
will result inString
, not in&str
. This is why I didn't marked this as duplicate to your linked question.â Tim Diekmann
Aug 23 at 10:31