Loop over a list of strings and increment letter count in a corresponding sublist
Clash Royale CLAN TAG#URR8PPP
up vote
4
down vote
favorite
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics counting
add a comment |Â
up vote
4
down vote
favorite
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics counting
This works at counting, but if I map it over all sequences asTranspose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.
â briennakh
32 mins ago
This works at counting, but if I map it over all sequences asTranspose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.
â briennakh
29 mins ago
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics counting
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics counting
list-manipulation numerics counting
edited 59 mins ago
asked 1 hour ago
briennakh
2156
2156
This works at counting, but if I map it over all sequences asTranspose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.
â briennakh
32 mins ago
This works at counting, but if I map it over all sequences asTranspose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.
â briennakh
29 mins ago
add a comment |Â
This works at counting, but if I map it over all sequences asTranspose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.
â briennakh
32 mins ago
This works at counting, but if I map it over all sequences asTranspose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.
â briennakh
29 mins ago
This works at counting, but if I map it over all sequences as
Transpose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.â briennakh
32 mins ago
This works at counting, but if I map it over all sequences as
Transpose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.â briennakh
32 mins ago
This works at counting, but if I map it over all sequences as
Transpose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.â briennakh
29 mins ago
This works at counting, but if I map it over all sequences as
Transpose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.â briennakh
29 mins ago
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
2
down vote
accepted
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
I like the pretty output!
â briennakh
39 mins ago
add a comment |Â
up vote
3
down vote
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
Thank you, this works as well!
â briennakh
35 mins ago
But I would need to change the code to accommodate a list of sequences.
â briennakh
30 mins ago
@briennakh it should work in case of longer counts and sequences. If not please add examples to work with to your question
â Kubaâ¦
21 mins ago
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
15 mins ago
add a comment |Â
up vote
1
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solution performs:
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, First@counts]]]; // RepeatedTiming // First
3.65
0.076
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssemble[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
lcs == lcs2
0.0090
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs 3
True
1
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
5 mins ago
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
I like the pretty output!
â briennakh
39 mins ago
add a comment |Â
up vote
2
down vote
accepted
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
I like the pretty output!
â briennakh
39 mins ago
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
answered 48 mins ago
kglr
161k8184384
161k8184384
I like the pretty output!
â briennakh
39 mins ago
add a comment |Â
I like the pretty output!
â briennakh
39 mins ago
I like the pretty output!
â briennakh
39 mins ago
I like the pretty output!
â briennakh
39 mins ago
add a comment |Â
up vote
3
down vote
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
Thank you, this works as well!
â briennakh
35 mins ago
But I would need to change the code to accommodate a list of sequences.
â briennakh
30 mins ago
@briennakh it should work in case of longer counts and sequences. If not please add examples to work with to your question
â Kubaâ¦
21 mins ago
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
15 mins ago
add a comment |Â
up vote
3
down vote
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
Thank you, this works as well!
â briennakh
35 mins ago
But I would need to change the code to accommodate a list of sequences.
â briennakh
30 mins ago
@briennakh it should work in case of longer counts and sequences. If not please add examples to work with to your question
â Kubaâ¦
21 mins ago
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
15 mins ago
add a comment |Â
up vote
3
down vote
up vote
3
down vote
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
answered 47 mins ago
Kubaâ¦
99.7k11194492
99.7k11194492
Thank you, this works as well!
â briennakh
35 mins ago
But I would need to change the code to accommodate a list of sequences.
â briennakh
30 mins ago
@briennakh it should work in case of longer counts and sequences. If not please add examples to work with to your question
â Kubaâ¦
21 mins ago
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
15 mins ago
add a comment |Â
Thank you, this works as well!
â briennakh
35 mins ago
But I would need to change the code to accommodate a list of sequences.
â briennakh
30 mins ago
@briennakh it should work in case of longer counts and sequences. If not please add examples to work with to your question
â Kubaâ¦
21 mins ago
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
15 mins ago
Thank you, this works as well!
â briennakh
35 mins ago
Thank you, this works as well!
â briennakh
35 mins ago
But I would need to change the code to accommodate a list of sequences.
â briennakh
30 mins ago
But I would need to change the code to accommodate a list of sequences.
â briennakh
30 mins ago
@briennakh it should work in case of longer counts and sequences. If not please add examples to work with to your question
â Kubaâ¦
21 mins ago
@briennakh it should work in case of longer counts and sequences. If not please add examples to work with to your question
â Kubaâ¦
21 mins ago
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
15 mins ago
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
15 mins ago
add a comment |Â
up vote
1
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solution performs:
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, First@counts]]]; // RepeatedTiming // First
3.65
0.076
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssemble[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
lcs == lcs2
0.0090
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs 3
True
1
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
5 mins ago
add a comment |Â
up vote
1
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solution performs:
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, First@counts]]]; // RepeatedTiming // First
3.65
0.076
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssemble[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
lcs == lcs2
0.0090
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs 3
True
1
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
5 mins ago
add a comment |Â
up vote
1
down vote
up vote
1
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solution performs:
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, First@counts]]]; // RepeatedTiming // First
3.65
0.076
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssemble[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
lcs == lcs2
0.0090
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs 3
True
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solution performs:
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, First@counts]]]; // RepeatedTiming // First
3.65
0.076
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssemble[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
lcs == lcs2
0.0090
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs 3
True
edited 1 min ago
answered 20 mins ago
Henrik Schumacher
38.7k253114
38.7k253114
1
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
5 mins ago
add a comment |Â
1
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
5 mins ago
1
1
Henrik, if some letters have 0 count in some sequences,
Kubalcs
will have Missing[KeyAbsent]
instead of 0; so some additional processing is needed.â kglr
5 mins ago
Henrik, if some letters have 0 count in some sequences,
Kubalcs
will have Missing[KeyAbsent]
instead of 0; so some additional processing is needed.â kglr
5 mins ago
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182201%2floop-over-a-list-of-strings-and-increment-letter-count-in-a-corresponding-sublis%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
This works at counting, but if I map it over all sequences as
Transpose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.â briennakh
32 mins ago
This works at counting, but if I map it over all sequences as
Transpose@Tally@Characters@# & /@ sequences
it will output multiple headings + counts.â briennakh
29 mins ago