Can machine learning decode the SHA256 hashes?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
21
down vote

favorite
5












I have a 64 character SHA256 hash.



I'm hoping to train a model that can predict if the plaintext used to generate the hash begins with a 1 or not.



Regardless if this is "Possible", what algorithm would be the best approach?



My initial thoughts:



  • Generate a large sample of hashes that begin with a 1 and a large sample of hashes that do not begin with a 1

  • Set each of the 64 characters of a hash as a parameter for some sort of unsupervised logistic regression model.

  • Train the model by telling it when it is right/wrong.

  • Hopefully be able to create a model that can predict if the plaintext begins with a 1 or not with a high enough accuracy (and with a decent kappa)









share|cite|improve this question









New contributor




John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1




    For this to be tractable, you’d essentially have to have the ML algorithm carry out the hashing function. Maybe this operation could be represented in some form or fashion, but you’ll have to do most of the work by engineering features.
    – Sycorax
    2 days ago






  • 1




    @John, it's likely not possible, but if you're still determined to test it out, you should make sure that whatever algorithm you use is powerful enough to do the opposite. Given the plaintext it should determine the first digit of the hash, or in other words learn the hashing function from examples.
    – csiz
    2 days ago






  • 10




    FYI: This is likely motivated by bitcoin mining.
    – ClojureMostly
    yesterday






  • 14




    “How would I train a model that allows me to time travel, regardless of whether this is ‘possible’?”
    – Konrad Rudolph
    yesterday







  • 3




    @Joshua OP wants to invert SHA-256. I’ll let him publish even if it takes a thousand times as many steps as SHA-256. I’ll also cease existing since the solution almost certainly exploits a bug within the very fabric of reality to defeat logic.
    – Konrad Rudolph
    yesterday
















up vote
21
down vote

favorite
5












I have a 64 character SHA256 hash.



I'm hoping to train a model that can predict if the plaintext used to generate the hash begins with a 1 or not.



Regardless if this is "Possible", what algorithm would be the best approach?



My initial thoughts:



  • Generate a large sample of hashes that begin with a 1 and a large sample of hashes that do not begin with a 1

  • Set each of the 64 characters of a hash as a parameter for some sort of unsupervised logistic regression model.

  • Train the model by telling it when it is right/wrong.

  • Hopefully be able to create a model that can predict if the plaintext begins with a 1 or not with a high enough accuracy (and with a decent kappa)









share|cite|improve this question









New contributor




John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1




    For this to be tractable, you’d essentially have to have the ML algorithm carry out the hashing function. Maybe this operation could be represented in some form or fashion, but you’ll have to do most of the work by engineering features.
    – Sycorax
    2 days ago






  • 1




    @John, it's likely not possible, but if you're still determined to test it out, you should make sure that whatever algorithm you use is powerful enough to do the opposite. Given the plaintext it should determine the first digit of the hash, or in other words learn the hashing function from examples.
    – csiz
    2 days ago






  • 10




    FYI: This is likely motivated by bitcoin mining.
    – ClojureMostly
    yesterday






  • 14




    “How would I train a model that allows me to time travel, regardless of whether this is ‘possible’?”
    – Konrad Rudolph
    yesterday







  • 3




    @Joshua OP wants to invert SHA-256. I’ll let him publish even if it takes a thousand times as many steps as SHA-256. I’ll also cease existing since the solution almost certainly exploits a bug within the very fabric of reality to defeat logic.
    – Konrad Rudolph
    yesterday












up vote
21
down vote

favorite
5









up vote
21
down vote

favorite
5






5





I have a 64 character SHA256 hash.



I'm hoping to train a model that can predict if the plaintext used to generate the hash begins with a 1 or not.



Regardless if this is "Possible", what algorithm would be the best approach?



My initial thoughts:



  • Generate a large sample of hashes that begin with a 1 and a large sample of hashes that do not begin with a 1

  • Set each of the 64 characters of a hash as a parameter for some sort of unsupervised logistic regression model.

  • Train the model by telling it when it is right/wrong.

  • Hopefully be able to create a model that can predict if the plaintext begins with a 1 or not with a high enough accuracy (and with a decent kappa)









share|cite|improve this question









New contributor




John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have a 64 character SHA256 hash.



I'm hoping to train a model that can predict if the plaintext used to generate the hash begins with a 1 or not.



Regardless if this is "Possible", what algorithm would be the best approach?



My initial thoughts:



  • Generate a large sample of hashes that begin with a 1 and a large sample of hashes that do not begin with a 1

  • Set each of the 64 characters of a hash as a parameter for some sort of unsupervised logistic regression model.

  • Train the model by telling it when it is right/wrong.

  • Hopefully be able to create a model that can predict if the plaintext begins with a 1 or not with a high enough accuracy (and with a decent kappa)






machine-learning logistic






share|cite|improve this question









New contributor




John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question









New contributor




John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question








edited yesterday









Tim♦

52.6k8119205




52.6k8119205






New contributor




John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 2 days ago









John

11813




11813




New contributor




John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






John is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







  • 1




    For this to be tractable, you’d essentially have to have the ML algorithm carry out the hashing function. Maybe this operation could be represented in some form or fashion, but you’ll have to do most of the work by engineering features.
    – Sycorax
    2 days ago






  • 1




    @John, it's likely not possible, but if you're still determined to test it out, you should make sure that whatever algorithm you use is powerful enough to do the opposite. Given the plaintext it should determine the first digit of the hash, or in other words learn the hashing function from examples.
    – csiz
    2 days ago






  • 10




    FYI: This is likely motivated by bitcoin mining.
    – ClojureMostly
    yesterday






  • 14




    “How would I train a model that allows me to time travel, regardless of whether this is ‘possible’?”
    – Konrad Rudolph
    yesterday







  • 3




    @Joshua OP wants to invert SHA-256. I’ll let him publish even if it takes a thousand times as many steps as SHA-256. I’ll also cease existing since the solution almost certainly exploits a bug within the very fabric of reality to defeat logic.
    – Konrad Rudolph
    yesterday












  • 1




    For this to be tractable, you’d essentially have to have the ML algorithm carry out the hashing function. Maybe this operation could be represented in some form or fashion, but you’ll have to do most of the work by engineering features.
    – Sycorax
    2 days ago






  • 1




    @John, it's likely not possible, but if you're still determined to test it out, you should make sure that whatever algorithm you use is powerful enough to do the opposite. Given the plaintext it should determine the first digit of the hash, or in other words learn the hashing function from examples.
    – csiz
    2 days ago






  • 10




    FYI: This is likely motivated by bitcoin mining.
    – ClojureMostly
    yesterday






  • 14




    “How would I train a model that allows me to time travel, regardless of whether this is ‘possible’?”
    – Konrad Rudolph
    yesterday







  • 3




    @Joshua OP wants to invert SHA-256. I’ll let him publish even if it takes a thousand times as many steps as SHA-256. I’ll also cease existing since the solution almost certainly exploits a bug within the very fabric of reality to defeat logic.
    – Konrad Rudolph
    yesterday







1




1




For this to be tractable, you’d essentially have to have the ML algorithm carry out the hashing function. Maybe this operation could be represented in some form or fashion, but you’ll have to do most of the work by engineering features.
– Sycorax
2 days ago




For this to be tractable, you’d essentially have to have the ML algorithm carry out the hashing function. Maybe this operation could be represented in some form or fashion, but you’ll have to do most of the work by engineering features.
– Sycorax
2 days ago




1




1




@John, it's likely not possible, but if you're still determined to test it out, you should make sure that whatever algorithm you use is powerful enough to do the opposite. Given the plaintext it should determine the first digit of the hash, or in other words learn the hashing function from examples.
– csiz
2 days ago




@John, it's likely not possible, but if you're still determined to test it out, you should make sure that whatever algorithm you use is powerful enough to do the opposite. Given the plaintext it should determine the first digit of the hash, or in other words learn the hashing function from examples.
– csiz
2 days ago




10




10




FYI: This is likely motivated by bitcoin mining.
– ClojureMostly
yesterday




FYI: This is likely motivated by bitcoin mining.
– ClojureMostly
yesterday




14




14




“How would I train a model that allows me to time travel, regardless of whether this is ‘possible’?”
– Konrad Rudolph
yesterday





“How would I train a model that allows me to time travel, regardless of whether this is ‘possible’?”
– Konrad Rudolph
yesterday





3




3




@Joshua OP wants to invert SHA-256. I’ll let him publish even if it takes a thousand times as many steps as SHA-256. I’ll also cease existing since the solution almost certainly exploits a bug within the very fabric of reality to defeat logic.
– Konrad Rudolph
yesterday




@Joshua OP wants to invert SHA-256. I’ll let him publish even if it takes a thousand times as many steps as SHA-256. I’ll also cease existing since the solution almost certainly exploits a bug within the very fabric of reality to defeat logic.
– Konrad Rudolph
yesterday










11 Answers
11






active

oldest

votes

















up vote
42
down vote













SHA256 is designed to be as random as possible, so it is unlikely you would be able to separate hashes that came from 1-prefixed plaintext from those that do not; there should simply be no feature of the hash string that would give that information away.






share|cite|improve this answer
















  • 5




    "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis.
    – John
    2 days ago






  • 19




    +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills.
    – whuber♦
    2 days ago






  • 30




    You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist.
    – pvlkmrv
    2 days ago






  • 24




    "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as confusion and diffusion. It makes the task here (recovering just the first bit) just as hard as recovering the whole message.
    – MSalters
    yesterday






  • 10




    I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content.
    – Neil Slater
    yesterday


















up vote
41
down vote













This isn't really a stats answer, but:



No, you can't determine the first character of the plaintext from the hash, because there's no such thing as "the plaintext" for a given hash.



SHA-256 is a hashing algorithm. No matter what your plaintext, you get out a 32-byte signature, often expressed as a 64-character hex string. There are far more possible plaintexts than there are possible 64 character hex strings - the same hash can be generated from any number of different plaintexts. There's no reason to believe that the first character being/not being a '1' is uniform across all plaintexts producing a given hash.






share|cite|improve this answer








New contributor




Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.













  • 11




    This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its not an injective function.
    – Luca Citi
    yesterday






  • 5




    Could you not predict the probability that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning.
    – Matthew Drury
    yesterday







  • 8




    @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, for any given hash. So if you want to estimate the probability, your best estimate is going to be roughly $frac1256pmvarepsilon$.
    – Konrad Rudolph
    yesterday







  • 8




    Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning.
    – Matthew Drury
    yesterday






  • 4




    @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest no hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256.
    – Chris H
    23 hours ago


















up vote
32
down vote













Regardless if this is "Possible", what algorithm would be the best approach?



Sorry, but that's a nonsensical question. If something is impossible, then you can't search for the best approach to the problem.



In this case, this definitely should be impossible because hashing is a one-way function: several inputs (infinite, in fact) can produce the same output. If the first bit of input on its own would somehow influence the probability of a specific hash value, this would mean that the hash algorithm is completely flawed.



You certainly can train a neural network, linear classifier, SVM and whatnot to attempt prediction. And if you will be able to reliably predict the input from output for a certain hashing algorithm, this would prove that this algorithm is worthless. I'd say that for a widely used algorithm like SHA256 such a possibility is vanishingly low. However, it's a reasonable approach to quickly rule out new, unproven and untested hashing algorithms.






share|cite|improve this answer








New contributor




IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.













  • 6




    It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($mathoprm sign(x)$, famously a non-linear one-way function, tells me a lot about the input).
    – Konrad Rudolph
    yesterday







  • 7




    @KonradRudolph: "One-way function" has a specific meaning in this context that is not the meaning you're thinking of. sign(x) is not a one-way function in this sense, because finding preimages is trivial.
    – user2357112
    yesterday






  • 3




    That said, I don't think the answer is using "one-way function" correctly either.
    – user2357112
    yesterday






  • 1




    @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to.
    – Konrad Rudolph
    yesterday











  • Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms.
    – IMil
    yesterday

















up vote
18
down vote













While one can't prove a negative with an example.
Still I feel an example would be suggestive; and perhaps useful.
And it does show how one would (attempt to) solve similar problems.



In the case of
I want to make binary predictions, using features that are binary vectors,
a Random Forest is a solid choice.
I guess this kind of answers the second part of your question: what is a good algorithm.



We well want to preprocess the SHA256 strings, into binary (Boolean) vectors,
as each bit is statistically independent, thus each bit is a good feature.
So that will make our inputs 256 element boolean vectors.



Demo



Here is a demonstration of how the whole thing can be done using the Julia DecisionTree.jl library.



You can copy paste the below into the julia prompt.



using SHA
using DecisionTree
using Statistics: mean
using Random: randstring

const maxlen=10_000 # longest string (document) to be hashed.

gen_plaintext(x) = gen_plaintext(Valx())
gen_plaintext(::Valtrue) = "1" * randstring(rand(0:maxlen-1))
gen_plaintext(::Valfalse) = randstring(rand(1:maxlen))


bitvector(x) = BitVector(digits(x, base=2, pad=8sizeof(x)))
bitvector(x::AbstractVector) = reduce(vcat, bitvector.(x))

function gen_observation(class)
plaintext = gen_plaintext(class)
obs = bitvector(sha256(plaintext))
obs
end

function feature_mat(obs)
convert(Array, reduce(hcat, obs)')
end

########################################

const train_labels = rand(Bool, 100_000)
const train_obs = gen_observation.(train_labels)
const train_feature_mat = feature_mat(train_obs)

const test_labels = rand(Bool, 100_000)
const test_obs = gen_observation.(test_labels)
const test_feature_mat = feature_mat(test_obs)


# Train the model
const model = build_forest(train_labels, train_feature_mat)
@show model


#Training Set accuracy:
@show mean(apply_forest(model, train_feature_mat) .== train_labels)

#Test Set accuracy:
@show mean(apply_forest(model, test_feature_mat) .== test_labels)


Results



When I did this,
training on 100,000 random ASCII strings of length up to 10,000.
Here are the results I saw:



Train the model



julia> const model = build_forest(train_labels, train_feature_mat)
Ensemble of Decision Trees
Trees: 10
Avg Leaves: 16124.7
Avg Depth: 17.9


Training Set accuracy:



julia> mean(apply_forest(model, train_feature_mat) .== train_labels)
0.95162


Test Set accuracy:



julia> mean(apply_forest(model, test_feature_mat) .== test_labels)
0.5016


Discussion



So that is basically nothing.
We went from 95% on the training set, to barely over 50% on the test set.
Someone could apply proper hypothesis tests, to see if we can reject the null

hypothesis, but I am pretty certain we can't.
It is a tiny improvement over the guess rate.



That suggests that it can't be learned.
If a Random Forest, can go from well fitted to hitting just the guess rate.
Random Forests are pretty capable of learning difficult inputs.
If there was something to learn, I would expect at least a few percent.



You can play around with different hash functions by changing the code.
Which could be interesting
I got basically same results when using the julia in built hash function (which is not a cryptographically secure hsah, but still is a good hash so should indeed send similar strings apart).
I also got basically the same results for CRC32c.






share|cite|improve this answer





























    up vote
    10
    down vote













    Hash functions are (by design) extremely badly suited for doing anything machine learning with them.



    ML is essentially a family of methods for modelling / estimating locally continuous functions. I.e., you're trying to describe some physical system that, while it may have certain discontinuities, is in some sense in most of the parameter space smooth enough so that only a scattered sample of test data can be used to predict the result for other input. To do that, the AI algorithms need to somehow decompose the data into a clever basis representation, for which the training has suggested that e.g. if you see such and such shape (which appears to correlate with the result of such and such convolution) then there's a good chance that the output should have in the corresponding region such and such structure (which can again be described by a convolution or something).



    (I know, many ML approaches aren't like convolution at all, but the general idea is always the same: you have some input space that's so high dimensional it's impossible to sample exhaustively, so you find a clever decomposition that allows you to extrapolate results from a comparatively sparse sample.)



    The idea behind a cryptographic hash function however is that any change to the plaintext should result in a completely different digest. So no matter how you decompose the function, local estimators will not allow you to extrapolate how small fluctuations around that part influence the result. Unless of course you actually process all of the information of a limited set, but this wouldn't be called machine learning: you'd just be building a rainbow table.






    share|cite|improve this answer
















    • 3




      Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it might be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting.
      – senderle
      yesterday

















    up vote
    5
    down vote













    This is an interesting question because it raises issues about what counts as "machine learning." There is certainly an algorithm that will eventually solve this problem if it can be solved. It goes like this:



    1. Pick your favorite programming language, and decide on an encoding that maps every string to a (potentially very large) integer.


    2. Pick a random number and convert it into a string. Check to see if it's a valid program in your language. If it's not, pick another number and try again. If it is, start it, immediately pause it, and add it to a list of paused programs.


    3. Run all the paused programs for a little while. If any of them halt without producing an adequate solution, remove them from the list. If one produces an adequate solution, you're done! Otherwise, return to 2 after letting them all run for a bit.


    There's no question that if you have infinite storage and infinite time, the above algorithm will eventually find a good solution. But that's probably not what you mean by "machine learning."



    Here's the rub: if you consider all possible problems, no machine learning algorithm can do better on average! This is known as the no free lunch theorem. It proves that among all the possible problems you could throw at any given machine learning algorithm, the number that it can solve quickly is vanishingly small.



    It can solve those problems quickly only because they are governed by patterns that the algorithm can anticipate. For example, many successful algorithms assume the following:



    1. Solutions can be described by some complex series of matrix multiplications and nonlinear distortions, governed by a set of parameters.


    2. Good solutions will be clustered together in parameter space, so that all you have to do is pick a search neighborhood, find the best solution there, shift your search neighborhood so that the best solution is in the center, and repeat.


    Obviously neither of these assumptions hold in general. The second is particularly suspect. And the no free lunch tells us that these assumptions don't even hold most of the time. In fact they almost never hold! It's just our good fortune that they do hold for certain problems that actually matter.



    The problem you've chosen is designed from the beginning to violate assumption 2. Hash functions are specifically designed so that similar inputs give completely different outputs.



    So your question—what is the best machine learning algorithm for solving this problem?—probably has a very straightforward answer: random search.






    share|cite|improve this answer




















    • I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too.
      – Max Vernon
      yesterday







    • 1




      @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to other quantum algorithms. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a question and self-answer that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.)
      – senderle
      yesterday

















    up vote
    5
    down vote













    It is next to impossible. However, people observed some patterns in SHA256 which might suggest its non-randomness A distinguisher for SHA256 using Bitcoin (mining faster along the way). Their tldr:



    "To distinguish between an ideal random permutation hash and SHA256, hash a large amount (~2^80) of candidate 1024 bit blocks twice, as done in Bitcoin. Ensure that the bits of the candidate blocks are sparsely set (much fewer than the 512 mean expected), according to the Bitcoin protocol, discarding candidate blocks that do not meet the Bitcoin “difficulty” standard (where the resultant hashes start with a the large number of 0’s). With the remaining set of valid input candidates (467369 when this analysis was done), observe a particular set of 32 bits in the input block (located where Bitcoin has the nonce, input bits 607-639). Note that the mean number of bits set in the nonce field is skewed to the left, i.e. fewer than the expected value of 16 bits set (estimated mean 15.428)."



    See a discussion on lobste.rs. One possible explanation is a bias introduced by the miners.






    share|cite|improve this answer










    New contributor




    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.













    • 2




      This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched.
      – senderle
      yesterday






    • 1




      @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols.
      – IndieSolver
      yesterday

















    up vote
    2
    down vote













    I'll answer with a program. To reduce computational requirements I'll use a variant of sha256 I call sha16, which is just the first 16 bits of sha256.



    #!/usr/bin/python3

    import hashlib
    from itertools import count

    def sha16(plaintext):
    h = hashlib.sha256()
    h.update(plaintext)
    return h.hexdigest()[:4]

    def has_plaintext_start_with_1(digest):
    """Return True if and only if the given digest can be generated from a
    plaintext starting with "1" first bit."""
    return True

    def plaintext_starting_with_1(digest):
    """Return a plaintext starting with '1' matching the given digest."""
    for c in count():
    plaintext = (b'x80' + str(c).encode('ascii'))
    d = sha16(plaintext)
    if d == digest:
    return plaintext

    for digest in range(0x10000):
    digest = "%04x" % (digest,)
    plain = plaintext_starting_with_1(digest)
    print("%s hashes to %s" % (plain, digest))


    This produces the output:



    b'x8094207' hashes to 0000
    b'x8047770' hashes to 0001
    b'x8078597' hashes to 0002
    b'x8025129' hashes to 0003
    b'x8055307' hashes to 0004
    b'x80120019' hashes to 0005
    b'x8062700' hashes to 0006
    b'x8036411' hashes to 0007
    b'x80135953' hashes to 0008
    b'x8044091' hashes to 0009
    b'x808968' hashes to 000a
    b'x8039318' hashes to 000b
    [...]


    I'll leave the full proof as an exercise for the reader, but take my word for it: there's an input that starts with a "1" for each possible digest from 0000 to ffff.



    There's also an input that doesn't start with "1". And there's one that starts with the complete works of Shakespeare, too.



    This holds for any reasonably good hash function, although my brute force proof may become computationally infeasible.






    share|cite|improve this answer










    New contributor




    Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.
























      up vote
      1
      down vote













      Hashing functions are purposefully designed to be difficult to model, so (as pointed out already) this is likely to be very difficult. Nevertheless, any weakness in the hashing function will reduce its entropy, making it more predictable.




      Regardless if this is "Possible", what algorithm would be the best approach?




      A useful example is the Physically Unclonable Function, or PUF - which is analogous to a hardware hashing function. Typically, manufacturing variations are purposefully used to give each PUF a slightly different response so that their 'hashed' output is different for a given input. Design weaknesses limit the entropy, however, and given enough challenge-response pairs it is often possible to build a black-box model of the PUF so that the response for a new, previously unseen challenge can be predicted.



      Logistic regression is the most commonly used approach for these modelling attacks, such as in this paper by Rührmair.



      Genetic algorithms (or more generally evolutionary strategies) may be an alternative approach, as they are applicable to problems that are not differentiable and/or linearly separable. They are also discussed in the above paper.






      share|cite|improve this answer



























        up vote
        1
        down vote













        What you describe is basically a pre-image attack. You're trying to find an input such that, when it is hashed, the output has some property like "a leading 1".*



        It is an explicit goal of cryptographic hashes to prevent such pre-image attacks. If you can make such an attack, we tend to consider that algorithm to be insecure and stop using it.



        So while that means it's not impossible, it means your machine learning algorithm would have to simultaneously outwit a large fraction of the mathematicians in the world, and their super computers. It is unlikely that you will do so.



        However, if you did, you would become known as someone who broke a major cryptographic hash algorithm. That fame is worth something!



        *Technically a "first preimage attack" tries to find a match for a specific hash. However, to show that a hash algorithm has first preimage attack resistence, they typically show that you can't find any meaningful information about the input from the hash.






        share|cite|improve this answer



























          up vote
          0
          down vote













          Most all the answers here are telling you why you can't do this but here's the direct answer to:




          Regardless if this is "Possible", what algorithm would be the best approach?




          Assuming the input is sufficiently large:



          1. Take the count of the set of valid characters.

          2. Take the reciprocal of the number from step 1.

          That's the probability that the input string starts with '1'. You don't even need to look at the input. If you can do better than that, it would mean the hash is very broken. You can save a lot of CPU cycles over trying to train an algorithm to pick random numbers.



          You could train an algorithm and it might come up with a different answer because of overfitting. That is unless there's something really wrong with the hash algorithm. Using this algorithm is then going wrong more often than if you simply pick a random value.






          share|cite|improve this answer




















            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "65"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );






            John is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366272%2fcan-machine-learning-decode-the-sha256-hashes%23new-answer', 'question_page');

            );

            Post as a guest






























            11 Answers
            11






            active

            oldest

            votes








            11 Answers
            11






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            42
            down vote













            SHA256 is designed to be as random as possible, so it is unlikely you would be able to separate hashes that came from 1-prefixed plaintext from those that do not; there should simply be no feature of the hash string that would give that information away.






            share|cite|improve this answer
















            • 5




              "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis.
              – John
              2 days ago






            • 19




              +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills.
              – whuber♦
              2 days ago






            • 30




              You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist.
              – pvlkmrv
              2 days ago






            • 24




              "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as confusion and diffusion. It makes the task here (recovering just the first bit) just as hard as recovering the whole message.
              – MSalters
              yesterday






            • 10




              I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content.
              – Neil Slater
              yesterday















            up vote
            42
            down vote













            SHA256 is designed to be as random as possible, so it is unlikely you would be able to separate hashes that came from 1-prefixed plaintext from those that do not; there should simply be no feature of the hash string that would give that information away.






            share|cite|improve this answer
















            • 5




              "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis.
              – John
              2 days ago






            • 19




              +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills.
              – whuber♦
              2 days ago






            • 30




              You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist.
              – pvlkmrv
              2 days ago






            • 24




              "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as confusion and diffusion. It makes the task here (recovering just the first bit) just as hard as recovering the whole message.
              – MSalters
              yesterday






            • 10




              I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content.
              – Neil Slater
              yesterday













            up vote
            42
            down vote










            up vote
            42
            down vote









            SHA256 is designed to be as random as possible, so it is unlikely you would be able to separate hashes that came from 1-prefixed plaintext from those that do not; there should simply be no feature of the hash string that would give that information away.






            share|cite|improve this answer












            SHA256 is designed to be as random as possible, so it is unlikely you would be able to separate hashes that came from 1-prefixed plaintext from those that do not; there should simply be no feature of the hash string that would give that information away.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered 2 days ago









            pvlkmrv

            601214




            601214







            • 5




              "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis.
              – John
              2 days ago






            • 19




              +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills.
              – whuber♦
              2 days ago






            • 30




              You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist.
              – pvlkmrv
              2 days ago






            • 24




              "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as confusion and diffusion. It makes the task here (recovering just the first bit) just as hard as recovering the whole message.
              – MSalters
              yesterday






            • 10




              I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content.
              – Neil Slater
              yesterday













            • 5




              "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis.
              – John
              2 days ago






            • 19




              +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills.
              – whuber♦
              2 days ago






            • 30




              You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist.
              – pvlkmrv
              2 days ago






            • 24




              "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as confusion and diffusion. It makes the task here (recovering just the first bit) just as hard as recovering the whole message.
              – MSalters
              yesterday






            • 10




              I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content.
              – Neil Slater
              yesterday








            5




            5




            "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis.
            – John
            2 days ago




            "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis.
            – John
            2 days ago




            19




            19




            +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills.
            – whuber♦
            2 days ago




            +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills.
            – whuber♦
            2 days ago




            30




            30




            You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist.
            – pvlkmrv
            2 days ago




            You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist.
            – pvlkmrv
            2 days ago




            24




            24




            "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as confusion and diffusion. It makes the task here (recovering just the first bit) just as hard as recovering the whole message.
            – MSalters
            yesterday




            "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as confusion and diffusion. It makes the task here (recovering just the first bit) just as hard as recovering the whole message.
            – MSalters
            yesterday




            10




            10




            I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content.
            – Neil Slater
            yesterday





            I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content.
            – Neil Slater
            yesterday













            up vote
            41
            down vote













            This isn't really a stats answer, but:



            No, you can't determine the first character of the plaintext from the hash, because there's no such thing as "the plaintext" for a given hash.



            SHA-256 is a hashing algorithm. No matter what your plaintext, you get out a 32-byte signature, often expressed as a 64-character hex string. There are far more possible plaintexts than there are possible 64 character hex strings - the same hash can be generated from any number of different plaintexts. There's no reason to believe that the first character being/not being a '1' is uniform across all plaintexts producing a given hash.






            share|cite|improve this answer








            New contributor




            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.













            • 11




              This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its not an injective function.
              – Luca Citi
              yesterday






            • 5




              Could you not predict the probability that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning.
              – Matthew Drury
              yesterday







            • 8




              @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, for any given hash. So if you want to estimate the probability, your best estimate is going to be roughly $frac1256pmvarepsilon$.
              – Konrad Rudolph
              yesterday







            • 8




              Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning.
              – Matthew Drury
              yesterday






            • 4




              @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest no hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256.
              – Chris H
              23 hours ago















            up vote
            41
            down vote













            This isn't really a stats answer, but:



            No, you can't determine the first character of the plaintext from the hash, because there's no such thing as "the plaintext" for a given hash.



            SHA-256 is a hashing algorithm. No matter what your plaintext, you get out a 32-byte signature, often expressed as a 64-character hex string. There are far more possible plaintexts than there are possible 64 character hex strings - the same hash can be generated from any number of different plaintexts. There's no reason to believe that the first character being/not being a '1' is uniform across all plaintexts producing a given hash.






            share|cite|improve this answer








            New contributor




            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.













            • 11




              This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its not an injective function.
              – Luca Citi
              yesterday






            • 5




              Could you not predict the probability that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning.
              – Matthew Drury
              yesterday







            • 8




              @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, for any given hash. So if you want to estimate the probability, your best estimate is going to be roughly $frac1256pmvarepsilon$.
              – Konrad Rudolph
              yesterday







            • 8




              Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning.
              – Matthew Drury
              yesterday






            • 4




              @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest no hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256.
              – Chris H
              23 hours ago













            up vote
            41
            down vote










            up vote
            41
            down vote









            This isn't really a stats answer, but:



            No, you can't determine the first character of the plaintext from the hash, because there's no such thing as "the plaintext" for a given hash.



            SHA-256 is a hashing algorithm. No matter what your plaintext, you get out a 32-byte signature, often expressed as a 64-character hex string. There are far more possible plaintexts than there are possible 64 character hex strings - the same hash can be generated from any number of different plaintexts. There's no reason to believe that the first character being/not being a '1' is uniform across all plaintexts producing a given hash.






            share|cite|improve this answer








            New contributor




            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            This isn't really a stats answer, but:



            No, you can't determine the first character of the plaintext from the hash, because there's no such thing as "the plaintext" for a given hash.



            SHA-256 is a hashing algorithm. No matter what your plaintext, you get out a 32-byte signature, often expressed as a 64-character hex string. There are far more possible plaintexts than there are possible 64 character hex strings - the same hash can be generated from any number of different plaintexts. There's no reason to believe that the first character being/not being a '1' is uniform across all plaintexts producing a given hash.







            share|cite|improve this answer








            New contributor




            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            share|cite|improve this answer



            share|cite|improve this answer






            New contributor




            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            answered yesterday









            Chris H

            51927




            51927




            New contributor




            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.





            New contributor





            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            Chris H is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.







            • 11




              This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its not an injective function.
              – Luca Citi
              yesterday






            • 5




              Could you not predict the probability that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning.
              – Matthew Drury
              yesterday







            • 8




              @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, for any given hash. So if you want to estimate the probability, your best estimate is going to be roughly $frac1256pmvarepsilon$.
              – Konrad Rudolph
              yesterday







            • 8




              Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning.
              – Matthew Drury
              yesterday






            • 4




              @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest no hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256.
              – Chris H
              23 hours ago













            • 11




              This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its not an injective function.
              – Luca Citi
              yesterday






            • 5




              Could you not predict the probability that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning.
              – Matthew Drury
              yesterday







            • 8




              @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, for any given hash. So if you want to estimate the probability, your best estimate is going to be roughly $frac1256pmvarepsilon$.
              – Konrad Rudolph
              yesterday







            • 8




              Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning.
              – Matthew Drury
              yesterday






            • 4




              @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest no hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256.
              – Chris H
              23 hours ago








            11




            11




            This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its not an injective function.
            – Luca Citi
            yesterday




            This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its not an injective function.
            – Luca Citi
            yesterday




            5




            5




            Could you not predict the probability that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning.
            – Matthew Drury
            yesterday





            Could you not predict the probability that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning.
            – Matthew Drury
            yesterday





            8




            8




            @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, for any given hash. So if you want to estimate the probability, your best estimate is going to be roughly $frac1256pmvarepsilon$.
            – Konrad Rudolph
            yesterday





            @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, for any given hash. So if you want to estimate the probability, your best estimate is going to be roughly $frac1256pmvarepsilon$.
            – Konrad Rudolph
            yesterday





            8




            8




            Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning.
            – Matthew Drury
            yesterday




            Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning.
            – Matthew Drury
            yesterday




            4




            4




            @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest no hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256.
            – Chris H
            23 hours ago





            @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest no hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256.
            – Chris H
            23 hours ago











            up vote
            32
            down vote













            Regardless if this is "Possible", what algorithm would be the best approach?



            Sorry, but that's a nonsensical question. If something is impossible, then you can't search for the best approach to the problem.



            In this case, this definitely should be impossible because hashing is a one-way function: several inputs (infinite, in fact) can produce the same output. If the first bit of input on its own would somehow influence the probability of a specific hash value, this would mean that the hash algorithm is completely flawed.



            You certainly can train a neural network, linear classifier, SVM and whatnot to attempt prediction. And if you will be able to reliably predict the input from output for a certain hashing algorithm, this would prove that this algorithm is worthless. I'd say that for a widely used algorithm like SHA256 such a possibility is vanishingly low. However, it's a reasonable approach to quickly rule out new, unproven and untested hashing algorithms.






            share|cite|improve this answer








            New contributor




            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.













            • 6




              It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($mathoprm sign(x)$, famously a non-linear one-way function, tells me a lot about the input).
              – Konrad Rudolph
              yesterday







            • 7




              @KonradRudolph: "One-way function" has a specific meaning in this context that is not the meaning you're thinking of. sign(x) is not a one-way function in this sense, because finding preimages is trivial.
              – user2357112
              yesterday






            • 3




              That said, I don't think the answer is using "one-way function" correctly either.
              – user2357112
              yesterday






            • 1




              @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to.
              – Konrad Rudolph
              yesterday











            • Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms.
              – IMil
              yesterday














            up vote
            32
            down vote













            Regardless if this is "Possible", what algorithm would be the best approach?



            Sorry, but that's a nonsensical question. If something is impossible, then you can't search for the best approach to the problem.



            In this case, this definitely should be impossible because hashing is a one-way function: several inputs (infinite, in fact) can produce the same output. If the first bit of input on its own would somehow influence the probability of a specific hash value, this would mean that the hash algorithm is completely flawed.



            You certainly can train a neural network, linear classifier, SVM and whatnot to attempt prediction. And if you will be able to reliably predict the input from output for a certain hashing algorithm, this would prove that this algorithm is worthless. I'd say that for a widely used algorithm like SHA256 such a possibility is vanishingly low. However, it's a reasonable approach to quickly rule out new, unproven and untested hashing algorithms.






            share|cite|improve this answer








            New contributor




            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.













            • 6




              It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($mathoprm sign(x)$, famously a non-linear one-way function, tells me a lot about the input).
              – Konrad Rudolph
              yesterday







            • 7




              @KonradRudolph: "One-way function" has a specific meaning in this context that is not the meaning you're thinking of. sign(x) is not a one-way function in this sense, because finding preimages is trivial.
              – user2357112
              yesterday






            • 3




              That said, I don't think the answer is using "one-way function" correctly either.
              – user2357112
              yesterday






            • 1




              @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to.
              – Konrad Rudolph
              yesterday











            • Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms.
              – IMil
              yesterday












            up vote
            32
            down vote










            up vote
            32
            down vote









            Regardless if this is "Possible", what algorithm would be the best approach?



            Sorry, but that's a nonsensical question. If something is impossible, then you can't search for the best approach to the problem.



            In this case, this definitely should be impossible because hashing is a one-way function: several inputs (infinite, in fact) can produce the same output. If the first bit of input on its own would somehow influence the probability of a specific hash value, this would mean that the hash algorithm is completely flawed.



            You certainly can train a neural network, linear classifier, SVM and whatnot to attempt prediction. And if you will be able to reliably predict the input from output for a certain hashing algorithm, this would prove that this algorithm is worthless. I'd say that for a widely used algorithm like SHA256 such a possibility is vanishingly low. However, it's a reasonable approach to quickly rule out new, unproven and untested hashing algorithms.






            share|cite|improve this answer








            New contributor




            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            Regardless if this is "Possible", what algorithm would be the best approach?



            Sorry, but that's a nonsensical question. If something is impossible, then you can't search for the best approach to the problem.



            In this case, this definitely should be impossible because hashing is a one-way function: several inputs (infinite, in fact) can produce the same output. If the first bit of input on its own would somehow influence the probability of a specific hash value, this would mean that the hash algorithm is completely flawed.



            You certainly can train a neural network, linear classifier, SVM and whatnot to attempt prediction. And if you will be able to reliably predict the input from output for a certain hashing algorithm, this would prove that this algorithm is worthless. I'd say that for a widely used algorithm like SHA256 such a possibility is vanishingly low. However, it's a reasonable approach to quickly rule out new, unproven and untested hashing algorithms.







            share|cite|improve this answer








            New contributor




            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            share|cite|improve this answer



            share|cite|improve this answer






            New contributor




            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            answered 2 days ago









            IMil

            35114




            35114




            New contributor




            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.





            New contributor





            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            IMil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.







            • 6




              It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($mathoprm sign(x)$, famously a non-linear one-way function, tells me a lot about the input).
              – Konrad Rudolph
              yesterday







            • 7




              @KonradRudolph: "One-way function" has a specific meaning in this context that is not the meaning you're thinking of. sign(x) is not a one-way function in this sense, because finding preimages is trivial.
              – user2357112
              yesterday






            • 3




              That said, I don't think the answer is using "one-way function" correctly either.
              – user2357112
              yesterday






            • 1




              @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to.
              – Konrad Rudolph
              yesterday











            • Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms.
              – IMil
              yesterday












            • 6




              It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($mathoprm sign(x)$, famously a non-linear one-way function, tells me a lot about the input).
              – Konrad Rudolph
              yesterday







            • 7




              @KonradRudolph: "One-way function" has a specific meaning in this context that is not the meaning you're thinking of. sign(x) is not a one-way function in this sense, because finding preimages is trivial.
              – user2357112
              yesterday






            • 3




              That said, I don't think the answer is using "one-way function" correctly either.
              – user2357112
              yesterday






            • 1




              @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to.
              – Konrad Rudolph
              yesterday











            • Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms.
              – IMil
              yesterday







            6




            6




            It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($mathoprm sign(x)$, famously a non-linear one-way function, tells me a lot about the input).
            – Konrad Rudolph
            yesterday





            It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($mathoprm sign(x)$, famously a non-linear one-way function, tells me a lot about the input).
            – Konrad Rudolph
            yesterday





            7




            7




            @KonradRudolph: "One-way function" has a specific meaning in this context that is not the meaning you're thinking of. sign(x) is not a one-way function in this sense, because finding preimages is trivial.
            – user2357112
            yesterday




            @KonradRudolph: "One-way function" has a specific meaning in this context that is not the meaning you're thinking of. sign(x) is not a one-way function in this sense, because finding preimages is trivial.
            – user2357112
            yesterday




            3




            3




            That said, I don't think the answer is using "one-way function" correctly either.
            – user2357112
            yesterday




            That said, I don't think the answer is using "one-way function" correctly either.
            – user2357112
            yesterday




            1




            1




            @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to.
            – Konrad Rudolph
            yesterday





            @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to.
            – Konrad Rudolph
            yesterday













            Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms.
            – IMil
            yesterday




            Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms.
            – IMil
            yesterday










            up vote
            18
            down vote













            While one can't prove a negative with an example.
            Still I feel an example would be suggestive; and perhaps useful.
            And it does show how one would (attempt to) solve similar problems.



            In the case of
            I want to make binary predictions, using features that are binary vectors,
            a Random Forest is a solid choice.
            I guess this kind of answers the second part of your question: what is a good algorithm.



            We well want to preprocess the SHA256 strings, into binary (Boolean) vectors,
            as each bit is statistically independent, thus each bit is a good feature.
            So that will make our inputs 256 element boolean vectors.



            Demo



            Here is a demonstration of how the whole thing can be done using the Julia DecisionTree.jl library.



            You can copy paste the below into the julia prompt.



            using SHA
            using DecisionTree
            using Statistics: mean
            using Random: randstring

            const maxlen=10_000 # longest string (document) to be hashed.

            gen_plaintext(x) = gen_plaintext(Valx())
            gen_plaintext(::Valtrue) = "1" * randstring(rand(0:maxlen-1))
            gen_plaintext(::Valfalse) = randstring(rand(1:maxlen))


            bitvector(x) = BitVector(digits(x, base=2, pad=8sizeof(x)))
            bitvector(x::AbstractVector) = reduce(vcat, bitvector.(x))

            function gen_observation(class)
            plaintext = gen_plaintext(class)
            obs = bitvector(sha256(plaintext))
            obs
            end

            function feature_mat(obs)
            convert(Array, reduce(hcat, obs)')
            end

            ########################################

            const train_labels = rand(Bool, 100_000)
            const train_obs = gen_observation.(train_labels)
            const train_feature_mat = feature_mat(train_obs)

            const test_labels = rand(Bool, 100_000)
            const test_obs = gen_observation.(test_labels)
            const test_feature_mat = feature_mat(test_obs)


            # Train the model
            const model = build_forest(train_labels, train_feature_mat)
            @show model


            #Training Set accuracy:
            @show mean(apply_forest(model, train_feature_mat) .== train_labels)

            #Test Set accuracy:
            @show mean(apply_forest(model, test_feature_mat) .== test_labels)


            Results



            When I did this,
            training on 100,000 random ASCII strings of length up to 10,000.
            Here are the results I saw:



            Train the model



            julia> const model = build_forest(train_labels, train_feature_mat)
            Ensemble of Decision Trees
            Trees: 10
            Avg Leaves: 16124.7
            Avg Depth: 17.9


            Training Set accuracy:



            julia> mean(apply_forest(model, train_feature_mat) .== train_labels)
            0.95162


            Test Set accuracy:



            julia> mean(apply_forest(model, test_feature_mat) .== test_labels)
            0.5016


            Discussion



            So that is basically nothing.
            We went from 95% on the training set, to barely over 50% on the test set.
            Someone could apply proper hypothesis tests, to see if we can reject the null

            hypothesis, but I am pretty certain we can't.
            It is a tiny improvement over the guess rate.



            That suggests that it can't be learned.
            If a Random Forest, can go from well fitted to hitting just the guess rate.
            Random Forests are pretty capable of learning difficult inputs.
            If there was something to learn, I would expect at least a few percent.



            You can play around with different hash functions by changing the code.
            Which could be interesting
            I got basically same results when using the julia in built hash function (which is not a cryptographically secure hsah, but still is a good hash so should indeed send similar strings apart).
            I also got basically the same results for CRC32c.






            share|cite|improve this answer


























              up vote
              18
              down vote













              While one can't prove a negative with an example.
              Still I feel an example would be suggestive; and perhaps useful.
              And it does show how one would (attempt to) solve similar problems.



              In the case of
              I want to make binary predictions, using features that are binary vectors,
              a Random Forest is a solid choice.
              I guess this kind of answers the second part of your question: what is a good algorithm.



              We well want to preprocess the SHA256 strings, into binary (Boolean) vectors,
              as each bit is statistically independent, thus each bit is a good feature.
              So that will make our inputs 256 element boolean vectors.



              Demo



              Here is a demonstration of how the whole thing can be done using the Julia DecisionTree.jl library.



              You can copy paste the below into the julia prompt.



              using SHA
              using DecisionTree
              using Statistics: mean
              using Random: randstring

              const maxlen=10_000 # longest string (document) to be hashed.

              gen_plaintext(x) = gen_plaintext(Valx())
              gen_plaintext(::Valtrue) = "1" * randstring(rand(0:maxlen-1))
              gen_plaintext(::Valfalse) = randstring(rand(1:maxlen))


              bitvector(x) = BitVector(digits(x, base=2, pad=8sizeof(x)))
              bitvector(x::AbstractVector) = reduce(vcat, bitvector.(x))

              function gen_observation(class)
              plaintext = gen_plaintext(class)
              obs = bitvector(sha256(plaintext))
              obs
              end

              function feature_mat(obs)
              convert(Array, reduce(hcat, obs)')
              end

              ########################################

              const train_labels = rand(Bool, 100_000)
              const train_obs = gen_observation.(train_labels)
              const train_feature_mat = feature_mat(train_obs)

              const test_labels = rand(Bool, 100_000)
              const test_obs = gen_observation.(test_labels)
              const test_feature_mat = feature_mat(test_obs)


              # Train the model
              const model = build_forest(train_labels, train_feature_mat)
              @show model


              #Training Set accuracy:
              @show mean(apply_forest(model, train_feature_mat) .== train_labels)

              #Test Set accuracy:
              @show mean(apply_forest(model, test_feature_mat) .== test_labels)


              Results



              When I did this,
              training on 100,000 random ASCII strings of length up to 10,000.
              Here are the results I saw:



              Train the model



              julia> const model = build_forest(train_labels, train_feature_mat)
              Ensemble of Decision Trees
              Trees: 10
              Avg Leaves: 16124.7
              Avg Depth: 17.9


              Training Set accuracy:



              julia> mean(apply_forest(model, train_feature_mat) .== train_labels)
              0.95162


              Test Set accuracy:



              julia> mean(apply_forest(model, test_feature_mat) .== test_labels)
              0.5016


              Discussion



              So that is basically nothing.
              We went from 95% on the training set, to barely over 50% on the test set.
              Someone could apply proper hypothesis tests, to see if we can reject the null

              hypothesis, but I am pretty certain we can't.
              It is a tiny improvement over the guess rate.



              That suggests that it can't be learned.
              If a Random Forest, can go from well fitted to hitting just the guess rate.
              Random Forests are pretty capable of learning difficult inputs.
              If there was something to learn, I would expect at least a few percent.



              You can play around with different hash functions by changing the code.
              Which could be interesting
              I got basically same results when using the julia in built hash function (which is not a cryptographically secure hsah, but still is a good hash so should indeed send similar strings apart).
              I also got basically the same results for CRC32c.






              share|cite|improve this answer
























                up vote
                18
                down vote










                up vote
                18
                down vote









                While one can't prove a negative with an example.
                Still I feel an example would be suggestive; and perhaps useful.
                And it does show how one would (attempt to) solve similar problems.



                In the case of
                I want to make binary predictions, using features that are binary vectors,
                a Random Forest is a solid choice.
                I guess this kind of answers the second part of your question: what is a good algorithm.



                We well want to preprocess the SHA256 strings, into binary (Boolean) vectors,
                as each bit is statistically independent, thus each bit is a good feature.
                So that will make our inputs 256 element boolean vectors.



                Demo



                Here is a demonstration of how the whole thing can be done using the Julia DecisionTree.jl library.



                You can copy paste the below into the julia prompt.



                using SHA
                using DecisionTree
                using Statistics: mean
                using Random: randstring

                const maxlen=10_000 # longest string (document) to be hashed.

                gen_plaintext(x) = gen_plaintext(Valx())
                gen_plaintext(::Valtrue) = "1" * randstring(rand(0:maxlen-1))
                gen_plaintext(::Valfalse) = randstring(rand(1:maxlen))


                bitvector(x) = BitVector(digits(x, base=2, pad=8sizeof(x)))
                bitvector(x::AbstractVector) = reduce(vcat, bitvector.(x))

                function gen_observation(class)
                plaintext = gen_plaintext(class)
                obs = bitvector(sha256(plaintext))
                obs
                end

                function feature_mat(obs)
                convert(Array, reduce(hcat, obs)')
                end

                ########################################

                const train_labels = rand(Bool, 100_000)
                const train_obs = gen_observation.(train_labels)
                const train_feature_mat = feature_mat(train_obs)

                const test_labels = rand(Bool, 100_000)
                const test_obs = gen_observation.(test_labels)
                const test_feature_mat = feature_mat(test_obs)


                # Train the model
                const model = build_forest(train_labels, train_feature_mat)
                @show model


                #Training Set accuracy:
                @show mean(apply_forest(model, train_feature_mat) .== train_labels)

                #Test Set accuracy:
                @show mean(apply_forest(model, test_feature_mat) .== test_labels)


                Results



                When I did this,
                training on 100,000 random ASCII strings of length up to 10,000.
                Here are the results I saw:



                Train the model



                julia> const model = build_forest(train_labels, train_feature_mat)
                Ensemble of Decision Trees
                Trees: 10
                Avg Leaves: 16124.7
                Avg Depth: 17.9


                Training Set accuracy:



                julia> mean(apply_forest(model, train_feature_mat) .== train_labels)
                0.95162


                Test Set accuracy:



                julia> mean(apply_forest(model, test_feature_mat) .== test_labels)
                0.5016


                Discussion



                So that is basically nothing.
                We went from 95% on the training set, to barely over 50% on the test set.
                Someone could apply proper hypothesis tests, to see if we can reject the null

                hypothesis, but I am pretty certain we can't.
                It is a tiny improvement over the guess rate.



                That suggests that it can't be learned.
                If a Random Forest, can go from well fitted to hitting just the guess rate.
                Random Forests are pretty capable of learning difficult inputs.
                If there was something to learn, I would expect at least a few percent.



                You can play around with different hash functions by changing the code.
                Which could be interesting
                I got basically same results when using the julia in built hash function (which is not a cryptographically secure hsah, but still is a good hash so should indeed send similar strings apart).
                I also got basically the same results for CRC32c.






                share|cite|improve this answer














                While one can't prove a negative with an example.
                Still I feel an example would be suggestive; and perhaps useful.
                And it does show how one would (attempt to) solve similar problems.



                In the case of
                I want to make binary predictions, using features that are binary vectors,
                a Random Forest is a solid choice.
                I guess this kind of answers the second part of your question: what is a good algorithm.



                We well want to preprocess the SHA256 strings, into binary (Boolean) vectors,
                as each bit is statistically independent, thus each bit is a good feature.
                So that will make our inputs 256 element boolean vectors.



                Demo



                Here is a demonstration of how the whole thing can be done using the Julia DecisionTree.jl library.



                You can copy paste the below into the julia prompt.



                using SHA
                using DecisionTree
                using Statistics: mean
                using Random: randstring

                const maxlen=10_000 # longest string (document) to be hashed.

                gen_plaintext(x) = gen_plaintext(Valx())
                gen_plaintext(::Valtrue) = "1" * randstring(rand(0:maxlen-1))
                gen_plaintext(::Valfalse) = randstring(rand(1:maxlen))


                bitvector(x) = BitVector(digits(x, base=2, pad=8sizeof(x)))
                bitvector(x::AbstractVector) = reduce(vcat, bitvector.(x))

                function gen_observation(class)
                plaintext = gen_plaintext(class)
                obs = bitvector(sha256(plaintext))
                obs
                end

                function feature_mat(obs)
                convert(Array, reduce(hcat, obs)')
                end

                ########################################

                const train_labels = rand(Bool, 100_000)
                const train_obs = gen_observation.(train_labels)
                const train_feature_mat = feature_mat(train_obs)

                const test_labels = rand(Bool, 100_000)
                const test_obs = gen_observation.(test_labels)
                const test_feature_mat = feature_mat(test_obs)


                # Train the model
                const model = build_forest(train_labels, train_feature_mat)
                @show model


                #Training Set accuracy:
                @show mean(apply_forest(model, train_feature_mat) .== train_labels)

                #Test Set accuracy:
                @show mean(apply_forest(model, test_feature_mat) .== test_labels)


                Results



                When I did this,
                training on 100,000 random ASCII strings of length up to 10,000.
                Here are the results I saw:



                Train the model



                julia> const model = build_forest(train_labels, train_feature_mat)
                Ensemble of Decision Trees
                Trees: 10
                Avg Leaves: 16124.7
                Avg Depth: 17.9


                Training Set accuracy:



                julia> mean(apply_forest(model, train_feature_mat) .== train_labels)
                0.95162


                Test Set accuracy:



                julia> mean(apply_forest(model, test_feature_mat) .== test_labels)
                0.5016


                Discussion



                So that is basically nothing.
                We went from 95% on the training set, to barely over 50% on the test set.
                Someone could apply proper hypothesis tests, to see if we can reject the null

                hypothesis, but I am pretty certain we can't.
                It is a tiny improvement over the guess rate.



                That suggests that it can't be learned.
                If a Random Forest, can go from well fitted to hitting just the guess rate.
                Random Forests are pretty capable of learning difficult inputs.
                If there was something to learn, I would expect at least a few percent.



                You can play around with different hash functions by changing the code.
                Which could be interesting
                I got basically same results when using the julia in built hash function (which is not a cryptographically secure hsah, but still is a good hash so should indeed send similar strings apart).
                I also got basically the same results for CRC32c.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited 22 hours ago

























                answered yesterday









                Lyndon White

                1,294930




                1,294930




















                    up vote
                    10
                    down vote













                    Hash functions are (by design) extremely badly suited for doing anything machine learning with them.



                    ML is essentially a family of methods for modelling / estimating locally continuous functions. I.e., you're trying to describe some physical system that, while it may have certain discontinuities, is in some sense in most of the parameter space smooth enough so that only a scattered sample of test data can be used to predict the result for other input. To do that, the AI algorithms need to somehow decompose the data into a clever basis representation, for which the training has suggested that e.g. if you see such and such shape (which appears to correlate with the result of such and such convolution) then there's a good chance that the output should have in the corresponding region such and such structure (which can again be described by a convolution or something).



                    (I know, many ML approaches aren't like convolution at all, but the general idea is always the same: you have some input space that's so high dimensional it's impossible to sample exhaustively, so you find a clever decomposition that allows you to extrapolate results from a comparatively sparse sample.)



                    The idea behind a cryptographic hash function however is that any change to the plaintext should result in a completely different digest. So no matter how you decompose the function, local estimators will not allow you to extrapolate how small fluctuations around that part influence the result. Unless of course you actually process all of the information of a limited set, but this wouldn't be called machine learning: you'd just be building a rainbow table.






                    share|cite|improve this answer
















                    • 3




                      Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it might be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting.
                      – senderle
                      yesterday














                    up vote
                    10
                    down vote













                    Hash functions are (by design) extremely badly suited for doing anything machine learning with them.



                    ML is essentially a family of methods for modelling / estimating locally continuous functions. I.e., you're trying to describe some physical system that, while it may have certain discontinuities, is in some sense in most of the parameter space smooth enough so that only a scattered sample of test data can be used to predict the result for other input. To do that, the AI algorithms need to somehow decompose the data into a clever basis representation, for which the training has suggested that e.g. if you see such and such shape (which appears to correlate with the result of such and such convolution) then there's a good chance that the output should have in the corresponding region such and such structure (which can again be described by a convolution or something).



                    (I know, many ML approaches aren't like convolution at all, but the general idea is always the same: you have some input space that's so high dimensional it's impossible to sample exhaustively, so you find a clever decomposition that allows you to extrapolate results from a comparatively sparse sample.)



                    The idea behind a cryptographic hash function however is that any change to the plaintext should result in a completely different digest. So no matter how you decompose the function, local estimators will not allow you to extrapolate how small fluctuations around that part influence the result. Unless of course you actually process all of the information of a limited set, but this wouldn't be called machine learning: you'd just be building a rainbow table.






                    share|cite|improve this answer
















                    • 3




                      Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it might be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting.
                      – senderle
                      yesterday












                    up vote
                    10
                    down vote










                    up vote
                    10
                    down vote









                    Hash functions are (by design) extremely badly suited for doing anything machine learning with them.



                    ML is essentially a family of methods for modelling / estimating locally continuous functions. I.e., you're trying to describe some physical system that, while it may have certain discontinuities, is in some sense in most of the parameter space smooth enough so that only a scattered sample of test data can be used to predict the result for other input. To do that, the AI algorithms need to somehow decompose the data into a clever basis representation, for which the training has suggested that e.g. if you see such and such shape (which appears to correlate with the result of such and such convolution) then there's a good chance that the output should have in the corresponding region such and such structure (which can again be described by a convolution or something).



                    (I know, many ML approaches aren't like convolution at all, but the general idea is always the same: you have some input space that's so high dimensional it's impossible to sample exhaustively, so you find a clever decomposition that allows you to extrapolate results from a comparatively sparse sample.)



                    The idea behind a cryptographic hash function however is that any change to the plaintext should result in a completely different digest. So no matter how you decompose the function, local estimators will not allow you to extrapolate how small fluctuations around that part influence the result. Unless of course you actually process all of the information of a limited set, but this wouldn't be called machine learning: you'd just be building a rainbow table.






                    share|cite|improve this answer












                    Hash functions are (by design) extremely badly suited for doing anything machine learning with them.



                    ML is essentially a family of methods for modelling / estimating locally continuous functions. I.e., you're trying to describe some physical system that, while it may have certain discontinuities, is in some sense in most of the parameter space smooth enough so that only a scattered sample of test data can be used to predict the result for other input. To do that, the AI algorithms need to somehow decompose the data into a clever basis representation, for which the training has suggested that e.g. if you see such and such shape (which appears to correlate with the result of such and such convolution) then there's a good chance that the output should have in the corresponding region such and such structure (which can again be described by a convolution or something).



                    (I know, many ML approaches aren't like convolution at all, but the general idea is always the same: you have some input space that's so high dimensional it's impossible to sample exhaustively, so you find a clever decomposition that allows you to extrapolate results from a comparatively sparse sample.)



                    The idea behind a cryptographic hash function however is that any change to the plaintext should result in a completely different digest. So no matter how you decompose the function, local estimators will not allow you to extrapolate how small fluctuations around that part influence the result. Unless of course you actually process all of the information of a limited set, but this wouldn't be called machine learning: you'd just be building a rainbow table.







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered yesterday









                    leftaroundabout

                    24116




                    24116







                    • 3




                      Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it might be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting.
                      – senderle
                      yesterday












                    • 3




                      Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it might be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting.
                      – senderle
                      yesterday







                    3




                    3




                    Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it might be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting.
                    – senderle
                    yesterday




                    Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it might be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting.
                    – senderle
                    yesterday










                    up vote
                    5
                    down vote













                    This is an interesting question because it raises issues about what counts as "machine learning." There is certainly an algorithm that will eventually solve this problem if it can be solved. It goes like this:



                    1. Pick your favorite programming language, and decide on an encoding that maps every string to a (potentially very large) integer.


                    2. Pick a random number and convert it into a string. Check to see if it's a valid program in your language. If it's not, pick another number and try again. If it is, start it, immediately pause it, and add it to a list of paused programs.


                    3. Run all the paused programs for a little while. If any of them halt without producing an adequate solution, remove them from the list. If one produces an adequate solution, you're done! Otherwise, return to 2 after letting them all run for a bit.


                    There's no question that if you have infinite storage and infinite time, the above algorithm will eventually find a good solution. But that's probably not what you mean by "machine learning."



                    Here's the rub: if you consider all possible problems, no machine learning algorithm can do better on average! This is known as the no free lunch theorem. It proves that among all the possible problems you could throw at any given machine learning algorithm, the number that it can solve quickly is vanishingly small.



                    It can solve those problems quickly only because they are governed by patterns that the algorithm can anticipate. For example, many successful algorithms assume the following:



                    1. Solutions can be described by some complex series of matrix multiplications and nonlinear distortions, governed by a set of parameters.


                    2. Good solutions will be clustered together in parameter space, so that all you have to do is pick a search neighborhood, find the best solution there, shift your search neighborhood so that the best solution is in the center, and repeat.


                    Obviously neither of these assumptions hold in general. The second is particularly suspect. And the no free lunch tells us that these assumptions don't even hold most of the time. In fact they almost never hold! It's just our good fortune that they do hold for certain problems that actually matter.



                    The problem you've chosen is designed from the beginning to violate assumption 2. Hash functions are specifically designed so that similar inputs give completely different outputs.



                    So your question—what is the best machine learning algorithm for solving this problem?—probably has a very straightforward answer: random search.






                    share|cite|improve this answer




















                    • I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too.
                      – Max Vernon
                      yesterday







                    • 1




                      @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to other quantum algorithms. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a question and self-answer that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.)
                      – senderle
                      yesterday














                    up vote
                    5
                    down vote













                    This is an interesting question because it raises issues about what counts as "machine learning." There is certainly an algorithm that will eventually solve this problem if it can be solved. It goes like this:



                    1. Pick your favorite programming language, and decide on an encoding that maps every string to a (potentially very large) integer.


                    2. Pick a random number and convert it into a string. Check to see if it's a valid program in your language. If it's not, pick another number and try again. If it is, start it, immediately pause it, and add it to a list of paused programs.


                    3. Run all the paused programs for a little while. If any of them halt without producing an adequate solution, remove them from the list. If one produces an adequate solution, you're done! Otherwise, return to 2 after letting them all run for a bit.


                    There's no question that if you have infinite storage and infinite time, the above algorithm will eventually find a good solution. But that's probably not what you mean by "machine learning."



                    Here's the rub: if you consider all possible problems, no machine learning algorithm can do better on average! This is known as the no free lunch theorem. It proves that among all the possible problems you could throw at any given machine learning algorithm, the number that it can solve quickly is vanishingly small.



                    It can solve those problems quickly only because they are governed by patterns that the algorithm can anticipate. For example, many successful algorithms assume the following:



                    1. Solutions can be described by some complex series of matrix multiplications and nonlinear distortions, governed by a set of parameters.


                    2. Good solutions will be clustered together in parameter space, so that all you have to do is pick a search neighborhood, find the best solution there, shift your search neighborhood so that the best solution is in the center, and repeat.


                    Obviously neither of these assumptions hold in general. The second is particularly suspect. And the no free lunch tells us that these assumptions don't even hold most of the time. In fact they almost never hold! It's just our good fortune that they do hold for certain problems that actually matter.



                    The problem you've chosen is designed from the beginning to violate assumption 2. Hash functions are specifically designed so that similar inputs give completely different outputs.



                    So your question—what is the best machine learning algorithm for solving this problem?—probably has a very straightforward answer: random search.






                    share|cite|improve this answer




















                    • I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too.
                      – Max Vernon
                      yesterday







                    • 1




                      @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to other quantum algorithms. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a question and self-answer that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.)
                      – senderle
                      yesterday












                    up vote
                    5
                    down vote










                    up vote
                    5
                    down vote









                    This is an interesting question because it raises issues about what counts as "machine learning." There is certainly an algorithm that will eventually solve this problem if it can be solved. It goes like this:



                    1. Pick your favorite programming language, and decide on an encoding that maps every string to a (potentially very large) integer.


                    2. Pick a random number and convert it into a string. Check to see if it's a valid program in your language. If it's not, pick another number and try again. If it is, start it, immediately pause it, and add it to a list of paused programs.


                    3. Run all the paused programs for a little while. If any of them halt without producing an adequate solution, remove them from the list. If one produces an adequate solution, you're done! Otherwise, return to 2 after letting them all run for a bit.


                    There's no question that if you have infinite storage and infinite time, the above algorithm will eventually find a good solution. But that's probably not what you mean by "machine learning."



                    Here's the rub: if you consider all possible problems, no machine learning algorithm can do better on average! This is known as the no free lunch theorem. It proves that among all the possible problems you could throw at any given machine learning algorithm, the number that it can solve quickly is vanishingly small.



                    It can solve those problems quickly only because they are governed by patterns that the algorithm can anticipate. For example, many successful algorithms assume the following:



                    1. Solutions can be described by some complex series of matrix multiplications and nonlinear distortions, governed by a set of parameters.


                    2. Good solutions will be clustered together in parameter space, so that all you have to do is pick a search neighborhood, find the best solution there, shift your search neighborhood so that the best solution is in the center, and repeat.


                    Obviously neither of these assumptions hold in general. The second is particularly suspect. And the no free lunch tells us that these assumptions don't even hold most of the time. In fact they almost never hold! It's just our good fortune that they do hold for certain problems that actually matter.



                    The problem you've chosen is designed from the beginning to violate assumption 2. Hash functions are specifically designed so that similar inputs give completely different outputs.



                    So your question—what is the best machine learning algorithm for solving this problem?—probably has a very straightforward answer: random search.






                    share|cite|improve this answer












                    This is an interesting question because it raises issues about what counts as "machine learning." There is certainly an algorithm that will eventually solve this problem if it can be solved. It goes like this:



                    1. Pick your favorite programming language, and decide on an encoding that maps every string to a (potentially very large) integer.


                    2. Pick a random number and convert it into a string. Check to see if it's a valid program in your language. If it's not, pick another number and try again. If it is, start it, immediately pause it, and add it to a list of paused programs.


                    3. Run all the paused programs for a little while. If any of them halt without producing an adequate solution, remove them from the list. If one produces an adequate solution, you're done! Otherwise, return to 2 after letting them all run for a bit.


                    There's no question that if you have infinite storage and infinite time, the above algorithm will eventually find a good solution. But that's probably not what you mean by "machine learning."



                    Here's the rub: if you consider all possible problems, no machine learning algorithm can do better on average! This is known as the no free lunch theorem. It proves that among all the possible problems you could throw at any given machine learning algorithm, the number that it can solve quickly is vanishingly small.



                    It can solve those problems quickly only because they are governed by patterns that the algorithm can anticipate. For example, many successful algorithms assume the following:



                    1. Solutions can be described by some complex series of matrix multiplications and nonlinear distortions, governed by a set of parameters.


                    2. Good solutions will be clustered together in parameter space, so that all you have to do is pick a search neighborhood, find the best solution there, shift your search neighborhood so that the best solution is in the center, and repeat.


                    Obviously neither of these assumptions hold in general. The second is particularly suspect. And the no free lunch tells us that these assumptions don't even hold most of the time. In fact they almost never hold! It's just our good fortune that they do hold for certain problems that actually matter.



                    The problem you've chosen is designed from the beginning to violate assumption 2. Hash functions are specifically designed so that similar inputs give completely different outputs.



                    So your question—what is the best machine learning algorithm for solving this problem?—probably has a very straightforward answer: random search.







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered yesterday









                    senderle

                    22616




                    22616











                    • I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too.
                      – Max Vernon
                      yesterday







                    • 1




                      @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to other quantum algorithms. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a question and self-answer that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.)
                      – senderle
                      yesterday
















                    • I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too.
                      – Max Vernon
                      yesterday







                    • 1




                      @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to other quantum algorithms. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a question and self-answer that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.)
                      – senderle
                      yesterday















                    I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too.
                    – Max Vernon
                    yesterday





                    I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too.
                    – Max Vernon
                    yesterday





                    1




                    1




                    @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to other quantum algorithms. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a question and self-answer that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.)
                    – senderle
                    yesterday




                    @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to other quantum algorithms. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a question and self-answer that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.)
                    – senderle
                    yesterday










                    up vote
                    5
                    down vote













                    It is next to impossible. However, people observed some patterns in SHA256 which might suggest its non-randomness A distinguisher for SHA256 using Bitcoin (mining faster along the way). Their tldr:



                    "To distinguish between an ideal random permutation hash and SHA256, hash a large amount (~2^80) of candidate 1024 bit blocks twice, as done in Bitcoin. Ensure that the bits of the candidate blocks are sparsely set (much fewer than the 512 mean expected), according to the Bitcoin protocol, discarding candidate blocks that do not meet the Bitcoin “difficulty” standard (where the resultant hashes start with a the large number of 0’s). With the remaining set of valid input candidates (467369 when this analysis was done), observe a particular set of 32 bits in the input block (located where Bitcoin has the nonce, input bits 607-639). Note that the mean number of bits set in the nonce field is skewed to the left, i.e. fewer than the expected value of 16 bits set (estimated mean 15.428)."



                    See a discussion on lobste.rs. One possible explanation is a bias introduced by the miners.






                    share|cite|improve this answer










                    New contributor




                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.













                    • 2




                      This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched.
                      – senderle
                      yesterday






                    • 1




                      @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols.
                      – IndieSolver
                      yesterday














                    up vote
                    5
                    down vote













                    It is next to impossible. However, people observed some patterns in SHA256 which might suggest its non-randomness A distinguisher for SHA256 using Bitcoin (mining faster along the way). Their tldr:



                    "To distinguish between an ideal random permutation hash and SHA256, hash a large amount (~2^80) of candidate 1024 bit blocks twice, as done in Bitcoin. Ensure that the bits of the candidate blocks are sparsely set (much fewer than the 512 mean expected), according to the Bitcoin protocol, discarding candidate blocks that do not meet the Bitcoin “difficulty” standard (where the resultant hashes start with a the large number of 0’s). With the remaining set of valid input candidates (467369 when this analysis was done), observe a particular set of 32 bits in the input block (located where Bitcoin has the nonce, input bits 607-639). Note that the mean number of bits set in the nonce field is skewed to the left, i.e. fewer than the expected value of 16 bits set (estimated mean 15.428)."



                    See a discussion on lobste.rs. One possible explanation is a bias introduced by the miners.






                    share|cite|improve this answer










                    New contributor




                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.













                    • 2




                      This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched.
                      – senderle
                      yesterday






                    • 1




                      @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols.
                      – IndieSolver
                      yesterday












                    up vote
                    5
                    down vote










                    up vote
                    5
                    down vote









                    It is next to impossible. However, people observed some patterns in SHA256 which might suggest its non-randomness A distinguisher for SHA256 using Bitcoin (mining faster along the way). Their tldr:



                    "To distinguish between an ideal random permutation hash and SHA256, hash a large amount (~2^80) of candidate 1024 bit blocks twice, as done in Bitcoin. Ensure that the bits of the candidate blocks are sparsely set (much fewer than the 512 mean expected), according to the Bitcoin protocol, discarding candidate blocks that do not meet the Bitcoin “difficulty” standard (where the resultant hashes start with a the large number of 0’s). With the remaining set of valid input candidates (467369 when this analysis was done), observe a particular set of 32 bits in the input block (located where Bitcoin has the nonce, input bits 607-639). Note that the mean number of bits set in the nonce field is skewed to the left, i.e. fewer than the expected value of 16 bits set (estimated mean 15.428)."



                    See a discussion on lobste.rs. One possible explanation is a bias introduced by the miners.






                    share|cite|improve this answer










                    New contributor




                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.









                    It is next to impossible. However, people observed some patterns in SHA256 which might suggest its non-randomness A distinguisher for SHA256 using Bitcoin (mining faster along the way). Their tldr:



                    "To distinguish between an ideal random permutation hash and SHA256, hash a large amount (~2^80) of candidate 1024 bit blocks twice, as done in Bitcoin. Ensure that the bits of the candidate blocks are sparsely set (much fewer than the 512 mean expected), according to the Bitcoin protocol, discarding candidate blocks that do not meet the Bitcoin “difficulty” standard (where the resultant hashes start with a the large number of 0’s). With the remaining set of valid input candidates (467369 when this analysis was done), observe a particular set of 32 bits in the input block (located where Bitcoin has the nonce, input bits 607-639). Note that the mean number of bits set in the nonce field is skewed to the left, i.e. fewer than the expected value of 16 bits set (estimated mean 15.428)."



                    See a discussion on lobste.rs. One possible explanation is a bias introduced by the miners.







                    share|cite|improve this answer










                    New contributor




                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.









                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited yesterday





















                    New contributor




                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.









                    answered yesterday









                    IndieSolver

                    1115




                    1115




                    New contributor




                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.





                    New contributor





                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.






                    IndieSolver is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.







                    • 2




                      This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched.
                      – senderle
                      yesterday






                    • 1




                      @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols.
                      – IndieSolver
                      yesterday












                    • 2




                      This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched.
                      – senderle
                      yesterday






                    • 1




                      @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols.
                      – IndieSolver
                      yesterday







                    2




                    2




                    This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched.
                    – senderle
                    yesterday




                    This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched.
                    – senderle
                    yesterday




                    1




                    1




                    @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols.
                    – IndieSolver
                    yesterday




                    @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols.
                    – IndieSolver
                    yesterday










                    up vote
                    2
                    down vote













                    I'll answer with a program. To reduce computational requirements I'll use a variant of sha256 I call sha16, which is just the first 16 bits of sha256.



                    #!/usr/bin/python3

                    import hashlib
                    from itertools import count

                    def sha16(plaintext):
                    h = hashlib.sha256()
                    h.update(plaintext)
                    return h.hexdigest()[:4]

                    def has_plaintext_start_with_1(digest):
                    """Return True if and only if the given digest can be generated from a
                    plaintext starting with "1" first bit."""
                    return True

                    def plaintext_starting_with_1(digest):
                    """Return a plaintext starting with '1' matching the given digest."""
                    for c in count():
                    plaintext = (b'x80' + str(c).encode('ascii'))
                    d = sha16(plaintext)
                    if d == digest:
                    return plaintext

                    for digest in range(0x10000):
                    digest = "%04x" % (digest,)
                    plain = plaintext_starting_with_1(digest)
                    print("%s hashes to %s" % (plain, digest))


                    This produces the output:



                    b'x8094207' hashes to 0000
                    b'x8047770' hashes to 0001
                    b'x8078597' hashes to 0002
                    b'x8025129' hashes to 0003
                    b'x8055307' hashes to 0004
                    b'x80120019' hashes to 0005
                    b'x8062700' hashes to 0006
                    b'x8036411' hashes to 0007
                    b'x80135953' hashes to 0008
                    b'x8044091' hashes to 0009
                    b'x808968' hashes to 000a
                    b'x8039318' hashes to 000b
                    [...]


                    I'll leave the full proof as an exercise for the reader, but take my word for it: there's an input that starts with a "1" for each possible digest from 0000 to ffff.



                    There's also an input that doesn't start with "1". And there's one that starts with the complete works of Shakespeare, too.



                    This holds for any reasonably good hash function, although my brute force proof may become computationally infeasible.






                    share|cite|improve this answer










                    New contributor




                    Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.





















                      up vote
                      2
                      down vote













                      I'll answer with a program. To reduce computational requirements I'll use a variant of sha256 I call sha16, which is just the first 16 bits of sha256.



                      #!/usr/bin/python3

                      import hashlib
                      from itertools import count

                      def sha16(plaintext):
                      h = hashlib.sha256()
                      h.update(plaintext)
                      return h.hexdigest()[:4]

                      def has_plaintext_start_with_1(digest):
                      """Return True if and only if the given digest can be generated from a
                      plaintext starting with "1" first bit."""
                      return True

                      def plaintext_starting_with_1(digest):
                      """Return a plaintext starting with '1' matching the given digest."""
                      for c in count():
                      plaintext = (b'x80' + str(c).encode('ascii'))
                      d = sha16(plaintext)
                      if d == digest:
                      return plaintext

                      for digest in range(0x10000):
                      digest = "%04x" % (digest,)
                      plain = plaintext_starting_with_1(digest)
                      print("%s hashes to %s" % (plain, digest))


                      This produces the output:



                      b'x8094207' hashes to 0000
                      b'x8047770' hashes to 0001
                      b'x8078597' hashes to 0002
                      b'x8025129' hashes to 0003
                      b'x8055307' hashes to 0004
                      b'x80120019' hashes to 0005
                      b'x8062700' hashes to 0006
                      b'x8036411' hashes to 0007
                      b'x80135953' hashes to 0008
                      b'x8044091' hashes to 0009
                      b'x808968' hashes to 000a
                      b'x8039318' hashes to 000b
                      [...]


                      I'll leave the full proof as an exercise for the reader, but take my word for it: there's an input that starts with a "1" for each possible digest from 0000 to ffff.



                      There's also an input that doesn't start with "1". And there's one that starts with the complete works of Shakespeare, too.



                      This holds for any reasonably good hash function, although my brute force proof may become computationally infeasible.






                      share|cite|improve this answer










                      New contributor




                      Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.



















                        up vote
                        2
                        down vote










                        up vote
                        2
                        down vote









                        I'll answer with a program. To reduce computational requirements I'll use a variant of sha256 I call sha16, which is just the first 16 bits of sha256.



                        #!/usr/bin/python3

                        import hashlib
                        from itertools import count

                        def sha16(plaintext):
                        h = hashlib.sha256()
                        h.update(plaintext)
                        return h.hexdigest()[:4]

                        def has_plaintext_start_with_1(digest):
                        """Return True if and only if the given digest can be generated from a
                        plaintext starting with "1" first bit."""
                        return True

                        def plaintext_starting_with_1(digest):
                        """Return a plaintext starting with '1' matching the given digest."""
                        for c in count():
                        plaintext = (b'x80' + str(c).encode('ascii'))
                        d = sha16(plaintext)
                        if d == digest:
                        return plaintext

                        for digest in range(0x10000):
                        digest = "%04x" % (digest,)
                        plain = plaintext_starting_with_1(digest)
                        print("%s hashes to %s" % (plain, digest))


                        This produces the output:



                        b'x8094207' hashes to 0000
                        b'x8047770' hashes to 0001
                        b'x8078597' hashes to 0002
                        b'x8025129' hashes to 0003
                        b'x8055307' hashes to 0004
                        b'x80120019' hashes to 0005
                        b'x8062700' hashes to 0006
                        b'x8036411' hashes to 0007
                        b'x80135953' hashes to 0008
                        b'x8044091' hashes to 0009
                        b'x808968' hashes to 000a
                        b'x8039318' hashes to 000b
                        [...]


                        I'll leave the full proof as an exercise for the reader, but take my word for it: there's an input that starts with a "1" for each possible digest from 0000 to ffff.



                        There's also an input that doesn't start with "1". And there's one that starts with the complete works of Shakespeare, too.



                        This holds for any reasonably good hash function, although my brute force proof may become computationally infeasible.






                        share|cite|improve this answer










                        New contributor




                        Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.









                        I'll answer with a program. To reduce computational requirements I'll use a variant of sha256 I call sha16, which is just the first 16 bits of sha256.



                        #!/usr/bin/python3

                        import hashlib
                        from itertools import count

                        def sha16(plaintext):
                        h = hashlib.sha256()
                        h.update(plaintext)
                        return h.hexdigest()[:4]

                        def has_plaintext_start_with_1(digest):
                        """Return True if and only if the given digest can be generated from a
                        plaintext starting with "1" first bit."""
                        return True

                        def plaintext_starting_with_1(digest):
                        """Return a plaintext starting with '1' matching the given digest."""
                        for c in count():
                        plaintext = (b'x80' + str(c).encode('ascii'))
                        d = sha16(plaintext)
                        if d == digest:
                        return plaintext

                        for digest in range(0x10000):
                        digest = "%04x" % (digest,)
                        plain = plaintext_starting_with_1(digest)
                        print("%s hashes to %s" % (plain, digest))


                        This produces the output:



                        b'x8094207' hashes to 0000
                        b'x8047770' hashes to 0001
                        b'x8078597' hashes to 0002
                        b'x8025129' hashes to 0003
                        b'x8055307' hashes to 0004
                        b'x80120019' hashes to 0005
                        b'x8062700' hashes to 0006
                        b'x8036411' hashes to 0007
                        b'x80135953' hashes to 0008
                        b'x8044091' hashes to 0009
                        b'x808968' hashes to 000a
                        b'x8039318' hashes to 000b
                        [...]


                        I'll leave the full proof as an exercise for the reader, but take my word for it: there's an input that starts with a "1" for each possible digest from 0000 to ffff.



                        There's also an input that doesn't start with "1". And there's one that starts with the complete works of Shakespeare, too.



                        This holds for any reasonably good hash function, although my brute force proof may become computationally infeasible.







                        share|cite|improve this answer










                        New contributor




                        Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.









                        share|cite|improve this answer



                        share|cite|improve this answer








                        edited 9 hours ago





















                        New contributor




                        Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.









                        answered 9 hours ago









                        Phil Frost

                        1213




                        1213




                        New contributor




                        Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.





                        New contributor





                        Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.






                        Phil Frost is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.




















                            up vote
                            1
                            down vote













                            Hashing functions are purposefully designed to be difficult to model, so (as pointed out already) this is likely to be very difficult. Nevertheless, any weakness in the hashing function will reduce its entropy, making it more predictable.




                            Regardless if this is "Possible", what algorithm would be the best approach?




                            A useful example is the Physically Unclonable Function, or PUF - which is analogous to a hardware hashing function. Typically, manufacturing variations are purposefully used to give each PUF a slightly different response so that their 'hashed' output is different for a given input. Design weaknesses limit the entropy, however, and given enough challenge-response pairs it is often possible to build a black-box model of the PUF so that the response for a new, previously unseen challenge can be predicted.



                            Logistic regression is the most commonly used approach for these modelling attacks, such as in this paper by Rührmair.



                            Genetic algorithms (or more generally evolutionary strategies) may be an alternative approach, as they are applicable to problems that are not differentiable and/or linearly separable. They are also discussed in the above paper.






                            share|cite|improve this answer
























                              up vote
                              1
                              down vote













                              Hashing functions are purposefully designed to be difficult to model, so (as pointed out already) this is likely to be very difficult. Nevertheless, any weakness in the hashing function will reduce its entropy, making it more predictable.




                              Regardless if this is "Possible", what algorithm would be the best approach?




                              A useful example is the Physically Unclonable Function, or PUF - which is analogous to a hardware hashing function. Typically, manufacturing variations are purposefully used to give each PUF a slightly different response so that their 'hashed' output is different for a given input. Design weaknesses limit the entropy, however, and given enough challenge-response pairs it is often possible to build a black-box model of the PUF so that the response for a new, previously unseen challenge can be predicted.



                              Logistic regression is the most commonly used approach for these modelling attacks, such as in this paper by Rührmair.



                              Genetic algorithms (or more generally evolutionary strategies) may be an alternative approach, as they are applicable to problems that are not differentiable and/or linearly separable. They are also discussed in the above paper.






                              share|cite|improve this answer






















                                up vote
                                1
                                down vote










                                up vote
                                1
                                down vote









                                Hashing functions are purposefully designed to be difficult to model, so (as pointed out already) this is likely to be very difficult. Nevertheless, any weakness in the hashing function will reduce its entropy, making it more predictable.




                                Regardless if this is "Possible", what algorithm would be the best approach?




                                A useful example is the Physically Unclonable Function, or PUF - which is analogous to a hardware hashing function. Typically, manufacturing variations are purposefully used to give each PUF a slightly different response so that their 'hashed' output is different for a given input. Design weaknesses limit the entropy, however, and given enough challenge-response pairs it is often possible to build a black-box model of the PUF so that the response for a new, previously unseen challenge can be predicted.



                                Logistic regression is the most commonly used approach for these modelling attacks, such as in this paper by Rührmair.



                                Genetic algorithms (or more generally evolutionary strategies) may be an alternative approach, as they are applicable to problems that are not differentiable and/or linearly separable. They are also discussed in the above paper.






                                share|cite|improve this answer












                                Hashing functions are purposefully designed to be difficult to model, so (as pointed out already) this is likely to be very difficult. Nevertheless, any weakness in the hashing function will reduce its entropy, making it more predictable.




                                Regardless if this is "Possible", what algorithm would be the best approach?




                                A useful example is the Physically Unclonable Function, or PUF - which is analogous to a hardware hashing function. Typically, manufacturing variations are purposefully used to give each PUF a slightly different response so that their 'hashed' output is different for a given input. Design weaknesses limit the entropy, however, and given enough challenge-response pairs it is often possible to build a black-box model of the PUF so that the response for a new, previously unseen challenge can be predicted.



                                Logistic regression is the most commonly used approach for these modelling attacks, such as in this paper by Rührmair.



                                Genetic algorithms (or more generally evolutionary strategies) may be an alternative approach, as they are applicable to problems that are not differentiable and/or linearly separable. They are also discussed in the above paper.







                                share|cite|improve this answer












                                share|cite|improve this answer



                                share|cite|improve this answer










                                answered yesterday









                                4Oh4

                                22025




                                22025




















                                    up vote
                                    1
                                    down vote













                                    What you describe is basically a pre-image attack. You're trying to find an input such that, when it is hashed, the output has some property like "a leading 1".*



                                    It is an explicit goal of cryptographic hashes to prevent such pre-image attacks. If you can make such an attack, we tend to consider that algorithm to be insecure and stop using it.



                                    So while that means it's not impossible, it means your machine learning algorithm would have to simultaneously outwit a large fraction of the mathematicians in the world, and their super computers. It is unlikely that you will do so.



                                    However, if you did, you would become known as someone who broke a major cryptographic hash algorithm. That fame is worth something!



                                    *Technically a "first preimage attack" tries to find a match for a specific hash. However, to show that a hash algorithm has first preimage attack resistence, they typically show that you can't find any meaningful information about the input from the hash.






                                    share|cite|improve this answer
























                                      up vote
                                      1
                                      down vote













                                      What you describe is basically a pre-image attack. You're trying to find an input such that, when it is hashed, the output has some property like "a leading 1".*



                                      It is an explicit goal of cryptographic hashes to prevent such pre-image attacks. If you can make such an attack, we tend to consider that algorithm to be insecure and stop using it.



                                      So while that means it's not impossible, it means your machine learning algorithm would have to simultaneously outwit a large fraction of the mathematicians in the world, and their super computers. It is unlikely that you will do so.



                                      However, if you did, you would become known as someone who broke a major cryptographic hash algorithm. That fame is worth something!



                                      *Technically a "first preimage attack" tries to find a match for a specific hash. However, to show that a hash algorithm has first preimage attack resistence, they typically show that you can't find any meaningful information about the input from the hash.






                                      share|cite|improve this answer






















                                        up vote
                                        1
                                        down vote










                                        up vote
                                        1
                                        down vote









                                        What you describe is basically a pre-image attack. You're trying to find an input such that, when it is hashed, the output has some property like "a leading 1".*



                                        It is an explicit goal of cryptographic hashes to prevent such pre-image attacks. If you can make such an attack, we tend to consider that algorithm to be insecure and stop using it.



                                        So while that means it's not impossible, it means your machine learning algorithm would have to simultaneously outwit a large fraction of the mathematicians in the world, and their super computers. It is unlikely that you will do so.



                                        However, if you did, you would become known as someone who broke a major cryptographic hash algorithm. That fame is worth something!



                                        *Technically a "first preimage attack" tries to find a match for a specific hash. However, to show that a hash algorithm has first preimage attack resistence, they typically show that you can't find any meaningful information about the input from the hash.






                                        share|cite|improve this answer












                                        What you describe is basically a pre-image attack. You're trying to find an input such that, when it is hashed, the output has some property like "a leading 1".*



                                        It is an explicit goal of cryptographic hashes to prevent such pre-image attacks. If you can make such an attack, we tend to consider that algorithm to be insecure and stop using it.



                                        So while that means it's not impossible, it means your machine learning algorithm would have to simultaneously outwit a large fraction of the mathematicians in the world, and their super computers. It is unlikely that you will do so.



                                        However, if you did, you would become known as someone who broke a major cryptographic hash algorithm. That fame is worth something!



                                        *Technically a "first preimage attack" tries to find a match for a specific hash. However, to show that a hash algorithm has first preimage attack resistence, they typically show that you can't find any meaningful information about the input from the hash.







                                        share|cite|improve this answer












                                        share|cite|improve this answer



                                        share|cite|improve this answer










                                        answered 2 hours ago









                                        Cort Ammon

                                        50725




                                        50725




















                                            up vote
                                            0
                                            down vote













                                            Most all the answers here are telling you why you can't do this but here's the direct answer to:




                                            Regardless if this is "Possible", what algorithm would be the best approach?




                                            Assuming the input is sufficiently large:



                                            1. Take the count of the set of valid characters.

                                            2. Take the reciprocal of the number from step 1.

                                            That's the probability that the input string starts with '1'. You don't even need to look at the input. If you can do better than that, it would mean the hash is very broken. You can save a lot of CPU cycles over trying to train an algorithm to pick random numbers.



                                            You could train an algorithm and it might come up with a different answer because of overfitting. That is unless there's something really wrong with the hash algorithm. Using this algorithm is then going wrong more often than if you simply pick a random value.






                                            share|cite|improve this answer
























                                              up vote
                                              0
                                              down vote













                                              Most all the answers here are telling you why you can't do this but here's the direct answer to:




                                              Regardless if this is "Possible", what algorithm would be the best approach?




                                              Assuming the input is sufficiently large:



                                              1. Take the count of the set of valid characters.

                                              2. Take the reciprocal of the number from step 1.

                                              That's the probability that the input string starts with '1'. You don't even need to look at the input. If you can do better than that, it would mean the hash is very broken. You can save a lot of CPU cycles over trying to train an algorithm to pick random numbers.



                                              You could train an algorithm and it might come up with a different answer because of overfitting. That is unless there's something really wrong with the hash algorithm. Using this algorithm is then going wrong more often than if you simply pick a random value.






                                              share|cite|improve this answer






















                                                up vote
                                                0
                                                down vote










                                                up vote
                                                0
                                                down vote









                                                Most all the answers here are telling you why you can't do this but here's the direct answer to:




                                                Regardless if this is "Possible", what algorithm would be the best approach?




                                                Assuming the input is sufficiently large:



                                                1. Take the count of the set of valid characters.

                                                2. Take the reciprocal of the number from step 1.

                                                That's the probability that the input string starts with '1'. You don't even need to look at the input. If you can do better than that, it would mean the hash is very broken. You can save a lot of CPU cycles over trying to train an algorithm to pick random numbers.



                                                You could train an algorithm and it might come up with a different answer because of overfitting. That is unless there's something really wrong with the hash algorithm. Using this algorithm is then going wrong more often than if you simply pick a random value.






                                                share|cite|improve this answer












                                                Most all the answers here are telling you why you can't do this but here's the direct answer to:




                                                Regardless if this is "Possible", what algorithm would be the best approach?




                                                Assuming the input is sufficiently large:



                                                1. Take the count of the set of valid characters.

                                                2. Take the reciprocal of the number from step 1.

                                                That's the probability that the input string starts with '1'. You don't even need to look at the input. If you can do better than that, it would mean the hash is very broken. You can save a lot of CPU cycles over trying to train an algorithm to pick random numbers.



                                                You could train an algorithm and it might come up with a different answer because of overfitting. That is unless there's something really wrong with the hash algorithm. Using this algorithm is then going wrong more often than if you simply pick a random value.







                                                share|cite|improve this answer












                                                share|cite|improve this answer



                                                share|cite|improve this answer










                                                answered 11 hours ago









                                                JimmyJames

                                                993




                                                993




















                                                    John is a new contributor. Be nice, and check out our Code of Conduct.









                                                     

                                                    draft saved


                                                    draft discarded


















                                                    John is a new contributor. Be nice, and check out our Code of Conduct.












                                                    John is a new contributor. Be nice, and check out our Code of Conduct.











                                                    John is a new contributor. Be nice, and check out our Code of Conduct.













                                                     


                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function ()
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366272%2fcan-machine-learning-decode-the-sha256-hashes%23new-answer', 'question_page');

                                                    );

                                                    Post as a guest













































































                                                    Comments

                                                    Popular posts from this blog

                                                    What does second last employer means? [closed]

                                                    Installing NextGIS Connect into QGIS 3?

                                                    One-line joke