Detect file type using file signatures

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












My program has to detect the real file type of a given file using signatures. For now I'm just checking JPG, but I want to add more.



Dim files() As String = IO.Directory.GetFiles(pictures)
Dim file_data As Byte()

Dim jpg_file_extension() As Byte = &HFF, &HD8, &HFF
Dim office_file_extension() As Byte = &H50, &H4B, &H3, &H4, &H14, &H0, &H6, &H0

Dim check As Integer = 0

For Each file As String In files
file_data = IO.File.ReadAllBytes(file)
If file_data.Length > 2 Then
For i = 0 To jpg_file_extension.Length - 1
If file_data(i) = jpg_file_extension(i) Then
check += 1
Else
check = 0
Exit For
End If
Next
If (check.ToString.Length = jpg_file_extension.Length - 1) Then
MsgBox(file.Split("").Last & ": its jpg")
End If
End If
Next


The code looks a bit messy right now and It's only checking one file type, my questions are:



  1. How can I improve this code, make it cleaner and efficient.

  2. Is there a way to implement this code in such a way that I can have a function, give it the file data and check if the signature is "whitelisted"?









share|improve this question









New contributor




Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.























    up vote
    2
    down vote

    favorite












    My program has to detect the real file type of a given file using signatures. For now I'm just checking JPG, but I want to add more.



    Dim files() As String = IO.Directory.GetFiles(pictures)
    Dim file_data As Byte()

    Dim jpg_file_extension() As Byte = &HFF, &HD8, &HFF
    Dim office_file_extension() As Byte = &H50, &H4B, &H3, &H4, &H14, &H0, &H6, &H0

    Dim check As Integer = 0

    For Each file As String In files
    file_data = IO.File.ReadAllBytes(file)
    If file_data.Length > 2 Then
    For i = 0 To jpg_file_extension.Length - 1
    If file_data(i) = jpg_file_extension(i) Then
    check += 1
    Else
    check = 0
    Exit For
    End If
    Next
    If (check.ToString.Length = jpg_file_extension.Length - 1) Then
    MsgBox(file.Split("").Last & ": its jpg")
    End If
    End If
    Next


    The code looks a bit messy right now and It's only checking one file type, my questions are:



    1. How can I improve this code, make it cleaner and efficient.

    2. Is there a way to implement this code in such a way that I can have a function, give it the file data and check if the signature is "whitelisted"?









    share|improve this question









    New contributor




    Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      My program has to detect the real file type of a given file using signatures. For now I'm just checking JPG, but I want to add more.



      Dim files() As String = IO.Directory.GetFiles(pictures)
      Dim file_data As Byte()

      Dim jpg_file_extension() As Byte = &HFF, &HD8, &HFF
      Dim office_file_extension() As Byte = &H50, &H4B, &H3, &H4, &H14, &H0, &H6, &H0

      Dim check As Integer = 0

      For Each file As String In files
      file_data = IO.File.ReadAllBytes(file)
      If file_data.Length > 2 Then
      For i = 0 To jpg_file_extension.Length - 1
      If file_data(i) = jpg_file_extension(i) Then
      check += 1
      Else
      check = 0
      Exit For
      End If
      Next
      If (check.ToString.Length = jpg_file_extension.Length - 1) Then
      MsgBox(file.Split("").Last & ": its jpg")
      End If
      End If
      Next


      The code looks a bit messy right now and It's only checking one file type, my questions are:



      1. How can I improve this code, make it cleaner and efficient.

      2. Is there a way to implement this code in such a way that I can have a function, give it the file data and check if the signature is "whitelisted"?









      share|improve this question









      New contributor




      Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      My program has to detect the real file type of a given file using signatures. For now I'm just checking JPG, but I want to add more.



      Dim files() As String = IO.Directory.GetFiles(pictures)
      Dim file_data As Byte()

      Dim jpg_file_extension() As Byte = &HFF, &HD8, &HFF
      Dim office_file_extension() As Byte = &H50, &H4B, &H3, &H4, &H14, &H0, &H6, &H0

      Dim check As Integer = 0

      For Each file As String In files
      file_data = IO.File.ReadAllBytes(file)
      If file_data.Length > 2 Then
      For i = 0 To jpg_file_extension.Length - 1
      If file_data(i) = jpg_file_extension(i) Then
      check += 1
      Else
      check = 0
      Exit For
      End If
      Next
      If (check.ToString.Length = jpg_file_extension.Length - 1) Then
      MsgBox(file.Split("").Last & ": its jpg")
      End If
      End If
      Next


      The code looks a bit messy right now and It's only checking one file type, my questions are:



      1. How can I improve this code, make it cleaner and efficient.

      2. Is there a way to implement this code in such a way that I can have a function, give it the file data and check if the signature is "whitelisted"?






      file vb.net






      share|improve this question









      New contributor




      Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 1 hour ago









      200_success

      125k14145406




      125k14145406






      New contributor




      Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 5 hours ago









      Milton Cardoso

      113




      113




      New contributor




      Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Milton Cardoso is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          Your first step is to wind your thinking back a few steps and re-approach your code with a fresh line of thinking. Looking at your code, you say "to detect the real file type of a given file" but you have written code to detect a JPEG(*) file.



          There is a subtlety here, but once you have mastered that you can approach complex problems with more confidence. The subtlety is you want a generic approach, but your thinking at the moment is constrained to and focussed on a particular example - your solution is tailored to that example. More specifically, your current code answers the question "Is this a JPEG file?", you want your solution to answer the question "What is the file type of this file?".



          Signatures



          You define your signatures early. This is a good approach because it lends itself to a future implementation where you can import a tailored list of signatures.



          However, you are currently using separate arrays to store the signature data. The use of multiple arrays is going to be inefficient for any improvements or event for checking multiple files.



          The use of static arrays implies looping through all arrays. In a small implementation this is not that noticeable, but if you have a hundred arrays with a size ranging from 3 to 15 bytes, you will start to notice a performance hit. Basically, you will be continuing to check arrays that you have already eliminated as being relevant to your quest.



          A suggested way to improve the performance initially is to put the signatures in a collection (e.g. List(Of OrderedList(Of Byte))). This way, once you eliminate a signature you can remove it from the collection, thus quickly removing the unnecessary checks with a commensurate improvement in performance.



          The use of the inner collection removes the need to check array lengths, but having a List(Of Array) could also work.



          Looping



          You manually loop through your array. This is always a simple first approach and reflects the basic solution to identifying a signature. Your code is set up to first loop through the first signature and I assume you were thinking of duplicating this kind of loop for the other signatures.



          Sitting here, I can think of two simple approaches:



          • Looping through the file bytes individually, removing signatures from the collection as they fail

          • Looping through the signatures and doing an array check against the first x bytes of each file

          Intuitively, I think the second option will be less efficient but I could be wrong.
          Some example code (not guaranteed to be compilable):



          For Each file As String In files
          file_data = IO.File.ReadAllBytes(file)
          For signatureIterator = MasterSignatureList.Count - 1 to 0
          ' Declare and implement as required
          ' Used a For loop going backwards because in this example we are going to remove elements from the collection
          signature = MasterSignatureList(signatureIterator) ' the shorter text makes my example easier to read.
          If file_data.Length < signature.Length then
          MasterSignatureList.Remove signatureIterator
          Else
          If Not CheckArrayIsSame(file_data.Resize(signature.length), signature) then
          ' Some function to check arrays are the same will be required
          ' The native .Resize actually changes the original array, so you should make a copy before running .Resize. I was being lazy.
          MasterSignatureList.Remove signatureIterator
          End If
          End if
          Next signatureIterator
          ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
          Next file


          And an example for the first option



          For Each file As String In files
          file_data = IO.File.ReadAllBytes(file)
          For each signature in MasterSignatureList
          if filedata.Length < signature.Length Then MasterSignatureList.Remove signature ' Obviously wrong
          Next signature
          For signatureIterator = 0 to file_data.Length ' we should exit the loop before getting to the end of most files!
          signatureCheck = false
          For each signature in MasterSignatureList
          If signatureIterator < signature.Length Then ' retains signatures that have already passed
          signatureCheck = true ' still some signatures to check
          If file_data(signatureIterator) <> signature(signatureIterator) Then
          MasterSignatureList.Remove signature ' signature does not match
          End if
          End if
          Next signature
          If MasterSignatureList.Empty or Not signatureCheck then Exit For ' exit if nothing left to check
          Next signatureIterator
          ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
          Next file


          In both of those examples, the signatures remaining the signature list are the potential file types. In these examples, the possibility of multiple signatures passing is allowed - how you handle that is up to your programming logic.



          As already noted - I have not tested the above code, so also check for the dreaded Jedi array error condition (off-by-1) in my iterations.



          (*) The correct nomenclature is JPEG, the file extension in traditional 8.3 style is ".jpg". Why this is so, I leave up to your own research.






          share|improve this answer



























            up vote
            0
            down vote













            IO.File.ReadAllBytes(file) seems like overkill. Most file formats have signatures that appear within the first few kilobytes. There are, however, signatures where the signature does not appear at the start of the file (e.g. TAR archives), as well as signatures with subtype information at discontinuous locations (e.g. DOS / Windows executables). Depending on how ambitious you want to be, you may need to generalize how the signatures are specified.






            share|improve this answer




















              Your Answer




              StackExchange.ifUsing("editor", function ()
              return StackExchange.using("mathjaxEditing", function ()
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
              );
              );
              , "mathjax-editing");

              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "196"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              convertImagesToLinks: false,
              noModals: false,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );






              Milton Cardoso is a new contributor. Be nice, and check out our Code of Conduct.









               

              draft saved


              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204595%2fdetect-file-type-using-file-signatures%23new-answer', 'question_page');

              );

              Post as a guest






























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              1
              down vote













              Your first step is to wind your thinking back a few steps and re-approach your code with a fresh line of thinking. Looking at your code, you say "to detect the real file type of a given file" but you have written code to detect a JPEG(*) file.



              There is a subtlety here, but once you have mastered that you can approach complex problems with more confidence. The subtlety is you want a generic approach, but your thinking at the moment is constrained to and focussed on a particular example - your solution is tailored to that example. More specifically, your current code answers the question "Is this a JPEG file?", you want your solution to answer the question "What is the file type of this file?".



              Signatures



              You define your signatures early. This is a good approach because it lends itself to a future implementation where you can import a tailored list of signatures.



              However, you are currently using separate arrays to store the signature data. The use of multiple arrays is going to be inefficient for any improvements or event for checking multiple files.



              The use of static arrays implies looping through all arrays. In a small implementation this is not that noticeable, but if you have a hundred arrays with a size ranging from 3 to 15 bytes, you will start to notice a performance hit. Basically, you will be continuing to check arrays that you have already eliminated as being relevant to your quest.



              A suggested way to improve the performance initially is to put the signatures in a collection (e.g. List(Of OrderedList(Of Byte))). This way, once you eliminate a signature you can remove it from the collection, thus quickly removing the unnecessary checks with a commensurate improvement in performance.



              The use of the inner collection removes the need to check array lengths, but having a List(Of Array) could also work.



              Looping



              You manually loop through your array. This is always a simple first approach and reflects the basic solution to identifying a signature. Your code is set up to first loop through the first signature and I assume you were thinking of duplicating this kind of loop for the other signatures.



              Sitting here, I can think of two simple approaches:



              • Looping through the file bytes individually, removing signatures from the collection as they fail

              • Looping through the signatures and doing an array check against the first x bytes of each file

              Intuitively, I think the second option will be less efficient but I could be wrong.
              Some example code (not guaranteed to be compilable):



              For Each file As String In files
              file_data = IO.File.ReadAllBytes(file)
              For signatureIterator = MasterSignatureList.Count - 1 to 0
              ' Declare and implement as required
              ' Used a For loop going backwards because in this example we are going to remove elements from the collection
              signature = MasterSignatureList(signatureIterator) ' the shorter text makes my example easier to read.
              If file_data.Length < signature.Length then
              MasterSignatureList.Remove signatureIterator
              Else
              If Not CheckArrayIsSame(file_data.Resize(signature.length), signature) then
              ' Some function to check arrays are the same will be required
              ' The native .Resize actually changes the original array, so you should make a copy before running .Resize. I was being lazy.
              MasterSignatureList.Remove signatureIterator
              End If
              End if
              Next signatureIterator
              ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
              Next file


              And an example for the first option



              For Each file As String In files
              file_data = IO.File.ReadAllBytes(file)
              For each signature in MasterSignatureList
              if filedata.Length < signature.Length Then MasterSignatureList.Remove signature ' Obviously wrong
              Next signature
              For signatureIterator = 0 to file_data.Length ' we should exit the loop before getting to the end of most files!
              signatureCheck = false
              For each signature in MasterSignatureList
              If signatureIterator < signature.Length Then ' retains signatures that have already passed
              signatureCheck = true ' still some signatures to check
              If file_data(signatureIterator) <> signature(signatureIterator) Then
              MasterSignatureList.Remove signature ' signature does not match
              End if
              End if
              Next signature
              If MasterSignatureList.Empty or Not signatureCheck then Exit For ' exit if nothing left to check
              Next signatureIterator
              ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
              Next file


              In both of those examples, the signatures remaining the signature list are the potential file types. In these examples, the possibility of multiple signatures passing is allowed - how you handle that is up to your programming logic.



              As already noted - I have not tested the above code, so also check for the dreaded Jedi array error condition (off-by-1) in my iterations.



              (*) The correct nomenclature is JPEG, the file extension in traditional 8.3 style is ".jpg". Why this is so, I leave up to your own research.






              share|improve this answer
























                up vote
                1
                down vote













                Your first step is to wind your thinking back a few steps and re-approach your code with a fresh line of thinking. Looking at your code, you say "to detect the real file type of a given file" but you have written code to detect a JPEG(*) file.



                There is a subtlety here, but once you have mastered that you can approach complex problems with more confidence. The subtlety is you want a generic approach, but your thinking at the moment is constrained to and focussed on a particular example - your solution is tailored to that example. More specifically, your current code answers the question "Is this a JPEG file?", you want your solution to answer the question "What is the file type of this file?".



                Signatures



                You define your signatures early. This is a good approach because it lends itself to a future implementation where you can import a tailored list of signatures.



                However, you are currently using separate arrays to store the signature data. The use of multiple arrays is going to be inefficient for any improvements or event for checking multiple files.



                The use of static arrays implies looping through all arrays. In a small implementation this is not that noticeable, but if you have a hundred arrays with a size ranging from 3 to 15 bytes, you will start to notice a performance hit. Basically, you will be continuing to check arrays that you have already eliminated as being relevant to your quest.



                A suggested way to improve the performance initially is to put the signatures in a collection (e.g. List(Of OrderedList(Of Byte))). This way, once you eliminate a signature you can remove it from the collection, thus quickly removing the unnecessary checks with a commensurate improvement in performance.



                The use of the inner collection removes the need to check array lengths, but having a List(Of Array) could also work.



                Looping



                You manually loop through your array. This is always a simple first approach and reflects the basic solution to identifying a signature. Your code is set up to first loop through the first signature and I assume you were thinking of duplicating this kind of loop for the other signatures.



                Sitting here, I can think of two simple approaches:



                • Looping through the file bytes individually, removing signatures from the collection as they fail

                • Looping through the signatures and doing an array check against the first x bytes of each file

                Intuitively, I think the second option will be less efficient but I could be wrong.
                Some example code (not guaranteed to be compilable):



                For Each file As String In files
                file_data = IO.File.ReadAllBytes(file)
                For signatureIterator = MasterSignatureList.Count - 1 to 0
                ' Declare and implement as required
                ' Used a For loop going backwards because in this example we are going to remove elements from the collection
                signature = MasterSignatureList(signatureIterator) ' the shorter text makes my example easier to read.
                If file_data.Length < signature.Length then
                MasterSignatureList.Remove signatureIterator
                Else
                If Not CheckArrayIsSame(file_data.Resize(signature.length), signature) then
                ' Some function to check arrays are the same will be required
                ' The native .Resize actually changes the original array, so you should make a copy before running .Resize. I was being lazy.
                MasterSignatureList.Remove signatureIterator
                End If
                End if
                Next signatureIterator
                ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
                Next file


                And an example for the first option



                For Each file As String In files
                file_data = IO.File.ReadAllBytes(file)
                For each signature in MasterSignatureList
                if filedata.Length < signature.Length Then MasterSignatureList.Remove signature ' Obviously wrong
                Next signature
                For signatureIterator = 0 to file_data.Length ' we should exit the loop before getting to the end of most files!
                signatureCheck = false
                For each signature in MasterSignatureList
                If signatureIterator < signature.Length Then ' retains signatures that have already passed
                signatureCheck = true ' still some signatures to check
                If file_data(signatureIterator) <> signature(signatureIterator) Then
                MasterSignatureList.Remove signature ' signature does not match
                End if
                End if
                Next signature
                If MasterSignatureList.Empty or Not signatureCheck then Exit For ' exit if nothing left to check
                Next signatureIterator
                ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
                Next file


                In both of those examples, the signatures remaining the signature list are the potential file types. In these examples, the possibility of multiple signatures passing is allowed - how you handle that is up to your programming logic.



                As already noted - I have not tested the above code, so also check for the dreaded Jedi array error condition (off-by-1) in my iterations.



                (*) The correct nomenclature is JPEG, the file extension in traditional 8.3 style is ".jpg". Why this is so, I leave up to your own research.






                share|improve this answer






















                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  Your first step is to wind your thinking back a few steps and re-approach your code with a fresh line of thinking. Looking at your code, you say "to detect the real file type of a given file" but you have written code to detect a JPEG(*) file.



                  There is a subtlety here, but once you have mastered that you can approach complex problems with more confidence. The subtlety is you want a generic approach, but your thinking at the moment is constrained to and focussed on a particular example - your solution is tailored to that example. More specifically, your current code answers the question "Is this a JPEG file?", you want your solution to answer the question "What is the file type of this file?".



                  Signatures



                  You define your signatures early. This is a good approach because it lends itself to a future implementation where you can import a tailored list of signatures.



                  However, you are currently using separate arrays to store the signature data. The use of multiple arrays is going to be inefficient for any improvements or event for checking multiple files.



                  The use of static arrays implies looping through all arrays. In a small implementation this is not that noticeable, but if you have a hundred arrays with a size ranging from 3 to 15 bytes, you will start to notice a performance hit. Basically, you will be continuing to check arrays that you have already eliminated as being relevant to your quest.



                  A suggested way to improve the performance initially is to put the signatures in a collection (e.g. List(Of OrderedList(Of Byte))). This way, once you eliminate a signature you can remove it from the collection, thus quickly removing the unnecessary checks with a commensurate improvement in performance.



                  The use of the inner collection removes the need to check array lengths, but having a List(Of Array) could also work.



                  Looping



                  You manually loop through your array. This is always a simple first approach and reflects the basic solution to identifying a signature. Your code is set up to first loop through the first signature and I assume you were thinking of duplicating this kind of loop for the other signatures.



                  Sitting here, I can think of two simple approaches:



                  • Looping through the file bytes individually, removing signatures from the collection as they fail

                  • Looping through the signatures and doing an array check against the first x bytes of each file

                  Intuitively, I think the second option will be less efficient but I could be wrong.
                  Some example code (not guaranteed to be compilable):



                  For Each file As String In files
                  file_data = IO.File.ReadAllBytes(file)
                  For signatureIterator = MasterSignatureList.Count - 1 to 0
                  ' Declare and implement as required
                  ' Used a For loop going backwards because in this example we are going to remove elements from the collection
                  signature = MasterSignatureList(signatureIterator) ' the shorter text makes my example easier to read.
                  If file_data.Length < signature.Length then
                  MasterSignatureList.Remove signatureIterator
                  Else
                  If Not CheckArrayIsSame(file_data.Resize(signature.length), signature) then
                  ' Some function to check arrays are the same will be required
                  ' The native .Resize actually changes the original array, so you should make a copy before running .Resize. I was being lazy.
                  MasterSignatureList.Remove signatureIterator
                  End If
                  End if
                  Next signatureIterator
                  ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
                  Next file


                  And an example for the first option



                  For Each file As String In files
                  file_data = IO.File.ReadAllBytes(file)
                  For each signature in MasterSignatureList
                  if filedata.Length < signature.Length Then MasterSignatureList.Remove signature ' Obviously wrong
                  Next signature
                  For signatureIterator = 0 to file_data.Length ' we should exit the loop before getting to the end of most files!
                  signatureCheck = false
                  For each signature in MasterSignatureList
                  If signatureIterator < signature.Length Then ' retains signatures that have already passed
                  signatureCheck = true ' still some signatures to check
                  If file_data(signatureIterator) <> signature(signatureIterator) Then
                  MasterSignatureList.Remove signature ' signature does not match
                  End if
                  End if
                  Next signature
                  If MasterSignatureList.Empty or Not signatureCheck then Exit For ' exit if nothing left to check
                  Next signatureIterator
                  ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
                  Next file


                  In both of those examples, the signatures remaining the signature list are the potential file types. In these examples, the possibility of multiple signatures passing is allowed - how you handle that is up to your programming logic.



                  As already noted - I have not tested the above code, so also check for the dreaded Jedi array error condition (off-by-1) in my iterations.



                  (*) The correct nomenclature is JPEG, the file extension in traditional 8.3 style is ".jpg". Why this is so, I leave up to your own research.






                  share|improve this answer












                  Your first step is to wind your thinking back a few steps and re-approach your code with a fresh line of thinking. Looking at your code, you say "to detect the real file type of a given file" but you have written code to detect a JPEG(*) file.



                  There is a subtlety here, but once you have mastered that you can approach complex problems with more confidence. The subtlety is you want a generic approach, but your thinking at the moment is constrained to and focussed on a particular example - your solution is tailored to that example. More specifically, your current code answers the question "Is this a JPEG file?", you want your solution to answer the question "What is the file type of this file?".



                  Signatures



                  You define your signatures early. This is a good approach because it lends itself to a future implementation where you can import a tailored list of signatures.



                  However, you are currently using separate arrays to store the signature data. The use of multiple arrays is going to be inefficient for any improvements or event for checking multiple files.



                  The use of static arrays implies looping through all arrays. In a small implementation this is not that noticeable, but if you have a hundred arrays with a size ranging from 3 to 15 bytes, you will start to notice a performance hit. Basically, you will be continuing to check arrays that you have already eliminated as being relevant to your quest.



                  A suggested way to improve the performance initially is to put the signatures in a collection (e.g. List(Of OrderedList(Of Byte))). This way, once you eliminate a signature you can remove it from the collection, thus quickly removing the unnecessary checks with a commensurate improvement in performance.



                  The use of the inner collection removes the need to check array lengths, but having a List(Of Array) could also work.



                  Looping



                  You manually loop through your array. This is always a simple first approach and reflects the basic solution to identifying a signature. Your code is set up to first loop through the first signature and I assume you were thinking of duplicating this kind of loop for the other signatures.



                  Sitting here, I can think of two simple approaches:



                  • Looping through the file bytes individually, removing signatures from the collection as they fail

                  • Looping through the signatures and doing an array check against the first x bytes of each file

                  Intuitively, I think the second option will be less efficient but I could be wrong.
                  Some example code (not guaranteed to be compilable):



                  For Each file As String In files
                  file_data = IO.File.ReadAllBytes(file)
                  For signatureIterator = MasterSignatureList.Count - 1 to 0
                  ' Declare and implement as required
                  ' Used a For loop going backwards because in this example we are going to remove elements from the collection
                  signature = MasterSignatureList(signatureIterator) ' the shorter text makes my example easier to read.
                  If file_data.Length < signature.Length then
                  MasterSignatureList.Remove signatureIterator
                  Else
                  If Not CheckArrayIsSame(file_data.Resize(signature.length), signature) then
                  ' Some function to check arrays are the same will be required
                  ' The native .Resize actually changes the original array, so you should make a copy before running .Resize. I was being lazy.
                  MasterSignatureList.Remove signatureIterator
                  End If
                  End if
                  Next signatureIterator
                  ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
                  Next file


                  And an example for the first option



                  For Each file As String In files
                  file_data = IO.File.ReadAllBytes(file)
                  For each signature in MasterSignatureList
                  if filedata.Length < signature.Length Then MasterSignatureList.Remove signature ' Obviously wrong
                  Next signature
                  For signatureIterator = 0 to file_data.Length ' we should exit the loop before getting to the end of most files!
                  signatureCheck = false
                  For each signature in MasterSignatureList
                  If signatureIterator < signature.Length Then ' retains signatures that have already passed
                  signatureCheck = true ' still some signatures to check
                  If file_data(signatureIterator) <> signature(signatureIterator) Then
                  MasterSignatureList.Remove signature ' signature does not match
                  End if
                  End if
                  Next signature
                  If MasterSignatureList.Empty or Not signatureCheck then Exit For ' exit if nothing left to check
                  Next signatureIterator
                  ' **** do something here with the remaining signatures as these are the valid ones for that particular file!
                  Next file


                  In both of those examples, the signatures remaining the signature list are the potential file types. In these examples, the possibility of multiple signatures passing is allowed - how you handle that is up to your programming logic.



                  As already noted - I have not tested the above code, so also check for the dreaded Jedi array error condition (off-by-1) in my iterations.



                  (*) The correct nomenclature is JPEG, the file extension in traditional 8.3 style is ".jpg". Why this is so, I leave up to your own research.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 1 hour ago









                  AJD

                  1,0451213




                  1,0451213






















                      up vote
                      0
                      down vote













                      IO.File.ReadAllBytes(file) seems like overkill. Most file formats have signatures that appear within the first few kilobytes. There are, however, signatures where the signature does not appear at the start of the file (e.g. TAR archives), as well as signatures with subtype information at discontinuous locations (e.g. DOS / Windows executables). Depending on how ambitious you want to be, you may need to generalize how the signatures are specified.






                      share|improve this answer
























                        up vote
                        0
                        down vote













                        IO.File.ReadAllBytes(file) seems like overkill. Most file formats have signatures that appear within the first few kilobytes. There are, however, signatures where the signature does not appear at the start of the file (e.g. TAR archives), as well as signatures with subtype information at discontinuous locations (e.g. DOS / Windows executables). Depending on how ambitious you want to be, you may need to generalize how the signatures are specified.






                        share|improve this answer






















                          up vote
                          0
                          down vote










                          up vote
                          0
                          down vote









                          IO.File.ReadAllBytes(file) seems like overkill. Most file formats have signatures that appear within the first few kilobytes. There are, however, signatures where the signature does not appear at the start of the file (e.g. TAR archives), as well as signatures with subtype information at discontinuous locations (e.g. DOS / Windows executables). Depending on how ambitious you want to be, you may need to generalize how the signatures are specified.






                          share|improve this answer












                          IO.File.ReadAllBytes(file) seems like overkill. Most file formats have signatures that appear within the first few kilobytes. There are, however, signatures where the signature does not appear at the start of the file (e.g. TAR archives), as well as signatures with subtype information at discontinuous locations (e.g. DOS / Windows executables). Depending on how ambitious you want to be, you may need to generalize how the signatures are specified.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered 1 hour ago









                          200_success

                          125k14145406




                          125k14145406




















                              Milton Cardoso is a new contributor. Be nice, and check out our Code of Conduct.









                               

                              draft saved


                              draft discarded


















                              Milton Cardoso is a new contributor. Be nice, and check out our Code of Conduct.












                              Milton Cardoso is a new contributor. Be nice, and check out our Code of Conduct.











                              Milton Cardoso is a new contributor. Be nice, and check out our Code of Conduct.













                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204595%2fdetect-file-type-using-file-signatures%23new-answer', 'question_page');

                              );

                              Post as a guest













































































                              Comments

                              Popular posts from this blog

                              Long meetings (6-7 hours a day): Being “babysat” by supervisor

                              Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                              Confectionery