Intersection of rows from text files using python

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
6
down vote

favorite












I have a number of text files which have text and numerical entries in them. More specifically, the first two rows and the first column of all these files are text, and all other entries are comprised of numbers. As an example, let's consider three separate text files (files A, B and C) in the formats as shown in the pictures given below.



enter image description here



enter image description here



enter image description here



Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).



I want to use python to create a separate text file which contains only those rows (starting from row 3) which are present in all these three files (i.e. intersection of rows). This final text file is expected to keep relevant columns from all these three files. Following would be a picture of the final file that I want.enter image description here (Text file for this final file can be found here.)



I realize that I can use pandas to import each of the mentioned text files using a command like the following:



df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)


How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.










share|improve this question





















  • So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
    – Dark
    1 hour ago















up vote
6
down vote

favorite












I have a number of text files which have text and numerical entries in them. More specifically, the first two rows and the first column of all these files are text, and all other entries are comprised of numbers. As an example, let's consider three separate text files (files A, B and C) in the formats as shown in the pictures given below.



enter image description here



enter image description here



enter image description here



Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).



I want to use python to create a separate text file which contains only those rows (starting from row 3) which are present in all these three files (i.e. intersection of rows). This final text file is expected to keep relevant columns from all these three files. Following would be a picture of the final file that I want.enter image description here (Text file for this final file can be found here.)



I realize that I can use pandas to import each of the mentioned text files using a command like the following:



df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)


How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.










share|improve this question





















  • So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
    – Dark
    1 hour ago













up vote
6
down vote

favorite









up vote
6
down vote

favorite











I have a number of text files which have text and numerical entries in them. More specifically, the first two rows and the first column of all these files are text, and all other entries are comprised of numbers. As an example, let's consider three separate text files (files A, B and C) in the formats as shown in the pictures given below.



enter image description here



enter image description here



enter image description here



Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).



I want to use python to create a separate text file which contains only those rows (starting from row 3) which are present in all these three files (i.e. intersection of rows). This final text file is expected to keep relevant columns from all these three files. Following would be a picture of the final file that I want.enter image description here (Text file for this final file can be found here.)



I realize that I can use pandas to import each of the mentioned text files using a command like the following:



df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)


How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.










share|improve this question













I have a number of text files which have text and numerical entries in them. More specifically, the first two rows and the first column of all these files are text, and all other entries are comprised of numbers. As an example, let's consider three separate text files (files A, B and C) in the formats as shown in the pictures given below.



enter image description here



enter image description here



enter image description here



Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).



I want to use python to create a separate text file which contains only those rows (starting from row 3) which are present in all these three files (i.e. intersection of rows). This final text file is expected to keep relevant columns from all these three files. Following would be a picture of the final file that I want.enter image description here (Text file for this final file can be found here.)



I realize that I can use pandas to import each of the mentioned text files using a command like the following:



df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)


How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.







python python-2.7 pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 1 hour ago









Ling Guo

1356




1356











  • So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
    – Dark
    1 hour ago

















  • So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
    – Dark
    1 hour ago
















So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
– Dark
1 hour ago





So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
– Dark
1 hour ago













2 Answers
2






active

oldest

votes

















up vote
4
down vote













Use pd.merge() -



from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)


OR



df1.join(df2, how='inner').join(df3, how='inner')


These gives you a foolproof way of joining by keys -



 Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


Timings



@Dark



2.63 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


@Vivek



2.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


@Vivek (2nd - using join)



5.4 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Since there is no hassle on going by keys, @Dark's solution is slightly more efficient. But since OP mentioned contains only those rows (starting from row 3) which are present in all these three files, I went by the merge way rather than the concat way






share|improve this answer





























    up vote
    4
    down vote













    Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e



    df_m = pd.concat([A,B,C],axis=1).dropna()

    Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
    Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
    object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
    object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
    object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
    object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


    This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..






    share|improve this answer


















    • 2




      Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
      – Vivek Kalyanarangan
      53 mins ago







    • 1




      @VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
      – Dark
      23 mins ago











    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53165860%2fintersection-of-rows-from-text-files-using-python%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    4
    down vote













    Use pd.merge() -



    from functools import reduce
    dfs = [df1, df2, df3]
    df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)


    OR



    df1.join(df2, how='inner').join(df3, how='inner')


    These gives you a foolproof way of joining by keys -



     Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
    Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
    object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
    object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
    object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
    object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


    Timings



    @Dark



    2.63 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


    @Vivek



    2.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


    @Vivek (2nd - using join)



    5.4 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


    Since there is no hassle on going by keys, @Dark's solution is slightly more efficient. But since OP mentioned contains only those rows (starting from row 3) which are present in all these three files, I went by the merge way rather than the concat way






    share|improve this answer


























      up vote
      4
      down vote













      Use pd.merge() -



      from functools import reduce
      dfs = [df1, df2, df3]
      df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)


      OR



      df1.join(df2, how='inner').join(df3, how='inner')


      These gives you a foolproof way of joining by keys -



       Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
      Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
      object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
      object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
      object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
      object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


      Timings



      @Dark



      2.63 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


      @Vivek



      2.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


      @Vivek (2nd - using join)



      5.4 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


      Since there is no hassle on going by keys, @Dark's solution is slightly more efficient. But since OP mentioned contains only those rows (starting from row 3) which are present in all these three files, I went by the merge way rather than the concat way






      share|improve this answer
























        up vote
        4
        down vote










        up vote
        4
        down vote









        Use pd.merge() -



        from functools import reduce
        dfs = [df1, df2, df3]
        df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)


        OR



        df1.join(df2, how='inner').join(df3, how='inner')


        These gives you a foolproof way of joining by keys -



         Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
        Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
        object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
        object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
        object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
        object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


        Timings



        @Dark



        2.63 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


        @Vivek



        2.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


        @Vivek (2nd - using join)



        5.4 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


        Since there is no hassle on going by keys, @Dark's solution is slightly more efficient. But since OP mentioned contains only those rows (starting from row 3) which are present in all these three files, I went by the merge way rather than the concat way






        share|improve this answer














        Use pd.merge() -



        from functools import reduce
        dfs = [df1, df2, df3]
        df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)


        OR



        df1.join(df2, how='inner').join(df3, how='inner')


        These gives you a foolproof way of joining by keys -



         Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
        Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
        object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
        object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
        object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
        object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


        Timings



        @Dark



        2.63 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


        @Vivek



        2.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


        @Vivek (2nd - using join)



        5.4 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


        Since there is no hassle on going by keys, @Dark's solution is slightly more efficient. But since OP mentioned contains only those rows (starting from row 3) which are present in all these three files, I went by the merge way rather than the concat way







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 58 mins ago

























        answered 1 hour ago









        Vivek Kalyanarangan

        3,6221724




        3,6221724






















            up vote
            4
            down vote













            Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e



            df_m = pd.concat([A,B,C],axis=1).dropna()

            Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
            Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
            object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
            object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
            object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
            object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


            This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..






            share|improve this answer


















            • 2




              Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
              – Vivek Kalyanarangan
              53 mins ago







            • 1




              @VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
              – Dark
              23 mins ago















            up vote
            4
            down vote













            Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e



            df_m = pd.concat([A,B,C],axis=1).dropna()

            Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
            Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
            object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
            object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
            object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
            object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


            This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..






            share|improve this answer


















            • 2




              Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
              – Vivek Kalyanarangan
              53 mins ago







            • 1




              @VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
              – Dark
              23 mins ago













            up vote
            4
            down vote










            up vote
            4
            down vote









            Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e



            df_m = pd.concat([A,B,C],axis=1).dropna()

            Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
            Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
            object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
            object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
            object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
            object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


            This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..






            share|improve this answer














            Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e



            df_m = pd.concat([A,B,C],axis=1).dropna()

            Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
            Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
            object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
            object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
            object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
            object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9


            This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 18 mins ago

























            answered 1 hour ago









            Dark

            20.4k31945




            20.4k31945







            • 2




              Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
              – Vivek Kalyanarangan
              53 mins ago







            • 1




              @VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
              – Dark
              23 mins ago













            • 2




              Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
              – Vivek Kalyanarangan
              53 mins ago







            • 1




              @VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
              – Dark
              23 mins ago








            2




            2




            Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
            – Vivek Kalyanarangan
            53 mins ago





            Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
            – Vivek Kalyanarangan
            53 mins ago





            1




            1




            @VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
            – Dark
            23 mins ago





            @VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
            – Dark
            23 mins ago


















             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53165860%2fintersection-of-rows-from-text-files-using-python%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            Long meetings (6-7 hours a day): Being “babysat” by supervisor

            Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

            Confectionery