Best way to flatten dataframe based on values on column

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
7
down vote

favorite
2












I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:



df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0


My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column



 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0


The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.



added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()









share|improve this question





















  • upvote for a nicely constructed question. :)
    – anky_91
    1 hour ago














up vote
7
down vote

favorite
2












I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:



df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0


My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column



 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0


The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.



added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()









share|improve this question





















  • upvote for a nicely constructed question. :)
    – anky_91
    1 hour ago












up vote
7
down vote

favorite
2









up vote
7
down vote

favorite
2






2





I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:



df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0


My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column



 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0


The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.



added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()









share|improve this question













I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:



df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0


My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column



 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0


The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.



added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()






pandas dataframe vectorization






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 1 hour ago









el_Rinaldo

466213




466213











  • upvote for a nicely constructed question. :)
    – anky_91
    1 hour ago
















  • upvote for a nicely constructed question. :)
    – anky_91
    1 hour ago















upvote for a nicely constructed question. :)
– anky_91
1 hour ago




upvote for a nicely constructed question. :)
– anky_91
1 hour ago












4 Answers
4






active

oldest

votes

















up vote
8
down vote













Here is one way:



# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1

# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()

# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0





share|improve this answer
















  • 4




    newdf.columns.map('0[0]_0[1]'.format)
    – Wen
    58 mins ago


















up vote
3
down vote













In an effort to improve performance, I've used numba and numpy assignment



from numba import njit

@njit
def f(i, vals, n, m, k):

out = np.empty((n, k, m), vals.dtype)
out.fill(0)

c = np.zeros(n, np.int64)

for j in range(len(i)):
x = i[j]
out[x, :, c[x]] = vals[j]
c[x] += 1

return out.reshape(n, m * k)


d0 = df.drop('A', 1)
cols = [*d0]

i, r = pd.factorize(df.A)

n = len(r)
m = np.bincount(i).max()
k = len(cols)

vals = d0.values

pd.DataFrame(
f(i, vals, n, m, k),
pd.Index(r, name='A'),
[f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()



 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0





share|improve this answer



























    up vote
    2
    down vote













    Another approach using groupby and ravel()



    >>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
    index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

    B_1 B_2 B_3 C_1 C_2 C_3
    A
    a 1 0 0 1 0 1
    b 0 1 0 0 0 1
    c 1 1 1 1 0 0





    share|improve this answer





























      up vote
      2
      down vote













      Modify your index by using %



      df.index=df.index%3+1
      df.set_index('A',append=True,inplace=True)
      newdf=df.unstack(level=0)
      newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
      newdf
      Out[291]:
      B_1 B_2 B_3 C_1 C_2 C_3
      A
      a 1 0 0 1 0 1
      b 0 1 0 0 0 1
      c 1 1 1 1 0 0





      share|improve this answer




















        Your Answer





        StackExchange.ifUsing("editor", function ()
        StackExchange.using("externalEditor", function ()
        StackExchange.using("snippets", function ()
        StackExchange.snippets.init();
        );
        );
        , "code-snippets");

        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "1"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        convertImagesToLinks: true,
        noModals: false,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













         

        draft saved


        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52839666%2fbest-way-to-flatten-dataframe-based-on-values-on-column%23new-answer', 'question_page');

        );

        Post as a guest






























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes








        up vote
        8
        down vote













        Here is one way:



        # create a row number by group
        df['rn'] = df.groupby('A').cumcount() + 1

        # pivot the table
        new_df = df.set_index(['A', 'rn']).unstack()

        # rename columns
        new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

        new_df.reset_index()
        # A B_1 B_2 B_3 C_1 C_2 C_3
        #0 a 1 0 0 1 0 1
        #1 b 0 1 0 0 0 1
        #2 c 1 1 1 1 0 0





        share|improve this answer
















        • 4




          newdf.columns.map('0[0]_0[1]'.format)
          – Wen
          58 mins ago















        up vote
        8
        down vote













        Here is one way:



        # create a row number by group
        df['rn'] = df.groupby('A').cumcount() + 1

        # pivot the table
        new_df = df.set_index(['A', 'rn']).unstack()

        # rename columns
        new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

        new_df.reset_index()
        # A B_1 B_2 B_3 C_1 C_2 C_3
        #0 a 1 0 0 1 0 1
        #1 b 0 1 0 0 0 1
        #2 c 1 1 1 1 0 0





        share|improve this answer
















        • 4




          newdf.columns.map('0[0]_0[1]'.format)
          – Wen
          58 mins ago













        up vote
        8
        down vote










        up vote
        8
        down vote









        Here is one way:



        # create a row number by group
        df['rn'] = df.groupby('A').cumcount() + 1

        # pivot the table
        new_df = df.set_index(['A', 'rn']).unstack()

        # rename columns
        new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

        new_df.reset_index()
        # A B_1 B_2 B_3 C_1 C_2 C_3
        #0 a 1 0 0 1 0 1
        #1 b 0 1 0 0 0 1
        #2 c 1 1 1 1 0 0





        share|improve this answer












        Here is one way:



        # create a row number by group
        df['rn'] = df.groupby('A').cumcount() + 1

        # pivot the table
        new_df = df.set_index(['A', 'rn']).unstack()

        # rename columns
        new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

        new_df.reset_index()
        # A B_1 B_2 B_3 C_1 C_2 C_3
        #0 a 1 0 0 1 0 1
        #1 b 0 1 0 0 0 1
        #2 c 1 1 1 1 0 0






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 59 mins ago









        Psidom

        116k1069114




        116k1069114







        • 4




          newdf.columns.map('0[0]_0[1]'.format)
          – Wen
          58 mins ago













        • 4




          newdf.columns.map('0[0]_0[1]'.format)
          – Wen
          58 mins ago








        4




        4




        newdf.columns.map('0[0]_0[1]'.format)
        – Wen
        58 mins ago





        newdf.columns.map('0[0]_0[1]'.format)
        – Wen
        58 mins ago













        up vote
        3
        down vote













        In an effort to improve performance, I've used numba and numpy assignment



        from numba import njit

        @njit
        def f(i, vals, n, m, k):

        out = np.empty((n, k, m), vals.dtype)
        out.fill(0)

        c = np.zeros(n, np.int64)

        for j in range(len(i)):
        x = i[j]
        out[x, :, c[x]] = vals[j]
        c[x] += 1

        return out.reshape(n, m * k)


        d0 = df.drop('A', 1)
        cols = [*d0]

        i, r = pd.factorize(df.A)

        n = len(r)
        m = np.bincount(i).max()
        k = len(cols)

        vals = d0.values

        pd.DataFrame(
        f(i, vals, n, m, k),
        pd.Index(r, name='A'),
        [f"c_i" for c in cols for i in range(1, m + 1)]
        ).reset_index()



         A B_1 B_2 B_3 C_1 C_2 C_3
        0 a 1 0 0 1 0 1
        1 b 0 1 0 0 0 1
        2 c 1 1 1 1 0 0





        share|improve this answer
























          up vote
          3
          down vote













          In an effort to improve performance, I've used numba and numpy assignment



          from numba import njit

          @njit
          def f(i, vals, n, m, k):

          out = np.empty((n, k, m), vals.dtype)
          out.fill(0)

          c = np.zeros(n, np.int64)

          for j in range(len(i)):
          x = i[j]
          out[x, :, c[x]] = vals[j]
          c[x] += 1

          return out.reshape(n, m * k)


          d0 = df.drop('A', 1)
          cols = [*d0]

          i, r = pd.factorize(df.A)

          n = len(r)
          m = np.bincount(i).max()
          k = len(cols)

          vals = d0.values

          pd.DataFrame(
          f(i, vals, n, m, k),
          pd.Index(r, name='A'),
          [f"c_i" for c in cols for i in range(1, m + 1)]
          ).reset_index()



           A B_1 B_2 B_3 C_1 C_2 C_3
          0 a 1 0 0 1 0 1
          1 b 0 1 0 0 0 1
          2 c 1 1 1 1 0 0





          share|improve this answer






















            up vote
            3
            down vote










            up vote
            3
            down vote









            In an effort to improve performance, I've used numba and numpy assignment



            from numba import njit

            @njit
            def f(i, vals, n, m, k):

            out = np.empty((n, k, m), vals.dtype)
            out.fill(0)

            c = np.zeros(n, np.int64)

            for j in range(len(i)):
            x = i[j]
            out[x, :, c[x]] = vals[j]
            c[x] += 1

            return out.reshape(n, m * k)


            d0 = df.drop('A', 1)
            cols = [*d0]

            i, r = pd.factorize(df.A)

            n = len(r)
            m = np.bincount(i).max()
            k = len(cols)

            vals = d0.values

            pd.DataFrame(
            f(i, vals, n, m, k),
            pd.Index(r, name='A'),
            [f"c_i" for c in cols for i in range(1, m + 1)]
            ).reset_index()



             A B_1 B_2 B_3 C_1 C_2 C_3
            0 a 1 0 0 1 0 1
            1 b 0 1 0 0 0 1
            2 c 1 1 1 1 0 0





            share|improve this answer












            In an effort to improve performance, I've used numba and numpy assignment



            from numba import njit

            @njit
            def f(i, vals, n, m, k):

            out = np.empty((n, k, m), vals.dtype)
            out.fill(0)

            c = np.zeros(n, np.int64)

            for j in range(len(i)):
            x = i[j]
            out[x, :, c[x]] = vals[j]
            c[x] += 1

            return out.reshape(n, m * k)


            d0 = df.drop('A', 1)
            cols = [*d0]

            i, r = pd.factorize(df.A)

            n = len(r)
            m = np.bincount(i).max()
            k = len(cols)

            vals = d0.values

            pd.DataFrame(
            f(i, vals, n, m, k),
            pd.Index(r, name='A'),
            [f"c_i" for c in cols for i in range(1, m + 1)]
            ).reset_index()



             A B_1 B_2 B_3 C_1 C_2 C_3
            0 a 1 0 0 1 0 1
            1 b 0 1 0 0 0 1
            2 c 1 1 1 1 0 0






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered 12 mins ago









            piRSquared

            144k19124256




            144k19124256




















                up vote
                2
                down vote













                Another approach using groupby and ravel()



                >>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
                index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

                B_1 B_2 B_3 C_1 C_2 C_3
                A
                a 1 0 0 1 0 1
                b 0 1 0 0 0 1
                c 1 1 1 1 0 0





                share|improve this answer


























                  up vote
                  2
                  down vote













                  Another approach using groupby and ravel()



                  >>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
                  index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

                  B_1 B_2 B_3 C_1 C_2 C_3
                  A
                  a 1 0 0 1 0 1
                  b 0 1 0 0 0 1
                  c 1 1 1 1 0 0





                  share|improve this answer
























                    up vote
                    2
                    down vote










                    up vote
                    2
                    down vote









                    Another approach using groupby and ravel()



                    >>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
                    index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

                    B_1 B_2 B_3 C_1 C_2 C_3
                    A
                    a 1 0 0 1 0 1
                    b 0 1 0 0 0 1
                    c 1 1 1 1 0 0





                    share|improve this answer














                    Another approach using groupby and ravel()



                    >>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
                    index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

                    B_1 B_2 B_3 C_1 C_2 C_3
                    A
                    a 1 0 0 1 0 1
                    b 0 1 0 0 0 1
                    c 1 1 1 1 0 0






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 43 mins ago

























                    answered 48 mins ago









                    RafaelC

                    24.1k72447




                    24.1k72447




















                        up vote
                        2
                        down vote













                        Modify your index by using %



                        df.index=df.index%3+1
                        df.set_index('A',append=True,inplace=True)
                        newdf=df.unstack(level=0)
                        newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
                        newdf
                        Out[291]:
                        B_1 B_2 B_3 C_1 C_2 C_3
                        A
                        a 1 0 0 1 0 1
                        b 0 1 0 0 0 1
                        c 1 1 1 1 0 0





                        share|improve this answer
























                          up vote
                          2
                          down vote













                          Modify your index by using %



                          df.index=df.index%3+1
                          df.set_index('A',append=True,inplace=True)
                          newdf=df.unstack(level=0)
                          newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
                          newdf
                          Out[291]:
                          B_1 B_2 B_3 C_1 C_2 C_3
                          A
                          a 1 0 0 1 0 1
                          b 0 1 0 0 0 1
                          c 1 1 1 1 0 0





                          share|improve this answer






















                            up vote
                            2
                            down vote










                            up vote
                            2
                            down vote









                            Modify your index by using %



                            df.index=df.index%3+1
                            df.set_index('A',append=True,inplace=True)
                            newdf=df.unstack(level=0)
                            newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
                            newdf
                            Out[291]:
                            B_1 B_2 B_3 C_1 C_2 C_3
                            A
                            a 1 0 0 1 0 1
                            b 0 1 0 0 0 1
                            c 1 1 1 1 0 0





                            share|improve this answer












                            Modify your index by using %



                            df.index=df.index%3+1
                            df.set_index('A',append=True,inplace=True)
                            newdf=df.unstack(level=0)
                            newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
                            newdf
                            Out[291]:
                            B_1 B_2 B_3 C_1 C_2 C_3
                            A
                            a 1 0 0 1 0 1
                            b 0 1 0 0 0 1
                            c 1 1 1 1 0 0






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered 42 mins ago









                            Wen

                            85.9k72452




                            85.9k72452



























                                 

                                draft saved


                                draft discarded















































                                 


                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52839666%2fbest-way-to-flatten-dataframe-based-on-values-on-column%23new-answer', 'question_page');

                                );

                                Post as a guest













































































                                Comments

                                Popular posts from this blog

                                Long meetings (6-7 hours a day): Being “babysat” by supervisor

                                Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                                Confectionery