Create a new column only if values differ

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
6
down vote

favorite












My dataframe looks like this :



pd.DataFrame([["t1","d2","e3","r4"],
["t1","d2","e2","r4"],
["t1","d2","e1","r4"]],columns=["a","b","c","d"])


and I want:



pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
columns=["a","b","c","d","c1","c2"])


ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?










share|improve this question

























    up vote
    6
    down vote

    favorite












    My dataframe looks like this :



    pd.DataFrame([["t1","d2","e3","r4"],
    ["t1","d2","e2","r4"],
    ["t1","d2","e1","r4"]],columns=["a","b","c","d"])


    and I want:



    pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
    columns=["a","b","c","d","c1","c2"])


    ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?










    share|improve this question























      up vote
      6
      down vote

      favorite









      up vote
      6
      down vote

      favorite











      My dataframe looks like this :



      pd.DataFrame([["t1","d2","e3","r4"],
      ["t1","d2","e2","r4"],
      ["t1","d2","e1","r4"]],columns=["a","b","c","d"])


      and I want:



      pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
      columns=["a","b","c","d","c1","c2"])


      ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?










      share|improve this question













      My dataframe looks like this :



      pd.DataFrame([["t1","d2","e3","r4"],
      ["t1","d2","e2","r4"],
      ["t1","d2","e1","r4"]],columns=["a","b","c","d"])


      and I want:



      pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
      columns=["a","b","c","d","c1","c2"])


      ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?







      python pandas dataframe






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 47 mins ago









      FFL75

      97311




      97311






















          4 Answers
          4






          active

          oldest

          votes

















          up vote
          4
          down vote













          Edit: To generalize for any single non-unique column:



          Ucols = df.columns[(df.nunique() == 1)].tolist()
          df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
          df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
          print(df_out.reset_index())


          Output:



           a b d c c1 c2
          0 t1 d2 r4 e3 e2 e1


          Original Answer



          Use:



          df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()

          df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]

          df_out.reset_index()


          Output:



           a b d c c1 c2
          0 t1 d2 r4 e3 e2 e1





          share|improve this answer


















          • 2




            It looks to me that this doesn't generalize. I mean you know already that column c has more values.
            – user32185
            31 mins ago






          • 1




            Updated solution to handle not knowing 'c' has more than one value. Thanks.
            – Scott Boston
            20 mins ago


















          up vote
          4
          down vote













          You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.



          res = pd.DataFrame(f'colidx': val for col in df for idx, val in 
          enumerate(df[col].unique(), 1), index=[0])

          print(res)

          a1 b1 c1 c2 c3 d1
          0 t1 d2 e3 e2 e1 r4


          An alternative to df[col].unique() is df[col].drop_duplicates(), though the latter may incur an overhead for iterating a pd.Series object versus np.ndarray.






          share|improve this answer






















          • Here you can unique instead of drop_duplicates()
            – user32185
            25 mins ago






          • 1




            @user32185, Good point, have included that alternative. I believe it should be more efficient as unique returns an array, i.e. no pd.Series overhead.
            – jpp
            23 mins ago











          • I'm curious to know if it's possible to use a comprehension with a for inside the else in order to obtain the OP requested output. See my answer as reference.
            – user32185
            8 mins ago










          • @user32185, You can probably do something like d = 1: ''; f'cold.get(idx, idx)', since you can use dict.get within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
            – jpp
            2 mins ago


















          up vote
          2
          down vote













          Not as beautiful as Scott answer but the logic you are looking for is:



          out = pd.DataFrame()
          for col in df.columns:
          values =df[col].unique()
          if len(values)==1:
          out[col]=values
          else:
          for i,value in enumerate(values):
          out[col+str(i+1)]= value





          share|improve this answer



























            up vote
            0
            down vote













            Using drop_duplicates



            s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')


            pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
            Out[1151]:
            a0 b0 c0 c1 c2 d0
            0 t1 d2 e3 e2 e1 r4




            share




















              Your Answer





              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              convertImagesToLinks: true,
              noModals: false,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













               

              draft saved


              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52480841%2fcreate-a-new-column-only-if-values-differ%23new-answer', 'question_page');

              );

              Post as a guest






























              4 Answers
              4






              active

              oldest

              votes








              4 Answers
              4






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              4
              down vote













              Edit: To generalize for any single non-unique column:



              Ucols = df.columns[(df.nunique() == 1)].tolist()
              df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
              print(df_out.reset_index())


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1


              Original Answer



              Use:



              df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()

              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]

              df_out.reset_index()


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1





              share|improve this answer


















              • 2




                It looks to me that this doesn't generalize. I mean you know already that column c has more values.
                – user32185
                31 mins ago






              • 1




                Updated solution to handle not knowing 'c' has more than one value. Thanks.
                – Scott Boston
                20 mins ago















              up vote
              4
              down vote













              Edit: To generalize for any single non-unique column:



              Ucols = df.columns[(df.nunique() == 1)].tolist()
              df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
              print(df_out.reset_index())


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1


              Original Answer



              Use:



              df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()

              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]

              df_out.reset_index()


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1





              share|improve this answer


















              • 2




                It looks to me that this doesn't generalize. I mean you know already that column c has more values.
                – user32185
                31 mins ago






              • 1




                Updated solution to handle not knowing 'c' has more than one value. Thanks.
                – Scott Boston
                20 mins ago













              up vote
              4
              down vote










              up vote
              4
              down vote









              Edit: To generalize for any single non-unique column:



              Ucols = df.columns[(df.nunique() == 1)].tolist()
              df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
              print(df_out.reset_index())


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1


              Original Answer



              Use:



              df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()

              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]

              df_out.reset_index()


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1





              share|improve this answer














              Edit: To generalize for any single non-unique column:



              Ucols = df.columns[(df.nunique() == 1)].tolist()
              df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
              print(df_out.reset_index())


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1


              Original Answer



              Use:



              df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()

              df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]

              df_out.reset_index()


              Output:



               a b d c c1 c2
              0 t1 d2 r4 e3 e2 e1






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited 28 mins ago

























              answered 39 mins ago









              Scott Boston

              46.2k52351




              46.2k52351







              • 2




                It looks to me that this doesn't generalize. I mean you know already that column c has more values.
                – user32185
                31 mins ago






              • 1




                Updated solution to handle not knowing 'c' has more than one value. Thanks.
                – Scott Boston
                20 mins ago













              • 2




                It looks to me that this doesn't generalize. I mean you know already that column c has more values.
                – user32185
                31 mins ago






              • 1




                Updated solution to handle not knowing 'c' has more than one value. Thanks.
                – Scott Boston
                20 mins ago








              2




              2




              It looks to me that this doesn't generalize. I mean you know already that column c has more values.
              – user32185
              31 mins ago




              It looks to me that this doesn't generalize. I mean you know already that column c has more values.
              – user32185
              31 mins ago




              1




              1




              Updated solution to handle not knowing 'c' has more than one value. Thanks.
              – Scott Boston
              20 mins ago





              Updated solution to handle not knowing 'c' has more than one value. Thanks.
              – Scott Boston
              20 mins ago













              up vote
              4
              down vote













              You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.



              res = pd.DataFrame(f'colidx': val for col in df for idx, val in 
              enumerate(df[col].unique(), 1), index=[0])

              print(res)

              a1 b1 c1 c2 c3 d1
              0 t1 d2 e3 e2 e1 r4


              An alternative to df[col].unique() is df[col].drop_duplicates(), though the latter may incur an overhead for iterating a pd.Series object versus np.ndarray.






              share|improve this answer






















              • Here you can unique instead of drop_duplicates()
                – user32185
                25 mins ago






              • 1




                @user32185, Good point, have included that alternative. I believe it should be more efficient as unique returns an array, i.e. no pd.Series overhead.
                – jpp
                23 mins ago











              • I'm curious to know if it's possible to use a comprehension with a for inside the else in order to obtain the OP requested output. See my answer as reference.
                – user32185
                8 mins ago










              • @user32185, You can probably do something like d = 1: ''; f'cold.get(idx, idx)', since you can use dict.get within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
                – jpp
                2 mins ago















              up vote
              4
              down vote













              You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.



              res = pd.DataFrame(f'colidx': val for col in df for idx, val in 
              enumerate(df[col].unique(), 1), index=[0])

              print(res)

              a1 b1 c1 c2 c3 d1
              0 t1 d2 e3 e2 e1 r4


              An alternative to df[col].unique() is df[col].drop_duplicates(), though the latter may incur an overhead for iterating a pd.Series object versus np.ndarray.






              share|improve this answer






















              • Here you can unique instead of drop_duplicates()
                – user32185
                25 mins ago






              • 1




                @user32185, Good point, have included that alternative. I believe it should be more efficient as unique returns an array, i.e. no pd.Series overhead.
                – jpp
                23 mins ago











              • I'm curious to know if it's possible to use a comprehension with a for inside the else in order to obtain the OP requested output. See my answer as reference.
                – user32185
                8 mins ago










              • @user32185, You can probably do something like d = 1: ''; f'cold.get(idx, idx)', since you can use dict.get within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
                – jpp
                2 mins ago













              up vote
              4
              down vote










              up vote
              4
              down vote









              You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.



              res = pd.DataFrame(f'colidx': val for col in df for idx, val in 
              enumerate(df[col].unique(), 1), index=[0])

              print(res)

              a1 b1 c1 c2 c3 d1
              0 t1 d2 e3 e2 e1 r4


              An alternative to df[col].unique() is df[col].drop_duplicates(), though the latter may incur an overhead for iterating a pd.Series object versus np.ndarray.






              share|improve this answer














              You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.



              res = pd.DataFrame(f'colidx': val for col in df for idx, val in 
              enumerate(df[col].unique(), 1), index=[0])

              print(res)

              a1 b1 c1 c2 c3 d1
              0 t1 d2 e3 e2 e1 r4


              An alternative to df[col].unique() is df[col].drop_duplicates(), though the latter may incur an overhead for iterating a pd.Series object versus np.ndarray.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited 24 mins ago

























              answered 31 mins ago









              jpp

              65.6k173881




              65.6k173881











              • Here you can unique instead of drop_duplicates()
                – user32185
                25 mins ago






              • 1




                @user32185, Good point, have included that alternative. I believe it should be more efficient as unique returns an array, i.e. no pd.Series overhead.
                – jpp
                23 mins ago











              • I'm curious to know if it's possible to use a comprehension with a for inside the else in order to obtain the OP requested output. See my answer as reference.
                – user32185
                8 mins ago










              • @user32185, You can probably do something like d = 1: ''; f'cold.get(idx, idx)', since you can use dict.get within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
                – jpp
                2 mins ago

















              • Here you can unique instead of drop_duplicates()
                – user32185
                25 mins ago






              • 1




                @user32185, Good point, have included that alternative. I believe it should be more efficient as unique returns an array, i.e. no pd.Series overhead.
                – jpp
                23 mins ago











              • I'm curious to know if it's possible to use a comprehension with a for inside the else in order to obtain the OP requested output. See my answer as reference.
                – user32185
                8 mins ago










              • @user32185, You can probably do something like d = 1: ''; f'cold.get(idx, idx)', since you can use dict.get within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
                – jpp
                2 mins ago
















              Here you can unique instead of drop_duplicates()
              – user32185
              25 mins ago




              Here you can unique instead of drop_duplicates()
              – user32185
              25 mins ago




              1




              1




              @user32185, Good point, have included that alternative. I believe it should be more efficient as unique returns an array, i.e. no pd.Series overhead.
              – jpp
              23 mins ago





              @user32185, Good point, have included that alternative. I believe it should be more efficient as unique returns an array, i.e. no pd.Series overhead.
              – jpp
              23 mins ago













              I'm curious to know if it's possible to use a comprehension with a for inside the else in order to obtain the OP requested output. See my answer as reference.
              – user32185
              8 mins ago




              I'm curious to know if it's possible to use a comprehension with a for inside the else in order to obtain the OP requested output. See my answer as reference.
              – user32185
              8 mins ago












              @user32185, You can probably do something like d = 1: ''; f'cold.get(idx, idx)', since you can use dict.get within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
              – jpp
              2 mins ago





              @user32185, You can probably do something like d = 1: ''; f'cold.get(idx, idx)', since you can use dict.get within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
              – jpp
              2 mins ago











              up vote
              2
              down vote













              Not as beautiful as Scott answer but the logic you are looking for is:



              out = pd.DataFrame()
              for col in df.columns:
              values =df[col].unique()
              if len(values)==1:
              out[col]=values
              else:
              for i,value in enumerate(values):
              out[col+str(i+1)]= value





              share|improve this answer
























                up vote
                2
                down vote













                Not as beautiful as Scott answer but the logic you are looking for is:



                out = pd.DataFrame()
                for col in df.columns:
                values =df[col].unique()
                if len(values)==1:
                out[col]=values
                else:
                for i,value in enumerate(values):
                out[col+str(i+1)]= value





                share|improve this answer






















                  up vote
                  2
                  down vote










                  up vote
                  2
                  down vote









                  Not as beautiful as Scott answer but the logic you are looking for is:



                  out = pd.DataFrame()
                  for col in df.columns:
                  values =df[col].unique()
                  if len(values)==1:
                  out[col]=values
                  else:
                  for i,value in enumerate(values):
                  out[col+str(i+1)]= value





                  share|improve this answer












                  Not as beautiful as Scott answer but the logic you are looking for is:



                  out = pd.DataFrame()
                  for col in df.columns:
                  values =df[col].unique()
                  if len(values)==1:
                  out[col]=values
                  else:
                  for i,value in enumerate(values):
                  out[col+str(i+1)]= value






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 36 mins ago









                  user32185

                  9941722




                  9941722




















                      up vote
                      0
                      down vote













                      Using drop_duplicates



                      s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')


                      pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
                      Out[1151]:
                      a0 b0 c0 c1 c2 d0
                      0 t1 d2 e3 e2 e1 r4




                      share
























                        up vote
                        0
                        down vote













                        Using drop_duplicates



                        s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')


                        pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
                        Out[1151]:
                        a0 b0 c0 c1 c2 d0
                        0 t1 d2 e3 e2 e1 r4




                        share






















                          up vote
                          0
                          down vote










                          up vote
                          0
                          down vote









                          Using drop_duplicates



                          s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')


                          pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
                          Out[1151]:
                          a0 b0 c0 c1 c2 d0
                          0 t1 d2 e3 e2 e1 r4




                          share












                          Using drop_duplicates



                          s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')


                          pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
                          Out[1151]:
                          a0 b0 c0 c1 c2 d0
                          0 t1 d2 e3 e2 e1 r4





                          share











                          share


                          share










                          answered 4 mins ago









                          Wen

                          81.7k72246




                          81.7k72246



























                               

                              draft saved


                              draft discarded















































                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52480841%2fcreate-a-new-column-only-if-values-differ%23new-answer', 'question_page');

                              );

                              Post as a guest













































































                              Comments

                              Popular posts from this blog

                              What does second last employer means? [closed]

                              List of Gilmore Girls characters

                              Confectionery