How to unnesting a column in pandas' DataFrame?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
6
down vote

favorite
2












I have following Dataframe one of the columns is object (list type cell).



df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected out put as below :



 A B
0 1 1
1 1 2
3 2 1
4 2 2


How should I do to achieve this ?










share|improve this question





















  • Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
    – U9-Forward
    3 hours ago






  • 1




    @U9-Forward thank you man :-)
    – W-B
    3 hours ago










  • Haha, YW :D good question here
    – U9-Forward
    3 hours ago














up vote
6
down vote

favorite
2












I have following Dataframe one of the columns is object (list type cell).



df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected out put as below :



 A B
0 1 1
1 1 2
3 2 1
4 2 2


How should I do to achieve this ?










share|improve this question





















  • Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
    – U9-Forward
    3 hours ago






  • 1




    @U9-Forward thank you man :-)
    – W-B
    3 hours ago










  • Haha, YW :D good question here
    – U9-Forward
    3 hours ago












up vote
6
down vote

favorite
2









up vote
6
down vote

favorite
2






2





I have following Dataframe one of the columns is object (list type cell).



df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected out put as below :



 A B
0 1 1
1 1 2
3 2 1
4 2 2


How should I do to achieve this ?










share|improve this question













I have following Dataframe one of the columns is object (list type cell).



df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected out put as below :



 A B
0 1 1
1 1 2
3 2 1
4 2 2


How should I do to achieve this ?







python pandas dataframe






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 4 hours ago









W-B

90.7k72755




90.7k72755











  • Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
    – U9-Forward
    3 hours ago






  • 1




    @U9-Forward thank you man :-)
    – W-B
    3 hours ago










  • Haha, YW :D good question here
    – U9-Forward
    3 hours ago
















  • Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
    – U9-Forward
    3 hours ago






  • 1




    @U9-Forward thank you man :-)
    – W-B
    3 hours ago










  • Haha, YW :D good question here
    – U9-Forward
    3 hours ago















Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
– U9-Forward
3 hours ago




Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
– U9-Forward
3 hours ago




1




1




@U9-Forward thank you man :-)
– W-B
3 hours ago




@U9-Forward thank you man :-)
– W-B
3 hours ago












Haha, YW :D good question here
– U9-Forward
3 hours ago




Haha, YW :D good question here
– U9-Forward
3 hours ago












3 Answers
3






active

oldest

votes

















up vote
6
down vote














As an user with both R and python and spent one year in this site, I have seen this type of question couple times.




Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.




I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .




Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )



df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2



Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2


Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .



Solution : join or merge with the index after 'unnest' the single columns



s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2


If you need the column order exactly same as before , adding reindex at the end



s.join(df.drop('B',1),how='left').reindex(columns=df.columns)



Method 3 recreate the list



pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2


If more than two columns



s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]



Method 4 using reindex or loc



df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


Method 5 when the list only contain unique values:



df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2


Special case have two columns type object



df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]



Self-def function



def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2



Summary :



I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba






share|improve this answer






















  • Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
    – coldspeed
    2 hours ago

















up vote
3
down vote













Option 1



If all of the sublists in the other column are the same length, numpy can be an efficient option here:



vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




 A B
0 1 1
1 1 2
2 2 1
3 2 2



Option 2



If the sublists have different length, you need an additional step:



vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




 A B
0 1 1
1 1 2
2 2 1
3 2 2



Option 3



I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])




 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C




def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs = [len(r) for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))
return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])




 A D B_C
0 1 A 1
1 1 A 2
2 1 A 1
3 1 A 2
4 1 A 3
5 2 B 1
6 2 B 2
7 2 B 3
8 2 B 1
9 2 B 2
10 3 C 1
11 3 C 1
12 3 C 2



Functions



def wen1(df):
return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
return s.join(df.drop('B', 1), how='left')

def wen4(df):
return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A.values, rs)
return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


Timings



import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
columns=[10, 50, 100, 500, 1000, 5000, 10000],
dtype=float
)

for f in res.index:
for c in res.columns:
df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
df = pd.concat([df]*c)
stmt = '(df)'.format(f)
setp = 'from __main__ import df, '.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")


Performance



enter image description here






share|improve this answer





























    up vote
    1
    down vote













    Something pretty not recommended (at least work in this case):



    df=pd.concat([df]*2).sort_index()
    it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
    df['B']=df['B'].apply(lambda x:next(it))


    concat + sort_index + iter + apply + next.



    Now:



    print(df)


    Is:



     A B
    0 1 1
    0 1 2
    1 2 1
    1 2 2


    If care about index:



    df=df.reset_index(drop=True)


    Now:



    print(df)


    Is:



     A B
    0 1 1
    1 1 2
    2 2 1
    3 2 2





    share|improve this answer




















      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-to-unnesting-a-column-in-pandas-dataframe%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      6
      down vote














      As an user with both R and python and spent one year in this site, I have seen this type of question couple times.




      Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.




      I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .




      Method 1
      apply + pd.Series (easy to understand by in term of performance not recommended . )



      df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
      Out[463]:
      A B
      0 1 1
      1 1 2
      0 2 1
      1 2 2



      Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



      df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
      df
      Out[465]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2


      Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .



      Solution : join or merge with the index after 'unnest' the single columns



      s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
      s.join(df.drop('B',1),how='left')
      Out[477]:
      B A
      0 1 1
      0 2 1
      1 1 2
      1 2 2


      If you need the column order exactly same as before , adding reindex at the end



      s.join(df.drop('B',1),how='left').reindex(columns=df.columns)



      Method 3 recreate the list



      pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
      Out[488]:
      A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2


      If more than two columns



      s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
      s.merge(df,left_on=0,right_index=True)
      Out[491]:
      0 1 A B
      0 0 1 1 [1, 2]
      1 0 2 1 [1, 2]
      2 1 1 2 [1, 2]
      3 1 2 2 [1, 2]



      Method 4 using reindex or loc



      df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
      Out[554]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2

      #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


      Method 5 when the list only contain unique values:



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
      from collections import ChainMap
      d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
      pd.DataFrame(list(d.items()),columns=df.columns[::-1])
      Out[574]:
      B A
      0 1 1
      1 2 1
      2 3 2
      3 4 2


      Special case have two columns type object



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
      df
      Out[592]:
      A B C
      0 1 [1, 2] [1, 2]
      1 2 [3, 4] [3, 4]



      Self-def function



      def unnesting(df, explode):
      idx=df.index.repeat(df[explode[0]].str.len())
      df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
      df1.index=idx
      return df1.join(df.drop(explode,1),how='left')

      unnesting(df,['B','C'])
      Out[609]:
      B C A
      0 1 1 1
      0 2 2 1
      1 3 3 2
      1 4 4 2



      Summary :



      I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba






      share|improve this answer






















      • Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
        – coldspeed
        2 hours ago














      up vote
      6
      down vote














      As an user with both R and python and spent one year in this site, I have seen this type of question couple times.




      Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.




      I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .




      Method 1
      apply + pd.Series (easy to understand by in term of performance not recommended . )



      df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
      Out[463]:
      A B
      0 1 1
      1 1 2
      0 2 1
      1 2 2



      Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



      df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
      df
      Out[465]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2


      Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .



      Solution : join or merge with the index after 'unnest' the single columns



      s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
      s.join(df.drop('B',1),how='left')
      Out[477]:
      B A
      0 1 1
      0 2 1
      1 1 2
      1 2 2


      If you need the column order exactly same as before , adding reindex at the end



      s.join(df.drop('B',1),how='left').reindex(columns=df.columns)



      Method 3 recreate the list



      pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
      Out[488]:
      A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2


      If more than two columns



      s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
      s.merge(df,left_on=0,right_index=True)
      Out[491]:
      0 1 A B
      0 0 1 1 [1, 2]
      1 0 2 1 [1, 2]
      2 1 1 2 [1, 2]
      3 1 2 2 [1, 2]



      Method 4 using reindex or loc



      df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
      Out[554]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2

      #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


      Method 5 when the list only contain unique values:



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
      from collections import ChainMap
      d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
      pd.DataFrame(list(d.items()),columns=df.columns[::-1])
      Out[574]:
      B A
      0 1 1
      1 2 1
      2 3 2
      3 4 2


      Special case have two columns type object



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
      df
      Out[592]:
      A B C
      0 1 [1, 2] [1, 2]
      1 2 [3, 4] [3, 4]



      Self-def function



      def unnesting(df, explode):
      idx=df.index.repeat(df[explode[0]].str.len())
      df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
      df1.index=idx
      return df1.join(df.drop(explode,1),how='left')

      unnesting(df,['B','C'])
      Out[609]:
      B C A
      0 1 1 1
      0 2 2 1
      1 3 3 2
      1 4 4 2



      Summary :



      I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba






      share|improve this answer






















      • Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
        – coldspeed
        2 hours ago












      up vote
      6
      down vote










      up vote
      6
      down vote










      As an user with both R and python and spent one year in this site, I have seen this type of question couple times.




      Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.




      I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .




      Method 1
      apply + pd.Series (easy to understand by in term of performance not recommended . )



      df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
      Out[463]:
      A B
      0 1 1
      1 1 2
      0 2 1
      1 2 2



      Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



      df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
      df
      Out[465]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2


      Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .



      Solution : join or merge with the index after 'unnest' the single columns



      s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
      s.join(df.drop('B',1),how='left')
      Out[477]:
      B A
      0 1 1
      0 2 1
      1 1 2
      1 2 2


      If you need the column order exactly same as before , adding reindex at the end



      s.join(df.drop('B',1),how='left').reindex(columns=df.columns)



      Method 3 recreate the list



      pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
      Out[488]:
      A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2


      If more than two columns



      s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
      s.merge(df,left_on=0,right_index=True)
      Out[491]:
      0 1 A B
      0 0 1 1 [1, 2]
      1 0 2 1 [1, 2]
      2 1 1 2 [1, 2]
      3 1 2 2 [1, 2]



      Method 4 using reindex or loc



      df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
      Out[554]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2

      #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


      Method 5 when the list only contain unique values:



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
      from collections import ChainMap
      d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
      pd.DataFrame(list(d.items()),columns=df.columns[::-1])
      Out[574]:
      B A
      0 1 1
      1 2 1
      2 3 2
      3 4 2


      Special case have two columns type object



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
      df
      Out[592]:
      A B C
      0 1 [1, 2] [1, 2]
      1 2 [3, 4] [3, 4]



      Self-def function



      def unnesting(df, explode):
      idx=df.index.repeat(df[explode[0]].str.len())
      df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
      df1.index=idx
      return df1.join(df.drop(explode,1),how='left')

      unnesting(df,['B','C'])
      Out[609]:
      B C A
      0 1 1 1
      0 2 2 1
      1 3 3 2
      1 4 4 2



      Summary :



      I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba






      share|improve this answer















      As an user with both R and python and spent one year in this site, I have seen this type of question couple times.




      Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.




      I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .




      Method 1
      apply + pd.Series (easy to understand by in term of performance not recommended . )



      df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
      Out[463]:
      A B
      0 1 1
      1 1 2
      0 2 1
      1 2 2



      Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



      df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
      df
      Out[465]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2


      Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .



      Solution : join or merge with the index after 'unnest' the single columns



      s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
      s.join(df.drop('B',1),how='left')
      Out[477]:
      B A
      0 1 1
      0 2 1
      1 1 2
      1 2 2


      If you need the column order exactly same as before , adding reindex at the end



      s.join(df.drop('B',1),how='left').reindex(columns=df.columns)



      Method 3 recreate the list



      pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
      Out[488]:
      A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2


      If more than two columns



      s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
      s.merge(df,left_on=0,right_index=True)
      Out[491]:
      0 1 A B
      0 0 1 1 [1, 2]
      1 0 2 1 [1, 2]
      2 1 1 2 [1, 2]
      3 1 2 2 [1, 2]



      Method 4 using reindex or loc



      df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
      Out[554]:
      A B
      0 1 1
      0 1 2
      1 2 1
      1 2 2

      #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


      Method 5 when the list only contain unique values:



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
      from collections import ChainMap
      d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
      pd.DataFrame(list(d.items()),columns=df.columns[::-1])
      Out[574]:
      B A
      0 1 1
      1 2 1
      2 3 2
      3 4 2


      Special case have two columns type object



      df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
      df
      Out[592]:
      A B C
      0 1 [1, 2] [1, 2]
      1 2 [3, 4] [3, 4]



      Self-def function



      def unnesting(df, explode):
      idx=df.index.repeat(df[explode[0]].str.len())
      df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
      df1.index=idx
      return df1.join(df.drop(explode,1),how='left')

      unnesting(df,['B','C'])
      Out[609]:
      B C A
      0 1 1 1
      0 2 2 1
      1 3 3 2
      1 4 4 2



      Summary :



      I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 1 hour ago

























      answered 4 hours ago









      W-B

      90.7k72755




      90.7k72755











      • Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
        – coldspeed
        2 hours ago
















      • Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
        – coldspeed
        2 hours ago















      Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
      – coldspeed
      2 hours ago




      Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
      – coldspeed
      2 hours ago












      up vote
      3
      down vote













      Option 1



      If all of the sublists in the other column are the same length, numpy can be an efficient option here:



      vals = np.array(df.B.values.tolist()) 
      a = np.repeat(df.A, vals.shape[1])

      pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




       A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2



      Option 2



      If the sublists have different length, you need an additional step:



      vals = df.B.values.tolist()
      rs = [len(r) for r in vals]
      a = np.repeat(df.A, rs)

      pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




       A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2



      Option 3



      I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



      df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
      'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])




       A B C D
      0 1 [1, 2] [1, 2, 3] A
      1 2 [1, 2, 3] [1, 2] B
      2 3 [1] [1, 2] C




      def unnest(df, tile, explode):
      vals = df[explode].sum(1)
      rs = [len(r) for r in vals]
      a = np.repeat(df[tile].values, rs, axis=0)
      b = np.concatenate(vals.values)
      d = np.column_stack((a, b))
      return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

      unnest(df, ['A', 'D'], ['B', 'C'])




       A D B_C
      0 1 A 1
      1 1 A 2
      2 1 A 1
      3 1 A 2
      4 1 A 3
      5 2 B 1
      6 2 B 2
      7 2 B 3
      8 2 B 1
      9 2 B 2
      10 3 C 1
      11 3 C 1
      12 3 C 2



      Functions



      def wen1(df):
      return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

      def wen2(df):
      return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

      def wen3(df):
      s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
      return s.join(df.drop('B', 1), how='left')

      def wen4(df):
      return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

      def chris1(df):
      vals = np.array(df.B.values.tolist())
      a = np.repeat(df.A, vals.shape[1])
      return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

      def chris2(df):
      vals = df.B.values.tolist()
      rs = [len(r) for r in vals]
      a = np.repeat(df.A.values, rs)
      return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


      Timings



      import pandas as pd
      import matplotlib.pyplot as plt
      import numpy as np
      from timeit import timeit

      res = pd.DataFrame(
      index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
      columns=[10, 50, 100, 500, 1000, 5000, 10000],
      dtype=float
      )

      for f in res.index:
      for c in res.columns:
      df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
      df = pd.concat([df]*c)
      stmt = '(df)'.format(f)
      setp = 'from __main__ import df, '.format(f)
      res.at[f, c] = timeit(stmt, setp, number=50)

      ax = res.div(res.min()).T.plot(loglog=True)
      ax.set_xlabel("N")
      ax.set_ylabel("time (relative)")


      Performance



      enter image description here






      share|improve this answer


























        up vote
        3
        down vote













        Option 1



        If all of the sublists in the other column are the same length, numpy can be an efficient option here:



        vals = np.array(df.B.values.tolist()) 
        a = np.repeat(df.A, vals.shape[1])

        pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




         A B
        0 1 1
        1 1 2
        2 2 1
        3 2 2



        Option 2



        If the sublists have different length, you need an additional step:



        vals = df.B.values.tolist()
        rs = [len(r) for r in vals]
        a = np.repeat(df.A, rs)

        pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




         A B
        0 1 1
        1 1 2
        2 2 1
        3 2 2



        Option 3



        I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



        df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
        'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])




         A B C D
        0 1 [1, 2] [1, 2, 3] A
        1 2 [1, 2, 3] [1, 2] B
        2 3 [1] [1, 2] C




        def unnest(df, tile, explode):
        vals = df[explode].sum(1)
        rs = [len(r) for r in vals]
        a = np.repeat(df[tile].values, rs, axis=0)
        b = np.concatenate(vals.values)
        d = np.column_stack((a, b))
        return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

        unnest(df, ['A', 'D'], ['B', 'C'])




         A D B_C
        0 1 A 1
        1 1 A 2
        2 1 A 1
        3 1 A 2
        4 1 A 3
        5 2 B 1
        6 2 B 2
        7 2 B 3
        8 2 B 1
        9 2 B 2
        10 3 C 1
        11 3 C 1
        12 3 C 2



        Functions



        def wen1(df):
        return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

        def wen2(df):
        return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

        def wen3(df):
        s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
        return s.join(df.drop('B', 1), how='left')

        def wen4(df):
        return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

        def chris1(df):
        vals = np.array(df.B.values.tolist())
        a = np.repeat(df.A, vals.shape[1])
        return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

        def chris2(df):
        vals = df.B.values.tolist()
        rs = [len(r) for r in vals]
        a = np.repeat(df.A.values, rs)
        return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


        Timings



        import pandas as pd
        import matplotlib.pyplot as plt
        import numpy as np
        from timeit import timeit

        res = pd.DataFrame(
        index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
        columns=[10, 50, 100, 500, 1000, 5000, 10000],
        dtype=float
        )

        for f in res.index:
        for c in res.columns:
        df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
        df = pd.concat([df]*c)
        stmt = '(df)'.format(f)
        setp = 'from __main__ import df, '.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

        ax = res.div(res.min()).T.plot(loglog=True)
        ax.set_xlabel("N")
        ax.set_ylabel("time (relative)")


        Performance



        enter image description here






        share|improve this answer
























          up vote
          3
          down vote










          up vote
          3
          down vote









          Option 1



          If all of the sublists in the other column are the same length, numpy can be an efficient option here:



          vals = np.array(df.B.values.tolist()) 
          a = np.repeat(df.A, vals.shape[1])

          pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




           A B
          0 1 1
          1 1 2
          2 2 1
          3 2 2



          Option 2



          If the sublists have different length, you need an additional step:



          vals = df.B.values.tolist()
          rs = [len(r) for r in vals]
          a = np.repeat(df.A, rs)

          pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




           A B
          0 1 1
          1 1 2
          2 2 1
          3 2 2



          Option 3



          I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



          df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
          'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])




           A B C D
          0 1 [1, 2] [1, 2, 3] A
          1 2 [1, 2, 3] [1, 2] B
          2 3 [1] [1, 2] C




          def unnest(df, tile, explode):
          vals = df[explode].sum(1)
          rs = [len(r) for r in vals]
          a = np.repeat(df[tile].values, rs, axis=0)
          b = np.concatenate(vals.values)
          d = np.column_stack((a, b))
          return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

          unnest(df, ['A', 'D'], ['B', 'C'])




           A D B_C
          0 1 A 1
          1 1 A 2
          2 1 A 1
          3 1 A 2
          4 1 A 3
          5 2 B 1
          6 2 B 2
          7 2 B 3
          8 2 B 1
          9 2 B 2
          10 3 C 1
          11 3 C 1
          12 3 C 2



          Functions



          def wen1(df):
          return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

          def wen2(df):
          return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

          def wen3(df):
          s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
          return s.join(df.drop('B', 1), how='left')

          def wen4(df):
          return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

          def chris1(df):
          vals = np.array(df.B.values.tolist())
          a = np.repeat(df.A, vals.shape[1])
          return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

          def chris2(df):
          vals = df.B.values.tolist()
          rs = [len(r) for r in vals]
          a = np.repeat(df.A.values, rs)
          return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


          Timings



          import pandas as pd
          import matplotlib.pyplot as plt
          import numpy as np
          from timeit import timeit

          res = pd.DataFrame(
          index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
          columns=[10, 50, 100, 500, 1000, 5000, 10000],
          dtype=float
          )

          for f in res.index:
          for c in res.columns:
          df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
          df = pd.concat([df]*c)
          stmt = '(df)'.format(f)
          setp = 'from __main__ import df, '.format(f)
          res.at[f, c] = timeit(stmt, setp, number=50)

          ax = res.div(res.min()).T.plot(loglog=True)
          ax.set_xlabel("N")
          ax.set_ylabel("time (relative)")


          Performance



          enter image description here






          share|improve this answer














          Option 1



          If all of the sublists in the other column are the same length, numpy can be an efficient option here:



          vals = np.array(df.B.values.tolist()) 
          a = np.repeat(df.A, vals.shape[1])

          pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




           A B
          0 1 1
          1 1 2
          2 2 1
          3 2 2



          Option 2



          If the sublists have different length, you need an additional step:



          vals = df.B.values.tolist()
          rs = [len(r) for r in vals]
          a = np.repeat(df.A, rs)

          pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




           A B
          0 1 1
          1 1 2
          2 2 1
          3 2 2



          Option 3



          I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



          df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
          'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])




           A B C D
          0 1 [1, 2] [1, 2, 3] A
          1 2 [1, 2, 3] [1, 2] B
          2 3 [1] [1, 2] C




          def unnest(df, tile, explode):
          vals = df[explode].sum(1)
          rs = [len(r) for r in vals]
          a = np.repeat(df[tile].values, rs, axis=0)
          b = np.concatenate(vals.values)
          d = np.column_stack((a, b))
          return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

          unnest(df, ['A', 'D'], ['B', 'C'])




           A D B_C
          0 1 A 1
          1 1 A 2
          2 1 A 1
          3 1 A 2
          4 1 A 3
          5 2 B 1
          6 2 B 2
          7 2 B 3
          8 2 B 1
          9 2 B 2
          10 3 C 1
          11 3 C 1
          12 3 C 2



          Functions



          def wen1(df):
          return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

          def wen2(df):
          return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

          def wen3(df):
          s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
          return s.join(df.drop('B', 1), how='left')

          def wen4(df):
          return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

          def chris1(df):
          vals = np.array(df.B.values.tolist())
          a = np.repeat(df.A, vals.shape[1])
          return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

          def chris2(df):
          vals = df.B.values.tolist()
          rs = [len(r) for r in vals]
          a = np.repeat(df.A.values, rs)
          return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


          Timings



          import pandas as pd
          import matplotlib.pyplot as plt
          import numpy as np
          from timeit import timeit

          res = pd.DataFrame(
          index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
          columns=[10, 50, 100, 500, 1000, 5000, 10000],
          dtype=float
          )

          for f in res.index:
          for c in res.columns:
          df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
          df = pd.concat([df]*c)
          stmt = '(df)'.format(f)
          setp = 'from __main__ import df, '.format(f)
          res.at[f, c] = timeit(stmt, setp, number=50)

          ax = res.div(res.min()).T.plot(loglog=True)
          ax.set_xlabel("N")
          ax.set_ylabel("time (relative)")


          Performance



          enter image description here







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 2 hours ago

























          answered 3 hours ago









          user3483203

          28.2k72351




          28.2k72351




















              up vote
              1
              down vote













              Something pretty not recommended (at least work in this case):



              df=pd.concat([df]*2).sort_index()
              it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
              df['B']=df['B'].apply(lambda x:next(it))


              concat + sort_index + iter + apply + next.



              Now:



              print(df)


              Is:



               A B
              0 1 1
              0 1 2
              1 2 1
              1 2 2


              If care about index:



              df=df.reset_index(drop=True)


              Now:



              print(df)


              Is:



               A B
              0 1 1
              1 1 2
              2 2 1
              3 2 2





              share|improve this answer
























                up vote
                1
                down vote













                Something pretty not recommended (at least work in this case):



                df=pd.concat([df]*2).sort_index()
                it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                df['B']=df['B'].apply(lambda x:next(it))


                concat + sort_index + iter + apply + next.



                Now:



                print(df)


                Is:



                 A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2


                If care about index:



                df=df.reset_index(drop=True)


                Now:



                print(df)


                Is:



                 A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2





                share|improve this answer






















                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  Something pretty not recommended (at least work in this case):



                  df=pd.concat([df]*2).sort_index()
                  it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                  df['B']=df['B'].apply(lambda x:next(it))


                  concat + sort_index + iter + apply + next.



                  Now:



                  print(df)


                  Is:



                   A B
                  0 1 1
                  0 1 2
                  1 2 1
                  1 2 2


                  If care about index:



                  df=df.reset_index(drop=True)


                  Now:



                  print(df)


                  Is:



                   A B
                  0 1 1
                  1 1 2
                  2 2 1
                  3 2 2





                  share|improve this answer












                  Something pretty not recommended (at least work in this case):



                  df=pd.concat([df]*2).sort_index()
                  it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                  df['B']=df['B'].apply(lambda x:next(it))


                  concat + sort_index + iter + apply + next.



                  Now:



                  print(df)


                  Is:



                   A B
                  0 1 1
                  0 1 2
                  1 2 1
                  1 2 2


                  If care about index:



                  df=df.reset_index(drop=True)


                  Now:



                  print(df)


                  Is:



                   A B
                  0 1 1
                  1 1 2
                  2 2 1
                  3 2 2






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 3 hours ago









                  U9-Forward

                  8,7912733




                  8,7912733



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-to-unnesting-a-column-in-pandas-dataframe%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      Long meetings (6-7 hours a day): Being “babysat” by supervisor

                      Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                      Confectionery