Iyfjky

Question

I have following Dataframe one of the columns is object (list type cell).

df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]: 
 A B
0 1 [1, 2]
1 2 [1, 2]

My expected out put as below :

How should I do to achieve this ?

Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â€“Â U9-Forward
3 hours ago — 3 hours ago
Haha, YW :D good question here
â€“Â U9-Forward
3 hours ago — 3 hours ago

score 6 · Answer 1 · 2018-11-09 05:06:32Z

As an user with both R and python and spent one year in this site, I have seen this type of question couple times.

Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.

I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .

Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]: 
 A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
 B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
 A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
 0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contain unique values:

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
 B A
0 1 1
1 2 1
2 3 2
3 4 2

Special case have two columns type object

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]: 
 A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function

def unnesting(df, explode):
 idx=df.index.repeat(df[explode[0]].str.len())
 df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
 df1.index=idx
 return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Summary :

I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists. — 2 hours ago

score 3 · Answer 2 · 2018-11-09 04:15:38Z

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals] 
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])

 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C

def unnest(df, tile, explode):
 vals = df[explode].sum(1)
 rs = [len(r) for r in vals]
 a = np.repeat(df[tile].values, rs, axis=0)
 b = np.concatenate(vals.values)
 d = np.column_stack((a, b))
 return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

Functions

def wen1(df):
 return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
 return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
 s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
 return s.join(df.drop('B', 1), how='left')

def wen4(df):
 return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
 vals = np.array(df.B.values.tolist())
 a = np.repeat(df.A, vals.shape[1])
 return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
 vals = df.B.values.tolist()
 rs = [len(r) for r in vals]
 a = np.repeat(df.A.values, rs)
 return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
 index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
 columns=[10, 50, 100, 500, 1000, 5000, 10000],
 dtype=float
)

for f in res.index:
 for c in res.columns:
 df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
 df = pd.concat([df]*c)
 stmt = '(df)'.format(f)
 setp = 'from __main__ import df, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

enter image description here

U9-Forward 8,7912733 · Answer 3 · 2018-11-09 02:40:08Z

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

score 6 · Answer 4 · 2018-11-09 05:06:32Z

As an user with both R and python and spent one year in this site, I have seen this type of question couple times.

Since in R they have the build-in function from package tidyr so called unnest, But in Python(pandas) there is no build-in function for this type of question.

I know object columns type always make the data hard to convert by pandas' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .

Method 1
apply + pd.Series (easy to understand by in term of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]: 
 A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
 B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly same as before , adding reindex at the end

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3 recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
 A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
 0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4 using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
 A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5 when the list only contain unique values:

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
 B A
0 1 1
1 2 1
2 3 2
3 4 2

Special case have two columns type object

df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]: 
 A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function

def unnesting(df, explode):
 idx=df.index.repeat(df[explode[0]].str.len())
 df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
 df1.index=idx
 return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Summary :

I am using pandas and python function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy and most of the time numpy is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython and numba

Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists. — 2 hours ago

score 3 · Answer 5 · 2018-11-09 04:15:38Z

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist()) 
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals] 
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:

df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
 'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])

 A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C

def unnest(df, tile, explode):
 vals = df[explode].sum(1)
 rs = [len(r) for r in vals]
 a = np.repeat(df[tile].values, rs, axis=0)
 b = np.concatenate(vals.values)
 d = np.column_stack((a, b))
 return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

Functions

def wen1(df):
 return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')

def wen2(df):
 return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))

def wen3(df):
 s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
 return s.join(df.drop('B', 1), how='left')

def wen4(df):
 return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
 vals = np.array(df.B.values.tolist())
 a = np.repeat(df.A, vals.shape[1])
 return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
 vals = df.B.values.tolist()
 rs = [len(r) for r in vals]
 a = np.repeat(df.A.values, rs)
 return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
 index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
 columns=[10, 50, 100, 500, 1000, 5000, 10000],
 dtype=float
)

for f in res.index:
 for c in res.columns:
 df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
 df = pd.concat([df]*c)
 stmt = '(df)'.format(f)
 setp = 'from __main__ import df, '.format(f)
 res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

enter image description here

U9-Forward 8,7912733 · Answer 6 · 2018-11-09 02:40:08Z

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

Search This Blog

Iyfjky

How to unnesting a column in pandas' DataFrame?

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Comments

Post a Comment

Popular posts from this blog

Long meetings (6-7 hours a day): Being â€œbabysatâ€ by supervisor

What does second last employer means? [closed]

One-line joke

Category

Random preview

How to unnesting a column in pandas' DataFrame?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Comments

Post a Comment

Popular posts from this blog

Long meetings (6-7 hours a day): Being â€œbabysatâ€ by supervisor

What does second last employer means? [closed]

One-line joke

3 Answers
3

3 Answers
3

3 Answers
3

Long meetings (6-7 hours a day): Being â€œbabysatâ€ by supervisor