How to unnesting a column in pandas' DataFrame?
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
I have following Dataframe one of the columns is object (list type cell).
df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]
My expected out put as below :
A B
0 1 1
1 1 2
3 2 1
4 2 2
How should I do to achieve this ?
python pandas dataframe
add a comment |Â
up vote
6
down vote
favorite
I have following Dataframe one of the columns is object (list type cell).
df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]
My expected out put as below :
A B
0 1 1
1 1 2
3 2 1
4 2 2
How should I do to achieve this ?
python pandas dataframe
Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â U9-Forward
3 hours ago
1
@U9-Forward thank you man :-)
â W-B
3 hours ago
Haha, YW :D good question here
â U9-Forward
3 hours ago
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I have following Dataframe one of the columns is object (list type cell).
df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]
My expected out put as below :
A B
0 1 1
1 1 2
3 2 1
4 2 2
How should I do to achieve this ?
python pandas dataframe
I have following Dataframe one of the columns is object (list type cell).
df=pd.DataFrame('A':[1,2],'B':[[1,2],[1,2]])
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]
My expected out put as below :
A B
0 1 1
1 1 2
3 2 1
4 2 2
How should I do to achieve this ?
python pandas dataframe
python pandas dataframe
asked 4 hours ago
W-B
90.7k72755
90.7k72755
Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â U9-Forward
3 hours ago
1
@U9-Forward thank you man :-)
â W-B
3 hours ago
Haha, YW :D good question here
â U9-Forward
3 hours ago
add a comment |Â
Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â U9-Forward
3 hours ago
1
@U9-Forward thank you man :-)
â W-B
3 hours ago
Haha, YW :D good question here
â U9-Forward
3 hours ago
Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â U9-Forward
3 hours ago
Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â U9-Forward
3 hours ago
1
1
@U9-Forward thank you man :-)
â W-B
3 hours ago
@U9-Forward thank you man :-)
â W-B
3 hours ago
Haha, YW :D good question here
â U9-Forward
3 hours ago
Haha, YW :D good question here
â U9-Forward
3 hours ago
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
6
down vote
As an user with both R
and python
and spent one year in this site, I have seen this type of question couple times.
Since in R they have the build-in function from package tidyr
so called unnest
, But in Python
(pandas
) there is no build-in function for this type of question.
I know object
columns type
always make the data hard to convert by pandas
' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .
Method 1 apply + pd.Series
(easy to understand by in term of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2 using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly same as before , adding reindex
at the end
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3 recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4 using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contain unique values:
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Special case have two columns type object
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Summary :
I am using pandas
and python
function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy
and most of the time numpy
is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython
and numba
Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â coldspeed
2 hours ago
add a comment |Â
up vote
3
down vote
Option 1
If all of the sublists in the other column are the same length, numpy
can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N
columns and tile M
columns, I'll work later on making it more efficient:
df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])
A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs = [len(r) for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))
return pd.DataFrame(d, columns = tile + ['_'.join(explode)])
unnest(df, ['A', 'D'], ['B', 'C'])
A D B_C
0 1 A 1
1 1 A 2
2 1 A 1
3 1 A 2
4 1 A 3
5 2 B 1
6 2 B 2
7 2 B 3
8 2 B 1
9 2 B 2
10 3 C 1
11 3 C 1
12 3 C 2
Functions
def wen1(df):
return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')
def wen2(df):
return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
def wen3(df):
s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
return s.join(df.drop('B', 1), how='left')
def wen4(df):
return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
def chris2(df):
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A.values, rs)
return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
Timings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
columns=[10, 50, 100, 500, 1000, 5000, 10000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
df = pd.concat([df]*c)
stmt = '(df)'.format(f)
setp = 'from __main__ import df, '.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
Performance
add a comment |Â
up vote
1
down vote
Something pretty not recommended (at least work in this case):
df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))
concat
+ sort_index
+ iter
+ apply
+ next
.
Now:
print(df)
Is:
A B
0 1 1
0 1 2
1 2 1
1 2 2
If care about index:
df=df.reset_index(drop=True)
Now:
print(df)
Is:
A B
0 1 1
1 1 2
2 2 1
3 2 2
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
6
down vote
As an user with both R
and python
and spent one year in this site, I have seen this type of question couple times.
Since in R they have the build-in function from package tidyr
so called unnest
, But in Python
(pandas
) there is no build-in function for this type of question.
I know object
columns type
always make the data hard to convert by pandas
' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .
Method 1 apply + pd.Series
(easy to understand by in term of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2 using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly same as before , adding reindex
at the end
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3 recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4 using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contain unique values:
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Special case have two columns type object
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Summary :
I am using pandas
and python
function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy
and most of the time numpy
is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython
and numba
Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â coldspeed
2 hours ago
add a comment |Â
up vote
6
down vote
As an user with both R
and python
and spent one year in this site, I have seen this type of question couple times.
Since in R they have the build-in function from package tidyr
so called unnest
, But in Python
(pandas
) there is no build-in function for this type of question.
I know object
columns type
always make the data hard to convert by pandas
' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .
Method 1 apply + pd.Series
(easy to understand by in term of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2 using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly same as before , adding reindex
at the end
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3 recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4 using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contain unique values:
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Special case have two columns type object
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Summary :
I am using pandas
and python
function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy
and most of the time numpy
is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython
and numba
Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â coldspeed
2 hours ago
add a comment |Â
up vote
6
down vote
up vote
6
down vote
As an user with both R
and python
and spent one year in this site, I have seen this type of question couple times.
Since in R they have the build-in function from package tidyr
so called unnest
, But in Python
(pandas
) there is no build-in function for this type of question.
I know object
columns type
always make the data hard to convert by pandas
' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .
Method 1 apply + pd.Series
(easy to understand by in term of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2 using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly same as before , adding reindex
at the end
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3 recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4 using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contain unique values:
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Special case have two columns type object
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Summary :
I am using pandas
and python
function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy
and most of the time numpy
is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython
and numba
As an user with both R
and python
and spent one year in this site, I have seen this type of question couple times.
Since in R they have the build-in function from package tidyr
so called unnest
, But in Python
(pandas
) there is no build-in function for this type of question.
I know object
columns type
always make the data hard to convert by pandas
' function. When I received the data like this , the first thing come into my mind is to 'flatten' or unnesting the columns .
Method 1 apply + pd.Series
(easy to understand by in term of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0:'B')
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2 using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1 for example besides A we have A.1 .....A.n, if we still using the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame('B':np.concatenate(df.B.values),index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly same as before , adding reindex
at the end
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3 recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4 using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5 when the list only contain unique values:
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]])
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Special case have two columns type object
df=pd.DataFrame('A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]])
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame(x:np.concatenate(df[x].values) )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Summary :
I am using pandas
and python
function for this type of question , if you worry about the speed of above solutions I provided , you can check user3483203's answer , since he is using numpy
and most of the time numpy
is faster . Just a suggestion if the speed is do matter for your case , I will recommend Cpython
and numba
edited 1 hour ago
answered 4 hours ago
W-B
90.7k72755
90.7k72755
Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â coldspeed
2 hours ago
add a comment |Â
Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â coldspeed
2 hours ago
Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â coldspeed
2 hours ago
Good one! I like the answers here. Perhaps you could enumerate on some situations where multiple columns need unnesting, so how would a solution like this generalise to N arbitrary columns with even (or uneven) length lists.
â coldspeed
2 hours ago
add a comment |Â
up vote
3
down vote
Option 1
If all of the sublists in the other column are the same length, numpy
can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N
columns and tile M
columns, I'll work later on making it more efficient:
df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])
A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs = [len(r) for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))
return pd.DataFrame(d, columns = tile + ['_'.join(explode)])
unnest(df, ['A', 'D'], ['B', 'C'])
A D B_C
0 1 A 1
1 1 A 2
2 1 A 1
3 1 A 2
4 1 A 3
5 2 B 1
6 2 B 2
7 2 B 3
8 2 B 1
9 2 B 2
10 3 C 1
11 3 C 1
12 3 C 2
Functions
def wen1(df):
return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')
def wen2(df):
return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
def wen3(df):
s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
return s.join(df.drop('B', 1), how='left')
def wen4(df):
return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
def chris2(df):
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A.values, rs)
return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
Timings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
columns=[10, 50, 100, 500, 1000, 5000, 10000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
df = pd.concat([df]*c)
stmt = '(df)'.format(f)
setp = 'from __main__ import df, '.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
Performance
add a comment |Â
up vote
3
down vote
Option 1
If all of the sublists in the other column are the same length, numpy
can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N
columns and tile M
columns, I'll work later on making it more efficient:
df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])
A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs = [len(r) for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))
return pd.DataFrame(d, columns = tile + ['_'.join(explode)])
unnest(df, ['A', 'D'], ['B', 'C'])
A D B_C
0 1 A 1
1 1 A 2
2 1 A 1
3 1 A 2
4 1 A 3
5 2 B 1
6 2 B 2
7 2 B 3
8 2 B 1
9 2 B 2
10 3 C 1
11 3 C 1
12 3 C 2
Functions
def wen1(df):
return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')
def wen2(df):
return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
def wen3(df):
s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
return s.join(df.drop('B', 1), how='left')
def wen4(df):
return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
def chris2(df):
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A.values, rs)
return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
Timings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
columns=[10, 50, 100, 500, 1000, 5000, 10000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
df = pd.concat([df]*c)
stmt = '(df)'.format(f)
setp = 'from __main__ import df, '.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
Performance
add a comment |Â
up vote
3
down vote
up vote
3
down vote
Option 1
If all of the sublists in the other column are the same length, numpy
can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N
columns and tile M
columns, I'll work later on making it more efficient:
df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])
A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs = [len(r) for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))
return pd.DataFrame(d, columns = tile + ['_'.join(explode)])
unnest(df, ['A', 'D'], ['B', 'C'])
A D B_C
0 1 A 1
1 1 A 2
2 1 A 1
3 1 A 2
4 1 A 3
5 2 B 1
6 2 B 2
7 2 B 3
8 2 B 1
9 2 B 2
10 3 C 1
11 3 C 1
12 3 C 2
Functions
def wen1(df):
return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')
def wen2(df):
return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
def wen3(df):
s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
return s.join(df.drop('B', 1), how='left')
def wen4(df):
return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
def chris2(df):
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A.values, rs)
return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
Timings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
columns=[10, 50, 100, 500, 1000, 5000, 10000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
df = pd.concat([df]*c)
stmt = '(df)'.format(f)
setp = 'from __main__ import df, '.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
Performance
Option 1
If all of the sublists in the other column are the same length, numpy
can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N
columns and tile M
columns, I'll work later on making it more efficient:
df = pd.DataFrame('A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C'])
A B C D
0 1 [1, 2] [1, 2, 3] A
1 2 [1, 2, 3] [1, 2] B
2 3 [1] [1, 2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs = [len(r) for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))
return pd.DataFrame(d, columns = tile + ['_'.join(explode)])
unnest(df, ['A', 'D'], ['B', 'C'])
A D B_C
0 1 A 1
1 1 A 2
2 1 A 1
3 1 A 2
4 1 A 3
5 2 B 1
6 2 B 2
7 2 B 3
8 2 B 1
9 2 B 2
10 3 C 1
11 3 C 1
12 3 C 2
Functions
def wen1(df):
return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns=0: 'B')
def wen2(df):
return pd.DataFrame('A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values))
def wen3(df):
s = pd.DataFrame('B': np.concatenate(df.B.values), index=df.index.repeat(df.B.str.len()))
return s.join(df.drop('B', 1), how='left')
def wen4(df):
return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
def chris2(df):
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A.values, rs)
return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
Timings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
columns=[10, 50, 100, 500, 1000, 5000, 10000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame('A': [1, 2], 'B': [[1, 2], [1, 2]])
df = pd.concat([df]*c)
stmt = '(df)'.format(f)
setp = 'from __main__ import df, '.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
Performance
edited 2 hours ago
answered 3 hours ago
user3483203
28.2k72351
28.2k72351
add a comment |Â
add a comment |Â
up vote
1
down vote
Something pretty not recommended (at least work in this case):
df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))
concat
+ sort_index
+ iter
+ apply
+ next
.
Now:
print(df)
Is:
A B
0 1 1
0 1 2
1 2 1
1 2 2
If care about index:
df=df.reset_index(drop=True)
Now:
print(df)
Is:
A B
0 1 1
1 1 2
2 2 1
3 2 2
add a comment |Â
up vote
1
down vote
Something pretty not recommended (at least work in this case):
df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))
concat
+ sort_index
+ iter
+ apply
+ next
.
Now:
print(df)
Is:
A B
0 1 1
0 1 2
1 2 1
1 2 2
If care about index:
df=df.reset_index(drop=True)
Now:
print(df)
Is:
A B
0 1 1
1 1 2
2 2 1
3 2 2
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Something pretty not recommended (at least work in this case):
df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))
concat
+ sort_index
+ iter
+ apply
+ next
.
Now:
print(df)
Is:
A B
0 1 1
0 1 2
1 2 1
1 2 2
If care about index:
df=df.reset_index(drop=True)
Now:
print(df)
Is:
A B
0 1 1
1 1 2
2 2 1
3 2 2
Something pretty not recommended (at least work in this case):
df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))
concat
+ sort_index
+ iter
+ apply
+ next
.
Now:
print(df)
Is:
A B
0 1 1
0 1 2
1 2 1
1 2 2
If care about index:
df=df.reset_index(drop=True)
Now:
print(df)
Is:
A B
0 1 1
1 1 2
2 2 1
3 2 2
answered 3 hours ago
U9-Forward
8,7912733
8,7912733
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-to-unnesting-a-column-in-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Happy to up-vote 2 times (actually 1.5 times tho :D) :-)
â U9-Forward
3 hours ago
1
@U9-Forward thank you man :-)
â W-B
3 hours ago
Haha, YW :D good question here
â U9-Forward
3 hours ago