Best way to flatten dataframe based on values on column
Clash Royale CLAN TAG#URR8PPP
up vote
7
down vote
favorite
I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:
df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])
print (df)
A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0
My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0
The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.
added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']
new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)
for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()
pandas dataframe vectorization
add a comment |Â
up vote
7
down vote
favorite
I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:
df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])
print (df)
A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0
My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0
The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.
added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']
new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)
for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()
pandas dataframe vectorization
upvote for a nicely constructed question. :)
â anky_91
1 hour ago
add a comment |Â
up vote
7
down vote
favorite
up vote
7
down vote
favorite
I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:
df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])
print (df)
A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0
My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0
The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.
added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']
new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)
for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()
pandas dataframe vectorization
I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:
df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])
print (df)
A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0
My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0
The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.
added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']
new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)
for e, elem in new_df['A'].iteritems():
new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()
pandas dataframe vectorization
pandas dataframe vectorization
asked 1 hour ago
el_Rinaldo
466213
466213
upvote for a nicely constructed question. :)
â anky_91
1 hour ago
add a comment |Â
upvote for a nicely constructed question. :)
â anky_91
1 hour ago
upvote for a nicely constructed question. :)
â anky_91
1 hour ago
upvote for a nicely constructed question. :)
â anky_91
1 hour ago
add a comment |Â
4 Answers
4
active
oldest
votes
up vote
8
down vote
Here is one way:
# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1
# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()
# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]
new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0
4
newdf.columns.map('0[0]_0[1]'.format)
â Wen
58 mins ago
add a comment |Â
up vote
3
down vote
In an effort to improve performance, I've used numba and numpy assignment
from numba import njit
@njit
def f(i, vals, n, m, k):
out = np.empty((n, k, m), vals.dtype)
out.fill(0)
c = np.zeros(n, np.int64)
for j in range(len(i)):
x = i[j]
out[x, :, c[x]] = vals[j]
c[x] += 1
return out.reshape(n, m * k)
d0 = df.drop('A', 1)
cols = [*d0]
i, r = pd.factorize(df.A)
n = len(r)
m = np.bincount(i).max()
k = len(cols)
vals = d0.values
pd.DataFrame(
f(i, vals, n, m, k),
pd.Index(r, name='A'),
[f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0
add a comment |Â
up vote
2
down vote
Another approach using groupby
and ravel()
>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(),
index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
add a comment |Â
up vote
2
down vote
Modify your index by using %
df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]:
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
add a comment |Â
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
8
down vote
Here is one way:
# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1
# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()
# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]
new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0
4
newdf.columns.map('0[0]_0[1]'.format)
â Wen
58 mins ago
add a comment |Â
up vote
8
down vote
Here is one way:
# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1
# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()
# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]
new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0
4
newdf.columns.map('0[0]_0[1]'.format)
â Wen
58 mins ago
add a comment |Â
up vote
8
down vote
up vote
8
down vote
Here is one way:
# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1
# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()
# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]
new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0
Here is one way:
# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1
# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()
# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]
new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0
answered 59 mins ago
Psidom
116k1069114
116k1069114
4
newdf.columns.map('0[0]_0[1]'.format)
â Wen
58 mins ago
add a comment |Â
4
newdf.columns.map('0[0]_0[1]'.format)
â Wen
58 mins ago
4
4
newdf.columns.map('0[0]_0[1]'.format)
â Wen
58 mins ago
newdf.columns.map('0[0]_0[1]'.format)
â Wen
58 mins ago
add a comment |Â
up vote
3
down vote
In an effort to improve performance, I've used numba and numpy assignment
from numba import njit
@njit
def f(i, vals, n, m, k):
out = np.empty((n, k, m), vals.dtype)
out.fill(0)
c = np.zeros(n, np.int64)
for j in range(len(i)):
x = i[j]
out[x, :, c[x]] = vals[j]
c[x] += 1
return out.reshape(n, m * k)
d0 = df.drop('A', 1)
cols = [*d0]
i, r = pd.factorize(df.A)
n = len(r)
m = np.bincount(i).max()
k = len(cols)
vals = d0.values
pd.DataFrame(
f(i, vals, n, m, k),
pd.Index(r, name='A'),
[f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0
add a comment |Â
up vote
3
down vote
In an effort to improve performance, I've used numba and numpy assignment
from numba import njit
@njit
def f(i, vals, n, m, k):
out = np.empty((n, k, m), vals.dtype)
out.fill(0)
c = np.zeros(n, np.int64)
for j in range(len(i)):
x = i[j]
out[x, :, c[x]] = vals[j]
c[x] += 1
return out.reshape(n, m * k)
d0 = df.drop('A', 1)
cols = [*d0]
i, r = pd.factorize(df.A)
n = len(r)
m = np.bincount(i).max()
k = len(cols)
vals = d0.values
pd.DataFrame(
f(i, vals, n, m, k),
pd.Index(r, name='A'),
[f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0
add a comment |Â
up vote
3
down vote
up vote
3
down vote
In an effort to improve performance, I've used numba and numpy assignment
from numba import njit
@njit
def f(i, vals, n, m, k):
out = np.empty((n, k, m), vals.dtype)
out.fill(0)
c = np.zeros(n, np.int64)
for j in range(len(i)):
x = i[j]
out[x, :, c[x]] = vals[j]
c[x] += 1
return out.reshape(n, m * k)
d0 = df.drop('A', 1)
cols = [*d0]
i, r = pd.factorize(df.A)
n = len(r)
m = np.bincount(i).max()
k = len(cols)
vals = d0.values
pd.DataFrame(
f(i, vals, n, m, k),
pd.Index(r, name='A'),
[f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0
In an effort to improve performance, I've used numba and numpy assignment
from numba import njit
@njit
def f(i, vals, n, m, k):
out = np.empty((n, k, m), vals.dtype)
out.fill(0)
c = np.zeros(n, np.int64)
for j in range(len(i)):
x = i[j]
out[x, :, c[x]] = vals[j]
c[x] += 1
return out.reshape(n, m * k)
d0 = df.drop('A', 1)
cols = [*d0]
i, r = pd.factorize(df.A)
n = len(r)
m = np.bincount(i).max()
k = len(cols)
vals = d0.values
pd.DataFrame(
f(i, vals, n, m, k),
pd.Index(r, name='A'),
[f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()
A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0
answered 12 mins ago
piRSquared
144k19124256
144k19124256
add a comment |Â
add a comment |Â
up vote
2
down vote
Another approach using groupby
and ravel()
>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(),
index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
add a comment |Â
up vote
2
down vote
Another approach using groupby
and ravel()
>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(),
index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Another approach using groupby
and ravel()
>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(),
index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
Another approach using groupby
and ravel()
>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(),
index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
edited 43 mins ago
answered 48 mins ago
RafaelC
24.1k72447
24.1k72447
add a comment |Â
add a comment |Â
up vote
2
down vote
Modify your index by using %
df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]:
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
add a comment |Â
up vote
2
down vote
Modify your index by using %
df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]:
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Modify your index by using %
df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]:
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
Modify your index by using %
df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]:
B_1 B_2 B_3 C_1 C_2 C_3
A
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0
answered 42 mins ago
Wen
85.9k72452
85.9k72452
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52839666%2fbest-way-to-flatten-dataframe-based-on-values-on-column%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
upvote for a nicely constructed question. :)
â anky_91
1 hour ago