Best way to flatten dataframe based on values on column

up vote
7
down vote

favorite

I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:

df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

 A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0

My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0

The code I wrote gives the result I want, but it is pretty slow as it uses a simple for loop on the unique labels.
The solution I see is to write some vectorized function that optimize my code. Anyone has some idea?
Below I append the code.

added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
 new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()

asked 1 hour ago

el_Rinaldo

466213

upvote for a nicely constructed question. :)
â€“Â anky_91
1 hour ago

add a commentÂ |Â

up vote
7
down vote

favorite

I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:

df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

 A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0

My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0

added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
 new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()

asked 1 hour ago

el_Rinaldo

466213

upvote for a nicely constructed question. :)
â€“Â anky_91
1 hour ago

add a commentÂ |Â

up vote
7
down vote

favorite

I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:

df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

 A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0

My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0

added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
 new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()

asked 1 hour ago

el_Rinaldo

466213

I have to process a whole dataframe with some hundered thousands rows, but I can simplify it as below:

df = pd.DataFrame([
('a', 1, 1),
('a', 0, 0),
('a', 0, 1),
('b', 0, 0),
('b', 1, 0),
('b', 0, 1),
('c', 1, 1),
('c', 1, 0),
('c', 1, 0)
], columns=['A', 'B', 'C'])

print (df)

 A B C
0 a 1 1
1 a 0 0
2 a 0 1
3 b 0 0
4 b 1 0
5 b 0 1
6 c 1 1
7 c 1 0
8 c 1 0

My goal it to flatten the columns "B" and "C" based on the label they have in the "A" column

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
3 b 0 1 0 0 0 1
6 c 1 1 1 1 0 0

added_col = ['B_1', 'B_2', 'B_3', 'C_1', 'C_2', 'C_3']

new_df = df.drop(['B', 'C'], axis=1).copy()
new_df = new_df.iloc[[x for x in range(0, len(df), 3)], :]
new_df = pd.concat([new_df,pd.DataFrame(columns=added_col)], sort=False)

for e, elem in new_df['A'].iteritems():
 new_df.loc[e, added_col] = df[df['A'] == elem].loc[:,['B','C']].T.values.flatten()

pandas dataframe vectorization

asked 1 hour ago

el_Rinaldo

466213

asked 1 hour ago

el_Rinaldo

466213

asked 1 hour ago

el_Rinaldo

466213

asked 1 hour ago

el_Rinaldo

466213

asked 1 hour ago

el_Rinaldo

466213

upvote for a nicely constructed question. :)
â€“Â anky_91
1 hour ago

add a commentÂ |Â

upvote for a nicely constructed question. :)
â€“Â anky_91
1 hour ago

upvote for a nicely constructed question. :)
â€“Â anky_91
1 hour ago

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
8
down vote

Here is one way:

# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1

# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()

# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0

answered 59 mins ago

Psidom

116k1069114

4

newdf.columns.map('0[0]_0[1]'.format)
â€“Â Wen
58 mins ago

add a commentÂ |Â

up vote
3
down vote

In an effort to improve performance, I've used numba and numpy assignment

from numba import njit

@njit
def f(i, vals, n, m, k):

 out = np.empty((n, k, m), vals.dtype)
 out.fill(0)

 c = np.zeros(n, np.int64)

 for j in range(len(i)):
 x = i[j]
 out[x, :, c[x]] = vals[j]
 c[x] += 1

 return out.reshape(n, m * k)


d0 = df.drop('A', 1)
cols = [*d0]

i, r = pd.factorize(df.A)

n = len(r)
m = np.bincount(i).max()
k = len(cols)

vals = d0.values

pd.DataFrame(
 f(i, vals, n, m, k),
 pd.Index(r, name='A'),
 [f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0

answered 12 mins ago

piRSquared

144k19124256

add a commentÂ |Â

up vote
2
down vote

Another approach using groupby and ravel()

>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
 index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

edited 43 mins ago

answered 48 mins ago

RafaelC

24.1k72447

add a commentÂ |Â

up vote
2
down vote

Modify your index by using %

df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]: 
 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

answered 42 mins ago

Wen

85.9k72452

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52839666%2fbest-way-to-flatten-dataframe-based-on-values-on-column%23new-answer', 'question_page');

);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
8
down vote

Here is one way:

# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1

# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()

# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0

answered 59 mins ago

Psidom

116k1069114

4

newdf.columns.map('0[0]_0[1]'.format)
â€“Â Wen
58 mins ago

add a commentÂ |Â

up vote
8
down vote

Here is one way:

# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1

# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()

# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0

answered 59 mins ago

Psidom

116k1069114

4

newdf.columns.map('0[0]_0[1]'.format)
â€“Â Wen
58 mins ago

add a commentÂ |Â

up vote
8
down vote

Here is one way:

# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1

# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()

# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0

answered 59 mins ago

Psidom

116k1069114

Here is one way:

# create a row number by group
df['rn'] = df.groupby('A').cumcount() + 1

# pivot the table
new_df = df.set_index(['A', 'rn']).unstack()

# rename columns
new_df.columns = [x + '_' + str(y) for (x, y) in new_df.columns]

new_df.reset_index()
# A B_1 B_2 B_3 C_1 C_2 C_3
#0 a 1 0 0 1 0 1
#1 b 0 1 0 0 0 1
#2 c 1 1 1 1 0 0

answered 59 mins ago

Psidom

116k1069114

answered 59 mins ago

Psidom

116k1069114

answered 59 mins ago

Psidom

116k1069114

answered 59 mins ago

Psidom

116k1069114

4

newdf.columns.map('0[0]_0[1]'.format)
â€“Â Wen
58 mins ago

add a commentÂ |Â

4

newdf.columns.map('0[0]_0[1]'.format)
â€“Â Wen
58 mins ago

newdf.columns.map('0[0]_0[1]'.format)
â€“Â Wen
58 mins ago

add a commentÂ |Â

up vote
3
down vote

In an effort to improve performance, I've used numba and numpy assignment

from numba import njit

@njit
def f(i, vals, n, m, k):

 out = np.empty((n, k, m), vals.dtype)
 out.fill(0)

 c = np.zeros(n, np.int64)

 for j in range(len(i)):
 x = i[j]
 out[x, :, c[x]] = vals[j]
 c[x] += 1

 return out.reshape(n, m * k)


d0 = df.drop('A', 1)
cols = [*d0]

i, r = pd.factorize(df.A)

n = len(r)
m = np.bincount(i).max()
k = len(cols)

vals = d0.values

pd.DataFrame(
 f(i, vals, n, m, k),
 pd.Index(r, name='A'),
 [f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0

answered 12 mins ago

piRSquared

144k19124256

add a commentÂ |Â

up vote
3
down vote

In an effort to improve performance, I've used numba and numpy assignment

from numba import njit

@njit
def f(i, vals, n, m, k):

 out = np.empty((n, k, m), vals.dtype)
 out.fill(0)

 c = np.zeros(n, np.int64)

 for j in range(len(i)):
 x = i[j]
 out[x, :, c[x]] = vals[j]
 c[x] += 1

 return out.reshape(n, m * k)


d0 = df.drop('A', 1)
cols = [*d0]

i, r = pd.factorize(df.A)

n = len(r)
m = np.bincount(i).max()
k = len(cols)

vals = d0.values

pd.DataFrame(
 f(i, vals, n, m, k),
 pd.Index(r, name='A'),
 [f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0

answered 12 mins ago

piRSquared

144k19124256

add a commentÂ |Â

up vote
3
down vote

In an effort to improve performance, I've used numba and numpy assignment

from numba import njit

@njit
def f(i, vals, n, m, k):

 out = np.empty((n, k, m), vals.dtype)
 out.fill(0)

 c = np.zeros(n, np.int64)

 for j in range(len(i)):
 x = i[j]
 out[x, :, c[x]] = vals[j]
 c[x] += 1

 return out.reshape(n, m * k)


d0 = df.drop('A', 1)
cols = [*d0]

i, r = pd.factorize(df.A)

n = len(r)
m = np.bincount(i).max()
k = len(cols)

vals = d0.values

pd.DataFrame(
 f(i, vals, n, m, k),
 pd.Index(r, name='A'),
 [f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0

answered 12 mins ago

piRSquared

144k19124256

In an effort to improve performance, I've used numba and numpy assignment

from numba import njit

@njit
def f(i, vals, n, m, k):

 out = np.empty((n, k, m), vals.dtype)
 out.fill(0)

 c = np.zeros(n, np.int64)

 for j in range(len(i)):
 x = i[j]
 out[x, :, c[x]] = vals[j]
 c[x] += 1

 return out.reshape(n, m * k)


d0 = df.drop('A', 1)
cols = [*d0]

i, r = pd.factorize(df.A)

n = len(r)
m = np.bincount(i).max()
k = len(cols)

vals = d0.values

pd.DataFrame(
 f(i, vals, n, m, k),
 pd.Index(r, name='A'),
 [f"c_i" for c in cols for i in range(1, m + 1)]
).reset_index()

 A B_1 B_2 B_3 C_1 C_2 C_3
0 a 1 0 0 1 0 1
1 b 0 1 0 0 0 1
2 c 1 1 1 1 0 0

answered 12 mins ago

piRSquared

144k19124256

answered 12 mins ago

piRSquared

144k19124256

answered 12 mins ago

piRSquared

144k19124256

answered 12 mins ago

piRSquared

144k19124256

add a commentÂ |Â

up vote
2
down vote

Another approach using groupby and ravel()

>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
 index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

edited 43 mins ago

answered 48 mins ago

RafaelC

24.1k72447

add a commentÂ |Â

up vote
2
down vote

Another approach using groupby and ravel()

>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
 index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

edited 43 mins ago

answered 48 mins ago

RafaelC

24.1k72447

add a commentÂ |Â

up vote
2
down vote

Another approach using groupby and ravel()

>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
 index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

edited 43 mins ago

answered 48 mins ago

RafaelC

24.1k72447

Another approach using groupby and ravel()

>>> df.groupby('A')[['B','C']].apply(lambda s: pd.Series(s.T.values.ravel(), 
 index=[f'x_i' for x in s.columns for i in range(1, len(s)+1)]))

 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

edited 43 mins ago

answered 48 mins ago

RafaelC

24.1k72447

edited 43 mins ago

answered 48 mins ago

RafaelC

24.1k72447

answered 48 mins ago

RafaelC

24.1k72447

answered 48 mins ago

RafaelC

24.1k72447

add a commentÂ |Â

up vote
2
down vote

Modify your index by using %

df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]: 
 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

answered 42 mins ago

Wen

85.9k72452

add a commentÂ |Â

up vote
2
down vote

Modify your index by using %

df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]: 
 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

answered 42 mins ago

Wen

85.9k72452

add a commentÂ |Â

up vote
2
down vote

Modify your index by using %

df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]: 
 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

answered 42 mins ago

Wen

85.9k72452

Modify your index by using %

df.index=df.index%3+1
df.set_index('A',append=True,inplace=True)
newdf=df.unstack(level=0)
newdf.columns=newdf.columns.map('0[0]_0[1]'.format)
newdf
Out[291]: 
 B_1 B_2 B_3 C_1 C_2 C_3
A 
a 1 0 0 1 0 1
b 0 1 0 0 0 1
c 1 1 1 1 0 0

answered 42 mins ago

Wen

85.9k72452

answered 42 mins ago

Wen

85.9k72452

answered 42 mins ago

Wen

85.9k72452

answered 42 mins ago

Wen

85.9k72452

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky