Create a new column only if values differ
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
My dataframe looks like this :
pd.DataFrame([["t1","d2","e3","r4"],
["t1","d2","e2","r4"],
["t1","d2","e1","r4"]],columns=["a","b","c","d"])
and I want:
pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
columns=["a","b","c","d","c1","c2"])
ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?
python pandas dataframe
add a comment |Â
up vote
6
down vote
favorite
My dataframe looks like this :
pd.DataFrame([["t1","d2","e3","r4"],
["t1","d2","e2","r4"],
["t1","d2","e1","r4"]],columns=["a","b","c","d"])
and I want:
pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
columns=["a","b","c","d","c1","c2"])
ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?
python pandas dataframe
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
My dataframe looks like this :
pd.DataFrame([["t1","d2","e3","r4"],
["t1","d2","e2","r4"],
["t1","d2","e1","r4"]],columns=["a","b","c","d"])
and I want:
pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
columns=["a","b","c","d","c1","c2"])
ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?
python pandas dataframe
My dataframe looks like this :
pd.DataFrame([["t1","d2","e3","r4"],
["t1","d2","e2","r4"],
["t1","d2","e1","r4"]],columns=["a","b","c","d"])
and I want:
pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
columns=["a","b","c","d","c1","c2"])
ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?
python pandas dataframe
python pandas dataframe
asked 47 mins ago
FFL75
97311
97311
add a comment |Â
add a comment |Â
4 Answers
4
active
oldest
votes
up vote
4
down vote
Edit: To generalize for any single non-unique column:
Ucols = df.columns[(df.nunique() == 1)].tolist()
df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
print(df_out.reset_index())
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
Original Answer
Use:
df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
df_out.reset_index()
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
2
It looks to me that this doesn't generalize. I mean you know already that columnc
has more values.
– user32185
31 mins ago
1
Updated solution to handle not knowing 'c' has more than one value. Thanks.
– Scott Boston
20 mins ago
add a comment |Â
up vote
4
down vote
You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.
res = pd.DataFrame(f'colidx': val for col in df for idx, val in
enumerate(df[col].unique(), 1), index=[0])
print(res)
a1 b1 c1 c2 c3 d1
0 t1 d2 e3 e2 e1 r4
An alternative to df[col].unique()
is df[col].drop_duplicates()
, though the latter may incur an overhead for iterating a pd.Series
object versus np.ndarray
.
Here you canunique
instead ofdrop_duplicates()
– user32185
25 mins ago
1
@user32185, Good point, have included that alternative. I believe it should be more efficient asunique
returns an array, i.e. nopd.Series
overhead.
– jpp
23 mins ago
I'm curious to know if it's possible to use a comprehension with afor
inside the else in order to obtain the OP requested output. See my answer as reference.
– user32185
8 mins ago
@user32185, You can probably do something liked = 1: ''; f'cold.get(idx, idx)'
, since you can usedict.get
within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
– jpp
2 mins ago
add a comment |Â
up vote
2
down vote
Not as beautiful as Scott answer but the logic you are looking for is:
out = pd.DataFrame()
for col in df.columns:
values =df[col].unique()
if len(values)==1:
out[col]=values
else:
for i,value in enumerate(values):
out[col+str(i+1)]= value
add a comment |Â
up vote
0
down vote
Using drop_duplicates
s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')
pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
Out[1151]:
a0 b0 c0 c1 c2 d0
0 t1 d2 e3 e2 e1 r4
add a comment |Â
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
Edit: To generalize for any single non-unique column:
Ucols = df.columns[(df.nunique() == 1)].tolist()
df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
print(df_out.reset_index())
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
Original Answer
Use:
df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
df_out.reset_index()
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
2
It looks to me that this doesn't generalize. I mean you know already that columnc
has more values.
– user32185
31 mins ago
1
Updated solution to handle not knowing 'c' has more than one value. Thanks.
– Scott Boston
20 mins ago
add a comment |Â
up vote
4
down vote
Edit: To generalize for any single non-unique column:
Ucols = df.columns[(df.nunique() == 1)].tolist()
df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
print(df_out.reset_index())
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
Original Answer
Use:
df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
df_out.reset_index()
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
2
It looks to me that this doesn't generalize. I mean you know already that columnc
has more values.
– user32185
31 mins ago
1
Updated solution to handle not knowing 'c' has more than one value. Thanks.
– Scott Boston
20 mins ago
add a comment |Â
up vote
4
down vote
up vote
4
down vote
Edit: To generalize for any single non-unique column:
Ucols = df.columns[(df.nunique() == 1)].tolist()
df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
print(df_out.reset_index())
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
Original Answer
Use:
df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
df_out.reset_index()
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
Edit: To generalize for any single non-unique column:
Ucols = df.columns[(df.nunique() == 1)].tolist()
df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
print(df_out.reset_index())
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
Original Answer
Use:
df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()
df_out.columns = [f'ij' if j != 0 else f'i' for i,j in df_out.columns]
df_out.reset_index()
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
edited 28 mins ago
answered 39 mins ago


Scott Boston
46.2k52351
46.2k52351
2
It looks to me that this doesn't generalize. I mean you know already that columnc
has more values.
– user32185
31 mins ago
1
Updated solution to handle not knowing 'c' has more than one value. Thanks.
– Scott Boston
20 mins ago
add a comment |Â
2
It looks to me that this doesn't generalize. I mean you know already that columnc
has more values.
– user32185
31 mins ago
1
Updated solution to handle not knowing 'c' has more than one value. Thanks.
– Scott Boston
20 mins ago
2
2
It looks to me that this doesn't generalize. I mean you know already that column
c
has more values.– user32185
31 mins ago
It looks to me that this doesn't generalize. I mean you know already that column
c
has more values.– user32185
31 mins ago
1
1
Updated solution to handle not knowing 'c' has more than one value. Thanks.
– Scott Boston
20 mins ago
Updated solution to handle not knowing 'c' has more than one value. Thanks.
– Scott Boston
20 mins ago
add a comment |Â
up vote
4
down vote
You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.
res = pd.DataFrame(f'colidx': val for col in df for idx, val in
enumerate(df[col].unique(), 1), index=[0])
print(res)
a1 b1 c1 c2 c3 d1
0 t1 d2 e3 e2 e1 r4
An alternative to df[col].unique()
is df[col].drop_duplicates()
, though the latter may incur an overhead for iterating a pd.Series
object versus np.ndarray
.
Here you canunique
instead ofdrop_duplicates()
– user32185
25 mins ago
1
@user32185, Good point, have included that alternative. I believe it should be more efficient asunique
returns an array, i.e. nopd.Series
overhead.
– jpp
23 mins ago
I'm curious to know if it's possible to use a comprehension with afor
inside the else in order to obtain the OP requested output. See my answer as reference.
– user32185
8 mins ago
@user32185, You can probably do something liked = 1: ''; f'cold.get(idx, idx)'
, since you can usedict.get
within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
– jpp
2 mins ago
add a comment |Â
up vote
4
down vote
You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.
res = pd.DataFrame(f'colidx': val for col in df for idx, val in
enumerate(df[col].unique(), 1), index=[0])
print(res)
a1 b1 c1 c2 c3 d1
0 t1 d2 e3 e2 e1 r4
An alternative to df[col].unique()
is df[col].drop_duplicates()
, though the latter may incur an overhead for iterating a pd.Series
object versus np.ndarray
.
Here you canunique
instead ofdrop_duplicates()
– user32185
25 mins ago
1
@user32185, Good point, have included that alternative. I believe it should be more efficient asunique
returns an array, i.e. nopd.Series
overhead.
– jpp
23 mins ago
I'm curious to know if it's possible to use a comprehension with afor
inside the else in order to obtain the OP requested output. See my answer as reference.
– user32185
8 mins ago
@user32185, You can probably do something liked = 1: ''; f'cold.get(idx, idx)'
, since you can usedict.get
within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
– jpp
2 mins ago
add a comment |Â
up vote
4
down vote
up vote
4
down vote
You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.
res = pd.DataFrame(f'colidx': val for col in df for idx, val in
enumerate(df[col].unique(), 1), index=[0])
print(res)
a1 b1 c1 c2 c3 d1
0 t1 d2 e3 e2 e1 r4
An alternative to df[col].unique()
is df[col].drop_duplicates()
, though the latter may incur an overhead for iterating a pd.Series
object versus np.ndarray
.
You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.
res = pd.DataFrame(f'colidx': val for col in df for idx, val in
enumerate(df[col].unique(), 1), index=[0])
print(res)
a1 b1 c1 c2 c3 d1
0 t1 d2 e3 e2 e1 r4
An alternative to df[col].unique()
is df[col].drop_duplicates()
, though the latter may incur an overhead for iterating a pd.Series
object versus np.ndarray
.
edited 24 mins ago
answered 31 mins ago


jpp
65.6k173881
65.6k173881
Here you canunique
instead ofdrop_duplicates()
– user32185
25 mins ago
1
@user32185, Good point, have included that alternative. I believe it should be more efficient asunique
returns an array, i.e. nopd.Series
overhead.
– jpp
23 mins ago
I'm curious to know if it's possible to use a comprehension with afor
inside the else in order to obtain the OP requested output. See my answer as reference.
– user32185
8 mins ago
@user32185, You can probably do something liked = 1: ''; f'cold.get(idx, idx)'
, since you can usedict.get
within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
– jpp
2 mins ago
add a comment |Â
Here you canunique
instead ofdrop_duplicates()
– user32185
25 mins ago
1
@user32185, Good point, have included that alternative. I believe it should be more efficient asunique
returns an array, i.e. nopd.Series
overhead.
– jpp
23 mins ago
I'm curious to know if it's possible to use a comprehension with afor
inside the else in order to obtain the OP requested output. See my answer as reference.
– user32185
8 mins ago
@user32185, You can probably do something liked = 1: ''; f'cold.get(idx, idx)'
, since you can usedict.get
within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.
– jpp
2 mins ago
Here you can
unique
instead of drop_duplicates()
– user32185
25 mins ago
Here you can
unique
instead of drop_duplicates()
– user32185
25 mins ago
1
1
@user32185, Good point, have included that alternative. I believe it should be more efficient as
unique
returns an array, i.e. no pd.Series
overhead.– jpp
23 mins ago
@user32185, Good point, have included that alternative. I believe it should be more efficient as
unique
returns an array, i.e. no pd.Series
overhead.– jpp
23 mins ago
I'm curious to know if it's possible to use a comprehension with a
for
inside the else in order to obtain the OP requested output. See my answer as reference.– user32185
8 mins ago
I'm curious to know if it's possible to use a comprehension with a
for
inside the else in order to obtain the OP requested output. See my answer as reference.– user32185
8 mins ago
@user32185, You can probably do something like
d = 1: ''; f'cold.get(idx, idx)'
, since you can use dict.get
within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.– jpp
2 mins ago
@user32185, You can probably do something like
d = 1: ''; f'cold.get(idx, idx)'
, since you can use dict.get
within an f-string. So if it's a 1 it'll just be an empty string, otherwise return the integer.– jpp
2 mins ago
add a comment |Â
up vote
2
down vote
Not as beautiful as Scott answer but the logic you are looking for is:
out = pd.DataFrame()
for col in df.columns:
values =df[col].unique()
if len(values)==1:
out[col]=values
else:
for i,value in enumerate(values):
out[col+str(i+1)]= value
add a comment |Â
up vote
2
down vote
Not as beautiful as Scott answer but the logic you are looking for is:
out = pd.DataFrame()
for col in df.columns:
values =df[col].unique()
if len(values)==1:
out[col]=values
else:
for i,value in enumerate(values):
out[col+str(i+1)]= value
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Not as beautiful as Scott answer but the logic you are looking for is:
out = pd.DataFrame()
for col in df.columns:
values =df[col].unique()
if len(values)==1:
out[col]=values
else:
for i,value in enumerate(values):
out[col+str(i+1)]= value
Not as beautiful as Scott answer but the logic you are looking for is:
out = pd.DataFrame()
for col in df.columns:
values =df[col].unique()
if len(values)==1:
out[col]=values
else:
for i,value in enumerate(values):
out[col+str(i+1)]= value
answered 36 mins ago
user32185
9941722
9941722
add a comment |Â
add a comment |Â
up vote
0
down vote
Using drop_duplicates
s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')
pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
Out[1151]:
a0 b0 c0 c1 c2 d0
0 t1 d2 e3 e2 e1 r4
add a comment |Â
up vote
0
down vote
Using drop_duplicates
s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')
pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
Out[1151]:
a0 b0 c0 c1 c2 d0
0 t1 d2 e3 e2 e1 r4
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Using drop_duplicates
s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')
pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
Out[1151]:
a0 b0 c0 c1 c2 d0
0 t1 d2 e3 e2 e1 r4
Using drop_duplicates
s=df.reset_index().melt('index').drop_duplicates(['variable','value'],keep='first')
pd.DataFrame([s.value.values.tolist()],columns=s['variable']+s['index'].astype(str))
Out[1151]:
a0 b0 c0 c1 c2 d0
0 t1 d2 e3 e2 e1 r4
answered 4 mins ago


Wen
81.7k72246
81.7k72246
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52480841%2fcreate-a-new-column-only-if-values-differ%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password