propagate conditional column value in pandas
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
I want to create an indicator variable that will propagate to all rows with the same customer-period value pair as the indicator. Specifically, if baz
is yes
, I want all rows of that same customer and period email to show my indicator.
df
Customer Period Question Score
A 1 foo 2
A 1 bar 3
A 1 baz yes
A 1 biz 1
B 1 bar 2
B 1 baz no
B 1 qux 3
A 2 foo 5
A 2 baz yes
B 2 baz yes
B 2 biz 2
I've tried
df['Indicator'] = np.where(
(df.Question.str.contains('baz') & (df.Score == 'yes')),
1, 0)
which returns
Customer Period Question Score Indicator
A 1 foo 2 0
A 1 bar 3 0
A 1 baz yes 1
A 1 biz 1 0
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 0
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 0
But this is the desired output:
Customer Period Question Score Indicator
A 1 foo 2 1
A 1 bar 3 1
A 1 baz yes 1
A 1 biz 1 1
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 1
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 1
I'm not sure how to go about getting what I want. Maybe groupby with ffill and another with bfill?
python pandas
add a comment |Â
up vote
6
down vote
favorite
I want to create an indicator variable that will propagate to all rows with the same customer-period value pair as the indicator. Specifically, if baz
is yes
, I want all rows of that same customer and period email to show my indicator.
df
Customer Period Question Score
A 1 foo 2
A 1 bar 3
A 1 baz yes
A 1 biz 1
B 1 bar 2
B 1 baz no
B 1 qux 3
A 2 foo 5
A 2 baz yes
B 2 baz yes
B 2 biz 2
I've tried
df['Indicator'] = np.where(
(df.Question.str.contains('baz') & (df.Score == 'yes')),
1, 0)
which returns
Customer Period Question Score Indicator
A 1 foo 2 0
A 1 bar 3 0
A 1 baz yes 1
A 1 biz 1 0
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 0
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 0
But this is the desired output:
Customer Period Question Score Indicator
A 1 foo 2 1
A 1 bar 3 1
A 1 baz yes 1
A 1 biz 1 1
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 1
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 1
I'm not sure how to go about getting what I want. Maybe groupby with ffill and another with bfill?
python pandas
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I want to create an indicator variable that will propagate to all rows with the same customer-period value pair as the indicator. Specifically, if baz
is yes
, I want all rows of that same customer and period email to show my indicator.
df
Customer Period Question Score
A 1 foo 2
A 1 bar 3
A 1 baz yes
A 1 biz 1
B 1 bar 2
B 1 baz no
B 1 qux 3
A 2 foo 5
A 2 baz yes
B 2 baz yes
B 2 biz 2
I've tried
df['Indicator'] = np.where(
(df.Question.str.contains('baz') & (df.Score == 'yes')),
1, 0)
which returns
Customer Period Question Score Indicator
A 1 foo 2 0
A 1 bar 3 0
A 1 baz yes 1
A 1 biz 1 0
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 0
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 0
But this is the desired output:
Customer Period Question Score Indicator
A 1 foo 2 1
A 1 bar 3 1
A 1 baz yes 1
A 1 biz 1 1
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 1
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 1
I'm not sure how to go about getting what I want. Maybe groupby with ffill and another with bfill?
python pandas
I want to create an indicator variable that will propagate to all rows with the same customer-period value pair as the indicator. Specifically, if baz
is yes
, I want all rows of that same customer and period email to show my indicator.
df
Customer Period Question Score
A 1 foo 2
A 1 bar 3
A 1 baz yes
A 1 biz 1
B 1 bar 2
B 1 baz no
B 1 qux 3
A 2 foo 5
A 2 baz yes
B 2 baz yes
B 2 biz 2
I've tried
df['Indicator'] = np.where(
(df.Question.str.contains('baz') & (df.Score == 'yes')),
1, 0)
which returns
Customer Period Question Score Indicator
A 1 foo 2 0
A 1 bar 3 0
A 1 baz yes 1
A 1 biz 1 0
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 0
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 0
But this is the desired output:
Customer Period Question Score Indicator
A 1 foo 2 1
A 1 bar 3 1
A 1 baz yes 1
A 1 biz 1 1
B 1 bar 2 0
B 1 baz no 0
B 1 qux 3 0
A 2 foo 5 1
A 2 baz yes 1
B 2 baz yes 1
B 2 biz 2 1
I'm not sure how to go about getting what I want. Maybe groupby with ffill and another with bfill?
python pandas
asked Aug 21 at 14:28


Andrew
466216
466216
add a comment |Â
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
5
down vote
accepted
You can use
In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
.groupby(['Customer', 'Period'])['eq']
.transform('any').astype(int))
In [955]: df
Out[955]:
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Details
In [956]: df.Question.eq('baz') & df.Score.eq('yes')
Out[956]:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
In [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
Out[957]:
Customer Period Question Score Indicator eq
0 A 1 foo 2 1 False
1 A 1 bar 3 1 False
2 A 1 baz yes 1 True
3 A 1 biz 1 1 False
4 B 1 bar 2 0 False
5 B 1 baz no 0 False
6 B 1 qux 3 0 False
7 A 2 foo 5 1 False
8 A 2 baz yes 1 True
9 B 2 baz yes 1 True
10 B 2 biz 2 1 False
perfect! It took me a second to make it work with my real set, but this is a great solution.
– Andrew
Aug 21 at 14:47
add a comment |Â
up vote
4
down vote
Here's one way. The idea is to use a Boolean mask with MultiIndex
. Then use pd.Series.isin
to compare against your filtered indices.
mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')
idx_cols = ['Customer', 'Period']
idx = df.set_index(idx_cols).loc[mask.values].index
df['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)
print(df)
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
add a comment |Â
up vote
4
down vote
You can factorize the tuples of Customer
and Period
. Then use np.logical_or.at
to get group-wise any
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Explanation
i, r = pd.factorize([*zip(df.Customer, df.Period)])
produces unique (Customer, Period)
pairs in r
where i
is an array keeping track of which element of r
went where in order to produce the original list of tuples
Original list of tuples
[*zip(df.Customer, df.Period)]
[('A', 1),
('A', 1),
('A', 1),
('A', 1),
('B', 1),
('B', 1),
('B', 1),
('A', 2),
('A', 2),
('B', 2),
('B', 2)]After factorizing, unique tuples
r
r
array([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)And the positions
i
i
array([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])
I can now use i
as indices for evaluating grouped any
in Numpy using Numpy's at
method on ufuncs
. Basically, this allows me to create an array upfront whose values may change based on my at
operation. Then specify an array of indices (that's what i
will be) and an array matching the size of i
that is the second part of my operation at that index.
I end up using as my matching array
df.eval('Question == "baz" and Score == "yes"')
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
Let me show this in painstaking detail
Flag GroupIndex Group State of a
0 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
1 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
2 True 0 (A, 1) [1, 0, 0, 0] # Flag is True, or_eq for Index 0
3 False 0 (A, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
4 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
5 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
6 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
7 False 2 (A, 2) [1, 0, 0, 0] # Flag is False, So do Nothing
8 True 2 (A, 2) [1, 0, 1, 0] # Flag is True, or_eq for Index 2
9 True 3 (B, 2) [1, 0, 1, 1] # Flag is True, or_eq for Index 3
10 False 3 (B, 2) [1, 0, 1, 1] # Flag is False, So do Nothing
The final State
is [1, 0, 1, 1]
or in boolean terms [True, False, True, True]
. And that represents the or
accumulation within each unique group that is housed in a
a
array([ True, False, True, True])
If I slice this with the index positions in i
and cast as integers, I get
a[i].astype(np.int64)
array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])
Which is precisely what we were looking for.
Finally, I use assign
to produce a copy of the dataframe with its new column.
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Why Do it This Way?!
Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
q = df.Question.values == 'baz'
s = df.Score.values == 'yes'
m = q & s
np.logical_or.at(a, i, m)
df.assign(Indicator=a[i].astype(np.int64))
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
You can use
In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
.groupby(['Customer', 'Period'])['eq']
.transform('any').astype(int))
In [955]: df
Out[955]:
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Details
In [956]: df.Question.eq('baz') & df.Score.eq('yes')
Out[956]:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
In [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
Out[957]:
Customer Period Question Score Indicator eq
0 A 1 foo 2 1 False
1 A 1 bar 3 1 False
2 A 1 baz yes 1 True
3 A 1 biz 1 1 False
4 B 1 bar 2 0 False
5 B 1 baz no 0 False
6 B 1 qux 3 0 False
7 A 2 foo 5 1 False
8 A 2 baz yes 1 True
9 B 2 baz yes 1 True
10 B 2 biz 2 1 False
perfect! It took me a second to make it work with my real set, but this is a great solution.
– Andrew
Aug 21 at 14:47
add a comment |Â
up vote
5
down vote
accepted
You can use
In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
.groupby(['Customer', 'Period'])['eq']
.transform('any').astype(int))
In [955]: df
Out[955]:
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Details
In [956]: df.Question.eq('baz') & df.Score.eq('yes')
Out[956]:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
In [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
Out[957]:
Customer Period Question Score Indicator eq
0 A 1 foo 2 1 False
1 A 1 bar 3 1 False
2 A 1 baz yes 1 True
3 A 1 biz 1 1 False
4 B 1 bar 2 0 False
5 B 1 baz no 0 False
6 B 1 qux 3 0 False
7 A 2 foo 5 1 False
8 A 2 baz yes 1 True
9 B 2 baz yes 1 True
10 B 2 biz 2 1 False
perfect! It took me a second to make it work with my real set, but this is a great solution.
– Andrew
Aug 21 at 14:47
add a comment |Â
up vote
5
down vote
accepted
up vote
5
down vote
accepted
You can use
In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
.groupby(['Customer', 'Period'])['eq']
.transform('any').astype(int))
In [955]: df
Out[955]:
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Details
In [956]: df.Question.eq('baz') & df.Score.eq('yes')
Out[956]:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
In [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
Out[957]:
Customer Period Question Score Indicator eq
0 A 1 foo 2 1 False
1 A 1 bar 3 1 False
2 A 1 baz yes 1 True
3 A 1 biz 1 1 False
4 B 1 bar 2 0 False
5 B 1 baz no 0 False
6 B 1 qux 3 0 False
7 A 2 foo 5 1 False
8 A 2 baz yes 1 True
9 B 2 baz yes 1 True
10 B 2 biz 2 1 False
You can use
In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
.groupby(['Customer', 'Period'])['eq']
.transform('any').astype(int))
In [955]: df
Out[955]:
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Details
In [956]: df.Question.eq('baz') & df.Score.eq('yes')
Out[956]:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
In [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))
Out[957]:
Customer Period Question Score Indicator eq
0 A 1 foo 2 1 False
1 A 1 bar 3 1 False
2 A 1 baz yes 1 True
3 A 1 biz 1 1 False
4 B 1 bar 2 0 False
5 B 1 baz no 0 False
6 B 1 qux 3 0 False
7 A 2 foo 5 1 False
8 A 2 baz yes 1 True
9 B 2 baz yes 1 True
10 B 2 biz 2 1 False
answered Aug 21 at 14:32


Zero
35.6k75584
35.6k75584
perfect! It took me a second to make it work with my real set, but this is a great solution.
– Andrew
Aug 21 at 14:47
add a comment |Â
perfect! It took me a second to make it work with my real set, but this is a great solution.
– Andrew
Aug 21 at 14:47
perfect! It took me a second to make it work with my real set, but this is a great solution.
– Andrew
Aug 21 at 14:47
perfect! It took me a second to make it work with my real set, but this is a great solution.
– Andrew
Aug 21 at 14:47
add a comment |Â
up vote
4
down vote
Here's one way. The idea is to use a Boolean mask with MultiIndex
. Then use pd.Series.isin
to compare against your filtered indices.
mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')
idx_cols = ['Customer', 'Period']
idx = df.set_index(idx_cols).loc[mask.values].index
df['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)
print(df)
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
add a comment |Â
up vote
4
down vote
Here's one way. The idea is to use a Boolean mask with MultiIndex
. Then use pd.Series.isin
to compare against your filtered indices.
mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')
idx_cols = ['Customer', 'Period']
idx = df.set_index(idx_cols).loc[mask.values].index
df['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)
print(df)
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
add a comment |Â
up vote
4
down vote
up vote
4
down vote
Here's one way. The idea is to use a Boolean mask with MultiIndex
. Then use pd.Series.isin
to compare against your filtered indices.
mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')
idx_cols = ['Customer', 'Period']
idx = df.set_index(idx_cols).loc[mask.values].index
df['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)
print(df)
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Here's one way. The idea is to use a Boolean mask with MultiIndex
. Then use pd.Series.isin
to compare against your filtered indices.
mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')
idx_cols = ['Customer', 'Period']
idx = df.set_index(idx_cols).loc[mask.values].index
df['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)
print(df)
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
answered Aug 21 at 14:37


jpp
63.3k173680
63.3k173680
add a comment |Â
add a comment |Â
up vote
4
down vote
You can factorize the tuples of Customer
and Period
. Then use np.logical_or.at
to get group-wise any
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Explanation
i, r = pd.factorize([*zip(df.Customer, df.Period)])
produces unique (Customer, Period)
pairs in r
where i
is an array keeping track of which element of r
went where in order to produce the original list of tuples
Original list of tuples
[*zip(df.Customer, df.Period)]
[('A', 1),
('A', 1),
('A', 1),
('A', 1),
('B', 1),
('B', 1),
('B', 1),
('A', 2),
('A', 2),
('B', 2),
('B', 2)]After factorizing, unique tuples
r
r
array([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)And the positions
i
i
array([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])
I can now use i
as indices for evaluating grouped any
in Numpy using Numpy's at
method on ufuncs
. Basically, this allows me to create an array upfront whose values may change based on my at
operation. Then specify an array of indices (that's what i
will be) and an array matching the size of i
that is the second part of my operation at that index.
I end up using as my matching array
df.eval('Question == "baz" and Score == "yes"')
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
Let me show this in painstaking detail
Flag GroupIndex Group State of a
0 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
1 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
2 True 0 (A, 1) [1, 0, 0, 0] # Flag is True, or_eq for Index 0
3 False 0 (A, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
4 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
5 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
6 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
7 False 2 (A, 2) [1, 0, 0, 0] # Flag is False, So do Nothing
8 True 2 (A, 2) [1, 0, 1, 0] # Flag is True, or_eq for Index 2
9 True 3 (B, 2) [1, 0, 1, 1] # Flag is True, or_eq for Index 3
10 False 3 (B, 2) [1, 0, 1, 1] # Flag is False, So do Nothing
The final State
is [1, 0, 1, 1]
or in boolean terms [True, False, True, True]
. And that represents the or
accumulation within each unique group that is housed in a
a
array([ True, False, True, True])
If I slice this with the index positions in i
and cast as integers, I get
a[i].astype(np.int64)
array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])
Which is precisely what we were looking for.
Finally, I use assign
to produce a copy of the dataframe with its new column.
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Why Do it This Way?!
Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
q = df.Question.values == 'baz'
s = df.Score.values == 'yes'
m = q & s
np.logical_or.at(a, i, m)
df.assign(Indicator=a[i].astype(np.int64))
add a comment |Â
up vote
4
down vote
You can factorize the tuples of Customer
and Period
. Then use np.logical_or.at
to get group-wise any
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Explanation
i, r = pd.factorize([*zip(df.Customer, df.Period)])
produces unique (Customer, Period)
pairs in r
where i
is an array keeping track of which element of r
went where in order to produce the original list of tuples
Original list of tuples
[*zip(df.Customer, df.Period)]
[('A', 1),
('A', 1),
('A', 1),
('A', 1),
('B', 1),
('B', 1),
('B', 1),
('A', 2),
('A', 2),
('B', 2),
('B', 2)]After factorizing, unique tuples
r
r
array([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)And the positions
i
i
array([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])
I can now use i
as indices for evaluating grouped any
in Numpy using Numpy's at
method on ufuncs
. Basically, this allows me to create an array upfront whose values may change based on my at
operation. Then specify an array of indices (that's what i
will be) and an array matching the size of i
that is the second part of my operation at that index.
I end up using as my matching array
df.eval('Question == "baz" and Score == "yes"')
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
Let me show this in painstaking detail
Flag GroupIndex Group State of a
0 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
1 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
2 True 0 (A, 1) [1, 0, 0, 0] # Flag is True, or_eq for Index 0
3 False 0 (A, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
4 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
5 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
6 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
7 False 2 (A, 2) [1, 0, 0, 0] # Flag is False, So do Nothing
8 True 2 (A, 2) [1, 0, 1, 0] # Flag is True, or_eq for Index 2
9 True 3 (B, 2) [1, 0, 1, 1] # Flag is True, or_eq for Index 3
10 False 3 (B, 2) [1, 0, 1, 1] # Flag is False, So do Nothing
The final State
is [1, 0, 1, 1]
or in boolean terms [True, False, True, True]
. And that represents the or
accumulation within each unique group that is housed in a
a
array([ True, False, True, True])
If I slice this with the index positions in i
and cast as integers, I get
a[i].astype(np.int64)
array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])
Which is precisely what we were looking for.
Finally, I use assign
to produce a copy of the dataframe with its new column.
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Why Do it This Way?!
Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
q = df.Question.values == 'baz'
s = df.Score.values == 'yes'
m = q & s
np.logical_or.at(a, i, m)
df.assign(Indicator=a[i].astype(np.int64))
add a comment |Â
up vote
4
down vote
up vote
4
down vote
You can factorize the tuples of Customer
and Period
. Then use np.logical_or.at
to get group-wise any
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Explanation
i, r = pd.factorize([*zip(df.Customer, df.Period)])
produces unique (Customer, Period)
pairs in r
where i
is an array keeping track of which element of r
went where in order to produce the original list of tuples
Original list of tuples
[*zip(df.Customer, df.Period)]
[('A', 1),
('A', 1),
('A', 1),
('A', 1),
('B', 1),
('B', 1),
('B', 1),
('A', 2),
('A', 2),
('B', 2),
('B', 2)]After factorizing, unique tuples
r
r
array([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)And the positions
i
i
array([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])
I can now use i
as indices for evaluating grouped any
in Numpy using Numpy's at
method on ufuncs
. Basically, this allows me to create an array upfront whose values may change based on my at
operation. Then specify an array of indices (that's what i
will be) and an array matching the size of i
that is the second part of my operation at that index.
I end up using as my matching array
df.eval('Question == "baz" and Score == "yes"')
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
Let me show this in painstaking detail
Flag GroupIndex Group State of a
0 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
1 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
2 True 0 (A, 1) [1, 0, 0, 0] # Flag is True, or_eq for Index 0
3 False 0 (A, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
4 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
5 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
6 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
7 False 2 (A, 2) [1, 0, 0, 0] # Flag is False, So do Nothing
8 True 2 (A, 2) [1, 0, 1, 0] # Flag is True, or_eq for Index 2
9 True 3 (B, 2) [1, 0, 1, 1] # Flag is True, or_eq for Index 3
10 False 3 (B, 2) [1, 0, 1, 1] # Flag is False, So do Nothing
The final State
is [1, 0, 1, 1]
or in boolean terms [True, False, True, True]
. And that represents the or
accumulation within each unique group that is housed in a
a
array([ True, False, True, True])
If I slice this with the index positions in i
and cast as integers, I get
a[i].astype(np.int64)
array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])
Which is precisely what we were looking for.
Finally, I use assign
to produce a copy of the dataframe with its new column.
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Why Do it This Way?!
Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
q = df.Question.values == 'baz'
s = df.Score.values == 'yes'
m = q & s
np.logical_or.at(a, i, m)
df.assign(Indicator=a[i].astype(np.int64))
You can factorize the tuples of Customer
and Period
. Then use np.logical_or.at
to get group-wise any
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Explanation
i, r = pd.factorize([*zip(df.Customer, df.Period)])
produces unique (Customer, Period)
pairs in r
where i
is an array keeping track of which element of r
went where in order to produce the original list of tuples
Original list of tuples
[*zip(df.Customer, df.Period)]
[('A', 1),
('A', 1),
('A', 1),
('A', 1),
('B', 1),
('B', 1),
('B', 1),
('A', 2),
('A', 2),
('B', 2),
('B', 2)]After factorizing, unique tuples
r
r
array([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)And the positions
i
i
array([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])
I can now use i
as indices for evaluating grouped any
in Numpy using Numpy's at
method on ufuncs
. Basically, this allows me to create an array upfront whose values may change based on my at
operation. Then specify an array of indices (that's what i
will be) and an array matching the size of i
that is the second part of my operation at that index.
I end up using as my matching array
df.eval('Question == "baz" and Score == "yes"')
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
dtype: bool
Let me show this in painstaking detail
Flag GroupIndex Group State of a
0 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
1 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing
2 True 0 (A, 1) [1, 0, 0, 0] # Flag is True, or_eq for Index 0
3 False 0 (A, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
4 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
5 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
6 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing
7 False 2 (A, 2) [1, 0, 0, 0] # Flag is False, So do Nothing
8 True 2 (A, 2) [1, 0, 1, 0] # Flag is True, or_eq for Index 2
9 True 3 (B, 2) [1, 0, 1, 1] # Flag is True, or_eq for Index 3
10 False 3 (B, 2) [1, 0, 1, 1] # Flag is False, So do Nothing
The final State
is [1, 0, 1, 1]
or in boolean terms [True, False, True, True]
. And that represents the or
accumulation within each unique group that is housed in a
a
array([ True, False, True, True])
If I slice this with the index positions in i
and cast as integers, I get
a[i].astype(np.int64)
array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])
Which is precisely what we were looking for.
Finally, I use assign
to produce a copy of the dataframe with its new column.
df.assign(Indicator=a[i].astype(np.int64))
Customer Period Question Score Indicator
0 A 1 foo 2 1
1 A 1 bar 3 1
2 A 1 baz yes 1
3 A 1 biz 1 1
4 B 1 bar 2 0
5 B 1 baz no 0
6 B 1 qux 3 0
7 A 2 foo 5 1
8 A 2 baz yes 1
9 B 2 baz yes 1
10 B 2 biz 2 1
Why Do it This Way?!
Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)
i, r = pd.factorize([*zip(df.Customer, df.Period)])
a = np.zeros(len(r), dtype=np.bool8)
q = df.Question.values == 'baz'
s = df.Score.values == 'yes'
m = q & s
np.logical_or.at(a, i, m)
df.assign(Indicator=a[i].astype(np.int64))
edited Aug 21 at 15:23
answered Aug 21 at 14:43


piRSquared
138k19117244
138k19117244
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51950944%2fpropagate-conditional-column-value-in-pandas%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password