How does the Pandas deal with the situation when a column with type “object” is compared with an integer?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
9
down vote

favorite
2












My question is about the rule that pandas uses to compare a column with type "object" with an integer. Here is my code:



In [334]: df
Out[334]:
c1 c2 c3 c4
id1 1 li -0.367860 5
id2 2 zhao -0.596926 5
id3 3 sun 0.493806 5
id4 4 wang -0.311407 5
id5 5 wang 0.253646 5

In [335]: df < 2
Out[335]:
c1 c2 c3 c4
id1 True True True False
id2 False True True False
id3 False True True False
id4 False True True False
id5 False True True False

In [336]: df.dtypes
Out[336]:
c1 int64
c2 object
c3 float64
c4 int64
dtype: object


Why does the "c2" column get True for all?



P.S. I also tried:



In [333]: np.less(np.array(["s","b"]),2)
Out[333]: NotImplemented






share|improve this question


















  • 1




    Funnily enough, both df > 2 and df < 2 yield all True
    – RafaelC
    Aug 18 at 13:46










  • I tried overriding the le (less than) and ge (greater than) routine of a class to always return False, and df > 2 still returns True. My guess is that pandas overrides every object to return True on comparison for some reason.
    – user2653663
    Sep 3 at 7:55














up vote
9
down vote

favorite
2












My question is about the rule that pandas uses to compare a column with type "object" with an integer. Here is my code:



In [334]: df
Out[334]:
c1 c2 c3 c4
id1 1 li -0.367860 5
id2 2 zhao -0.596926 5
id3 3 sun 0.493806 5
id4 4 wang -0.311407 5
id5 5 wang 0.253646 5

In [335]: df < 2
Out[335]:
c1 c2 c3 c4
id1 True True True False
id2 False True True False
id3 False True True False
id4 False True True False
id5 False True True False

In [336]: df.dtypes
Out[336]:
c1 int64
c2 object
c3 float64
c4 int64
dtype: object


Why does the "c2" column get True for all?



P.S. I also tried:



In [333]: np.less(np.array(["s","b"]),2)
Out[333]: NotImplemented






share|improve this question


















  • 1




    Funnily enough, both df > 2 and df < 2 yield all True
    – RafaelC
    Aug 18 at 13:46










  • I tried overriding the le (less than) and ge (greater than) routine of a class to always return False, and df > 2 still returns True. My guess is that pandas overrides every object to return True on comparison for some reason.
    – user2653663
    Sep 3 at 7:55












up vote
9
down vote

favorite
2









up vote
9
down vote

favorite
2






2





My question is about the rule that pandas uses to compare a column with type "object" with an integer. Here is my code:



In [334]: df
Out[334]:
c1 c2 c3 c4
id1 1 li -0.367860 5
id2 2 zhao -0.596926 5
id3 3 sun 0.493806 5
id4 4 wang -0.311407 5
id5 5 wang 0.253646 5

In [335]: df < 2
Out[335]:
c1 c2 c3 c4
id1 True True True False
id2 False True True False
id3 False True True False
id4 False True True False
id5 False True True False

In [336]: df.dtypes
Out[336]:
c1 int64
c2 object
c3 float64
c4 int64
dtype: object


Why does the "c2" column get True for all?



P.S. I also tried:



In [333]: np.less(np.array(["s","b"]),2)
Out[333]: NotImplemented






share|improve this question














My question is about the rule that pandas uses to compare a column with type "object" with an integer. Here is my code:



In [334]: df
Out[334]:
c1 c2 c3 c4
id1 1 li -0.367860 5
id2 2 zhao -0.596926 5
id3 3 sun 0.493806 5
id4 4 wang -0.311407 5
id5 5 wang 0.253646 5

In [335]: df < 2
Out[335]:
c1 c2 c3 c4
id1 True True True False
id2 False True True False
id3 False True True False
id4 False True True False
id5 False True True False

In [336]: df.dtypes
Out[336]:
c1 int64
c2 object
c3 float64
c4 int64
dtype: object


Why does the "c2" column get True for all?



P.S. I also tried:



In [333]: np.less(np.array(["s","b"]),2)
Out[333]: NotImplemented








share|improve this question













share|improve this question




share|improve this question








edited Aug 18 at 14:16









Alex Riley

71k19149154




71k19149154










asked Aug 18 at 12:46









BO.LI

828




828







  • 1




    Funnily enough, both df > 2 and df < 2 yield all True
    – RafaelC
    Aug 18 at 13:46










  • I tried overriding the le (less than) and ge (greater than) routine of a class to always return False, and df > 2 still returns True. My guess is that pandas overrides every object to return True on comparison for some reason.
    – user2653663
    Sep 3 at 7:55












  • 1




    Funnily enough, both df > 2 and df < 2 yield all True
    – RafaelC
    Aug 18 at 13:46










  • I tried overriding the le (less than) and ge (greater than) routine of a class to always return False, and df > 2 still returns True. My guess is that pandas overrides every object to return True on comparison for some reason.
    – user2653663
    Sep 3 at 7:55







1




1




Funnily enough, both df > 2 and df < 2 yield all True
– RafaelC
Aug 18 at 13:46




Funnily enough, both df > 2 and df < 2 yield all True
– RafaelC
Aug 18 at 13:46












I tried overriding the le (less than) and ge (greater than) routine of a class to always return False, and df > 2 still returns True. My guess is that pandas overrides every object to return True on comparison for some reason.
– user2653663
Sep 3 at 7:55




I tried overriding the le (less than) and ge (greater than) routine of a class to always return False, and df > 2 still returns True. My guess is that pandas overrides every object to return True on comparison for some reason.
– user2653663
Sep 3 at 7:55












1 Answer
1






active

oldest

votes

















up vote
8
down vote



accepted










For DataFrames, comparison with a scalar always returns a DataFrame having all Boolean columns.



I don't think it's documented anywhere officially, but there's a comment in the source code (see below) confirming the intended behaviour:




[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.




In practice, this means that all comparisons for every column must return either True or False. Any invalid comparison (such as 'li' < 2) should default to one of these Boolean values.



Put simply, the pandas developers decided that it should default to True.



There's some discussion of this behaviour in #4537 and some argument to use False instead, or restrict the comparison to only columns with compatible types, but the ticket was closed and no code was changed.



If you're interested, you can see where the default value is used for invalid comparisons in an internal method found in ops.py:



def _comp_method_FRAME(cls, func, special):
str_rep = _get_opstr(func, cls)
op_name = _get_op_name(func, special)

@Appender('Wrapper for comparison method name'.format(name=op_name))
def f(self, other):
if isinstance(other, ABCDataFrame):
# Another DataFrame
if not self._indexed_same(other):
raise ValueError('Can only compare identically-labeled '
'DataFrame objects')
return self._compare_frame(other, func, str_rep)

elif isinstance(other, ABCSeries):
return _combine_series_frame(self, other, func,
fill_value=None, axis=None,
level=None, try_cast=False)
else:

# straight boolean comparisons we want to allow all columns
# (regardless of dtype to pass thru) See #4537 for discussion.
res = self._combine_const(other, func,
errors='ignore',
try_cast=False)
return res.fillna(True).astype(bool)

f.__name__ = op_name
return f


The else block is the one we're interested in for the scalar case.



Note the errors='ignore' argument, meaning an invalid comparison will return NaN (instead of raising an error). The res.fillna(True) fills these failed comparisons with True.






share|improve this answer






















  • Seems like a weird design decision to me to not keep the NaNs. Especially as the library liberally uses NaNs elsewhere when there's no sensible value for a field.
    – timgeb
    Aug 18 at 17:22











Your Answer





StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51908815%2fhow-does-the-pandas-deal-with-the-situation-when-a-column-with-type-object-is%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
8
down vote



accepted










For DataFrames, comparison with a scalar always returns a DataFrame having all Boolean columns.



I don't think it's documented anywhere officially, but there's a comment in the source code (see below) confirming the intended behaviour:




[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.




In practice, this means that all comparisons for every column must return either True or False. Any invalid comparison (such as 'li' < 2) should default to one of these Boolean values.



Put simply, the pandas developers decided that it should default to True.



There's some discussion of this behaviour in #4537 and some argument to use False instead, or restrict the comparison to only columns with compatible types, but the ticket was closed and no code was changed.



If you're interested, you can see where the default value is used for invalid comparisons in an internal method found in ops.py:



def _comp_method_FRAME(cls, func, special):
str_rep = _get_opstr(func, cls)
op_name = _get_op_name(func, special)

@Appender('Wrapper for comparison method name'.format(name=op_name))
def f(self, other):
if isinstance(other, ABCDataFrame):
# Another DataFrame
if not self._indexed_same(other):
raise ValueError('Can only compare identically-labeled '
'DataFrame objects')
return self._compare_frame(other, func, str_rep)

elif isinstance(other, ABCSeries):
return _combine_series_frame(self, other, func,
fill_value=None, axis=None,
level=None, try_cast=False)
else:

# straight boolean comparisons we want to allow all columns
# (regardless of dtype to pass thru) See #4537 for discussion.
res = self._combine_const(other, func,
errors='ignore',
try_cast=False)
return res.fillna(True).astype(bool)

f.__name__ = op_name
return f


The else block is the one we're interested in for the scalar case.



Note the errors='ignore' argument, meaning an invalid comparison will return NaN (instead of raising an error). The res.fillna(True) fills these failed comparisons with True.






share|improve this answer






















  • Seems like a weird design decision to me to not keep the NaNs. Especially as the library liberally uses NaNs elsewhere when there's no sensible value for a field.
    – timgeb
    Aug 18 at 17:22















up vote
8
down vote



accepted










For DataFrames, comparison with a scalar always returns a DataFrame having all Boolean columns.



I don't think it's documented anywhere officially, but there's a comment in the source code (see below) confirming the intended behaviour:




[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.




In practice, this means that all comparisons for every column must return either True or False. Any invalid comparison (such as 'li' < 2) should default to one of these Boolean values.



Put simply, the pandas developers decided that it should default to True.



There's some discussion of this behaviour in #4537 and some argument to use False instead, or restrict the comparison to only columns with compatible types, but the ticket was closed and no code was changed.



If you're interested, you can see where the default value is used for invalid comparisons in an internal method found in ops.py:



def _comp_method_FRAME(cls, func, special):
str_rep = _get_opstr(func, cls)
op_name = _get_op_name(func, special)

@Appender('Wrapper for comparison method name'.format(name=op_name))
def f(self, other):
if isinstance(other, ABCDataFrame):
# Another DataFrame
if not self._indexed_same(other):
raise ValueError('Can only compare identically-labeled '
'DataFrame objects')
return self._compare_frame(other, func, str_rep)

elif isinstance(other, ABCSeries):
return _combine_series_frame(self, other, func,
fill_value=None, axis=None,
level=None, try_cast=False)
else:

# straight boolean comparisons we want to allow all columns
# (regardless of dtype to pass thru) See #4537 for discussion.
res = self._combine_const(other, func,
errors='ignore',
try_cast=False)
return res.fillna(True).astype(bool)

f.__name__ = op_name
return f


The else block is the one we're interested in for the scalar case.



Note the errors='ignore' argument, meaning an invalid comparison will return NaN (instead of raising an error). The res.fillna(True) fills these failed comparisons with True.






share|improve this answer






















  • Seems like a weird design decision to me to not keep the NaNs. Especially as the library liberally uses NaNs elsewhere when there's no sensible value for a field.
    – timgeb
    Aug 18 at 17:22













up vote
8
down vote



accepted







up vote
8
down vote



accepted






For DataFrames, comparison with a scalar always returns a DataFrame having all Boolean columns.



I don't think it's documented anywhere officially, but there's a comment in the source code (see below) confirming the intended behaviour:




[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.




In practice, this means that all comparisons for every column must return either True or False. Any invalid comparison (such as 'li' < 2) should default to one of these Boolean values.



Put simply, the pandas developers decided that it should default to True.



There's some discussion of this behaviour in #4537 and some argument to use False instead, or restrict the comparison to only columns with compatible types, but the ticket was closed and no code was changed.



If you're interested, you can see where the default value is used for invalid comparisons in an internal method found in ops.py:



def _comp_method_FRAME(cls, func, special):
str_rep = _get_opstr(func, cls)
op_name = _get_op_name(func, special)

@Appender('Wrapper for comparison method name'.format(name=op_name))
def f(self, other):
if isinstance(other, ABCDataFrame):
# Another DataFrame
if not self._indexed_same(other):
raise ValueError('Can only compare identically-labeled '
'DataFrame objects')
return self._compare_frame(other, func, str_rep)

elif isinstance(other, ABCSeries):
return _combine_series_frame(self, other, func,
fill_value=None, axis=None,
level=None, try_cast=False)
else:

# straight boolean comparisons we want to allow all columns
# (regardless of dtype to pass thru) See #4537 for discussion.
res = self._combine_const(other, func,
errors='ignore',
try_cast=False)
return res.fillna(True).astype(bool)

f.__name__ = op_name
return f


The else block is the one we're interested in for the scalar case.



Note the errors='ignore' argument, meaning an invalid comparison will return NaN (instead of raising an error). The res.fillna(True) fills these failed comparisons with True.






share|improve this answer














For DataFrames, comparison with a scalar always returns a DataFrame having all Boolean columns.



I don't think it's documented anywhere officially, but there's a comment in the source code (see below) confirming the intended behaviour:




[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.




In practice, this means that all comparisons for every column must return either True or False. Any invalid comparison (such as 'li' < 2) should default to one of these Boolean values.



Put simply, the pandas developers decided that it should default to True.



There's some discussion of this behaviour in #4537 and some argument to use False instead, or restrict the comparison to only columns with compatible types, but the ticket was closed and no code was changed.



If you're interested, you can see where the default value is used for invalid comparisons in an internal method found in ops.py:



def _comp_method_FRAME(cls, func, special):
str_rep = _get_opstr(func, cls)
op_name = _get_op_name(func, special)

@Appender('Wrapper for comparison method name'.format(name=op_name))
def f(self, other):
if isinstance(other, ABCDataFrame):
# Another DataFrame
if not self._indexed_same(other):
raise ValueError('Can only compare identically-labeled '
'DataFrame objects')
return self._compare_frame(other, func, str_rep)

elif isinstance(other, ABCSeries):
return _combine_series_frame(self, other, func,
fill_value=None, axis=None,
level=None, try_cast=False)
else:

# straight boolean comparisons we want to allow all columns
# (regardless of dtype to pass thru) See #4537 for discussion.
res = self._combine_const(other, func,
errors='ignore',
try_cast=False)
return res.fillna(True).astype(bool)

f.__name__ = op_name
return f


The else block is the one we're interested in for the scalar case.



Note the errors='ignore' argument, meaning an invalid comparison will return NaN (instead of raising an error). The res.fillna(True) fills these failed comparisons with True.







share|improve this answer














share|improve this answer



share|improve this answer








edited Aug 18 at 16:01

























answered Aug 18 at 14:14









Alex Riley

71k19149154




71k19149154











  • Seems like a weird design decision to me to not keep the NaNs. Especially as the library liberally uses NaNs elsewhere when there's no sensible value for a field.
    – timgeb
    Aug 18 at 17:22

















  • Seems like a weird design decision to me to not keep the NaNs. Especially as the library liberally uses NaNs elsewhere when there's no sensible value for a field.
    – timgeb
    Aug 18 at 17:22
















Seems like a weird design decision to me to not keep the NaNs. Especially as the library liberally uses NaNs elsewhere when there's no sensible value for a field.
– timgeb
Aug 18 at 17:22





Seems like a weird design decision to me to not keep the NaNs. Especially as the library liberally uses NaNs elsewhere when there's no sensible value for a field.
– timgeb
Aug 18 at 17:22


















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51908815%2fhow-does-the-pandas-deal-with-the-situation-when-a-column-with-type-object-is%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What does second last employer means? [closed]

Installing NextGIS Connect into QGIS 3?

One-line joke