Groupby class and count missing values in features












16















I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



What do I want to do?



I have a dataframe like this



CLASS FEATURE1 FEATURE2 FEATURE3
X A NaN NaN
X NaN A NaN
B A A A


I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0


I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



Is there something similar for the NaN-Values?



I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN










share|improve this question





























    16















    I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



    What do I want to do?



    I have a dataframe like this



    CLASS FEATURE1 FEATURE2 FEATURE3
    X A NaN NaN
    X NaN A NaN
    B A A A


    I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



    CLASS FEATURE1 FEATURE2 FEATURE3
    X 1 1 2
    B 0 0 0


    I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



    Is there something similar for the NaN-Values?



    I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN










    share|improve this question



























      16












      16








      16








      I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



      What do I want to do?



      I have a dataframe like this



      CLASS FEATURE1 FEATURE2 FEATURE3
      X A NaN NaN
      X NaN A NaN
      B A A A


      I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



      CLASS FEATURE1 FEATURE2 FEATURE3
      X 1 1 2
      B 0 0 0


      I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



      Is there something similar for the NaN-Values?



      I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN










      share|improve this question
















      I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.



      What do I want to do?



      I have a dataframe like this



      CLASS FEATURE1 FEATURE2 FEATURE3
      X A NaN NaN
      X NaN A NaN
      B A A A


      I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.



      CLASS FEATURE1 FEATURE2 FEATURE3
      X 1 1 2
      B 0 0 0


      I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()



      Is there something similar for the NaN-Values?



      I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN







      python pandas dataframe group-by nan






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 27 '18 at 16:11









      coldspeed

      131k23135221




      131k23135221










      asked Dec 27 '18 at 15:15









      FelTry2FelTry2

      1057




      1057
























          3 Answers
          3






          active

          oldest

          votes


















          15














          Compute a mask with isna, then group and find the sum:



          df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

          CLASS FEATURE1 FEATURE2 FEATURE3
          0 X 1.0 1.0 2.0
          1 B 0.0 0.0 0.0




          Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



          df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


          Or,



          g = df.groupby('CLASS')
          g.count().rsub(g.size(), axis=0)




                 FEATURE1  FEATURE2  FEATURE3
          CLASS
          B 0 0 0
          X 1 1 2




          There are quite a few good answers, so here are some timeits for your perusal:



          df_ = df
          df = pd.concat([df_] * 10000)

          %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
          %timeit df.set_index('CLASS').isna().sum(level=0)
          %%timeit
          g = df.groupby('CLASS')
          g.count().rsub(g.size(), axis=0)

          11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


          Actual performance depends on your data and setup, so your mileage may vary.






          share|improve this answer





















          • 2





            Amazing - thank you very, very much!

            – FelTry2
            Dec 27 '18 at 15:37



















          12














          You can use set_index and sum:



          df.set_index('CLASS').isna().sum(level=0)


          Output:



                 FEATURE1  FEATURE2  FEATURE3
          CLASS
          X 1.0 1.0 2.0
          B 0.0 0.0 0.0





          share|improve this answer































            7














            Using the diff between count and size



            g=df.groupby('CLASS')

            -g.count().sub(g.size(),0)

            FEATURE1 FEATURE2 FEATURE3
            CLASS
            B 0 0 0
            X 1 1 2


            And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



            pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
            Out[468]:
            FEATURE1 FEATURE2 FEATURE3
            B 0 0 0
            X 1 1 2





            share|improve this answer

























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947196%2fgroupby-class-and-count-missing-values-in-features%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              15














              Compute a mask with isna, then group and find the sum:



              df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

              CLASS FEATURE1 FEATURE2 FEATURE3
              0 X 1.0 1.0 2.0
              1 B 0.0 0.0 0.0




              Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



              df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


              Or,



              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)




                     FEATURE1  FEATURE2  FEATURE3
              CLASS
              B 0 0 0
              X 1 1 2




              There are quite a few good answers, so here are some timeits for your perusal:



              df_ = df
              df = pd.concat([df_] * 10000)

              %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
              %timeit df.set_index('CLASS').isna().sum(level=0)
              %%timeit
              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)

              11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


              Actual performance depends on your data and setup, so your mileage may vary.






              share|improve this answer





















              • 2





                Amazing - thank you very, very much!

                – FelTry2
                Dec 27 '18 at 15:37
















              15














              Compute a mask with isna, then group and find the sum:



              df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

              CLASS FEATURE1 FEATURE2 FEATURE3
              0 X 1.0 1.0 2.0
              1 B 0.0 0.0 0.0




              Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



              df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


              Or,



              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)




                     FEATURE1  FEATURE2  FEATURE3
              CLASS
              B 0 0 0
              X 1 1 2




              There are quite a few good answers, so here are some timeits for your perusal:



              df_ = df
              df = pd.concat([df_] * 10000)

              %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
              %timeit df.set_index('CLASS').isna().sum(level=0)
              %%timeit
              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)

              11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


              Actual performance depends on your data and setup, so your mileage may vary.






              share|improve this answer





















              • 2





                Amazing - thank you very, very much!

                – FelTry2
                Dec 27 '18 at 15:37














              15












              15








              15







              Compute a mask with isna, then group and find the sum:



              df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

              CLASS FEATURE1 FEATURE2 FEATURE3
              0 X 1.0 1.0 2.0
              1 B 0.0 0.0 0.0




              Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



              df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


              Or,



              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)




                     FEATURE1  FEATURE2  FEATURE3
              CLASS
              B 0 0 0
              X 1 1 2




              There are quite a few good answers, so here are some timeits for your perusal:



              df_ = df
              df = pd.concat([df_] * 10000)

              %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
              %timeit df.set_index('CLASS').isna().sum(level=0)
              %%timeit
              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)

              11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


              Actual performance depends on your data and setup, so your mileage may vary.






              share|improve this answer















              Compute a mask with isna, then group and find the sum:



              df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

              CLASS FEATURE1 FEATURE2 FEATURE3
              0 X 1.0 1.0 2.0
              1 B 0.0 0.0 0.0




              Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:



              df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)


              Or,



              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)




                     FEATURE1  FEATURE2  FEATURE3
              CLASS
              B 0 0 0
              X 1 1 2




              There are quite a few good answers, so here are some timeits for your perusal:



              df_ = df
              df = pd.concat([df_] * 10000)

              %timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
              %timeit df.set_index('CLASS').isna().sum(level=0)
              %%timeit
              g = df.groupby('CLASS')
              g.count().rsub(g.size(), axis=0)

              11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
              6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


              Actual performance depends on your data and setup, so your mileage may vary.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Dec 27 '18 at 15:29

























              answered Dec 27 '18 at 15:16









              coldspeedcoldspeed

              131k23135221




              131k23135221








              • 2





                Amazing - thank you very, very much!

                – FelTry2
                Dec 27 '18 at 15:37














              • 2





                Amazing - thank you very, very much!

                – FelTry2
                Dec 27 '18 at 15:37








              2




              2





              Amazing - thank you very, very much!

              – FelTry2
              Dec 27 '18 at 15:37





              Amazing - thank you very, very much!

              – FelTry2
              Dec 27 '18 at 15:37













              12














              You can use set_index and sum:



              df.set_index('CLASS').isna().sum(level=0)


              Output:



                     FEATURE1  FEATURE2  FEATURE3
              CLASS
              X 1.0 1.0 2.0
              B 0.0 0.0 0.0





              share|improve this answer




























                12














                You can use set_index and sum:



                df.set_index('CLASS').isna().sum(level=0)


                Output:



                       FEATURE1  FEATURE2  FEATURE3
                CLASS
                X 1.0 1.0 2.0
                B 0.0 0.0 0.0





                share|improve this answer


























                  12












                  12








                  12







                  You can use set_index and sum:



                  df.set_index('CLASS').isna().sum(level=0)


                  Output:



                         FEATURE1  FEATURE2  FEATURE3
                  CLASS
                  X 1.0 1.0 2.0
                  B 0.0 0.0 0.0





                  share|improve this answer













                  You can use set_index and sum:



                  df.set_index('CLASS').isna().sum(level=0)


                  Output:



                         FEATURE1  FEATURE2  FEATURE3
                  CLASS
                  X 1.0 1.0 2.0
                  B 0.0 0.0 0.0






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Dec 27 '18 at 15:18









                  Scott BostonScott Boston

                  54.8k73056




                  54.8k73056























                      7














                      Using the diff between count and size



                      g=df.groupby('CLASS')

                      -g.count().sub(g.size(),0)

                      FEATURE1 FEATURE2 FEATURE3
                      CLASS
                      B 0 0 0
                      X 1 1 2


                      And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                      pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                      Out[468]:
                      FEATURE1 FEATURE2 FEATURE3
                      B 0 0 0
                      X 1 1 2





                      share|improve this answer






























                        7














                        Using the diff between count and size



                        g=df.groupby('CLASS')

                        -g.count().sub(g.size(),0)

                        FEATURE1 FEATURE2 FEATURE3
                        CLASS
                        B 0 0 0
                        X 1 1 2


                        And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                        pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                        Out[468]:
                        FEATURE1 FEATURE2 FEATURE3
                        B 0 0 0
                        X 1 1 2





                        share|improve this answer




























                          7












                          7








                          7







                          Using the diff between count and size



                          g=df.groupby('CLASS')

                          -g.count().sub(g.size(),0)

                          FEATURE1 FEATURE2 FEATURE3
                          CLASS
                          B 0 0 0
                          X 1 1 2


                          And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                          pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                          Out[468]:
                          FEATURE1 FEATURE2 FEATURE3
                          B 0 0 0
                          X 1 1 2





                          share|improve this answer















                          Using the diff between count and size



                          g=df.groupby('CLASS')

                          -g.count().sub(g.size(),0)

                          FEATURE1 FEATURE2 FEATURE3
                          CLASS
                          B 0 0 0
                          X 1 1 2


                          And we can transform this question to the more generic question how to count how many NaN in dataframe with for loop



                          pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
                          Out[468]:
                          FEATURE1 FEATURE2 FEATURE3
                          B 0 0 0
                          X 1 1 2






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Dec 27 '18 at 16:25

























                          answered Dec 27 '18 at 15:19









                          Wen-BenWen-Ben

                          111k83266




                          111k83266






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947196%2fgroupby-class-and-count-missing-values-in-features%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Bressuire

                              Cabo Verde

                              Gyllenstierna