# Introduction ollowing non-melanoma skin cancer, breast cancer is the most common type of cancer among women and the second worldwide, corresponding to 25.2% of all cancers in world statistics and 29.5% in Brazil. Breast cancer is rare in men, representing less than 1% of cases (American cancer society (2019), Instituto Nacional de Cancer, Brazil, 2017). To successfully treat and control breast cancer in the female population, it is essential to identify risk factors for the disease. Moreover, early diagnosis and immediate access to treatment are decisive conditions for the disease prognosis (American Cancer Society (2019), Instituto Nacional de Cancer, Brazil, 2017). The histological grade of malignancy proposed by Scarff, Bloom, and Richardson and further modified by Elston and Ellis, known as the Nottingham Classification System, is considered one of the main factors for determining the prognosis of breast cancer (Beck et al., 2011, Chen et Machine learning is advantageous due to its potential to gather a large volume of information, once the appropriate accuracy and precision are achieved, on a specific disease in a single digital tool; suppressing the subjectivity of human evaluation with agility in the analysis of the material to be studied, aiming at safe and quick diagnoses, which could even be used as a "second specialized opinion" in cases of greater complexity (Wernick et al., 2010, Mulrane et al., 2008, Jones et al., 2009, Misselwitz et al., 2010). The present study aimed to perform an automated and reproducible classification of the parameters used by pathologists to diagnose breast cancer: nuclear score, tubular score, and mitotic index. The software used for image analysis and classification (CellProfiler and Tanagra) used for the present study are free. The results obtained by the automated analysis were compared with a pathologist diagnosis (Jones et al., 2009, Carpenter et al., 2006, Lamprecht et al., 2007, Lenz et al., 2017). # II. # Materials e Methods # a) The samples-Inclusion and exclusion criteria The study targeted women with breast cancer and presenting the most frequent histological types: infiltrating ductal carcinoma, invasive lobular carcinoma, and the mixed infiltrating lobular ductal form; who underwent surgical treatment for this disease in 2015 and that, until the time of surgery, had not undergone adjuvant chemotherapy or radiotherapy treatments. Complete epidemiological diagnosis and treatment data could be obtained, and histological slides were stained by the Hematoxylin & Eosin method with preserved staining, which enabled digital photographs of adequate quality. The Santa Rita de Cássia Hospital, located in the city of Vitória, is considered the main reference hospital for cancer treatment in the Espírito Santo state, providing medical care for 625 women with breast cancer in 2015. Out of 276 cases selected for meeting the inclusion and exclusion criteria, 52 patients were also excluded by the pathologist at the Hospital Santa Rita de Cássia due to "in situ" suffering from breast cancers. Since these issues could compromise machine learning and, consequently, the automated analysis of these images, this study included 224 cases at the end. The year 2015 was selected because the Tumor Record Sheets for that year represents, at the beginning of the study, the most recent and complete data released by the Health Information System -Hospital Cancer Registry of the Ministry of Health of the Federal Government of Brazil. The # b) Digitization of histological slides All histological slides from the 224 selected cases were randomly reviewed by a pathologist without access to patient data at the Hospital Santa Rita de Cássia, aiming to select the samples with the bestpreserved color aspect. Twenty images of breast tissue of each selected patient were obtained using a digital camera (Moticam 1000 1.3 MPixel MTC 1000) attached to a light microscope. # c) Loading images to CellProfiler Out of 4,480 digitalized photographs in the 40fold magnification, after their upload to the CellProfiler program, only the artifact-free images were maintained and recognized as adequate by this image analysis program., Therefore, 1937 images were transferred to the CellProfiler software and submitted to its algorithm, These attributes are aspects and characteristics, identified by the CellProfiler software that express the averages of the quantitative parameters of the study's objects (the images) and enabled the automated identification and classification of each object. # d) CellProfiler algorithm Following an algorithm developed for treating digitized images for the CellProfiler computational environment, all 1997 images were treated in the following sequence of the 9-step algorithm, as shown in Chart 1. Chart 1: CellProfiler algorithm. The 1937 digitized photographs treated according to this algorithm resulted in a data set exported to Tanagra cellular image data analysis software. Then, this dataset was distributed in an Excel spreadsheet (Microsoft R ), and the automated classifications of the tubular, nuclear and mitotic indexes, as well as the histological degree of malignancy, were acquired. # e) CellProfiler Algorithm i. Phase 1 -Load Images All the digitized images observed from histological slides at 40-fold magnification were transferred to the CellProfiler software (Figure 1a). # ii. Phase 2 -Color to Gray The original scanned images were converted to the white/gray/black spectrum (Figure 1b). # iii. Phase 3 -ImageMath Since the CellProfiler software analyzes the study's objects according to the light intensity and the cell nuclei, it was necessary to reverse the nuclei coloration initially stained in black to white and invert the other elements coloring to black (Figure 1c). generated for each digitized image with 47 quantitative parameters, called attributes. # iv. Phase 4 -Apply Threshold In this stage, a binary image (i.e., an image with only two-pixel intensities, 0 and 1), was created. # v. Phase 5 -Identify Primary Objects Cell nuclei were defined and identified as primary objects of the study in this step of the algorithm (Figure 1d). # vi. Phase 6 -Measure Objects Size and Shape Primary objects were measured in this step, and the parameters (attributes), identified by the CellProfiler software for each study object, were acquired by the average of these measurements. # vii. Phase 7 -Filter Objects An image filtering was used to suppress changes that could interfere in the primary object analysis, eliminating the artifacts and preserving only the cell nuclei (Figure 1e). After applying the image filter and eliminating artifactual changes, a new measurement of the primary objects (cell nuclei) attributes was performed. # ix. Phase 9 -Export to Database After the CellProfiler algorithm steps, 47 quantitative data (attributes) for each primary object studied were identified using qualitative data from the digitized images and defined as parameters, enabling both individual identification and analysis of each primary object. This list of attributes constituted the database exported to the Tanagra image data analysis software. # f) Classification after machine learning Tanagra is open-source software for database analysis and statistical analysis developed under the design of machine learning. In the present study, Tanagra software was used to perform the automated classification of the malignancy degree of breast cancers for the tubular, nuclear and mitotic index scores, as well as for the histological grade. Moreover, 3 parameters used in the definition of the histological grade in breast cancer were analyzed: the tubular aspect, the nuclear morphology, and the cell count in mitosis; from the analysis of the database containing 47 quantitative parameters for each analyzed object of the study. # III. # Statistical Analysis The tubular, nuclear, and mitotic index scores, which together define the histological degree of malignancy in breast cancer, were determined. The statistical parameters of Predictive Values, Accuracy, Error, and the Kappa Index of agreement between the pathologist and the medical program analyzer, were also used in this phase. The programs Tanagra and Med Calc were used for statistical processing. The statistical parameters gathered were used to determine the histological degree of malignancy. IV. # Results The present study aimed to perform an automated and reproducible classification of the pathological parameters used to diagnose breast cancer: nuclear score, tubular score, and mitotic index. The automated classification results are depicted in Table 1, while the outcomes comparing the pathological and the automated diagnoses are shown in Table 2. A scatter plot of the automated classification resulted from machine learning is exhibited in Figure 2. # Discussion Artificial Intelligence, particularly linked to machine learning, has been increasingly used as a safe and effective tool in disease diagnosis and prognosis, especially on studies assessing breast cancer, a disease of high impact on several women's lives. This study stands out as a pioneering publication using free access software to diagnose the histological degree of malignancy in breast cancer. Thus, the automated analysis to obtain safe diagnoses of histopathological parameters is a feasible tool since a dataset with sufficient information for adequate machine learning can provide an efficient analysis that ensures remarkable accuracy. In conclusion, digitalized images of breast cancer histological slides enabled the automated analysis of histopathological parameters, converting them into quantitative parameters for the diagnosis, and defining the histological degree of malignancy. A database expansion is necessary to optimize the analysis and provide the machine sufficient information and data, postulating solid concepts and knowledge to support all requested aspects of the analysis. In this sense, further multidisciplinary studies covering machine learning and breast cancer in women may lead to significant novel contributions. 1a![Figure 1a: Original image of breast cancer tissue](image-2.png "Figure 1a :") 1b![Figure 1b: Figure 1a converted to greyscale Figure 1c: Figure 1b with inverted intensities Figure 1d & e: Initial identification of nuclei Figure 1f: Remaining objects after filtering for subsequent analysis viii. Phase 8 -Measure Object Size and ShapeAfter applying the image filter and eliminating artifactual changes, a new measurement of the primary objects (cell nuclei) attributes was performed.](image-3.png "Figure 1b :") 122a![Figure 2a: Classification of malignancy using the tubular score](image-4.png "Table 1 :Table 2 :Figure 2a :") ![](image-5.png "") ![](image-6.png "") ![](image-7.png "") ## Conflicts of interest: None declared. Author contributions: PCRB: Taking images, writing, cooperation with the pathology. RDJ: pathological diagnosis. CSMN: Image analysis, writing. IPPQ: Image analysis, writing. SSS: Image analysis, writing. DL: Supervision, statistical processing, machine learning. * Breast Cancer Facts & Figures Atlanta 2019-2020. 2019 American Cancer Society INC * Classification of breast cancer histology images using Convolutional Neural Networks TAraújo GAresta ECastro JRouco PAguiar CEloy APolónia ACampilho PLoS One 12 6 177544 2017 * Systematic analysis of breast cancer morphology uncovers stromal features associated with survival AHBeck ARSangoi SLeung RJMarinelli TONielsen MJVan De Vijver RBWest MVan De Rijn DKoller SciTransl Med 201 108 * Replacement of specific markers for apoptosis and necrosis by nuclear morphology for affordable cytometry RBuzin FEPinto KNieschke AMittag DeAndrade TUEndringer DCTarnok ALenz D Journal of Immunological Methods 1 2015 * CellProfiler: image analysis software for identifying and quantifying cell phenotypes AECarpenter TRJones MRLamprecht CClarke IHKang OFriman DAGuertin JHChang RALindquist JMoffat PGolland DMSabatini Genome Biol 7 10 2006. 2006 Oct 31 * Computer-aided prognosis on breast cancer with hematoxylin and eosin histopathology images: A review JMChen YLi JXu LGong LWWang WLLiu JLiu Tumour Biol 1010428317694550 2017 * Opportunities and obstacles for deep learning in biology and medicine TChing DSHimmelstein BKBeaulieu-Jones AAKalinin BTDo GPWay EFerrero PMAgapow MZietz MMHoffman WXie GLRosen BJLengerich JIsraeli JLanchantin SWoloszynek AECarpenter AShrikumar JXu EMCofer Lavenderca SCTuraga AMAlexandari ZLu DJHarris DDecaprio Qiy AKundaje YPeng LKWiley MhsSegler SBocasm SJSwamidass AHuang AGitter CSGreene J R Soc Interface 15 141 20170387 * An open-source computational tool to automatically quantify immunolabeled retinal ganglion cells ACDordea MABray AllenKLogan DJFei FMalhotra RGregory MSCarpenter AEBuys ES Exp Eye Res 147 2016 * Reconstructing cell cycle and disease progression using deep learning PEulenberg NKöhler TBlasi AFilby AECarpenter PaulRees FJTheis FAWolf Nat Commun 8 463 2017 * Breast Cancer Multi-classification from Histopathological Images with Structured Deep Learning Model ZHan BWei YZheng YYin KLi SLi Sci Rep 7 1 4172 2017 23 * An open-source solution for advanced imaging flow cytometry data analysis using machine learning HHennig PRees TBlasi LKamentsky JHung DDao AECarpenter AFilby Methods 112 2017 * The future of telepathology for the developing world CLHitchcock Arch Pathol Lab Med 135 2 2011 * Incidência do Câncer no Brasil 2018. 2017 INCA INCA, Brasil; Rio de Janeiro * Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning TRJones AECarpenter MRLamprecht JMoffat SJSilver JKGrenier ABCastoreno USEggert DERoot PGolland DMSabatini * UProcnatlacadsci SA 2009 10 106 * CellProfiler: free, versatile software for automated biological image analysis MRLamprecht DMSabatini AECarpenter Biotechniques 42 1 2007 * DLenz LSGasparini NDMacedo EFPimentel MFronza VLJunior WSBorges ERCole TUAndrade Endringerdc vitro cell viability by CellProfiler ® software as equivalent to MTT assay 2017 13 365 * Breast cancer characterization based on image classification of tissue sections visualized under low magnification CLoukas SKostopoulos ATanoglidi DGlotsos CSfikas DCavouras Comput Math Methods Med 829461 2013 * Nuclear shape and orientation features from H&E images predict survival in early-stage estrogen receptor-positive breast cancers CLu Romo-Buchelid XWang AJanowczyk SGanesan Gilmoreh DRimm AMadabhushi Lab Invest 98 11 2018 * Enhanced CellClassifier: a multi-class classification tool for microscopy images BMisselwitz GStrittmatter BPeriaswamy MCSchlumberger SRout PHorvath KKozak WDHardt BMC Bioinformatics 11 30 2010 * Automated image analysis in histopathology: a valuable tool in medical diagnostics. Expert VerMolDiagn LMulrane ERexhepaj SPenney JJCallanan WMGallagher 2008 Nov 8 * Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine FPesapane MCodari FSardanelli EurRadiol Exp 2 1 35 2018 * A Deep Learning Based Strategy for Identifying and Associating Mitotic Activity with Gene Expression Derived Risk Categories in Estrogen Receptor Positive Breast Cancers DRomo-Bucheli AJanowczyk HGilmore ERomero AMadabhushi Cytometry A 91 6 2017 Jun * Increasing the Content of High-Content Screening An Overview SSingh AECarpenter AGenovesio J Biomol Screen 19 5 2014 Jun * Machine learning in cell biology -teaching computers to recognize phenotypes CSommer DWGerlich J Cell Sci 126 2013 Dec 15 Pt 24 * Machine Learning in Medical Imaging ;Wernickmn Y;Yang JGBrankov G;Yourganov Strother Sc IEEE Signal Processing Magazine 27 4 2010 Jul * Quantitative nuclear histomorphometry predicts oncotype DX risk categories for early stage ER+ breast cancer JWhitney GCorredor AJanowczyk SGanesan Doyles JTomaszewski MFeldman HGilmore AnantMadabhushi BMC Cancer 18 1 610 2018 May 30 * Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images JXu XQLei HGilmore JWu JTang AMadabhushi IEEE Trans Med Imaging 35 1 2016 Jan * Predicting non-small cell lung cancer prognosis by fully automated microscopic pathologyimage features KHYu CZhang GJBerry RBAltman CRé DLRubin MSnyder Nat Commun 7 12474 2016 Aug 16