github.com/vlifesystems/rulehunter@v0.0.0-20180501090014-673078aa4a83/examples/csv/breast_cancer_wisconsin.txt

github.com/vlifesystems/rulehunter@v0.0.0-20180501090014-673078aa4a83/examples/csv/breast_cancer_wisconsin.txt (about)

     1  1. Title: Wisconsin Diagnostic Breast Cancer (WDBC)
     2  
     3  2. Source Information
     4  
     5  a) Creators:
     6  
     7  	Dr. William H. Wolberg, General Surgery Dept., University of
     8  	Wisconsin,  Clinical Sciences Center, Madison, WI 53792
     9  	wolberg@eagle.surgery.wisc.edu
    10  
    11  	W. Nick Street, Computer Sciences Dept., University of
    12  	Wisconsin, 1210 West Dayton St., Madison, WI 53706
    13  	street@cs.wisc.edu  608-262-6619
    14  
    15  	Olvi L. Mangasarian, Computer Sciences Dept., University of
    16  	Wisconsin, 1210 West Dayton St., Madison, WI 53706
    17  	olvi@cs.wisc.edu
    18  
    19  b) Donor: Nick Street
    20  
    21  c) Date: November 1995
    22  
    23  Data Source:
    24    UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
    25    Irvine, CA: University of California, School of Information and
    26    Computer Science.
    27    https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
    28  
    29  3. Past Usage:
    30  
    31  first usage:
    32  
    33  	W.N. Street, W.H. Wolberg and O.L. Mangasarian
    34  	Nuclear feature extraction for breast tumor diagnosis.
    35  	IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science
    36  	and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
    37  
    38  OR literature:
    39  
    40  	O.L. Mangasarian, W.N. Street and W.H. Wolberg.
    41  	Breast cancer diagnosis and prognosis via linear programming.
    42  	Operations Research, 43(4), pages 570-577, July-August 1995.
    43  
    44  Medical literature:
    45  
    46  	W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
    47  	Machine learning techniques to diagnose breast cancer from
    48  	fine-needle aspirates.
    49  	Cancer Letters 77 (1994) 163-171.
    50  
    51  	W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
    52  	Image analysis and machine learning applied to breast cancer
    53  	diagnosis and prognosis.
    54  	Analytical and Quantitative Cytology and Histology, Vol. 17
    55  	No. 2, pages 77-87, April 1995.
    56  
    57  	W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
    58  	Computerized breast cancer diagnosis and prognosis from fine
    59  	needle aspirates.
    60  	Archives of Surgery 1995;130:511-516.
    61  
    62  	W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
    63  	Computer-derived nuclear features distinguish malignant from
    64  	benign breast cytology.
    65  	Human Pathology, 26:792--796, 1995.
    66  
    67  See also:
    68  	http://www.cs.wisc.edu/~olvi/uwmp/mpml.html
    69  	http://www.cs.wisc.edu/~olvi/uwmp/cancer.html
    70  
    71  Results:
    72  
    73  	- predicting field 2, diagnosis: B = benign, M = malignant
    74  	- sets are linearly separable using all 30 input features
    75  	- best predictive accuracy obtained using one separating plane
    76  		in the 3-D space of Worst Area, Worst Smoothness and
    77  		Mean Texture.  Estimated accuracy 97.5% using repeated
    78  		10-fold crossvalidations.  Classifier has correctly
    79  		diagnosed 176 consecutive new patients as of November
    80  		1995.
    81  
    82  4. Relevant information
    83  
    84  	Features are computed from a digitized image of a fine needle
    85  	aspirate (FNA) of a breast mass.  They describe
    86  	characteristics of the cell nuclei present in the image.
    87  	A few of the images can be found at
    88  	http://www.cs.wisc.edu/~street/images/
    89  
    90  	Separating plane described above was obtained using
    91  	Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
    92  	Construction Via Linear Programming." Proceedings of the 4th
    93  	Midwest Artificial Intelligence and Cognitive Science Society,
    94  	pp. 97-101, 1992], a classification method which uses linear
    95  	programming to construct a decision tree.  Relevant features
    96  	were selected using an exhaustive search in the space of 1-4
    97  	features and 1-3 separating planes.
    98  
    99  	The actual linear program used to obtain the separating plane
   100  	in the 3-dimensional space is that described in:
   101  	[K. P. Bennett and O. L. Mangasarian: "Robust Linear
   102  	Programming Discrimination of Two Linearly Inseparable Sets",
   103  	Optimization Methods and Software 1, 1992, 23-34].
   104  
   105  
   106  	This database is also available through the UW CS ftp server:
   107  
   108  	ftp ftp.cs.wisc.edu
   109  	cd math-prog/cpo-dataset/machine-learn/WDBC/
   110  
   111  5. Number of instances: 569
   112  
   113  6. Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)
   114  
   115  7. Attribute information
   116  
   117  1) ID number
   118  2) Diagnosis (M = malignant, B = benign)
   119  3-32)
   120  
   121  Ten real-valued features are computed for each cell nucleus:
   122  
   123  	a) radius (mean of distances from center to points on the perimeter)
   124  	b) texture (standard deviation of gray-scale values)
   125  	c) perimeter
   126  	d) area
   127  	e) smoothness (local variation in radius lengths)
   128  	f) compactness (perimeter^2 / area - 1.0)
   129  	g) concavity (severity of concave portions of the contour)
   130  	h) concave points (number of concave portions of the contour)
   131  	i) symmetry
   132  	j) fractal dimension ("coastline approximation" - 1)
   133  
   134  Several of the papers listed above contain detailed descriptions of
   135  how these features are computed.
   136  
   137  The mean, standard error, and "worst" or largest (mean of the three
   138  largest values) of these features were computed for each image,
   139  resulting in 30 features.  For instance, field 3 is Mean Radius, field
   140  13 is Radius SE, field 23 is Worst Radius.
   141  
   142  All feature values are recoded with four significant digits.
   143  
   144  8. Missing attribute values: none
   145  
   146  9. Class distribution: 357 benign, 212 malignant