The flashcards below were created by user tttran1 on FreezingBlue Flashcards.

  1. Why Data Mining?
    • -More intense competition at the global scale
    • -Recognition of the value in data sources
    • -Availability of quality data on customers, vendors, transactions, Web, etc.
    • -Consolidation and integration of data repositories into data warehouses
    • -The exponential increase in data processing and storage capabilities; and decrease in cost
    • -Movement toward conversion of information resources into nonphysical form
  2. Definition of Data Mining
    • -The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases. - Fayyad et al., (1996)
    • -Keywords in this definition: Process, nontrivial, valid, novel, potentially useful, understandable.
    • -Data mining: a misnomer?
    • -Other names: knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching, data dredging,…
  3. Data Mining at the Intersection of Many Disciplines
    Slide 8
  4. Data Mining Characteristics/Objectives
    • -Source of data for DM is often a consolidated data warehouse (not always!)
    • -DM environment is usually a client-server or a Web-based information systems architecture
    • -Data is the most critical ingredient for DM which may include soft/unstructured data
    • -The miner is often an end user
    • -Striking it rich requires creative thinking
    • -Data mining tools’ capabilities and ease of use are essential (Web, Parallel processing, etc.)
  5. Data in Data Mining
    • Data: a collection of facts usually obtained as the result of experiences, observations, or experiments
    • -Data may consist of numbers, words, images, …
    • -Data: lowest level of abstraction (from which information and knowledge are derived)
    • Slide 10
    • Examples
    • -Categorical: Male, Female
    • -Ordinal: Freshmen, Sophmore, Junior, Senior
  6. Data Mining Applications
  7. Customer Relationship Management
    • -Maximize return on marketing campaigns
    • -Improve customer retention (churn analysis)
    • -Maximize customer value (cross-, up-selling)
    • -Identify and treat most valued customers
  8. Banking and Other Financial
    • -Automate the loan application process
    • -Detecting fraudulent transactions
    • -Maximize customer value (cross-, up-selling)
    • -Optimizing cash reserves with forecasting
  9. Retailing and Logistics
    • -Optimize inventory levels at different locations
    • -Improve the store layout and sales promotions
    • -Optimize logistics by predicting seasonal effects
    • -Minimize losses due to limited shelf life
  10. Manufacturing and Maintenance
    • -Predict/prevent machinery failures
    • -Identify anomalies in production systems to optimize the use manufacturing capacity
    • -Discover novel patterns to improve product quality
  11. Brokerage and Securities Trading
    • -Predict changes on certain bond prices
    • -Forecast the direction of stock fluctuations
    • -Assess the effect of events on market movements
    • -Identify and prevent fraudulent activities in trading
  12. Insurance
    • -Forecast claim costs for better business planning
    • -Determine optimal rate plans
    • -Optimize marketing to specific customers
    • -Identify and prevent fraudulent claim activities
  13. Data Mining Process
    • -A manifestation of best practices
    • -A systematic way to conduct DM projects
    • -Different groups has different versions
    • -Most common standard processes:
    • -CRISP-DM (Cross-Industry Standard Process for Data Mining)
    • -SEMMA (Sample, Explore, Modify, Model, and Assess)
    • -KDD (Knowledge Discovery in Databases)
  14. Data Mining Process
    Slide 16
  15. Data Mining Process: CRISP-DM
    • Slide17
    • Step 1: Business Understanding
    • Step 2: Data Understanding
    • Step 3: Data Preparation (!)
    • Step 4: Model Building
    • Step 5: Testing and Evaluation
    • Step 6: Deployment
    • -The process is highly repetitive and experimental (DM: art versus science?)
  16. Data Preparation – A Critical DM Task
  17. Data Mining Process: SEMMA

    • Data Mining Methods: Classification
    • -Most frequently used DM method
    • -Part of the machine-learning family
    • -Employ supervised learning
    • -Learn from past data, classify new data
    • -The output variable is categorical (nominal or ordinal) in nature
    • -Classification versus regression?
    • -Classification versus clustering?
  18. Assessment Methods for Classification
    • -Predictive accuracy
    • --Hit rate
    • -Speed
    • --Model building; predicting
    • -Robustness
    • -Scalability
    • -Interpretability
    • --Transparency, explainability
  19. Accuracy of Classification Models
    • -In classification problems, the primary source for accuracy estimation is the confusion matrix
    • Slide23

    • Estimation Methodologies for Classification
    • -Simple split (or holdout or test sample estimation)
    • -Split the data into 2 mutually exclusive sets training (~70%) and testing (30%)
    • Slide24
    • -For ANN, the data is split into three sub-sets (training [~60%], validation [~20%], testing [~20%])
    • -k-Fold Cross Validation (rotation estimation)
    • -Split the data into k mutually exclusive subsets
    • -Use each subset as testing while using the rest of the subsets as training
    • -Repeat the experimentation for k times
    • -Aggregate the test results for true estimation of prediction accuracy training
  20. Classification Techniques
    • -Decision tree analysis
    • -Statistical analysis
    • -Neural networks
    • -Support vector machines
    • -Case-based reasoning
    • -Bayesian classifiers
    • -Genetic algorithms
    • -Rough sets
  21. Decision Trees
    • -Employs the divide and conquer method
    • -Recursively divides a training set until each division consists of examples from one class
    • -1. Create a root node and assign all of the training data to it
    • -2. Select the best splitting attribute
    • -3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive subsets along the lines of the specific split
    • -4. Repeat the steps 2 and 3 for each and every leaf node until the stopping criteria is reached
  22. DT algorithms mainly differ on
    • -Splitting criteria
    • --Which variable to split first?
    • --What values to use to split?
    • --How many splits to form for each node?
    • -Stopping criteria
    • --When to stop building the tree
    • -Pruning (generalization method)
    • --Pre-pruning versus post-pruning
  23. Most popular DT algorithms include
    -ID3, C4.5, C5; CART; CHAID; M5
  24. Alternative splitting criteria
    • -Gini index determines the purity of a specific class as a result of a decision to branch along a particular attribute/value
    • --Used in CART
    • -Information gain uses entropy to measure the extent of uncertainty or randomness of a particular attribute/value split
    • --Used in ID3, C4.5, C5
    • -Chi-square statistics (used in CHAID)
  25. Cluster Analysis for Data Mining
    • -Used for automatic identification of natural groupings of things
    • -Part of the machine-learning family
    • -Employ unsupervised learning
    • -Learns the clusters of things from past data, then assigns new instances
    • -There is not an output variable
    • -Also known as segmentation
  26. Clustering results may be used to
    • -Identify natural groupings of customers
    • -Identify rules for assigning new cases to classes for targeting/diagnostic purposes
    • -Provide characterization, definition, labeling of populations
    • -Decrease the size and complexity of problems for other data mining methods
    • -Identify outliers in a specific domain (e.g., rare-event detection)
  27. Analysis methods
    • -Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on
    • -Neural networks (adaptive resonance theory [ART], self-organizing map [SOM])
    • -Fuzzy logic (e.g., fuzzy c-means algorithm)
    • -Genetic algorithms
  28. Divisive versus Agglomerative methods
  29. How many clusters?
    • -There is not a “truly optimal” way to calculate it
    • -Heuristics are often used
    • --Look at the sparseness of clusters
    • --Number of clusters = (n/2)1/2 (n: no of data points)
  30. Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items
    -Euclidian versus Manhattan (rectilinear) distance
  31. k-Means Clustering Algorithm
    • -k : pre-determined number of clusters
    • -Algorithm (Step 0: determine value of k)
    • Step 1: Randomly generate k random points as initial cluster centers
    • Step 2: Assign each point to the nearest cluster center
    • Step 3: Re-compute the new cluster centers
    • Repetition step: Repeat steps 3 and 4 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable)
  32. Cluster Analysis for Data Mining - k-Means Clustering Algorithm

    • Association Rule Mining
    • -A very popular DM method in business
    • -Finds interesting relationships (affinities) between variables (items or events)
    • -Part of machine learning family
    • -Employs unsupervised learning
    • -There is no output variable
    • -Also known as market basket analysis
    • -Often used as an example to describe DM to ordinary people, such as the famous “relationship between diapers and beers!”
  33. Data Mining Software
    • Commercial
    • -SPSS - PASW (formerly Clementine)
    • -SAS - Enterprise Miner
    • -IBM - Intelligent Miner
    • -StatSoft – Statistical Data Miner
    • -… many more
    • Free and/or Open Source
    • -Weka
    • -RapidMiner…
Card Set:
2013-03-04 16:35:49

Show Answers: