Big Data - Exam II

Card Set Information

Author:
mjweston
ID:
237541
Filename:
Big Data - Exam II
Updated:
2013-10-01 07:28:33
Tags:
Data Mining
Folders:

Description:
Data Mining
Show Answers:

Home > Flashcards > Print Preview

The flashcards below were created by user mjweston on FreezingBlue Flashcards. What would you like to do?


  1. data mining
    the nontrivial process of identifying valid, novel, potentially useful, & ultimately understandable patterns in any type of data
  2. categorical data
    represent the labels of multiple classes used to divide a variable into specific groups - also called discrete data implying that it represents a finite number of values with no continuum between them (ex: race, sex, age group, education level)
  3. nominal data
    measurements of simple codes assigned to objects as labels which are not measurements (ex: marital status cagegorized as (1) single, (2) married, and (3) divorced
  4. ordinal data
    contain codes assigned to objects or events as labels that also represent the rank order among them (ex: credit score - low, medium, high or age group - child, young, middle-aged, elderly)
  5. numeric data
    represent the numeric values of specific variables (ex: age, number of children, income, travel distance, temperature)
  6. interval data
    variables that can be measured on interval scales, and have an arbitrary (random) 0 point (ex: temperature 0 degrees doesn't mean "no temperature")
  7. ratio data
    include measurement variables commonly found in the physical sciences and engineering where the data provide a true (absolute) zero poing where 0 = none (ex: zero weight = no weight, mass, length, time, plane angle, energy)
  8. categorical - nominal & ordinal
    numerical - interval & ratio
    two main types of data
  9. associations
    predictions
    clusters
    sequential relationships
    four major types of patterns identified by data mining
  10. - coupons & discounts
    - product placement
    - timing & cross-marketing
    actions that are based on association discovery
  11. associations (affinity grouping)
    find the commonly co-occurring groupings of things, such as beer and diapers going together in market-basket analysis
  12.  - support
     - confidence
     - lift
    measures of predictive ability
  13. support
    refers to the percentage of baskets where the rule was true (both condition & result products were present) or the result was true (result products were present irrespective of condition)
  14. confidence
    measures the probability that the result product is present given that we know the condition product is present
  15. lift
    measures whether the condition product is present without the result product (if > 1 indicates that transactions containing the condition tend to contain the result more often than transactions that do not contain the condition)
  16. support for rule - (cola +pizza / all)
    support for condition - (cola / all)
    support for result - (pizza / all)
    confidence - (rule / condition)
    lift - (confidence / result)
    computing measures of association
  17. - understanding of business problem
    - what data are relevant for study
    - identify missing data fields, data noise, etc.
    - develop model to explain relationships
    - search for patterns of interest
    - review results to refine model
    - use refined model to predict output for set of inputs where output is not yet known
    - take action on discovered patterns
    general process of knowledge discovery
  18. Sample, Explore, Modify, Model, and Assess - a common standard data mining process
    SEMMA
  19. Knowledge Discovery in Databases - common standard data mining process
    KDD
  20. predictions
    tell the nature of future occurrences of certain events based on what has happened in the past such as predicting the winner of the Super Bowl or forecasting the absolute temperature of a particular day
  21. clusters
    identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their deographics and past purchase behaviors
  22. sequential relationships
    discover time-ordered events, such as predicting what an existing banking customer who already has a checking account will open a savings account followed by an investment account within a year
  23. prediction
    association
    clustering
    three main categories of data mining
  24. supervised learning algorithms
    type of learning algorithms that include both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable)
  25. unsupervised learning algorithms
    type of learning algorithms that include only the descriptive attributes (i.e., independent variables or decision variables)
  26. prediction
    takes into account the experiences, opinions, and other relevant information in conducting the task of foretelling
  27. prediction, forecasting
    _____ is largely experience and opinion based, whereas ____ is data and model based.
  28. classification & regression
    two major types of prediction
  29. classification (supervised induction)
    where the predicted thing, such as tommorow's forecast, is a class label such as "rainy" or "sunny" with the object to analyze the historical data stored in a database and automatically generate a model that can predict future behavior
  30. regression
    where the predicted thing, such as tomorrow's temperature, is a real number, such as "65 degrees F"
  31. decision trees
    classify data into a finite number of classes based on the values of the input variables - a heiarchy of if-then statements most appropriate for categorical & interval data
  32. clustering
    partitions a collection of things into segments (or natural groupings) whose members share similar characteristics
  33. associations (association rule learning) or market-basket analysis
    a technique for discovering interesting relationships among variables in large databases
  34. link analysis & sequence mining
    two commonly used derivatives of association rule mining
  35. link analysis
    an association type of data mining where the linkage among many objects of interest is discovered automatically, such as the link between Web pages & referential relationships among groups of academic publication authors
  36. sequence mining
    an association type of data mining where the relationships are examined in terms of their order of occurrence to identify associations over times
  37. visualization
    technique of presenting information in graphical form, can be used in conjunction with other data mining techniques to gain a clearer understanding of underlying relationships
  38. time-series forecasting
    the data are a series of values of the same variable that is captured and stored over time - which is used to develop models to extrapolate future values of the same thing
  39. hypothesis-driven data mining
    begins with a proposition by the user, who then seeks to validate the truthfulness of the proposition
  40. discovery-driven data mining
    minds patterns, associations, and other relationships hidden within datasets - can uncover facts that an organization had not previously known or even contemplated
  41. Cross Industry Standard Process for Data Mining
    CRISP-DM
  42. 1. Business Understanding
    2. Data Understanding
    3. Data Preparation
    4. Model Building
    5. Testing & Evaluation
    6. Deployment
    six steps of the CRISP-DM  data mining process:
  43. business understanding
    CRISP-DM step that involves gaining a thorough understanding of the  managerial need for new knowledge and an explicit specification of the business objective regarding the study to be conducted
  44. data understanding
    CRISP-DM step that includes being clear about the description of the data mining task so that the most relevant data can be identified, and understanding the data sources & variables
  45. data preparation (data preprocessing)
    CRISP-DM step used to prepare data for analysis - takes the most time and effort
  46. data consolidation
    • data preprocessing step consisting of:
    • collecting data
    • selecting data
    • integrating data
  47. data cleaning
    • data preprocessing step consisting of:
    • imputing missing values
    • reducing noise in data
    • eliminating inconsistencies
  48. data transformation
    • data preprocessing step consisting of:
    • normalizing data
    • discretizing/aggregating data
    • constructing new attributes
  49. data reduction (dimensional reduction)
    • data preprocessing step consisting of:
    • reducing number of variables
    • reducing number of cases
    • balancing skewed data
  50. data consolidation
    data cleaning
    data transformation
    data reduction
    phases of data preprocessing (data preparation) - for converting raw real-world data into mine-able data sets
  51. impute
    to fill a missing value in a dataset with the most probable value
  52. model building
    CRISP-DM step in which various modeling techniques are selected and applied to an already prepared dataset in order to address the specific business need - using a variety of viable model types to identify the "best" method for a given purpose
  53. testing and evaluation
    CRISP-DM step in which models are assessed and evaluated for their accuracy and generality and the degree and extent to which the model(s) meets the business objective
  54. deployment
    CRISP-DM step in which the knowledge gained from the project is organized and presented in a way that the end user can understand and benefit from - in many cases is carried out by the customer rather than the dat analyst - may also include maintenance activities for the deployed models
  55. classification
    a predictive model that segments data by assigning them to groups that are already defined
  56. clustering
    an unsupervised learning way to segment data into groups that are not previously defined
  57. accuracy - agree w/ identified source
    completeness - all the info
    timeliness - up to date
    consistency - values agree between data sets
    data quality dimensions
  58. - eliminate duplicate records
    - parsing (divide & analyze parts)
    - standardization (ex: names-Bob, Rob, Bobby)
    - abbreviation expansion (clearly define abbreviations)
    - correction of data quality issues
    - update missing fields
    data cleansing

What would you like to do?

Home > Flashcards > Print Preview