# Big Data - Exam II

The flashcards below were created by user mjweston on FreezingBlue Flashcards.

1. data mining
the nontrivial process of identifying valid, novel, potentially useful, & ultimately understandable patterns in any type of data
2. categorical data
represent the labels of multiple classes used to divide a variable into specific groups - also called discrete data implying that it represents a finite number of values with no continuum between them (ex: race, sex, age group, education level)
3. nominal data
measurements of simple codes assigned to objects as labels which are not measurements (ex: marital status cagegorized as (1) single, (2) married, and (3) divorced
4. ordinal data
contain codes assigned to objects or events as labels that also represent the rank order among them (ex: credit score - low, medium, high or age group - child, young, middle-aged, elderly)
5. numeric data
represent the numeric values of specific variables (ex: age, number of children, income, travel distance, temperature)
6. interval data
variables that can be measured on interval scales, and have an arbitrary (random) 0 point (ex: temperature 0 degrees doesn't mean "no temperature")
7. ratio data
include measurement variables commonly found in the physical sciences and engineering where the data provide a true (absolute) zero poing where 0 = none (ex: zero weight = no weight, mass, length, time, plane angle, energy)
8. categorical - nominal & ordinal
numerical - interval & ratio
two main types of data
9. associations
predictions
clusters
sequential relationships
four major types of patterns identified by data mining
10. - coupons & discounts
- product placement
- timing & cross-marketing
actions that are based on association discovery
11. associations (affinity grouping)
find the commonly co-occurring groupings of things, such as beer and diapers going together in market-basket analysis
12.  - support
- confidence
- lift
measures of predictive ability
13. support
refers to the percentage of baskets where the rule was true (both condition & result products were present) or the result was true (result products were present irrespective of condition)
14. confidence
measures the probability that the result product is present given that we know the condition product is present
15. lift
measures whether the condition product is present without the result product (if > 1 indicates that transactions containing the condition tend to contain the result more often than transactions that do not contain the condition)
16. support for rule - (cola +pizza / all)
support for condition - (cola / all)
support for result - (pizza / all)
confidence - (rule / condition)
lift - (confidence / result)
computing measures of association
17. - understanding of business problem
- what data are relevant for study
- identify missing data fields, data noise, etc.
- develop model to explain relationships
- search for patterns of interest
- review results to refine model
- use refined model to predict output for set of inputs where output is not yet known
- take action on discovered patterns
general process of knowledge discovery
18. Sample, Explore, Modify, Model, and Assess - a common standard data mining process
SEMMA
19. Knowledge Discovery in Databases - common standard data mining process
KDD
20. predictions
tell the nature of future occurrences of certain events based on what has happened in the past such as predicting the winner of the Super Bowl or forecasting the absolute temperature of a particular day
21. clusters
identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their deographics and past purchase behaviors
22. sequential relationships
discover time-ordered events, such as predicting what an existing banking customer who already has a checking account will open a savings account followed by an investment account within a year
23. prediction
association
clustering
three main categories of data mining
24. supervised learning algorithms
type of learning algorithms that include both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable)
25. unsupervised learning algorithms
type of learning algorithms that include only the descriptive attributes (i.e., independent variables or decision variables)
26. prediction
takes into account the experiences, opinions, and other relevant information in conducting the task of foretelling
27. prediction, forecasting
_____ is largely experience and opinion based, whereas ____ is data and model based.
28. classification & regression
two major types of prediction
29. classification (supervised induction)
where the predicted thing, such as tommorow's forecast, is a class label such as "rainy" or "sunny" with the object to analyze the historical data stored in a database and automatically generate a model that can predict future behavior
30. regression
where the predicted thing, such as tomorrow's temperature, is a real number, such as "65 degrees F"
31. decision trees
classify data into a finite number of classes based on the values of the input variables - a heiarchy of if-then statements most appropriate for categorical & interval data
32. clustering
partitions a collection of things into segments (or natural groupings) whose members share similar characteristics
33. associations (association rule learning) or market-basket analysis
a technique for discovering interesting relationships among variables in large databases
34. link analysis & sequence mining
two commonly used derivatives of association rule mining
an association type of data mining where the linkage among many objects of interest is discovered automatically, such as the link between Web pages & referential relationships among groups of academic publication authors
36. sequence mining
an association type of data mining where the relationships are examined in terms of their order of occurrence to identify associations over times
37. visualization
technique of presenting information in graphical form, can be used in conjunction with other data mining techniques to gain a clearer understanding of underlying relationships
38. time-series forecasting
the data are a series of values of the same variable that is captured and stored over time - which is used to develop models to extrapolate future values of the same thing
39. hypothesis-driven data mining
begins with a proposition by the user, who then seeks to validate the truthfulness of the proposition
40. discovery-driven data mining
minds patterns, associations, and other relationships hidden within datasets - can uncover facts that an organization had not previously known or even contemplated
41. Cross Industry Standard Process for Data Mining
CRISP-DM
2. Data Understanding
3. Data Preparation
4. Model Building
5. Testing & Evaluation
6. Deployment
six steps of the CRISP-DM  data mining process:
CRISP-DM step that involves gaining a thorough understanding of the  managerial need for new knowledge and an explicit specification of the business objective regarding the study to be conducted
44. data understanding
CRISP-DM step that includes being clear about the description of the data mining task so that the most relevant data can be identified, and understanding the data sources & variables
45. data preparation (data preprocessing)
CRISP-DM step used to prepare data for analysis - takes the most time and effort
46. data consolidation
• data preprocessing step consisting of:
• collecting data
• selecting data
• integrating data
47. data cleaning
• data preprocessing step consisting of:
• imputing missing values
• reducing noise in data
• eliminating inconsistencies
48. data transformation
• data preprocessing step consisting of:
• normalizing data
• discretizing/aggregating data
• constructing new attributes
49. data reduction (dimensional reduction)
• data preprocessing step consisting of:
• reducing number of variables
• reducing number of cases
• balancing skewed data
50. data consolidation
data cleaning
data transformation
data reduction
phases of data preprocessing (data preparation) - for converting raw real-world data into mine-able data sets
51. impute
to fill a missing value in a dataset with the most probable value
52. model building
CRISP-DM step in which various modeling techniques are selected and applied to an already prepared dataset in order to address the specific business need - using a variety of viable model types to identify the "best" method for a given purpose
53. testing and evaluation
CRISP-DM step in which models are assessed and evaluated for their accuracy and generality and the degree and extent to which the model(s) meets the business objective
54. deployment
CRISP-DM step in which the knowledge gained from the project is organized and presented in a way that the end user can understand and benefit from - in many cases is carried out by the customer rather than the dat analyst - may also include maintenance activities for the deployed models
55. classification
a predictive model that segments data by assigning them to groups that are already defined
56. clustering
an unsupervised learning way to segment data into groups that are not previously defined
57. accuracy - agree w/ identified source
completeness - all the info
timeliness - up to date
consistency - values agree between data sets
data quality dimensions
58. - eliminate duplicate records
- parsing (divide & analyze parts)
- standardization (ex: names-Bob, Rob, Bobby)
- abbreviation expansion (clearly define abbreviations)
- correction of data quality issues
- update missing fields
data cleansing
 Author: mjweston ID: 237541 Card Set: Big Data - Exam II Updated: 2013-10-01 11:28:33 Tags: Data Mining Folders: Description: Data Mining Show Answers: