Book *Artificial Intelligence: A Modern Approach, 3rd Ed.*

**In the book, the actual page** = page_no - 20

underline page 54: to develop a small set of design principles for building successful agents

underline page 103: there are two other significant difference

underline page 103: first is that the goal test is applied to a node when it is selected for expansion

underline page 105: Therefore, it will get stuck in an infinite loop if there is a path with an infinite sequence of zero-cost actions

underline page 107: A variant of depth-first search called backtracking search

underline page 192: the evaluation function cannot know which states are which

underline page 236: it picks a variable that is most likely to cause a failure soon, thereby pruning the search tree.

underline page 237: In general, the heuristic is trying to leave the maximum flexibility for subsequent variable assignments.

page 9: We have completely rewritten the introductory machine-learning chapter, stressing a wider variety of more modern learning algorithms and placing them on a firmer theoretical footing.

page 54: Our plan in this book is to use this concept to develop a small set of design principles for building successful agents

page 54: In which we discuss the nature of agents, perfect or otherwise, the diversity of environments, and the resulting menagerie of agent types.

page 54: 2INTELLIGENT AGENTS

page 54: An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.

page 57: a performance measure

page 60: 2.3.1 Specifying the task environment

page 60: PEAS (Performance, Environment, Actuators, Sensors)

page 61: 2.3.2 Properties of task environments

page 62: Fully observable vs. partially observable:

page 62: Single agent vs. multiagent:

page 63: Deterministic vs. stochastic.

page 63: Episodic vs. sequential:

page 64: Static vs. dynamic:

page 64: Discrete vs. continuous:

page 64: Known vs. unknown:

page 66: 2.4 THE STRUCTURE OF AGENTS

page 66: agent = architecture + program

page 72: 2.4.4 Goal-based agents

page 72: agent needs some sort of goal information that describes situations that are desirable

page 73: 2.4.5 Utility-based agents

page 73: An agent’s utility function is essentially an internalization of the performance measure.

page 73: a rational utility-based agent chooses the action that maximizes the expected utility

page 79: 2.5 SUMMARY

page 79: The performance measure evaluates the behavior of the agent in an environment.

page 79: An agent is something that perceives and acts in an environment.

page 79: The agent function for an agent specifies the action taken by the agent in response to any percept sequence.

page 79: A task environment specification includes the performance measure

page 79: All agents can improve their performance through learning.

page 82: For each of the following activities, give a PEAS description

page 84: In which we see how an agent can find a sequence of actions that achieves its goals when no single action will do.

page 84: 3SOLVING PROBLEMS BY SEARCHING

page 85: Goal formulation

page 85: Problem formulation is the process of deciding what actions and states to consider, given a goal.

page 86: 3.1.1 Well-defined problems and solutions

page 86: A problem can be defined formally by five components:

page 86: initial state

page 87: actions

page 87: transition model,

page 87: goal test

page 88: path cost

page 88: 3.1.2 Formulating problems

page 101: UNINFORMED SEARCH STRATEGIES

page 101: 3.4.1 Breadth-first search

page 101: using a FIFO queue for the frontier

page 103: 3.4.2 Uniform-cost search

page 103: uniform-cost search expands the node n with the lowest path cost g(n).

page 105: 3.4.3 Depth-first search

page 105: uses a LIFO queue

page 107: 3.4.4 Depth-limited search

page 108: 3.4.5 Iterative deepening depth-first search

page 110: 3.4.6 Bidirectional search

page 111: 3.4.7 Comparing uninformed search strategies

page 112: INFORMED (HEURISTIC) SEARCH STRATEGIES

page 112: 3.5.1 Greedy best-first search

page 113: 3.5.2 A* search: Minimizing the total estimated solution cost

page 119: 3.5.3 Memory-bounded heuristic search

page 122: 3.6 HEURISTIC FUNCTIONS

page 123: we need a heuristic function that never overestimates the number of steps to the goal.

page 123: Manhattandistance.

page 128: This chapter has introduced methods that an agent can use to select actions in environments that are deterministic, observable, static, and completely known.

page 128: n such cases, the agent can construct sequences of actions that achieve its goals; this process is called search.

page 128: A problem consists of five parts: the initial state, a set of actions, a transition model describing the results of those actions, a goal test function, and a path cost function.

page 128: Uninformed search methods have access only to the problem definition. The basic algorithms are as follows:

page 128: Breadth-first search expands the shallowest nodes first; it is complete, optimal for unit step costs, but has exponential space complexity.

page 128: Uniform-cost search expands the node with lowest path cost, g(n), and is optimal for general step costs.

page 128: Depth-first search expands the deepest unexpanded node first. It is neither com- plete nor optimal, but has linear space complexity. Depth-limited search adds a depth bound.

page 128: Iterative deepening search calls depth-first search with increasing depth limits until a goal is found. It is complete, optimal for unit step costs, has time complexity comparable to breadth-first search, and has linear space complexity.

page 128: Bidirectional search can enormously reduce time complexity, but it is not always applicable and may require too much space.

page 128: Informed search methods may have access to a heuristic function h(n) that estimates the cost of a solution from n.

page 128: The generic best-first search algorithm selects a node for expansion according to an evaluation function.

page 128: Greedy best-first search expands nodes with minimal h(n). It is not optimal but is often efficient.

page 128: 3.7 SUMMARY

page 129: ∗ search expands nodes with minimal f (n) = g(n) + h(n). A∗ is complete and optimal, provided that h(n) is admissible (for TREE-SEARCH) or consistent (for GRAPH-SEARCH). The space complexity of A∗ is still prohibitive.

page 129: RBFS (recursive best-first search) and SMA∗ (simplified memory-bounded A∗) are robust,

page 129: The performance of heuristic search algorithms depends on the quality of the heuristic function.

page 140: In which we relax the simplifying assumptions of the previous chapter, thereby getting closer to the real world.

page 140: 4BEYOND CLASSICAL SEARCH

page 141: Local search

page 141: global minimum

page 141: global maximum

page 141: objective function

page 142: 4.1.1 Hill-climbing search

page 142: sometimes called greedy local search

page 144: Random-restart hill climbing

page 144: Stochastic hill climbing

page 144: First-choice hill climbing

page 145: 4.1.2 Simulated annealing

page 145: gradient descent (i.e., minimizing cost)

page 145: 4.1.3 Local beam search

page 145: combine hill climbing with a random walk in some way that yields both efficiency and completeness. Simulated annealing is such an algorithm.

page 146: 4.1.4 Genetic algorithms

page 146: is a variant of stochastic beam search

page 170: 4.5.3 Online local search

page 170: hill-climbing search has the property of locality in its node expan- sions.

page 170: hill-climbing search is already an online search algorithm!

page 173: 4.6 SUMMARY

page 181: 5ADVERSARIAL SEARCH

page 181: In which we examine the problems that arise when we try to plan ahead in a world where other agents are planning against us.

page 182: Pruning

page 182: heuristic evaluation functions

page 182: utility function

page 182: terminal test

page 182: transition model

page 182: game tree

page 183: game of tic-tac-toe

page 183: 5.2 page 183: OPTIMAL DECISIONS IN GAMES

page 184: minimax value

page 185: The minimax algorithm (Figure 5.3) computes the minimax decision from the current state.

page 185: 5.2.1 The minimax algorithm

page 185: backed up

page 187: The problem with minimax search is that the number of game states it has to examine is exponential in the depth of the tree.

page 187: Unfortunately, we can’t eliminate the exponent, but it turns out we can effectively cut it in half.

page 187: The trick is that it is possible to compute the correct minimax decision without looking at every node in the game tree.

page 187: That is, we can borrow the idea of pruning

page 187: technique we examine is called alpha–beta pruning

page 187: When applied to a standard minimax tree, it returns the same move as minimax would, but prunes away branches that cannot possibly influence the final decision

page 187: 5.3 ALPHA–BETA PRUNING

page 191: 5.4.1 Evaluation functions

page 192: expected value:

page 194: 5.4.3 Forward pruning

page 200: 5.6.1 Kriegspiel: Partially observable chess

page 203: equilibrium

page 209: We have looked at a variety of games to understand what optimal play means and to under- stand how to play well in practice.

page 209: most important ideas are

page 209: A game can be defined by the initial state (how the board is set up), the legal actions in each state, the result of each action, a terminal test (which says when the game is over), and a utility function that applies to terminal states.

page 209: In two-player zero-sum games with perfect information, the minimax algorithm can select optimal moves by a depth-first enumeration of the game tree.

page 209: The alpha–beta search algorithm computes the same optimal move as minimax, but achieves much greater efficiency by eliminating subtrees that are provably irrelevant.

page 209: Usually, it is not feasible to consider the whole game tree

page 209: 5.9 SUMMARY

page 210: need to cut the search off at some point and apply a heuristic evaluation function that estimates the utility of a state.

page 217: 5.9 This problem exercises the basic concepts of game playing, using tic-tac-toe

page 222: 6CONSTRAINT SATISFACTION PROBLEMS

page 236: minimum- remaining-values

page 236: “most constrained variable”

page 237: least-constraining-value

page 237: forward checking

page 251: 6.5 Solve the cryptarithmetic problem in Figure 6.2

page 251: 6.3 Consider the problem of constructing (not solving) crossword puzzles:5 fitting words into a rectangular grid. The grid, which is given as part of the problem, specifies which squares are blank and which are shaded. Assume that a list of words (i.e., a dictionary) is provided and that the task is to fill in the blank squares by using any subset of the list. Formulate this problem precisely in two ways:

page 252: 6.9 Explain why it is a good heuristic to choose the variable t

page 254: 7LOGICAL AGENTS

page 294: 7.8 SUMMARY

page 305: 8FIRST-ORDER LOGIC

page 336: 8.9 This exercise uses the function MapColor and predicates

page 337: 8.10Consider a vocabulary with the following

page 340: 8.28 Consider a first-order logical knowledge base that describes worlds containing people,

page 475: Inheritance becomes complicated when an object can belong to more than one category or when a category can be a subset of more than one other category

page 476: 12.5.2 Description logics

page 476: De- scription logics page 476: are notations that are designed to make it easier to describe definitions and properties of categories.

page 487: 12.8 SUMMARY

page 500: 13QUANTIFYING UNCERTAINTY

page 503: The fundamental idea of decision theory is that an agent is rational if and only if it chooses the action that yields the highest expected utility, averaged over all the possible outcomes of the action.

page 523: 13.7 SUMMARY

page 527: 13.8

page 527: 13.10 Deciding to put probability theory to good use,

page 527: CHERRY

page 528: 13.13

page 530: Bayesian networks can represent essentially any full joint probability distribution and in many cases can do so very concisely.

page 530: 14PROBABILISTIC REASONING

page 531: A Bayesian network is a directed graph

page 532: a conditional probability table, or CPT

page 534: the chain rule.

page 537: d-separation

page 537: The topological semantics2

page 571: Bayesian networks play a role roughly analogous to that of propositional logic for definite knowledge.

page 571: 14.8 SUMMARY

page 578: EXERCISES

page 578: 14.1 We have a bag of three biased coins a, b, and c

page 579: Consider the Bayesian network

page 579: 14.4

page 713: 18LEARNING FROM EXAMPLES

page 713: 18.1 FORMS OF LEARNING

page 715: SUPERVISED LEARNING

page 717: Decision tree induction is one of the simplest and yet most successful forms of machine learning.

page 717: 18.3 LEARNING DECISION TREES

page 718: 18.3.1 The decision tree representation

page 718: 18.3.2 Expressiveness of decision trees

page 722: We can evaluate the accuracy of a learning algorithm with a learning curve

page 723: Entropy is a measure of the uncertainty of a random variable; acquisition of information corresponds to a reduction in entropy.

page 723: 18.3.4 Choosing attribute tests

page 723: entropy

page 724: The information gain from the attribute test on A is the expected reduction in entropy:

page 724: The restaurant training set in Figure 18.3 has p = n = 6, so the corresponding entropy is B(0.5) or exactly 1 bit.

page 725: 18.3.5 Generalization and overfitting

page 725: decision tree pruning

page 725: This problem is called overfitting.

page 728: EVALUATING AND CHOOSING THE BEST HYPOTHESIS

page 728: error rate

page 728: stationarity assumption: that there is a probability distribution over examples that remains stationary over time.

page 728: called holdout cross-validation,

page 728: leave-one-out cross-validation or LOOCV

page 729: 18.4.1 Model selection: Complexity versus goodness of fit

page 729: validation set

page 730: 18.4.2 From error rates to loss

page 730: loss function

page 733: 18.5 THE THEORY OF LEARNING

page 734: probably approximately correct

page 734: PAC learning algorithm

page 737: 18.6 REGRESSION AND CLASSIFICATION WITH LINEAR MODELS

page 738: linear regression

page 739: weight space

page 739: gradient descent

page 739: learning rate

page 740: batch gradient descent

page 740: stochastic gradient descent,

page 740: 18.6.2 Multivariate linear regression

page 743: 18.6.3 Linear classifiers with a hard threshold

page 743: decision boundary

page 743: inearly separable

page 744: threshold function:

page 744: perceptron learning rule

page 744: training curve

page 745: 18.6.4 Linear classification with logistic regression

page 746: logistic regression

page 747: 18.7 ARTIFICIAL NEURAL NETWORKS

page 748: 18.7.1 Neural network structures

page 748: units

page 748: links

page 748: activation

page 748: weight

page 749: perceptron

page 749: feed-forward network

page 749: recurrent network

page 749: Feed-forward networks are usually arranged in layers

page 749: multilayer networks, which have one or more layers of hidden units

page 749: single-layer neural network, or a perceptron network.

page 749: 18.7.2 Single-layer feed-forward neural networks (perceptrons)

page 750: logistic regression

page 753: back-propagate

page 753: 18.7.4 Learning in multilayer networks

page 757: cross-validation

page 757: 18.8 NONPARAMETRIC MODELS

page 757: A nonparametric model is one that cannot be characterized by a bounded set of param- eters.

page 758: 18.8.1 Nearest neighbor models

page 758: Minkowski distance

page 758: k-nearest neighbors looku

page 758: Ham- ming distance.

page 759: Mahalanobis distance

page 759: normalization

page 759: curse of dimensionality.

page 761: 18.8.4 Nonparametric regression

page 763: kernel

page 764: 18.9

page 764: SUPPORT VECTOR MACHINES

page 764: maximum margin separator

page 764: kernel trick

page 766: quadratic programming

page 767: kernel function

page 768: This then is the clever kernel trick:

page 768: optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.

page 768: 18.10

page 768: ENSEMBLE LEARNING

page 768: The idea of ensemble learning methods is to select a collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions.

page 769: boosting

page 769: weak learning

page 770: ADABOOST

page 772: Bayesian learning

page 773: 18.11 PRACTICAL MACHINE LEARNING

page 773: 18.11.1 Case study: Handwritten digit recognition

page 773: NIST

page 775: 18.11.2 Case study: Word senses and house prices

page 777: Decision trees can represent all Boolean functions. The information-gain heuristic provides an efficient method for finding a simple, consistent decision tree.

page 777: 18.12 SUMMARY

page 777: Learning takes many forms, depending on the nature of the agent, the component to be improved, and the available feedback.

page 777: If the available feedback provides the correct answer for example inputs, then the learn- ing problem is called supervised learning.

page 777: The task is to learn a function y = h(x). Learning a discrete-valued function is called classification;

page 777: learning a continuous func- tion is called regression.

page 777: The performance of a learning algorithm is measured by the learning curve

page 777: which shows the prediction accuracy on the test set as a function of the training-set size

page 777: When there are multiple models to choose from, cross-validation can be used to select a model that will generalize well.

page 777: Sometimes not all errors are equal. A loss function tells us how bad each error is; the goal is then to minimize loss over a validation set.

page 777: Linear regression is a widely used model. The optimal parameters of a linear regres- sion model can be found by gradient descent search, or computed exactly.

page 777: A linear classifier with a hard threshold—also known as a perceptron—can be trained by a simple weight update rule to fit data that are linearly separable. In other cases, the rule fails to converge.

page 778: Logistic regression replaces the perceptron’s hard threshold with a soft threshold de- fined by a logistic function.

page 778: Gradient descent works well even for noisy data that are not linearly separable.

page 778: Neural networks represent complex nonlinear functions with a network of linear- threshold units. termMultilayer feed-forward neural networks can represent any func- tion, given enough units.

page 778: The back-propagation algorithm implements a gradient de- scent in parameter space to minimize the output error.

page 778: Nonparametric models use all the data to make each prediction, rather than trying to summarize the data first with a few parameters.

page 778: Support vector machines find linear separators with maximum margin to improve the generalization performance of the classifier.

page 778: Kernel methods implicitly transform the input data into a high-dimensional space where a linear separator may exist, even if the original data are non-separable.

page 778: Ensemble methods such as boosting often perform better than individual methods.

page 778: In online learning we can aggregate the opinions of experts to come arbitrarily close to the best expert’s performance, even when the distribution of the data is constantly shifting.

_{Updated: Wed May 20 15:36:33 EDT 2015}