TM426 The T396 Project
Project title: Artificial intelligence for data
interpretation
Sample project description
Specific title
Artificial intelligence tools for predicting
the fuel economy of cars
Description
[712 words]
The
data
I was fascinated to find a large amount
of data about the fuel economy of a range of different cars at the UCI
(University of California, Irvine) Repository of Machine Learning Databases and
Domain Theories. The repository contains two car databases, one from the
StatLib library maintained at Carnegie Mellon University and the other compiled
by Jeffrey C. Schlimmer from published trade and insurance sources. The StatLib
data contains 398 examples, which should be ample to train and test a neural
network. For each example, there are nine attributes:
1 fuel economy
(city cycle, mpg) continuous parameter;
2 cylinders
multi-valued discrete parameter;
3 displacement
continuous parameter;
4 horsepower
continuous parameter;
5 mass
continuous parameter;
6 acceleration
continuous parameter;
7 model year
multi-valued discrete parameter;
8 origin
multi-valued discrete parameter;
9 car name
a unique string for each example.
I plan to investigate alternative
artificial intelligence approaches to predicting the first attribute from the
others. The last attribute is unique for each example and therefore cannot
contribute to the task. Seven attributes (28) can, therefore, be used to
as the inputs to an AI system for predicting fuel economy (attribute 1), which
will be the single numerical output. The data have been successfully used by
Quinlan in his investigations into the use of instance-based and model-based
learning methods, which suggest the data have sufficient depth, despite the
relatively small number of attributes. In the event of difficulties with the
data, I can fall back to the other database, which has much greater depth (26
attributes), though less breadth (only 205 examples).
Neural
networks
This application is different from most
of those covered in T396 in that it is not a classification task. Instead of
mapping an input pattern onto one from a number of classes, the aim is to map
it onto a real-numbered output value. A neural network therefore requires only
a single output node to generate a fuel-economy value. This will be a value in
the range 01, which will be mapped by NeuralWorks MinMax facility
onto the range of economies of vehicles in the data set. Both databases are in
plain text format, separated by commas or spaces, with one example per line.
This is precisely the format used by NeuralWorks, so the data can be used with
little or no pre-processing. The data will be divided into two sets one
for training and one for testing. That way, the networks can be tested against
previously unseen data.
Initial experiments will be based on a
three-layered perceptron comprising seven input nodes (for the seven input
values), seven nodes in the hidden layer, and a single output node. Some
experimentation will be required with the number of nodes in the hidden layer
and the learning parameters. If convergence is difficult to achieve, a second
hidden layer may be required.
The use of a Kohonen network will also
be investigated. Conventionally, the Kohonen network detects clusters in the
data during an unsupervised learning phase, and a mapping network is used
during a supervised learning phase to associate clusters with classes. It will
be interesting to ascertain whether distinct clusters will form, or whether
firing neurons will spread from one corner of the Kohonen layer for the most
efficient cars to the opposite corner for the least efficient.
Both types of neural network will be
evaluated using a scoring algorithm. This will require modification from the
algorithm used in T396, because the output cannot be simply classed as right or
wrong. Instead, the scoring algorithm will be modified to take account of the
closeness of the network output to the desired output.
Knowledge-based
systems
A knowledge-based system will be
constructed, using Flex, to predict the fuel economy from the set of seven
input parameters. Fuzzy rules will scale the fuel economy, based on the
continuous parameters (displacement, horsepower, mass, and acceleration). Some
experimentation will be required with the number, shape and distribution of the
fuzzy sets. Either crisp or fuzzy rules can be used with the discrete
parameters (number of cylinders, model year, and origin), so experiments will
be carried out with both. Indeed, it may be that some parameters, such as
origin, are best ignored altogether. The same scoring system used with neural
networks will be used for the KBS.
Project schedule
Start date: February
200x End date: December
200x
Total time allocation: 260 hours
Fixed intermediate and final
deadlines and approximate time allocation
|
TMA 01 |
[the cut-off date] |
48 hours |
|
TMA 02 |
[the cut-off date] |
48 hours |
|
TMA 03 |
[the cut-off date] |
60 hours |
|
Report submission |
[the cut-off date] |
84 hours |
|
TMA 04 |
[the cut-off date] |
20 hours |
|
Month |
Planned activity |
|
February |
Read project briefing
notes.
Identify requirements for
project lifecycle, deadlines, deliverables etc.
Determine project topic.
|
|
March |
Literature survey.
Review T396 material.
Identify a suitable data source
for experimentation with both KBSs and computational intelligence (i.e. neural
networks and/or genetic algorithms) revisiting data from the T396
project is OK if new work is proposed.
Produce TMA 01. |
|
April |
Rework proposal following
comments from tutor.
Identify skills weaknesses.
Plan detailed project
schedule.
Ensure familiarity with current
versions of Flex and NeuralWorks.
Construct an MLP and a simple
rule-based system to verify usability of the data set. |
|
May |
Ensure weak skills areas are
supported.
Finalize project plan.
Produce TMA 02.
Literature search focused on
related applications. |
|
June |
Perform a range of MLP
experiments.
Document all tests and record
calculated scores. |
|
July |
Implement fuzzy rules for
predicting fuel economy.
Test fuzzy rules using
consistent scoring method.
Document all tests and record
calculated scores.
Produce TMA 03. |
|
August |
Perform a range of Kohonen
network experiments.
Document all tests and record
calculated scores. |
|
September |
Revise fuzzy rules and
experiment with crisp rules.
Start work on project report
plan content, structure and update bibliographic records. |
|
October |
Decide on what is implementable
in remaining time.
Consolidate existing work
focus on essential incomplete features, tidy program layout and
prioritize remaining activities.
Evaluation of techniques,
results and overall project.
Ongoing project report work
should be almost complete for last week of October.
Proofreading. |
|
November |
Review and revise project
report: check presentation and ensure that the implications of the work are
fully discussed.
Submit project report.
|
|
December |
Produce TMA 04.
End. |
Bibliography
Gurney, K. (1997)
Neural Networks: An Introduction, UCL Press,
ISBN 1857286731.
Quinlan, R. (1993) Combining
instance-based and model-based learning, 10th
Machine Learning 93: Proceedings of the 10th International Conference on
Machine Learning, University of Massachusetts, Amherst, Morgan Kaufmann,
pp. 23643, ISBN 1558603077.
UCI Repository of Machine-learning
Databases, University of California, Department of Information and Computer
Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html
(Accessed October 2003.)
Quinlan distinguishes instance-based
learning, in which predictive capability is learnt from a set of examples, from
model-based learning, in which high-level principles are explicitly stated.
Neural networks are therefore a form of instance-based learning, and rule
induction is a form of model-based learning. Quinlan compares the performance
of purely instance-based methods (including neural networks) with a hybrid
method in which the instances are firstly modified by applying model-based
learning.
Unfortunately Quinlans model-based
modifications to the training data are unclear and provide no clues to the
construction of my knowledge-based system. Nevertheless, I have found the paper
helpful for the following reasons.
·
Quinlan explicitly acknowledges that prediction of a continuous
variable, as in the project proposed here, has different characteristics from a
classification task.
·
Quinlan looks at eight separate example problems, one of which is
the fuel economy dataset that I propose to use, and another is my
reserve data set compiled by Schlimmer. (In fact, all eight of
Quinlans datasets are available from the UCI Repository.)
·
The paper demonstrates that it should be possible to apply a
neural network to the data that I propose to use. (Quinlans neural
network was a three-layer perceptron, but its learning algorithm and transfer
function were non-standard and therefore his results will not be exactly
reproducible.)
·
Although Quinlans model-based modifications improved the
performance of some of his instance-based techniques, they had an adverse
affect on his neural networks. In fact, a pure neural network was
his most accurate technique with the fuel economy data.
Equipment or software
No additional software or hardware will
be used beyond the standard course software packages and a Windows PC that
meets the current T396 requirement. The project involves the development of a
fuzzy and crisp rule-based system that can be implemented using the features
and facilities found in the current T396 release of Flex (including FLINT
extensions for the fuzzy rules). A range of neural networks will be produced
using the current OU release of NeuralWorks. The networks are not large, and
will not come close to the size restrictions imposed by NeuralWorks.