public class Dataset extends java.lang.Object implements java.io.Serializable, java.lang.Cloneable, Printable, Savable, Options, Commandable, Citable
To implement a new Dataset, you first need to create an extension of BaseEntry that represents the new kind of data. Then, you need to overload the following operations:
Usage: *No options to set*
Implemented Commands:
add <entries...> - Add entries to a dataset
add $<dataset> [-force] - Add entries from another dataset
<output> = clone [-empty] - Create a copy of this dataset
combine $<dataset> - Add all entries from another dataset
duplicates <resolver> [<resolver options>] -
Eliminate duplicates within a dataset
filter <include|exclude> <method> [<options...>] - Run
dataset through a filter
generate <method> [>options<] - Generate new entries
match $<dataset> <num> - Find most similar entries in this dataset
modify <method> [<options>] - Modify the dataset
import <filename> [<options...>] - Import data by reading a file
subtract $<data> - Subtract another dataset from this one
rank <number> <maximum|minimum> <measured|predicted> <method>
[<options>] - Print the top ranked entries based by some measure
<output> = split <number|fraction> - Randomly select and remove
entries from dataset
<output> = subset <number|fraction> - Generate a random subset
from this dataset
attributes - Print all attributes
attributes clear - Clear all attribute data
attributes expanders add <method> [<options...>] - Add an
attribute expander to be run after generating attributes
attributes expanders clear - Clear the current list of attribute
expanders
attributes expanders run - Run the currently-defined list of attribute
expanders
attributes generators add <method> [<options...>] - Add a new
attribute generator to list of generators
attributes generators run - Run the currently-defined list of
attribute expanders
attributes generators clear - Clear the current list of attribute
generators
attributes generate - Generate attributes for each entry
attributes rank <number> <method> [<options...>] - Rank
attributes based on predictive power
If attributes, classes, or properties are different, attributes and class values
in new entries (i.e., those from the other dataset) will be deleted and properties
will be merged
Prints the most similar entries in this dataset to those in the dataset
passed as the argument.BaseEntryRanker
Implemented Print Commands:
details - Print details about this class
dist - Print distribution of entries between known classes
description - Print description of this dataset
Implemented Save Formats: TBD
csv - Comma-separated value format.
arff - Weka's ARFF format.
stats - Writes predicted and measured class variables.
template - Save an empty clone of the dataset using serialization
json - Save dataset into JSON format
The value of each attribute and the measured class variable, if
defined.
Requires that a measured value is available for the class variable of
each entry.
This is intended to allow an external program to evaluate model
performance.
Modifier and Type | Field and Description |
---|---|
protected java.util.ArrayList<java.lang.String> |
AttributeName
Names of attributes that describe each entry
|
protected java.util.ArrayList<BaseEntry> |
Entries
Internal array that stores entries
|
protected java.util.List<BaseAttributeExpander> |
Expanders
Tools to generate new attributes based on existing ones
|
protected java.util.List<BaseAttributeGenerator> |
Generators
Tools to generate special-purpose attributes
|
Constructor and Description |
---|
Dataset()
Generate a blank dataset
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
about()
Prints a simple status message about this object
|
void |
addAttribute(java.lang.String name,
double[] values)
Add in a new attribute.
|
void |
addAttributeExpander(BaseAttributeExpander expander)
Add a new tool to expand the number of attributes for this dataset.
|
void |
addAttributeGenerator(BaseAttributeGenerator generator)
Add a new tool to generate additional attributes
|
void |
addAttributes(java.util.List<java.lang.String> names)
Add new attributes.
|
void |
addEntries(java.util.Collection<? extends BaseEntry> entries)
Add many entries to a the data set
|
void |
addEntries(Dataset otherDataset,
boolean forceMerge)
Add entries from another dataset.
|
void |
addEntry(BaseEntry e)
Add an entry.
|
BaseEntry |
addEntry(java.lang.String input)
A new entry by parsing an input string.
|
protected void |
calculateAttributes()
Compute attributes that are specific to this class.
|
void |
clearAttributeExpanders()
Reset the list of attribute expanders
|
void |
clearAttributeGenerators()
Reset the list of attribute generators
|
void |
clearAttributes()
Clear attribute data.
|
void |
clearData()
Clear out all entries
|
Dataset |
clone()
Creates a new instance of this dataset, and clones of each entry.
|
void |
combine(java.util.Collection<Dataset> d)
Combine the data structure with a collection of other data structures.
|
void |
combine(Dataset d)
Combine the data structure with another.
|
void |
combine(Dataset[] d)
Combine the data structure with an array of other Datasets.
|
boolean |
containsEntry(BaseEntry Entry)
Determine whether a dataset contains a certain entry
|
Dataset |
createTemplate()
Create a template designed to be saved using serialization
|
void |
deleteMeasuredClasses()
Delete all measured classes from entries
|
void |
deletePredictedClasses()
Delete all predicted classes from entries
|
Dataset |
emptyClone()
Creates a new instance with the same class and attribute names, but
without any entries.
|
protected void |
finalizeGeneration()
Run after generating attributes.
|
void |
generateAttributes()
Generate attributes for this dataset
|
double[][] |
getAttributeArray()
Output the attributes of each entry into an array
|
java.util.List<BaseAttributeExpander> |
getAttributeExpanders()
Get a copy of the list of currently-employed attribute expanders.
|
java.util.List<BaseAttributeGenerator> |
getAttributeGenerators()
Get a copy of the list of currently-employed attribute generators.
|
int |
getAttributeIndex(java.lang.String Name)
Get index of a certain attribute
|
java.lang.String |
getAttributeName(int index)
Get name of a specific attribute
|
java.lang.String[] |
getAttributeNames() |
java.util.List<org.apache.commons.lang3.tuple.Pair<java.lang.String,Citation>> |
getCitations()
Return a list of citations for this object and any underlying objects.
|
java.lang.String |
getClassName(int value)
Get the name of a certain class (for data with multiple possible
classes)
|
java.lang.String[] |
getClassNames() |
double[][] |
getClassProbabilityArray()
Get an array of class probabilities
|
double[] |
getDistribution()
Get the distribution of entries between known classes
|
int[] |
getDistributionCount()
Get the distribution of entries between known classes
|
java.util.Map<BaseEntry,java.util.List<BaseEntry>> |
getDuplicates()
Get a map of an example of a duplicate entry to all duplicates.
|
java.util.List<BaseEntry> |
getEntries()
Return copy of list of entries.
|
java.util.List<BaseEntry> |
getEntriesWriteAccess()
Get the internal list of entries from this dataset.
|
BaseEntry |
getEntry(int index)
Retrieve a single entry from the dataset
|
double[][] |
getEntryArray()
Output the attributes and class of each entry
|
double[] |
getMeasuredClassArray()
Output an array of the measured classes for each entry.
|
double[] |
getPredictedClassArray()
Get the predicted class for each entry
|
Dataset |
getRandomSplit(double fraction)
Split off a certain number of entries into a separate dataset.
|
Dataset |
getRandomSplit(double fraction,
long seed,
boolean stratified)
Split off a certain number of entries into a separate dataset.
|
Dataset |
getRandomSplit(int number)
Split off a certain number of entries into a separate dataset.
|
Dataset |
getRandomSplit(int number,
long seed,
boolean stratified)
Split off a certain number of entries into a separate dataset.
|
Dataset |
getRandomSubset(double fraction)
Generate a random subset of the original data, which is left intact
|
Dataset |
getRandomSubset(int number)
Generate a random subset of the original data, which is left intact
|
double[] |
getSingleAttributeArray(int Attribute)
Output a single attribute for each entry
|
Dataset |
getSubset(int[] indicies)
Get a specific list of entries from the dataset.
|
Dataset |
getTrainingExamples()
Get a dataset that only contains entries with a measured class variable.
|
java.util.Map<BaseEntry,java.util.List<BaseEntry>> |
getUniqueEntries()
Get a map of all unique entries, and their duplicates.
|
void |
importText(java.lang.String filename,
java.lang.Object[] options)
Imports data from a text file.
|
static Dataset |
loadState(java.lang.String filename)
Read the state from file using serialization
|
java.util.List<BaseEntry> |
matchEntries(BaseEntry entry,
int n)
Find entries in this dataset with attributes closest to a user-provided
entry.
|
int |
NAttributes() |
int |
NClasses() |
int |
NEntries() |
Dataset[] |
partition(int[] labels)
Given a list of labels, separate Dataset into multiple subsets
|
Dataset[] |
partition(int[] labels,
int number)
Given a list of labels, separate Dataset into multiple subsets (some may
be empty)
|
java.lang.String |
printAttributeDescription(boolean htmlFormat)
Print out description of attributes.
|
java.lang.String |
printCommand(java.util.List<java.lang.String> Command)
Handles more complicated printing commands.
|
java.lang.String |
printDescription(boolean htmlFormat)
Print out name of dataset and what attributes are generated.
|
java.lang.String |
printDistribution()
Print out the distribution of entries in the known classes
|
java.lang.String |
printEntries(int[] list)
Print out data regarding a list of entries.
|
java.lang.String |
printEntryDescription(boolean htmlFormat)
Print out what the entries to this dataset are.
|
java.lang.String |
printUsage()
Print out required format for options.
|
void |
removeDuplicates()
Remove all duplicate entries without any selection strategy
|
void |
resolveDuplicates(BaseDuplicateResolver resolver)
Resolve duplicates according to some strategy
|
void |
restoreAttributes(weka.core.Instances weka)
Restore attribute data to each entry.
|
protected java.lang.Object |
runAttributeCommand(java.util.List<java.lang.Object> Command)
Run commands related to attributes of each entry.
|
void |
runAttributeExpanders()
Expand the list of attributes using the currently-set list of attribute
expanders.
|
protected void |
runAttributeExpansionCommand(java.util.List<java.lang.Object> Command)
Run commands relating to expanding the attribute pool.
|
protected void |
runAttributeGeneratorCommand(java.util.List<java.lang.Object> Command)
Run commands relating to generating new attributes.
|
void |
runAttributeGenerators()
Expand the list of attributes using the currently-set list of attribute
generators.
|
java.lang.Object |
runCommand(java.util.List<java.lang.Object> command)
Process some command described by a list of Objects.
|
java.lang.String |
saveCommand(java.lang.String Basename,
java.lang.String Command)
Handles complicated saving commands.
|
void |
saveState(java.lang.String filename)
Save the state of this object using serialization
|
void |
saveTemplate(java.lang.String filename)
Save a template of this dataset to disk using
|
void |
setAttributeNames(java.util.List<java.lang.String> attributeNames)
Set the names of each attributes.
|
void |
setClassNames(java.lang.String[] newClassNames)
Set name of class variable (or possible classes)
|
void |
setClassProbabilities(double[][] predictions)
Set class probabilities for each entry
|
void |
setMeasuredClasses(double[] measurements)
Set measured class for each entry, given an array of measurements
|
void |
setOptions(java.util.List<java.lang.Object> Options)
Set any options for this object.
|
void |
setPredictedClasses(double[] predictions)
Set predicted class for each entry, given an array of predictions
|
Dataset[] |
splitForThreading(int NThreads)
Split for threading purposes.
|
Dataset[] |
splitIntoFolds(int folds)
Split the dataset into multiple folds for cross-validation, empties the
original test set
|
Dataset[] |
splitIntoFolds(int folds,
long seed)
Split the dataset into multiple folds for cross-validation, empties the
original test set
|
void |
subtract(Dataset data)
Remove all entries that are in another dataset from this dataset
|
org.json.JSONObject |
toJSON()
Print dataset to JSON format
|
java.lang.String |
toString() |
weka.core.Instances |
transferToWeka(boolean useClass,
boolean useDiscreteClass)
Convert to Weka Instances object, delete attribute information in each
entry.
|
protected java.util.ArrayList<java.lang.String> AttributeName
protected java.util.ArrayList<BaseEntry> Entries
protected java.util.List<BaseAttributeExpander> Expanders
protected java.util.List<BaseAttributeGenerator> Generators
public static Dataset loadState(java.lang.String filename) throws java.lang.Exception
filename
- Filename for inputjava.lang.Exception
public void setOptions(java.util.List<java.lang.Object> Options) throws java.lang.Exception
Options
setOptions
in interface Options
Options
- Array of options as Objects - can be null
java.lang.Exception
- if problem with inputspublic java.lang.String printUsage()
Options
printUsage
in interface Options
public Dataset clone()
clone
in class java.lang.Object
public Dataset emptyClone()
public Dataset createTemplate()
public Dataset getTrainingExamples()
public void clearData()
public void clearAttributes()
public void addAttributeExpander(BaseAttributeExpander expander)
expander
- New expanderpublic void clearAttributeExpanders()
public java.util.List<BaseAttributeExpander> getAttributeExpanders()
addAttributeExpander(magpie.attributes.expanders.BaseAttributeExpander)
public void runAttributeExpanders()
getAttributeExpanders()
public void addAttributeGenerator(BaseAttributeGenerator generator)
generator
- New generatorpublic void clearAttributeGenerators()
public java.util.List<BaseAttributeGenerator> getAttributeGenerators()
addAttributeGenerator(BaseAttributeGenerator)
public void runAttributeGenerators() throws java.lang.Exception
java.lang.Exception
getAttributeGenerators()
public final void generateAttributes() throws java.lang.Exception
java.lang.Exception
- If any error is encounteredprotected void calculateAttributes() throws java.lang.Exception
java.lang.Exception
- If attribute calculation failspublic java.lang.String[] getAttributeNames()
public void setAttributeNames(java.util.List<java.lang.String> attributeNames)
NOTE: This will not effect the number of attributes of each entry. Make sure to update those if needed!
attributeNames
- New names for each attributepublic java.lang.String getAttributeName(int index)
index
- Attribute numberpublic int getAttributeIndex(java.lang.String Name)
Name
- Name of desired attributepublic void importText(java.lang.String filename, java.lang.Object[] options) throws java.lang.Exception
Attribute1Name, Attribute2Name, ..., AttributeNName, Class
Attribute1, Attribute2, ..., AttributeN, ClassVariable
filename
- Path to data fileoptions
- Any options used to control importjava.lang.Exception
- If text import failspublic java.lang.String[] getClassNames()
public void setClassNames(java.lang.String[] newClassNames)
newClassNames
- New name(s) to usepublic java.lang.String getClassName(int value)
value
- Value of class variablepublic void addAttribute(java.lang.String name, double[] values)
name
- Name to be addedvalues
- Value of attribute for each entrypublic void addAttributes(java.util.List<java.lang.String> names)
names
- Names of new attributesBaseEntry.addAttributes(double[])
public int NAttributes()
public int NClasses()
public int NEntries()
public void addEntry(BaseEntry e)
e
- Entry to be addedpublic BaseEntry addEntry(java.lang.String input) throws java.lang.Exception
input
- String describing the entryjava.lang.Exception
- If conversion failspublic void addEntries(java.util.Collection<? extends BaseEntry> entries)
entries
- Any collection type of entriespublic void addEntries(Dataset otherDataset, boolean forceMerge) throws java.lang.Exception
Entries will be cloned before adding them to this dataset.
otherDataset
- Dataset to be added to this oneforceMerge
- Whether to force merge if attributes / class are different.java.lang.Exception
- If datasets have different classes or attributes,
and forceMerge is false.public void removeDuplicates()
public void resolveDuplicates(BaseDuplicateResolver resolver)
resolver
- Duplicate resolution strategypublic java.util.Map<BaseEntry,java.util.List<BaseEntry>> getUniqueEntries()
public java.util.Map<BaseEntry,java.util.List<BaseEntry>> getDuplicates()
public boolean containsEntry(BaseEntry Entry)
Entry
- Entry to be testedEntry
public java.util.List<BaseEntry> matchEntries(BaseEntry entry, int n)
entry
- Entry to be matchedn
- Number of top entries to listpublic void combine(Dataset d)
Entries are not cloned when during combination.
If you are looking to add entries from another dataset, also consider using addEntries(magpie.data.Dataset, boolean). This function will clone each entry of the dataset so that the two datasets can be treated as completely independent after the merge.
d
- Dataset to be addedpublic void combine(Dataset[] d)
d
- Array of Datasetspublic void combine(java.util.Collection<Dataset> d)
d
- Collection of Datasetspublic void subtract(Dataset data)
data
- Second datasetpublic BaseEntry getEntry(int index)
index
- Index of entrypublic java.util.List<BaseEntry> getEntries()
public java.util.List<BaseEntry> getEntriesWriteAccess()
public Dataset[] partition(int[] labels)
labels
- Label defining in which subset to label an entrymax(label) + 1
, where
each member, i, contains entries with label[i] == i
.public Dataset[] partition(int[] labels, int number)
labels
- Label defining in which subset to label an entrynumber
- Number subsets to produce, must be greater than max(labels)public Dataset getSubset(int[] indicies)
indicies
- List of entry IDs to be removedpublic Dataset getRandomSplit(int number)
Does not specifically ensure that split has same distribution of classes as this dataset
number
- Fraction of entries from original set to movegetRandomSplit(int, long, boolean)
public Dataset getRandomSplit(int number, long seed, boolean stratified)
number
- Number of entries in new setseed
- Seed used for random number generatorstratified
- Ensure split has the same distribution of classes as thepublic Dataset getRandomSplit(double fraction)
Does not specifically ensure that split has same distribution of classes as this dataset
fraction
- Fraction of entries from original set to movegetRandomSplit(double, long, boolean)
public Dataset getRandomSplit(double fraction, long seed, boolean stratified)
fraction
- Fraction of entries of original set to move to new setseed
- Seed used for random number generatorstratified
- Ensure split has the same distribution of classes as the
host datasetpublic Dataset getRandomSubset(int number)
number
- Number of entries to move overpublic Dataset getRandomSubset(double fraction)
fraction
- Fraction of entries used in new setpublic Dataset[] splitForThreading(int NThreads)
Example: 4 threads, thread 1 has entry #1, 5, 9, ...
The entries in this object are untouched.
NThreads
- Number of subsets to createpublic Dataset[] splitIntoFolds(int folds)
folds
- Number of foldspublic Dataset[] splitIntoFolds(int folds, long seed)
folds
- Number of foldsseed
- Random seedpublic weka.core.Instances transferToWeka(boolean useClass, boolean useDiscreteClass)
useClass
- Whether to output class data. Note: If there is no measured
class data and useDiscreteClass is true, value will be set to Zero. This
allows the Instances to contain information about how many classes are available
regardless of whether this Dataset contains any measurements.useDiscreteClass
- Whether to treat class variable as discreterestoreAttributes(weka.core.Instances)
public void restoreAttributes(weka.core.Instances weka) throws java.lang.Exception
weka
- Weka object containing attribute information. Assumes last
variable has class variablejava.lang.Exception
transferToWeka(boolean, boolean)
public double[][] getEntryArray()
public double[][] getAttributeArray()
public double[] getSingleAttributeArray(int Attribute)
Attribute
- Which Attribute to outputpublic double[] getMeasuredClassArray()
java.lang.RuntimeException
- If an entry is missing a measured valuepublic double[] getPredictedClassArray()
public double[][] getClassProbabilityArray()
public void deleteMeasuredClasses()
public void deletePredictedClasses()
public void setPredictedClasses(double[] predictions)
predictions
- Predictions in the same order as generated by
getFeaturespublic void setMeasuredClasses(double[] measurements)
measurements
- Measurements in the same order as generated by
getFeaturespublic void setClassProbabilities(double[][] predictions)
predictions
- Probabilities in the same order as generated by
getFeaturespublic java.lang.String about()
Printable
public java.lang.String printDescription(boolean htmlFormat)
printDescription
in interface Printable
htmlFormat
- Whether to print in HTML format#printModel()
public java.util.List<org.apache.commons.lang3.tuple.Pair<java.lang.String,Citation>> getCitations()
Citable
getCitations
in interface Citable
public java.lang.String printAttributeDescription(boolean htmlFormat)
Implementation Guide
Subclasses should describe what kind of attributes are generated by default. If it uses a separate generator class, those are captured by the printDescription(boolean) section.
htmlFormat
- Whether to print in HTML formatpublic java.lang.String printEntryDescription(boolean htmlFormat)
htmlFormat
- Whether to print in HTML formatpublic java.lang.String toString()
toString
in class java.lang.Object
public org.json.JSONObject toJSON()
public int[] getDistributionCount()
public double[] getDistribution()
public java.lang.String printDistribution()
public java.lang.String printEntries(int[] list)
list
- ID numbers of entries to be printed.public void saveState(java.lang.String filename) throws java.io.IOException
filename
- Path to output filejava.io.IOException
public void saveTemplate(java.lang.String filename) throws java.io.IOException
filename
- Path to output filejava.io.IOException
public java.lang.String printCommand(java.util.List<java.lang.String> Command) throws java.lang.Exception
Printable
printCommand
in interface Printable
Command
- Command specifying what to printjava.lang.Exception
- If command not understoodpublic java.lang.String saveCommand(java.lang.String Basename, java.lang.String Command) throws java.lang.Exception
Savable
Dev Note: Make sure to add save format to Javadoc. See Dataset as an example. Required format:
<save><p><b>format<b> - Description
<br>Optional room to talk more about format </save>
saveCommand
in interface Savable
Basename
- Name of file without extensionCommand
- Command specifying format in which to printjava.lang.Exception
- If command not understoodpublic java.lang.Object runCommand(java.util.List<java.lang.Object> command) throws java.lang.Exception
Commandable
runCommand
in interface Commandable
command
- Command as a list of objectsjava.lang.Exception
- If something goes wrongprotected java.lang.Object runAttributeCommand(java.util.List<java.lang.Object> Command) throws java.lang.Exception
Command
- Operation to be run on/about attributesjava.lang.Exception
- On any errorprotected void runAttributeExpansionCommand(java.util.List<java.lang.Object> Command) throws java.lang.Exception
Command
- Attribute expansion command (e.g., "run")java.lang.Exception
protected void runAttributeGeneratorCommand(java.util.List<java.lang.Object> Command) throws java.lang.Exception
Command
- Attribute expansion command (e.g., "run")java.lang.Exception
protected void finalizeGeneration()