RDFStats Manual for v2.0-beta

Previous manuals are available as part of the corresponding source code.

Important note: for v2.0 we temporarily dropped the feature of class-specific histogram generation. If you need class-specific histograms, you can checkout rev 10 from trunk and read the previous manual.

Contents

  1. Quick Facts
  2. Sample Statistics Document
  3. Statistics Vocabulary
  4. Download
  5. Usage / Generation
  6. Usage / Decoding
  7. Using Configuration Files
  8. RDFStatsModel and Histogram API
  9. Embedding RDFStats
  10. Note on String Histograms
  11. Open Issues
  12. License
  13. Credits & Acknowledgements

Quick Facts

Sample Statistics Document

A typical statistics document has a stats:RDFStatsDataset (sub-class of scv:Dataset) which represents a stats:RDFDocument or stats:SPARQEndpoint by its stats:sourceUrl:

_:b1  a       stats:RDFStatsDataset ;
      <http://purl.org/dc/elements/1.1/creator>
              "dorgon@midearth" ;
      <http://purl.org/dc/elements/1.1/date>
              "2009-08-22T12:57:16.589Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
      stats:sourceType stats:SPARQLEndpoint ;
      stats:sourceUrl <http://localhost:8888/sparql> .

The histograms are represented as SCOVO data items, linked to the dataset resource, and associated to several dimensions. For instance, a stats:SubjectHistogram looks like this:

[]    a       stats:SubjectHistogram ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#value>
              """ATKM/emz97IAAAAFaHR0cDovL3d3dy53My5vcmcvMjAwMC8wMS9yZGYtc2NoZW1hI1Jlc291cmNl
A2h0dHA6Ly9leGFtcGxlLmNvbS9yZXNvdXJjZS9jb25mZXJlbmNlcy8yMzU0MQNodHRwOi8vZXhh
bXBsZS5jb20vcmVzb3VyY2UvdG9waWNzLzkDAAAAAAAAAAIAAAAAAAAACQAAAAAAAAAGAAAAAAAA
AAoAAAAAAAAADwAAAAAAAAACAAAAAAAAAAkAAAAAAAAABgAAAAAAAAAKAAAAAAAAAA9odHRwOi8v
ZXhhbXBsZS5jb20vcmVzb3VyY2UvY29uZmVyZW5jZXMDaHR0cDovL2V4YW1wbGUuY29tL3Jlc291
cmNlL29yZ2FuaXphdGlvbnMDaHR0cDovL2V4YW1wbGUuY29tL3Jlc291cmNlL3BhcGVycwNodHRw
Oi8vZXhhbXBsZS5jb20vcmVzb3VyY2UvcGVyc29ucwNodHRwOi8vZXhhbXBsZS5jb20vcmVzb3Vy
Y2UvdG9waWNzAw==""" ;
      <http://purl.org/NET/scovo#dataset>
              _:b1 ;
      stats:rangeDimension
              <http://www.w3.org/2000/01/rdf-schema#Resource> .

It has only a single dimension <http://www.w3.org/2000/01/rdf-schema#Resource>, which means the histogram refers to subject URI resources.

A stats:PropertyHistogram item may look like:

[]    a       stats:PropertyHistogram ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#value>
              """ATKM/TMuamQAAAAFaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEjc3RyaW5nA0dlcm1h
bnkDVW5pdGVkIFN0YXRlcwMAAAAAAAAAAQAAAAAAAAABAAAAAAAAAAIAAAAAAAAAAQAAAAAAAAAB
AAAAAAAAAAEAAAAAAAAAAQAAAAAAAAABAAAAAAAAAAEAAAAAAAAAAUdlcm1hbnkDSXRhbHkDVGhl
IE5ldGhlcmxhbmRzA1VLA1VuaXRlZCBTdGF0ZXMD""" ;
      <http://purl.org/NET/scovo#dataset>
              _:b1 ;
      stats:propertyDimension
              <http://www.w3.org/2001/vcard-rdf/3.0#Country> ;
      stats:rangeDimension
              <http://www.w3.org/2001/XMLSchema#string> .

It has a property and range dimension. Here is the complete example.

Statistics Vocabulary

The vocabulary is based on a simplified version of SCOVO and available at http://purl.org/rdfstats/stats.

Note: we don't use instances per dimension and link instead directly to property and range URIs - see also ISSUE 18.

Also note that we don't use SCOVO to represent all the histogram buckets. It is also unclear if we will SCOVO in future at all. Until know we are not aware of generic tools that can parse and visualize SCOVO and since our histograms are very specific types of statistical information, it is very unlikely that there will be any generic SCOVO parser for such histograms.

Download RDFStats Generator and Java Histogram Decoder

Version v2.0 is currently in preparation, please be patient.

The newest version can be checked out from the Subversion repository.

Usage: Statistics Generation

For your convenience, there are shell scripts available in the /bin folder. Either use Linux/Mac OS X or Cygwin on Windows or compose the classpath manually adding all the jars from the /lib directory and execute java -cp $CP rdfstats.generate.

usage: parameters:
 -c,--config-file <filename>    RDFStats configuration file (either use
                                this and optionally -e OR only use the other command line parameters)
 -d,--document <document-url>   RDF document URL (format will be guessed by extension)
 -e,--endpoint <endpoint-uri>   SPARQL endpoint URI
 -f,--format <key>              File format (RDF/XML, N3, or N-TRIPLES),
                                guessed based on file extension if omitted
 -m,--strhist-maxlen <length>   Maximum length of strings processed for
                                StringOrderedHistogram, default is 2147483647
 -o,--out <filename>            Model file (loaded if exsists as base
                                model; output is written to screen if omitted)
 -q,--quick                     Only generate histograms for new classes
                                or if the number of total instances has changed
 -s,--size <size>               Size of histograms (amount of bins), default is 50
 -t,--timezone <timezone>       The time zone to use when parsing date
                                values (default is your locale: Central European Time)

Example:

bin/generate -e http://localhost:8888/sparql -o statistics.n3 -f N3 -s 50

The statistics will be written into statistics.n3.

Debugging

Logback is used over SLF4J as the logging framework, but you can use any other frameworks by just replacing JARs as described here. To change the log configuration, edit lib/logback/logback.xml.

Usage: Decoding Statistics

For a quick look into a statistics graph, you can use bin/decode:

usage: parameters:
 -c,--config <filename>         Configuration file
 -e,--endpoint <endpoint-uri>   Only print statistics for this endpoint URI
 -d,--document <document-url>   Only print statistics for this document URL
 -i,--in <filename>             Input RDF file or Web URI
 -f,--format <format>           Input format (RDF/XML, N3, or N-TRIPLES),
                                default: auto-detect based on file extension
 -t,--timezone <timezone>       The time zone to use when parsing date
                                values (default is your locale: Central European Time)

This just pretty-prints the statistics and histogram data to the screen.

To process the statistics in your application, you usually use the RDFStatsModelFactory: to create an RDFStatsModel:

RDFStatsModel m = RDFStatsModelFactory.create("file:statistics.n3", "N3");

alternatively you can specify another Jena model as the wrapped model for RDFStatsModel:

Model m = ModelFactory.createDefaultModel();
m.read("file:statistics.n3", "N3");
RDFStatsModel stats = RDFStatsModelFactory.create(m);

Using Configuration Files

Instead of using command line parameters, RDFStats can be configured using RDF-based configuration files. The vocabulary for RDFStats configuration files can be found at http://purl.org/rdfstats/config. A sample configuration file is provided as part of the release in the root folder.

This is also useful when embedding RDFStats into your applications. If your applications also uses RDF-based configuration, you may even embedd RDFStats configuration data into your configuration and pass the model read from the filesystem (or Web) to the constructor of RDFStatsConfiguration.

RDFStatsModel and Histogram API

Statistics and estimation functions are available via three interfaces:

An RDFStatsDataset can be obtained from the RDFStatsModel calling getDataset(String sourceUrl)

There are several other interfaces which are extended by those three. The instances are printed below:

public interface RDFStatsModel extends GlobalGraphStatistics {
	
	/** get the actual Jena model wrapped by the RDFStatsModel class */
	public Model getWrappedModel();

	/** get a list of all available SCOVO datasets describing RDF sources */
	public List<RDFStatsDataset> getDatasets();

	/** get the SCOVO dataset for an RDF source
	 * @throws RDFStatsModelException */
	public RDFStatsDataset getDataset(String sourceUrl) throws RDFStatsModelException;
	
	/** get subject histogram
	 * 
	 * @param sourceUrl of the dataset
	 * @param blankNodes if true, get GenericSingleBinHistogram over blank nodes
	 * @return the subject histogram if exists or null
	 */
	public Histogram<String> getSubjectHistogram(String sourceUrl, boolean blankNodes) throws RDFStatsModelException;

	/** get subject histogram as encoded string */
	public String getSubjectHistogramEncoded(String sourceUrl, boolean blankNodes) throws RDFStatsModelException;

	/**
	 * @param sourceUrl of the dataset
	 * @return list of all properties where histograms are available for */
	public List<String> getPropertyHistogramProperties(String sourceUrl);
	
	/**
	 * @param sourceUrl of the dataset
	 * @param rangeUri a specific range URI (e.g. http://www.w3.org/2001/XMLSchema#int or http://www.w3.org/2000/01/rdf-schema#Resource)
	 * @return list of all properties, given the property values' range (URI) where histograms are available for */
	public List<String> getPropertyHistogramProperties(String sourceUrl, String rangeUri);

	/**
	 * @param sourceUrl of the dataset
	 * @param property a specific property URI (e.g. http://xmlns.com/foaf/0.1/name or http://www.w3.org/1999/02/22-rdf-syntax-ns#type)
	 * @return list of all range URIs, given property p where histograms are available for */
	public List<String> getPropertyHistogramRanges(String sourceUrl, String property);
	
	/** get histogram for property, range URI
	 * 
	 * @param sourceUrl of the dataset (must not be null)
	 * @param p a property
	 * @param rangeUri
	 * @return the histogram if exists or null
	 */
	public Histogram<?> getPropertyHistogram(String sourceUrl, String p, String rangeUri) throws RDFStatsModelException;

	/** get histogram as encoded string */
	public String getPropertyHistogramEncoded(String sourceUrl, String p, String rangeUri) throws RDFStatsModelException;
}

public interface GlobalGraphStatistics {

	/** get datasets that possibly have information about r 
	 * @throws RDFStatsModelException */
	public List<RDFStatsDataset> getDatasetsDescribingResource(String r) throws RDFStatsModelException;

	/** get the set of all properties */
	public Set<String> getProperties();
}

RDFStatsDataset:

public interface RDFStatsDataset extends JavaResourceView, GraphStatistics, QueryStatistics {

	public String getSourceType();	
	public String getSourceUrl();
	public String getCreator();
	public Calendar getCalendar();
	public Date getDate();

}

public interface JavaResourceView {

	public String getURI();
	public String getLocalName();
	public String getLabel();
	public Resource getWrappedResource();
	
}

public interface GraphStatistics {

	/** @return a list of all properties used */
	public Set<String> getProperties();
	
	/** @return total number of distinct subjects including blank nodes, exact value (no estimation) */
	public Long getSubjectsTotal() throws RDFStatsModelException;

	/** @return total number of blank nodes */
	public Long getAnonymousSubjectsTotal() throws RDFStatsModelException;

	/** @return total number of URI subjects */
	public Long getURISubjectsTotal() throws RDFStatsModelException;
	
	/** @return true if data source has no information about a subject (guaranteed), false positives possible, but no false negatives */
	public Boolean subjectNotExists(String uri) throws RDFStatsModelException;

// triple pattern estimation
	
	/** 
	 * @param s subject
	 * @param p predicate
	 * @param o object
	 * @return estimated amount of triples to expect from the triple pattern 
	 * @throws RDFStatsModelException */
	public Long triplesForPattern(Node s, Node p, Node o) throws RDFStatsModelException;
	
	/**
	 * @param s subject
	 * @param p predicate
	 * @param o object
	 * @param filter a list of filter expressions
	 * @return estimated amount of triples to expect from the filtered triple pattern 
	 * @throws RDFStatsModelException */
	public Long triplesForFilteredPattern(Node s, Node p, Node o, ExprList filter) throws RDFStatsModelException;

}

/**
 * each estimate function returns 3 values as an array with indexes:
 *   0: expected minimum triples
 *   1: expected average triples (good estimate)
 *   2: expected maximum triples (if this is 0, it is guaranteed that there are no false negatives but maybe false positives)
 */
public interface QueryStatistics {

	public Long[] triplesForBGP(BasicPattern bgp) throws RDFStatsModelException;
	public Long[] triplesForFilteredBGP(BasicPattern bgp, ExprList exprs) throws RDFStatsModelException;
	public Long[] triplesForQuery(String qry) throws RDFStatsModelException;
	public Long[] triplesForQuery(Query qry) throws RDFStatsModelException;
	public Long[] triplesForQueryPlan(Op plan) throws RDFStatsModelException;
	
}

There is also an extended model for the concurrent (multi-threaded) manipulation of data: the RDFStatsUpdatableModel:

public interface RDFStatsUpdatableModel extends RDFStatsModel {

	/** get the actual Jena model wrapped by the RDFStatsModel class
	 * 
	 * Attention! The obtained model must not be altered if other processes are
	 * may alter this model (usually using the exclusive write lock which may be
	 * obtained by requestExclusiveWriteLock(RDFStatsDataset ds);
	 * 
	 * @return the wrapped Jena model
	 */
	public Model getWrappedModel();		

// locking
	
	/**
	 * request exclusive write lock for an RDFStatsDataset
	 * 
	 * this is a simple lock for the complete updatable model which can only be acquired by one process at the same time
	 * an additional MRSW lock provided by Jena is used, so during this exclusive write lock, any other process may access the underlying
	 * RDFStatsModel as long as none of the actually updating (writing) methods are currently executing (because they are using the Jena Lock.WRITE)
	 * 
	 * The process must return the exclusive lock after it has finished the update process calling returnExclusiveWriteLock();
	 * 
	 * @param ds if null, request write lock for all statistics
	 */
	public void requestExclusiveWriteLock(RDFStatsDataset ds);
	
	/**
	 * @param ds if null, returns the write lock for all statistics
	 * returns the exclusive write lock
	 * @throws RDFStatsModelException 
	 */
	public void returnExclusiveWriteLock(RDFStatsDataset ds) throws RDFStatsModelException;

	
// modifications
	
	/**
	 * updates a dataset
	 * 
	 * @param ds
	 * @param creator
	 * @param date
	 * @return again the ds reference
	 * @throws RDFStatsModelException
	 */
	public RDFStatsDataset updateDataset(RDFStatsDataset ds, String creator, Calendar date) throws RDFStatsModelException;
	
	/**
	 * create a new dataset get the lock for it
	 * returns the new dataset reference which must be used for further calls to modifying methods
	 * 
	 * @param sourceUrl the URI (either to a document or SPARQL endpoint)
	 * @param sourceType URI reference to {@link Stats}.SPARQLEndpoint or .RDFDocument
	 * @param creator
	 * @param date
	 * @return the new dataset
	 * @throws RDFStatsModelException
	 */
	public RDFStatsDataset addDatasetAndLock(String sourceUrl, String sourceType, String creator, Calendar date) throws RDFStatsModelException;
	
	/** 
	 * removes SCOVO items which are part of the dataset ds and have not been changed since the last call to requestExclusiveLock()
	 * 
	 * @param ds if null, returns all unchanged items regardless of the dataset
	 * @return the number of removed items
	 * @throws RDFStatsModelException 
	 */
	public int removeUnchangedItems(RDFStatsDataset ds) throws RDFStatsModelException;

	/**
	 * explicitly tell the updatable model to keep this histogram when calling removeUnchangedItems(RDFStatsDataset ds);
	 * 
	 * @param dataset
	 * @param p
	 * @param rangeUri
	 * @throws RDFStatsModelException 
	 */
	public void keepPropertyHistogram(RDFStatsDataset dataset, String p, String rangeUri) throws RDFStatsModelException;

	/**
	 * explicitly tell the updatable model to keep this subject histogram when calling removeUnchangedItems(RDFStatsDataset ds);
	 * 
	 * @param dataset
	 * @param blankNodes
	 * @throws RDFStatsModelException 
	 */
	public void keepSubjectHistogram(RDFStatsDataset dataset, boolean blankNodes) throws RDFStatsModelException;

	/**
	 * create a new or update existing histogram for specific dataset, property, and rangeUri
	 *
	 * @param dataset
	 * @param p
	 * @param rangeUri
	 * @param encodedHistogram
	 * @throws RDFStatsModelException
	 */
	public boolean addOrUpdatePropertyHistogram(RDFStatsDataset dataset, String p, String rangeUri, String encodedHistogram) throws RDFStatsModelException;

	/**
	 * create a new or update existing subject histogram for specific dataset
	 *
	 * @param dataset
	 * @param blankNodes
	 * @param encodedHistogram
	 * @throws RDFStatsModelException
	 */
	public boolean addOrUpdateSubjectHistogram(RDFStatsDataset dataset, boolean blankNodes, String encodedHistogram) throws RDFStatsModelException;

	/** merge (optionally only newer) statistics from Model newModel into this model 
	 * 
	 * @param newModel the new model containing one or more RDFStats statistics datasets
	 * @param onlyNewer if true, statistics are only merged if the dc:date of a new dataset is newer than that of the possibly existing dataset for the same RDF source
	 * 
	 * @param return true if update fully succeeded, false if only partly
	 * @throws RDFStatsModelException
	 */
	public boolean updateFrom(RDFStatsModel newModel, boolean onlyNewer) throws RDFStatsModelException;
	
	/**
	 * merge (optionally only newer) statistics from Model newModel into this model
	 * similar to updateFrom(RDFStatsModel newModel, boolean onlyNewer, boolean deleteNonPresent), but restrict on datasets for sourceUrl
	 * 
	 * @param sourceUrl only import statistics for this RDF source
	 * @param newModel
	 * @param onlyNewer
	 * 
	 * @return true if import finished successfully
	 * @throws RDFStatsModelException
	 */
	public boolean updateFrom(String sourceUrl, RDFStatsModel newModel, boolean onlyNewer) throws RDFStatsModelException;

}

This is the Histogram API:

/**
 * @author dorgon
 *
 * A histogram using the NATIVE java type.
 * 
 * There are different implementations for different Java data types internally used like Integer, Float, String, etc.
 * Each histogram also stores the datatype URI from the original RDF node (see {@link RDF2JavaMapper}.getType(Node val))
 * 
 * Methods are either implemented by {@link AbstractHistogram} or by the concrete implementations (e.g. all methods
 * with NATIVE attributes are implemented specifically).
 */
public interface Histogram<NATIVE> {

	/** 
	 * @return total number of bins used (i.e. size of the histogram)
	 */
	public int getNumBins();
	
	/**
	 * @return histogram data as long[] (bin data)
	 */
	public long[] getBinData();

	/**
	 * @return data type URI of the source values (see {@link RDF2JavaMapper}.getType(Node val) for details on this URI)
	 */
	public String getDatatypeUri();
	
	/**
	 * @param idx the bin index
	 * @return absolute bin quantity (size of the bin with index idx)
	 */
	public long getBinQuantity(int idx);
	
	/**
	 * @param idx the bin index
	 * @return relative bin quantity (bin quantity / totalValues) in the range [0..1]
	 */
	public float getBinQuantityRelative(int idx);
	
	/**
	 * @param a NATIVE value
	 * @return estimated quantity for value; estimated value but at least 1 if there is any value in the bin
	 */
	public long getEstimatedQuantity(NATIVE val);
	
	/**
	 * @param a NATIVE value
	 * @return estimated relative quantity in the range [0..1]; estimated value, but at least > 0 if there is any value in the bin
	 */
	public float getEstimatedQuantityRelative(NATIVE val);
	
	/**
	 * @return the total amount of values in the source distribution, also used as divisor for normalization
	 */
	public long getTotalValues();
	
	/**
	 * @return the number of distinct values in the source distribution
	 */
	public long getDistinctValues();
	
	/**
	 * @return true if the source values are unique (e.g. a primary key of a database)
	 */
	public boolean hasUniqueValues();
	
	/**
	 * @param a NATIVE value
	 * @return the bin index the NATIVE value goes into; returns -1 if the value is outside of the histogram data range
	 */
	public int getBinIndex(NATIVE val);

	/**
	 * parse a node value to native representation
	 * 
	 * this method must have a static version parseNodeValueImpl which can also be used by the
	 * corresponding HistogramBuilder without instantiating the concrete Histogram class
	 *
	 * @param val
	 * @return native Java type
	 * @throws ParseException
	 */
	public NATIVE parseNodeValue(Node val) throws ParseException;
}

Histogram types with a comparable domain (metric scale) also implement:


	public NATIVE getMin();
	public NATIVE getMax();
	
	/**
	 * @param val a NATIVE value
	 * @return the cumulative quantity from 0 to a NATIVE value
	 */
	public long getCumulativeQuantity(NATIVE val);
	
	/**
	 * @param idx bin index
	 * @return the cumulative bin quantity from bin 0 to bin index idx
	 */
	public long getCumulativeBinQuantity(int idx);
	
	/**
	 * @param val a NATIVE value
	 * @return the cumulative relative quantity in the range [0..1]
	 */
	public float getCumulativeQuantityRelative(NATIVE val);
	
	/**
	 * @param idx bin index
	 * @return the cumulative relative quantity in the range [0..1]
	 */
	public float getCumulativeBinQuantityRelative(int idx);	
}

Embedding RDFStats into other Applications

There are several possible ways how to embed RDFStats depending on your requirements. Both examples are illustrated in class at.faw.rdfstats.samples.EmbeddingSamples.

Monitor a set of endpoints

This is similar to the standalone program, where you want to monitor several SPARQL endpoints and make the statistics available by a central RDFStatsModel, but it is embedded into your application. In this case, just use the class GeneratorMultiple and supply an RDFStatsConfiguration object:

Model cfgModel = FileManager.get().loadModel("sample-config.ttl"); // or use your application config model
RDFStatsConfiguration cfg = new RDFStatsConfiguration(cfgModel);
GeneratorMultiple multiGen = new GeneratorMultiple(cfg);
Model stats = multiGen.generate();
stats.commit(); // required when using FileModel with Jena assembler to flush all data to disk

// now access via RDFStatsModel API
RDFStatsModel s = RDFStatsModelFactory.create(stats);
Resource ds = s.getDataset(null);
ds.get...

The RDFStatsModel stats may be accessed by other processes (see API).

Another way is to use the RDFStatsGeneratorSPARQL (calling RDFStatsGeneratorFactory.generatorSPARQL(String)) class which only fetches statistics for a single endpoint. If you don't have a configuration, just generate a default one like in this example and specify an endpoint:

RDFStatsConfiguration conf = RDFStatsConfiguration.getDefault();
RDFStatsGeneratorSPARQL gen = RDFStatsGeneratorFactory.generatorSPARQL(conf, "http://localhost:8888/sparql");
gen.generate();

// access data
RDFStatsModel s2 = RDFStatsModelFactory.create(conf.getStatsModel()); // model is referenced in configuration
s2.get...

Note on String Histograms

It is generally not trivial to generate meaningful histograms for strings. This is mainly because of the fact, that strings cannot be mapped easily to a metric scale like real numbers or integers. It is also difficult to find literature about auto-scaling string histograms which was a major requirement for RDFStats (if you have pointers for me, I would be pleased to receive them!).

Basically, there are three approaches:

  1. no compression (big histograms, one bin for each distinct string)
  2. hash compression
  3. - order preserving??
  4. compression to common prefixes

1. is implemented in SimpleStringHistogram which is used for URIs. Approach 3 is implemented in OrderedStringHistogram because it is order-preserving and the histogram size can be arbitrarily scaled based on occurring common prefixes. Approach 2 is not trival for the general case and sparse distributions with run-away values.

Open Issues

License

RDFStats is licensed under Apache Software License 2.0

Acknowledgements

Contact: aka AndyL

This work is funded by the Austrian BMBWK (Federal Ministry for Education, Science and Culture), contract GZ BMWF-10.220/0002-II/10/2007.

Thanks to SourceForge.net for providing the infrastructure.

SourceForge.net Logo