PRMLで使われるデータセットの（現在有効な）入手先

PRML 12章の実装試してみたいけど、Oil Flowデータの配布ページが無くなってるんだよなあ、とあきらめてた。でもがんばって探したらwebarchiveから拾えた。tar.gzファイルまでアーカイブしてあるなんて……。すごすぎ。欲しい人いる？　再配布したらまずい？
2010-05-24 @shuyo Nakatani Shuyo

Oil Flowデータの配布ページがずっとリンク切れのままなのですよね… それ以外のデータはBishop先生のリンクからひと通り手に入るけど紹介を兼ねて。

Handwritten Digits - 手書き数字（MNISTデータ集合）

The MNIST digits data are available from Yann LeCun’s MNIST page, which also contains a detailed description of the data. There's also a Matlab function to read the data into Matlab under Windows.

http://yann.lecun.com/exdb/mnist/

Oil Flow - 送油

This data set can be retrieved in various formats from the GTM data web-page.

その"GTM data web-page"からダウンロードできなくなってる＞＜

→→MATLAB形式に変換されたものを発見しますた
http://code.google.com/p/pmtkdata/source/browse/trunk/oilFlow3Class/oilFlow3Class.mat

SciPyってMATLABファイル読めるのね... loadmat()神
あとはこの程度のスクリプトで
mat2txt.py:

import scipy.io
import numpy

x = scipy.io.loadmat("./oilFlow3Class.mat")
for k,v in x.items():
  print "%s.txt: %dx%d" % (k, len(v[0]), len(v))
  numpy.savetxt(k + ".txt", v, fmt='%15.7e')

これで以下の９つのファイルが得られる：

training data
- DataTrn.txt: 12x1000
  // 1000 measurements
- DataTrnFrctns.txt: 2x1000
  // the corresponding fractions of water and oil (in that order)
- DataTrnLbls.txt: 3x1000
  // the corresponding configuration labels, given in a 1-of-3 coding scheme, where
  [1 0 0] == Homogeneous configuration
  [0 1 0] == Annular configuration
  [0 0 1] == Stratified configuration

validation data
- DataVdn.txt: 12x1000
- DataVdnFrctns.txt: 2x1000
- DataVdnLbls.txt: 3x1000

test data
- DataTst.txt: 12x1000
- DataTstFrctns.txt: 2x1000
- DataTstLbls.txt: 3x1000

// the three file sets all contain 1000 samples. The fractions and configurations are picked at random from corresponding uniform distributions.

こういうやつね
f:id:n4_t:20111229201309p:plain

Old Faithful - 間欠泉データ

There are several Old Faithful data sets in existence. The one used in PRML, which seems to be the most widely adopted, is available here.

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/faithful.txt
カラム１：直近の噴出継続時間
カラム２：次回の噴出までの待ち時間

Synthetic Data - 人工データ集合

Curve Fitting - 曲線フィッティング

The curve fitting data contains 10 data, uniformly spaced on [0,1] in x-space and with
y = sin(2πx) + N(0,0.3),
i.e, with Gaussian noise of variance 0.09. The file has 10 rows of 2 columns ([x,y]). This is the actual data that was used to generate the plots in figure 1.4 (and others).

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/curvefitting.txt
X, Y

f:id:n4_t:20111229051319p:plain

Classification - クラス分類

The classification data contains 200 data, sampled from a 3-component Gaussian mixture in 2D. This data was generated using the gmmsamp function from Netlab. The corresponding Gaussian mixture model had the parameters:

mix.priors = [0.5 0.25 0.25];
mix.centres = [0 -0.1; 1 1; 1 -1];
mix.covars(:,:,1) = [0.625 -0.2165; -0.2165 0.875];
mix.covars(:,:,2) = [0.2241 -0.1368; -0.1368 0.9759];
mix.covars(:,:,3) = [0.2375 0.1516; 0.1516 0.4125];

The first component represent class 1 (blue circles, o, in the left panel of Figure A.7), the other components class 0 (red crosses, ×). The file has 200 rows of 3 columns, the first two columns giving datum position, the last column containing the label (0/1).

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/classification.txt
X, Y, クラス(0/1)

naoya_t@hatenablog

いわゆるチラシノウラであります