PRMLで使われるデータセットの(現在有効な)入手先
2010-05-24 @shuyo Nakatani Shuyo
PRML 12章の実装試してみたいけど、Oil Flowデータの配布ページが無くなってるんだよなあ、とあきらめてた。でもがんばって探したらwebarchiveから拾えた。tar.gzファイルまでアーカイブしてあるなんて……。すごすぎ。欲しい人いる? 再配布したらまずい?
Oil Flowデータの配布ページがずっとリンク切れのままなのですよね… それ以外のデータはBishop先生のリンクからひと通り手に入るけど紹介を兼ねて。
Handwritten Digits - 手書き数字(MNISTデータ集合)
The MNIST digits data are available from Yann LeCun’s MNIST page, which also contains a detailed description of the data. There's also a Matlab function to read the data into Matlab under Windows.
Oil Flow - 送油
This data set can be retrieved in various formats from the GTM data web-page.
その"GTM data web-page"からダウンロードできなくなってる><
→→MATLAB形式に変換されたものを発見しますた
http://code.google.com/p/pmtkdata/source/browse/trunk/oilFlow3Class/oilFlow3Class.mat
SciPyってMATLABファイル読めるのね... loadmat()神
あとはこの程度のスクリプトで
mat2txt.py:
import scipy.io import numpy x = scipy.io.loadmat("./oilFlow3Class.mat") for k,v in x.items(): print "%s.txt: %dx%d" % (k, len(v[0]), len(v)) numpy.savetxt(k + ".txt", v, fmt='%15.7e')
これで以下の9つのファイルが得られる:
- training data
- DataTrn.txt: 12x1000
// 1000 measurements - DataTrnFrctns.txt: 2x1000
// the corresponding fractions of water and oil (in that order) - DataTrnLbls.txt: 3x1000
// the corresponding configuration labels, given in a 1-of-3 coding scheme, where
[1 0 0] == Homogeneous configuration
[0 1 0] == Annular configuration
[0 0 1] == Stratified configuration
- DataTrn.txt: 12x1000
- validation data
- DataVdn.txt: 12x1000
- DataVdnFrctns.txt: 2x1000
- DataVdnLbls.txt: 3x1000
- test data
- DataTst.txt: 12x1000
- DataTstFrctns.txt: 2x1000
- DataTstLbls.txt: 3x1000
// the three file sets all contain 1000 samples. The fractions and configurations are picked at random from corresponding uniform distributions.
こういうやつね
Old Faithful - 間欠泉データ
There are several Old Faithful data sets in existence. The one used in PRML, which seems to be the most widely adopted, is available here.
http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/faithful.txt
カラム1:直近の噴出継続時間
カラム2:次回の噴出までの待ち時間
Synthetic Data - 人工データ集合
Curve Fitting - 曲線フィッティング
The curve fitting data contains 10 data, uniformly spaced on [0,1] in x-space and with
y = sin(2πx) + N(0,0.3),
i.e, with Gaussian noise of variance 0.09. The file has 10 rows of 2 columns ([x,y]). This is the actual data that was used to generate the plots in figure 1.4 (and others).
http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/curvefitting.txt
X, Y
Classification - クラス分類
The classification data contains 200 data, sampled from a 3-component Gaussian mixture in 2D. This data was generated using the gmmsamp function from Netlab. The corresponding Gaussian mixture model had the parameters:
mix.priors = [0.5 0.25 0.25];
mix.centres = [0 -0.1; 1 1; 1 -1];
mix.covars(:,:,1) = [0.625 -0.2165; -0.2165 0.875];
mix.covars(:,:,2) = [0.2241 -0.1368; -0.1368 0.9759];
mix.covars(:,:,3) = [0.2375 0.1516; 0.1516 0.4125];
The first component represent class 1 (blue circles, o, in the left panel of Figure A.7), the other components class 0 (red crosses, ×). The file has 200 rows of 3 columns, the first two columns giving datum position, the last column containing the label (0/1).
http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/classification.txt
X, Y, クラス(0/1)