naoya_t@hatenablog

いわゆるチラシノウラであります

PRMLで使われるデータセットの(現在有効な)入手先


PRML 12章の実装試してみたいけど、Oil Flowデータの配布ページが無くなってるんだよなあ、とあきらめてた。でもがんばって探したらwebarchiveから拾えた。tar.gzファイルまでアーカイブしてあるなんて……。すごすぎ。欲しい人いる? 再配布したらまずい?

2010-05-24 @shuyo Nakatani Shuyo

Oil Flowデータの配布ページがずっとリンク切れのままなのですよね… それ以外のデータはBishop先生のリンクからひと通り手に入るけど紹介を兼ねて。

Handwritten Digits - 手書き数字(MNISTデータ集合)

The MNIST digits data are available from Yann LeCun’s MNIST page, which also contains a detailed description of the data. There's also a Matlab function to read the data into Matlab under Windows.

http://yann.lecun.com/exdb/mnist/

Oil Flow - 送油

This data set can be retrieved in various formats from the GTM data web-page.

その"GTM data web-page"からダウンロードできなくなってる><

→→MATLAB形式に変換されたものを発見しますた
http://code.google.com/p/pmtkdata/source/browse/trunk/oilFlow3Class/oilFlow3Class.mat

SciPyってMATLABファイル読めるのね... loadmat()
あとはこの程度のスクリプトで
mat2txt.py:

import scipy.io
import numpy

x = scipy.io.loadmat("./oilFlow3Class.mat")
for k,v in x.items():
  print "%s.txt: %dx%d" % (k, len(v[0]), len(v))
  numpy.savetxt(k + ".txt", v, fmt='%15.7e')

これで以下の9つのファイルが得られる:

  • training data
    • DataTrn.txt: 12x1000
      // 1000 measurements
    • DataTrnFrctns.txt: 2x1000
      // the corresponding fractions of water and oil (in that order)
    • DataTrnLbls.txt: 3x1000
      // the corresponding configuration labels, given in a 1-of-3 coding scheme, where
      [1 0 0] == Homogeneous configuration
      [0 1 0] == Annular configuration
      [0 0 1] == Stratified configuration
  • validation data
    • DataVdn.txt: 12x1000
    • DataVdnFrctns.txt: 2x1000
    • DataVdnLbls.txt: 3x1000
  • test data
    • DataTst.txt: 12x1000
    • DataTstFrctns.txt: 2x1000
    • DataTstLbls.txt: 3x1000

// the three file sets all contain 1000 samples. The fractions and configurations are picked at random from corresponding uniform distributions.

こういうやつね
f:id:n4_t:20111229201309p:plain

Old Faithful - 間欠泉データ

There are several Old Faithful data sets in existence. The one used in PRML, which seems to be the most widely adopted, is available here.

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/faithful.txt
カラム1:直近の噴出継続時間
カラム2:次回の噴出までの待ち時間

Synthetic Data - 人工データ集合

Curve Fitting - 曲線フィッティング

The curve fitting data contains 10 data, uniformly spaced on [0,1] in x-space and with

y = sin(2πx) + N(0,0.3),

i.e, with Gaussian noise of variance 0.09. The file has 10 rows of 2 columns ([x,y]). This is the actual data that was used to generate the plots in figure 1.4 (and others).

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/curvefitting.txt
X, Y

f:id:n4_t:20111229051319p:plain

Classification - クラス分類

The classification data contains 200 data, sampled from a 3-component Gaussian mixture in 2D. This data was generated using the gmmsamp function from Netlab. The corresponding Gaussian mixture model had the parameters:


mix.priors = [0.5 0.25 0.25];
mix.centres = [0 -0.1; 1 1; 1 -1];
mix.covars(:,:,1) = [0.625 -0.2165; -0.2165 0.875];
mix.covars(:,:,2) = [0.2241 -0.1368; -0.1368 0.9759];
mix.covars(:,:,3) = [0.2375 0.1516; 0.1516 0.4125];


The first component represent class 1 (blue circles, o, in the left panel of Figure A.7), the other components class 0 (red crosses, ×). The file has 200 rows of 3 columns, the first two columns giving datum position, the last column containing the label (0/1).

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/classification.txt
X, Y, クラス(0/1)