Just another Ruby porter, 2015-2-b

■ pipeを介したsort

3000万行、2.8GBのテキストファイルをsortする。

% ls -oh bigfile.txt      
-rw-rw-r-- 2 eban 2.8G Feb 12 17:26 bigfile.txt
% wc -l bigfile.txt
30750762 bigfile.txt
% time sort bigfile.txt > /dev/null
sort bigfile.txt > /dev/null  15.35s user 18.82s system 191% cpu 17.886 total

次にcatしてpipeで。

% time cat bigfile.txt | sort > /dev/null                                                                               
cat bigfile.txt  0.00s user 2.16s system 20% cpu 10.766 total
sort > /dev/null  13.82s user 13.24s system 53% cpu 50.711 total

なぜか遅くなる。これは元のファイルのサイズがわからなくなるから。
このくらいのサイズになると最初からmerge sortになるが、一時ファイルは/tmpに作られる。
サイズがわかれば適切なサイズで分割できるが、
pipeだとわからないので小さめの一時ファイルをたくさん作ることになる。
その非効率がこの速度の違いになる。
実際に/tmpを観察してみると数GBと数MBとなっていた。

2015-02-13 (Fri)

■ pipeを介したsortでの対処

オプションを調べてみると目につくのは-S, --buffer-size=SIZEで、
use SIZE for main memory bufferと書いてあるので、
ファイルのサイズには関係ないかと思ったらこれでいいようだ。

$ time cat bigfile.txt | sort -S 1G > /dev/null

real    0m16.754s
user    0m14.206s
sys     0m12.613s

% time cat bigfile.txt | sort -S 10G > /dev/null

real    0m12.594s
user    0m14.043s
sys     0m15.625s

2015-02-14 (Sat)

■ psでコマンドがいつ起動されたか調べる

ふつうにuを指定すればわかるんだけど、1日以上経つと

% ps u 1804      
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
eban      1804  0.0  0.0  54088  2744 pts/23   Ss   Feb07   0:00 zsh

のように時刻までは表示されなくなる。
etimeを使えば起動してからの経過時間はわかるが引き算が面倒。

% ps o etime 1804
    ELAPSED
 7-01:50:26

7日と1:50:26らしい。いろいろぐぐったりしてみたらlstartでいけるそうだ。

% ps o lstart 1804
                 STARTED
Sat Feb  7 23:49:48 2015

man psしてもこんなざっくりとした説明しかないが。

lstart      STARTED   time the command started.  See also bsdstart, start, start_time, and stime.

2015-02-15 (Sun)

■ joinをcommの代わりに使う

joinで-t ''とすれば行全体を1つのフィールドとして扱うようになるので、commの代わりに使える。
commの問題点は-23で1つ目のファイルにしかないもの表示するんだけど、これが逆なのでわかりにくい。
その点joinなら-v1と1が表に出てくるのでわかりやすい。

% seq 10 | paste -d' ' - -                            
1 2
3 4
5 6
7 8
9 10
% seq 9 | paste -d' ' - - 
1 2
3 4
5 6
7 8
9 
% comm -23 <(seq 10 | paste -d' ' - -) <(seq 9 | paste -d' ' - -)
9 10
% join -v1 <(seq 10 | paste -d' ' - -) <(seq 9 | paste -d' ' - -)
% join -v1 -t '' <(seq 10 | paste -d' ' - -) <(seq 9 | paste -d' ' - -)
9 10

-t ''がないと-j1と同じでフィールド1しか見ないので全部共通と見做されてしまう。

2015-02-16 (Mon)

■ むだなgrepが意外に遅かった

TSVの先頭にヘッダーがついているので、軽い気持ちでgrep -vでいいかと

% find . -type f -name '*.tsv' | xargs env LANG=C grep -vh '^F1' > all.tsv

のような処理をしていたら、ちょっと気になった。
1行目だけ消せばいいわけでなんかむだにgrepを使っているような？
というわけで、grepとawkで試してみた。

% time LANG=C grep -vh '^F1' bigfile.tsv > /dev/null
LANG=C grep -vh '^F1' bigfile.tsv > /dev/null  5.70s user 0.42s system 99% cpu 6.128 total
% time LANG=C awk 'FNR>1' bigfile.tsv > /dev/null
LANG=C awk 'FNR>1' bigfile.tsv > /dev/null  7.34s user 0.63s system 99% cpu 7.981 total

あれ。awk負けてる。そういえばawkはgawkだな。mawkで試してみる。

% time mawk 'FNR>1' bigfile.tsv > /dev/null
mawk 'FNR>1' bigfile.tsv > /dev/null  3.82s user 0.56s system 99% cpu 4.379 total

おお、やっぱmawkは速い。
fgrepだとどうかというとなんか遅い。

% time LANG=C fgrep -vh 'F1' bigfile.tsv > /dev/null 
LANG=C fgrep -vh 'F1' bigfile.tsv > /dev/null  7.64s user 0.38s system 99% cpu 8.023 total

^のようなアンカーが使えないからか。\tをつけてやればgrepと同等になる。

% time LANG=C fgrep -vh $'F1\t' bigfile.tsv > /dev/null
LANG=C fgrep -vh $'F1\t' bigfile.tsv > /dev/null  5.46s user 0.50s system 99% cpu 5.977 total

2015-02-17 (Tue)

■ bcの小技

ちょっとした計算をbcにさせるときにscaleを指定するのが面倒。
そんなときは-lオプションを使うといい。

% echo 10/3 | bc
3
% echo 10/3 | bc -l
3.33333333333333333333

scaleは20に設定される。

MATH LIBRARY
    If  bc  is  invoked  with the -l option, a math library is preloaded and the default scale is set to 20.   The
    math functions will calculate their results to the scale set at the time of  their  call.   The  math  library
    defines the following functions:

    s (x)  The sine of x, x is in radians.

    c (x)  The cosine of x, x is in radians.

    a (x)  The arctangent of x, arctangent returns radians.

    l (x)  The natural logarithm of x.

    e (x)  The exponential function of raising e to the value x.

    j (n,x)
           The Bessel function of integer order n of x.

2015-02-18 (Wed)

■ gzexe

gzipにはgzexeというコマンドが含まれている。
実行ファイルを圧縮して先頭に伸長実行するようなスクリプトを埋め込む。
gzexe自体は単なるbash scriptなのでそのまま読めばいいんだけど、
ちょっとtailとsedの使い方が気になる。
先頭のスクリプト部分をスキップしてgzip -dcへ渡すわけだけど、
そのスキップにtailが使われている。
わざわざ-nが+数値を受け取るかどうかをチェックしている。
tail -n +44と書けば44行以降を表示すると意味になるわけだけど、
なんでsed 1,43dとしないのか? sedならわざわざチェックしなくてもいいのに。

Just another Ruby porter,

2015-02-11 (Wed)

■ ハンチョウ

2015-02-12 (Thu)