- 2008-02-15 (Fri) 22:07
- プログラミング ( sed/wake/awk )
入力は空白区切りの小文字のみで構成される文字列のリスト.出力はソート結果.あなごるのWord frequency count (FIXED) 用に作ったのだが… 問題の入力に出てくる単語のソートだけで手元のマシンでも6秒弱かかる.
# input: word1 word2 ... wordn # should be lower-case letters. no spaces except for the separator s/ /\n/g s/$/ /gm # each line is a sorted list :a /\n/!q s/^\(.*\)\n\(.*\)$/@\1: %\2/gm # each line: @word11 word12 ... word1n : %word21 word22 ... word2m # ':' is a separator of two lists # '@' is the head of non-merged part of the first list # '%' is the head of non-merged part of the second list :b # each line: sortedpart @word1i ... word1n : %word2j ... word2m s/^[^%@:\n]*$/~&/gm # already done s/^\(.*\)@\(.*\): % *$/~\1\2/gm # done (2nd list is consumed completely) s/^\(.*\)@: \(.*\)%\(.*\)/~\1\2\3/gm # done (1st list is consumed completely) /@/!be h s/^~.*$/~/gm # ignore merged lists s/^.*@\(\w*\).*%\(\w*\).*$/\1 \2/gm # prepare to comparation #------------------- parallel comparation ----- # input: word11 word12\nword21 word22\n...\nwordn1 wordn2 # should be lower-case letters. no spaces except for the separator # the entry '~' is ignored # algorithm: # make pairs of corresponding characters in two words # compare all pairs at the same time # take the first non-equal result of he comparisons s/^\| /&#/gm s/^#~ *$/~/gm :5 s/#\(\w\)\(.*\)#\(\w\)/\1\3 #\2#/mg s/#\w.*#$/>/mg s/# #..*/</mg s/# #$/=/mg t5 :1 :2 /a/b9 y/bcdefghijklmnopqrstuvwxyz/abcdefghijklmnopqrstuvwxy/ b2 :9 s/a[^a] /</g s/[^a]a />/g s/aa /=/g /[^\n=<>~]/b1 s/^=*\(.\)/\1/gm s/^\(.\).*/\1/gm #----------- parallel comparation done ------ s/\n//g s/$/#/ G # [<>=]*#\nmergedpart @rest_of_list1 %rest_of_list2\n... # according to the results of comparations held in the head, # move a head of either the rest of list1 or list2 to the tail of merged_part. :m s/^[<=]\(.*\)\n\(.*\)@\(\w*\) \(.*\)%\(\w*\)/\1-\2\3 @\4%\5/m s/^>\(.*\)\n\(.*\)@\(\w*\)\(.*\)%\(\w*\) /\1-\2\5 @\3\4%/m s/^~\([^\n]*\)\n~/\1-/ tm s/-/\n/g s/#\n// bb :e s/~//g ba
一文字目でバケットソートしてマージソートを各バケットに適用とかしないとダメっぽい.とにかく操作文字列長を小さくしないと速度が…
- Newer: AWK - はじめ