WordNet

2017-03-28

最近在coursera上学习Princeton大学的Algorithm PartII，这个系列的两门课是我见过最好的算法课。主讲Robert Sedgewick师承高德纳，声名远播，虽然年纪大了，但是讲课思路清晰，深入浅出。与Kevin Wayne共同编著的《算法》第四版作为教材，没有《算法导论》那么偏理论、晦涩难懂，能够把常见的数据结构和算法讲得很透彻，容易理解。
更值得称道的是这门课编程作业及其评分系统。上课的内容是基础的数据结构和算法，但编程作业往往是这些算法的实际应用，能真正理解为什么在给定的情况下要使用这样的数据结构、算法。作业的评分不仅仅是看正确与否，还有运行时间、内存的要求，会给出详细的分析报告，如果不是ACM大神，这些作业都值得花时间去仔细琢磨。

下面进入正题，第一周的作业WordNet

##题目
WordNet is a semantic lexicon for theEnglish language that is used extensively by computational linguistsand cognitive scientists; for example, it was a key component in IBM’sWatson.WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them.One such relationship is the is-a relationship, which connects a hyponym(more specific synset) to a hypernym (more general synset).For example, locomotion is a hypernym of runningand running is a hypernym of dash.

The WordNet digraph.Your first task is to build the wordnet digraph: each vertex v is an integer that represents a synset, and each directed edge v→w represents that w is a hypernym of v.The wordnet digraph is a rooted DAG: it is acylic and has one vertex thatis an ancestor of every other vertex.However, it is not necessarily a tree because a synset can have more than onehypernym. A small subgraph of the wordnet digraph is illustrated below.

The WordNet input file formats.We now describe the two data files that you will use to create the wordnet digraph.The files are in CSV format: each line contains a sequence of fields,separated by commas.

List of noun synsets.The file synsets.txtlists all the (noun) synsets in WordNet.The first field is the synset id (an integer),the second field is the synonym set (or synset), and thethird field is its dictionary definition (or gloss).For example, the line
36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire
means that the synset { AND_circuit, AND_gate }has an id number of 36 and it’s gloss isa circuit in a computer that fires only when all of its inputs fire.The individual nouns that comprise a synset are separatedby spaces (and a synset element is not permitted to contain a space).The S synset ids are numbered 0 through S − 1;the id numbers will appear consecutively in the synset file.
List of hypernyms.The file hypernyms.txtcontains the hypernym relationships:The first field is a synset id; subsequent fields are the id numbersof the synset’s hypernyms. For example, the following line
164,21012,56099
means that the the synset 164 (“Actifed”) has two hypernyms:21012 (“antihistamine”) and56099 (“nasal_decongestant”),representing that Actifed is both an antihistamine and a nasal decongestant.The synsets are obtained from the corresponding lines in the file synsets.txt.

164,Actifed,trade name for a drug containing an antihistamine and a decongestant…
21012,antihistamine,a medicine used to treat allergies…
56099,nasal_decongestant,a decongestant that provides temporary relief of nasal…
WordNet data type.Implement an immutable data type WordNet with the following API:

// constructor takes the name of the two input files
public WordNet(String synsets, String hypernyms)

// the set of nouns (no duplicates), returned as an Iterable
public Iterable nouns()

// is the word a WordNet noun?
public boolean isNoun(String word)

// distance between nounA and nounB (defined below)
public int distance(String nounA, String nounB)

// a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
// in a shortest ancestral path (defined below)
public String sap(String nounA, String nounB)

// for unit testing of this class
public static void main(String[] args)
The constructor should throw a java.lang.IllegalArgumentExceptionif the input does not correspond to a rooted DAG.The distance() and sap() methodsshould throw a java.lang.IllegalArgumentExceptionunless both of the noun arguments are WordNet nouns.
Your data type should use space linear in the input size(size of synsets and hypernyms files).The constructor should take time linearithmic (or better) in the input size.The method isNoun() should run in time logarithmic (or better) inthe number of nouns.The methods distance() and sap() should run in time linear in thesize of the WordNet digraph.

Shortest ancestral path.An ancestral path between two verticesv and w in a digraph is a directed path fromv to a common ancestor x, together witha directed path from w to the same ancestor x. A shortest ancestral path is an ancestral path of minimum total length.For example, in the digraph at left(digraph1.txt),the shortest ancestral path between3 and 11 has length 4 (with common ancestor 1).In the digraph at right (digraph2.txt),one ancestral path between 1 and 5 has length 4(with common ancestor 5), but the shortest ancestral path has length 2(with common ancestor 0).

SAP data type.Implement an immutable data type SAP with the following API:

// constructor takes a digraph (not necessarily a DAG)
public SAP(Digraph G)

// length of shortest ancestral path between v and w; -1 if no such path
public int length(int v, int w)

// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w)

// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
public int length(Iterable v, Iterable w)

// a common ancestor that participates in shortest ancestral path; -1 if no such path
public int ancestor(Iterable v, Iterable w)

// for unit testing of this class (such as the one below)
public static void main(String[] args)
All methods should throw a java.lang.IndexOutOfBoundsException if one (or more) of the input arguments is not between 0 and G.V() - 1.You may assume that the iterable arguments contain at least one integer.All methods (and the constructor) should take time at mostproportional to E + Vin the worst case, where E and V are the number of edges and verticesin the digraph, respectively.Your data type should use space proportional to E + V.
Test client.The following test client takes the name of a digraph input file asas a command-line argument, constructs the digraph,reads in vertex pairs from standard input,and prints out the length of the shortest ancestral path between the two verticesand a common ancestor that participates in that path:

public static void main(String[] args) {
In in = new In(args[0]);
Digraph G = new Digraph(in);
SAP sap = new SAP(G);
while (!StdIn.isEmpty()) {
int v = StdIn.readInt();
int w = StdIn.readInt();
int length = sap.length(v, w);
int ancestor = sap.ancestor(v, w);
StdOut.printf(“length = %d, ancestor = %d\n”, length, ancestor);
}
}
Here is a sample execution:
% more digraph1.txt % java SAP digraph1.txt
13 3 11
11 length = 4, ancestor = 1
7 3
8 3 9 12
3 1 length = 3, ancestor = 5
4 1
5 1 7 2
9 5 length = 4, ancestor = 0
10 5
11 10 1 6
12 10 length = -1, ancestor = -1
1 0
2 0
Measuring the semantic relatedness of two nouns.Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, most of us agree that George Bush and John Kennedy (two U.S. presidents)are more related than are George Bushand chimpanzee (two primates). However, not most of us agree that George Bush and Eric Arthur Blair are related concepts. But if one is aware that George Bush and Eric Arthur Blair (aka George Orwell) are both communicators, then it becomes clear that the two concepts might be related.

We define the semantic relatednessof two wordnet nouns A and B as follows:

distance(A, B) = distance is the minimum length of any ancestral path betweenany synset v of A and any synset w of B.
This is the notion of distance that you will use to implement thedistance() and sap() methods in the WordNet data type.

Outcast detection.Given a list of wordnet nouns A1, A2,…, An, which nounis the least related to the others? To identify an outcast,compute the sum of the distances between each noun and every other one:

di = dist(Ai, A1) + dist(Ai, A2) + … + dist(Ai, An)
and return a noun Atfor which dt is maximum.
Implement an immutable data type Outcast with the following API:

// constructor takes a WordNet object
public Outcast(WordNet wordnet)

// given an array of WordNet nouns, return an outcast
public String outcast(String[] nouns)

// for unit testing of this class (such as the one below)
public static void main(String[] args)
Assume that argument array to the outcast() methodcontains only valid wordnet nouns (and that it contains at least two such nouns).
The following test client takes from the command line the name of a synset file, the name of a hypernym file, followed by thenames of outcast files, and prints out an outcast in each file:

public static void main(String[] args) {
WordNet wordnet = new WordNet(args[0], args[1]);
Outcast outcast = new Outcast(wordnet);
for (int t = 2; t < args.length; t++) {
In in = new In(args[t]);
String[] nouns = in.readAllStrings();
StdOut.println(args[t] + “: “ + outcast.outcast(nouns));
}
}
Here is a sample execution:
% more outcast5.txt
horse zebra cat bear table

% more outcast8.txt
water soda bed orange_juice milk apple_juice tea coffee

% more outcast11.txt
apple pear peach banana lime lemon blueberry strawberry mango watermelon potato

% java Outcast synsets.txt hypernyms.txt outcast5.txt outcast8.txt outcast11.txt
outcast5.txt: table
outcast8.txt: bed
outcast11.txt: potato

##题目分析
wordnet是一个有上下位关系的英语词典，可以描述词语间的关系。下位词是上位词的一种具体描述，是一种isa关系。例如running是dash的上位词。从数据结构上来看wordnet是一个有向无环图。每个节点是一个同义词集合（synset），一个英语单词或复合词可能有多个意思，因此可能会出现在多个不同节点中。每条边都是由下位词指向上位词，有且仅有从下位词到上位词的路径，因此不可能存在环。
题目中wordnet是以synsets.txt和hypernyms.txt的形式给出。synsets.txt中按字母顺序给出了所有同义词集合的序号、集合、描述。hypernyms.txt按同义词集合的序号给出了上下位的关系，每行第一个序号是下位词，之后的序号都是上位词。一个下位词可能存在多个上位词。
题目的要求是实现三个类：

WordNet.java
SAP.java
Outcast.java

SAP.java是一个基础类，给定一个有向图，计算两点之间的最短距离，以及两点的共同祖先。首先应该完成这个类。
WordNet.java中实现计算wordnet结构中两个词的最短距离及共同祖先。
Outcast.java就比较简单了，找出一组词中最不相关的一个，调用WordNet.java很容易实现。

##SAP.java
下面具体分析每个类的实现
首先实现SAP.java，在这个类中实现计算最短距离和共同祖先的算法，然后可以应用到wordnet中。

###分析
求最短路径和共同祖先的思想就是从两个不同节点出发，找到都能到达且距离最短的节点。要计算距离，使用广度优先搜索，并且保存和更新每个节点的到出发点的距离。

成员变量：

private Digraph digraph;
private int ancestor;
private int length;

构造方法：

public SAP(Digraph G) {
       if (G == null) throw new NullPointerException();
       this.digraph = G;
   }

由于最短距离是两个节点到某公共节点的距离组合，可能出现多种情况，因此自定义一个数据结构Node来表示公共节点，把所有可能的公共节点放入一个优先队列中，最后从中取最小值。

private class Node implements Comparable<Node> {
        private int length;
        private int id;
        Node(int length, int id) {
            this.length = length;
            this.id = id;
        }
        @Override
        public int compareTo(Node o) {
            if (length > o.length) {
                return 1;
            } else if (length == o.length) {
                return 0;
            } else {
                return -1;
            }
        }
    }
```   
求最短距离的完整代码：
1. 先对一个节点进行广度优先搜素，计算到所有节点的距离。
2. 对另一个节点进行广度优先搜素，计算到所有节点的距离，如果某节点可以从两个节点分别到达，则存在优先队列中。
3. 从优先队列中取出最小值，即为最短距离。

// length of shortest ancestral path between v and w; -1 if no such path
public int length(int v, int w) {
if (v < 0 || v >= digraph.V() || w < 0 || w >= digraph.V())
throw new IndexOutOfBoundsException();
MinPQ possibleLength = new MinPQ();
boolean[] marked = new boolean[this.digraph.V()];
marked[v] = true;
int[] pathV = new int[digraph.V()];
int[] pathW = new int[digraph.V()];
for (int i = 0; i < digraph.V(); i++) {
pathV[i] = -1;
pathW[i] = -1;
}
pathV[v] = 0;
pathW[w] = 0;

    // bfs on v
    Queue<Integer> queue = new Queue<>();
    queue.enqueue(v);
    while (!queue.isEmpty()) {
        int vertex = queue.dequeue();
        for (int nextVertex : digraph.adj(vertex)) {
            if (!marked[nextVertex]) {
                marked[nextVertex] = true;
                queue.enqueue(nextVertex);
                if (pathV[nextVertex] == -1 || pathV[nextVertex] > pathV[vertex] + 1) {
                    pathV[nextVertex] = pathV[vertex] + 1;
                }

            }
        }
    }

    // bfs on w
    marked = new boolean[this.digraph.V()];
    marked[w] = true;
    queue.enqueue(w);
    while (!queue.isEmpty()) {
        int vertex = queue.dequeue();
        if (pathV[vertex] > -1) {
            Node node = new Node(pathV[vertex] + pathW[vertex], vertex);
            possibleLength.insert(node);
        }
        for (int nextVertex : digraph.adj(vertex)) {
            if (!marked[nextVertex]) {
                marked[nextVertex] = true;
                queue.enqueue(nextVertex);
                if (pathW[nextVertex] == -1 || pathW[nextVertex] > pathW[vertex] + 1) {
                    pathW[nextVertex] = pathW[vertex] + 1;
                }

            }
        }
    }
    if (possibleLength.size() > 0) {
        Node node = possibleLength.delMin();
        this.length = node.length;
        this.ancestor = node.id;
    } else {
        this.length = -1;
        this.ancestor = -1;
    }
    return this.length;
}

1 2	共同祖先在计算最短距离时也能获得，Node中的id就是共同祖先，因此可以复用。

// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w) {
length(v, w);
return this.ancestor;
}

1
2
3


输入的v,w为多个节点时，直接用暴力for循环找出每两个节点间的最短距离和共同祖先，存到优先队列中，最后取出的就是两组节点间的最短纪录和共同祖先。
代码：

// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
public int length(Iterable<Integer> v, Iterable<Integer> w) {
    if (v == null || w == null) throw new NullPointerException();
    MinPQ<Node> possibleNodes = new MinPQ();
    for (int nodeV : v) {
        if (nodeV < 0 || nodeV > this.digraph.V())
            throw new IndexOutOfBoundsException();
        for (int nodeW : w) {
            if (nodeW < 0 || nodeW > this.digraph.V())
                throw new IndexOutOfBoundsException();

            Node node = new Node(length(nodeV, nodeW), ancestor(nodeV, nodeW));
            possibleNodes.insert(node);
        }
    }

    if (possibleNodes.size() > 0) {
        Node node = possibleNodes.delMin();
        this.length = node.length;
        this.ancestor = node.id;
    } else {
        this.length = -1;
        this.ancestor = -1;
    }
    return this.length;
}

// a common ancestor that participates in shortest ancestral path; -1 if no such path
public int ancestor(Iterable<Integer> v, Iterable<Integer> w) {
    length(v, w);
    return this.ancestor;
}


## WordNet.java
要把sap中的算法实际高效地应用到wordnet中并不简单，需要找到适合的数据结构来储存我们需要的信息并且能够高效地查询。
题目要求
1. 返回wordnet中所有的词
2. 判断一个词是否在wordnet中
3. 两个词的最短距离
4. 两个词最短路径上的共同上位词的同义词集合
要做到1和2两个要求，需要一个没有重复词的集合，而且能够根据词来快速搜索。提到快速搜索，立即想到时间复杂度log(n)的二叉搜索树，正好符合题目的时间要求。因此，使用PartI中讲到过的SET<key>(在Alg4.jar中，用平衡二叉树实现log(n)的插入、搜索)。
SET需要一个key来实现搜索。要求搜索的是词，但不能直接以string作为key，因为还要考虑到最短路径的实现。因此，用一个自定义结构：实现Comparable接口时比较的是词的字母顺序，另外由于词的多义性用一个ArrayList存放出现在wordnet中的可能位置。

private class Node implements Comparable {
private ArrayList ids;
private String noun;
Node(String noun) {
this.noun = noun;
this.ids = new ArrayList<>();
}

    private void addId(int id) {
        this.ids.add(id);
    }

    private ArrayList<Integer> getIds() {
        return this.ids;
    }

    @Override
    public int compareTo(Node o) {
        return this.noun.compareTo(o.noun);
    }
}

1
2


成员变量：

private SAP wordnetSap;
private Digraph digraph;
private boolean hasCycle;
private boolean[] onStack;
private boolean[] marked;
private ArrayList possibleRoots; // a DAG has only one root
private ArrayList synsets; // synsets for each wordnet node
private SET allNouns; // SET to store Node


构造方法：读取synsets.txt和hypernyms.txt,构建表示wordnet的图。Node表示词以及词可能出现的位置，建立SET。与sap中不同，共同祖先需要根据序号返回同义词组，与SET的思路不一致，因此还需要一个ArrayList来表示每个序号对应的同义词组。
根据题目要求，还需要检查构建的wordnet图是否合法，即检查是否无环、是否只有一个root。这个检查合法性同样用广度优先搜索可以完成：
* 是否无环：搜索时如果之前访问过某节点，则构成环。
* 是否只有一个root：无论从哪个节点开始搜索，最后一个节点（没有指向其他节点的边）是唯一的。

public WordNet(String synsets, String hypernyms) {
if (synsets == null || hypernyms == null)
throw new NullPointerException();
possibleRoots = new ArrayList<>();
this.synsets = new ArrayList<>();
allNouns = new SET();
int count = 0;
try {
BufferedReader in = new BufferedReader(new FileReader(synsets));
String line;
while ((line = in.readLine()) != null) {
String[] parts = line.split(“,”);
String aSynset = parts[1];
String[] strings = aSynset.split(“ “);
for (String str : strings) {
Node node = new Node(str);
if (this.allNouns.contains(node)) {
this.allNouns.ceiling(node).addId(Integer.parseInt(parts[0]));
}else {
node.addId(Integer.parseInt(parts[0]));
this.allNouns.add(node);
}
}
this.synsets.add(strings);

            count++;
        }
        in.close();

        this.digraph = new Digraph(count);

        BufferedReader hypernymsReader = new BufferedReader(new FileReader(hypernyms));
        while ((line = hypernymsReader.readLine()) != null){
            String[] temp = line.split(",");
            int id = Integer.parseInt(temp[0]);
            for (int i = 1; i < temp.length; i++) {
                this.digraph.addEdge(id, Integer.parseInt(temp[i]));
            }
        }
        hypernymsReader.close();
    } catch (IOException e) {
        e.printStackTrace();
    }

    this.wordnetSap = new SAP(this.digraph);

    onStack = new boolean[digraph.V()];
    marked = new boolean[digraph.V()];
    if (hasCycle(this.digraph)) throw new IllegalArgumentException();
    if (this.possibleRoots.size() > 1) throw new IllegalArgumentException();
}

private boolean hasCycle(Digraph digraph) {
    for (int i = 0; i < digraph.V(); i++) {
        if (!this.marked[i]) dfs(digraph, i);
    }
    return hasCycle;
}

private void dfs(Digraph digraph, int v) {

    onStack[v] = true;
    marked[v] = true;

    if (!digraph.adj(v).iterator().hasNext()) {
        if (!this.possibleRoots.contains(v))
            this.possibleRoots.add(v);
    }
    for (int w : digraph.adj(v)) {
        if (this.hasCycle) return;
        else if (!marked[w]) {
            dfs(digraph, w);
        }
        else if (onStack[w]) this.hasCycle = true;
    }
    onStack[v] = false;
}


选对了合适的数据结构，完成了构造方法，其实就完成了绝大部分工作，剩下的实现就相对容易了。
返回wordnet中所有的词：

public Iterable nouns() {
ArrayList nouns = new ArrayList<>();
for (Node node : this.allNouns) {
nouns.add(node.noun);
}
return nouns;
}

1 2	判断一个词是否在wordnet中：

public boolean isNoun(String word) {
if (word == null) throw new NullPointerException();
Node node = new Node(word);
return this.allNouns.contains(node);
}

1 2	两个词的最短距离：用SET和Node迅速找到词所对应的可能位置，用sap中的方法可以马上计算。

public int distance(String nounA, String nounB) {
if (nounA == null || nounB == null)
throw new NullPointerException();
if (!isNoun(nounA) || !isNoun(nounB))
throw new IllegalArgumentException();
ArrayList idAs;
ArrayList idBs;

    Node nodeA = new Node(nounA);
    Node nodeB = new Node(nounB);
    nodeA = this.allNouns.ceiling(nodeA);
    nodeB = this.allNouns.ceiling(nodeB);
    idAs = nodeA.getIds();
    idBs = nodeB.getIds();

    return wordnetSap.length(idAs, idBs);
}

1 2	两个词最短路径上的共同上位词的同义词集合：同样用sap中的方法找到位置，然后从事先存好的arraylist中获取同义词集合，拼接成给定的形式。

public String sap(String nounA, String nounB) {
if (nounA == null || nounB == null)
throw new NullPointerException();
if (!isNoun(nounA) || !isNoun(nounB))
throw new IllegalArgumentException();

    ArrayList<Integer> idAs;
    ArrayList<Integer> idBs;

    Node nodeA = new Node(nounA);
    Node nodeB = new Node(nounB);
    nodeA = this.allNouns.ceiling(nodeA);
    nodeB = this.allNouns.ceiling(nodeB);
    idAs = nodeA.getIds();
    idBs = nodeB.getIds();

    int id = wordnetSap.ancestor(idAs, idBs);
    String[] strings = this.synsets.get(id);
    String result = "";
    for (int i = 0; i< strings.length; i++) {
        result += strings[i];
        if (i != strings.length - 1)
            result += " ";
    }
    return result;
}


## Outcast.java
outcast就简单了，从给定的词组中，计算每个词与其他词的最短距离之和。和最大的词自然就是与别的词最不同的了。

public class Outcast {
private WordNet wordNet;

public Outcast(WordNet wordnet) {
    this.wordNet = wordnet;
}

public String outcast(String[] nouns) {
    int[] distances = new int[nouns.length];
    for (int i = 0; i < nouns.length; i++) {
        String noun = nouns[i];
        for (int j = 0; j < nouns.length; j++) {
            if (i != j) {
                distances[i] += wordNet.distance(noun, nouns[j]);
            }
        }
    }
    int max = 0;
    int id = 0;
    for (int i = 0; i < distances.length; i++) {
        if (distances[i] > max) {
            max = distances[i];
            id = i;
        }
    }

    return nouns[id];
}

}
```

总结

笔者水平有限，上述答案其实只有91分，运行时间不达标。抛砖引玉，希望大神们能够提出改进方法，大家一起学习进步，谢谢！

项目源码在GitHub中，请点这里