最近在coursera上学习Princeton大学的Algorithm PartII,这个系列的两门课是我见过最好的算法课。主讲Robert Sedgewick师承高德纳,声名远播,虽然年纪大了,但是讲课思路清晰,深入浅出。与Kevin Wayne共同编著的《算法》第四版作为教材,没有《算法导论》那么偏理论、晦涩难懂,能够把常见的数据结构和算法讲得很透彻,容易理解。
更值得称道的是这门课编程作业及其评分系统。上课的内容是基础的数据结构和算法,但编程作业往往是这些算法的实际应用,能真正理解为什么在给定的情况下要使用这样的数据结构、算法。作业的评分不仅仅是看正确与否,还有运行时间、内存的要求,会给出详细的分析报告,如果不是ACM大神,这些作业都值得花时间去仔细琢磨。
下面进入正题,第一周的作业WordNet
##题目
WordNet is a semantic lexicon for theEnglish language that is used extensively by computational linguistsand cognitive scientists; for example, it was a key component in IBM’sWatson.WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them.One such relationship is the is-a relationship, which connects a hyponym(more specific synset) to a hypernym (more general synset).For example, locomotion is a hypernym of runningand running is a hypernym of dash.
The WordNet digraph.Your first task is to build the wordnet digraph: each vertex v is an integer that represents a synset, and each directed edge v→w represents that w is a hypernym of v.The wordnet digraph is a rooted DAG: it is acylic and has one vertex thatis an ancestor of every other vertex.However, it is not necessarily a tree because a synset can have more than onehypernym. A small subgraph of the wordnet digraph is illustrated below.
The WordNet input file formats.We now describe the two data files that you will use to create the wordnet digraph.The files are in CSV format: each line contains a sequence of fields,separated by commas.
List of noun synsets.The file synsets.txtlists all the (noun) synsets in WordNet.The first field is the synset id (an integer),the second field is the synonym set (or synset), and thethird field is its dictionary definition (or gloss).For example, the line
36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire
means that the synset { AND_circuit, AND_gate }has an id number of 36 and it’s gloss isa circuit in a computer that fires only when all of its inputs fire.The individual nouns that comprise a synset are separatedby spaces (and a synset element is not permitted to contain a space).The S synset ids are numbered 0 through S − 1;the id numbers will appear consecutively in the synset file.
List of hypernyms.The file hypernyms.txtcontains the hypernym relationships:The first field is a synset id; subsequent fields are the id numbersof the synset’s hypernyms. For example, the following line
164,21012,56099
means that the the synset 164 (“Actifed”) has two hypernyms:21012 (“antihistamine”) and56099 (“nasal_decongestant”),representing that Actifed is both an antihistamine and a nasal decongestant.The synsets are obtained from the corresponding lines in the file synsets.txt.
164,Actifed,trade name for a drug containing an antihistamine and a decongestant…
21012,antihistamine,a medicine used to treat allergies…
56099,nasal_decongestant,a decongestant that provides temporary relief of nasal…
WordNet data type.Implement an immutable data type WordNet with the following API:
// constructor takes the name of the two input files
public WordNet(String synsets, String hypernyms)
// the set of nouns (no duplicates), returned as an Iterable
public Iterable
// is the word a WordNet noun?
public boolean isNoun(String word)
// distance between nounA and nounB (defined below)
public int distance(String nounA, String nounB)
// a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
// in a shortest ancestral path (defined below)
public String sap(String nounA, String nounB)
// for unit testing of this class
public static void main(String[] args)
The constructor should throw a java.lang.IllegalArgumentExceptionif the input does not correspond to a rooted DAG.The distance() and sap() methodsshould throw a java.lang.IllegalArgumentExceptionunless both of the noun arguments are WordNet nouns.
Your data type should use space linear in the input size(size of synsets and hypernyms files).The constructor should take time linearithmic (or better) in the input size.The method isNoun() should run in time logarithmic (or better) inthe number of nouns.The methods distance() and sap() should run in time linear in thesize of the WordNet digraph.
Shortest ancestral path.An ancestral path between two verticesv and w in a digraph is a directed path fromv to a common ancestor x, together witha directed path from w to the same ancestor x. A shortest ancestral path is an ancestral path of minimum total length.For example, in the digraph at left(digraph1.txt),the shortest ancestral path between3 and 11 has length 4 (with common ancestor 1).In the digraph at right (digraph2.txt),one ancestral path between 1 and 5 has length 4(with common ancestor 5), but the shortest ancestral path has length 2(with common ancestor 0).
SAP data type.Implement an immutable data type SAP with the following API:
// constructor takes a digraph (not necessarily a DAG)
public SAP(Digraph G)
// length of shortest ancestral path between v and w; -1 if no such path
public int length(int v, int w)
// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w)
// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
public int length(Iterable
// a common ancestor that participates in shortest ancestral path; -1 if no such path
public int ancestor(Iterable
// for unit testing of this class (such as the one below)
public static void main(String[] args)
All methods should throw a java.lang.IndexOutOfBoundsException if one (or more) of the input arguments is not between 0 and G.V() - 1.You may assume that the iterable arguments contain at least one integer.All methods (and the constructor) should take time at mostproportional to E + Vin the worst case, where E and V are the number of edges and verticesin the digraph, respectively.Your data type should use space proportional to E + V.
Test client.The following test client takes the name of a digraph input file asas a command-line argument, constructs the digraph,reads in vertex pairs from standard input,and prints out the length of the shortest ancestral path between the two verticesand a common ancestor that participates in that path:
public static void main(String[] args) {
In in = new In(args[0]);
Digraph G = new Digraph(in);
SAP sap = new SAP(G);
while (!StdIn.isEmpty()) {
int v = StdIn.readInt();
int w = StdIn.readInt();
int length = sap.length(v, w);
int ancestor = sap.ancestor(v, w);
StdOut.printf(“length = %d, ancestor = %d\n”, length, ancestor);
}
}
Here is a sample execution:
% more digraph1.txt % java SAP digraph1.txt
13 3 11
11 length = 4, ancestor = 1
7 3
8 3 9 12
3 1 length = 3, ancestor = 5
4 1
5 1 7 2
9 5 length = 4, ancestor = 0
10 5
11 10 1 6
12 10 length = -1, ancestor = -1
1 0
2 0
Measuring the semantic relatedness of two nouns.Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, most of us agree that George Bush and John Kennedy (two U.S. presidents)are more related than are George Bushand chimpanzee (two primates). However, not most of us agree that George Bush and Eric Arthur Blair are related concepts. But if one is aware that George Bush and Eric Arthur Blair (aka George Orwell) are both communicators, then it becomes clear that the two concepts might be related.
We define the semantic relatednessof two wordnet nouns A and B as follows:
distance(A, B) = distance is the minimum length of any ancestral path betweenany synset v of A and any synset w of B.
This is the notion of distance that you will use to implement thedistance() and sap() methods in the WordNet data type.
Outcast detection.Given a list of wordnet nouns A1, A2,…, An, which nounis the least related to the others? To identify an outcast,compute the sum of the distances between each noun and every other one:
di = dist(Ai, A1) + dist(Ai, A2) + … + dist(Ai, An)
and return a noun Atfor which dt is maximum.
Implement an immutable data type Outcast with the following API:
// constructor takes a WordNet object
public Outcast(WordNet wordnet)
// given an array of WordNet nouns, return an outcast
public String outcast(String[] nouns)
// for unit testing of this class (such as the one below)
public static void main(String[] args)
Assume that argument array to the outcast() methodcontains only valid wordnet nouns (and that it contains at least two such nouns).
The following test client takes from the command line the name of a synset file, the name of a hypernym file, followed by thenames of outcast files, and prints out an outcast in each file:
public static void main(String[] args) {
WordNet wordnet = new WordNet(args[0], args[1]);
Outcast outcast = new Outcast(wordnet);
for (int t = 2; t < args.length; t++) {
In in = new In(args[t]);
String[] nouns = in.readAllStrings();
StdOut.println(args[t] + “: “ + outcast.outcast(nouns));
}
}
Here is a sample execution:
% more outcast5.txt
horse zebra cat bear table
% more outcast8.txt
water soda bed orange_juice milk apple_juice tea coffee
% more outcast11.txt
apple pear peach banana lime lemon blueberry strawberry mango watermelon potato
% java Outcast synsets.txt hypernyms.txt outcast5.txt outcast8.txt outcast11.txt
outcast5.txt: table
outcast8.txt: bed
outcast11.txt: potato
##题目分析
wordnet是一个有上下位关系的英语词典,可以描述词语间的关系。下位词是上位词的一种具体描述,是一种isa关系。例如running是dash的上位词。从数据结构上来看wordnet是一个有向无环图。每个节点是一个同义词集合(synset),一个英语单词或复合词可能有多个意思,因此可能会出现在多个不同节点中。每条边都是由下位词指向上位词,有且仅有从下位词到上位词的路径,因此不可能存在环。
题目中wordnet是以synsets.txt和hypernyms.txt的形式给出。synsets.txt中按字母顺序给出了所有同义词集合的序号、集合、描述。hypernyms.txt按同义词集合的序号给出了上下位的关系,每行第一个序号是下位词,之后的序号都是上位词。一个下位词可能存在多个上位词。
题目的要求是实现三个类:
- WordNet.java
- SAP.java
- Outcast.java
SAP.java是一个基础类,给定一个有向图,计算两点之间的最短距离,以及两点的共同祖先。首先应该完成这个类。
WordNet.java中实现计算wordnet结构中两个词的最短距离及共同祖先。
Outcast.java就比较简单了,找出一组词中最不相关的一个,调用WordNet.java很容易实现。
##SAP.java
下面具体分析每个类的实现
首先实现SAP.java,在这个类中实现计算最短距离和共同祖先的算法,然后可以应用到wordnet中。
###分析
求最短路径和共同祖先的思想就是从两个不同节点出发,找到都能到达且距离最短的节点。要计算距离,使用广度优先搜索,并且保存和更新每个节点的到出发点的距离。
成员变量:
|
|
构造方法:
|
|
由于最短距离是两个节点到某公共节点的距离组合,可能出现多种情况,因此自定义一个数据结构Node来表示公共节点,把所有可能的公共节点放入一个优先队列中,最后从中取最小值。
|
|
// length of shortest ancestral path between v and w; -1 if no such path
public int length(int v, int w) {
if (v < 0 || v >= digraph.V() || w < 0 || w >= digraph.V())
throw new IndexOutOfBoundsException();
MinPQ
boolean[] marked = new boolean[this.digraph.V()];
marked[v] = true;
int[] pathV = new int[digraph.V()];
int[] pathW = new int[digraph.V()];
for (int i = 0; i < digraph.V(); i++) {
pathV[i] = -1;
pathW[i] = -1;
}
pathV[v] = 0;
pathW[w] = 0;
// bfs on v
Queue<Integer> queue = new Queue<>();
queue.enqueue(v);
while (!queue.isEmpty()) {
int vertex = queue.dequeue();
for (int nextVertex : digraph.adj(vertex)) {
if (!marked[nextVertex]) {
marked[nextVertex] = true;
queue.enqueue(nextVertex);
if (pathV[nextVertex] == -1 || pathV[nextVertex] > pathV[vertex] + 1) {
pathV[nextVertex] = pathV[vertex] + 1;
}
}
}
}
// bfs on w
marked = new boolean[this.digraph.V()];
marked[w] = true;
queue.enqueue(w);
while (!queue.isEmpty()) {
int vertex = queue.dequeue();
if (pathV[vertex] > -1) {
Node node = new Node(pathV[vertex] + pathW[vertex], vertex);
possibleLength.insert(node);
}
for (int nextVertex : digraph.adj(vertex)) {
if (!marked[nextVertex]) {
marked[nextVertex] = true;
queue.enqueue(nextVertex);
if (pathW[nextVertex] == -1 || pathW[nextVertex] > pathW[vertex] + 1) {
pathW[nextVertex] = pathW[vertex] + 1;
}
}
}
}
if (possibleLength.size() > 0) {
Node node = possibleLength.delMin();
this.length = node.length;
this.ancestor = node.id;
} else {
this.length = -1;
this.ancestor = -1;
}
return this.length;
}
|
|
// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w) {
length(v, w);
return this.ancestor;
}
// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
public int length(Iterable<Integer> v, Iterable<Integer> w) {
if (v == null || w == null) throw new NullPointerException();
MinPQ<Node> possibleNodes = new MinPQ();
for (int nodeV : v) {
if (nodeV < 0 || nodeV > this.digraph.V())
throw new IndexOutOfBoundsException();
for (int nodeW : w) {
if (nodeW < 0 || nodeW > this.digraph.V())
throw new IndexOutOfBoundsException();
Node node = new Node(length(nodeV, nodeW), ancestor(nodeV, nodeW));
possibleNodes.insert(node);
}
}
if (possibleNodes.size() > 0) {
Node node = possibleNodes.delMin();
this.length = node.length;
this.ancestor = node.id;
} else {
this.length = -1;
this.ancestor = -1;
}
return this.length;
}
// a common ancestor that participates in shortest ancestral path; -1 if no such path
public int ancestor(Iterable<Integer> v, Iterable<Integer> w) {
length(v, w);
return this.ancestor;
}
|
|
private class Node implements Comparable
private ArrayList
private String noun;
Node(String noun) {
this.noun = noun;
this.ids = new ArrayList<>();
}
private void addId(int id) {
this.ids.add(id);
}
private ArrayList<Integer> getIds() {
return this.ids;
}
@Override
public int compareTo(Node o) {
return this.noun.compareTo(o.noun);
}
}
|
|
private SAP wordnetSap;
private Digraph digraph;
private boolean hasCycle;
private boolean[] onStack;
private boolean[] marked;
private ArrayList
private ArrayList
private SET
public WordNet(String synsets, String hypernyms) {
if (synsets == null || hypernyms == null)
throw new NullPointerException();
possibleRoots = new ArrayList<>();
this.synsets = new ArrayList<>();
allNouns = new SET
int count = 0;
try {
BufferedReader in = new BufferedReader(new FileReader(synsets));
String line;
while ((line = in.readLine()) != null) {
String[] parts = line.split(“,”);
String aSynset = parts[1];
String[] strings = aSynset.split(“ “);
for (String str : strings) {
Node node = new Node(str);
if (this.allNouns.contains(node)) {
this.allNouns.ceiling(node).addId(Integer.parseInt(parts[0]));
}else {
node.addId(Integer.parseInt(parts[0]));
this.allNouns.add(node);
}
}
this.synsets.add(strings);
count++;
}
in.close();
this.digraph = new Digraph(count);
BufferedReader hypernymsReader = new BufferedReader(new FileReader(hypernyms));
while ((line = hypernymsReader.readLine()) != null){
String[] temp = line.split(",");
int id = Integer.parseInt(temp[0]);
for (int i = 1; i < temp.length; i++) {
this.digraph.addEdge(id, Integer.parseInt(temp[i]));
}
}
hypernymsReader.close();
} catch (IOException e) {
e.printStackTrace();
}
this.wordnetSap = new SAP(this.digraph);
onStack = new boolean[digraph.V()];
marked = new boolean[digraph.V()];
if (hasCycle(this.digraph)) throw new IllegalArgumentException();
if (this.possibleRoots.size() > 1) throw new IllegalArgumentException();
}
private boolean hasCycle(Digraph digraph) {
for (int i = 0; i < digraph.V(); i++) {
if (!this.marked[i]) dfs(digraph, i);
}
return hasCycle;
}
private void dfs(Digraph digraph, int v) {
onStack[v] = true;
marked[v] = true;
if (!digraph.adj(v).iterator().hasNext()) {
if (!this.possibleRoots.contains(v))
this.possibleRoots.add(v);
}
for (int w : digraph.adj(v)) {
if (this.hasCycle) return;
else if (!marked[w]) {
dfs(digraph, w);
}
else if (onStack[w]) this.hasCycle = true;
}
onStack[v] = false;
}
|
|
public Iterable
ArrayList
for (Node node : this.allNouns) {
nouns.add(node.noun);
}
return nouns;
}
public boolean isNoun(String word) {
if (word == null) throw new NullPointerException();
Node node = new Node(word);
return this.allNouns.contains(node);
}
public int distance(String nounA, String nounB) {
if (nounA == null || nounB == null)
throw new NullPointerException();
if (!isNoun(nounA) || !isNoun(nounB))
throw new IllegalArgumentException();
ArrayList
ArrayList
Node nodeA = new Node(nounA);
Node nodeB = new Node(nounB);
nodeA = this.allNouns.ceiling(nodeA);
nodeB = this.allNouns.ceiling(nodeB);
idAs = nodeA.getIds();
idBs = nodeB.getIds();
return wordnetSap.length(idAs, idBs);
}
|
|
public String sap(String nounA, String nounB) {
if (nounA == null || nounB == null)
throw new NullPointerException();
if (!isNoun(nounA) || !isNoun(nounB))
throw new IllegalArgumentException();
ArrayList<Integer> idAs;
ArrayList<Integer> idBs;
Node nodeA = new Node(nounA);
Node nodeB = new Node(nounB);
nodeA = this.allNouns.ceiling(nodeA);
nodeB = this.allNouns.ceiling(nodeB);
idAs = nodeA.getIds();
idBs = nodeB.getIds();
int id = wordnetSap.ancestor(idAs, idBs);
String[] strings = this.synsets.get(id);
String result = "";
for (int i = 0; i< strings.length; i++) {
result += strings[i];
if (i != strings.length - 1)
result += " ";
}
return result;
}
|
|
public class Outcast {
private WordNet wordNet;
public Outcast(WordNet wordnet) {
this.wordNet = wordnet;
}
public String outcast(String[] nouns) {
int[] distances = new int[nouns.length];
for (int i = 0; i < nouns.length; i++) {
String noun = nouns[i];
for (int j = 0; j < nouns.length; j++) {
if (i != j) {
distances[i] += wordNet.distance(noun, nouns[j]);
}
}
}
int max = 0;
int id = 0;
for (int i = 0; i < distances.length; i++) {
if (distances[i] > max) {
max = distances[i];
id = i;
}
}
return nouns[id];
}
}
```
总结
笔者水平有限,上述答案其实只有91分,运行时间不达标。抛砖引玉,希望大神们能够提出改进方法,大家一起学习进步,谢谢!
项目源码在GitHub中,请点这里