Implement two solutions for the problem given below using map-reduce: one on Apache Hadoop and one on Apache Spark. Test the solutions with at least a 1000 documents.
₹600-1500 INR
착불
You are given a large collection of (English) text documents (as files).
For each document, compute the top 20 keywords by relevance scores. For a keyword w and a
document d the relevance score is given by T(w,d)/D(w,d) :
• where T (w,d) = count(w,d)^0.5 where ^ denotes exponentiation and
• count(w,d) is the number of occurrences of w in d and
• D(w,d) is the fraction of documents in the collection in which w occurs (i.e. x/N if w occurs in
x documents out of N, the total number of documents in the collection).
Also compute the intersection of all the top-20 keywords.
프로젝트 ID: #33941230