Implement two solutions for the problem given below using map-reduce: one on Apache Hadoop and one on Apache Spark. Test the solutions with at least a 1000 documents.

종료 등록 시간: 1년 전 착불
종료 착불

You are given a large collection of (English) text documents (as files).

For each document, compute the top 20 keywords by relevance scores. For a keyword w and a

document d the relevance score is given by T(w,d)/D(w,d) :

• where T (w,d) = count(w,d)^0.5 where ^ denotes exponentiation and

• count(w,d) is the number of occurrences of w in d and

• D(w,d) is the fraction of documents in the collection in which w occurs (i.e. x/N if w occurs in

x documents out of N, the total number of documents in the collection).

Also compute the intersection of all the top-20 keywords.

Apache Hadoop Apache Spark

프로젝트 ID: #33941230

프로젝트 소개

재택 근무형 프로젝트 서비스 이용 중: 1년 전