Cloudera CCA Spark and Hadoop Developer (CCA175) Free Practice Test
Question 1
CORRECT TEXT
Problem Scenario 42 : You have been given a file (sparklO/sales.txt), with the content as given in below.
spark10/sales.txt
Department,Designation,costToCompany,State
Sales,Trainee,12000,UP
Sales,Lead,32000,AP
Sales,Lead,32000,LA
Sales,Lead,32000,TN
Sales,Lead,32000,AP
Sales,Lead,32000,TN
Sales,Lead,32000,LA
Sales,Lead,32000,LA
Marketing,Associate,18000,TN
Marketing,Associate,18000,TN
HR,Manager,58000,TN
And want to produce the output as a csv with group by Department,Designation,State with additional columns with sum(costToCompany) and TotalEmployeeCountt
Should get result like
Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales.Lead.LA.3.96000
Sales,Lead,TN,2,64000
Problem Scenario 42 : You have been given a file (sparklO/sales.txt), with the content as given in below.
spark10/sales.txt
Department,Designation,costToCompany,State
Sales,Trainee,12000,UP
Sales,Lead,32000,AP
Sales,Lead,32000,LA
Sales,Lead,32000,TN
Sales,Lead,32000,AP
Sales,Lead,32000,TN
Sales,Lead,32000,LA
Sales,Lead,32000,LA
Marketing,Associate,18000,TN
Marketing,Associate,18000,TN
HR,Manager,58000,TN
And want to produce the output as a csv with group by Department,Designation,State with additional columns with sum(costToCompany) and TotalEmployeeCountt
Should get result like
Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales.Lead.LA.3.96000
Sales,Lead,TN,2,64000
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
step 1 : Create a file first using Hue in hdfs.
Step 2 : Load tile as an RDD
val rawlines = sc.textFile("spark10/sales.txt")
Step 3 : Create a case class, which can represent its column fileds. case class
Employee(dep: String, des: String, cost: Double, state: String)
Step 4 : Split the data and create RDD of all Employee objects.
val employees = rawlines.map(_.split(",")).map(row=>Employee(row(0), row{1), row{2).toDouble, row{3)))
Step 5 : Create a row as we needed. All group by fields as a key and value as a count for each employee as well as its cost, val keyVals = employees.map( em => ((em.dep, em.des, em.state), (1 , em.cost)))
Step 6 : Group by all the records using reduceByKey method as we want summation as well. For number of employees and their total cost, val results = keyVals.reduceByKey{
(a,b) => (a._1 + b._1, a._2 + b._2)} // (a.count + b.count, a.cost + b.cost)}
Step 7 : Save the results in a text file as below.
results.repartition(1).saveAsTextFile("spark10/group.txt")
Explanation:
Solution :
step 1 : Create a file first using Hue in hdfs.
Step 2 : Load tile as an RDD
val rawlines = sc.textFile("spark10/sales.txt")
Step 3 : Create a case class, which can represent its column fileds. case class
Employee(dep: String, des: String, cost: Double, state: String)
Step 4 : Split the data and create RDD of all Employee objects.
val employees = rawlines.map(_.split(",")).map(row=>Employee(row(0), row{1), row{2).toDouble, row{3)))
Step 5 : Create a row as we needed. All group by fields as a key and value as a count for each employee as well as its cost, val keyVals = employees.map( em => ((em.dep, em.des, em.state), (1 , em.cost)))
Step 6 : Group by all the records using reduceByKey method as we want summation as well. For number of employees and their total cost, val results = keyVals.reduceByKey{
(a,b) => (a._1 + b._1, a._2 + b._2)} // (a.count + b.count, a.cost + b.cost)}
Step 7 : Save the results in a text file as below.
results.repartition(1).saveAsTextFile("spark10/group.txt")
Question 2
CORRECT TEXT
Problem Scenario 63 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, String}] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))
Problem Scenario 63 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) val b = a.map(x => (x.length, x)) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, String}] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.reduceByKey(_ + _).collect
reduceByKey JPair] : This function provides the well-known reduce functionality in Spark.
Please note that any function f you provide, should be commutative in order to generate reproducible results.
Explanation:
Solution :
b.reduceByKey(_ + _).collect
reduceByKey JPair] : This function provides the well-known reduce functionality in Spark.
Please note that any function f you provide, should be commutative in order to generate reproducible results.
Question 3
CORRECT TEXT
Problem Scenario 46 : You have been given belwo list in scala (name,sex,cost) for each work done.
List( ("Deeapak" , "male", 4000), ("Deepak" , "male", 2000), ("Deepika" , "female",
2000),("Deepak" , "female", 2000), ("Deepak" , "male", 1000) , ("Neeta" , "female", 2000))
Now write a Spark program to load this list as an RDD and do the sum of cost for combination of name and sex (as key)
Problem Scenario 46 : You have been given belwo list in scala (name,sex,cost) for each work done.
List( ("Deeapak" , "male", 4000), ("Deepak" , "male", 2000), ("Deepika" , "female",
2000),("Deepak" , "female", 2000), ("Deepak" , "male", 1000) , ("Neeta" , "female", 2000))
Now write a Spark program to load this list as an RDD and do the sum of cost for combination of name and sex (as key)
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create an RDD out of this list
val rdd = sc.parallelize(List( ("Deeapak" , "male", 4000}, ("Deepak" , "male", 2000),
("Deepika" , "female", 2000),("Deepak" , "female", 2000), ("Deepak" , "male", 1000} ,
("Neeta" , "female", 2000}}}
Step 2 : Convert this RDD in pair RDD
val byKey = rdd.map({case (name,sex,cost) => (name,sex)->cost})
Step 3 : Now group by Key
val byKeyGrouped = byKey.groupByKey
Step 4 : Nowsum the cost for each group
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Step 5 : Save the results result.repartition(1).saveAsTextFile("spark12/result.txt")
Explanation:
Solution :
Step 1 : Create an RDD out of this list
val rdd = sc.parallelize(List( ("Deeapak" , "male", 4000}, ("Deepak" , "male", 2000),
("Deepika" , "female", 2000),("Deepak" , "female", 2000), ("Deepak" , "male", 1000} ,
("Neeta" , "female", 2000}}}
Step 2 : Convert this RDD in pair RDD
val byKey = rdd.map({case (name,sex,cost) => (name,sex)->cost})
Step 3 : Now group by Key
val byKeyGrouped = byKey.groupByKey
Step 4 : Nowsum the cost for each group
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Step 5 : Save the results result.repartition(1).saveAsTextFile("spark12/result.txt")
Question 4
CORRECT TEXT
Problem Scenario 44 : You have been given 4 files , with the content as given below:
spark11/file1.txt
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework spark11/file2.txt
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File
System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.
spark11/file3.txt
his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking spark11/file4.txt
Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system. One might use
Storm to transform unstructured data as it flows into a system into a desired format
(spark11Afile1.txt)
(spark11/file2.txt)
(spark11/file3.txt)
(sparkl 1/file4.txt)
Write a Spark program, which will give you the highest occurring words in each file. With their file name and highest occurring words.
Problem Scenario 44 : You have been given 4 files , with the content as given below:
spark11/file1.txt
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework spark11/file2.txt
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File
System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.
spark11/file3.txt
his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking spark11/file4.txt
Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system. One might use
Storm to transform unstructured data as it flows into a system into a desired format
(spark11Afile1.txt)
(spark11/file2.txt)
(spark11/file3.txt)
(sparkl 1/file4.txt)
Write a Spark program, which will give you the highest occurring words in each file. With their file name and highest occurring words.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create all 4 file first using Hue in hdfs.
Step 2 : Load all file as an RDD
val file1 = sc.textFile("sparkl1/filel.txt")
val file2 = sc.textFile("spark11/file2.txt")
val file3 = sc.textFile("spark11/file3.txt")
val file4 = sc.textFile("spark11/file4.txt")
Step 3 : Now do the word count for each file and sort in reverse order of count.
val contentl = filel.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content.2 = file2.flatMap( line => line.splitf ")).map(word => (word,1)).reduceByKey(_
+ _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content3 = file3.flatMap( line > line.split)" ")).map(word => (word,1)).reduceByKey(_
+ _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content4 = file4.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_ ).map(item => item.swap).sortByKey(false).map(e=>e.swap)
Step 4 : Split the data and create RDD of all Employee objects.
val filelword = sc.makeRDD(Array(file1.name+"->"+content1(0)._1+"-"+content1(0)._2)) val file2word = sc.makeRDD(Array(file2.name+"->"+content2(0)._1+"-"+content2(0)._2)) val file3word = sc.makeRDD(Array(file3.name+"->"+content3(0)._1+"-"+content3(0)._2)) val file4word = sc.makeRDD(Array(file4.name+M->"+content4(0)._1+"-"+content4(0)._2))
Step 5: Union all the RDDS
val unionRDDs = filelword.union(file2word).union(file3word).union(file4word)
Step 6 : Save the results in a text file as below.
unionRDDs.repartition(1).saveAsTextFile("spark11/union.txt")
Explanation:
Solution :
Step 1 : Create all 4 file first using Hue in hdfs.
Step 2 : Load all file as an RDD
val file1 = sc.textFile("sparkl1/filel.txt")
val file2 = sc.textFile("spark11/file2.txt")
val file3 = sc.textFile("spark11/file3.txt")
val file4 = sc.textFile("spark11/file4.txt")
Step 3 : Now do the word count for each file and sort in reverse order of count.
val contentl = filel.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content.2 = file2.flatMap( line => line.splitf ")).map(word => (word,1)).reduceByKey(_
+ _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content3 = file3.flatMap( line > line.split)" ")).map(word => (word,1)).reduceByKey(_
+ _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content4 = file4.flatMap( line => line.split(" ")).map(word => (word,1)).reduceByKey(_ +
_ ).map(item => item.swap).sortByKey(false).map(e=>e.swap)
Step 4 : Split the data and create RDD of all Employee objects.
val filelword = sc.makeRDD(Array(file1.name+"->"+content1(0)._1+"-"+content1(0)._2)) val file2word = sc.makeRDD(Array(file2.name+"->"+content2(0)._1+"-"+content2(0)._2)) val file3word = sc.makeRDD(Array(file3.name+"->"+content3(0)._1+"-"+content3(0)._2)) val file4word = sc.makeRDD(Array(file4.name+M->"+content4(0)._1+"-"+content4(0)._2))
Step 5: Union all the RDDS
val unionRDDs = filelword.union(file2word).union(file3word).union(file4word)
Step 6 : Save the results in a text file as below.
unionRDDs.repartition(1).saveAsTextFile("spark11/union.txt")
Question 5
CORRECT TEXT
Problem Scenario 81 : You have been given MySQL DB with following details. You have been given following product.csv file product.csv productID,productCode,name,quantity,price
1001,PEN,Pen Red,5000,1.23
1002,PEN,Pen Blue,8000,1.25
1003,PEN,Pen Black,2000,1.25
1004,PEC,Pencil 2B,10000,0.48
1005,PEC,Pencil 2H,8000,0.49
1006,PEC,Pencil HB,0,9999.99
Now accomplish following activities.
1 . Create a Hive ORC table using SparkSql
2 . Load this data in Hive table.
3 . Create a Hive parquet table using SparkSQL and load data in it.
Problem Scenario 81 : You have been given MySQL DB with following details. You have been given following product.csv file product.csv productID,productCode,name,quantity,price
1001,PEN,Pen Red,5000,1.23
1002,PEN,Pen Blue,8000,1.25
1003,PEN,Pen Black,2000,1.25
1004,PEC,Pencil 2B,10000,0.48
1005,PEC,Pencil 2H,8000,0.49
1006,PEC,Pencil HB,0,9999.99
Now accomplish following activities.
1 . Create a Hive ORC table using SparkSql
2 . Load this data in Hive table.
3 . Create a Hive parquet table using SparkSQL and load data in it.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create this tile in HDFS under following directory (Without header}
/user/cloudera/he/exam/task1/productcsv
Step 2 : Now using Spark-shell read the file as RDD
// load the data into a new RDD
val products = sc.textFile("/user/cloudera/he/exam/task1/product.csv")
// Return the first element in this RDD
prod u cts.fi rst()
Step 3 : Now define the schema using a case class
case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price:
Float)
Step 4 : create an RDD of Product objects
val prdRDD = products.map(_.split(",")).map(p =>
Product(p(0).tolnt,p(1),p(2),p(3}.tolnt,p(4}.toFloat))
prdRDD.first()
prdRDD.count()
Step 5 : Now create data frame val prdDF = prdRDD.toDF()
Step 6 : Now store data in hive warehouse directory. (However, table will not be created } import org.apache.spark.sql.SaveMode prdDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("product_orc_table") step 7: Now create table using data stored in warehouse directory. With the help of hive.
hive
show tables
CREATE EXTERNAL TABLE products (productid int,code string,name string .quantity int, price float}
STORED AS ore
LOCATION 7user/hive/warehouse/product_orc_table';
Step 8 : Now create a parquet table
import org.apache.spark.sql.SaveMode
prdDF.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("product_parquet_ table")
Step 9 : Now create table using this
CREATE EXTERNAL TABLE products_parquet (productid int,code string,name string
.quantity int, price float}
STORED AS parquet
LOCATION 7user/hive/warehouse/product_parquet_table';
Step 10 : Check data has been loaded or not.
Select * from products;
Select * from products_parquet;
Explanation:
Solution :
Step 1 : Create this tile in HDFS under following directory (Without header}
/user/cloudera/he/exam/task1/productcsv
Step 2 : Now using Spark-shell read the file as RDD
// load the data into a new RDD
val products = sc.textFile("/user/cloudera/he/exam/task1/product.csv")
// Return the first element in this RDD
prod u cts.fi rst()
Step 3 : Now define the schema using a case class
case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price:
Float)
Step 4 : create an RDD of Product objects
val prdRDD = products.map(_.split(",")).map(p =>
Product(p(0).tolnt,p(1),p(2),p(3}.tolnt,p(4}.toFloat))
prdRDD.first()
prdRDD.count()
Step 5 : Now create data frame val prdDF = prdRDD.toDF()
Step 6 : Now store data in hive warehouse directory. (However, table will not be created } import org.apache.spark.sql.SaveMode prdDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("product_orc_table") step 7: Now create table using data stored in warehouse directory. With the help of hive.
hive
show tables
CREATE EXTERNAL TABLE products (productid int,code string,name string .quantity int, price float}
STORED AS ore
LOCATION 7user/hive/warehouse/product_orc_table';
Step 8 : Now create a parquet table
import org.apache.spark.sql.SaveMode
prdDF.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("product_parquet_ table")
Step 9 : Now create table using this
CREATE EXTERNAL TABLE products_parquet (productid int,code string,name string
.quantity int, price float}
STORED AS parquet
LOCATION 7user/hive/warehouse/product_parquet_table';
Step 10 : Check data has been loaded or not.
Select * from products;
Select * from products_parquet;
Question 6
CORRECT TEXT
Problem Scenario 95 : You have to run your Spark application on yarn with each executor
Maximum heap size to be 512MB and Number of processor cores to allocate on each executor will be 1 and Your main application required three values as input arguments V1
V2 V3.
Please replace XXX, YYY, ZZZ
./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3
--driver-memory 512m XXX YYY lib/hadoopexam.jarZZZ
Problem Scenario 95 : You have to run your Spark application on yarn with each executor
Maximum heap size to be 512MB and Number of processor cores to allocate on each executor will be 1 and Your main application required three values as input arguments V1
V2 V3.
Please replace XXX, YYY, ZZZ
./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3
--driver-memory 512m XXX YYY lib/hadoopexam.jarZZZ
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution
XXX: -executor-memory 512m YYY: -executor-cores 1
ZZZ : V1 V2 V3
Notes : spark-submit on yarn options Option Description
archives Comma-separated list of archives to be extracted into the working directory of each executor. The path must be globally visible inside your cluster; see Advanced
Dependency Management.
executor-cores Number of processor cores to allocate on each executor. Alternatively, you can use the spark.executor.cores property, executor-memory Maximum heap size to allocate to each executor. Alternatively, you can use the spark.executor.memory-property.
num-executors Total number of YARN containers to allocate for this application.
Alternatively, you can use the spark.executor.instances property. queue YARN queue to submit to. For more information, see Assigning Applications and Queries to Resource
Pools. Default: default.
Explanation:
Solution
XXX: -executor-memory 512m YYY: -executor-cores 1
ZZZ : V1 V2 V3
Notes : spark-submit on yarn options Option Description
archives Comma-separated list of archives to be extracted into the working directory of each executor. The path must be globally visible inside your cluster; see Advanced
Dependency Management.
executor-cores Number of processor cores to allocate on each executor. Alternatively, you can use the spark.executor.cores property, executor-memory Maximum heap size to allocate to each executor. Alternatively, you can use the spark.executor.memory-property.
num-executors Total number of YARN containers to allocate for this application.
Alternatively, you can use the spark.executor.instances property. queue YARN queue to submit to. For more information, see Assigning Applications and Queries to Resource
Pools. Default: default.
Question 7
CORRECT TEXT
Problem Scenario 60 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"}, 3} val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","woif","bear","bee"), 3) val d = c.keyBy(_.length) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)),
(3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
Problem Scenario 60 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"}, 3} val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","woif","bear","bee"), 3) val d = c.keyBy(_.length) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)),
(3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
solution:
b.join(d).collect
join [Pair]: Performs an inner join using two key-value RDDs. Please note that the keys must be generally comparable to make this work. keyBy : Constructs two-component tuples
(key-value pairs) by applying a function on each data item. The result of the function becomes the data item becomes the key and the original value of the newly created tuples.
Explanation:
solution:
b.join(d).collect
join [Pair]: Performs an inner join using two key-value RDDs. Please note that the keys must be generally comparable to make this work. keyBy : Constructs two-component tuples
(key-value pairs) by applying a function on each data item. The result of the function becomes the data item becomes the key and the original value of the newly created tuples.
Question 8
CORRECT TEXT
Problem Scenario 36 : You have been given a file named spark8/data.csv (type,name).
data.csv
1 ,Lokesh
2 ,Bhupesh
2 ,Amit
2 ,Ratan
2 ,Dinesh
1 ,Pavan
1 ,Tejas
2 ,Sheela
1 ,Kumar
1 ,Venkat
1. Load this file from hdfs and save it back as (id, (all names of same type)) in results directory. However, make sure while saving it should be
Problem Scenario 36 : You have been given a file named spark8/data.csv (type,name).
data.csv
1 ,Lokesh
2 ,Bhupesh
2 ,Amit
2 ,Ratan
2 ,Dinesh
1 ,Pavan
1 ,Tejas
2 ,Sheela
1 ,Kumar
1 ,Venkat
1. Load this file from hdfs and save it back as (id, (all names of same type)) in results directory. However, make sure while saving it should be
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create file in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load data.csv file from hdfs and create PairRDDs
val name = sc.textFile("spark8/data.csv")
val namePairRDD = name.map(x=> (x.split(",")(0),x.split(",")(1)))
Step 3 : Now swap namePairRDD RDD.
val swapped = namePairRDD.map(item => item.swap)
Step 4 : Now combine the rdd by key.
val combinedOutput = namePairRDD.combineByKey(List(_), (x:List[String], y:String) => y ::
x, (x:List[String], y:List[String]) => x ::: y)
Step 5 : Save the output as a Text file and output must be written in a single file.
:ombinedOutput.repartition(1).saveAsTextFile("spark8/result.txt")
Explanation:
Solution :
Step 1 : Create file in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load data.csv file from hdfs and create PairRDDs
val name = sc.textFile("spark8/data.csv")
val namePairRDD = name.map(x=> (x.split(",")(0),x.split(",")(1)))
Step 3 : Now swap namePairRDD RDD.
val swapped = namePairRDD.map(item => item.swap)
Step 4 : Now combine the rdd by key.
val combinedOutput = namePairRDD.combineByKey(List(_), (x:List[String], y:String) => y ::
x, (x:List[String], y:List[String]) => x ::: y)
Step 5 : Save the output as a Text file and output must be written in a single file.
:ombinedOutput.repartition(1).saveAsTextFile("spark8/result.txt")
Question 9
CORRECT TEXT
Problem Scenario 52 : You have been given below code snippet.
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
Operation_xyz
Write a correct code snippet for Operation_xyz which will produce below output.
scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 ->
1)
Problem Scenario 52 : You have been given below code snippet.
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
Operation_xyz
Write a correct code snippet for Operation_xyz which will produce below output.
scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 ->
1)
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.countByValue
countByValue
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)
Listing Variants
def countByValue(): Map[T, Long]
Explanation:
Solution :
b.countByValue
countByValue
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)
Listing Variants
def countByValue(): Map[T, Long]
Question 10
CORRECT TEXT
Problem Scenario 15 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. In mysql departments table please insert following record. Insert into departments values(9999, '"Data Science"1);
2. Now there is a downstream system which will process dumps of this file. However, system is designed the way that it can process only files if fields are enlcosed in(') single quote and separate of the field should be (-} and line needs to be terminated by : (colon).
3. If data itself contains the " (double quote } than it should be escaped by \.
4. Please import the departments table in a directory called departments_enclosedby and file should be able to process by downstream system.
Problem Scenario 15 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. In mysql departments table please insert following record. Insert into departments values(9999, '"Data Science"1);
2. Now there is a downstream system which will process dumps of this file. However, system is designed the way that it can process only files if fields are enlcosed in(') single quote and separate of the field should be (-} and line needs to be terminated by : (colon).
3. If data itself contains the " (double quote } than it should be escaped by \.
4. Please import the departments table in a directory called departments_enclosedby and file should be able to process by downstream system.
Correct Answer:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Connect to mysql database.
mysql --user=retail_dba -password=cloudera
show databases; use retail_db; show tables;
Insert record
Insert into departments values(9999, '"Data Science"');
select" from departments;
Step 2 : Import data as per requirement.
sqoop import \
-connect jdbc:mysql;//quickstart:3306/retail_db \
~ username=retail_dba \
--password=cloudera \
-table departments \
-target-dir /user/cloudera/departments_enclosedby \
-enclosed-by V -escaped-by \\ -fields-terminated-by--' -lines-terminated-by :
Step 3 : Check the result.
hdfs dfs -cat/user/cloudera/departments_enclosedby/part"
Explanation:
Solution :
Step 1 : Connect to mysql database.
mysql --user=retail_dba -password=cloudera
show databases; use retail_db; show tables;
Insert record
Insert into departments values(9999, '"Data Science"');
select" from departments;
Step 2 : Import data as per requirement.
sqoop import \
-connect jdbc:mysql;//quickstart:3306/retail_db \
~ username=retail_dba \
--password=cloudera \
-table departments \
-target-dir /user/cloudera/departments_enclosedby \
-enclosed-by V -escaped-by \\ -fields-terminated-by--' -lines-terminated-by :
Step 3 : Check the result.
hdfs dfs -cat/user/cloudera/departments_enclosedby/part"