EMR Series 2: Running and Troubleshooting an EMR Cluster | Slave (2023)

This is part two of our three-part blog series on sending Amazon Elastic MapReduce (EMR) logs toSolarWinds® Loggly®. In the first part, we give a brief introduction to EMR, Amazon's Apache Hadoop distribution. An EMR cluster is a managed environment that differs slightly from a traditional Hadoop cluster. We saw how the EMR generates different types of log files and where these logs are stored. These logs are invaluable in troubleshooting Hadoop job failures

In this post, we will create and configure an EMR cluster to run Apache Hive jobs in multiple stages. We intentionally introduce an error in one of the steps and use EMR logs to find the cause

The set

Apache Hive is installed on the EMR cluster. This cluster uses EMRFS as the file system, so the data input and output locations are mapped to an S3 bucket. The cluster also uses the same S3 bucket to store logs

We will create several EMR steps on the cluster to process a sample dataset. Each of these steps runs a Hive script and the final output is stored in your S3 bucket. These steps generate MapReduce logs because Hive commands are converted to MapReduce jobs at runtime. Log files for each step are collected from the generated containers

sample data

The sample dataset for this use case is publicly available atOpen Data website of the Australian government. This dataset covers endangered animal and plant species from various states and territories in Australia. A description of the fields in this dataset and the CSV file can be viewed and downloadedher.

treatment step

The first step in the EMR job involves creating a Hive table as a schema for the underlying source file in S3. In the second step of the work, we will run a successful query on the data. Then we run a third and a fourth query; the third will fail and the fourth will succeed

We repeat these four steps a few times an hour, simulating sequential executions of a multi-step batch job. However, in a real scenario, the time difference between each batch run can be much larger. The small time lag between successive runs here is meant to speed up our tests.

S3 emits a folder

Before creating our EMR cluster, we had to create an S3 bucket to host your files. In our example, we named this bucket "loggly-emr". The folders in this bucket are listed below in the AWS console for S3:


AWS S3, © Amazon.com, Inc.

  • The input folder contains the sample data
  • The Scripts folder contains Hive script files for EMR job steps
  • The output folder contains output from the Hive program
  • The log directory is used by the EMR cluster to store the logs. As we saw in the first part of this series, EMR will create several subfolders within this folder. Log files from the log folder are loaded into Loggly.

Hive-scripts voor EMR-workflows

Trin 1

This job step runs a Hive scriptcreateTable.qto create an external Hive table. This table describes the table schema of the underlying CSV data file. The script is shown below:

CREATE EXTERNAL TABLE `threatened_species`( string `nome científico`, string `nome comum`, string `nome científico atual`, string `status ameaçado`, string `act`, string `nsw`, string `nt`, `string `qld,``, string `qld,``, string `qld,``, string `, strength `cki`, strength `ci`, strength `csi`, strength ` jb t`, `n string fi`, string `hmi`, strength `aat`, strength `cma`, `listet brisling taxonid` bigint, `current brisling taxonid` bigint, string `kingdom`, `string `data class`, `string ``de`, string ``s` ily`, string ``genus`, strength `species', strength `infraspecifik rank', strength `infraspecies`, strength `author species', `infraspec ies author` string)CAMPOS DELIMITADOS THE FORMAT OF LINHA TERMINADOS POR ','STORED AS INPUTFORMAT 'org.apUTFORMATha 'org.apUTFORMAT.org.apUTFormat. hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION 's3://loggly-emr/input/'

threes 2

This job step runs a query to calculate the five most endangered species in the state of New South Wales (NSW). The file name of the Hive query isendangered species NSW.qand is shown below:

SELECT species, COUNT(nsw)AS number_of_threatened_speciesFROM endangered_speciesWHERE (nsw = 'Yes' OR nsw = 'Endangered') AND "threatened status" = 'Endangered' GROUP BY species WITH COUNT(nsw) > 1 ORDER BY number_5_extinct LIMITED

threes 3

This job step runs a query to calculate the total number of endangered plant species for each plant family in Australia. However, this query contains an intentional code error where a field was not added to the GROUP BY clause. This is the error we want Loggly to flag. This query fails when Hive tries to run it. The file name of the Hive query isPlantearter in Danger.qand is shown below

SELECT family, COUNT(species) AS number_of_threatened_speciesFROM endangered_species2WHERE realm = 'Plant'OG "truet status" = 'Truet'

Trin 4

This step will succeed again. It shows the scientific names of extinct animal species in the Australian state of Queensland. The script file is calledextinct AnimalsQLD.qand is shown below:

SELECT "common name", "scientific name" FROM endangered_speciesWHERE realm = 'Animalia'AND (qld = 'Yes' OR qld = 'Extinct') AND "endangered status" = 'Extinct'

merging records

We also uploaded a JSON file called logAggregation.json in the scripts folder of the S3 bucket. This file is used to compile the YARN logs.merging recordsis configured in the yarn-site.xml configuration file when the cluster boots. The contents of the logAggregation.json file are shown below:

[ { "Classificatie": "yarn-site", "Propriedades": { "yarn.log-aggregation-enable": "true", "yarn.log-aggregation.retain-seconds": "-1", "yarn.nodemanager.remote-app-log-dir": "slogly-emr}/loggly"}

Set up EMR cluster

Once the S3 bucket is created and the data and script files are copied to their respective directories, we create an EMR cluster. The following images describe the process as we create the defaults cluster

On the first screen, to configure the cluster in the AWS console, we kept all EMR recommended applications, including Hive. We do not use AWS Glue to store Hive metadata, nor do we add task steps at this time. However, we are adding a software configuration to Hive. Notice how we specify the path tolog aggregationJSON file in this field


AWS EMR, © Amazon.com, Inc.

On the next screen, we kept all the default settings. For our testing purposes, the cluster has one root node and two root nodes. Each node is an m3.xlarge instance and has a root volume of 10 GB.


AWS EMR, © Amazon.com, Inc.

We name the cluster Loggly-EMR on the next screen and specify the custom s3 location for the logs


AWS EMR, © Amazon.com, Inc.

We leave the defaults for unified EMRFS rendering, custom AMI ID, and bootstrap actions


AWS EMR, © Amazon.com, Inc.

Finally, we specify an EC2 key pair to access the cluster head node. The default IAM roles for EMR, EC2 instance profile, and autoscale settings are not changed. Large and large nodes also use security groups by default


AWS EMR, © Amazon.com, Inc.

In general, this is a standard configuration for an EMR cluster. When it's done, the cluster will be in a "waiting" state, as shown below:


AWS EMR, © Amazon.com, Inc.

Submit Hive job steps

Now that the EMR cluster is up and running, let's add four task steps from the command line using the AWS CLI command aws emr add steps. Notice how the script specifies the Hive script file and input and output directory for each step:

aws add emr step
--cluster-id j-2TFSCG8AY15CK--stepsType=HIVE,Name='createTable',ActionOnFailure=CONTINUE,Args=[-f,s3://loggly-emr/scripts/createTable.q,-d,INPUT=s3://loggly-emr/input/input/input,-s/loggly-emr/input/input,-s/loggly- emr/input/input,-s/logOUTIVE,-3 met edSpeciesNSW',ActionOnFailure=CONTINUE,Args=[-f,s3://loggly-emr/scripts/endangeredSpeciesNSW.q,-d,INPUT=s3://loggly-emr/input,-d,OUTPUT=s3://loggly-emr/output,Typea plant'as'aFé cies,Name ailure=FORTSÆT,Ar gs=[-f,s3://loggly-emr/scripts/endangeredPlantSpecies.q,-d,INPUT=s3://loggly-emr/input,-d,OUTPUT=s3://loggly-emr/output]Type=HIVE,Name='extinctAnimals-INFUE,='extinctAnimals-INFUE,='extinctAnimals -INFUE,='extinctAnimals-INFUE,= emr/scripts/extin ctAnimalsQLD.q,-d,INPUT=s3://loggly-emr/input,-d,OUTPUT=s3://loggly-emr/output]

There should be no line breaks when you run this script. Line breaks added for clarity. The output of this command is a list of work stage IDs:

{ "StepIds": [ "s-27S2V7H36F11A", "s-20S96C57D979O", "s-1OZI9O3LPFMIH", "s-2128EHDHIPD3M" ]}

These are the steps that EMR would perform in sequence. The image below shows the steps of the AWS EMR console:


AWS EMR, © Amazon.com, Inc.

Here is one of the steps and the other three are waiting. When each step is completed, it will be marked as "Done".
As expected, the third will fail, as shown in the following screenshot

Log files for each step are stored in subfolders belows3://loggly-emr/logs//trinpasta

Each node has saved its logss3://loggly-emr/logs//Ofpasta

This console image shows the status of each step after execution:


AWS EMR, © Amazon.com, Inc.

Resolve an error using the log files

Before using Loggly for troubleshooting, let's review these logs in the EMR console. We can also download the log files from the S3 folder and open them in a text editor. For each step, the controller log file contains the most relevant information

Neethreecluster tab, if we click on the "controller" link of the failed task step, the log file will open in a separate browser tab

Two lines in this log file are important. First the log will show a message like this:

INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar hive-script --run-hive-script --args -f s3://loggly-emr/scripts/endangeredPlantSpecies.q -d INPUT=s3://loggly-emrly/input=s3://loggly-emrly-out'

This shows the name of the Hive script that launches the Hadoop command runner.

Towards the end of the file, there will be another message like this:

WARN step failed with exitCode

This indicates that the step failed. But why did it fail? To find the answer, we can look in the Hive application subfolder of the logs folder. This folder contains a file called hive.log.gz. The Hive log file contains detailed information about each HiveQL command executed. In our case, the file is in the following location:
s3://loggly-emr/logs//Of//applications/hive/user/hadoop:


AWS S3, © Amazon.com, Inc.

Searching the log for "endangeredPlantSpecies.q" will show us that it starts and then starts the build phase:

2017-12-22T10:02:00,209 INFO [ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: s3n.S3NativeFileSystem (S3NativeFileSystem.java:open)(1210)Open 's3://loggly-emr/scripts/endangeredPlantSpecies.q' to read2017-12-22T10:02:00,261 INFO [ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: conf.HiveConf (HiveConf.java:getLogIdVar(3947) bestået for at bridge logefid 2-værdien -8947-7:) 46eb-b 715-5849a72caa452017-12-22T10:02:00,323 INFO [ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: ql.(java:compild)(Driver:compild) (Driver:compild) (Driver) op_201 712 22 100200_95a19046-d8ad-4625-8182-f3a2e9014040): SELECT familie, COUNT(species) AS number_of_threatened_species FROM endangered_species WHERE koninkrijk = "Plante truet" status = "plante truet"

Then semantic analysis of the query begins:

2017-12-22T10:02:01,561 INFO [ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: parse.CalcitePlanner (SemanticAnalyzer.java:analyzeInternal)(1)Start semantic analysis2017-12-22T10:02:01,569 INFO [ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: parse.CalcitePlanner (SemanticAnalyzer.java:genResolved) Analyse van ParseTman 6) -107 Komplet 6. 017- 12 -22 T10:02:01,569 INFO [ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: parse.CalcitePlanner (SemanticAnalyzer.java:getMetaData) source(19:2s:getMetadata) -19:19 tabel (19:12 tabeller) 02:01.664 INFO [ff 202b74-8efe-46eb-b715-5849a72caa45 main([])]: parse.CalcitePlanner (SemanticAnalyzer.java:getMetaData(2095)) - Hent metadata voor underforespørgsler [INFO-1020:1200:1200,1200:1200:1200 2 02b74-8efe-46 eb-b715 -5849a72caa45 main([])]: parse.CalcitePlanner (SemanticAnalyzer.java:getMetaData(2119)) - Hent metadata for måltabeller main([])]: q Context l.jatva) (Context l.jatva) (Temperatur) (2119) biblioteket er hdfs://ip-10-0-8-31.ec2.internal:8020/tmp/hive/hadoop/ff202b74-8efe-46eb-b715-5849a72caa45/hive_2017-12-22_ 106_34301-46eb-b715-5849a72caa45/hive_201 7-12-22_ 106_302-01 343150106_343019 -12017-12-22T10:02:01,688 INFO [ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: parse.CalcitePlanner (SemanticAnalyzer.java :genResol ved(10) -10Tresol ved(Parse)Completed metadata retrieval in semantic analysis

Finally, it will display an error, reported by the keyword "ERROR", followed by an error dump from Java:

2017-12-22T10:02:02.800WRONG[ff202b74-8efe-46eb-b715-5849a72caa45 main([])]: parse.CalcitePlanner (CalcitePlanner.java:genOPTree(423)) - CBO mislukt, CBOorg.apache.hadoop.hadoop.Exeption.hive.By toets in regel 1:QGROUP 1:q. familie' in org .apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:11620) ~[hive-exec-2.3.2-amzn-0.jar:2.3.2-amzn-0]……...

And then the actual error:

(SessionState.java:printError(1126)) - ERROR: SemanticException [Error 10025]: Line 1:7 Not expressed in key GROUP BY 'family'

Now we can see why the step failed: a field is missing in the GROUP BY clause of the SQL statement

Conclusion

Obviously, if we resubmit the job with the correct syntax, it will succeed. But what about other mistakes in the future? This example was a simple demo. In real life there can be dozens of steps with complex logic, each generating very large log files. Manually searching through thousands of log lines may not be practical for debugging purposes. The following questions naturally arise:

  • Is there a way to collect all logs in one place?
  • Is there an easy way to search for errors?
  • Can we be notified if an error is logged?
  • Can we try to find an error pattern in the logs?

Here's a tool for thatSlavecan help. We can use it to debug our EMR work errors. We will explore this topic in the third and final part of this blog series.

Ready to try Logglynot ? go ahead andsign up here for a 14-day trial of the Enterprise tierand take a walk.

The Loggly and SolarWinds trademarks, service marks and logos are the exclusive property of SolarWinds Worldwide, LLC or its affiliates. All other trademarks belong to their respective owners.

References

Top Articles
Latest Posts
Article information

Author: Chrissy Homenick

Last Updated: 12/31/2023

Views: 5533

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Chrissy Homenick

Birthday: 2001-10-22

Address: 611 Kuhn Oval, Feltonbury, NY 02783-3818

Phone: +96619177651654

Job: Mining Representative

Hobby: amateur radio, Sculling, Knife making, Gardening, Watching movies, Gunsmithing, Video gaming

Introduction: My name is Chrissy Homenick, I am a tender, funny, determined, tender, glorious, fancy, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.