HDP Sandbox Installation and Setup

//Install Oracle VM

https://www.virtualbox.org/wiki/Downloads

 //Install HDP

http://hortonworks.com/products/hortonworks-sandbox/#install

 // Enable the GNOME Desktop environment on Centos 6.2

Here are the instructions for installing GNOME under CentOS:
1. After logging to your root account in CLI mode, install the X window system (X11) with this command:
yum groupinstall -y ‘X Window System’
2. Install now GNOME with the following command:
yum groupinstall -y ‘Desktop’
3. Let’s now configure CentOS to start the GUI mode (Gnome login screen) instead of the CLI mode. To do this, edit the/etc/inittab file with this command:
vi  /etc/inittab
change now this line:
id:3:initdefault:
to this:
id:5:initdefault:
4. Save your file (ESC -> :w -> Return) and reboot your system. CentOS will now load the Gnome desktop environment.
5. To switch between CLI and GUI modes, you can use these keyboard shortcuts:

Ctrl + Alt + F6 —> GUI to CLI

Ctrl + Alt + F1 —> CLI to GUI

Or simply start the GUI mode by running this command:
startx

 // To install the fire fox browser

Yum install firefox

//Start the Ambari server and Agent

ambari-server start

ambari-agent start

 //Verify the Ambari UI works (userid: admin | password: admin)

http://127.0.0.1:8080/

 //Verify the Hue UI works       

http://127.0.0.1:8000/

//Any issues restart all the services and check the status

service startup_script restart

HDP Sandbox Installation and Setup

Difference between Hive Internal and External tables

Internal Tables External Tables
If you drop the table, it deletes both the data and metadata If you drop the table, it deletes only the metadata
Best use for temporary tables. Best use for permenant tables
Share the meta data with single program or process Share the meta data with many programs or processes
Hive have full control Users have full control
Difference between Hive Internal and External tables

Difference between Kafka and Flume

Flume Kafka
Distributed log transport and processing system(very good integration with HDFS and HBase) Distributed publish and subscribe messaging system(persist data) like MSMQ
Best use to transport data into Hadoop Best use to transport data in to multiple applications
Many built in source and sink options are available Limited or have to create custom consumers and producers for your use case requirements
Processs data on the fly Requires external streaming system for data processing
  • Streaming data from Kafka to Hadoop using a Flume agent is a good design pattern.
  • Storm and spark – event processing
Difference between Kafka and Flume