Sunday, December 15, 2019

Pull data from Azure Data lake Gen 2 and Write data to sql data warehouse using spark and scala

This article will give you the complete working information to pull data from Azure Data lake Gen 2 and write/push that data to azure SQL data warehouse through Azure Spark using Scala language.

Prerequisite:

  1. You should have Azure Subscription
  2. Azure SQL data warehouse
  3. Azure Blob Storage
  4. Blob Container
  5. Azure Spark 
  6. Azure Data Lake Gen 2 with some sample parquet file in blob container
  7. Master Key on Database
Step 0:

Create Master Key to Azure SQL Data warehouse

Open SQL server management studio and connect via your 
  • Azure SQL data warehouse Server //Fully Qualified name  suffix with ".database.windows.net"
  • You should use SQL authentication "username and password"
  • Azure SQL data warehouse username //should be suffix with @yourAzureWareHouseServerName
  • Azure SQL data warehouse password 
Once you successfully connected to database than select your data warehouse and open new query window and execute below SQL command that will create master key in Database

To execute this query you need control permission on database

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '23987hxJ#KL95234nl0zBe';


Step 1:

Pull Data to Data Frame using Scala

Write Blow code to Spark Notebook.
Note:- First you have to mount your Azure Data Lake Gen 2 Container to Spark Context. In my case it is "/mnt/blobTestContainer/emp"


val pardf = spark.read.parquet("/mnt/blobTestContainer/emp") //load data frame from parquet file
pardf.createOrReplaceTempView("tempEmp") //create temporary view
val dfEmp = spark.sql("select * from tempEmp") //load data frame from sql view


After Executing your Notebook command your data frame "dfEmp" will be ready

Step 2:

Storage Setting to Write Data to Azure SQL warehouse

Write below code to Spark Notebook. Below settings are required to write data 

val blobStorage = "xxxxxx.blob.core.windows.net" //This should be your fully qualified blob storage name

val blobContainer = "test" //Blob container is required to temporary storage of file, after that Data Ware house pulls data from this temp storage

val blobAccessKey = "cH/xxxxxxxxxxxFBEd3878QPZM9EoYPrkSXXXXXXXXXXXX/eg+rh/IfjxxxxxxxxF6A3BA=="

val tempDir = "wasbs://" + blobContainer + "@" + blobStorage + "/tempDirs" //temp dir storage location

val acntInfo = "fs.azure.account.key."+ blobStorage

sc.hadoopConfiguration.set(acntInfo, blobAccessKey)

Step 3:

SQL Data warehouse related settings

Write below code to spark Notebook. make sure username should be fully qualified name meaning it should be suffix with @yourAzureWareHouseServerName

Data warehouse server name should suffix with ".database.windows.net"

val dwDatabase = "dwTestDB" //This is Data ware house database Name

val dwServer = "xxxxx.database.windows.net" //This is Data ware house server name

val dwUser = "dwuser@dwserver" //This is user name to connect to database. You can get this from azure data ware house property

val dwPass = "XXXXX" //This is password to connect to database. You can get username and password while creating azure data warehouse 

val dwJdbcPort = "1433"

val dwJdbcExtraOptions = "encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

val sqlDwUrl = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser + ";password=" + dwPass + ";$dwJdbcExtraOptions"

val sqlDwUrlSmall = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser + ";password=" + dwPass

Step 4:

Code to Write Data to Azure Data warehouse

Write below code to push data to azure data warehouse. In the blow code we are writing data frame "dfEmp" to azure data warehouse. It will create data table "Emp" in database by reading tempDir and it will automatically create polybase connection underneath. 

spark.conf.set("spark.sql.parquet.writeLegacyFormat","true")

dfEmp.write.format("com.databricks.spark.sqldw").option("url", sqlDwUrlSmall).option("dbtable","Emp").option("forward_spark_azure_storage_credentials","True").option("tempdir",tempDir).mode("overwrite").save()
After All these 5 steps you can connect to you azure data warehouse using SSMS "Sql Server Management Studio" and query on Emp table.

Final Code

Follow above all settings and copy below code to spark notebook and execute together

val pardf = spark.read.parquet("/mnt/blobTestContainer/emp") //load data frame from parquet file

pardf.createOrReplaceTempView("tempEmp") //create temporary view

val dfEmp = spark.sql("select * from tempEmp") //load data frame from sql view

val blobStorage = "xxxxxx.blob.core.windows.net" //This should be your blob fully qualified storage name

val blobContainer = "test" //Blob container is required to temporary storage of file, after that Data Ware house pulls data from this temp storage

val blobAccessKey = "cH/xxxxxxxxxxxFBEd3878QPZM9EoYPrkSXXXXXXXXXXXX/eg+rh/IfjxxxxxxxxF6A3BA=="

val tempDir = "wasbs://" + blobContainer + "@" + blobStorage + "/tempDirs" //temp dir storage location

val acntInfo = "fs.azure.account.key."+ blobStorage

sc.hadoopConfiguration.set(acntInfo, blobAccessKey)


val dwDatabase = "dwTestDB" //This is Data ware house database Name

val dwServer = "xxxxx.database.windows.net" //This is Data ware house server name

val dwUser = "dwuser@dwserver" //This is user name to connect to database. You can get this from azure data ware house property

val dwPass = "XXXXX" //This is password to connect to database. You can get username and password while creating azure data warehouse 

val dwJdbcPort = "1433"

val dwJdbcExtraOptions = "encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

val sqlDwUrl = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser + ";password=" + dwPass + ";$dwJdbcExtraOptions"

val sqlDwUrlSmall = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser + ";password=" + dwPass


spark.conf.set("spark.sql.parquet.writeLegacyFormat","true")

dfEmp.write.format("com.databricks.spark.sqldw").option("url", sqlDwUrlSmall).option("dbtable","Emp").option("forward_spark_azure_storage_credentials","True").option("tempdir",tempDir).mode("overwrite").save()

Hope this code will help you. Write comment If you need any help or for any suggestion.

Happy coding. 

No comments:

Post a Comment