This article will give you the complete working information to pull data from Azure Data lake Gen 2 and write/push that data to azure SQL data warehouse through Azure Spark using Scala language.
Prerequisite:
Step 1:
Pull Data to Data Frame using Scala
Write Blow code to Spark Notebook.
Note:- First you have to mount your Azure Data Lake Gen 2 Container to Spark Context. In my case it is "/mnt/blobTestContainer/emp"
Final Code
Follow above all settings and copy below code to spark notebook and execute together
val dfEmp = spark.sql("select * from tempEmp") //load data frame from sql view
Prerequisite:
- You should have Azure Subscription
- Azure SQL data warehouse
- Azure Blob Storage
- Blob Container
- Azure Spark
- Azure Data Lake Gen 2 with some sample parquet file in blob container
- Master Key on Database
Step 0:
Create Master Key to Azure SQL Data warehouse
Open SQL server management studio and connect via your
- Azure SQL data warehouse Server //Fully Qualified name suffix with ".database.windows.net"
- You should use SQL authentication "username and password"
- Azure SQL data warehouse username //should be suffix with @yourAzureWareHouseServerName
- Azure SQL data warehouse password
Once you successfully connected to database than select your data warehouse and open new query window and execute below SQL command that will create master key in Database
To execute this query you need control permission on database
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '23987hxJ#KL95234nl0zBe';To execute this query you need control permission on database
Step 1:
Pull Data to Data Frame using Scala
Write Blow code to Spark Notebook.
Note:- First you have to mount your Azure Data Lake Gen 2 Container to Spark Context. In my case it is "/mnt/blobTestContainer/emp"
val pardf =
spark.read.parquet("/mnt/blobTestContainer/emp") //load data
frame from parquet file
pardf.createOrReplaceTempView("tempEmp")
//create temporary view
val dfEmp = spark.sql("select * from
tempEmp") //load data frame from sql view
After Executing your Notebook command your data frame "dfEmp" will be ready
Step 2:
Storage Setting to Write Data to Azure SQL warehouse
Write below code to Spark Notebook. Below settings are required to write data
val blobStorage =
"xxxxxx.blob.core.windows.net" //This should be your fully qualified blob storage name
val blobContainer = "test" //Blob container is
required to temporary storage of file, after that Data Ware house pulls data from this temp
storage
val blobAccessKey =
"cH/xxxxxxxxxxxFBEd3878QPZM9EoYPrkSXXXXXXXXXXXX/eg+rh/IfjxxxxxxxxF6A3BA=="
val tempDir = "wasbs://" + blobContainer +
"@" + blobStorage + "/tempDirs" //temp dir storage location
val acntInfo = "fs.azure.account.key."+
blobStorage
sc.hadoopConfiguration.set(acntInfo, blobAccessKey)
Step 3:
SQL Data warehouse related settings
Write below code to spark Notebook. make sure username should be fully qualified name meaning it should be suffix with @yourAzureWareHouseServerName
Data warehouse server name should suffix with ".database.windows.net"
val dwDatabase = "dwTestDB" //This is Data ware house database Name
val dwServer = "xxxxx.database.windows.net" //This is Data ware house server name
val dwUser = "dwuser@dwserver" //This is user name to connect to database. You can get this from azure data ware house property
val dwPass = "XXXXX" //This is password to connect to database. You can get username and password while creating azure data warehouse
val dwJdbcPort = "1433"
val dwJdbcExtraOptions =
"encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30"
val sqlDwUrl = "jdbc:sqlserver://" + dwServer +
":" + dwJdbcPort + ";database=" + dwDatabase +
";user=" + dwUser + ";password=" + dwPass +
";$dwJdbcExtraOptions"
val sqlDwUrlSmall = "jdbc:sqlserver://" + dwServer
+ ":" + dwJdbcPort + ";database=" + dwDatabase +
";user=" + dwUser + ";password=" + dwPass
Step 4:
Code to Write Data to Azure Data warehouse
Write below code to push data to azure data warehouse. In the blow code we are writing data frame "dfEmp" to azure data warehouse. It will create data table "Emp" in database by reading tempDir and it will automatically create polybase connection underneath.
spark.conf.set("spark.sql.parquet.writeLegacyFormat","true")
dfEmp.write.format("com.databricks.spark.sqldw").option("url",
sqlDwUrlSmall).option("dbtable","Emp").option("forward_spark_azure_storage_credentials","True").option("tempdir",tempDir).mode("overwrite").save()
After All these 5 steps you can connect to you azure data warehouse using SSMS "Sql Server Management Studio" and query on Emp table.
Final Code
Follow above all settings and copy below code to spark notebook and execute together
val pardf = spark.read.parquet("/mnt/blobTestContainer/emp") //load data frame from parquet file
pardf.createOrReplaceTempView("tempEmp") //create temporary view
val dfEmp = spark.sql("select * from tempEmp") //load data frame from sql view
val blobStorage = "xxxxxx.blob.core.windows.net" //This should be your blob fully qualified storage name
val blobContainer = "test" //Blob container is required to temporary storage of file, after that Data Ware house pulls data from this temp storage
val blobAccessKey = "cH/xxxxxxxxxxxFBEd3878QPZM9EoYPrkSXXXXXXXXXXXX/eg+rh/IfjxxxxxxxxF6A3BA=="
val tempDir = "wasbs://" + blobContainer + "@" + blobStorage + "/tempDirs" //temp dir storage location
val acntInfo = "fs.azure.account.key."+ blobStorage
sc.hadoopConfiguration.set(acntInfo, blobAccessKey)
val dwDatabase = "dwTestDB" //This is Data ware house database Name
val dwServer = "xxxxx.database.windows.net" //This is Data ware house server name
val dwUser = "dwuser@dwserver" //This is user name to connect to database. You can get this from azure data ware house property
val dwPass = "XXXXX" //This is password to connect to database. You can get username and password while creating azure data warehouse
val dwJdbcPort = "1433"
val dwJdbcExtraOptions = "encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30"
val sqlDwUrl = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser + ";password=" + dwPass + ";$dwJdbcExtraOptions"
val sqlDwUrlSmall = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser + ";password=" + dwPass
spark.conf.set("spark.sql.parquet.writeLegacyFormat","true")
dfEmp.write.format("com.databricks.spark.sqldw").option("url", sqlDwUrlSmall).option("dbtable","Emp").option("forward_spark_azure_storage_credentials","True").option("tempdir",tempDir).mode("overwrite").save()
Hope this code will help you. Write comment If you need any help or for any suggestion.
Happy coding.