Share to gain more social capita
Written by — Mika Heino, Data Architect
AI and Machine Learning tools are finally coming into the hands of Data Engineers and business users. This opens a multitude of possibilities! Could AI, for example, help us to write better, more sustainable SQL or Python code in the future? Let's investigated the newest fuzz on the town, ChatGPT-3, and how well it can write SQL and Terraform code.
Written by — Mika Heino, Data Architect
Share to gain more social capita
At Recordly, we have strong expertise on how Machine Learning and AI can help create better data solutions and I am blessed to have worked with one of our top ML engineers, Rasmus Toivanen. He is renowned for his research on Finnish NLP/ASR models and has been featured in the Finnish IT field paper (behind paywall), Tivi for the same topic.
Rasmus has been a great help to me in my past Snowflake Machine Learning projects. After office hours, we've been testing the increasingly popular Stable Diffusion text-to-image solution at the office to generate images of us, Recordly employees. Recently, I've been using the Colab Dreambooth text-to-image generator to create images for my blogs. It's a great way to get interesting visuals without having to purchase a license from a stock image service. For example, the illustration on this blog is created with the help of the latest Stable Diffusion 2.0 from Stability AI.
The last time we caught up with Rasmus, we started discussing Werner Vogels' introduction of Neural Radiance Fields (NeRF) at the AWS re: Invent keynote. I've been fascinated with NeRFs ever since I saw them featured on the popular YouTube series Corridor Crew. I think NeRFs is an incredible advancement in the field of image processing technology, but I haven't yet invented a way NeRFs can help us with real-life issues.
Rasmus told me about an even more interesting new feature than NeRF called Open AI ChatGPT-3 which can actually help us right away. I was intrigued by ChatGPT-3 because Rasmus said it can write Terraform code so well that you can even set up a data platform with it. When I asked how it performs with SQL, the lingua franca of our data engineers, I was in for a surprise.
Before doing the actual testing, let's first understand what ChatGPT-3 is. Open AI's ChatGPT belongs to Conversational AI (or CAI) solutions which are part of
artificial intelligence technology that enables machines and humans to have natural conversations.
Conversational AI combines natural language processing (NLP), machine learning (ML), and deep learning (DL) to build systems that can understand and respond to user input in a natural and human-like way. Conversational AI is used in applications such as chatbots, virtual assistants, and automated customer service systems.
With ChatGPT, you can have a realistic, two-way conversation that feels more like talking to a real person. What makes ChatGPT better than current CAI solutions is the usage of reinforcement
training. To train ChatGPT, the Open AI team used a combination of reinforcement learning from human feedback (RLHF) and supervised fine-tuning. In the RLHF process, human trainers played both sides of a conversation and were given access to model-generated suggestions to help them compose their responses. The team then created a reward model for reinforcement learning by collecting comparison data consisting of ranked responses from the chatbot. This data was used to fine-tune the model using Proximal Policy Optimization.
Let's try then checking out what ChatGPT-3 can do. The values in the prompt are a real language I gave to ChatGPT-3 and the values in Reply are the stuff that ChatCPT-3 replied to (in addition to my own comments about the quality of the reply.
Let's start with the easy stuff. I use the SQL syntax that I used in Sustainable SQL with Jaromir Jagr -blog and see how the ChatGPT-3 performs.
"I have Snowflake table, Show me how to query the Average and Min values from Players table where I have attribute Player Height"
To query the average and minimum values of the Player Height attribute in the Players table in Snowflake, you can use the following SQL statement:
SELECT AVG(PlayerHeight) AS AvgPlayerHeight, MIN(PlayerHeight) AS MinPlayerHeight FROM Players;
This statement will return the average and minimum values for Player Height in the Players table. You can adjust the query as needed to include any additional criteria or to include other attributes from the table. For example, if you only want to consider players who are taller than a certain height, you could add a WHERE
clause to the query:
SELECT AVG(PlayerHeight) AS AvgPlayerHeight, MIN(PlayerHeight) AS MinPlayerHeight FROM Players WHERE PlayerHeight > X;
where X
is the minimum height you want to consider.
Not bad. Not bad at all. I especially like the CAPITAL notation and that ChatGPT even gives me further instructions on how to have a WHERE clause for further filtering.
Let's try SQL which needs grouping.
"I have Snowflake table Players which has attributes for Player Name, Player time on ice, Player team name. Can you create me a SQL query that shows the player from each team who has the max time on ice on each team?"
To create a SQL query that shows the player from each team who has the maximum time on ice for their team in the Players table in Snowflake, you can use the following statement:
SELECT PlayerName, PlayerTimeOnIce, PlayerTeamName FROM Players WHERE PlayerTimeOnIce = (SELECT MAX(PlayerTimeOnIce) FROM Players WHERE PlayerTeamName = Players.PlayerTeamName);
This query will return the player name, time on ice, and team name for each player who has the maximum time on ice for their team. You can adjust the query as needed to include other attributes or to add additional criteria. For example, if you only want to consider players who have played a certain number of games, you could add a
HAVING
clause to the query:
Again. Not bad, even though my question was somewhat vague and can be understood in multiple ways. The SQL works and ChatGPT gives again suggestions on how to improve the query.
How about administrative SQL?
"I have a Snowflake table called users, I need to clone it a make a copy of it. Whats the SQL command for doing this?"
To create a copy of the users table in Snowflake, you can use theCREATE TABLE
statement with theLIKE
clause. The basic syntax for this statement is:
CREATE TABLE users_copy AS SELECT * FROM users;
This will create a new table called
new_table_name
with the same schema as theusers
table. You can then use theINSERT
statement with aSELECT
query to copy the data from theusers
table into the new table:INSERT INTO new_table_name SELECT * FROM users;
This will insert all rows from the
users
table into the newnew_table_name
table. You can adjust the query as needed to include only certain columns or to add additional criteria for the data to be copied. For example, if you only want to copy users who have a certain attribute value, you could add aWHERE
clause to theSELECT
query:
INSERT INTO new_table_name
SELECT * FROM users
WHERE attribute = value;
whereattribute
is the name of the attribute you want to filter on, andvalue
is the value you want to consider.
This works as well though I was expecting the usage of the CLONE command, specialty in Snowflake. I was also surprised by the usage of the LIKE command instead of AS SELECT *. ChatGPT goes again beyond the normal question and instructs on the proper INSERT INTO clause with filtering options.
ChatGPT-3 can perform simple SQL -queries so let's try something more specific.
"I have Snowflake table which contains deliveries per month for months between 012022 to 122022. I need to PIVOT this table, how to do this?"
To pivot a table in Snowflake, you can use the PIVOT
function. The basic syntax for this function is:
SELECT * FROM table_name PIVOT ( aggregation_function(column_to_aggregate) FOR pivot_column IN (list_of_values) );
where
table_name
is the name of the table you want to pivot,aggregation_function
is the function you want to apply to the data (such asSUM
orCOUNT
),column_to_aggregate
is the column you want to aggregate, andpivot_column
is the column you want to pivot on. Thelist_of_values
should be a comma-separated list of the values you want to include in the pivot.For example, if you have a table called
deliveries
with columnsmonth
,product
, andquantity
, and you want to pivot the data on themonth
column and sum the quantities for each product, you could use the following query:
SELECT * FROM deliveries PIVOT ( SUM(quantity) FOR month IN ('012022', '022022', '032022', '042022', '052022', '062022', '072022', '082022', '092022', '102022', '112022', '122022') );
This will pivot the data on the
month
column and return the total quantity for each product for each month. You can adjust the query as needed to include different columns or to use different aggregation functions.
Yep. Even though I didn't give ChatGPT enough information, it managed to create valid PIVOT SQL. This is looking pretty good. You could easily build a prompter for helping with SQL writing with these kinds of capabilities that ChatGPT has.
Consider for example that you have to PIVOT the data, but do not remember the syntax. Instead of going into the documentation, the prompter would build you a proper example of how PIVOT data.
Let's try something that is always needed in data projects. Could I create the necessary calendar using ChatGPT-3?
"Create me Snowflake table D_CALENDAR -dimension table for joining with YYYY, MM-DD-YYYY and Day of the week attributes and insert few values into it"
To create a Snowflake table calledD_CALENDAR
with attributes for year, date, and day of the week, you can use the followingCREATE TABLE
statement:
CREATE TABLE D_CALENDAR ( YYYY INT, MM_DD_YYYY DATE, DAY_OF_WEEK VARCHAR(20) );
This will create a table with three columns for year, date, and day of the week. You can then use the INSERT
statement to add some data to the table:
INSERT INTO D_CALENDAR (YYYY, MM_DD_YYYY, DAY_OF_WEEK) VALUES (2022, '01-01-2022', 'Saturday'), (2022, '01-02-2022', 'Sunday'), (2022, '01-03-2022', 'Monday'), (2022, '01-04-2022', 'Tuesday'), (2022, '01-05-2022', 'Wednesday'), (2022, '01-06-2022', 'Thursday'), (2022, '01-07-2022', 'Friday');
This will insert seven rows of data into theD_CALENDAR
table for the first week of January 2022. You can adjust theINSERT
statement as needed to include different data or to add additional rows.
That's D_CALENDAR or at least the start of it. ChatGPT doesn't know why I need it (it doesn't have an understanding of the context, but it created a proper DDL -clause.
Let's try the ask the question in a different format.
"If I'm building a data warehouse solution, what's the best modeling option on getting the weekday (Mon, Tue) of a certain date in history?"
To get the weekday (e.g. Monday, Tuesday) for a certain date in a data warehouse solution, you can use a dimension table that contains the mapping between the numeric day of the week and the corresponding weekday name.
In a data warehouse, a dimension table is a table that contains descriptive attributes of the entities in the data warehouse, such as dates, product categories, or customer demographics. Dimension tables are typically used in combination with fact tables, which contain the quantitative data that is used to measure business performance.
To create a dimension table for the weekdays, you can use the following
CREATE TABLE
statement:CREATE TABLE dim_weekdays ( dow INT, weekday VARCHAR(20) );
This will create a dimension table calleddim_weekdays
with two columns:dow
for the numeric day of the week (1-7, where 1 is Monday and 7 is Sunday) andweekday
for the corresponding weekday name. You can then use theINSERT
statement to add the weekday names to the table:
INSERT INTO dim_weekdays (dow, weekday) VALUES (1, 'Monday'), (2, 'Tuesday'), (3, 'Wednesday'), (4, 'Thursday'), (5, 'Friday'), (6, 'Saturday'), (7, 'Sunday');
This will insert seven rows into the
dim_weekdays
table, one for each day of the week. You can then use aJOIN
operation to link this table to a fact table that contains the dates for which you want to get the weekdays.For example, if you have a fact table called
sales
with a columndate
that contains the dates of the sales transactions, you could use the following query to get the weekday for each transaction:SELECT s.date, w.weekday FROM sales s JOIN dim_weekdays w ON s.date_part('isodow', s.date) = w.dow;
This will return the date and weekday for each transaction in the
sales
table, using the `dim
Holy moly. ChatGPT does not what a D_CALENDAR is, but doesn't have the knowledge to link D_CALENDAR into the proper context. Because this is a roughly correct answer and should be general knowledge for Data Engineers working on Data Warehouse -projects.
Now, let's ask something that most of the Data Engineers how I know, might not know. I'm expecting a certain answer to this question.
"In the Snowflake database, what is the best table type for staging tables?"
In Snowflake, the best table type for staging tables is the
TRANSIENT
table. Transient tables are temporary tables that are automatically dropped when the session that created them ends. They are the ideal choice for staging tables because they do not consume storage space in the Snowflake account and are automatically cleaned up when they are no longer needed.
To create a transient table in Snowflake, you can use the
CREATE TRANSIENT TABLE
statement. The basic syntax for this statement is:
CREATE TRANSIENT TABLE table_name ( column1 datatype, column2 datatype, ... );
where
table_name
is the name of the table you want to create andcolumn1
,column2
, etc. are the columns in the table along with their data types. For example, to create a transient table calledstaging_data
with two columns,col1
andcol2
, you could use the following statement:CREATE TRANSIENT TABLE staging_data ( col1 INT, col2 VARCHAR(100) );
This will create a transient table called
staging_data
with two columns,col1
andcol2
, of the specified data types. You can then use theINSERT
statement to load data into the table. When the session ends, the table and its data will be automatically dropped.
Now I'm mind blown. This is the answer I was looking for and it's even better. ChatGPT even tells why the TRANSIENT table is the best option.
Let's start with the basics. Here I am giving ChatGPT really much information, such as the general knowledge of TCP -port 1433.
"Create me a Terraform script to set up AWS network security groups and deploy the MSSQL database inside the network security group. Open 1433 firewall for MSSQL database"
Here is a simple example of a Terraform script that can create a AWS network security group and deploy an MSSQL database inside it.
# Configure the AWS provider provider "aws" { region = "us-east-1" }
# Create a new VPC resource "aws_vpc" "my_vpc" { cidr_block = "10.0.0.0/16" }
# Create a new network security group resource "aws_security_group" "mssql_sg" { name = "mssql-sg" description = "Security group for MSSQL database" vpc_id = aws_vpc.my_vpc.id # Open the MSSQL database port ingress { from_port = 1433 to_port = 1433 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } }
# Create a new MSSQL database instance resource "aws_db_instance" "mssql_db" { engine = "sqlserver-ex" engine_version = "14.00.3223.3.v1" instance_class = "db.t2.micro" name = "my_mssql_db" username = "mssqluser" password = "mssqlpass" storage_type = "gp2" storage_encrypted = true publicly_accessible = true vpc_security_group_ids = [aws_security_group.mssql_sg.id] backup_retention_period = 7 }
It's been few years since I have wroten Terraform (I try to favor SaaS as much as possible), but to my eye this looks promising. There's the necessary VPC, security group and database server.
Let's try something else. Could I set up a simple Linux server for testing out PyTorch?
"Create me an AWS Terraform script for deploying Linux virtual machine. Inside the Linux, I want to install the latest Python version and Pytorch and create the user pythontestuser"
Here is a simple example of a Terraform script that can create an AWS Linux virtual machine, install the latest version of Python and PyTorch, and create a new user named pythontestuser
.
# Configure the AWS provider provider "aws" { region = "us-east-1" }
# Create a new Linux AMI data "aws_ami" "ubuntu" { most_recent = true filter { name = "name" values = ["ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*"] } filter { name = "virtualization-type" values = ["hvm"] } owners = ["099720109477"] # Canonical }
# Create a new EC2 instance resource "aws_instance" "web" { ami = data.aws_ami.ubuntu.id instance_type = "t2.micro"
# Install the latest version of Python and PyTorch user_data = <<-EOF #!/bin/bash
# Install Python and PyTorch apt-get update apt-get install -y python3-pip pip3 install torch torchvision # Create a new user useradd pythontestuser EOF }
Ok, it's been again few years since I logged onto any Debian -variant, but looks like proper apt-get commands.
As a final question. Can I finally focus on creating data products instead of creating data pipelines?
"Create me an example data pipeline using using Azure Data Factory"
Here is a simple example of an Azure Data Factory data pipeline that copies data from a CSV file stored in Azure Blob Storage to a table in Azure SQL Database.
First, create a new Azure Data Factory instance and give it a name. Then, create a new pipeline and add the following steps:
- Get CSV file from Azure Blob Storage: Use the
Get metadata
andCopy data
activities to retrieve the CSV file from Azure Blob Storage and store it in a Azure Data Lake Storage Gen2 account.- Extract data from CSV file: Use the
Extract data
activity to extract the data from the CSV file and store it in a structured format in Azure Data Lake Storage Gen2.- Load data into Azure SQL Database: Use the
Copy data
activity to copy the structured data from Azure Data Lake Storage Gen2 and load it into a table in Azure SQL Database.
Here is the code for the pipeline:
I didn't even bother to paste code for the pipeline because you can see that ChatGPT gives the correct answers for creating a really simple data pipeline.
What I can say right away, is that if solutions like ChatGPT-3 would be added as helping tools for business users or Data Engineers, we could get data pipelines created much faster and Data Engineers could spend their time solving actual problems with the data. The results that ChatGPT gives are not always 100% right, but they don't need to be as they point in the right direction. The bigger question does come through in the form, does ChatGPT give correct answers? I knew the answers I was looking for, but Data Engineers at the start of their careers might not know.
This has also raised concerns elsewhere as ChatGPT has been apparently been so already so popular that StackOverflow has temporarily banned its usage for producing coding-related answers and Open AI's ChatGPT has already seen as a rival for Google's search.
With this phase the AI solutions are created, I expected that by 2025 we will have AI solutions more and more embedded into Data Tools as well.
Conversational AI is set to revolutionize the way data engineers work in the future. By giving data engineers the ability to ask questions and receive answers in natural language, they can quickly access and analyze data.
Moreover, conversational AI can be used to automate mundane tasks, freeing up data engineers to focus on more strategic projects and enabling them to work more efficiently and effectively.
With conversational AI, data engineers will be able to create more powerful data models and insights, transforming the way they work and paving the way for a brighter future. This is just in the field of Data Engineering.
Let's pose this question to ChatGPT and see what other answers we can get.
Yes, OpenAI provides a free site where you can try out its ChatGPT technology at chat.openai.com. All you need to do is create an account and you're ready to start conversing with it. Since ChatGPT is a conversational AI, you'll need to ask it the right questions to get the best results.
So - here comes the more intriguing question, did I write this blog myself or not? The right answer is that I created the first version and structure, but I asked ChatGPT to rewrite it and make it better. After that, I modified the text again. Even this sentence isn't completely my work - you can tell by the lack of Finnish language structure. This is just one example of how solutions like ChatGPT can help in the future. Whether you need help writing good SQL, working with Python script syntax, or simply creating more readable blog text, ChatGPT can help you out.
The use of AI tools such as ChatGPT and currently very popular art generators from DALL·E 2 to Lensa do not come without issues. Far from that. Our team has been doing some testing with them and we will be sharing our experiences and thoughts on the matter later. If you're interested in learning more about what you should take into consideration when or if using these tools, sign up: