DataRockie

เริ่มเรียน Data Science ง่ายๆด้วยตัวเอง 🍟

ก้าวแรกสู่โลก Machine Learning กับ Gradient Descent

February 6, 2026

tutorial วันนี้แอดอธิบายและสอนเขียนฟังชั่น implement Gradient Descent ใน R เพื่ออัพเดทค่า intercept และ slope ของ simple linear regression มาทบทวน concept ของ SLR กันก่อน

y_hat = mx + c where m = slope and c = constant (i.e. y intercept)

SLR คือสมการเส้นตรงง่ายๆนี่เอง เช่น y = 0.5x + 2 ที่ลากตัดผ่าน (ทำนาย) ข้อมูลของเราได้แม่นยำที่สุด โดย x คือตัวแปรต้นและ y คือตัวแปรตาม เราสามารถวัดความแม่นยำของโมเดลด้วยค่า error แบบต่างๆ เช่น mse, mae, rmse และ r-squared บทความวันนี้จะอธิบายการคำนวณ mse เป็นหลัก

Intro to Gradient Descent

Gradient Descent คือ algorithm ยอดนิยมที่ data scientist ใช้อัพเดทค่า weights ใน machine learning models ต่างๆ ตั้งแต่ linear regression, logistic regression, neural networks ไปจนถึง deep learning

Gradient Descent แปลว่า slope ที่ลาดลง i.e. ไหลลงด้านล่าง

เราสามารถประยุกต์ใช้ gradient descent ได้กับทุกปัญหา supervised regression & classification ด้วยไอเดียของ algorithm ที่เข้าใจง่าย และ implement ไม่ยากเมื่อเทียบกับเทคนิคอื่นๆ

gradient descent ช่วยพาเราลงไปที่จุด ideal (best effort) แกนนอน theta แทนค่า weight ของโมเดล

โดยทั่วไปเราจะวัด performance ของโมเดลเราด้วยค่า error หรือศัพท์เทคนิคทาง ML คือ Cost Function โดย gradient descent จะช่วยอัพเดทค่า weights ในโมเดลของเราให้ cost function ลดต่ำลงไปที่จุดต่ำสุด (ideal position) ภาษาอังกฤษอธิบายสั้นๆว่า “Gradient descent helps minimize cost function.“

cost function ของ linear regression ที่เรานิยมใช้คือ MSE ย่อมาจาก mean squared error

# MSE calculation in R
mse <- (1/n) * sum((y_hat - y) ** 2)

# equivalent to this formula
mse <- mean((y_hat - y) ** 2)

เราต้องยกกำลังสองค่า error เสมอเพื่อทำให้ error ทั้งหมดมีค่าเป็นบวกก่อนที่จะหาค่าเฉลี่ยด้วยฟังชั่น mean() และเป็นที่มาของชื่อ MSE – mean squared error

ปล. ถ้าใครเคยเรียนคอร์ส Machine Learning กับ Andrew Ng ในคอร์สนั้นเค้าสอนให้ใช้ cost function แบบนี้ (1/2n) * sum((y_hat - y) ** 2) ด้วยเหตุผลทางคณิตศาสตร์ i.e. เราจะดิฟสมการนี้ออกมาหน้าตาดูง่ายขึ้น

How to Lower Cost?

พอคำนวณ mse เสร็จแล้ว ก็ถึงเวลาของ gradient descent มาจัดการค่า weights ของโมเดล i.e. จะปรับ intercept และ slope ยังไงให้ mse ลดลงไปที่จุดต่ำสุด i.e. minimize cost

math ที่เราใช้ในการ implement gradient descent คือแคลคูลัสพื้นฐานเรื่อง Partial Derivatives หรือที่เรียกสั้นๆว่า “ดิฟ” ภาษาคนคือการหา slope ว่าถ้าตัวแปร x เปลี่ยนแล้ว cost function จะเปลี่ยนเท่าไร?

ถ้าเรารู้ slope ของตัวแปรต่างๆในโมเดลที่ส่งผลต่อ cost function เราสามารถใช้ค่านี้ไปอัพเดท intercept และ slope ของ SLR ด้วยสมการในรูปด้านล่าง ตรงนี้ต้องใช้ความรู้แคลคูลัสนิดหน่อย Andrew Ng บอกว่านักเรียน (ที่ยังไม่เก่งแคลเท่าไร) สามารถจำสูตรด้านล่างเลยก็ได้เพราะใช้บ่อยมาก

Learning Rate

นอกจาก gradient ที่ได้จากการดิฟ cost function ยังมีอีกหนึ่ง parameter ที่เราต้อง optimize ตอนเทรน gradient descent algorithm นั้นคือ Learning Rate หรือเรียกสั้นๆว่า alpha ในงานวิจัย machine learning

# update weights using alpha and gradient
c_new <- c - alpha * (1/n) * sum(y_hat - y)
m_new <- m - alpha * (1/n) * sum((y_hat - y) * x)

ถ้าเราเลือก learning rate ต่ำเกินไป gradient descent จะใช้เวลานานกว่าที่โมเดลจะ converged i.e. ลงมาที่จุดต่ำสุดของ cost function แต่ถ้าเราเลือก learning rate สูงเกินไปมีโอกาสที่จะเกิดปัญหา overshoot คือโมเดลจะไม่สามารถเดินทางไปที่จุดต่ำสุดได้เลย (เราอธิบายปัญหา overshoot ใน appendix ของบทความนี้)

small learning rate จะใช้เวลานานในการ train กว่าจะลงมาถึงจุดต่ำสุดของ cost function

การเลือก learning rate ส่งผลโดยตรงกับ performance ของ gradient descent algorithm ปกติเราจะเห็นคนกำหนด alpha ในช่วง 0.001 – 0.010 แต่ไม่ได้มีกฏตายตัว

Implementation

เราสามารถ implement gradient descent ได้ใน 4 ขั้นตอน

แรนดอมค่า intercept และ slope
คำนวณ MSE ของโมเดลที่ใช้ intercept และ slope นี้
ดิฟหา gradient เพื่อปรับค่า intercept และ slope ให้ MSE ต่ำลง
ทำซ้ำ step 2-3 ไปเรื่อยๆจนกว่าโมเดลจะ converged หรือครบรอบ iterations ที่เรากำหนด

# function template to implement gradient descent in R
gradDesc <- function(df, x, y, alpha = 0.01, max_iter = 100000){

  # step1 - initialize random weights
  # step2 - compute cost function
  # step3 - update weights with gradient
  # step4 - loop through step 2-3 until converged

}

ฟังชั่นที่เราจะเขียนชื่อว่า gradDesc() มีทั้งหมด 5 arguments โดย alpha = 0.01 และ max_iter = 100000 เป็น default arguments ที่เราตั้งค่าไว้แล้ว แต่ตอนเทรนจริงสามารถปรับได้ตามความเหมาะสม โดยทั่วไป alpha ยิ่งต่ำ เราจะเพิ่มรอบ max_iter ให้เวลา algorithm ทำงานนานขึ้นในการหาจุด convergence

arguments	ความหมาย
`df`	data frame
`x`	ตัวแปร x หรือ predictor
`y`	ตัวแปร y หรือ target variable ที่เราต้องการทำนาย
`alpha`	learning rate ที่เราใช้ในการอัพเดทค่า weights
`max_iter`	จำนวนรอบสูงสุด (iterations) ที่เราให้ gradient descent อัพเดทค่า weights ในโมเดล

Full R Code

ตัวอย่างวันนี้ใช้ built-in data frame ชื่อว่า mtcars รูปด้านล่างคือ output ที่ได้จากการ implement gradient descent ด้วยฟังชั่น gradDesc() default learning rate = 0.01 และ max iterations = 100,000

`gradDesc()` จะหาค่า optimal intercept และ slope ที่ทำให้ค่า mse ต่ำที่สุด

code line 7 เราใช้ฟังชั่น scale() เพื่อ standardize ตัวแปร x อันนี้เป็นเทคนิคที่ช่วยให้ gradient descent ทำงานได้เร็วขึ้นมาก

code line 21-28 เราเขียน while loop เพื่ออัพเดทค่า weights ด้วย alpha * gradient จนกว่าโมเดลจะ converged หรือครบรอบ max_iter ที่เรากำหนด

code line 45 เราทดสอบฟังชั่นที่เราเขียนเพื่อหาค่า optimal weights สำหรับโมเดล mpg = f(wt) ซึ่งค่า intercept และ slope ที่ได้จะมีค่าใกล้เคียงกับการรันฟังชั่น lm(mpg ~ scale(wt), data = mtcars)

	# gradient descent function takes five arguments
	gradDesc <- function(df, x, y, alpha = 0.01, max_iter = 100000){

	# scale x for faster training
	start_time <- proc.time()
	n <- nrow(df)
	x <- as.vector(scale(df[[x]]))
	y <- df[[y]]
	plot(x, y, pch = 16)

	# initialize random weights
	m <- runif(1,0,1)
	c <- runif(1,0,1)
	yhat <- m * x + c
	mse <- (1/n) * sum((y – yhat) ** 2)
	converged <- F
	iteration <- 0

	# update weight using GD algorithm
	cat("=== Implementing gradient descent algorithm ===")
	while(converged == F){
	iteration <- iteration + 1
	m_new <- m – alpha * (1/n) * sum((yhat – y) * x)
	c_new <- c – alpha * (1/n) * sum(yhat – y)
	m <- m_new
	c <- c_new
	yhat <- m * x + c
	mse_new <- (1/n) * sum((y – yhat) ** 2)

	# if iteration hits max_iter, program ends
	if(iteration == max_iter){
	converged <- T
	abline(c, m)
	return(cat("\nOptimal intercept:", c,
	"\nOptimal slope:", m,
	"\nIteration:", iteration,
	"\nFinal MSE:", mse_new,
	"\nTime for training:",
	(proc.time() – start_time)[1], "seconds."))
	}
	}
	}

	# test function x=wt, y=mpg
	gradDesc(mtcars, x = "wt", y = "mpg", alpha = 0.001, max_iter = 100000)

view raw gradDesc.R hosted with ❤ by GitHub

List of Functions

สรุปฟังชั่นทั้งหมดที่เราจะใช้ใน tutorial นี้เป็น base R ทั้งหมดเลย ไม่จำเป็นต้องลง package อะไรเพิ่มเติม ถ้าเพื่อนๆคุ้นเคยกับ control flow เช่น if กับ while อยู่แล้ว การ implement gradient descent ใน R ง่ายนิดเดียว

ชื่อฟังชั่น	ใช้ทำอะไร?
`runif()`	สุ่มตัวเลข random number มีค่าอยู่ระหว่าง min = 0 max = 1
`if`	`control flow` ทดสอบเงื่อนไขใน R
`while`	`control flow` กำหนดจำนวน iterations ของโปรแกรม ในตัวอย่างวันนี้คือ `max_iter`
`scale()`	standardize ตัวแปรด้วยสูตร `(x - mean(x)) / sd(x)`
`plot()`	สร้าง scatter plot ง่ายๆ
`abline()`	สร้างสมการเส้นตรงด้วยค่า intercept และ slope ซ้อนบน scatter plot อีกที
`return()`	return output จากฟังชั่นที่เราเขียน และ terminate โปรแกรม
`cat()`	แสดงผล text output ที่เราต้องการใน console
`proc.time()`	จับเวลาโปรแกรมของเรา

Summary

Data scientist ใช้ gradient descent algorithm อัพเดทค่า weights ในโมเดล โดยปัจจัยที่ส่งผลโดยตรงต่อ performance ของ algorithm นี้คือ learning rate – how fast the algorithm learns to solve a problem

References

Implementing the Gradient Descent Algorithm in R by S. Richter-Walsh (12 Feb 2017)
Machine Learning 101: An Intuitive Introduction to Gradient Descent by Thalles Silva (7 March 2018)

Appendix – Overshooting Problem

สมมติ current cost = 10, learning rate = 4 (assume gradient = 1) และ cost ต่ำสุดที่เป็นไปได้คือ 0 ในตัวอย่างนี้ไม่มีทางเลยที่ cost จะลงไปถึงจุดต่ำสุด ถ้าเรารัน gradient descent 2 iterations → cost จะลดลงเหลือ 10 – 4 – 4 = 2 แต่ถ้าเราปรับ learning rate ลดลงเหลือ 2 โมเดลเราจะ converged ภายใน 5 iterations

2 responses to “ก้าวแรกสู่โลก Machine Learning กับ Gradient Descent”

Anonymous

October 9, 2020

i.e. ไหลลงด้านล่าง i.e คืออะไรคะแอดทอย

Reply
1. Kasidis Satangmongkol
  
  October 9, 2020
  
  ไหลลงล่าง เพราะต้องการหาทิศทางที่ทำให้ cost ต่ำสุดครับผม
  
  Reply