5.4 Capstone Briefing

ถึงเวลาเอาทุกอย่างมารวมกัน — คุณจะได้ไฟล์ CSV ที่ “ยุ่งเหยิง” แล้วต้องแปลงมันให้เป็นข้อมูลที่เชื่อถือได้ใน 3 ระบบ พร้อม type contract

เวลาที่ใช้: ~15 นาที (อ่าน briefing)

ภาพรวมโปรเจกต์

คุณเป็น Data Type Translator — คนที่รับ messy CSV จากทีมอื่น แล้วแปลงให้เป็นข้อมูลที่ถูกต้องและเชื่อถือได้ใน Google Sheets, Python (pandas), และ SQL

ไฟล์ Input: `messy_students.csv`

ไฟล์ CSV นี้มีปัญหาต่อไปนี้ (ตั้งใจให้มี):

ปัญหา	ตัวอย่าง	ควรเป็น
Leading zeros หาย	`student_id = 1234`	`001234`
Date format ปนกัน	`15/01/2025`, `2025-01-15`, `Jan 15, 2025`	`2025-01-15` (ISO)
Empty vs null vs zero ปน	`""`, ว่างเปล่า, `0`	แยกให้ถูก
ตัวเลขที่ดูเหมือน text	`"2,500.00"` (มี comma)	`2500.00` (numeric)
Phone มีรูปแบบปน	`0812345678`, `+66812345678`, `081-234-5678`	เลือกรูปแบบเดียว
Boolean ปนกัน	`TRUE`, `yes`, `1`, `Y`	เลือกรูปแบบเดียว

ตัวอย่างข้อมูล

student_id,name,phone,price,enrolled_on,active,note
1234,Somchai Jaidee,0812345678,"2,500.00",15/01/2025,yes,
001235,Somsri Deejai,+66812345679,1800.50,2025-01-16,TRUE,นักเรียนทุน
1236,Somying Sabaidi,,0,Jan 17 2025,1,
001237,Somkid Jaijing,081-234-5680,"3,200.00",2025-01-18,Y,note ปกติ
001238,Sompong Jaiyen,"","1,500.00",19/01/2025,yes,""

Deliverables (สิ่งที่ต้องส่ง)

Cleaned Google Sheet
- ทุก column มี data type ที่ถูกต้อง
- student_id เป็น text 6 หลัก (มี leading zeros)
- price เป็น number ไม่มี comma
- enrolled_on เป็น date format เดียว
- active เป็น TRUE/FALSE
- data validation rules ตั้งไว้
Cleaned pandas Notebook
- อ่าน CSV ด้วย dtype mapping ที่ถูกต้อง
- clean ทุก column ให้ตรง type contract
- รัน import checklist (5 checks) และ print ผลลัพธ์
- export เป็น cleaned CSV
SQL Schema + Load Script
- CREATE TABLE statement ที่มี type ถูกต้อง + constraints
- INSERT statements หรือ COPY command สำหรับ load ข้อมูล
- verification queries (row count, nulls, totals)
Type Contract Memo
- ตาราง 1 หน้าที่ระบุ column, type, required/nullable, constraints
- เหมือนตัวอย่างใน lesson 5.1

เกณฑ์การให้คะแนน (Rubric)

เกณฑ์	คะแนน	รายละเอียด
Row counts match	20%	จำนวนแถวใน Sheets, pandas, SQL ตรงกัน
IDs preserved	15%	`student_id` มี leading zeros ครบ 6 หลักในทุกระบบ
Dates correct	20%	ทุก date เป็น ISO 8601 (YYYY-MM-DD) ในทุกระบบ
Numeric types correct	20%	`price` ไม่มี rounding error, ไม่มี comma, ยอดรวมตรง
Type contract complete	25%	มีทุก column, ระบุ type + nullable + constraints ครบ
รวม	100%

เครื่องมือที่แนะนำ

Import: File > Import > Upload > ตั้ง column type เป็น “Do not convert”
Clean IDs: =TEXT(A2, "000000") สำหรับเติม leading zeros
Clean Dates: =DATEVALUE() หรือแก้ด้วยมือ + format เป็น YYYY-MM-DD
Clean Numbers: Find & Replace → ลบ comma ออก
Validate: Data > Data validation สำหรับทุก column

สำหรับ reference เท่านั้น — ไม่ต้องเขียน C program ใน capstone:

// ถ้าอยากเข้าใจว่า leading zeros หายยังไง:
int id = 1234;           // leading zeros หาย!
char id_str[] = "001234"; // เก็บเป็น string → คงอยู่

// ทำไม NUMERIC ถูกกว่า float สำหรับเงิน:
float price = 2500.00f;  // อาจมี rounding error
int cents = 250000;       // exact!

import pandas as pd
from decimal import Decimal

# อ่าน CSV ตาม contract
df = pd.read_csv('messy_students.csv', dtype={
    'student_id': 'string',
    'name': 'string',
    'phone': 'string',
    'price': 'string',      # อ่านเป็น string ก่อน เพราะมี comma
    'enrolled_on': 'string', # parse date ทีหลัง
    'active': 'string',     # normalize ทีหลัง
    'note': 'string',
})

# Clean: student_id → pad to 6 digits
df['student_id'] = df['student_id'].str.zfill(6)

# Clean: price → remove comma, convert to float
df['price'] = df['price'].str.replace(',', '').astype(float)

# Clean: dates → normalize to ISO
df['enrolled_on'] = pd.to_datetime(df['enrolled_on'], format='mixed')

# Clean: active → normalize to boolean
bool_map = {'true': True, 'yes': True, '1': True, 'y': True,
            'false': False, 'no': False, '0': False, 'n': False}
df['active'] = df['active'].str.lower().map(bool_map)

-- Schema ตาม contract
CREATE TABLE students (
    student_id   VARCHAR(6)     NOT NULL,
    name         VARCHAR(100)   NOT NULL,
    phone        VARCHAR(20),
    price        NUMERIC(10,2)  NOT NULL CHECK (price >= 0),
    enrolled_on  DATE           NOT NULL,
    active       BOOLEAN        NOT NULL DEFAULT TRUE,
    note         TEXT
);

-- หลัง load ข้อมูล ตรวจ:
SELECT COUNT(*) FROM students;                    -- row count
SELECT student_id FROM students
    WHERE LENGTH(student_id) != 6;                -- IDs
SELECT SUM(price) FROM students;                  -- totals
SELECT * FROM students WHERE enrolled_on IS NULL; -- dates

Timeline แนะนำ

ช่วงเวลา	งาน	เวลา
1	อ่าน CSV, วิเคราะห์ปัญหา, เขียน type contract	20 min
2	Clean ใน Google Sheets	25 min
3	Clean ใน pandas notebook	30 min
4	เขียน SQL schema + load + verify	25 min
5	ตรวจสอบ cross-platform: ยอดตรงกันทั้ง 3 ที่	10 min
6	Review type contract memo ครั้งสุดท้าย	10 min
รวม		~120 min