CP2403 / CP3413 — Assignment Part 2 (Portfolio Analytics Edition, TR3'25)

Quick Navigation

Motivation Hypothesis Playbook Scenario Data Sources & Prep Finance Crash Course Constructed Variables Tasks 1–6 Guiding Questions Visualisation Submission & Naming Glossary Quick starter snippets Delivery Template

Scenario

Analyse ~36 months of daily market data for a small portfolio (1–3 assets) and one benchmark index. Clean data, build returns, run ANOVA and Chi‑Square tests, and fit linear/multiple regressions to produce concise, decision‑ready conclusions.

Data Sources & Preparation

Update (Nov 2025):

Deprecated caret-index symbols (e.g., ^GSPC, ^NDX, ^STI) and futures tickers have been removed to avoid data issues. Use ETF benchmarks instead — SPY (S&P 500), QQQ (NASDAQ‑100), and IWM (Russell 2000).SPY (S&P 500), QQQ (NASDAQ‑100), and IWM (Russell 2000). Google Sheets and yfinance fully support these.

Ticker selection rules (to encourage diversity)

Pick 1–3 assets you care about — examples: NVDA, AMZN, META, TSLA (examples; you may use these or others).
Pick a benchmark that fits — examples: SPY, QQQ, (examples; choose what makes sense and justify).
In your report, write one sentence per choice: “We chose ___ because ___.”

Path A — Google Sheets (no code)

Adjusted Close for asset XXX: =GOOGLEFINANCE("XXX","price",TODAY()-1095,TODAY(),"DAILY")
Volume for XXX: =GOOGLEFINANCE("XXX","volume",TODAY()-1095,TODAY(),"DAILY")
Benchmark (index) — e.g., SPY, QQQ, IWM: =GOOGLEFINANCE("SPY","price",TODAY()-1095,TODAY(),"DAILY")
Export to CSV: File → Download → Comma-separated values (.csv)

Path B — Python (yfinance) — recommended for more choices

import yfinance as yf
assets = ["XXX","YYY"]        # examples: ["NVDA","AMZN","META"]
bench  = ["SPY"]            # examples: ["SPY","QQQ","STI"] or futures ["ES=F","NQ=F"]
tickers = assets + bench
df = yf.download(tickers, period="3y", interval="1d", auto_adjust=True).Close
df.to_csv("prices_3y.csv")

Macro drivers (REQUIRED in Task 4 — choose 1–2):

Volatility proxy: VIXY (ETF) — use daily % change
Oil: (WTI futures) — use daily % return
Rates proxy: IEF (7–10Y US Treasuries ETF) — use daily % return
Dollar proxy: UUP (USD ETF) — use daily % return

Rule: State each proxy and why it’s relevant to your asset.

Pre‑processing note (state this in your report): how you aligned trading calendars, removed NAs, handled outliers, and ensured consistent date ranges for assets vs benchmark.

Finance Crash Course (fast, practical)

Returns (daily) & why we use them

Return = % change from one day to the next. Simple return: (P_t - P_{t-1})/P_{t-1}. Returns let us compare different assets fairly and run statistics. Log return: ln(P_t/P_{t-1}) — adds better time‑additivity.

Benchmark, α, β (sensitivity)

A benchmark is the market yardstick (e.g., SPY, QQQ; futures:). β is slope on the benchmark; α is intercept (average return unexplained by the benchmark).

Volatility & clustering

Rolling 20‑day standard deviation as a volatility proxy. Markets show volatility clustering: calm periods, then stormy periods.

ANOVA & Chi‑Square

ANOVA tests if mean returns differ across groups (assumes similar variances and approx. normal residuals). Chi‑Square tests independence; expected counts in each cell should be about ≥5.

Regression essentials

Report coefficients (α, β’s), standard errors, t‑stats, p‑values, confidence intervals, R² and Adjusted R². Compare models with AIC (or BIC). Beware overfitting; consider a simple train/test split or out‑of‑sample check.

Motivation — Why this matters for your life

Short version: learning to analyse markets is a practical life skill. You won’t just pass a subject—you’ll build a system for smarter saving, investing, and decision‑making.

3 stories to keep in mind

Compounding works quietly, then suddenly. A $100/month habit at ~7% annualized can cross $50,000 in ~20 years and ~120k in ~30 years—without “big” bets, just consistency.
Process beats predictions. Professionals rarely “call” the market; they test hypotheses, measure uncertainty, and follow data. That’s exactly what you’ll practice here.
Transferable skills. Cleaning data, formulating hypotheses (H0/H1), choosing tests, and communicating results translate to any domain—ecology, IT, healthcare, policy.

What you’ll take away

A reproducible Python workflow to pull, clean, and analyse real market data.
Confidence in hypothesis testing (ANOVA, Chi‑Square, regression) and how to report decisions responsibly.
A personal playbook for evaluating claims: “Is this statistically meaningful or just noise?”

Treat this like a starter kit for your future self. The sooner you build these habits, the more time compounding has to work for you.

Quick starter snippets

Copy‑paste these into your notebooks as needed. They align with the Tasks and rubric.

Data pull (~36 months with yfinance)

import pandas as pd, yfinance as yf
TICKER = "QQQ"  # change me
end = pd.Timestamp.today().normalize()
start = end - pd.DateOffset(months=36)
df = yf.download(TICKER, start=start, end=end, interval="1d", auto_adjust=True)[["Close"]].reset_index()
df = df.sort_values("Date").rename(columns={"Close":"Close"})
df["Return"] = df["Close"].pct_change()
df.to_csv("prices_clean.csv", index=False)

ANOVA (Task 1)

import pandas as pd, statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date").dropna()
df["Weekday"] = df["Date"].dt.day_name()
m = ols("Return ~ C(Weekday)", data=df).fit()
anova = sm.stats.anova_lm(m, typ=2)
# eta-squared (η²) = SS_between / SS_total
eta2 = anova["sum_sq"]["C(Weekday)"] / anova["sum_sq"].sum()
print(anova, "\n
      Chi‑Square (Task 2)
      import pandas as pd, numpy as np, scipy.stats as st
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["UpDown"] = np.where(df["Return"]>0,"UP","DOWN")
df["Vol20"] = df["Return"].rolling(20).std()
q = df["Vol20"].quantile([0.33,0.66])
def reg(v):
    if pd.isna(v): return None
    if v<=q.iloc[0]: return "Low"
    if v<=q.iloc[1]: return "Med"
    return "High"
df["Regime"] = df["Vol20"].apply(reg)
ct = pd.crosstab(df["UpDown"], df["Regime"]).fillna(0)
chi2, p, dof, exp = st.chi2_contingency(ct)
n = ct.values.sum()
phi2 = chi2 / n
r, k = ct.shape
print(ct, "\nchi2=",chi2," p=",p," CramerV=",cramers_v)

      Linear regression (Task 3) + 95% CI
      import pandas as pd, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["Lag1"] = df["Return"].shift(1)
d = df.dropna()
X = sm.add_constant(d["Lag1"]); y = d["Return"]
m = sm.OLS(y, X).fit()
print(m.summary())
print("\n95% CI for beta:", m.conf_int().loc["Lag1"].tolist())

      Multiple regression (Task 4) — compare specs
      import pandas as pd, numpy as np, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["Lag1"] = df["Return"].shift(1); df["Lag2"] = df["Return"].shift(2); df["Vol20"] = df["Return"].rolling(20).std()
d = df.dropna()
def fit(cols):
    X = sm.add_constant(d[cols]); y = d["Return"]
    return sm.OLS(y, X).fit()
m1 = fit(["Lag1","Lag2"]); m2 = fit(["Lag1","Lag2","Vol20"])
print("M1:", dict(adjR2=m1.rsquared_adj, AIC=m1.aic, BIC=m1.bic))
print("M2:", dict(adjR2=m2.rsquared_adj, AIC=m2.aic, BIC=m2.bic))
    
      Polynomial regression (Task 5) — degree 2 (keep it simple)
      import pandas as pd, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).dropna().sort_values("Date")
# Predict asset return using a quadratic of benchmark return
df["BenchRet"] = df["Return"].shift(0)  # or use benchmark return if you merged it
X = pd.DataFrame({"BenchRet": df["BenchRet"], "BenchRet2": df["BenchRet"]**2}).dropna()
y = df.loc[X.index, "Return"]
X = sm.add_constant(X)
m = sm.OLS(y, X).fit()
print(m.summary())

      Logistic regression (Task 6) — classify Up vs Down
      import pandas as pd, numpy as np, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["Up"] = (df["Return"] > 0).astype(int)
df["Lag1"] = df["Return"].shift(1)
d = df.dropna(subset=["Up","Lag1"])
X = sm.add_constant(d[["Lag1"]])  # you may add one macro predictor if available
y = d["Up"]
logit = sm.Logit(y, X).fit(disp=False)
print(logit.summary())
pred = (logit.predict(X) >= 0.5).astype(int)
acc = (pred == y).mean()
print("Accuracy:", round(float(acc),3))



    
      Hypothesis Playbook (use for Tasks 1–2 and wherever relevant)
      
        Formulating H0 and H1 (templates)
        
          Define population/time-frame: “daily returns over the last ~36 months”.
          Task 1 (ANOVA) — H0: mean return is the same across all groups; H1: at least one group’s mean differs.
          Task 2 (Chi‑Square) — H0: direction (Up/Down) is independent of the factor; H1: they are dependent.
          Regression — H0: chosen coefficient = 0; H1: coefficient ≠ 0 (two‑tailed unless justified).
        
      
      
        Significance level, decision, reporting
        
          Set α = 0.05 (unless justified).
          Decision rule: if p ≤ α → reject H0; else fail to reject H0.
          Provide the relevant test statistic(s), p‑value(s), and a plain‑English sentence for a manager audience. (Effect sizes are not required for Part 2.)
        
      
      
        Logistic regression (classification) notes
        
          Outcome: Up (1) vs Down (0). Predictors: e.g., Lag1 return (and optionally one macro).
          Report coefficient signs and a simple performance metric (e.g., accuracy). No ROC/AUC required.
          Watch for class imbalance; if heavily imbalanced, note it briefly.
        
      
      
        Assumptions checklist
        
          ANOVA: approx. normal residuals; similar variances (Levene); independent observations.
          Chi‑Square: expected counts ≥ 5 in most cells; mutually exclusive categories.
          Regression: linearity, independent errors, homoskedasticity, approx. normal residuals.
        
      
    

    
      Variables You Will Construct
      
        Daily return: (P_t - P_{t-1}) / P_{t-1} or log‑return
        Up/Down label: Up if return > 0 else Down
        Day‑of‑Week (DoW): Monday … Friday
        Regime: Bull vs Bear via 50‑SMA vs 200‑SMA
        Volatility proxy: rolling 20‑day stdev of returns
        Momentum: prior 20‑day return
        Volume z‑score: (Volume − mean)/stdev across a rolling window
      
      
    

    
      Tasks
      Task 1 — ANOVA Module 5
      
        Factor (choose one): Day‑of‑Week; or Regime (benchmark 50‑SMA above/below 200‑SMA); or Volatility terciles.
        Response: Daily return of one chosen asset.
        Deliverables: ANOVA table, diagnostics, 1‑paragraph investor takeaway.
      

      Task 2 — Chi‑Square (Independence) Module 6
      
        Is Up/Down independent of Day‑of‑Week (or Regime)? Provide contingency table, expected counts, χ² & p‑value; interpret.
      

      Task 3 — Simple Linear Regression Module 7
      
        Model: AssetReturn = α + β × BenchmarkReturn + ε. Report α, β, R²; check residuals; explain β (sensitivity).
        Benchmark examples: SPY, QQQ.
      

      Task 4 — Multiple Regression (REQUIRES 1–2 macro drivers) Modules 7–8
      
        Response: Portfolio return (equal‑weight of your assets) or a single asset.
        Predictors: Benchmark return plus 1–2 macro drivers chosen from the list in Data (ΔVIX, oil return, Δ10Y, USD return).
        Fit ≥3 model variants; select “best” via R²/Adj‑R²/AIC; justify choice; write a short manager memo.
      
    
      Task 5 — Polynomial Regression (degree 2) Module 9
      
        Augment a simple regression with a quadratic term (degree 2) to capture mild curvature.
        Model: Return = α + β₁·Benchmark + β₂·Benchmark² + ε (keep predictors minimal).
        Deliverables: coefficients, p‑values, R²/Adj‑R², and a chart overlaying the fitted curve on the scatter.
      

      Task 6 — Logistic Regression (Up vs Down) Module 10
      
        Classify daily direction (Up=1, Down=0) using a simple logistic model.
        Predictors: Lag1 return (and optionally one macro driver). Keep it simple.
        Deliverables: coefficient signs, a confusion matrix (2×2), and accuracy. No ROC/AUC required.
      

      
        Weighting & Marks: Internal 100-point rubric maps to 25% course weight.

        T1–T4: 20 pts each (5% each). T5–T6: 10 pts each (2.5% each). Optional Bonus: up to +5 pts (does not exceed 100/25%).
      

    

    
      Guiding Questions (submit answers in a Word file)
      Answer these Q1–Q14 succinctly in a Word document; name it per the convention below.
      
        Choice & rationale
          Q1. Which assets (1–3) did you pick and why are they economically related?
          Q2. Which benchmark fits best and why not the alternatives listed?
        
        Data hygiene
          Q3. Show how you aligned dates and removed NAs when assets/benchmarks had different trading days.
          Q4. Did you use returns/changes (not levels) for regressions? Show one row example.
        
        Task 1 — ANOVA
          Q5. Which factor did you choose (DoW, Regime, or Vol‑terciles) and why is it sensible for your asset?
          Q6. What does the ANOVA say and what is the single‑sentence “so what?”
        
        Task 2 — Chi‑Square
          Q7. Build the Up/Down vs chosen factor table. Are counts adequate in each cell?
          Q8. What does χ² and the p‑value tell you? State your interpretation.
        
        Task 3 — Simple regression
          Q9. What are α, β, and R²? Provide a one‑sentence PM interpretation for β and R².
          Q10. Show one diagnostic (residual plot or normal QQ) and comment.
        
        Task 4 — Multiple regression (REQUIRES 1–2 macros)
          Q11. Which macro drivers did you choose (ΔVIX, oil return, Δ10Y, USD return) and why do they matter for your assets?
          Q12. Compare three model variants (benchmark‑only vs +macro1 vs +macro1+macro2). Which wins and why (R²/Adj‑R²/AIC)?
          Q13. What would change your conclusion (regime change, out‑of‑sample check)?
        
        Task 5 — Polynomial regression
          Q15. Why is a quadratic term plausible for your asset/benchmark relation? Show the fitted curve.
          Q16. Did the quadratic term materially improve Adj‑R² or AIC? Briefly justify.
        
        Task 6 — Logistic regression
          Q17. What predictors did you use and why? Report the confusion matrix and accuracy.
          Q18. How would misclassification risk affect a simple trading decision?
        
        Communication
          Q14. What’s your final manager memo (3–4 bullets) that a non‑technical audience can use?
        
      
    

    
      Visualisation Requirements
      
        Task 1: box/violin + mean markers per group
        Task 2: stacked bar or mosaic plot
        Task 3: scatter with fitted line + residual plot
        Task 4: coefficient table/plot + residual diagnostics
      
      Always label axes/legends with tickers, units, and date range.
    

    
    
      Delivery Template (copy into a Word file)
      Use the following outline verbatim in your Word document. Keep the headings and order. Replace ALL CAPS placeholders with your details. Submit the completed Word file to Blackboard.
      Title: CP2403/CP3413 — Assignment Part 2 (Portfolio Analytics Edition)

STUDENT NAME: FIRSTNAME LASTNAME
STUDENT ID: XXXXXXXXX
PRACTICAL GROUP: A / B / C
DATE: YYYY-MM-DD

1. CHOICES & RATIONALE
1.1 Assets (1–3): TICKER1, TICKER2, ...
    Rationale (1–2 sentences): ...
1.2 Benchmark (ETF): SPY / QQQ / IWM
    Rationale (1 sentence): ...

2. DATA ACQUISITION & HYGIENE
2.1 Source & Window: Google Sheets / yfinance, ~36 months ending TODAY
2.2 Cleaning Steps: aligned trading days, removed NAs, outliers handling (if any)
2.3 Files Provided: prices CSV(s) attached (filenames)

3. TASK 1 — ANOVA (Day-of-Week / Regime / Vol-terciles)
3.1 Hypotheses (H0/H1): ...
3.2 Method & Diagnostics (1–2 lines): ...
3.3 Results (F, p-value): ...
3.4 Investor Takeaway (1 paragraph): ...

4. TASK 2 — Chi‑Square (Independence)
4.1 Table (Up/Down vs Factor): (paste table or screenshot)
4.2 Result (χ², p-value): ...
4.3 Interpretation (1–2 lines): ...

5. TASK 3 — Simple Linear Regression
5.1 Model: AssetReturn = α + β × BenchmarkReturn + ε
5.2 Results (α, β, R²): ...
5.3 Diagnostic (1 plot + 1 sentence): ...
5.4 Interpretation for PM (1–2 lines): ...

6. TASK 4 — Multiple Regression (with 1–2 macro drivers)
6.1 Predictors used: Benchmark + MACRO1 (+ MACRO2)
6.2 Model Comparison (≥3 specs): brief table with Adj-R² / AIC
6.3 Chosen Model & Why: ...
6.4 Manager Memo (3–4 bullets): ...

7.  TASK 5 — Polynomial Regression (degree 2)
7.1 Why quadratic is plausible (1–2 lines): ...
7.2 Results (βs, Adj‑R²/AIC) + fitted curve screenshot: ...
7.3 Takeaway (1–2 lines): ...

8.  TASK 6 — Logistic Regression (Up vs Down)
8.1 Predictors: Lag1 (+ optional macro)
8.2 Results: coefficients, confusion matrix (2×2), accuracy
8.3 Takeaway (1–2 lines): ...

9. VISUALS
9.1 T1 figure(s): ...
9.2 T2 figure(s): ...
9.3 T3 figure(s): ...
9.4 T4 figure(s): ...
9.5 T5/T6: ...

10. REPRODUCIBILITY CHECKLIST
- I included the CSV(s) used and named files per convention.
- All charts label axes/units/date range.
- I answered Q1–Q14 (and Q15–Q18 if applicable) clearly.
- I stated any assumptions or limitations briefly.

APPENDIX A — Key Tables/Printouts
APPENDIX B — Code snippets (optional)

      File Naming (enforce exactly)
      
        Main answers (this template): task-answers-FirstNameLastName-StudentID.docx
        Task 1: task1-ANOVA-FirstNameLastName-StudentID.docx
        Task 2: task2-ChiSquare-FirstNameLastName-StudentID.docx
        Task 3: task3-LinearReg-FirstNameLastName-StudentID.docx
        Task 4: task4-MultipleReg-FirstNameLastName-StudentID.docx
        Task 5 : task5-PolynomialReg-FirstNameLastName-StudentID.docx
        Task 6 : task6-LogisticReg-FirstNameLastName-StudentID.docx
      

      
        Grading Alignment (what markers look for)
        
          Correctness of method & results reported for each task.
          Clarity: concise 1–2 line interpretations; manager memo quality.
          Reproducibility: CSV(s) provided, naming followed, charts labelled.
          Professionalism: formatting consistent with this template.
        
      
    


    
      Submission & Naming
      Submit to LearnJCU. Include your CSV(s) used. If you hosted data on GitHub (optional), paste the CSV link into your Word file for reproducibility. Deliverables:
      
        Word (answers): responses to Q1–Q14 — name: task-answers-FirstNameLastName-StudentID.docx
        Task 1: task1-ANOVA-FirstNameLastName-StudentID.docx (and ...ipynb if used)
        Task 2: task2-ChiSquare-FirstNameLastName-StudentID.docx (and ...ipynb if used)
        Task 3: task3-LinearReg-FirstNameLastName-StudentID.docx (and ...ipynb if used)
        Task 4: task4-MultipleReg-FirstNameLastName-StudentID.docx (and ...ipynb if used)
      Task 5: task5-PolynomialReg-FirstNameLastName-StudentID.docx (and ...ipynb if used)
        Task 6: task6-LogisticReg-FirstNameLastName-StudentID.docx (and ...ipynb if used)
      
    

    
    
      Glossary

      Core finance & analysis terms (for this brief)
      
        DoW (Day‑of‑Week)
Weekday (Mon–Fri) for each row; used to group/compare returns.
        Regime
Market state, e.g., Bull vs Bear via 50‑SMA vs 200‑SMA on the benchmark.
        Remove NAs
Drop/fill rows with missing values after aligning dates across tickers.
        Return vs Log‑return
Simple: (Pₜ − Pₜ₋₁)/Pₜ₋₁; Log: ln(Pₜ/Pₜ₋₁) — additive over time.
        α (Alpha)
Intercept: average return unexplained by the benchmark; daily units.
        β (Beta)
Sensitivity to the benchmark; slope of asset on benchmark returns.
        Residuals
Errors (actual − fitted). Random residuals suggest model adequacy.
        Heteroskedasticity
Residual variance changes with level/fitted value (fan‑shape in plots).
        Autocorrelation
Residuals correlated over time; check with Durbin‑Watson or ACF plot.
        Multicollinearity
Predictors highly correlated; inflates SEs. Check VIFs; drop/transform.
        R² / Adjusted R²
Explained variance; adjusted penalises extra predictors.
        AIC / BIC
Model scores that penalise complexity; lower is better (same data).
        t‑stat / p‑value
Signal‑to‑noise and its significance probability for a coefficient.
        Confidence Interval
Range that plausibly contains the true coefficient value (e.g., 95%).
        QQ Plot
Graph to assess if residuals are approximately normal.
        Leverage & Influence
Points that can sway the fit; inspect Cook’s distance.
        Standardisation
Scale variables (z‑score) for comparability of coefficients.
        Winsorise / Outliers
Cap extreme values to reduce undue influence; report if used.
        Stationarity
Statistical properties don’t change over time; returns are closer than prices.
        Overfitting
Model fits noise; keep parsimonious, validate out‑of‑sample.
      

      Statistics Glossary (what to report for each test)
      Report the key test statistic(s), p‑value(s), fit metrics where relevant, and a short, manager‑friendly conclusion. Effect sizes are not required for Part 2.

      ANOVA (one‑way)
Hypothesis test: F‑test of mean differences across groups.
Report: F‑statistic, p‑value, brief interpretation; include basic diagnostics (residuals, variance similarity).

      Chi‑Square (independence)
Hypothesis test: Pearson’s χ² on a contingency table; H0: variables are independent.
Report: contingency table, χ², p‑value, and a 1–2 sentence interpretation; ensure expected counts are adequate.

      Regression (OLS)
Hypothesis tests: coefficient t‑tests (H0: β=0), overall F‑test.
Report: coefficients, standard errors, t‑stats, p‑values, R²/Adj‑R², 95% CI for key β’s.
Diagnostics: linearity, homoskedasticity, residual normality (QQ‑plot).

      Copy‑ready code snippets
      

    
      Reminder: Examples on this page are only examples. You may use them or others — just justify your choices briefly.
    

    

  © 2025 Professor Dr. Tan. All rights reserved.


    
      Bonus Exploration (Optional, up to +5 pts)
      Advanced measures not required in Weeks 5–10. Attempt only if you wish to go beyond core requirements.
      
        For χ², compute and discuss Cramér’s V.
        For ANOVA, compute and discuss η² or ω².
        For logistic, explore a ROC/AUC sketch or threshold tuning. (Optional)
      
      Bonus points cannot push total above the Part 2 maximum (25%).