Quick Navigation
Scenario
Analyse ~36 months of daily market data for a small portfolio (1–3 assets) and one benchmark index. Clean data, build returns, run ANOVA and Chi‑Square tests, and fit linear/multiple regressions to produce concise, decision‑ready conclusions.
Data Sources & Preparation
Deprecated caret-index symbols (e.g., ^GSPC, ^NDX, ^STI) and futures tickers have been removed to avoid data issues. Use ETF benchmarks instead — SPY (S&P 500), QQQ (NASDAQ‑100), and IWM (Russell 2000).SPY (S&P 500), QQQ (NASDAQ‑100), and IWM (Russell 2000).
Google Sheets and yfinance fully support these.
Ticker selection rules (to encourage diversity)
- Pick 1–3 assets you care about — examples: NVDA, AMZN, META, TSLA (examples; you may use these or others).
- Pick a benchmark that fits — examples:
SPY,QQQ, (examples; choose what makes sense and justify). - In your report, write one sentence per choice: “We chose ___ because ___.”
Path A — Google Sheets (no code)
- Adjusted Close for asset XXX:
=GOOGLEFINANCE("XXX","price",TODAY()-1095,TODAY(),"DAILY") - Volume for XXX:
=GOOGLEFINANCE("XXX","volume",TODAY()-1095,TODAY(),"DAILY") - Benchmark (index) — e.g.,
SPY,QQQ,IWM:=GOOGLEFINANCE("SPY","price",TODAY()-1095,TODAY(),"DAILY") - Export to CSV: File → Download → Comma-separated values (.csv)
Path B — Python (yfinance) — recommended for more choices
import yfinance as yf
assets = ["XXX","YYY"] # examples: ["NVDA","AMZN","META"]
bench = ["SPY"] # examples: ["SPY","QQQ","STI"] or futures ["ES=F","NQ=F"]
tickers = assets + bench
df = yf.download(tickers, period="3y", interval="1d", auto_adjust=True).Close
df.to_csv("prices_3y.csv")
- Volatility proxy:
VIXY(ETF) — use daily % change - Oil: (WTI futures) — use daily % return
- Rates proxy:
IEF(7–10Y US Treasuries ETF) — use daily % return - Dollar proxy:
UUP(USD ETF) — use daily % return
Finance Crash Course (fast, practical)
Returns (daily) & why we use them
Return = % change from one day to the next. Simple return: (P_t - P_{t-1})/P_{t-1}. Returns let us compare different assets fairly and run statistics. Log return: ln(P_t/P_{t-1}) — adds better time‑additivity.
Benchmark, α, β (sensitivity)
A benchmark is the market yardstick (e.g., SPY, QQQ; futures:). β is slope on the benchmark; α is intercept (average return unexplained by the benchmark).
Volatility & clustering
Rolling 20‑day standard deviation as a volatility proxy. Markets show volatility clustering: calm periods, then stormy periods.
ANOVA & Chi‑Square
ANOVA tests if mean returns differ across groups (assumes similar variances and approx. normal residuals). Chi‑Square tests independence; expected counts in each cell should be about ≥5.
Regression essentials
Report coefficients (α, β’s), standard errors, t‑stats, p‑values, confidence intervals, R² and Adjusted R². Compare models with AIC (or BIC). Beware overfitting; consider a simple train/test split or out‑of‑sample check.
Motivation — Why this matters for your life
Short version: learning to analyse markets is a practical life skill. You won’t just pass a subject—you’ll build a system for smarter saving, investing, and decision‑making.
3 stories to keep in mind
- Compounding works quietly, then suddenly. A $100/month habit at ~7% annualized can cross $50,000 in ~20 years and ~120k in ~30 years—without “big” bets, just consistency.
- Process beats predictions. Professionals rarely “call” the market; they test hypotheses, measure uncertainty, and follow data. That’s exactly what you’ll practice here.
- Transferable skills. Cleaning data, formulating hypotheses (H0/H1), choosing tests, and communicating results translate to any domain—ecology, IT, healthcare, policy.
What you’ll take away
- A reproducible Python workflow to pull, clean, and analyse real market data.
- Confidence in hypothesis testing (ANOVA, Chi‑Square, regression) and how to report decisions responsibly.
- A personal playbook for evaluating claims: “Is this statistically meaningful or just noise?”
Treat this like a starter kit for your future self. The sooner you build these habits, the more time compounding has to work for you.
Quick starter snippets
Copy‑paste these into your notebooks as needed. They align with the Tasks and rubric.
Data pull (~36 months with yfinance)
import pandas as pd, yfinance as yf
TICKER = "QQQ" # change me
end = pd.Timestamp.today().normalize()
start = end - pd.DateOffset(months=36)
df = yf.download(TICKER, start=start, end=end, interval="1d", auto_adjust=True)[["Close"]].reset_index()
df = df.sort_values("Date").rename(columns={"Close":"Close"})
df["Return"] = df["Close"].pct_change()
df.to_csv("prices_clean.csv", index=False)
ANOVA (Task 1)
import pandas as pd, statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date").dropna()
df["Weekday"] = df["Date"].dt.day_name()
m = ols("Return ~ C(Weekday)", data=df).fit()
anova = sm.stats.anova_lm(m, typ=2)
# eta-squared (η²) = SS_between / SS_total
eta2 = anova["sum_sq"]["C(Weekday)"] / anova["sum_sq"].sum()
print(anova, "\n
Chi‑Square (Task 2)
import pandas as pd, numpy as np, scipy.stats as st
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["UpDown"] = np.where(df["Return"]>0,"UP","DOWN")
df["Vol20"] = df["Return"].rolling(20).std()
q = df["Vol20"].quantile([0.33,0.66])
def reg(v):
if pd.isna(v): return None
if v<=q.iloc[0]: return "Low"
if v<=q.iloc[1]: return "Med"
return "High"
df["Regime"] = df["Vol20"].apply(reg)
ct = pd.crosstab(df["UpDown"], df["Regime"]).fillna(0)
chi2, p, dof, exp = st.chi2_contingency(ct)
n = ct.values.sum()
phi2 = chi2 / n
r, k = ct.shape
print(ct, "\nchi2=",chi2," p=",p," CramerV=",cramers_v)
Linear regression (Task 3) + 95% CI
import pandas as pd, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["Lag1"] = df["Return"].shift(1)
d = df.dropna()
X = sm.add_constant(d["Lag1"]); y = d["Return"]
m = sm.OLS(y, X).fit()
print(m.summary())
print("\n95% CI for beta:", m.conf_int().loc["Lag1"].tolist())
Multiple regression (Task 4) — compare specs
import pandas as pd, numpy as np, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["Lag1"] = df["Return"].shift(1); df["Lag2"] = df["Return"].shift(2); df["Vol20"] = df["Return"].rolling(20).std()
d = df.dropna()
def fit(cols):
X = sm.add_constant(d[cols]); y = d["Return"]
return sm.OLS(y, X).fit()
m1 = fit(["Lag1","Lag2"]); m2 = fit(["Lag1","Lag2","Vol20"])
print("M1:", dict(adjR2=m1.rsquared_adj, AIC=m1.aic, BIC=m1.bic))
print("M2:", dict(adjR2=m2.rsquared_adj, AIC=m2.aic, BIC=m2.bic))
Polynomial regression (Task 5) — degree 2 (keep it simple)
import pandas as pd, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).dropna().sort_values("Date")
# Predict asset return using a quadratic of benchmark return
df["BenchRet"] = df["Return"].shift(0) # or use benchmark return if you merged it
X = pd.DataFrame({"BenchRet": df["BenchRet"], "BenchRet2": df["BenchRet"]**2}).dropna()
y = df.loc[X.index, "Return"]
X = sm.add_constant(X)
m = sm.OLS(y, X).fit()
print(m.summary())
Logistic regression (Task 6) — classify Up vs Down
import pandas as pd, numpy as np, statsmodels.api as sm
df = pd.read_csv("prices_clean.csv", parse_dates=["Date"]).sort_values("Date")
df["Up"] = (df["Return"] > 0).astype(int)
df["Lag1"] = df["Return"].shift(1)
d = df.dropna(subset=["Up","Lag1"])
X = sm.add_constant(d[["Lag1"]]) # you may add one macro predictor if available
y = d["Up"]
logit = sm.Logit(y, X).fit(disp=False)
print(logit.summary())
pred = (logit.predict(X) >= 0.5).astype(int)
acc = (pred == y).mean()
print("Accuracy:", round(float(acc),3))
Hypothesis Playbook (use for Tasks 1–2 and wherever relevant)
Formulating H0 and H1 (templates)
- Define population/time-frame: “daily returns over the last ~36 months”.
- Task 1 (ANOVA) — H0: mean return is the same across all groups; H1: at least one group’s mean differs.
- Task 2 (Chi‑Square) — H0: direction (Up/Down) is independent of the factor; H1: they are dependent.
- Regression — H0: chosen coefficient = 0; H1: coefficient ≠ 0 (two‑tailed unless justified).
Significance level, decision, reporting
- Set α = 0.05 (unless justified).
- Decision rule: if p ≤ α → reject H0; else fail to reject H0.
- Provide the relevant test statistic(s), p‑value(s), and a plain‑English sentence for a manager audience. (Effect sizes are not required for Part 2.)
Logistic regression (classification) notes
- Outcome: Up (1) vs Down (0). Predictors: e.g., Lag1 return (and optionally one macro).
- Report coefficient signs and a simple performance metric (e.g., accuracy). No ROC/AUC required.
- Watch for class imbalance; if heavily imbalanced, note it briefly.
Assumptions checklist
- ANOVA: approx. normal residuals; similar variances (Levene); independent observations.
- Chi‑Square: expected counts ≥ 5 in most cells; mutually exclusive categories.
- Regression: linearity, independent errors, homoskedasticity, approx. normal residuals.
Variables You Will Construct
- Daily return:
(P_t - P_{t-1}) / P_{t-1}or log‑return - Up/Down label:
Upif return > 0 elseDown - Day‑of‑Week (DoW): Monday … Friday
- Regime: Bull vs Bear via 50‑SMA vs 200‑SMA
- Volatility proxy: rolling 20‑day stdev of returns
- Momentum: prior 20‑day return
- Volume z‑score: (Volume − mean)/stdev across a rolling window
Tasks
Task 1 — ANOVA Module 5
- Factor (choose one): Day‑of‑Week; or Regime (benchmark 50‑SMA above/below 200‑SMA); or Volatility terciles.
- Response: Daily return of one chosen asset.
- Deliverables: ANOVA table, diagnostics, 1‑paragraph investor takeaway.
Task 2 — Chi‑Square (Independence) Module 6
- Is Up/Down independent of Day‑of‑Week (or Regime)? Provide contingency table, expected counts, χ² & p‑value; interpret.
Task 3 — Simple Linear Regression Module 7
- Model:
AssetReturn = α + β × BenchmarkReturn + ε. Report α, β, R²; check residuals; explain β (sensitivity). - Benchmark examples:
SPY,QQQ.
Task 4 — Multiple Regression (REQUIRES 1–2 macro drivers) Modules 7–8
- Response: Portfolio return (equal‑weight of your assets) or a single asset.
- Predictors: Benchmark return plus 1–2 macro drivers chosen from the list in Data (ΔVIX, oil return, Δ10Y, USD return).
- Fit ≥3 model variants; select “best” via R²/Adj‑R²/AIC; justify choice; write a short manager memo.
Task 5 — Polynomial Regression (degree 2) Module 9
- Augment a simple regression with a quadratic term (degree 2) to capture mild curvature.
- Model: Return = α + β₁·Benchmark + β₂·Benchmark² + ε (keep predictors minimal).
- Deliverables: coefficients, p‑values, R²/Adj‑R², and a chart overlaying the fitted curve on the scatter.
Task 6 — Logistic Regression (Up vs Down) Module 10
- Classify daily direction (Up=1, Down=0) using a simple logistic model.
- Predictors: Lag1 return (and optionally one macro driver). Keep it simple.
- Deliverables: coefficient signs, a confusion matrix (2×2), and accuracy. No ROC/AUC required.
T1–T4: 20 pts each (5% each). T5–T6: 10 pts each (2.5% each). Optional Bonus: up to +5 pts (does not exceed 100/25%).
Guiding Questions (submit answers in a Word file)
Answer these Q1–Q14 succinctly in a Word document; name it per the convention below.
- Choice & rationale
Q1. Which assets (1–3) did you pick and why are they economically related?Q2. Which benchmark fits best and why not the alternatives listed?
- Data hygiene
Q3. Show how you aligned dates and removed NAs when assets/benchmarks had different trading days.Q4. Did you use returns/changes (not levels) for regressions? Show one row example.
- Task 1 — ANOVA
Q5. Which factor did you choose (DoW, Regime, or Vol‑terciles) and why is it sensible for your asset?Q6. What does the ANOVA say and what is the single‑sentence “so what?”
- Task 2 — Chi‑Square
Q7. Build the Up/Down vs chosen factor table. Are counts adequate in each cell?Q8. What does χ² and the p‑value tell you? State your interpretation.
- Task 3 — Simple regression
Q9. What are α, β, and R²? Provide a one‑sentence PM interpretation for β and R².Q10. Show one diagnostic (residual plot or normal QQ) and comment.
- Task 4 — Multiple regression (REQUIRES 1–2 macros)
Q11. Which macro drivers did you choose (ΔVIX, oil return, Δ10Y, USD return) and why do they matter for your assets?Q12. Compare three model variants (benchmark‑only vs +macro1 vs +macro1+macro2). Which wins and why (R²/Adj‑R²/AIC)?Q13. What would change your conclusion (regime change, out‑of‑sample check)?
- Task 5 — Polynomial regression
Q15. Why is a quadratic term plausible for your asset/benchmark relation? Show the fitted curve.Q16. Did the quadratic term materially improve Adj‑R² or AIC? Briefly justify.
- Task 6 — Logistic regression
Q17. What predictors did you use and why? Report the confusion matrix and accuracy.Q18. How would misclassification risk affect a simple trading decision?
- Communication
Q14. What’s your final manager memo (3–4 bullets) that a non‑technical audience can use?
Visualisation Requirements
- Task 1: box/violin + mean markers per group
- Task 2: stacked bar or mosaic plot
- Task 3: scatter with fitted line + residual plot
- Task 4: coefficient table/plot + residual diagnostics
Always label axes/legends with tickers, units, and date range.
Delivery Template (copy into a Word file)
Use the following outline verbatim in your Word document. Keep the headings and order. Replace ALL CAPS placeholders with your details. Submit the completed Word file to Blackboard.
Title: CP2403/CP3413 — Assignment Part 2 (Portfolio Analytics Edition)
STUDENT NAME: FIRSTNAME LASTNAME
STUDENT ID: XXXXXXXXX
PRACTICAL GROUP: A / B / C
DATE: YYYY-MM-DD
1. CHOICES & RATIONALE
1.1 Assets (1–3): TICKER1, TICKER2, ...
Rationale (1–2 sentences): ...
1.2 Benchmark (ETF): SPY / QQQ / IWM
Rationale (1 sentence): ...
2. DATA ACQUISITION & HYGIENE
2.1 Source & Window: Google Sheets / yfinance, ~36 months ending TODAY
2.2 Cleaning Steps: aligned trading days, removed NAs, outliers handling (if any)
2.3 Files Provided: prices CSV(s) attached (filenames)
3. TASK 1 — ANOVA (Day-of-Week / Regime / Vol-terciles)
3.1 Hypotheses (H0/H1): ...
3.2 Method & Diagnostics (1–2 lines): ...
3.3 Results (F, p-value): ...
3.4 Investor Takeaway (1 paragraph): ...
4. TASK 2 — Chi‑Square (Independence)
4.1 Table (Up/Down vs Factor): (paste table or screenshot)
4.2 Result (χ², p-value): ...
4.3 Interpretation (1–2 lines): ...
5. TASK 3 — Simple Linear Regression
5.1 Model: AssetReturn = α + β × BenchmarkReturn + ε
5.2 Results (α, β, R²): ...
5.3 Diagnostic (1 plot + 1 sentence): ...
5.4 Interpretation for PM (1–2 lines): ...
6. TASK 4 — Multiple Regression (with 1–2 macro drivers)
6.1 Predictors used: Benchmark + MACRO1 (+ MACRO2)
6.2 Model Comparison (≥3 specs): brief table with Adj-R² / AIC
6.3 Chosen Model & Why: ...
6.4 Manager Memo (3–4 bullets): ...
7. TASK 5 — Polynomial Regression (degree 2)
7.1 Why quadratic is plausible (1–2 lines): ...
7.2 Results (βs, Adj‑R²/AIC) + fitted curve screenshot: ...
7.3 Takeaway (1–2 lines): ...
8. TASK 6 — Logistic Regression (Up vs Down)
8.1 Predictors: Lag1 (+ optional macro)
8.2 Results: coefficients, confusion matrix (2×2), accuracy
8.3 Takeaway (1–2 lines): ...
9. VISUALS
9.1 T1 figure(s): ...
9.2 T2 figure(s): ...
9.3 T3 figure(s): ...
9.4 T4 figure(s): ...
9.5 T5/T6: ...
10. REPRODUCIBILITY CHECKLIST
- I included the CSV(s) used and named files per convention.
- All charts label axes/units/date range.
- I answered Q1–Q14 (and Q15–Q18 if applicable) clearly.
- I stated any assumptions or limitations briefly.
APPENDIX A — Key Tables/Printouts
APPENDIX B — Code snippets (optional)
File Naming (enforce exactly)
- Main answers (this template):
task-answers-FirstNameLastName-StudentID.docx - Task 1:
task1-ANOVA-FirstNameLastName-StudentID.docx - Task 2:
task2-ChiSquare-FirstNameLastName-StudentID.docx - Task 3:
task3-LinearReg-FirstNameLastName-StudentID.docx - Task 4:
task4-MultipleReg-FirstNameLastName-StudentID.docx - Task 5 :
task5-PolynomialReg-FirstNameLastName-StudentID.docx - Task 6 :
task6-LogisticReg-FirstNameLastName-StudentID.docx
- Correctness of method & results reported for each task.
- Clarity: concise 1–2 line interpretations; manager memo quality.
- Reproducibility: CSV(s) provided, naming followed, charts labelled.
- Professionalism: formatting consistent with this template.
Submission & Naming
Submit to LearnJCU. Include your CSV(s) used. If you hosted data on GitHub (optional), paste the CSV link into your Word file for reproducibility. Deliverables:
- Word (answers): responses to Q1–Q14 — name:
task-answers-FirstNameLastName-StudentID.docx - Task 1:
task1-ANOVA-FirstNameLastName-StudentID.docx(and...ipynbif used) - Task 2:
task2-ChiSquare-FirstNameLastName-StudentID.docx(and...ipynbif used) - Task 3:
task3-LinearReg-FirstNameLastName-StudentID.docx(and...ipynbif used) - Task 4:
task4-MultipleReg-FirstNameLastName-StudentID.docx(and...ipynbif used) - Task 5:
task5-PolynomialReg-FirstNameLastName-StudentID.docx(and...ipynbif used) - Task 6:
task6-LogisticReg-FirstNameLastName-StudentID.docx(and...ipynbif used)
Glossary
Core finance & analysis terms (for this brief)
DoW (Day‑of‑Week)
Weekday (Mon–Fri) for each row; used to group/compare returns.
Regime
Market state, e.g., Bull vs Bear via 50‑SMA vs 200‑SMA on the benchmark.
Remove NAs
Drop/fill rows with missing values after aligning dates across tickers.
Return vs Log‑return
Simple: (Pₜ − Pₜ₋₁)/Pₜ₋₁; Log: ln(Pₜ/Pₜ₋₁) — additive over time.
α (Alpha)
Intercept: average return unexplained by the benchmark; daily units.
β (Beta)
Sensitivity to the benchmark; slope of asset on benchmark returns.
Residuals
Errors (actual − fitted). Random residuals suggest model adequacy.
Heteroskedasticity
Residual variance changes with level/fitted value (fan‑shape in plots).
Autocorrelation
Residuals correlated over time; check with Durbin‑Watson or ACF plot.
Multicollinearity
Predictors highly correlated; inflates SEs. Check VIFs; drop/transform.
R² / Adjusted R²
Explained variance; adjusted penalises extra predictors.
AIC / BIC
Model scores that penalise complexity; lower is better (same data).
t‑stat / p‑value
Signal‑to‑noise and its significance probability for a coefficient.
Confidence Interval
Range that plausibly contains the true coefficient value (e.g., 95%).
QQ Plot
Graph to assess if residuals are approximately normal.
Leverage & Influence
Points that can sway the fit; inspect Cook’s distance.
Standardisation
Scale variables (z‑score) for comparability of coefficients.
Winsorise / Outliers
Cap extreme values to reduce undue influence; report if used.
Stationarity
Statistical properties don’t change over time; returns are closer than prices.
Overfitting
Model fits noise; keep parsimonious, validate out‑of‑sample.
Statistics Glossary (what to report for each test)
Report the key test statistic(s), p‑value(s), fit metrics where relevant, and a short, manager‑friendly conclusion. Effect sizes are not required for Part 2.
ANOVA (one‑way)
- Hypothesis test: F‑test of mean differences across groups.
- Report: F‑statistic, p‑value, brief interpretation; include basic diagnostics (residuals, variance similarity).
Chi‑Square (independence)
- Hypothesis test: Pearson’s χ² on a contingency table; H0: variables are independent.
- Report: contingency table, χ², p‑value, and a 1–2 sentence interpretation; ensure expected counts are adequate.
Regression (OLS)
- Hypothesis tests: coefficient t‑tests (H0: β=0), overall F‑test.
- Report: coefficients, standard errors, t‑stats, p‑values, R²/Adj‑R², 95% CI for key β’s.
- Diagnostics: linearity, homoskedasticity, residual normality (QQ‑plot).
Copy‑ready code snippets
Reminder: Examples on this page are only examples. You may use them or others — just justify your choices briefly.
Bonus Exploration (Optional, up to +5 pts)
Advanced measures not required in Weeks 5–10. Attempt only if you wish to go beyond core requirements.
- For χ², compute and discuss Cramér’s V.
- For ANOVA, compute and discuss η² or ω².
- For logistic, explore a ROC/AUC sketch or threshold tuning. (Optional)
Bonus points cannot push total above the Part 2 maximum (25%).