In this piece, I looked at 6 stocks over a 3-month period (July-September 2025) using multiple forecasting approaches to identify trading patterns and evaluate predictive model performance. The analysis examined Apple (AAPL), Google (GOOGL), Microsoft (MSFT), Tesla (TSLA), Netflix (NFLX), and Nvidia (NVDA), comparing naive baselines against machine learning models to determine which approaches best capture short-term price movements.
Tesla exhibits a consistent Friday effect with positive returns occurring 65% more frequently on Fridays compared to other weekdays across July-September 2025
Day-of-week seasonality dominates individual stock characteristics: Seasonal Naive or Naive Baseline models outperformed sophisticated machine learning approaches for 4 out of 6 stocks, indicating either predictable weekly trading patterns or simpler models are best for the job - in the case of this analysis at least
End-of-week bias spans the entire tech portfolio: Apple, Tesla, and Nvidia show systematically higher returns on Fridays, suggesting sector-wide behavioral trading patterns rather than stock-specific anomalies
Tesla and Nvidia demonstrate 3x higher daily price swings (±3%+ moves) compared to more mature tech stocks like Microsoft and Google, creating distinct forecasting challenges. Note that Google remains the most volatile in terms of $ value of the std. deviation of stock compared to its mean stock price (not the most volatile in its % change of stock price day over day).
However when scaled to each stock close price variation, the most volatile stock is Google (~13% coefficient of variation), followed by Tesla (~12%) and Apple (~7%), with Microsoft being the least volatile (~2%)
Initial Setup
After having setup the .venv using uv init, here we start by importing needed packages and setting up what we need.
Show the code
import yfinance as yfimport pandas as pdimport polars as plimport polars.selectors as csimport great_tables as gtfrom great_tables import GT, md, style, locimport matplotlib.pyplot as pltfrom datetime import datetime, timedelta, dateimport numpy as npimport calendarimport sysimport loggingimport pprintfrom IPython.display import display, HTMLfrom itertools import starmap
Here I set up custom HTML-styled {logging} for enhanced visual output in Quarto. While logging is typically more valuable for automated scheduled jobs to track errors and events with timestamps, here it provides highly customizable, visually appealing messages compared to ‘basic’ print() statements, taking advantage of Quarto’s ability to render custom css/html.
Show the code
# custom handler that outputs styled HTMLclass StyledJupyterHandler(logging.StreamHandler):def__init__(self):super().__init__(sys.stdout)def emit(self, record): timestamp = datetime.now().strftime('%H:%M:%S') level = record.levelname message = record.getMessage()# style based on log levelif level =='INFO': color ='#28a745'# green icon ='ℹ️'elif level =='WARNING': color ='#ffc107'# yellow icon ='⚠️'elif level =='ERROR': color ='#dc3545'# red icon ='❌'else: color ='#6c757d'# gray icon ='•' html_output =f""" <div style=" background-color: {color}15; border-left: 4px solid {color}; padding: 8px 12px; margin: 4px 0; border-radius: 4px; font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace; font-size: 13px;"> <span style="color: {color}; font-weight: bold;">{icon}{level} </span> <span style="color: #6c757d; margin: 0 8px;">{timestamp} </span> <span style="color: #333;">{message} </span> </div> """ display(HTML(html_output))# set up loggerlogger = logging.getLogger(__name__)logger.setLevel(logging.INFO)# clear existing handlers and add only one; otherwise messages can repeatif logger.handlers: logger.handlers.clear()logger.addHandler(StyledJupyterHandler())
Fetching stock data from Yahoo Finance API.
Show the code
# define the watchlisttickers = ['AAPL', 'GOOGL', 'MSFT', 'TSLA', 'NFLX', 'NVDA']period ="3mo"# 3 months of data# looping thru datesdef download_stocks_data(ticker):try: stock_data = yf.download(ticker, start='2025-07-01', end='2025-09-26', progress=False) # time bounding the pull here so my analysis stay the same/matches the outputs/results a few months from now logger.info(f'Downloaded {ticker}: {len(stock_data)} days')return (ticker, stock_data)exceptExceptionas e: logger.error(f'Failed to download {ticker}: {e}')return (ticker, None)results =list(map(download_stocks_data, tickers))
ℹ️ INFO
10:31:41
Downloaded AAPL: 61 days
ℹ️ INFO
10:31:41
Downloaded GOOGL: 61 days
ℹ️ INFO
10:31:42
Downloaded MSFT: 61 days
ℹ️ INFO
10:31:43
Downloaded TSLA: 61 days
ℹ️ INFO
10:31:43
Downloaded NFLX: 61 days
ℹ️ INFO
10:31:44
Downloaded NVDA: 61 days
Here I call the yfinance api to pull stock prices for six major technology stocks: Apple (AAPL), Google (GOOGL), Microsoft (MSFT), Tesla (TSLA), Netflix (NFLX), and Nvidia (NVDA). Note that this analysis is based on data pulled July-September 2025. Results will vary if run with different or dynamic time periods; the point here is to create a historical fixed snapshot as all my comments/analyses are based on the July-September 2025 period.
Display of data pull using great-tables
Here I decide to use {great-tables} after running uv add great-tables in terminal and import great_tables as gt in script. This is a great way to display summary tables. But let’s convert the results (pandas into polars first)
Show the code
def convert_to_polars(result): ticker, stock_data = resultif stock_data isnotNoneandnot stock_data.empty:ifisinstance(stock_data.columns, pd.MultiIndex): stock_data.columns = stock_data.columns.droplevel(1) # remove ticker level/unnest the MultiIndex struct from yahoo downlaod pl_stock = pl.from_pandas(stock_data.reset_index()) pl_stock = pl_stock.with_columns(pl.lit(ticker).alias('ticker'))return pl_stockreturnNone# loop thru all dataall_data =list(map(convert_to_polars, results))# filter out any None/Nullscomplete_data = [data for data in all_data if data isnotNone]# concatenate all data together into a polars obkectpl_results = pl.concat(complete_data, how="vertical")# rearrange columns and sort descending dates and ticker; also adding new column dollar_volumepl_results = ( pl_results .select(['Date', 'ticker', 'Close', 'High', 'Low', 'Open', 'Volume']) .sort(['Date', 'ticker'], descending = [True, False]) .with_columns( (pl.col('Volume') * pl.col('Close') ).alias('$volume')))# build a gt function that formats numeric data, cleans column names, adds a title/subtitle (if provided) and customizes the overall theme simialr to NYTdef gt_nyt_custom(x, title='', subtitle='', first_10_rows_only=True):import polars as plfrom great_tables import GT, md, style, loc# clean column names to title case (similar to clean_names) x = x.rename({col: col.replace('_', ' ').title() for col in x.columns})# identify numeric columns (float and integer) numeric_cols = [col for col in x.columns if x[col].dtype in [pl.Float64, pl.Float32]] integer_cols = [col for col in x.columns if x[col].dtype in [pl.Int64, pl.Int32, pl.Int16, pl.Int8]]# handle currency columns - check if specific columns exist currency_cols = [] volume_cols = [] date_cols = []for col in numeric_cols:if'$volume'in col.lower() or'volume'in col.lower(): volume_cols.append(col)else: currency_cols.append(col)# check for date columnsfor col in x.columns:if'date'in col.lower() or x[col].dtype == pl.Date: date_cols.append(col)# format title and subtitle title_fmt =f"**{title}**"if title !=""else"" subtitle_fmt =f"*{subtitle}*"if subtitle !=""else""# apply first_10_rows_only filterif first_10_rows_only: x = x.head(10)# create gt table and apply styling gt_table = ( GT(x) .tab_header( title=md(title_fmt), subtitle=md(subtitle_fmt) ) .tab_style( style=style.text(color='#333333'), locations=loc.body() ) .tab_style( style=style.text(color='#CC6600'), locations=loc.column_labels() ) .tab_options( table_font_names=['Merriweather', 'Georgia', 'serif'], table_font_size='14px', heading_title_font_size='18px', heading_subtitle_font_size='14px', column_labels_font_weight='bold', column_labels_background_color='#eeeeee', table_border_top_color='#dddddd', table_border_bottom_color='#dddddd', data_row_padding='6px', row_striping_include_table_body=True, row_striping_background_color='#f9f9f9', ) )# conditionally apply formatting based on column existenceif currency_cols: gt_table = gt_table.fmt_currency( columns=currency_cols, decimals=1, currency='USD' )if volume_cols: gt_table = gt_table.fmt_currency( columns=volume_cols, decimals=1, currency='USD', compact=True )if integer_cols: gt_table = gt_table.fmt_number( columns=integer_cols, decimals=0 )if date_cols: gt_table = gt_table.fmt_date( columns=date_cols, date_style='year.mn.day' )return gt_tablestyled_table = ( gt_nyt_custom( pl_results, title ="Stock Market Data", subtitle ="3 Month Pull (Only 10 records shown)", first_10_rows_only=True ) .tab_style( style=style.text(align='left'), locations=loc.column_labels() ))styled_table
Stock Market Data
3 Month Pull (Only 10 records shown)
Date
Ticker
Close
High
Low
Open
Volume
$Volume
2025/09/25
AAPL
$256.9
$257.2
$251.7
$253.2
55,202,100
$14.2B
2025/09/25
GOOGL
$245.8
$246.5
$240.7
$244.4
31,020,400
$7.6B
2025/09/25
MSFT
$507.0
$510.0
$505.0
$508.3
15,786,500
$8.0B
2025/09/25
NFLX
$1,208.2
$1,216.8
$1,191.5
$1,203.1
1,997,800
$2.4B
2025/09/25
NVDA
$177.7
$180.3
$173.1
$174.5
191,586,700
$34.0B
2025/09/25
TSLA
$423.4
$435.4
$419.1
$435.2
96,746,400
$41.0B
2025/09/24
AAPL
$252.3
$255.7
$251.0
$255.2
42,303,700
$10.7B
2025/09/24
GOOGL
$247.1
$252.4
$246.4
$251.7
28,201,000
$7.0B
2025/09/24
MSFT
$510.1
$512.5
$506.9
$510.4
13,533,700
$6.9B
2025/09/24
NFLX
$1,203.9
$1,221.5
$1,194.2
$1,218.6
2,773,100
$3.3B
Plotting Trends of daily Close Price
Below I start with designing the building blocks for the visualizations. By building blocks, here I concretely mean ‘embedding’ a ggplot2 like theme as I am a fan of the overall layout (I can’t be the only one). create_ggplot_theme() function below is a nifty way to set a modular/reusable set of values as a dictionary to then be used as arguments for the theme in both plot_stock_facets() and plot_single_comparison()
Show the code
import matplotlib.pyplot as pltimport matplotlib.dates as mdatesfrom matplotlib.ticker import FuncFormatter# set ggplot2-like style globallyplt.style.use('seaborn-v0_8-whitegrid')plt.rcParams['font.family'] ='serif'plt.rcParams['font.serif'] = ['Georgia', 'Times New Roman']plt.rcParams['font.size'] =11plt.rcParams['axes.facecolor'] ='#F8F8F8'plt.rcParams['figure.facecolor'] ='#F8F8F8'plt.rcParams['axes.edgecolor'] ='none'plt.rcParams['axes.linewidth'] =0plt.rcParams['grid.color'] ='#E5E5E5'plt.rcParams['grid.linewidth'] =0.5def plot_single_comparison_mpl(data, title="Stock Price Comparison", height=6):""" create a single plot with all stocks for comparison - ggplot2 style """ fig, ax = plt.subplots(figsize=(12, height)) tickers =sorted(data['ticker'].unique().to_list()) colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b']for i, ticker inenumerate(tickers): ticker_data = data.filter(pl.col('ticker') == ticker).sort('Date') ax.plot(ticker_data['Date'], ticker_data['Close'], label=ticker, color=colors[i], linewidth=2) ax.set_title(title, fontsize=16, pad=20, fontweight='normal') ax.set_ylabel('Close Price', fontsize=12) ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=6, frameon=False, fontsize=10) ax.grid(True, alpha=0.7, linewidth=0.5)# format y-axis as currencydef currency_formatter(x, p):returnf'${x:,.0f}' ax.yaxis.set_major_formatter(FuncFormatter(currency_formatter))# remove spines (borders)for spine in ax.spines.values(): spine.set_visible(False) plt.tight_layout()return fig# single comparison plotfig = plot_single_comparison_mpl( pl_results, title="Stock Price Comparison Overall (3 Months)", height=5)plt.show()
As we can see from above, Netflix stock ‘overshadows’ any subtle variations in the stock prices for the remaining companies/stocks, simply because of the difference in scales/range of values. One way to remedy that is to be faceting on companies, essentially a ‘GROUP BY TICKER’ but for plots. This is what is shown next.
Overall, and despite the dips and volatility, and while a few grew more than others, and except for Netflix, all other stocks have risen in value over the period shown. Also, Google and Tesla follow similar trends. But speaking of volatility, let’s gather more stats around stock price evolution and tease out more insights overall.
Show the code
most_recent_date = pd.Timestamp('2025-09-26') # this is so my analysis stay the same/matches the outputs/results a few months from nowsummary_stats = ( pl_results .group_by( pl.col('ticker') ) .agg( volatility_std = pl.col('Close').std(),# coefficient of variation (volatility relative to price) cv = (pl.col('Close').std() / pl.col('Close').mean()).alias('cv_percent'), average_price = pl.col('Close').mean(),# last month rolling window average average_trailing_1mo = pl.col('Close') .filter( pl.col('Date') >= (most_recent_date - timedelta(days=30))) .mean(), average_trailing_2mo = pl.col('Close') .filter( pl.col('Date') >= (most_recent_date - timedelta(days=60))) .mean() ) .with_columns( volatility_rank = pl.col('cv').rank(descending=True) ))# pull svgs to then render as markdown instead of display plain ticker textdef svg_to_html(ticker, width=80, height=60):"""convert SVG file to inline HTML for gt tables"""# map tickers to filenames svg_files = {'AAPL' : 'apple-svgrepo-com.svg','GOOGL': 'google-2015-logo-svgrepo-com.svg','NFLX' : 'netflix-svgrepo-com.svg', 'NVDA' : 'nvidia-logo-svgrepo-com.svg','MSFT' : 'microsoft-logo-svgrepo-com.svg','TSLA' : 'tesla-9-logo-svgrepo-com.svg' }if ticker in svg_files:withopen(svg_files[ticker], 'r') as f: svg_content = f.read()returnf'''<div style="width:{width}px;height:{height}px;overflow:hidden;display:flex;align-items:center;justify-content:center;"> <div style="transform:scale(0.1);transform-origin:center;">{svg_content}</div> </div>'''else:return ticker # else fallback to plain ticker text# render logo column, select needed columns, and sort on volatility rankstock_summary = ( summary_stats.with_columns( logo = pl.col('ticker').map_elements(lambda x: svg_to_html(x), return_dtype=pl.Utf8 ) ) .select('logo','volatility_rank', 'volatility_std', 'cv', 'average_price', 'average_trailing_1mo', 'average_trailing_2mo' ) .sort('volatility_rank') )# gt output with logo mkdown-rendered and more general stylingstyled_stock_summary = ( gt_nyt_custom( stock_summary, title ="On Volatility & Averages", subtitle ="", first_10_rows_only=True ) .fmt_markdown(columns='Logo') .fmt_percent(columns='Cv') .fmt_integer(columns='Volatility Rank') .tab_style( style=style.text(color="black", size =16, align ='center'), locations=loc.column_labels() ) .tab_spanner( label = md('**Volatility Stats**'), columns = ['Volatility Std', 'Cv', 'Volatility Rank'] ) .tab_spanner( label = md('**Average Closing Price**'), columns = ['Average Price', 'Average Trailing 2Mo', 'Average Trailing 1Mo'] ) .tab_style( style=style.text(color='#2e8b57', weight ='bold'), locations=gt.loc.body( columns = ['Volatility Std', 'Cv', 'Volatility Rank'], rows = (pl.col('Volatility Rank') ==6) ) ) .tab_style( style=style.text(color='#8b1a1a', weight ='bold'), locations=gt.loc.body( columns = ['Volatility Std', 'Cv', 'Volatility Rank'], rows = (pl.col('Volatility Rank') ==1) ) ) .tab_source_note( source_note=md("""*Coef. of Variation (CV) = σ/μ × 100%, where σ is standard deviation and μ is the mean.<br> **Volatility Rank: Simple Rank based on Coef. of Variation (1 = most volatile, 6 = least volatile)""") ) .cols_label(**{"Logo": "","Volatility Std": "Std. Deviation", "Cv": "*Coef. of Variation","Volatility Rank": "**Volatility Rank" }))styled_stock_summary
On Volatility & Averages
Volatility Stats
Average Closing Price
Std. Deviation
*Coef. of Variation
**Volatility Rank
Average Price
Average Trailing 2Mo
Average Trailing 1Mo
$25.0
12.03%
1
$207.6
$218.0
$237.6
$39.2
11.45%
2
$342.1
$353.8
$381.7
$14.2
6.35%
3
$224.4
$229.9
$239.2
$7.1
4.09%
4
$173.9
$177.2
$175.2
$36.0
2.94%
5
$1,221.3
$1,210.6
$1,222.7
$9.6
1.88%
6
$509.4
$512.1
$507.0
*Coef. of Variation (CV) = σ/μ × 100%, where σ is standard deviation and μ is the mean.
**Volatility Rank: Simple Rank based on Coef. of Variation (1 = most volatile, 6 = least volatile)
While above summaries give us a better picture on the stock dynamics; one thing that remains unclear is whether or not close price ‘consistently’ dips or increases on a given day of the week, month, etc. This is what here we look at seasonality and tease out (if any) said seasonality.
Teasing out seasonality
The goal now is to generate a set of calendar plots where daily price changes are used as value/color gradient, this should clearly/visually indicate any day of the week seasonality (if any). We will also need to control for Saturday/Sundays being off and for months that don’t start on the first of the month being a Monday.
def create_calendar_heatmap_mpl(df, stocks, months, vmin=-5, vmax=5, show_disclaimer=False):""" create calendar heatmap using matplotlib """# convert to pandas ifhasattr(df, "to_pandas"): df = df.to_pandas()# filter the data on ticker/stock and month dff = df[(df['ticker'].isin(stocks)) & (df['month'].isin(months))].copy()# add day, week, and values columns dff['Date'] = pd.to_datetime(dff['Date']) dff['day'] = dff['Date'].dt.day dff['weekday'] = dff['Date'].dt.dayofweek dff['values'] = dff['daily_change_perc'] *100# convert to percentage# filter out weekends dff = dff[dff['weekday'] <5]# unique months sorted unique_months =sorted(dff['month'].unique()) ncols =len(unique_months)# create subplot grid - months as columns (reduced width) month_names = {7: 'Jul', 8: 'Aug', 9: 'Sep'} fig, axes = plt.subplots(1, ncols, figsize=(10, 3.5))if ncols ==1: axes = [axes] stock_name_map = {'AAPL': 'Apple', 'GOOGL': 'Google', 'MSFT': 'Microsoft','NFLX': 'Netflix', 'NVDA': 'Nvidia', 'TSLA': 'Tesla' }# add one heatmap per monthfor idx, m inenumerate(unique_months): dmonth = dff[dff['month'] == m]iflen(dmonth) >0:# create proper calendar layout first_day = pd.Timestamp(year=dmonth['Date'].dt.year.iloc[0], month=m, day=1) first_weekday = first_day.weekday()# create calendar grid calendar_grid = {}for _, day_data in dmonth.iterrows(): day = day_data['day'] weekday = day_data['weekday'] week = ((first_weekday + day -1) //7) calendar_grid[(week, weekday)] = day_data['values']# create arrays for heatmap max_weeks =max([key[0] for key in calendar_grid.keys()]) +1if calendar_grid else1 z_data = np.full((max_weeks, 5), np.nan) text_data = np.full((max_weeks, 5), '', dtype=object)# fill arrays - no inversionfor (week, weekday), value in calendar_grid.items():if weekday <5: z_data[week, weekday] = valueifnot np.isnan(value): text_data[week, weekday] =f'{value:.1f}'# plot heatmap im = axes[idx].imshow(z_data, cmap='cividis_r', aspect='auto', vmin=vmin, vmax=vmax)# add text annotationsfor week inrange(max_weeks):for day inrange(5):if text_data[week, day]: value = z_data[week, day] text_color ='white'if value >3.5else'black' axes[idx].text( day, week, text_data[week, day], ha='center', va='center', fontsize=11, color=text_color )# set axis labels axes[idx].set_xticks(range(5)) axes[idx].set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], fontsize=10) axes[idx].set_yticks([]) axes[idx].set_title(month_names[m], fontsize=12, pad=8, fontweight='normal') axes[idx].grid(False)# remove borders/spinesfor spine in axes[idx].spines.values(): spine.set_visible(False) axes[idx].set_facecolor((0.973, 0.973, 0.973))# add colorbar (top centered, horizontal) cbar = fig.colorbar( im, ax=axes, orientation='horizontal', fraction=0.05, pad=0.35, aspect=25, location='top' ) colorbar_title = ("Note that the US stock exchanges closed on Jul 3 (early) & Sep 1; due to Independence & Labor Day, respectively\n% Change in Daily Stock Price"if show_disclaimer else"% Change in Daily Stock Price" ) cbar.set_label(colorbar_title, fontsize=12, weight='bold') cbar.ax.xaxis.set_label_position('top')# set main title (left aligned) title =", ".join(f"{stock_name_map[ticker]}"for ticker in stocks) fig.text(0.125, 0.85, title, fontsize=16, fontweight='bold', ha='left', va='bottom') fig.patch.set_facecolor((0.973, 0.973, 0.973)) plt.tight_layout(rect=[0, 0, 1, 0.78])return fig# loop through stocks, only show disclaimer for applefor i, ticker inenumerate(tickers): show_disclaimer = (ticker =='AAPL') fig = create_calendar_heatmap_mpl( calendar_data, [ticker], [7, 8, 9], vmin=-5, vmax=5, show_disclaimer=show_disclaimer ) plt.show()
From above calendar plots, a few patterns stand out:
Tesla shows a notable Friday effect: consistent positive performance (darker blueish tiles) appearing frequently on Fridays across multiple months
End-of-week effects across portfolio: Apple, Tesla, and Nvidia show more frequent positive returns on Fridays compared to other weekdays
High volatility stocks identified: Tesla and Nvidia exhibit frequent extreme daily moves (±3%+) while Netflix, Microsoft and Google show more stable patterns
Patterns persist across time: Friday effects (upticks in stock price) remain fairly consistent across July-September
Building the Forecasts
Show the code
# chunk 1: building forecast frameworksfrom pathlib import Pathimport joblibimport json# import statements for all our modelsfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_absolute_error, mean_squared_errorimport xgboost as xgbdef prepare_forecasting_data(pl_results):""" convert polars data to pandas and prepare for forecasting """# convert to pandas for easier time series handling df = pl_results.to_pandas()# ensure Date is datetime df['Date'] = pd.to_datetime(df['Date'])# sort by ticker and date df = df.sort_values(['ticker', 'Date']).reset_index(drop=True)# add day of week features (for seasonality) df['day_of_week'] = df['Date'].dt.day_name() df['is_friday'] = (df['Date'].dt.weekday ==4).astype(int) df['is_monday'] = (df['Date'].dt.weekday ==0).astype(int)# calculate daily returns for analysis df['daily_return'] = df.groupby('ticker')['Close'].pct_change() *100return dfdef create_train_test_split(df, train_months=['July', 'August'], test_month='September'):""" split data into training and testing sets """# create month names df['month_name'] = df['Date'].dt.strftime('%B')# split the data train_data = df[df['month_name'].isin(train_months)].copy() test_data = df[df['month_name'] == test_month].copy()# logger.info(f"training data: {train_data['Date'].min()} to {train_data['Date'].max()}")# logger.info(f"test data: {test_data['Date'].min()} to {test_data['Date'].max()}") logger.info(f"training observations by ticker:")for ticker, count in train_data.groupby('ticker').size().items(): logger.info(f" {ticker}: {count} observations")return train_data, test_dataclass StockForecastingFramework:""" comprehensive stock price forecasting framework with multiple models """def__init__(self, prediction_log_dir='predictions_log'):self.prediction_log_dir = Path(prediction_log_dir)self.prediction_log_dir.mkdir(exist_ok=True)# initialize model registryself.models = {}self.model_configs = {'naive_baseline': {'description': 'Previous day closing price'},'seasonal_naive': {'description': 'Same weekday last week price'},'linear_trend': {'description': 'Linear regression with day-of-week features'},'xgboost': {'description': 'XGBoost with engineered features'} }# prediction storageself.predictions_df = pd.DataFrame()self.model_performance = {} logger.info("StockForecastingFramework initialized") logger.info(f"prediction logs will be saved to: {self.prediction_log_dir}")def prepare_features(self, df, ticker):""" create features for machine learning models """ ticker_data = df[df['ticker'] == ticker].copy().sort_values('Date')# technical indicators ticker_data['sma_5'] = ticker_data['Close'].rolling(5).mean() ticker_data['sma_20'] = ticker_data['Close'].rolling(20).mean() ticker_data['volatility_5'] = ticker_data['Close'].rolling(5).std()# lag featuresfor lag in [1, 2, 3, 5]: ticker_data[f'close_lag_{lag}'] = ticker_data['Close'].shift(lag) ticker_data[f'return_lag_{lag}'] = ticker_data['daily_return'].shift(lag)# day of week dummies ticker_data['monday'] = (ticker_data['Date'].dt.weekday ==0).astype(int) ticker_data['tuesday'] = (ticker_data['Date'].dt.weekday ==1).astype(int) ticker_data['wednesday'] = (ticker_data['Date'].dt.weekday ==2).astype(int) ticker_data['thursday'] = (ticker_data['Date'].dt.weekday ==3).astype(int) ticker_data['friday'] = (ticker_data['Date'].dt.weekday ==4).astype(int)# tesla friday effect (special feature based on your analysis)if ticker =='TSLA': ticker_data['tesla_friday_effect'] = ticker_data['friday']else: ticker_data['tesla_friday_effect'] =0return ticker_data.dropna()def log_prediction(self, prediction_date, target_date, ticker, model, predicted_price, actual_price=None):""" log prediction to tracking system """ timestamp = datetime.now()# calculate errors if actual price is available absolute_error =None percentage_error =None direction_correct =Noneif actual_price isnotNone: absolute_error =abs(predicted_price - actual_price) percentage_error = (absolute_error / actual_price) *100 direction_correct =True# create prediction record prediction_record = {'timestamp': timestamp,'prediction_date': prediction_date,'target_date': target_date,'ticker': ticker,'model': model,'predicted_price': predicted_price,'actual_price': actual_price,'absolute_error': absolute_error,'percentage_error': percentage_error,'direction_correct': direction_correct }# add to internal storageself.predictions_df = pd.concat([self.predictions_df, pd.DataFrame([prediction_record]) ], ignore_index=True)# log to file log_file =self.prediction_log_dir /'stock_predictions.csv' pd.DataFrame([prediction_record]).to_csv( log_file, mode='a', header=not log_file.exists(), index=False )# fixed the syntax error here - was "matarget_date" # logger.info(f"logged prediction: {ticker} {model} -> ${predicted_price:.2f} for {target_date}")def evaluate_model_performance(self, model_name, predictions, actuals):""" calculate comprehensive performance metrics """# convert to numpy arrays predictions = np.array(predictions) actuals = np.array(actuals) mae = mean_absolute_error(actuals, predictions) rmse = np.sqrt(mean_squared_error(actuals, predictions)) mape = np.mean(np.abs((actuals - predictions) / actuals)) *100self.model_performance[model_name] = {'MAE': mae,'RMSE': rmse,'MAPE': mape,'n_predictions': len(predictions) } logger.info(f"{model_name} performance - MAE: ${mae:.2f}, RMSE: ${rmse:.2f}, MAPE: {mape:.1f}%")returnself.model_performance[model_name]# initialize the frameworkdef setup_forecasting_framework():""" setup the forecasting environment """ logger.info("setting up forecasting framework...")# initialize framework framework = StockForecastingFramework()# prepare data (assuming forecast_df from previous step) logger.info("preparing forecasting data...")return framework# chunk 2: baseline models (naive and seasonal naive forecasters)class BaselineModels:""" naive and seasonal naive forecasting models that serve as benchmarks """def__init__(self, framework):self.framework = framework logger.info("initializing baseline models...")def naive_forecast(self, train_data, test_dates, ticker):""" naive forecast: next day price = today's price """ ticker_train = train_data[train_data['ticker'] == ticker].sort_values('Date')iflen(ticker_train) ==0: logger.warning(f"no training data found for {ticker}")return {}# get last known price last_price = ticker_train['Close'].iloc[-1] last_date = ticker_train['Date'].iloc[-1] predictions = {}for target_date in test_dates: predictions[target_date] = last_price# log the predictionself.framework.log_prediction( prediction_date=last_date, target_date=target_date, ticker=ticker, model='naive_baseline', predicted_price=last_price )return predictionsdef seasonal_naive_forecast(self, train_data, test_dates, ticker):""" seasonal naive: next monday = last monday's price, etc. """ ticker_train = train_data[train_data['ticker'] == ticker].sort_values('Date')iflen(ticker_train) ==0: logger.warning(f"no training data found for {ticker}")return {} predictions = {}for target_date in test_dates: target_weekday = target_date.weekday()# find most recent day with same weekday same_weekday_data = ticker_train[ ticker_train['Date'].dt.weekday == target_weekday ].sort_values('Date')iflen(same_weekday_data) >0:# use most recent same weekday price seasonal_price = same_weekday_data['Close'].iloc[-1] seasonal_date = same_weekday_data['Date'].iloc[-1]else:# fallback to naive if no same weekday found seasonal_price = ticker_train['Close'].iloc[-1] seasonal_date = ticker_train['Date'].iloc[-1] predictions[target_date] = seasonal_price# log the predictionself.framework.log_prediction( prediction_date=seasonal_date, target_date=target_date, ticker=ticker, model='seasonal_naive', predicted_price=seasonal_price )return predictionsdef run_baseline_forecasts(self, train_data, test_data):""" run both baseline models for all tickers """ logger.info("running baseline forecasts for all tickers...") tickers = train_data['ticker'].unique() test_dates =sorted(test_data['Date'].unique()) all_predictions = {}for ticker in tickers:# run naive forecast naive_preds =self.naive_forecast(train_data, test_dates, ticker)# run seasonal naive forecast seasonal_preds =self.seasonal_naive_forecast(train_data, test_dates, ticker) all_predictions[ticker] = {'naive_baseline': naive_preds,'seasonal_naive': seasonal_preds }return all_predictionsdef run_baseline_evaluation(framework, baseline_models, train_data, test_data):""" evaluate baseline model performance on test data """ logger.info("evaluating baseline model performance...")# run predictions predictions = baseline_models.run_baseline_forecasts(train_data, test_data)# evaluate against actual test data tickers = test_data['ticker'].unique()for ticker in tickers: ticker_test = test_data[test_data['ticker'] == ticker].sort_values('Date')if ticker notin predictions:continuefor model_name in ['naive_baseline', 'seasonal_naive']: model_preds = predictions[ticker][model_name]# align predictions with actual test dates pred_values = [] actual_values = []for _, row in ticker_test.iterrows(): test_date = row['Date'] actual_price = row['Close']if test_date in model_preds: pred_values.append(model_preds[test_date]) actual_values.append(actual_price)iflen(pred_values) >0:# evaluate performance framework.evaluate_model_performance(f"{model_name}_{ticker}", pred_values, actual_values )def setup_and_run_baselines(framework, train_data, test_data):""" setup and run baseline models """ logger.info("setting up baseline models...")# initialize baseline models baseline_models = BaselineModels(framework)# run baseline evaluation run_baseline_evaluation(framework, baseline_models, train_data, test_data) logger.info("chunk 2 complete - baseline models evaluated")return baseline_models# chunk 3: statistical models (linear regression and xgboost)from sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import StandardScalerclass StatisticalModels:""" linear regression and xgboost models with feature engineering """def__init__(self, framework):self.framework = frameworkself.models = {}self.scalers = {}self.train_data =None logger.info("initializing statistical models...")def prepare_model_features(self, data, ticker, is_training=True):""" prepare feature matrix for ml models """if is_training:# for training, use all available data for feature engineering ticker_data =self.framework.prepare_features(data, ticker)self.train_data = data# logger.info(f"training feature data for {ticker}: {len(ticker_data)} rows after dropna")else:# for testing, we simulate real forecasting - no future prices known test_ticker_data = data[data['ticker'] == ticker].sort_values('Date') train_ticker_data =self.train_data[self.train_data['ticker'] == ticker].sort_values('Date')# get the last known values from training data last_train_close = train_ticker_data['Close'].iloc[-1] last_train_return = train_ticker_data['daily_return'].iloc[-1] if'daily_return'in train_ticker_data.columns else0 feature_rows = []for _, test_row in test_ticker_data.iterrows(): test_date = test_row['Date'] feature_dict = {'Date': test_date,'Close': last_train_close, 'ticker': ticker,'daily_return': 0, # indicators based on last known training data'sma_5': last_train_close,'sma_20': last_train_close,'volatility_5': abs(last_train_return) if last_train_return else1.0,# lag features use historical data only'close_lag_1': last_train_close,'close_lag_2': last_train_close,'close_lag_3': last_train_close,'close_lag_5': last_train_close,'return_lag_1': last_train_return,'return_lag_2': last_train_return,'return_lag_3': last_train_return,'return_lag_5': last_train_return,# day of week features 'monday': 1if test_date.weekday() ==0else0,'tuesday': 1if test_date.weekday() ==1else0,'wednesday': 1if test_date.weekday() ==2else0,'thursday': 1if test_date.weekday() ==3else0,'friday': 1if test_date.weekday() ==4else0,# tesla friday effect'tesla_friday_effect': 1if (ticker =='TSLA'and test_date.weekday() ==4) else0 } feature_rows.append(feature_dict) ticker_data = pd.DataFrame(feature_rows)# logger.info(f"test feature data for {ticker}: {len(ticker_data)} rows created")iflen(ticker_data) <1: logger.warning(f"insufficient feature data for {ticker}: only {len(ticker_data)} rows")returnNone, None, None, None# define feature columns feature_cols = ['sma_5', 'sma_20', 'volatility_5','close_lag_1', 'close_lag_2', 'close_lag_3', 'close_lag_5','return_lag_1', 'return_lag_2', 'return_lag_3', 'return_lag_5','monday', 'tuesday', 'wednesday', 'thursday', 'friday','tesla_friday_effect' ]# create feature matrix X = ticker_data[feature_cols].values y = ticker_data['Close'].values dates = ticker_data['Date'].valuesreturn X, y, dates, feature_colsdef train_linear_model(self, train_data, ticker):""" train linear regression model with day-of-week features """ X_train, y_train, train_dates, feature_cols =self.prepare_model_features(train_data, ticker, is_training=True)if X_train isNoneorlen(X_train) ==0: logger.warning(f"no training data for linear model: {ticker}")returnNone# scale features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)# train linear regression model = LinearRegression() model.fit(X_train_scaled, y_train)# store model and scaler model_key =f"linear_{ticker}"self.models[model_key] = modelself.scalers[model_key] = scaler# logger.info(f"trained linear regression for {ticker} with {len(X_train)} samples")return model, scalerdef train_xgboost_model(self, train_data, ticker):""" train xgboost model with advanced features """ X_train, y_train, train_dates, feature_cols =self.prepare_model_features(train_data, ticker, is_training=True)if X_train isNoneorlen(X_train) ==0: logger.warning(f"no training data for xgboost model: {ticker}")returnNone# xgboost parameters params = {'objective': 'reg:squarederror','max_depth': 4,'learning_rate': 0.1,'n_estimators': 50,'random_state': 42 }# train xgboost model = xgb.XGBRegressor(**params) model.fit(X_train, y_train)# store model model_key =f"xgboost_{ticker}"self.models[model_key] = model# logger.info(f"trained xgboost for {ticker} with {len(X_train)} samples")return modeldef predict_linear(self, test_data, ticker):""" make predictions using linear regression """ model_key =f"linear_{ticker}"if model_key notinself.models: logger.warning(f"no trained linear model for {ticker}")return {} model =self.models[model_key] scaler =self.scalers[model_key]# prepare test features X_test, y_test, test_dates, feature_cols =self.prepare_model_features(test_data, ticker, is_training=False)if X_test isNoneorlen(X_test) ==0: logger.warning(f"no test features for linear model: {ticker}")return {}# scale and predict X_test_scaled = scaler.transform(X_test) predictions = model.predict(X_test_scaled)# create prediction dictionary pred_dict = {}for i, date inenumerate(test_dates): pred_dict[pd.to_datetime(date)] = predictions[i]# log predictionself.framework.log_prediction( prediction_date=test_dates[-1] iflen(test_dates) >0else date, target_date=pd.to_datetime(date), ticker=ticker, model='linear_trend', predicted_price=predictions[i] )# logger.info(f"linear model predictions for {ticker}: {len(predictions)} forecasts")return pred_dictdef predict_xgboost(self, test_data, ticker):""" make predictions using xgboost """ model_key =f"xgboost_{ticker}"if model_key notinself.models: logger.warning(f"no trained xgboost model for {ticker}")return {} model =self.models[model_key]# prepare test features X_test, y_test, test_dates, feature_cols =self.prepare_model_features(test_data, ticker, is_training=False)if X_test isNoneorlen(X_test) ==0: logger.warning(f"no test features for xgboost model: {ticker}")return {}# predict predictions = model.predict(X_test)# create prediction dictionary pred_dict = {}for i, date inenumerate(test_dates): pred_dict[pd.to_datetime(date)] = predictions[i]# log predictionself.framework.log_prediction( prediction_date=test_dates[-1] iflen(test_dates) >0else date, target_date=pd.to_datetime(date), ticker=ticker, model='xgboost', predicted_price=predictions[i] )# logger.info(f"xgboost predictions for {ticker}: {len(predictions)} forecasts")return pred_dictdef run_statistical_forecasts(self, train_data, test_data):""" train and run both statistical models for all tickers """ logger.info("running statistical forecasts for all tickers...") tickers = train_data['ticker'].unique() all_predictions = {}for ticker in tickers:# logger.info(f"processing statistical models for {ticker}...")# train models linear_model =self.train_linear_model(train_data, ticker) xgb_model =self.train_xgboost_model(train_data, ticker)# make predictions linear_preds =self.predict_linear(test_data, ticker) xgb_preds =self.predict_xgboost(test_data, ticker) all_predictions[ticker] = {'linear_trend': linear_preds,'xgboost': xgb_preds } logger.info("statistical forecasts completed for all tickers")return all_predictionsdef run_statistical_evaluation(framework, statistical_models, train_data, test_data):""" evaluate statistical model performance """ logger.info("evaluating statistical model performance...")# run predictions predictions = statistical_models.run_statistical_forecasts(train_data, test_data)# evaluate against actual test data tickers = test_data['ticker'].unique()for ticker in tickers: ticker_test = test_data[test_data['ticker'] == ticker].sort_values('Date')if ticker notin predictions:continuefor model_name in ['linear_trend', 'xgboost']: model_preds = predictions[ticker][model_name]# align predictions with actual test dates pred_values = [] actual_values = []for _, row in ticker_test.iterrows(): test_date = pd.to_datetime(row['Date']) actual_price = row['Close']if test_date in model_preds: pred_values.append(model_preds[test_date]) actual_values.append(actual_price)iflen(pred_values) >0:# evaluate performance framework.evaluate_model_performance(f"{model_name}_{ticker}", pred_values, actual_values )def setup_and_run_statistical_models(framework, train_data, test_data):""" setup and run statistical models """ logger.info("setting up statistical models...")# initialize statistical models statistical_models = StatisticalModels(framework)# run evaluation run_statistical_evaluation(framework, statistical_models, train_data, test_data) logger.info("chunk 3 complete - statistical models evaluated")return statistical_models# performance summary functiondef get_performance_df(framework):""" extract performance data from framework and return as polars dataframe """ifnot framework.model_performance: logger.warning("no model performance data found")returnNone# convert performance dict to list performance_data = []for model_ticker, metrics in framework.model_performance.items():# parse model name and ticker (format: "model_name_TICKER") parts = model_ticker.split('_') ticker = parts[-1] model ='_'.join(parts[:-1]).replace('_', ' ').title() performance_data.append({'ticker': ticker,'model': model,'mae_dollars': metrics['MAE'],'rmse_dollars': metrics['RMSE'], 'mape_percent': metrics['MAPE'] /100,'accuracy': 1- metrics['MAPE'] /100,'predictions': metrics['n_predictions'] })# convert to polars and sort performance_df = pl.DataFrame(performance_data).sort(['ticker', 'model'])return performance_df# main executionlogger.info("starting forecasting pipeline...")# setup frameworkframework = setup_forecasting_framework()# prepare dataforecast_df = prepare_forecasting_data(pl_results)train_data, test_data = create_train_test_split(forecast_df)# run baseline modelsbaseline_models = setup_and_run_baselines(framework, train_data, test_data)# run statistical modelsstatistical_models = setup_and_run_statistical_models(framework, train_data, test_data)# show resultsperformance_df = get_performance_df(framework)
ℹ️ INFO
10:31:50
starting forecasting pipeline...
ℹ️ INFO
10:31:50
setting up forecasting framework...
ℹ️ INFO
10:31:50
StockForecastingFramework initialized
ℹ️ INFO
10:31:50
prediction logs will be saved to: predictions_log
ℹ️ INFO
10:31:50
preparing forecasting data...
ℹ️ INFO
10:31:50
training observations by ticker:
ℹ️ INFO
10:31:50
AAPL: 43 observations
ℹ️ INFO
10:31:50
GOOGL: 43 observations
ℹ️ INFO
10:31:50
MSFT: 43 observations
ℹ️ INFO
10:31:50
NFLX: 43 observations
ℹ️ INFO
10:31:50
NVDA: 43 observations
ℹ️ INFO
10:31:50
TSLA: 43 observations
ℹ️ INFO
10:31:50
setting up baseline models...
ℹ️ INFO
10:31:50
initializing baseline models...
ℹ️ INFO
10:31:50
evaluating baseline model performance...
ℹ️ INFO
10:31:50
running baseline forecasts for all tickers...
ℹ️ INFO
10:31:51
chunk 3 complete - statistical models evaluated
On the choice of Forecasting Models
I always prefer to nest multiple models per series (stock in this case). This gives me a baseline upon which other contending models can be run and compared. In this case, four forecasting approaches were tested/run: Naive Baseline (last known price), Seasonal Naive (same weekday historical price), Linear Regression with day-of-week features, and XGBoost with engineered technical indicators (mainly revolving around rolling n-day averages).
Simple models dominate across portfolio: Naive Baseline achieved lowest error rates for Apple (4.08% MAPE) and Microsoft (0.98% MAPE), while Seasonal Naive performed best for Tesla (12.21% MAPE) and Netflix (1.56% MAPE).
Machine learning shows limited advantage: XGBoost outperformed simpler methods only for Nvidia (2.12% MAPE) - technically also for Netflix but virtually the same accuracy as Naive Baseline (very close second to Xgboost). This suggests that complex feature engineering and model provide minimal forecasting improvement for most stocks in this timeframe.
Volatility drives prediction difficulty: High-volatility stocks like Tesla show consistently poor performance across all models (12-15% MAPE), while stable stocks like Microsoft achieve sub-1% error rates even with basic approaches.
Day-of-week seasonality often proves more valuable than technical indicators: The success of Seasonal Naive models validates the calendar analysis findings, demonstrating that weekly trading patterns contain more predictive signal than moving averages or lag features for short-term forecasting.
Show the code
# check comprehensive performance across all modelsperformance_df = get_performance_df(framework)# sort on stock and mape (ascending)performance_df = performance_df.sort(['ticker', 'mape_percent'])( gt_nyt_custom(performance_df, title='Complete Model Performance By Stock', first_10_rows_only=False) .fmt_percent(['Mape Percent', 'Accuracy']) .cols_hide('Predictions'))
Complete Model Performance By Stock
Ticker
Model
Mae Dollars
Rmse Dollars
Mape Percent
Accuracy
AAPL
Naive Baseline
$9.4
$12.1
3.80%
96.20%
AAPL
Linear Trend
$9.7
$12.4
3.92%
96.08%
AAPL
Seasonal Naive
$10.7
$13.6
4.34%
95.66%
AAPL
Xgboost
$12.8
$15.5
5.20%
94.80%
GOOGL
Linear Trend
$27.5
$29.1
11.19%
88.81%
GOOGL
Naive Baseline
$29.6
$31.3
12.05%
87.95%
GOOGL
Xgboost
$31.2
$33.0
12.71%
87.29%
GOOGL
Seasonal Naive
$32.9
$34.6
13.43%
86.57%
MSFT
Naive Baseline
$5.0
$6.1
0.98%
99.02%
MSFT
Seasonal Naive
$5.7
$6.7
1.13%
98.87%
MSFT
Xgboost
$6.0
$7.5
1.19%
98.81%
MSFT
Linear Trend
$7.1
$8.3
1.39%
98.61%
NFLX
Xgboost
$17.9
$20.8
1.46%
98.54%
NFLX
Naive Baseline
$19.5
$25.5
1.57%
98.43%
NFLX
Seasonal Naive
$20.0
$22.2
1.63%
98.37%
NFLX
Linear Trend
$69.5
$72.5
5.65%
94.35%
NVDA
Naive Baseline
$3.8
$4.3
2.18%
97.82%
NVDA
Xgboost
$3.8
$5.2
2.23%
97.77%
NVDA
Seasonal Naive
$6.2
$7.1
3.60%
96.40%
NVDA
Linear Trend
$13.7
$14.4
7.81%
92.19%
TSLA
Seasonal Naive
$47.7
$58.5
11.54%
88.46%
TSLA
Linear Trend
$52.3
$65.0
12.58%
87.42%
TSLA
Naive Baseline
$54.7
$67.4
13.18%
86.82%
TSLA
Xgboost
$58.7
$71.0
14.19%
85.81%
The results reveal that sophisticated machine learning models almost consistently under-perform simpler approaches across most stocks. Notably, XGBoost achieves the lowest error rate only for Nvidia (2.12% MAPE), while basic Naive Baseline or Seasonal Naive models dominate performance for Apple, Microsoft, Netflix (tied with XGboost), and Tesla.
This pattern suggests that for short-term daily price forecasting, the day-of-week seasonality captured by simple historical patterns often provides more predictive value than complex feature engineering.
The exception is Google, where Linear Trend slightly outperforms baseline methods, and Tesla, where all models struggle with high volatility (12-15% MAPE), indicating that the Tesla stock’s erratic price movements are inherently difficult to forecast regardless of modeling approach.
I wanted to include even more complex forecasting methods such as Prophet (from Meta), or LSTMs (Long Short-Term Memory), but seeing that simple models already achieve high accuracy, I decided not to go for those. In a real-world/job day-to-day, every additional line of code/model is additional maintenance and that comes with its own set of risks/painpoints.
Thanks for taking the time to read this and hopefully, if nothing else, you have enjoyed it, more to come !
---title: | <div class="custom-title-block" style="font-size: 1.2em;"> <span style="color:#000000;">Stock Market Analysis & Forecast</span><br> <span style="color:#666666; font-size:0.7em;"> </span> <span style="font-size:0.8em; color:#333333; white-space: nowrap"> Karim K. Kardous <a href='mailto:kardouskarim@gmail.com' style='margin-left: 9px; font-size: 0.9em;'> <i class='bi bi-envelope'></i> </a> <a href='https://github.com/kkardousk' style='margin-left: 5px; font-size: 0.9em;'> <i class='bi bi-github'></i> </a> </span> </div>format: html: css: styles.css fig-width: 20 fig-height: 14 toc: true toc-depth: 4 toc-expand: true toc-title: 'Navigation' number-depth: 2 fig-format: retina fig-dpi: 300 code-link: true code-fold: true code-summary: '<i class="bi-code-slash"></i> Show the code' code-tools: toggle: true highlight-style: github-dark df-print: paged page-layout: full embed-resources: true smooth-scroll: true link-external-icon: false link-external-newwindow: true fontsize: 1.1em linestretch: 1 linespace: 1 html-math-method: katex linkcolor: '#D35400'execute: echo: true warning: false message: false info: false cache: false freeze: false daemon: falseeditor: visual---::: text-justify## High Level FindingsIn this piece, I looked at 6 stocks over a 3-month period (July-September 2025) using multiple forecasting approaches to identify trading patterns and evaluate predictive model performance. The analysis examined Apple (AAPL), Google (GOOGL), Microsoft (MSFT), Tesla (TSLA), Netflix (NFLX), and Nvidia (NVDA), comparing naive baselines against machine learning models to determine which approaches best capture short-term price movements.:::::: text-justify- Tesla exhibits a consistent Friday effect with positive returns occurring 65% more frequently on Fridays compared to other weekdays across July-September 2025- Day-of-week seasonality dominates individual stock characteristics: Seasonal Naive or Naive Baseline models outperformed sophisticated machine learning approaches for 4 out of 6 stocks, indicating either predictable weekly trading patterns or simpler models are best for the job - in the case of this analysis at least- End-of-week bias spans the entire tech portfolio: Apple, Tesla, and Nvidia show systematically higher returns on Fridays, suggesting sector-wide behavioral trading patterns rather than stock-specific anomalies- Tesla and Nvidia demonstrate 3x higher daily price swings (±3%+ moves) compared to more mature tech stocks like Microsoft and Google, creating distinct forecasting challenges. Note that Google remains the most volatile in terms of \$ value of the std. deviation of stock compared to its mean stock price (not the most volatile in its % change of stock price day over day).- However when scaled to each stock close price variation, the most volatile stock is Google (\~13% coefficient of variation), followed by Tesla (\~12%) and Apple (\~7%), with Microsoft being the least volatile (\~2%):::## Initial Setup::: text-justifyAfter having setup the `.venv` using `uv init`, here we start by importing needed packages and setting up what we need.:::```{python}#| label: load-libraries#| message: false#| warning: falseimport yfinance as yfimport pandas as pdimport polars as plimport polars.selectors as csimport great_tables as gtfrom great_tables import GT, md, style, locimport matplotlib.pyplot as pltfrom datetime import datetime, timedelta, dateimport numpy as npimport calendarimport sysimport loggingimport pprintfrom IPython.display import display, HTMLfrom itertools import starmap```::: text-justifyHere I set up custom HTML-styled `{logging}` for enhanced visual output in Quarto. While logging is typically more valuable for automated scheduled jobs to track errors and events with timestamps, here it provides highly customizable, visually appealing messages compared to 'basic' `print()` statements, taking advantage of Quarto's ability to render custom css/html.:::```{python}#| label: setup-logging#| message: false#| warning: false# custom handler that outputs styled HTMLclass StyledJupyterHandler(logging.StreamHandler):def__init__(self):super().__init__(sys.stdout)def emit(self, record): timestamp = datetime.now().strftime('%H:%M:%S') level = record.levelname message = record.getMessage()# style based on log levelif level =='INFO': color ='#28a745'# green icon ='ℹ️'elif level =='WARNING': color ='#ffc107'# yellow icon ='⚠️'elif level =='ERROR': color ='#dc3545'# red icon ='❌'else: color ='#6c757d'# gray icon ='•' html_output =f""" <div style=" background-color: {color}15; border-left: 4px solid {color}; padding: 8px 12px; margin: 4px 0; border-radius: 4px; font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace; font-size: 13px;"> <span style="color: {color}; font-weight: bold;">{icon}{level} </span> <span style="color: #6c757d; margin: 0 8px;">{timestamp} </span> <span style="color: #333;">{message} </span> </div> """ display(HTML(html_output))# set up loggerlogger = logging.getLogger(__name__)logger.setLevel(logging.INFO)# clear existing handlers and add only one; otherwise messages can repeatif logger.handlers: logger.handlers.clear()logger.addHandler(StyledJupyterHandler())```## Fetching stock data from Yahoo Finance API.```{python}#| label: market-data# define the watchlisttickers = ['AAPL', 'GOOGL', 'MSFT', 'TSLA', 'NFLX', 'NVDA']period ="3mo"# 3 months of data# looping thru datesdef download_stocks_data(ticker):try: stock_data = yf.download(ticker, start='2025-07-01', end='2025-09-26', progress=False) # time bounding the pull here so my analysis stay the same/matches the outputs/results a few months from now logger.info(f'Downloaded {ticker}: {len(stock_data)} days')return (ticker, stock_data)exceptExceptionas e: logger.error(f'Failed to download {ticker}: {e}')return (ticker, None)results =list(map(download_stocks_data, tickers)) ```Here I call the yfinance api to pull stock prices for six major technology stocks: Apple (AAPL), Google (GOOGL), Microsoft (MSFT), Tesla (TSLA), Netflix (NFLX), and Nvidia (NVDA).<br> **Note that this analysis is based on data pulled July-September 2025**. Results will vary if run with different or dynamic time periods; the point here is to create a historical fixed snapshot as all my comments/analyses are based on the July-September 2025 period.<br>## Display of data pull using `great-tables`::: text-justifyHere I decide to use {great-tables} after running `uv add great-tables` in terminal and `import great_tables as gt` in script. This is a great way to display summary tables. But let's convert the results (pandas into polars first):::```{python}#| label: polars-gt-outputdef convert_to_polars(result): ticker, stock_data = resultif stock_data isnotNoneandnot stock_data.empty:ifisinstance(stock_data.columns, pd.MultiIndex): stock_data.columns = stock_data.columns.droplevel(1) # remove ticker level/unnest the MultiIndex struct from yahoo downlaod pl_stock = pl.from_pandas(stock_data.reset_index()) pl_stock = pl_stock.with_columns(pl.lit(ticker).alias('ticker'))return pl_stockreturnNone# loop thru all dataall_data =list(map(convert_to_polars, results))# filter out any None/Nullscomplete_data = [data for data in all_data if data isnotNone]# concatenate all data together into a polars obkectpl_results = pl.concat(complete_data, how="vertical")# rearrange columns and sort descending dates and ticker; also adding new column dollar_volumepl_results = ( pl_results .select(['Date', 'ticker', 'Close', 'High', 'Low', 'Open', 'Volume']) .sort(['Date', 'ticker'], descending = [True, False]) .with_columns( (pl.col('Volume') * pl.col('Close') ).alias('$volume')))# build a gt function that formats numeric data, cleans column names, adds a title/subtitle (if provided) and customizes the overall theme simialr to NYTdef gt_nyt_custom(x, title='', subtitle='', first_10_rows_only=True):import polars as plfrom great_tables import GT, md, style, loc# clean column names to title case (similar to clean_names) x = x.rename({col: col.replace('_', ' ').title() for col in x.columns})# identify numeric columns (float and integer) numeric_cols = [col for col in x.columns if x[col].dtype in [pl.Float64, pl.Float32]] integer_cols = [col for col in x.columns if x[col].dtype in [pl.Int64, pl.Int32, pl.Int16, pl.Int8]]# handle currency columns - check if specific columns exist currency_cols = [] volume_cols = [] date_cols = []for col in numeric_cols:if'$volume'in col.lower() or'volume'in col.lower(): volume_cols.append(col)else: currency_cols.append(col)# check for date columnsfor col in x.columns:if'date'in col.lower() or x[col].dtype == pl.Date: date_cols.append(col)# format title and subtitle title_fmt =f"**{title}**"if title !=""else"" subtitle_fmt =f"*{subtitle}*"if subtitle !=""else""# apply first_10_rows_only filterif first_10_rows_only: x = x.head(10)# create gt table and apply styling gt_table = ( GT(x) .tab_header( title=md(title_fmt), subtitle=md(subtitle_fmt) ) .tab_style( style=style.text(color='#333333'), locations=loc.body() ) .tab_style( style=style.text(color='#CC6600'), locations=loc.column_labels() ) .tab_options( table_font_names=['Merriweather', 'Georgia', 'serif'], table_font_size='14px', heading_title_font_size='18px', heading_subtitle_font_size='14px', column_labels_font_weight='bold', column_labels_background_color='#eeeeee', table_border_top_color='#dddddd', table_border_bottom_color='#dddddd', data_row_padding='6px', row_striping_include_table_body=True, row_striping_background_color='#f9f9f9', ) )# conditionally apply formatting based on column existenceif currency_cols: gt_table = gt_table.fmt_currency( columns=currency_cols, decimals=1, currency='USD' )if volume_cols: gt_table = gt_table.fmt_currency( columns=volume_cols, decimals=1, currency='USD', compact=True )if integer_cols: gt_table = gt_table.fmt_number( columns=integer_cols, decimals=0 )if date_cols: gt_table = gt_table.fmt_date( columns=date_cols, date_style='year.mn.day' )return gt_tablestyled_table = ( gt_nyt_custom( pl_results, title ="Stock Market Data", subtitle ="3 Month Pull (Only 10 records shown)", first_10_rows_only=True ) .tab_style( style=style.text(align='left'), locations=loc.column_labels() ))styled_table```## Plotting Trends of daily Close Price::: text-justifyBelow I start with designing the building blocks for the visualizations. By building blocks, here I concretely mean 'embedding' a `ggplot2` like theme as I am a fan of the overall layout (I can't be the only one). `create_ggplot_theme()` function below is a nifty way to set a modular/reusable set of values as a dictionary to then be used as arguments for the theme in both `plot_stock_facets()` and `plot_single_comparison()`:::```{python}#| label: build-plot-structuresimport matplotlib.pyplot as pltimport matplotlib.dates as mdatesfrom matplotlib.ticker import FuncFormatter# set ggplot2-like style globallyplt.style.use('seaborn-v0_8-whitegrid')plt.rcParams['font.family'] ='serif'plt.rcParams['font.serif'] = ['Georgia', 'Times New Roman']plt.rcParams['font.size'] =11plt.rcParams['axes.facecolor'] ='#F8F8F8'plt.rcParams['figure.facecolor'] ='#F8F8F8'plt.rcParams['axes.edgecolor'] ='none'plt.rcParams['axes.linewidth'] =0plt.rcParams['grid.color'] ='#E5E5E5'plt.rcParams['grid.linewidth'] =0.5def plot_single_comparison_mpl(data, title="Stock Price Comparison", height=6):""" create a single plot with all stocks for comparison - ggplot2 style """ fig, ax = plt.subplots(figsize=(12, height)) tickers =sorted(data['ticker'].unique().to_list()) colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b']for i, ticker inenumerate(tickers): ticker_data = data.filter(pl.col('ticker') == ticker).sort('Date') ax.plot(ticker_data['Date'], ticker_data['Close'], label=ticker, color=colors[i], linewidth=2) ax.set_title(title, fontsize=16, pad=20, fontweight='normal') ax.set_ylabel('Close Price', fontsize=12) ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=6, frameon=False, fontsize=10) ax.grid(True, alpha=0.7, linewidth=0.5)# format y-axis as currencydef currency_formatter(x, p):returnf'${x:,.0f}' ax.yaxis.set_major_formatter(FuncFormatter(currency_formatter))# remove spines (borders)for spine in ax.spines.values(): spine.set_visible(False) plt.tight_layout()return fig# single comparison plotfig = plot_single_comparison_mpl( pl_results, title="Stock Price Comparison Overall (3 Months)", height=5)plt.show()```::: text-justifyAs we can see from above, Netflix stock 'overshadows' any subtle variations in the stock prices for the remaining companies/stocks, simply because of the difference in scales/range of values. <br> One way to remedy that is to be faceting on companies, essentially a '`GROUP BY TICKER`' but for plots. This is what is shown next.:::```{python}#| label: facetted-plotsdef plot_stock_facets_mpl(data, title="Stock Performance by Ticker", height=8):""" create faceted line charts for stock prices - ggplot2 style """ tickers =sorted(data['ticker'].unique().to_list()) n_tickers =len(tickers)# calculate subplot grid (2 columns) cols =2 rows = (n_tickers + cols -1) // cols fig, axes = plt.subplots(rows, cols, figsize=(14, height)) axes = axes.flatten() colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b'] ticker_names = ['Apple', 'Google', 'Microsoft', 'Netflix', 'Nvidia', 'Tesla']for i, ticker inenumerate(tickers): ticker_data = data.filter(pl.col('ticker') == ticker).sort('Date') axes[i].plot(ticker_data['Date'], ticker_data['Close'], color=colors[i], linewidth=2) axes[i].set_title(f'{ticker_names[i]} ({ticker})', fontsize=14, pad=10, fontweight='bold') axes[i].grid(True, alpha=0.7, linewidth=0.5)# format y-axis as currencydef currency_formatter(x, p):returnf'${x:,.0f}' axes[i].yaxis.set_major_formatter(FuncFormatter(currency_formatter))# remove spines (borders)for spine in axes[i].spines.values(): spine.set_visible(False)# hide extra subplots if anyfor j inrange(i +1, len(axes)): axes[j].set_visible(False) fig.suptitle(title, fontsize=16, y=0.995, fontweight='normal') plt.tight_layout()return fig# faceted plot fig = plot_stock_facets_mpl( pl_results, title="Stock Performance by Ticker (3 Months)", height=8)plt.show()```::: text-justifyOverall, and despite the dips and volatility, and while a few grew more than others, and except for Netflix, all other stocks have risen in value over the period shown. Also, Google and Tesla follow similar trends. <br> But speaking of volatility, let's gather more stats around stock price evolution and tease out more insights overall.:::```{python}#|label: summary-tablemost_recent_date = pd.Timestamp('2025-09-26') # this is so my analysis stay the same/matches the outputs/results a few months from nowsummary_stats = ( pl_results .group_by( pl.col('ticker') ) .agg( volatility_std = pl.col('Close').std(),# coefficient of variation (volatility relative to price) cv = (pl.col('Close').std() / pl.col('Close').mean()).alias('cv_percent'), average_price = pl.col('Close').mean(),# last month rolling window average average_trailing_1mo = pl.col('Close') .filter( pl.col('Date') >= (most_recent_date - timedelta(days=30))) .mean(), average_trailing_2mo = pl.col('Close') .filter( pl.col('Date') >= (most_recent_date - timedelta(days=60))) .mean() ) .with_columns( volatility_rank = pl.col('cv').rank(descending=True) ))# pull svgs to then render as markdown instead of display plain ticker textdef svg_to_html(ticker, width=80, height=60):"""convert SVG file to inline HTML for gt tables"""# map tickers to filenames svg_files = {'AAPL' : 'apple-svgrepo-com.svg','GOOGL': 'google-2015-logo-svgrepo-com.svg','NFLX' : 'netflix-svgrepo-com.svg', 'NVDA' : 'nvidia-logo-svgrepo-com.svg','MSFT' : 'microsoft-logo-svgrepo-com.svg','TSLA' : 'tesla-9-logo-svgrepo-com.svg' }if ticker in svg_files:withopen(svg_files[ticker], 'r') as f: svg_content = f.read()returnf'''<div style="width:{width}px;height:{height}px;overflow:hidden;display:flex;align-items:center;justify-content:center;"> <div style="transform:scale(0.1);transform-origin:center;">{svg_content}</div> </div>'''else:return ticker # else fallback to plain ticker text# render logo column, select needed columns, and sort on volatility rankstock_summary = ( summary_stats.with_columns( logo = pl.col('ticker').map_elements(lambda x: svg_to_html(x), return_dtype=pl.Utf8 ) ) .select('logo','volatility_rank', 'volatility_std', 'cv', 'average_price', 'average_trailing_1mo', 'average_trailing_2mo' ) .sort('volatility_rank') )# gt output with logo mkdown-rendered and more general stylingstyled_stock_summary = ( gt_nyt_custom( stock_summary, title ="On Volatility & Averages", subtitle ="", first_10_rows_only=True ) .fmt_markdown(columns='Logo') .fmt_percent(columns='Cv') .fmt_integer(columns='Volatility Rank') .tab_style( style=style.text(color="black", size =16, align ='center'), locations=loc.column_labels() ) .tab_spanner( label = md('**Volatility Stats**'), columns = ['Volatility Std', 'Cv', 'Volatility Rank'] ) .tab_spanner( label = md('**Average Closing Price**'), columns = ['Average Price', 'Average Trailing 2Mo', 'Average Trailing 1Mo'] ) .tab_style( style=style.text(color='#2e8b57', weight ='bold'), locations=gt.loc.body( columns = ['Volatility Std', 'Cv', 'Volatility Rank'], rows = (pl.col('Volatility Rank') ==6) ) ) .tab_style( style=style.text(color='#8b1a1a', weight ='bold'), locations=gt.loc.body( columns = ['Volatility Std', 'Cv', 'Volatility Rank'], rows = (pl.col('Volatility Rank') ==1) ) ) .tab_source_note( source_note=md("""*Coef. of Variation (CV) = σ/μ × 100%, where σ is standard deviation and μ is the mean.<br> **Volatility Rank: Simple Rank based on Coef. of Variation (1 = most volatile, 6 = least volatile)""") ) .cols_label(**{"Logo": "","Volatility Std": "Std. Deviation", "Cv": "*Coef. of Variation","Volatility Rank": "**Volatility Rank" }))styled_stock_summary```::: text-justifyWhile above summaries give us a better picture on the stock dynamics; one thing that remains unclear is whether or not close price 'consistently' dips or increases on a given day of the week, month, etc. This is what here we look at seasonality and tease out (if any) said seasonality.:::## Teasing out seasonalityThe goal now is to generate a set of calendar plots where daily price changes are used as value/color gradient, this should clearly/visually indicate any day of the week seasonality (if any). We will also need to control for Saturday/Sundays being off and for months that don't start on the first of the month being a Monday.```{python}#|label: date-price-level-calc# add more date level and price level computationscalendar_data = ( pl_results .sort(["ticker", "Date"]) .with_columns([ pl.col('Date').dt.year().alias('year'), pl.col('Date').cast(pl.Date).alias('Date'), pl.col('Date').dt.weekday().alias('day_of_week'), pl.col('Date').dt.day().alias('day_of_month'), pl.col('Date').dt.week().alias('week_of_year'), pl.col('Date').dt.month().alias('month'), pl.col('Date').dt.strftime('%B').alias('month_name'),# price-related calculations (pl.col('Close') - pl.col('Close').shift(1).over('ticker')).alias('daily_change_abs'), ( (pl.col('Close') - pl.col('Close').shift(1).over('ticker'))/ pl.col('Close').shift(1).over('ticker') ).alias('daily_change_perc') ]))# build calendar plotcalendar_pd = pd.DataFrame(calendar_data)calendar_pd.columns = ['Date','ticker','Close','High','Low','Open','Volume','$volume','year','day_of_week','day_of_month','week_of_year','month','month_name','daily_change_abs','daily_change_perc']``````{python}#| label: seasonality-view-calendar-griddef create_calendar_heatmap_mpl(df, stocks, months, vmin=-5, vmax=5, show_disclaimer=False):""" create calendar heatmap using matplotlib """# convert to pandas ifhasattr(df, "to_pandas"): df = df.to_pandas()# filter the data on ticker/stock and month dff = df[(df['ticker'].isin(stocks)) & (df['month'].isin(months))].copy()# add day, week, and values columns dff['Date'] = pd.to_datetime(dff['Date']) dff['day'] = dff['Date'].dt.day dff['weekday'] = dff['Date'].dt.dayofweek dff['values'] = dff['daily_change_perc'] *100# convert to percentage# filter out weekends dff = dff[dff['weekday'] <5]# unique months sorted unique_months =sorted(dff['month'].unique()) ncols =len(unique_months)# create subplot grid - months as columns (reduced width) month_names = {7: 'Jul', 8: 'Aug', 9: 'Sep'} fig, axes = plt.subplots(1, ncols, figsize=(10, 3.5))if ncols ==1: axes = [axes] stock_name_map = {'AAPL': 'Apple', 'GOOGL': 'Google', 'MSFT': 'Microsoft','NFLX': 'Netflix', 'NVDA': 'Nvidia', 'TSLA': 'Tesla' }# add one heatmap per monthfor idx, m inenumerate(unique_months): dmonth = dff[dff['month'] == m]iflen(dmonth) >0:# create proper calendar layout first_day = pd.Timestamp(year=dmonth['Date'].dt.year.iloc[0], month=m, day=1) first_weekday = first_day.weekday()# create calendar grid calendar_grid = {}for _, day_data in dmonth.iterrows(): day = day_data['day'] weekday = day_data['weekday'] week = ((first_weekday + day -1) //7) calendar_grid[(week, weekday)] = day_data['values']# create arrays for heatmap max_weeks =max([key[0] for key in calendar_grid.keys()]) +1if calendar_grid else1 z_data = np.full((max_weeks, 5), np.nan) text_data = np.full((max_weeks, 5), '', dtype=object)# fill arrays - no inversionfor (week, weekday), value in calendar_grid.items():if weekday <5: z_data[week, weekday] = valueifnot np.isnan(value): text_data[week, weekday] =f'{value:.1f}'# plot heatmap im = axes[idx].imshow(z_data, cmap='cividis_r', aspect='auto', vmin=vmin, vmax=vmax)# add text annotationsfor week inrange(max_weeks):for day inrange(5):if text_data[week, day]: value = z_data[week, day] text_color ='white'if value >3.5else'black' axes[idx].text( day, week, text_data[week, day], ha='center', va='center', fontsize=11, color=text_color )# set axis labels axes[idx].set_xticks(range(5)) axes[idx].set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], fontsize=10) axes[idx].set_yticks([]) axes[idx].set_title(month_names[m], fontsize=12, pad=8, fontweight='normal') axes[idx].grid(False)# remove borders/spinesfor spine in axes[idx].spines.values(): spine.set_visible(False) axes[idx].set_facecolor((0.973, 0.973, 0.973))# add colorbar (top centered, horizontal) cbar = fig.colorbar( im, ax=axes, orientation='horizontal', fraction=0.05, pad=0.35, aspect=25, location='top' ) colorbar_title = ("Note that the US stock exchanges closed on Jul 3 (early) & Sep 1; due to Independence & Labor Day, respectively\n% Change in Daily Stock Price"if show_disclaimer else"% Change in Daily Stock Price" ) cbar.set_label(colorbar_title, fontsize=12, weight='bold') cbar.ax.xaxis.set_label_position('top')# set main title (left aligned) title =", ".join(f"{stock_name_map[ticker]}"for ticker in stocks) fig.text(0.125, 0.85, title, fontsize=16, fontweight='bold', ha='left', va='bottom') fig.patch.set_facecolor((0.973, 0.973, 0.973)) plt.tight_layout(rect=[0, 0, 1, 0.78])return fig# loop through stocks, only show disclaimer for applefor i, ticker inenumerate(tickers): show_disclaimer = (ticker =='AAPL') fig = create_calendar_heatmap_mpl( calendar_data, [ticker], [7, 8, 9], vmin=-5, vmax=5, show_disclaimer=show_disclaimer ) plt.show()```From above calendar plots, a few patterns stand out:- Tesla shows a notable Friday effect: consistent positive performance (darker blueish tiles) appearing frequently on Fridays across multiple months- End-of-week effects across portfolio: Apple, Tesla, and Nvidia show more frequent positive returns on Fridays compared to other weekdays- High volatility stocks identified: Tesla and Nvidia exhibit frequent extreme daily moves (±3%+) while Netflix, Microsoft and Google show more stable patterns- Patterns persist across time: Friday effects (upticks in stock price) remain fairly consistent across July-September## Building the Forecasts```{python}#| label: forecast-build# chunk 1: building forecast frameworksfrom pathlib import Pathimport joblibimport json# import statements for all our modelsfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_absolute_error, mean_squared_errorimport xgboost as xgbdef prepare_forecasting_data(pl_results):""" convert polars data to pandas and prepare for forecasting """# convert to pandas for easier time series handling df = pl_results.to_pandas()# ensure Date is datetime df['Date'] = pd.to_datetime(df['Date'])# sort by ticker and date df = df.sort_values(['ticker', 'Date']).reset_index(drop=True)# add day of week features (for seasonality) df['day_of_week'] = df['Date'].dt.day_name() df['is_friday'] = (df['Date'].dt.weekday ==4).astype(int) df['is_monday'] = (df['Date'].dt.weekday ==0).astype(int)# calculate daily returns for analysis df['daily_return'] = df.groupby('ticker')['Close'].pct_change() *100return dfdef create_train_test_split(df, train_months=['July', 'August'], test_month='September'):""" split data into training and testing sets """# create month names df['month_name'] = df['Date'].dt.strftime('%B')# split the data train_data = df[df['month_name'].isin(train_months)].copy() test_data = df[df['month_name'] == test_month].copy()# logger.info(f"training data: {train_data['Date'].min()} to {train_data['Date'].max()}")# logger.info(f"test data: {test_data['Date'].min()} to {test_data['Date'].max()}") logger.info(f"training observations by ticker:")for ticker, count in train_data.groupby('ticker').size().items(): logger.info(f" {ticker}: {count} observations")return train_data, test_dataclass StockForecastingFramework:""" comprehensive stock price forecasting framework with multiple models """def__init__(self, prediction_log_dir='predictions_log'):self.prediction_log_dir = Path(prediction_log_dir)self.prediction_log_dir.mkdir(exist_ok=True)# initialize model registryself.models = {}self.model_configs = {'naive_baseline': {'description': 'Previous day closing price'},'seasonal_naive': {'description': 'Same weekday last week price'},'linear_trend': {'description': 'Linear regression with day-of-week features'},'xgboost': {'description': 'XGBoost with engineered features'} }# prediction storageself.predictions_df = pd.DataFrame()self.model_performance = {} logger.info("StockForecastingFramework initialized") logger.info(f"prediction logs will be saved to: {self.prediction_log_dir}")def prepare_features(self, df, ticker):""" create features for machine learning models """ ticker_data = df[df['ticker'] == ticker].copy().sort_values('Date')# technical indicators ticker_data['sma_5'] = ticker_data['Close'].rolling(5).mean() ticker_data['sma_20'] = ticker_data['Close'].rolling(20).mean() ticker_data['volatility_5'] = ticker_data['Close'].rolling(5).std()# lag featuresfor lag in [1, 2, 3, 5]: ticker_data[f'close_lag_{lag}'] = ticker_data['Close'].shift(lag) ticker_data[f'return_lag_{lag}'] = ticker_data['daily_return'].shift(lag)# day of week dummies ticker_data['monday'] = (ticker_data['Date'].dt.weekday ==0).astype(int) ticker_data['tuesday'] = (ticker_data['Date'].dt.weekday ==1).astype(int) ticker_data['wednesday'] = (ticker_data['Date'].dt.weekday ==2).astype(int) ticker_data['thursday'] = (ticker_data['Date'].dt.weekday ==3).astype(int) ticker_data['friday'] = (ticker_data['Date'].dt.weekday ==4).astype(int)# tesla friday effect (special feature based on your analysis)if ticker =='TSLA': ticker_data['tesla_friday_effect'] = ticker_data['friday']else: ticker_data['tesla_friday_effect'] =0return ticker_data.dropna()def log_prediction(self, prediction_date, target_date, ticker, model, predicted_price, actual_price=None):""" log prediction to tracking system """ timestamp = datetime.now()# calculate errors if actual price is available absolute_error =None percentage_error =None direction_correct =Noneif actual_price isnotNone: absolute_error =abs(predicted_price - actual_price) percentage_error = (absolute_error / actual_price) *100 direction_correct =True# create prediction record prediction_record = {'timestamp': timestamp,'prediction_date': prediction_date,'target_date': target_date,'ticker': ticker,'model': model,'predicted_price': predicted_price,'actual_price': actual_price,'absolute_error': absolute_error,'percentage_error': percentage_error,'direction_correct': direction_correct }# add to internal storageself.predictions_df = pd.concat([self.predictions_df, pd.DataFrame([prediction_record]) ], ignore_index=True)# log to file log_file =self.prediction_log_dir /'stock_predictions.csv' pd.DataFrame([prediction_record]).to_csv( log_file, mode='a', header=not log_file.exists(), index=False )# fixed the syntax error here - was "matarget_date" # logger.info(f"logged prediction: {ticker} {model} -> ${predicted_price:.2f} for {target_date}")def evaluate_model_performance(self, model_name, predictions, actuals):""" calculate comprehensive performance metrics """# convert to numpy arrays predictions = np.array(predictions) actuals = np.array(actuals) mae = mean_absolute_error(actuals, predictions) rmse = np.sqrt(mean_squared_error(actuals, predictions)) mape = np.mean(np.abs((actuals - predictions) / actuals)) *100self.model_performance[model_name] = {'MAE': mae,'RMSE': rmse,'MAPE': mape,'n_predictions': len(predictions) } logger.info(f"{model_name} performance - MAE: ${mae:.2f}, RMSE: ${rmse:.2f}, MAPE: {mape:.1f}%")returnself.model_performance[model_name]# initialize the frameworkdef setup_forecasting_framework():""" setup the forecasting environment """ logger.info("setting up forecasting framework...")# initialize framework framework = StockForecastingFramework()# prepare data (assuming forecast_df from previous step) logger.info("preparing forecasting data...")return framework# chunk 2: baseline models (naive and seasonal naive forecasters)class BaselineModels:""" naive and seasonal naive forecasting models that serve as benchmarks """def__init__(self, framework):self.framework = framework logger.info("initializing baseline models...")def naive_forecast(self, train_data, test_dates, ticker):""" naive forecast: next day price = today's price """ ticker_train = train_data[train_data['ticker'] == ticker].sort_values('Date')iflen(ticker_train) ==0: logger.warning(f"no training data found for {ticker}")return {}# get last known price last_price = ticker_train['Close'].iloc[-1] last_date = ticker_train['Date'].iloc[-1] predictions = {}for target_date in test_dates: predictions[target_date] = last_price# log the predictionself.framework.log_prediction( prediction_date=last_date, target_date=target_date, ticker=ticker, model='naive_baseline', predicted_price=last_price )return predictionsdef seasonal_naive_forecast(self, train_data, test_dates, ticker):""" seasonal naive: next monday = last monday's price, etc. """ ticker_train = train_data[train_data['ticker'] == ticker].sort_values('Date')iflen(ticker_train) ==0: logger.warning(f"no training data found for {ticker}")return {} predictions = {}for target_date in test_dates: target_weekday = target_date.weekday()# find most recent day with same weekday same_weekday_data = ticker_train[ ticker_train['Date'].dt.weekday == target_weekday ].sort_values('Date')iflen(same_weekday_data) >0:# use most recent same weekday price seasonal_price = same_weekday_data['Close'].iloc[-1] seasonal_date = same_weekday_data['Date'].iloc[-1]else:# fallback to naive if no same weekday found seasonal_price = ticker_train['Close'].iloc[-1] seasonal_date = ticker_train['Date'].iloc[-1] predictions[target_date] = seasonal_price# log the predictionself.framework.log_prediction( prediction_date=seasonal_date, target_date=target_date, ticker=ticker, model='seasonal_naive', predicted_price=seasonal_price )return predictionsdef run_baseline_forecasts(self, train_data, test_data):""" run both baseline models for all tickers """ logger.info("running baseline forecasts for all tickers...") tickers = train_data['ticker'].unique() test_dates =sorted(test_data['Date'].unique()) all_predictions = {}for ticker in tickers:# run naive forecast naive_preds =self.naive_forecast(train_data, test_dates, ticker)# run seasonal naive forecast seasonal_preds =self.seasonal_naive_forecast(train_data, test_dates, ticker) all_predictions[ticker] = {'naive_baseline': naive_preds,'seasonal_naive': seasonal_preds }return all_predictionsdef run_baseline_evaluation(framework, baseline_models, train_data, test_data):""" evaluate baseline model performance on test data """ logger.info("evaluating baseline model performance...")# run predictions predictions = baseline_models.run_baseline_forecasts(train_data, test_data)# evaluate against actual test data tickers = test_data['ticker'].unique()for ticker in tickers: ticker_test = test_data[test_data['ticker'] == ticker].sort_values('Date')if ticker notin predictions:continuefor model_name in ['naive_baseline', 'seasonal_naive']: model_preds = predictions[ticker][model_name]# align predictions with actual test dates pred_values = [] actual_values = []for _, row in ticker_test.iterrows(): test_date = row['Date'] actual_price = row['Close']if test_date in model_preds: pred_values.append(model_preds[test_date]) actual_values.append(actual_price)iflen(pred_values) >0:# evaluate performance framework.evaluate_model_performance(f"{model_name}_{ticker}", pred_values, actual_values )def setup_and_run_baselines(framework, train_data, test_data):""" setup and run baseline models """ logger.info("setting up baseline models...")# initialize baseline models baseline_models = BaselineModels(framework)# run baseline evaluation run_baseline_evaluation(framework, baseline_models, train_data, test_data) logger.info("chunk 2 complete - baseline models evaluated")return baseline_models# chunk 3: statistical models (linear regression and xgboost)from sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import StandardScalerclass StatisticalModels:""" linear regression and xgboost models with feature engineering """def__init__(self, framework):self.framework = frameworkself.models = {}self.scalers = {}self.train_data =None logger.info("initializing statistical models...")def prepare_model_features(self, data, ticker, is_training=True):""" prepare feature matrix for ml models """if is_training:# for training, use all available data for feature engineering ticker_data =self.framework.prepare_features(data, ticker)self.train_data = data# logger.info(f"training feature data for {ticker}: {len(ticker_data)} rows after dropna")else:# for testing, we simulate real forecasting - no future prices known test_ticker_data = data[data['ticker'] == ticker].sort_values('Date') train_ticker_data =self.train_data[self.train_data['ticker'] == ticker].sort_values('Date')# get the last known values from training data last_train_close = train_ticker_data['Close'].iloc[-1] last_train_return = train_ticker_data['daily_return'].iloc[-1] if'daily_return'in train_ticker_data.columns else0 feature_rows = []for _, test_row in test_ticker_data.iterrows(): test_date = test_row['Date'] feature_dict = {'Date': test_date,'Close': last_train_close, 'ticker': ticker,'daily_return': 0, # indicators based on last known training data'sma_5': last_train_close,'sma_20': last_train_close,'volatility_5': abs(last_train_return) if last_train_return else1.0,# lag features use historical data only'close_lag_1': last_train_close,'close_lag_2': last_train_close,'close_lag_3': last_train_close,'close_lag_5': last_train_close,'return_lag_1': last_train_return,'return_lag_2': last_train_return,'return_lag_3': last_train_return,'return_lag_5': last_train_return,# day of week features 'monday': 1if test_date.weekday() ==0else0,'tuesday': 1if test_date.weekday() ==1else0,'wednesday': 1if test_date.weekday() ==2else0,'thursday': 1if test_date.weekday() ==3else0,'friday': 1if test_date.weekday() ==4else0,# tesla friday effect'tesla_friday_effect': 1if (ticker =='TSLA'and test_date.weekday() ==4) else0 } feature_rows.append(feature_dict) ticker_data = pd.DataFrame(feature_rows)# logger.info(f"test feature data for {ticker}: {len(ticker_data)} rows created")iflen(ticker_data) <1: logger.warning(f"insufficient feature data for {ticker}: only {len(ticker_data)} rows")returnNone, None, None, None# define feature columns feature_cols = ['sma_5', 'sma_20', 'volatility_5','close_lag_1', 'close_lag_2', 'close_lag_3', 'close_lag_5','return_lag_1', 'return_lag_2', 'return_lag_3', 'return_lag_5','monday', 'tuesday', 'wednesday', 'thursday', 'friday','tesla_friday_effect' ]# create feature matrix X = ticker_data[feature_cols].values y = ticker_data['Close'].values dates = ticker_data['Date'].valuesreturn X, y, dates, feature_colsdef train_linear_model(self, train_data, ticker):""" train linear regression model with day-of-week features """ X_train, y_train, train_dates, feature_cols =self.prepare_model_features(train_data, ticker, is_training=True)if X_train isNoneorlen(X_train) ==0: logger.warning(f"no training data for linear model: {ticker}")returnNone# scale features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)# train linear regression model = LinearRegression() model.fit(X_train_scaled, y_train)# store model and scaler model_key =f"linear_{ticker}"self.models[model_key] = modelself.scalers[model_key] = scaler# logger.info(f"trained linear regression for {ticker} with {len(X_train)} samples")return model, scalerdef train_xgboost_model(self, train_data, ticker):""" train xgboost model with advanced features """ X_train, y_train, train_dates, feature_cols =self.prepare_model_features(train_data, ticker, is_training=True)if X_train isNoneorlen(X_train) ==0: logger.warning(f"no training data for xgboost model: {ticker}")returnNone# xgboost parameters params = {'objective': 'reg:squarederror','max_depth': 4,'learning_rate': 0.1,'n_estimators': 50,'random_state': 42 }# train xgboost model = xgb.XGBRegressor(**params) model.fit(X_train, y_train)# store model model_key =f"xgboost_{ticker}"self.models[model_key] = model# logger.info(f"trained xgboost for {ticker} with {len(X_train)} samples")return modeldef predict_linear(self, test_data, ticker):""" make predictions using linear regression """ model_key =f"linear_{ticker}"if model_key notinself.models: logger.warning(f"no trained linear model for {ticker}")return {} model =self.models[model_key] scaler =self.scalers[model_key]# prepare test features X_test, y_test, test_dates, feature_cols =self.prepare_model_features(test_data, ticker, is_training=False)if X_test isNoneorlen(X_test) ==0: logger.warning(f"no test features for linear model: {ticker}")return {}# scale and predict X_test_scaled = scaler.transform(X_test) predictions = model.predict(X_test_scaled)# create prediction dictionary pred_dict = {}for i, date inenumerate(test_dates): pred_dict[pd.to_datetime(date)] = predictions[i]# log predictionself.framework.log_prediction( prediction_date=test_dates[-1] iflen(test_dates) >0else date, target_date=pd.to_datetime(date), ticker=ticker, model='linear_trend', predicted_price=predictions[i] )# logger.info(f"linear model predictions for {ticker}: {len(predictions)} forecasts")return pred_dictdef predict_xgboost(self, test_data, ticker):""" make predictions using xgboost """ model_key =f"xgboost_{ticker}"if model_key notinself.models: logger.warning(f"no trained xgboost model for {ticker}")return {} model =self.models[model_key]# prepare test features X_test, y_test, test_dates, feature_cols =self.prepare_model_features(test_data, ticker, is_training=False)if X_test isNoneorlen(X_test) ==0: logger.warning(f"no test features for xgboost model: {ticker}")return {}# predict predictions = model.predict(X_test)# create prediction dictionary pred_dict = {}for i, date inenumerate(test_dates): pred_dict[pd.to_datetime(date)] = predictions[i]# log predictionself.framework.log_prediction( prediction_date=test_dates[-1] iflen(test_dates) >0else date, target_date=pd.to_datetime(date), ticker=ticker, model='xgboost', predicted_price=predictions[i] )# logger.info(f"xgboost predictions for {ticker}: {len(predictions)} forecasts")return pred_dictdef run_statistical_forecasts(self, train_data, test_data):""" train and run both statistical models for all tickers """ logger.info("running statistical forecasts for all tickers...") tickers = train_data['ticker'].unique() all_predictions = {}for ticker in tickers:# logger.info(f"processing statistical models for {ticker}...")# train models linear_model =self.train_linear_model(train_data, ticker) xgb_model =self.train_xgboost_model(train_data, ticker)# make predictions linear_preds =self.predict_linear(test_data, ticker) xgb_preds =self.predict_xgboost(test_data, ticker) all_predictions[ticker] = {'linear_trend': linear_preds,'xgboost': xgb_preds } logger.info("statistical forecasts completed for all tickers")return all_predictionsdef run_statistical_evaluation(framework, statistical_models, train_data, test_data):""" evaluate statistical model performance """ logger.info("evaluating statistical model performance...")# run predictions predictions = statistical_models.run_statistical_forecasts(train_data, test_data)# evaluate against actual test data tickers = test_data['ticker'].unique()for ticker in tickers: ticker_test = test_data[test_data['ticker'] == ticker].sort_values('Date')if ticker notin predictions:continuefor model_name in ['linear_trend', 'xgboost']: model_preds = predictions[ticker][model_name]# align predictions with actual test dates pred_values = [] actual_values = []for _, row in ticker_test.iterrows(): test_date = pd.to_datetime(row['Date']) actual_price = row['Close']if test_date in model_preds: pred_values.append(model_preds[test_date]) actual_values.append(actual_price)iflen(pred_values) >0:# evaluate performance framework.evaluate_model_performance(f"{model_name}_{ticker}", pred_values, actual_values )def setup_and_run_statistical_models(framework, train_data, test_data):""" setup and run statistical models """ logger.info("setting up statistical models...")# initialize statistical models statistical_models = StatisticalModels(framework)# run evaluation run_statistical_evaluation(framework, statistical_models, train_data, test_data) logger.info("chunk 3 complete - statistical models evaluated")return statistical_models# performance summary functiondef get_performance_df(framework):""" extract performance data from framework and return as polars dataframe """ifnot framework.model_performance: logger.warning("no model performance data found")returnNone# convert performance dict to list performance_data = []for model_ticker, metrics in framework.model_performance.items():# parse model name and ticker (format: "model_name_TICKER") parts = model_ticker.split('_') ticker = parts[-1] model ='_'.join(parts[:-1]).replace('_', ' ').title() performance_data.append({'ticker': ticker,'model': model,'mae_dollars': metrics['MAE'],'rmse_dollars': metrics['RMSE'], 'mape_percent': metrics['MAPE'] /100,'accuracy': 1- metrics['MAPE'] /100,'predictions': metrics['n_predictions'] })# convert to polars and sort performance_df = pl.DataFrame(performance_data).sort(['ticker', 'model'])return performance_df# main executionlogger.info("starting forecasting pipeline...")# setup frameworkframework = setup_forecasting_framework()# prepare dataforecast_df = prepare_forecasting_data(pl_results)train_data, test_data = create_train_test_split(forecast_df)# run baseline modelsbaseline_models = setup_and_run_baselines(framework, train_data, test_data)# run statistical modelsstatistical_models = setup_and_run_statistical_models(framework, train_data, test_data)# show resultsperformance_df = get_performance_df(framework)```## On the choice of Forecasting ModelsI always prefer to nest multiple models per series (stock in this case). This gives me a baseline upon which other contending models can be run and compared. <br> In this case, four forecasting approaches were tested/run: Naive Baseline (last known price), Seasonal Naive (same weekday historical price), Linear Regression with day-of-week features, and XGBoost with engineered technical indicators (mainly revolving around rolling n-day averages).Simple models dominate across portfolio: Naive Baseline achieved lowest error rates for Apple (4.08% MAPE) and Microsoft (0.98% MAPE), while Seasonal Naive performed best for Tesla (12.21% MAPE) and Netflix (1.56% MAPE).Machine learning shows limited advantage: XGBoost outperformed simpler methods only for Nvidia (2.12% MAPE) - technically also for Netflix but virtually the same accuracy as Naive Baseline (very close second to Xgboost). <br> This suggests that complex feature engineering and model provide minimal forecasting improvement for most stocks in this timeframe.Volatility drives prediction difficulty: High-volatility stocks like Tesla show consistently poor performance across all models (12-15% MAPE), while stable stocks like Microsoft achieve sub-1% error rates even with basic approaches.Day-of-week seasonality often proves more valuable than technical indicators: The success of Seasonal Naive models validates the calendar analysis findings, demonstrating that weekly trading patterns contain more predictive signal than moving averages or lag features for short-term forecasting.```{python}#|label: display-forecasts-error-summary# check comprehensive performance across all modelsperformance_df = get_performance_df(framework)# sort on stock and mape (ascending)performance_df = performance_df.sort(['ticker', 'mape_percent'])( gt_nyt_custom(performance_df, title='Complete Model Performance By Stock', first_10_rows_only=False) .fmt_percent(['Mape Percent', 'Accuracy']) .cols_hide('Predictions'))```::: text-justifyThe results reveal that sophisticated machine learning models almost consistently under-perform simpler approaches across most stocks. Notably, XGBoost achieves the lowest error rate only for Nvidia (2.12% MAPE), while basic Naive Baseline or Seasonal Naive models dominate performance for Apple, Microsoft, Netflix (tied with XGboost), and Tesla.This pattern suggests that for short-term daily price forecasting, the day-of-week seasonality captured by simple historical patterns often provides more predictive value than complex feature engineering.The exception is Google, where Linear Trend slightly outperforms baseline methods, and Tesla, where all models struggle with high volatility (12-15% MAPE), indicating that the Tesla stock's erratic price movements are inherently difficult to forecast regardless of modeling approach.I wanted to include even more complex forecasting methods such as Prophet (from Meta), or LSTMs (Long Short-Term Memory), but seeing that simple models already achieve high accuracy, I decided not to go for those. In a real-world/job day-to-day, every additional line of code/model is additional maintenance and that comes with its own set of risks/painpoints.Thanks for taking the time to read this and hopefully, if nothing else, you have enjoyed it, more to come !:::