A machine-learning based decision system for the Indian mainboard IPO market. The model predicts the probability that a freshly-issued IPO will list at more than a 5% gain over its issue price, and a backtested strategy converts those probabilities into daily portfolio allocations.
๐ Live dashboard:
๐ป Source: https://github.com/Nityunj-Goel/ipo-ml-decision-system
-
๐ฏ Task framing: Binary classification โ
P(listing_gain > 5%)from pre-listing public signals (subscription multiples, issue size, price band, GMP, year, etc.). -
๐ค Model: logistic regression (elastic-net), chosen over random forest, XGBoost, and LightGBM on time-series CV ROC-AUC.
-
โ๏ธ Decision rule: On each IPO listing day, allocate capital equally across all IPOs whose predicted probability exceeds a learned threshold
t_min โ 0.41. Per-IPO allotment is approximated as1 / max(1, NII subscription multiple)[ Assuming NII category ]. -
๐งช Holdout result (unseen 2025 data, 75 trade days, 108 IPOs): Cumulative return +16.4%, mean per-trade-day return +0.20%, win rate 61.3%, Sharpe-like 0.43.
-
๐ Full backtest (2017โ2025, 444 IPOs, 358 trade days): Cumulative return +242.5%, mean per-trade-day return +0.35%, win rate 62%, Sharpe-like 0.33.
- The Problem We're Solving
- Problem Framing
- Data
- Pipeline Architecture
- Modeling
- Decision Engine
- Business Performance Metrics โ Definitions
- Backtesting & Results
- Prediction API
- Repository Structure
- Getting Started
- Limitations & Assumptions
- Future Scope
- FAQ
- Data Sources & Credits
- Project Learnings
- Disclaimer
Indian IPO listings are a high-variance event. Some double on day one, others nosedive 30%, and most land somewhere in between. The only thing an investor actually sees, before closing, is a handful of public signals. Subscription numbers, issue size, price band, grey market premium.
So, on every IPO closing day, an investor faces two very practical questions:
"Is this IPO worth subscribing to at all?"
"If multiple IPOs close today, how do I split my capital between them?"
This project is an end-to-end answer to both. It does not try to predict exact listing-day returns since it turned out to be unreliable on this dataset (see Decisions.md โ Why Classification Instead of Regression). Instead, it learns where the positive expected value lives in the IPO opportunity space, and route capital there with a simple, robust rule.
The deliverable is three things stitched together:
- ๐ง a calibrated probability model,
- ๐ a backtested decision rule on top of it, and
- ๐ฅ๏ธ a live dashboard + prediction API to actually use it.
Listing gain is the percentage difference between an IPO's listing-day open price and its issue price:
listing_gain_% = ((listing_day_open โ issue_price) / issue_price) ร 100
It's the standard market measure of IPO underpricing. How much value the opening auction discovered above what the issuer charged.
target = 1 if listing_gain_% > 5 else 0
The 5% buffer absorbs slippage, brokerage, and taxes, making the target closer to realizable profit than to theoretical underpricing.
A quick read of listing_gain_% across the dataset:
- ๐ Heavy right skew, with a heavy tail of extreme outliers (a few IPOs with very large gains dominate the variance).
- ๐ฏ Dense cluster in the 0โ10% range, where most IPOs land.
- โ๏ธ At the 5% threshold the positive class is ~57% of samples, so no class imbalance to fight.
The first two facts are why exact-magnitude regression is unreliable on this dataset (see next subsection). The third is why the classifier can be trained without resampling or class weights.
Regression on listing gain produced cross-validation Rยฒ โ 0 and holdout Rยฒ < 0 due to the heavy-tailed return distribution and weak signal for exact magnitude. Reframing as "is this IPO likely to clear a meaningful threshold?" turned an unstable problem into a stable one (holdout AUC โ 0.84). Full reasoning in Decisions.md โ Why Classification Instead of Regression.
On a multi-IPO day this looks like a classic portfolio allocation problem (which subset to fund, and in what proportions). EDA revealed it usually isn't. ~85% of IPO closing days have only a single IPO, so the dominant decision is trade vs. no-trade, not how to weight a basket. The system therefore reduces to a two-step rule: filter by a learned probability threshold, then split capital equally among the survivors.
Full design history (including the original wแตข โ pแตข^ฮฑ rule that was dropped) lives in Decisions.md โ Allocation Strategy.
- ๐ข Chittorgarh โ IPO subscription, issue, and listing data
- ๐น Investorgain โ Grey market premium (GMP) snapshots
- ๐ NSE โ listing-day open prices
- ๐ BSE โ IPO start/end dates and price band (low/high), used to backfill missing fields in the NSE-aggregated dataset. ~407 of 720 IPOs were enriched this way (393 exact-name matches + 14 fuzzy matches via
rapidfuzz).
- ๐ฎ๐ณ Indian mainboard IPOs only (SME segment excluded)
- ๐๏ธ Years: 2006โ2025
- ๐ฆ 720 IPOs after cleaning
Pre-listing, publicly observable at end of bidding window:
| Feature | Notes |
|---|---|
qib, nii, retail, total |
Raw subscription multiples by category (and overall) |
qib_ratio, nii_ratio, retail_ratio |
Each category's share of total subscription. Captures the composition of demand independent of magnitude |
issue_amount |
Issue size (Rs. crores) |
price_band_high, price_band_low |
Issue price band |
gmp |
Grey market premium (nullable โ not always available) |
is_gmp_missing |
Binary indicator for missing GMP โ turns missingness itself into a signal |
year |
Captures regime / cohort effects |
- Subscription numbers used at training time are end-of-window snapshots; live inference uses the same field captured ~30 minutes before close, so a small training-serving skew exists.
- GMP coverage is incomplete and reliability varies by source (handled via the
is_gmp_missingflag). - See
Limitationsbelow.
Raw scraped data
โ
โผ
[Data Pipeline] data collection โ aggregation โ cleaning โ feature engineering
โ
โผ
[Model Pipeline] preprocessing โ feature selection โ training โ evaluation (ROC-AUC) โ calibration
โ
โผ
[Decision Engine] probability โ threshold filter โ equal-weight allocation
โ
โผ
[Action] API response / dashboard / (future: broker execution)
๐ผ๏ธ <placeholder: architecture diagram โ to be added by author>
- Logistic regression (elastic-net) โ selected โ
- Random forest
- XGBoost
- LightGBM
On this dataset (~700 IPOs, ~10 features), the tree-based models including hyperparameter-tuned XGBoost and LightGBM failed to clearly outperform a well-regularized logistic regression on time-series CV ROC-AUC, and showed more variance across folds. The combination of a small dataset, mostly-monotonic feature โ outcome relationships, and the need for clean calibrated probabilities downstream all favored a simpler model. Logistic regression also gives:
- ๐ well-behaved, interpretable coefficients
- ๐ calibrated probabilities out of the box (with minor post-fit calibration)
- ๐ชถ a much smaller artifact and faster inference path
Final hyperparameters live in configs/config.yml under logistic_regression:
C: 0.0091
l1_ratio: 0.183 # elastic-net (mix of L1 and L2)
solver: saga
class_weight: balanced- Sort dataset by IPO date; reserve last 15% as a holdout test set.
- On the remaining 85%, run
TimeSeriesSplitCV withn_splits=5,gap=30days to prevent leakage across folds. - For each model ร hyperparameter set: feature selection (where applicable), train, evaluate ROC-AUC.
- Hyperparameters tuned with Optuna.
- Select top model based on best mean ROC-AUC across folds.
- Top model retrained on the full pre-holdout set, scored once on holdout.
AUC measures separability between the positive and negative class across all thresholds. Desirable here because the business threshold is learned downstream by the backtester. AUC rewards a model that ranks IPOs well, which is exactly what the allocation engine needs.
| Split | ROC-AUC |
|---|---|
| Time-series CV (validation) | 0.867 |
| Holdout (unseen) | 0.839 |
On each IPO listing day, only IPOs with prob โฅ t_min are considered for trading. t_min is learned, not assumed. It's tuned in the backtester to maximize cumulative return / stability.
portfolio.trade_threshold: 0.4091 # learned via backtestingCapital is split equally among all IPOs that pass the threshold on a given day. This is a deliberate simplification of the originally proposed wแตข โ pแตข^ฮฑ rule โ see Decisions.md โ Allocation Strategy.
Per-IPO allotment is approximated as 1 / max(1, NII_subscription_multiple). Oversubscribed issues yield a pro-rata fraction of the requested shares, which removes the unrealistic "100% allotment" assumption from earlier iterations.
contributionแตข = weightแตข ร allotmentแตข ร listing_gain_%แตข
day_return = ฮฃ contributionแตข
| Metric | Definition |
|---|---|
| Cumulative Return | (โ(1 + r_d/100) โ 1) ร 100 over all daily portfolio returns |
| Mean Daily Return | (1/N) ยท ฮฃ(ฮฃ wแตข ร allotmentแตข ร rแตข)_day |
| Win Rate | Fraction of trade days with positive portfolio return |
| % Days Traded | Fraction of trade days where ฮฃwแตข > 0 |
| Volatility | std(portfolio_return_day). Needs โฅ 5 days. |
| Sharpe-like | mean_daily_return / volatility. No risk-free adjustment. Needs โฅ 5 days. |
| Avg Return per Calendar Day | mean_daily_return ร num_ipo_days / calendar_days โ adjusts for inactive days |
- ๐ Walk-forward simulation across 2017โ2025
- ๐ Daily granularity (one decision point per IPO listing day)
- ๐๏ธ Strategy parameters (
t_min, allocation rule) tuned on CV folds, then applied to the unseen holdout window
| Metric | Value |
|---|---|
| Calendar days | 366 |
| IPO trade days | 75 |
| IPOs evaluated | 108 |
| Cumulative return | +16.38% |
| Avg return / trade day | +0.20% |
| Avg return / calendar day | +0.04% |
| Win rate (trade days) | 61.3% |
| % trade days deployed | 72.0% |
| Volatility (trade days) | 0.47% |
| Sharpe-like | +0.43 |
| Metric | Value |
|---|---|
| Calendar days | 3,256 |
| IPO trade days | 358 |
| IPOs evaluated | 444 |
| Cumulative return (Strategy) | +242.54% |
| Cumulative return (Equal-weight allocation, no filter baseline) | โ54% |
| Avg return / trade day | +0.35% |
| Win rate (trade days) | 62.0% |
| % trade days deployed | 73.5% |
| Sharpe-like | +0.33 |
The selection-filtered strategy compounds positively across the full window while the unfiltered "equal-weight on every IPO" baseline ends sharply negative โ confirming the model's value is in filtering, not in fine-grained allocation.
| Probability bucket | Avg listing-day return |
|---|---|
| 0.3 โ 0.4 | โ3.2% |
| 0.4 โ 0.5 | +4.1% |
| 0.5 โ 0.6 | +14.6% |
| 0.6 โ 0.7 | +24.0% |
| 0.7+ | +55.3% |
Two things to read from this:
- ๐ฏ Expected return crosses zero around p โ 0.4 โ that's why the learned
t_minlands there, not at the naive 0.5. - ๐ Predicted probability monotonically tracks realized return โ the model produces a useful ranking signal, in addition to acting as a classifier.
A FastAPI service exposes the trained model as an HTTP endpoint.
uvicorn app.main:app --host 127.0.0.1 --port 8000Swagger UI: http://127.0.0.1:8000/docs
Request
{
"ipos": [
{
"nii": 50.0,
"qib": 30.0,
"retail": 10.0,
"total": 25.0,
"year": 2025,
"issue_amount": 500.0,
"price_band_high": 500.0,
"price_band_low": 475.0,
"gmp": null
}
]
}Response
{
"allocations": [
{"probability": 0.7321, "allocation_weight": 1.0}
]
}The service returns one (probability, allocation_weight) pair per submitted IPO. Weights are computed assuming all submitted IPOs share the same listing day. IPOs below t_min receive allocation_weight = 0.
.
โโโ README.md # this file
โโโ Decisions.md # deeper "why" behind key technical choices
โโโ requirements.txt
โโโ configs/
โ โโโ config.yml # paths, model hyperparams, t_min, threshold
โ โโโ feature_config.py
โโโ data/ # raw, filtered, aggregated CSVs
โโโ notebooks/ # eda, models, backtesting, BSEscraper
โโโ src/
โ โโโ data/ # cleaning + aggregation
โ โโโ features/ # engineering + selection
โ โโโ models/ # trainer, eval, experiment runner
โ โโโ pipelines/ # data, model, inference, prediction pipelines
โ โโโ portfolio/ # allocator, backtester
โ โโโ utils/
โโโ app/ # FastAPI inference service
โโโ dashboard/ # Streamlit app + artifact builder
โโโ artifacts/
โ โโโ models/ # trained, joblib-serialized prediction pipeline
โ โโโ dashboard/ # trades.csv, meta.json for the dashboard
โโโ logs/
# 1. Install
pip install -r requirements.txt
# 2. (Optional) Re-run the modeling notebook end-to-end
jupyter notebook notebooks/models.ipynb
# 3. (Optional) Re-run the backtest and regenerate dashboard artifacts
jupyter notebook notebooks/backtesting.ipynb
python -m dashboard.build_artifacts
# 4. Serve the model
uvicorn app.main:app --host 127.0.0.1 --port 8000
# 5. Launch the dashboard (in a separate shell)
streamlit run dashboard/app.py- Dataset is small by ML standards (720 IPOs after cleaning).
- Temporal drift exists โ listing-day returns trend higher in the recent regime.
- GMP missingness is non-random; some IPOs have no signal there (partially captured via
is_gmp_missing).
- Allotment is approximated from NII subscription multiple โ actual allotment depends on category-specific lottery rules.
- No transaction costs, taxes, or slippage are modeled. The 5% gain buffer is the only cushion.
- Trade execution at issue price is assumed (i.e., assumes successful application + allotment + sell at listing-day open).
- No capital-blocking model โ funds are assumed available even on consecutive-IPO days.
- No microstructure modeling โ listing gain is the auction-discovered open price; actual fills can deviate.
A black-swan IPO whose features fall outside the training distribution can produce a confidently-wrong prediction. Backtesting suggests the strategy remains profitable on average over multi-year windows, but a single bad day can be material. A human-in-the-loop layer at decision time is recommended for live use.
- ๐ก Monitoring โ drift detection on input features and prediction distribution; automated alerts.
- ๐ Retraining pipeline โ scheduled retrains as new IPOs list, with model registry / versioning.
- โฑ๏ธ Live decisioning โ scheduled job that fetches open IPOs, runs predictions ~30 min before bidding close, and emits notifications (or trades).
- โ Pre-flight check โ a dry run at issue open to validate the end-to-end inference path before the real prediction run.
- ๐ค Broker integration โ order placement and exit strategy automation.
- ๐ ๏ธ CI/CD โ model registry, artifact versioning, deployment automation.
- ๐งฎ Multi-year fundamentals โ encode growth trends from prospectus financials (currently excluded due to schema inconsistency across filings; see Decisions.md).
โ What is listing gain? Listing-day open price minus issue price, expressed as a percentage of issue price.
โ Why a 5% threshold instead of >0? A minimum economically meaningful return: covers brokerage, slippage, taxes, and a risk buffer. It also makes the target closer to realizable profit than to theoretical underpricing.
โ If listing gain is positive, can I actually capture it by selling at market open? Mostly, yes โ historically, the bulk of IPO underpricing is captured in the opening auction and early trading. The 5% buffer protects against execution noise. But this system models pricing efficiency, not intraday microstructure; precise fills depend on order flow and liquidity not represented here.
โ Why not a regression model? Tried it. Rยฒ โ 0 in CV, < 0 on holdout. Heavy-tailed, low-signal returns make exact-magnitude prediction unreliable on this dataset. Classification + ranking is far more stable. Full breakdown in Decisions.md.
โ Why logistic regression over XGBoost / LightGBM? Small dataset (~700 IPOs), mostly-monotonic feature/outcome relationships, and a need for clean calibrated probabilities downstream. Tuned tree ensembles didn't clearly beat a well-regularized elastic-net logistic regression on time-series CV AUC, and showed higher fold-to-fold variance. Simpler model + smaller artifact + faster inference won.
โ Why not include 3-year company financials as features? Three reasons: (1) prospectus financial coverage is inconsistent across IPOs, (2) demand-side signals like subscription ratios already absorb most of the fundamental signal indirectly, and (3) IPO listing-day price action is more sentiment-driven than fundamentals-driven. A future iteration could fold in summarized growth metrics.
โ What if a black-swan IPO arrives? That's classic data drift. 100% loss avoidance isn't possible; the strategy is built to be profitable on average over multi-year windows. Mitigations: scheduled retraining, drift monitoring, and a human-in-the-loop approval step before live trades (feasible because IPO events are rare).
- ๐ข IPO data: Chittorgarh
- ๐น Grey market premium: Investorgain
- ๐ Listing-day prices: NSE
- ๐ IPO start/end dates and price band backfill: BSE
- ๐งญ Scope first, model later. Pinning down the business metric, the input/output contract, and verifying the dataset can plausibly support the target โ that's the most leveraged hour of the project.
- ๐งฑ Modeling is ~10% of the work. Data, evaluation, and the decision layer around the model are the other 90%.
- ๐ Get a baseline working end-to-end before iterating. Error analysis on a working baseline beats over-investing in any single stage.
- โ๏ธ Simpler often wins. A learned threshold + equal weighting beat a more complex
pแตข^ฮฑallocation rule once the data structure (one IPO/day on most days) was understood โ and a regularized logistic regression beat tuned tree ensembles on this small dataset. - ๐ฏ Centre on business logic, work backwards. "What decision are we improving?" is a sharper compass than "what model fits best?"
This project is for educational and research purposes only. It is not investment advice. Past performance does not guarantee future results. The author accepts no responsibility for any financial loss arising from use of this system or its outputs. Consult a qualified financial advisor before making any investment decision.