ETF Screening Process and Key Points Overview

robot
Abstract generation in progress
    1. Basic Data Acquisition and Preliminary Filtering

Retrieve ETF list: Use get_all_securities([‘etf’]) to get all market ETFs, then filter for those established before January 1, 2013 (start_date < 2023-01-01) to ensure sufficient historical data.
Exclude low-liquidity ETFs: Manually remove specific ETFs with very low average trading volume (e.g., 159003.XSHE China Merchants Fast Track ETF, 159005.XSHE Harvest Fund Quick Money ETF, etc., average volume ≤ 2.92k).

    1. Daily ETF Data and Return Calculation
      Data Range: Obtain closing prices for the most recent 240 trading days up to today.
      Return Processing: Calculate daily returns (pchg = close.pct_change()), forming an ETF return matrix (prices, rows=trading days, columns=ETF codes).
    1. K-Means Clustering for Deduplication (Based on Similarity in Trends)
      Clustering Goal: Group ETFs with similar trends to reduce duplicates.
      Parameters: Set number of clusters n_clusters=30 (to avoid too few clusters that may merge dissimilar ETFs), use KMeans algorithm with random_state=42.
      Within-Cluster Selection: Keep only the earliest established ETF in each cluster, because:
    • Earlier establishment → usually higher trading volume (better liquidity);
    • Earlier establishment → more historical data (better for model training).
    1. Silhouette Score Evaluation of Clustering Effectiveness
      Calculate silhouette score: approximately 0.45 (moderate level, indicating decent compactness and separation, but room for improvement).
    1. Secondary Filtering Based on Correlation (Further Reduce Correlation)
      Correlation matrix: Compute correlation of ETF returns (corr = prices[df.code].corr()).
      Handling highly correlated pairs: For pairs with correlation > 0.85, keep only the ETF established earlier, remove the other (e.g., remove 159922.XSHE, 512100.XSHG, etc.).
    1. Optional: Filter Out Recently Established ETFs (Improve Data Quality)
      Set threshold: Remove ETFs established after 2020 (e.g., 513060.XSHG Hang Seng Healthcare, 515790.XSHG Photovoltaic ETF), to ensure remaining ETFs have richer historical data (useful for model training).
    1. Notes and Additional Recommendations
      Special handling for government bond ETFs: If used for modeling, exclude 511010.XSHE government bond ETF—its trend is nearly linear (similar to Yu’ebao), with minimal volatility, which can interfere with the model’s learning of volatility features and offers no predictive value.
      Handling declining ETFs: The results may include long-term declining ETFs (e.g., healthcare ETF, real estate ETF). Whether to exclude depends on strategy goals:
    • For stable returns, consider removing;
    • If the strategy performs well even with declining ETFs, it indicates robustness (but beware of the “future function” risk—cannot predict if declining ETFs will reverse).
      Visualization validation: Plot remaining ETFs’ price charts (e.g., since 2017) to manually verify if correlations and distributions meet expectations (low correlation, reasonable spread).

Final filtering logic summary:
Through “initial filtering → clustering deduplication → secondary correlation filtering → (optional) establishment time filtering,” obtain a pool of ETFs with good liquidity, low trend correlation, and ample historical data. The core goal is to provide diverse, high-quality underlying assets for strategies or models.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin