[HELP] The Application of Deep Neural Networks

MistyhV1 · May-05-2024, 02:22 PM

Hello everyone,

I'm currently working on a regression model to predict financial asset returns, building upon the research by Welch and Goyal and the work of Gu et al. (2020). However, I'm facing a few challenges and would really appreciate any guidance you can offer.

R² Comparison: I am using the r2_score from sklearn.metrics to calculate the R² of my model. In the paper by Gu et al. (2020), they also mention an R², but I want to ensure that the formula they used is identical to that of r2_score. Could someone confirm whether this function adheres to the standard R² formula, or are there specific nuances to consider?

Explanatory Variables: Currently, my model only uses years and months as explanatory variables. I'm looking to incorporate individual asset characteristics, macroeconomic variables, and potentially the products of these variables.

Here what's the ticker_1000 looks like with all the features :

permno	DATE	mvel1	beta	betasq	chmom	dolvol	idiovol	indmom	mom1m	mom6m	mom12m	mom36m	pricedelay	turn	absacc	acc	age	agr	bm	bm_ia	cashdebt	cashpr	cfp	cfp_ia	chatoia	chcsho	chempia	chinv	chpmia	convind	currat	depr	divi	divo	dy	egr	ep	gma	grcapx	grltnoa	herf	hire	invest	lev	lgr	mve_ia	operprof	orgcap	pchcapx_ia	pchcurrat	pchdepr	pchgm_pchsale	pchquick	pchsale_pchinvt	pchsale_pchrect	pchsale_pchxsga	pchsaleinv	pctacc	ps	quick	rd	rd_mve	rd_sale	realestate	roic	salecash	saleinv	salerec	secured	securedind	sgr	sin	sp	tang	tb	aeavol	cash	chtx	cinvest	ear	nincr	roaq	roavol	roeq	rsup	stdacc	stdcf	ms	baspread	ill	maxret	retvol	std_dolvol	std_turn	zerotrade	sic2
10000	19860228	16100						0.2111585701																																																																																	0.0769982184	1.2440505E-6	0.25	0.0652783882	1.2312885045	2.1208045717	4.7851753E-8	39
10000	19860331	11960						0.2624713557	-0.257142872																																																																																0.0555114619	1.8917602E-6	0.0447761193	0.0310041349	1.0210892479	1.0797738144	1.0233918E-7	39

Here the Returns document for the ticker_10000

PERMNO	date	NCUSIP	TICKER	COMNAM	PRC	RET	RETX
10000	19851231						
10000	19860131	68391610	OMFGA	OPTIMUM MANUFACTURING INC	-4.375	C	C
10000	19860228	68391610	OMFGA	OPTIMUM MANUFACTURING INC	-3.25	-0.257143	-0.257143

Here the macroeconomics series :

yyyymm	b/m	tbl	ntis	Rfree	svar	dp	ep	tms	dfy
195701	0.567242675	0.0311	0.027991994	0.0027	0.000901942	-3.248451342	-2.574685554	0.0017	0.0072

Analysis by Assets and Sub-Periods: I also want to extend the analysis to cover multiple assets and different sub-periods. What would be the best method to structure my code to loop over multiple assets and sub-periods? Any specific code examples or frameworks you would recommend?

Thank you in advance for your help and suggestions. Any advice or code examples would be greatly appreciated and could really help improve the robustness and efficacy of my model.

Best regards,

My code :

# -*- coding: utf-8 -*-
"""
Created on Wed May  1 15:26:41 2024

@author: Lucas
"""

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

# Chemin d'accès aux fichiers
path = r"C:\Users\Lucas\Desktop\Base de données\Stocks Prices - Returns"
files = [f for f in os.listdir(path) if f.startswith('return_')]
data_frames = []

for file in files:
    full_path = os.path.join(path, file)
    df = pd.read_csv(full_path, delimiter='\t', engine='python', dtype={'RETX': str})
    data_frames.append(df)

data = pd.concat(data_frames)
data.columns = data.columns.str.replace('"', '').str.strip()
data.dropna(subset=['RETX'], inplace=True)
data = data[~data['RETX'].str.contains('[^0-9.-]', regex=True)]
data['date'] = pd.to_datetime(data['date'], format='%Y%m%d')
data['RETX'] = data['RETX'].astype(float)
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month

# Sélection des colonnes pour X et y
X = data[['year', 'month']]  # Ajoutez d'autres features numériques pertinentes ici
y = data['RETX']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardisation pour les réseaux de neurones et d'autres modèles si nécessaire
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Modèles de machine learning
model_ols = LinearRegression().fit(X_train, y_train)
model_lasso = Lasso(alpha=0.1).fit(X_train, y_train)
model_ridge = Ridge(alpha=1.0).fit(X_train, y_train)
model_svr = SVR(kernel='rbf').fit(X_train_scaled, y_train)  # Utiliser des données normalisées pour SVR
model_rf = RandomForestRegressor(n_estimators=100).fit(X_train, y_train)

# Modèle de réseau de neurones
model_nn = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dense(64, activation='relu'),
    Dense(1)
])
model_nn.compile(optimizer='adam', loss='mse')
model_nn.fit(X_train_scaled, y_train, epochs=100, batch_size=32)

# Réseau de neurones profond
model_dnn = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dense(64, activation='relu'),
    Dense(1)
])
model_dnn.compile(optimizer='adam', loss='mse')
model_dnn.fit(X_train_scaled, y_train, epochs=10, batch_size=32)

# Réseau LSTM
X_train_lstm = X_train_scaled.reshape(X_train_scaled.shape[0], 1, X_train_scaled.shape[1])
model_lstm = Sequential([
    LSTM(50, activation='relu', input_shape=(1, X_train_scaled.shape[1])),
    Dense(1)
])
model_lstm.compile(optimizer='adam', loss='mse')
model_lstm.fit(X_train_lstm, y_train, epochs=100, batch_size=32)


# Prédictions et évaluation
y_pred_ols = model_ols.predict(X_test)
y_pred_lasso = model_lasso.predict(X_test)
y_pred_ridge = model_ridge.predict(X_test)
y_pred_svr = model_svr.predict(X_test_scaled)
y_pred_rf = model_rf.predict(X_test)
y_pred_dnn = model_dnn.predict(X_test_scaled).flatten()
X_test_lstm = X_test_scaled.reshape(X_test_scaled.shape[0], 1, X_test_scaled.shape[1])
y_pred_lstm = model_lstm.predict(X_test_lstm).flatten()

# Affichage des erreurs
print("MSE - OLS:", mean_squared_error(y_test, y_pred_ols))
print("MSE - Lasso:", mean_squared_error(y_test, y_pred_lasso))
print("MSE - Ridge:", mean_squared_error(y_test, y_pred_ridge))
print("MSE - SVR:", mean_squared_error(y_test, y_pred_svr))
print("MSE - RandomForest:", mean_squared_error(y_test, y_pred_rf))
print("MSE - DNN:", mean_squared_error(y_test, y_pred_dnn))
print("MSE - LSTM:", mean_squared_error(y_test, y_pred_lstm))

# Graphique de comparaison
plt.figure(figsize=(10, 5))
plt.plot(y_test.index, y_test, label='Real')
plt.plot(y_test.index, y_pred_ols, label='OLS Predicted')
plt.legend()
plt.show()

# Calcul du R2 pour chaque modèle
r2_ols = r2_score(y_test, y_pred_ols)
r2_lasso = r2_score(y_test, y_pred_lasso)
r2_ridge = r2_score(y_test, y_pred_ridge)
r2_svr = r2_score(y_test, y_pred_svr)
r2_rf = r2_score(y_test, y_pred_rf)
r2_dnn = r2_score(y_test, y_pred_dnn)
r2_lstm = r2_score(y_test, y_pred_lstm)

# Affichage des résultats R2
model_names = ['OLS', 'LASSO', 'Ridge', 'SVR', 'Random Forest', 'DNN', 'LSTM']
r2_scores = [r2_ols, r2_lasso, r2_ridge, r2_svr, r2_rf, r2_dnn, r2_lstm]

plt.figure(figsize=(10, 6))
plt.bar(model_names, r2_scores, color='blue')
plt.xlabel('Model Type')
plt.ylabel('R2 Score')
plt.title('Comparison of R2 Scores Across Different Models')
plt.ylim(0, 1)  # Adjust the limit to better fit your data if needed
plt.show()

def calculate_r2_total(actual, predicted):
    rss = np.sum((actual - predicted) ** 2)
    tss = np.sum(actual ** 2)
    return 1 - rss / tss

def calculate_r2_predictive(actual, predicted):
    # Cette fonction est la même que R2 total dans ce contexte simplifié
    return calculate_r2_total(actual, predicted)

# Calcul pour le modèle OLS
r2_total_ols = calculate_r2_total(y_test, y_pred_ols)
r2_predictive_ols = calculate_r2_predictive(y_test, y_pred_ols)  # Assume future predictions are the same for simplification

print(f"R2 Total OLS: {r2_total_ols}")
print(f"R2 Predictive OLS: {r2_predictive_ols}")

# R2 pour OLS
r2_total_ols = calculate_r2_total(y_test, y_pred_ols)
r2_predictive_ols = calculate_r2_predictive(y_test, y_pred_ols)  # Simplification pour l'exemple

# R2 pour Lasso
r2_total_lasso = calculate_r2_total(y_test, y_pred_lasso)
r2_predictive_lasso = calculate_r2_predictive(y_test, y_pred_lasso)

# R2 pour Ridge
r2_total_ridge = calculate_r2_total(y_test, y_pred_ridge)
r2_predictive_ridge = calculate_r2_predictive(y_test, y_pred_ridge)

# R2 pour SVR (utiliser les données normalisées)
r2_total_svr = calculate_r2_total(y_test, y_pred_svr)
r2_predictive_svr = calculate_r2_predictive(y_test, y_pred_svr)

# R2 pour Random Forest
r2_total_rf = calculate_r2_total(y_test, y_pred_rf)
r2_predictive_rf = calculate_r2_predictive(y_test, y_pred_rf)

# R2 pour DNN (assurez-vous que les prédictions sont aplatis si nécessaire)
r2_total_dnn = calculate_r2_total(y_test, y_pred_dnn)
r2_predictive_dnn = calculate_r2_predictive(y_test, y_pred_dnn)

# R2 pour LSTM (assurez-vous que les prédictions sont aplatis si nécessaire)
r2_total_lstm = calculate_r2_total(y_test, y_pred_lstm)
r2_predictive_lstm = calculate_r2_predictive(y_test, y_pred_lstm)


r2_totals = [r2_total_ols, r2_total_lasso, r2_total_ridge, r2_total_svr, r2_total_rf, r2_total_dnn, r2_total_lstm]
r2_predictives = [r2_predictive_ols, r2_predictive_lasso, r2_predictive_ridge, r2_predictive_svr, r2_predictive_rf, r2_predictive_dnn, r2_predictive_lstm]

# Liste des noms de modèles pour les labels des axes
model_names = ['OLS', 'LASSO', 'Ridge', 'SVR', 'Random Forest', 'DNN', 'LSTM']

# Listes des scores R2 pour la visualisation
r2_totals = [r2_total_ols, r2_total_lasso, r2_total_ridge, r2_total_svr, r2_total_rf, r2_total_dnn, r2_total_lstm]
r2_predictives = [r2_predictive_ols, r2_predictive_lasso, r2_predictive_ridge, r2_predictive_svr, r2_predictive_rf, r2_predictive_dnn, r2_predictive_lstm]

# Code de visualisation comme précédemment mentionné
plt.figure(figsize=(12, 6))
x = np.arange(len(model_names))  # les positions des barres
width = 0.35  # la largeur des barres

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, r2_totals, width, label='R2 Total')
rects2 = ax.bar(x + width/2, r2_predictives, width, label='R2 Predictive')

# Ajout des labels, titre, etc.
ax.set_xlabel('Model Type')
ax.set_ylabel('R2 Score')
ax.set_title('Comparison of R2 Total and Predictive Across Different Models')
ax.set_xticks(x)
ax.set_xticklabels(model_names)
ax.legend()

plt.show()


import matplotlib.pyplot as plt

# Indices pour les données de test pour l'axe des x
test_indices = y_test.index

# Configuration du graphique
plt.figure(figsize=(18, 12))

# Tracés pour chaque modèle avec des points
plt.subplot(3, 3, 1)
plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o')
plt.scatter(test_indices, y_pred_ols, label='OLS Predicted', color='red', marker='o')
plt.title('OLS Predictions')
plt.legend()

plt.subplot(3, 3, 2)
plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o')
plt.scatter(test_indices, y_pred_lasso, label='LASSO Predicted', color='red', marker='o')
plt.title('LASSO Predictions')
plt.legend()

plt.subplot(3, 3, 3)
plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o')
plt.scatter(test_indices, y_pred_ridge, label='Ridge Predicted', color='red', marker='o')
plt.title('Ridge Predictions')
plt.legend()

plt.subplot(3, 3, 4)
plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o')
plt.scatter(test_indices, y_pred_svr, label='SVR Predicted', color='red', marker='o')
plt.title('SVR Predictions')
plt.legend()

plt.subplot(3, 3, 5)
plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o')
plt.scatter(test_indices, y_pred_rf, label='Random Forest Predicted', color='red', marker='o')
plt.title('Random Forest Predictions')
plt.legend()

plt.subplot(3, 3, 6)
plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o')
plt.scatter(test_indices, y_pred_dnn, label='DNN Predicted', color='red', marker='o')
plt.title('DNN Predictions')
plt.legend()

plt.subplot(3, 3, 7)
plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o')
plt.scatter(test_indices, y_pred_lstm, label='LSTM Predicted', color='red', marker='o')
plt.title('LSTM Predictions')
plt.legend()

# Ajustement de l'espacement entre les graphiques
plt.tight_layout()

# Affichage du graphique
plt.show()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

# Fonction pour simuler des prédictions de modèles
def simulate_model_predictions(data, params):
    """ Simule des prédictions basées sur une simple régression linéaire fictive pour l'exemple. """
    return data['feature'] * params['slope'] + np.random.normal(0, params['noise_level'], size=len(data))

# Fonction pour tracer les prédictions
def plot_predictions(real, predicted, title):
    plt.figure(figsize=(10, 6))
    plt.scatter(real.index, real, color='blue', label='Real Returns')
    plt.scatter(real.index, predicted, color='red', label='Predicted Returns', alpha=0.5)
    plt.title(title)
    plt.xlabel('Index')
    plt.ylabel('Returns')
    plt.legend()
    plt.show()

# Données de simulation
data = pd.DataFrame({
    'feature': np.random.rand(100),
    'RET': np.random.rand(100) * 10
})

# Paramètres fictifs et simulation des modèles
models = {
    'GLM': {'slope': 10, 'noise_level': 1},
    'LightGBM': {'slope': 9, 'noise_level': 1.5},
    'Random Forest': {'slope': 8, 'noise_level': 2},
    'XGBoost': {'slope': 7, 'noise_level': 2.5}
}

for model_name, params in models.items():
    start_time = time.time()
    predictions = simulate_model_predictions(data, params)
    elapsed_time = time.time() - start_time
    print(f'{model_name} simulation finished! Execution time: {time.strftime("%H:%M:%S", time.gmtime(elapsed_time))}')
    plot_predictions(data['RET'], predictions, f'{model_name} Model Predictions')

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Deep Learning Book	ankitdixit	2	2,804	Feb-06-2020, 12:01 PM Last Post: buran
	h5py: deep dataset access	paul18fr	2	2,477	Nov-28-2019, 03:43 PM Last Post: paul18fr
	Free ebook "Deep Learning with PyTorch"	ThomasL	0	2,385	Nov-22-2019, 02:50 PM Last Post: ThomasL
	Extract of matrix subpart using a deep copy	paul18fr	2	2,418	May-02-2019, 06:49 AM Last Post: paul18fr
	Developing larger Neural Networks	Chriskelm	2	2,869	Nov-03-2018, 02:47 AM Last Post: brighteningeyes
	Centralities for Weighted Networks	fishbacp	0	1,869	Oct-15-2018, 11:20 PM Last Post: fishbacp
	3D Object Recognition using Deep Learning	chandininair	0	2,288	Aug-08-2018, 11:29 PM Last Post: chandininair

[HELP] The Application of Deep Neural Networks

User Panel Messages

Announcements