Nomad2018 Predicting Transparent Conductors

kaggle比赛获得前10%
Posted by Bigzhao on February 26, 2018

Nomad2018 Predicting Transparent Conductors

What did work:

  • 在这个比赛中,简单的加权求和比stacking,blending更加有效,也可能是我添加的模型的相关度太高导致stacking结果不大好。

  • 最后采用的模型有:xgboost,catboost,lgboost,sklearn的gbdt, 若干dnn(效果一般般),公开kernel的gp, 这些模型加权起来最好能得到0.0615的逻辑均方误差,可惜最后选结果的时候没有选中,选了个0.0651的文件。。。

以下是简单加权求和的例子:

p_buf = []
for df in [sub_dnn, sub_stack_nn]:
# df = pd.read_csv(file)
    df = df.sort_values('id')
    ids = df['id'].values
    p = df[['formation_energy_ev_natom', 'bandgap_energy_ev']].values
    p_buf.append(p)

# Generate predictions and save to file    
preds = np.mean(p_buf, axis=0)
subm = pd.DataFrame()
subm['id'] = ids
subm['formation_energy_ev_natom'] = preds[:, 0]
subm['bandgap_energy_ev'] = preds[:, 1]
subm.to_csv('stacknn+dnn.csv', index=False)
  • 调参:例子如下

grid search:

def parameters():
    for learning_rate in [0.03, 0.05, 0.1]:
            for n_estimators in [100, 200, 300]:
                    yield {
                        'n_estimators':n_estimators,
                           'learning_rate': learning_rate,
                           'random_state': 1234,
                    }

params_set = []
mses = []
for param_dist in parameters():
    model = GradientBoostingRegressor(**param_dist)
    res = run_model_without_test(model, train_1, y_1)
    params_set.append(param_dist.copy())
    mses.append(res)

pd.set_option('max_colwidth',300)

res = pd.DataFrame({'mse': mses, 'params':params_set})

res.sort_values('mse').head()

random search

from random import choice

def parameters():
    yield {
        'n_estimators':choice(range(100, 1000, 10)),
        'learning_rate': choice([...]),
        'random_state': 1234,
    }

params_set = []
mses = []
for param_dist in parameters():
    model = GradientBoostingRegressor(**param_dist)
    res = run_model_without_test(model, train_1, y_1)
    params_set.append(param_dist.copy())
    mses.append(res)

pd.set_option('max_colwidth',300)

res = pd.DataFrame({'mse': mses, 'params':params_set})

res.sort_values('mse').head()

What did not work:

  • nn:nn尝试过bnn,ensemble dnn等结构,用随机调参来找网络结构。由于数据量太小的原因,简单的神经网络在这个比赛好像不大好用。最后能找到几个cv还不错的dnn,但是还是远比不上树模型。。。。最后考虑到nn模型能够增加一点多样性还是用了,但是从最后private score来看帮助不大,反而有扯后腿的现象。

  • 中间还尝试过去用了一个Google的免费GPU(colaboraty),在反复确认好几次已经使用了GPU的前提下,发现比我纯用CPU还慢,。。。资本主义的羊毛不好捋。。。

随机寻找网络结构代码:

import keras
import tensorflow as tf
import keras.backend as K
from keras.regularizers import l2
from keras.optimizers import Adam
from keras.models import Model
from keras.models import Sequential
from keras.layers import Dense,Dropout,BatchNormalization, Input

from keras.wrappers.scikit_learn import KerasRegressor
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Dropout, BatchNormalization, Activation

from sklearn.model_selection import KFold, train_test_split

from random import choice

import pandas as pd
import numpy as np

import gc

def rmsle(h, y):
    """
    Compute the Root Mean Squared Log Error for hypthesis h and targets y

    Args:
        h - numpy array containing predictions with shape (n_samples, n_targets)
        y - numpy array containing targets with shape (n_samples, n_targets)
    """

    return np.sqrt(np.square(np.log(h + 1) - np.log(y + 1)).mean())

def rmsle_all(y_true, y_pred):
    return np.mean([rmsle(y_true[:, 0], y_pred[:, 0]), rmsle(y_true[:, 1], y_pred[:, 1])])

X = pd.read_csv('train_v2.csv').values
y = pd.read_csv('y12_v2.csv').values
X_test = pd.read_csv('test_v2.csv').values

all_data = np.vstack([X, X_test])

from sklearn.preprocessing import StandardScaler

all_data = StandardScaler().fit_transform(all_data)

X = all_data[:2396]
X_test = all_data[2396:]

def get_model(n_layer, n_neural, lr):
    """ Returns a model of specific shape
   """

    X_input = Input(shape=(23,))  
    X = Dense(n_neural, kernel_initializer='he_normal')(X_input)
    X = BatchNormalization()(X)
    X = Activation('sigmoid')(X)

    for i in range(n_layer):
        X = Dense(n_neural)(X)
        X = BatchNormalization()(X)
        X = Activation('sigmoid')(X)


    X = Dense(2, activation='relu')(X)

    model = Model(inputs=X_input, outputs=X)
    optimizer = Adam(lr=lr, decay=0.001)
    model.compile(optimizer=optimizer, loss='mse', metrics=[rmsle_K])
    return model

# computes RMSLE from tensorflowa
def rmsle_K(y, y0):
    return K.sqrt(K.mean(K.square(tf.log1p(tf.expm1(y)) - tf.log1p(tf.expm1(y0)))))

def assess_nn( X, y, n_layer, n_neural, batch_size, lr):
    res = []
    y_test_pred = []
    seeds = [30, 2018, 123]
    for seed in seeds:
        X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1, random_state=seed)    
        model = get_model(n_layer, n_neural, lr)
        model.fit(x=X_train, y=y_train, epochs=2000, batch_size=batch_size, validation_data=(X_valid, y_valid), verbose=0)

        pred = model.predict(X_valid)
        res.append(rmsle_all(np.expm1(y_valid), np.expm1(pred)))
        print("  mse = ", rmsle_all(np.expm1(y_valid), np.expm1(pred)))
    res = np.array(res)
    return res

n_layer = choice(np.arange(2, 8))
n_neural = choice(np.arange(16, 90, 4))

batch_size = choice([8, 16, 32, 64])
#lr = np.exp((np.log(0.01) - np.log(0.001)) * np.random.rand() + np.log(0.001))
lr=0.001

mse = assess_nn(X, y,  n_layer, n_neural, batch_size, lr)
print('mse:{}  n_layer:{} n_neural: {}, batch_size:{}, lr:{}\n'.format(mse, n_layer, n_neural, batch_size, lr))
with open("keras_random_search_cv.txt", "a+") as f:
    f.write('mse:{} n_layer:{} n_neural: {}, batch_size:{}, lr:{}\n'.format(mse, n_layer, n_neural, batch_size, lr))

What other people did:

  • “I noticed that materials are similar to documents. The lattice constants are similar to the sentence length or the sentence count. The number of atoms is similar to the word count. The atom pairs are similar to the bigram. ” 。 结合几何信息,看出与语句的相似性,用RNN来求解:https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/discussion/46293

  • 结合几何信息,利用CNN:https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/discussion/49884

  • 更多的特征工程,创造出350个新特征,由于新特征有很强的相似性,所以最后subsample 20%的特征和40%的数据量(类似树模型的堆积):https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/discussion/49844