Kernel: Python 2
Implementation of Salisman's Don't Overfit submission
From Kaggle
In order to achieve this we have created a simulated data set with 200 variables and 20,000 cases. An ‘equation’ based on this data was created in order to generate a Target to be predicted. Given the all 20,000 cases, the problem is very easy to solve – but you only get given the Target value of 250 cases – the task is to build a model that gives the best predictions on the remaining 19,750 cases.
In [30]:
In [31]:
In [32]:
In [38]:
['overfitting.csv']
In [69]:
'case_id,train,Target_Practice,Target_Leaderboard,Target_Evaluate,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,var_11,var_12,var_13,var_14,var_15,var_16,var_17,var_18,var_19,var_20,var_21,var_22,var_23,var_24,var_25,var_26,var_27,var_28,var_29,var_30,var_31,var_32,var_33,var_34,var_35,var_36,var_37,var_38,var_39,var_40,var_41,var_42,var_43,var_44,var_45,var_46,var_47,var_48,var_49,var_50,var_51,var_52,var_53,var_54,var_55,var_56,var_57,var_58,var_59,var_60,var_61,var_62,var_63,var_64,var_65,var_66,var_67,var_68,var_69,var_70,var_71,var_72,var_73,var_74,var_75,var_76,var_77,var_78,var_79,var_80,var_81,var_82,var_83,var_84,var_85,var_86,var_87,var_88,var_89,var_90,var_91,var_92,var_93,var_94,var_95,var_96,var_97,var_98,var_99,var_100,var_101,var_102,var_103,var_104,var_105,var_106,var_107,var_108,var_109,var_110,var_111,var_112,var_113,var_114,var_115,var_116,var_117,var_118,var_119,var_120,var_121,var_122,var_123,var_124,var_125,var_126,var_127,var_128,var_129,var_130,var_131,var_132,var_133,var_134,var_135,var_136,var_137,var_138,var_139,var_140,var_141,var_142,var_143,var_144,var_145,var_146,var_147,var_148,var_149,var_150,var_151,var_152,var_153,var_154,var_155,var_156,var_157,var_158,var_159,var_160,var_161,var_162,var_163,var_164,var_165,var_166,var_167,var_168,var_169,var_170,var_171,var_172,var_173,var_174,var_175,var_176,var_177,var_178,var_179,var_180,var_181,var_182,var_183,var_184,var_185,var_186,var_187,var_188,var_189,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199,var_200\r\n'
In [70]:
In [63]:
In [71]:
23919756
In [75]:
In [ ]:
In [4]:
In [5]:
There are also 5 other fields,
case_id - 1 to 20,000, a unique identifier for each row
train - 1/0, this is a flag for the first 250 rows which are the training dataset
Target_Practice - we have provided all 20,000 Targets for this model, so you can develop your method completely off line.
Target_Leaderboard - only 250 Targets are provided. You submit your predictions for the remaining 19,750 to the Kaggle leaderboard.
Target_Evaluate - again only 250 Targets are provided. Those competitors who beat the 'benchmark' on the Leaderboard will be asked to make one further submission for the Evaluation model.
(20000L, 205L)
In [6]:
training: (250L, 200L) (250L,)
testing: (19750L, 200L) (19750L,)
Develop Tim's model
He mentions that the X variables are from a Uniform distribution. Let's investigate this:
In [7]:
In [8]:
50000
looks pretty right
In [127]:
In [128]:
In [129]:
In [130]:
In [132]:
Warning: Stochastic to_include's value is neither numerical nor array with floating-point dtype. Recommend fitting method fmin (default).
In [133]:
In [176]:
[****************100%******************] 100000 of 100000 complete
In [177]:
0.73999999999999999
In [178]:
0.72399999999999998
In [179]:
In [180]:
<matplotlib.colorbar.Colorbar instance at 0x0000000013270208>
In [181]:
<matplotlib.image.AxesImage at 0x19ce2780>
In [183]:
In [184]:
<matplotlib.image.AxesImage at 0x18d8ef60>
In [ ]: