Predicting Prices for House Shares using Deep Convolutional Neural Networks

Adam Lesnikowski


What is the price of a housing share, given only one image?


A primary motivation for this problem is exploring how well deep convolutional neural networks trained on one kind of visual data set perform on an images of an unseen type. A secondary goal is to have a convolutional neural network whose weights are initialized on a dataset like ImageNet to also work on a heterogenous, messy, real-world dataset like the one that we collected from webpages. A commercial application of our approach is building a data-driven computer vision system that can algorithmically detect over-priced or under-priced assets. The generalizability of neural networks, the model that we use in our approach, means that not only house share prices would be ammenable to the model we build, but also houses, cars, financial assets, internet products, or anything that has a price and some relativley rich image, audio or time series label connected to it.


A deep convolutional neural network was chosen as the price predictor for this problem. This choice was motivated by the following considerations: First there is an abundant source of labelled data available for the task. We collected more than 100K labelled images for our training set, and more is available. Deep neural networks have been shown to achieve top performance against other models, given enough training data is available to train them and avoid overfitting. Hence sufficient data is crucially important as the network often contain millions of parameters, and regularization techniques such as dropout and data set augmentation so far lack substantive theoretical guarantees. Second is the ongoing promise of neural networks to eliminate the need for hand-engineered features. For the price regression problem, it is not at all clear which features from a photo are most important for price. Hence we forgo the problem of feature selection by collecting enough data for a deep convolutional network to automatically find the right features for our price prediction task.


We collected a dataset of 117,746 user submitted photos photos from a popular house share website. This consisted of six large cities form the U.S. and Canada, and three additional cites from Europe. The number of photos ranged from just over 28,000 from New York City and London, to roughly 2,500 pictures from Boston and Washington D.C. The minimum image dimension of images collected was 600 pixels on a side. Hence our dataset allows for a variety of very interesting data augmentation techniques and multi-scale training possibilities by taking various crops of the training data as has been described in the literature. The prices of house shares in New York was found to have extreme outliers from $30 per night up to an $8000 per night penthouse overlooking Central Park. Hence we prepared the data set by filtering out outliers of more than three standard deviations from the mean price of $170 per night.


For our initial set of experiments, we ran experiments on both houses and apartment shares in New York City. We also used a variant of AlexNet, aka "SuperVision", a deep convolutional neural network, and Berkeley Vision Center deep learning package Caffe. This network was trained on ImageNet, a large collection of one thousand categories of images. Convolutional neural networks trained on one dataset have been shown to generalize well to categories of images not seen in training, hence we believe this step to be a mild simplifying step to take for our run of experiments.

From Classification Networks to Regressors

AlexNet was trained to perform classification, and has 1000 output nodes corresponding to each of ImageNet categories. Given this, how do we turn this into a regressor, that is a model that gives us a real valued price prediction? We used a technique that has been used before on similar kinds of problems: sample intermediate outputs of the network and train an auxilliary regressor to output a price. In particular, we ran a forward pass on our modified AlexNet on each of our 28,000 New York images, and took the 58 six by six pixel convolutional activations that are computed just before the set of fully connected layers. These are called the deep features. We note that these deep features were about 170KB each, larger than the typical input image set of 50-70KB. We then trained a Support Vector Regressor (SVR) on these deep features as the final step of our regressor.

Support Vector Regressors

Support vector regressors, or SVR's, are a type of regressor similar to their better known cousins, support vector machines, or SVM's. An SVR with a linear kernel is essentially linear regression with a linear loss instead of a squared loss, together with a tunable epsilon parameter that disregards regression mistakes within that epsilon range. For instance we can set epsilon so that we do not penalize predictions that are within one dollar, or whichever other amount, of the correct price. The motivation for an SVR, we would argue, is that a linear loss better tracks what a price predictor should be doing instead of squared loss. The psychology of price prediction suggest to us that a price that is off by $20 is twice, not four times, as bad as a price that is off by $10. We performed a test-train split of 8000 train images and 1000 test images. SVR's of increasing flexible regression curves were tried, from a linear kernel, then polynomial kernels of sizes 2, 4, 8, and 16, and also a radial basis function (RBF) kernel. The linear kernel was found to perform nearly as well as its more flexible cousins with the advantage of faster training times, hence for our initial experiments, only linear kernels were explored. Parameter fitting for epsilon and C, the data-fitting term for the SVR, was performed using cross-validation and a grid search among C in 10^-10 and 10^10 and for epsilon between 0 and 25. We noticed a dramatic slowdown in training our linear SVM's with more than 8000 training examples of deep features, which we suspect to be a memory bottleneck somewhere in our pipeline. The metric we optimized for was mean absolute error, or MAE, rather than mean absolute squared error or a similar metric.


The optimal linear kernel SVR has a mean absolute error, or MAE, of $68.26 on the 1000 test images. This predictor also correctly classified 64% of house shares as either above or below the mean price. we found this to be an extremely encouraging performance for our initial run of experiments.

Further Steps

For future experiments, we plan to train the SVR on all the New York data, and then on all the 118K images to maximize training experience. In the latter case of cross-city training, the regressor would predict z-scores instead of prices, which can be calibrated to a local market by fitting a distribution to the local house price shares. More careful parameter choosing for the SVR's and more trials with the more flexible polynomial and RBF kernel regressors in the larger data set case is expected to lead to improved performance. Another data set to run the same experiments on is the pictures of room shares instead of whole house or apartment shares. Calculating deep features through other, more recent deep convolutional neural networks such as VGG or various iterations of GoogLeNet would be extremely interesting and will likely improve results. Fine-tuning our labelled dataset, and then an SVR on these fine-tuned deep features is another intriguing further direction.


missing math

Sampling a neural network. Distribution of activation levels of a deep feature computed from a test image. X axis is intensity, y axis is the number of neurons with that activiation level.


Questions, access to our data, comments, more info, or latest results? Email me at first name dot last name at gmail, adam at math dot berkeley dot edu, or visit my website at