Predictive Modeling of Housing Prices

Charlotte, North Carolina | Spatial Data Analysis

Overview

This project developed a predictive model for housing prices in Charlotte, NC by integrating spatial data on property attributes, neighborhood crime rates, park proximity, and other key features. The model aimed to identify factors influencing home values and uncover spatial patterns that impact property pricing.

Objective

To build a robust predictive model that estimates home prices using spatial and non-spatial predictors and to evaluate spatial autocorrelation patterns in residuals to improve model accuracy.

Methodology

Data Collection and Cleaning

Collected geospatial data on:

Housing attributes (price, size, year built, etc.)
Crime incidents (local crime counts by ZIP code)
Park locations (distance from each property)

Cleaned data by:

Removing extreme outliers using interquartile range (IQR) filtering
Filtering records with incomplete or invalid property characteristics
Ensured data alignment using Charlotte ZIP code boundaries

Feature Engineering

Created engineered variables for improved model performance:

Log-transformed home price and home size to address skewness
Interaction terms such as bedrooms x full baths to capture joint effects
Distance to nearest park calculated using spatial distance metrics in R
Encoded categorical variables for heating type, property type, and building grade

Exploratory Data Analysis

Conducted correlation analysis to identify key predictors:

Correlation matrix showing relationships between housing variables

Relationship between property prices and neighborhood crime counts

Relationship between property prices and distance to nearest park

Relationship between property prices and year built

Relationship between property prices and logarithm of property area

Model Building and Evaluation

Trained an OLS regression model with selected features:

Key predictors: shape_Area, yearbuilt, fullbaths, crime_count, distance_to_nearest_park, and engineered interaction terms
Performed 10-fold cross-validation for model validation

Observed vs predicted housing prices showing model fit

Log-transformed observed vs predicted values showing improved model fit

Diagnostic plots for the regression model showing residual patterns

Residual Analysis and Spatial Autocorrelation

Spatial distribution of model residuals showing areas of under and over prediction

Moran's I test results showing significant spatial autocorrelation in residuals

Visualized residuals using spatial mapping to identify clustering patterns
Conducted Moran's I test to assess spatial dependence in residuals
Moran's I = 0.214 (p < 0.05), indicating moderate spatial autocorrelation
Identified underpredicted homes in central Charlotte and overpredicted values in peripheral areas

Key Findings

Home Size and Building Grade: These were the strongest predictors of higher property values, confirming the importance of property characteristics.
Crime Rates: Areas with higher crime rates showed lower property values, reinforcing the role of safety in home pricing.
Park Proximity: While closer parks had a slight positive effect, its influence was weaker than anticipated, suggesting other neighborhood features have a stronger influence.
Year Built: Newer homes generally commanded higher prices, with a clear upward trend for properties built after 1950.
Spatial Bias: Central neighborhoods were frequently underpredicted, while some suburban areas were overestimated.

Spatial Insights

Spatial distribution of home prices across Charlotte neighborhoods

Spatial distribution of home sizes (square footage) across Charlotte

Spatial distribution of crime counts across Charlotte neighborhoods

Distance to nearest park across Charlotte neighborhoods

Higher home prices clustered in suburban regions, where larger properties and newer builds are common.
Lower property values aligned with areas reporting higher crime counts.
The presence of residual spatial patterns suggests additional neighborhood-level characteristics may impact pricing.
Property size (area) showed a strong logarithmic relationship with price, indicating diminishing returns on very large properties.

Challenges and Limitations

Multicollinearity: Variables like home size, bedrooms, and bathrooms had high VIF scores, indicating potential redundancy.
Outlier Sensitivity: The model struggled to predict high-value homes accurately, likely due to non-linear effects not captured in the model.
Unaccounted Neighborhood Factors: Unobserved variables such as school quality, walkability, and transit access may contribute to residual spatial autocorrelation.

Urban Planning Implications

The model's insights can guide:

Targeted investment strategies in underpredicted central Charlotte neighborhoods where additional amenities may improve property values.
Crime prevention initiatives in areas with historically suppressed property values.
Park improvement efforts in residential areas where green space has stronger correlations with home pricing.
Housing development policies that consider the strong relationship between property age, size, and market value.

Conclusion

This project demonstrates the power of integrating spatial data and OLS regression in housing price prediction. The combination of engineered features, distance metrics, and spatial diagnostics provided valuable insights into Charlotte's housing market, informing planners and policymakers about factors driving property values. The correlation analysis revealed complex relationships between housing attributes and prices, while the spatial analysis highlighted neighborhood-level patterns that can guide targeted urban development strategies.

Project Details

Location

Charlotte, North Carolina

Tools Used

R (Version 4.4.2)
sf for spatial data handling
caret for model training
spdep for Moran's I test
ggplot2 for visualization

Model Performance

R²0.599

Moran's I0.214

Key Variables

Building Grade
Strong +
Home Size
Strong +
Full Baths
Moderate +
Year Built
Moderate +
Crime Count
Moderate -
Park Distance
Weak -

Research Highlights

Data Points

Analyzed over 10,000 property records across Charlotte metropolitan area

Time Period

Housing data from 2018-2022, providing recent market insights

Key Innovation

Integration of crime data and park proximity metrics with traditional housing variables