Identifying the Optimal Location for a New Business

MIDTERM PRESENTATION

November 13, 2017

W. Dai, A. Srivastava, R. Castellanes, D. First

Columbia University: Y. Garg, A. Mueller

Synergic Partners: G. Ribeiro

Agenda

The Problem and Our Approach
Literature Review
Data Sources
Initial Data Exploration
Goals and Next Steps

THE PROBLEM AND OUR APPROACH

Problem Statement
Our Approach

The objective of this project is to identify the optimal location to open up a new business.

ASSUMPTIONS

Begin with Chinese Restaurants and then expand
Optimize towards maximizing profit
Limited to Manhattan, New York
We want the user to be able to decide for herself what factors to prioritize

PROBLEM STATEMENT

METHODOLOGY

Approach 1: Acquire a dataset of NYC Chinese restaurants and with their profitability, then understand drivers of profitability and model it

Problem: Dataset?

Approach 2: Focus on business considerations that are drivers of profitability: revenue, cost, competition, and closeness to transportation

Example

We do not know the costs of real estate for restaurants, so we will use publicly available real estate prices instead

Recommending a location based on user preference

The user will rank how much they care about each of these factors
For each NYC location area, we derive a score to each restaurant for four categories:
- Expected popularity
- Cost
- Competition
- Distance to subways
A location is then selected that scores highest on these factors

1

2

3

Profit Score

Recommending a location based on user preference

The user will rank how much they care about each of these factors
For each restaurant, we derive assign a static score to each restaurant for four categories:
- Profit and Popularity
- Cost
- Competition
- Distance to subways
A location is then selected that scores highest on these factors

Example:

1

2

3

Profit Score

1

2

3

"I want a low-cost restaurant in a popular area, somewhat close to subways. I don't care about competition, because I'll differentiate."

1

.5

0

.5

Coefficients

Location 1

Location 2

Location 3

Location 4

1

0

.5

.75

.25

.30

.60

.15

.80

.20

.15

.40

.60

.80

.90

Total Score

1.325

.45

1.2

.506

Recommend Location 1

Cost

Popularity

Comp.

Transportation

Literature Review

Unsupervised Learning: Collaborative Filtering

In Eravci et al., the authors divided up NYC into neighborhoods. Then they used collaborative filtering to recommend businesses to new neighborhoods

Recommendations based on Linear combination of Drivers of Profitability

Two approaches have been taken in the literature

In Khateryna et al., the authors generate a location recommendation for Ukranian businesses based on combining the estimated profits and costs for different locations.

Our Approach

Problem: We want to give more granular-level recommendations
We want to allow the user to prioritize cost vs. popularity, and the like

We will integrate various

publicly-available datasets

In the CF approach, which is not in our model, the authors split up NYC into neighborhoods based on an unsupervised clustering method

EXAMPLE: CLUSTERING

DISTANCE

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7836791

PROBABLISTIC NEIGHBORHOOD SELECTION

COLLABORATIVE FILTERING

DATA SOURCES

Yelp
Foursquare
NYC Open Data
Loopnet

Example:

1

2

3

"I want a low-cost restaurant in a popular area, somewhat close to subways. I don't care about competition, because I'll differentiate."

1

.5

0

.5

Coefficients

Location 1

Location 2

Location 3

Location 4

1

0

.5

.75

.25

.30

.60

.15

.80

.20

.15

.40

.60

.80

.90

Total Score

1.325

.45

1.2

.506

Recommend Location 1

Cost

Popularity

Comp.

Transportation

We integrated various data sources in order to score each location on four metrics

Expected Popularity

Distance to Subways

Cost of Real Estate

Competition

Dataset

LoopNet

Yelp

Foursquare

Demographic Data (NYC Open Data)

Yelp

Foursquare

NYC Open Data

Metric

Yelp

Dataset: ~1,000 Chinese Restaurants in NYC

Fields Collected:

Name
Latitude
Longitude
Price
Rating
Number of Reviews

Foursquare

Dataset: ~600 Chinese Restaurants in NYC

Fields Collected:

Name
Latitude
Longitude
Check-ins
Visitors

NYC OPEN DATA

Datasets Collected:

Population by Zip Code (cut by Age and Sex)
Subway Latitude and Longitude
Bus Stop Latitude and Longitude

LOOPNET

Background: Loopnet lists out commercial real estate listings, including retail spaces

Parameters: We limited our search to NYC listings of locations <2000 SF, ground level

Vacant Address
Price per Sq. Ft. per Rent
Derived latitude and longitude

INITIAL DATA EXPLORATION

Price per Square Feet
Density of Chinese Restaurants
Number of Reviews
Ratings
Initial Profit Score

PRICEST ARE HIGHEST IN marquee RETAIL NEIGHBORHOODS

Tribeca and Midtown Manhattan, expensive retail neighborhoods, show the highest price per square foot per year

As expected, chinatown has the highest density of chinese restaurants

Other neighborhoods have a significantly lower and consistent density of Chinese Restaurants

Downtown Manhattan's Chinese restuarants seem to be the most popular

Both no. of reviews (Yelp) and checkins (Foursquare) seem to be highest for lower Manhattan

RATINGS ARE INCONSISTENT ACROSS NEIGHBORHOODS

In general, downtown generated higher ratings than uptown

OUR INITIAL PROFIT SCORE PROJECTS HIGHEST PROFIT ON THE UPPER WEST SIDE

Areas with high density of Chinese Restaurants and fewer reviews show lower profit scores at this point

For our initial model, we weighed each factor equally: expected popularity, expected cost, competition, and distance to subways: Coef = [1,1,1,1,]

GOALS AND NEXT STEPS

Define Location Areas
Model Target Scores
Create Web App

DEFINE LOCATION AREAS

In this, we will cluster locations based on price per square foot per year and distance

Once we see clusters that make sense, we will create their respective shape files that will serve as location areas.

FINE TUNE TARGET SCORE

We will fine-tune a meaningful target score that well-represents estimated profit at the locations we have defined.

Example:

1

2

3

"I want a low-cost restaurant in a popular area, somewhat close to subways. I don't care about competition, because I'll differentiate."

?

Coefficients

Location 1

Location 2

Location 3

Location 4

1

0

.5

.75

.25

.30

.60

.15

.80

.20

.15

.40

.60

.80

.90

Total Score

?

Recommend ?

Cost

Popularity

Comp.

Transportation

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT
- INPUT PAGE 1

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT
- INPUT PAGE 2

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT - RESULTS PAGE

APPENDIX

Estimating Revenue

Revenue

# of Customers

$$ per Order

=

x

Our target variable

Can be assumed constant
(e.g. Chinese Restaurants can expect $20 per order)

We will need to estimate this with data we have collected

Estimating COSTS

Total Cost

Fixed Costs

Variable Costs

=

+

Our target variable

The main variable cost is retail space rental, which varies by location

A lot can be assumed to be fixed. For Chinese Restaurants:
- Overhead
- Maintenance
- Size of Restaurant
- Staff

MODELING APPROACH

We will look into a sample of following features that could potentially impact overall profit:

Distance from nearby subways and bus stops
Populations by zip code by age and gender
- Population density
Number of competitor businesses in the vicinity
Estimates of popularity
- Yelp rating
- Foursquare checkins
... And of course, price per square foot per year

Zillow

Dataset: All neighborhoods and zip codes in NYC

Fields Collected:

Neighborhood
Zip Score
Zestimate rank (gives an idea of home price)

Identifying the Optimal Location for a New Business

Agenda

THE PROBLEM AND OUR APPROACH

PROBLEM STATEMENT

METHODOLOGY

Recommending a location based on user preference

Recommending a location based on user preference

Literature Review

Unsupervised Learning: Collaborative Filtering

Recommendations based on Linear combination of Drivers of Profitability

Our Approach

In the CF approach, which is not in our model, the authors split up NYC into neighborhoods based on an unsupervised clustering method

DATA SOURCES

Yelp

Foursquare

NYC OPEN DATA

LOOPNET

INITIAL DATA EXPLORATION

PRICEST ARE HIGHEST IN marquee RETAIL NEIGHBORHOODS

As expected, chinatown has the highest density of chinese restaurants

Downtown Manhattan's Chinese restuarants seem to be the most popular

RATINGS ARE INCONSISTENT ACROSS NEIGHBORHOODS

OUR INITIAL PROFIT SCORE PROJECTS HIGHEST PROFIT ON THE UPPER WEST SIDE

GOALS AND NEXT STEPS

DEFINE LOCATION AREAS

FINE TUNE TARGET SCORE

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT - INPUT PAGE 1

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT - INPUT PAGE 2

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT - RESULTS PAGE

APPENDIX

Estimating Revenue

Estimating COSTS

MODELING APPROACH

Zillow

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT
- INPUT PAGE 1

Final output - WEB APP OF RECOMMENDED LOCATION AREAS BASED ON USER INPUT
- INPUT PAGE 2