Killin' It at the Box Office

What Factors Drive Movie Revenue?

Five Things You Should Know

“It’s no trick to make a lot of money, if what you want to do is make a lot of money.” - Mr. Bernstein, Citizen Kane

Having explored what it takes to clinch that Oscar trophy, Team Nuns decided to pursue another question that might interest the movie exec: What can be done to maximize a movie’s sales? After scouring public data sources for a range of information, here are some important takeaways:

Budget's Not Gonna Help

“I like working on big budget films.” ― Freida Pinto

When it comes to movie revenue, there’s nothing more straightforward than putting money into the product. Conventional wisdom would indicate that dollars spent against a movie are the clearest indicator of overall success. However, the Nuns model suggests that, once other factors have been controlled for, a movie makes only 15 cents for every dollar spent in the budget. Before movie houses rush into throwing money at the problem, they should be warned that other elements of the production and release are more impactful. So while you’re not making the next billion dollar movie without a lot of cash backing it, you could stand to make an 85% loss if you’re not careful.


Image credit: blogspot

Timing, timing, timing

Another piece of conventional wisdom posits that releasing around the Holiday season is a big bet for many in tinseltown. This time, the pervading philosophy wins out, with a December launch indicating very strong results for these movies than their equivalents during the rest of the year. In fact, a movie releasing December is likely to make $20M more if it releases in December than an exactly similar one at another time of the year. Unsurprisingly, the 'wishy-washy' period of January-May is associated with the lowest gross revenues, even when controlling to ensure like-for-life movies.


Image credit: quickmeme

Mass Appeal

If there's anything Pixar has taught us over the last few years, it's that making movies that kids enjoy but that also sing to adults is a formula for the big bucks. Movies that have a General or PG rating enjoy a nearly $18M advantage, all else equal, on their PG-13 or R-rated counterparts. Sounds like it pays to focus in on the feel-good family fun, and perhaps limit the release of films with mind-bending grotesquery (like Sharknado).


Image credit: tumblr

The Opening Weekend

“The stakes are high on every film now because there's the opening weekend. The first week is extremely crucial... People are going berserk promoting their films.” ― Vidya Balan

It seems Ms. Balan was right on the money, and our analysis bears out that the impact of the opening weekend’s sales as a strong indicator of how the movie will perform. We wouldn’t be drawing any causative conclusions here, but for every dollar made during the first four days of release, a movie is likely to make $2.7 dollars in its entire run. That seems like a good enough reason to push for that marketing blitz.


Image credit: tumblr

Finally, the Movie Quality

“I'd rather make a show 100 people need to see, than a show that 1000 people want to see.” ― Joss Whedon

Oh, and yes, about how good the movie is. While determining the quality of a movie is open to interpretation, we relied on the masses and culled the IMDb rating (with thousands of IMDb-ans contributing) to provide us a sense of how a movie’s perceived calibre influenced its box office sales. The answer: very little. Moving up an entire point in the IMDb’s rating scale (no mean feat given the internet populace’s exacting standards) yields only an extra $10 million dollars. Important to note however, that this is while simultaneously considering all the other possible factors. Movies having a huge budget, the right opening weekend strategy, and targeting the appropriate audience tend to be pretty decent, leaving little room for differentiation. But that means you don’t need the next Citizen Kane for folks to cough up the dough. Just ask Joss Whedon, who despite his insistence above directed the thoroughly average, but box-office-smashing, Avengers 2.


Image credit: giphy

How did we do it?

Background & Motivation

While the prestige of an Oscar lends a unique importance to the award in tinseltown, what most movie production houses truly care about is the revenue value of a movie. This is with good reason, since movies are expensive to make! We were able to uncover a substantial amount of public information about the revenue grossed by the top 250 movies annually for the last few years, along with features about the movies themselves, and decided to put it together and explore whether there were commonalities among those that performed better at the Box Office. Questions we explicitly addressed included:

  • Movie Budget: Wealth begets wealth, goes the old adage. For studios looking to make a splash, how important are the investment dollars put against a production?
  • Opening Weekend: Film industry experts love the 'Opening Weekend' phenomenon, but just how much does the first 4 days of a movie's release truly matter?
  • Opening Theaters: Does the scale of release during the opening weekend affect a movie?
  • IMDb rating: Does how 'good' the movie is actually matter? We leverage the IMDb rating as a proxy for the 'inherent level of quality' of the movie.
  • Seasonality: How does a release around Christmas or the summer vacation affect the outcomes of a movie?
  • MPAA Rating: For the kids, the parents, or the young adults?
  • Power Studios: Do movie production houses like Warner Bros. or Universal have a power of their own, besides the factors listed above?

Data Gathering and Processing

The data was collected from a variety of sources, with the three primary ones being:

  • Box Office Mojo: A repository of domestic and international sales by release year, for all movies that would be worth talking about. The annual lists stretch from 1985 to the present date, and datatables include opening weekend gross revenue, the # of opening theaters, release and close dates.
  • This simple but effective website contains a list of movies and their estimated budgets for the last decade.
  • IMDb Database: IMDb provides access to a number of features at both the movie-level and person-level (actors, actresses, directors, etc.). While they have this available in large datasets through an external FTP portal, an enterprising group of movie analysts have put together the ‘IMDbpy’ package that provides a number of classes to more directly access the data through a python interface.
The scraping off the first two websites utilized methods similar to those learned during Homework 1, and were relatively straightforward in inserting into data dictionaries. The challenge emerged from attempting to combine these external data sources with the IMDb information. In order to prevent us having to match the data by hand, we were required to find keys that would match the data sources, while handling as many edge cases as possible. In similar fashion to the Oscar Prediction analysis, we underwent a series of data process helper functions and transformations to arrive at a combined dataset. Details can be found in the Box Office Scraper notebook provided.

Exploratory Data Analysis

Prior to determining the appropriate modeling techniques, we sought to explore the data received from these disparate sources. Given the continuous nature of our outcome variable - the gross revenue from a movie - we conducted a series of scatter plots to determine key relationships.

Gross Revenue vs. Budget

Expectedly, there is a viable linear relatonship between the gross revenue of the movie and the budget that was put into it. However, as we see in the plot above, there is greater variation between the budget and revenue as both increase. This may result in budget being a positive predictor but one with lower significance.

Gross Revenue vs. rating

We leverage the IMDb star rating as our closest proxy for the true 'quality' of a movie. Based on the plot above, the flat nature of the linear relationship indicates that rating may end up mattering less than we might believe.

Modeling and Results

The analysis involved for these questions is markedly different from that employed for the Oscar prediction scenario. Here, we are less interested in classification or the prediction of groups, and more concerned with describing the factors that are associated with the higher revenues. This is driven both by the nature of the data (having a continuous outcome variable), but also by the philosophy behind the question. We are not looking to find a way to directly target a customer or an activity, but inform a movie executive’s strategy in thinking about the business of a movie.

To that end, we will be employing a linear multivariate regression model, with the gross revenue as the outcome determined by the variables indicated under the ‘objectives’ page. Our model building process, even before engaging in a Machine Learning hyperparameter optimization, we must determine what variables to include in our multivariate regression. To do so we employed the OLS from Statsmodel.api and began with a simple model that included budget, the number of opening theaters, and the season of release.

As evidenced from the plot above, it appears that this rudimentary model does not do a great job of accounting for enough factors that would fit the model well. The blue line represents the best fit line based on the factors we provided, and the green line represents the lowess line for the data using these factors. The lowess line is a non-parametric curve that combines the multiple regression with a k-nn methodology, and its divergence from the best fit line indicates the relatively poor fit of the model.After a series of additional steps, we arrived at a model that also incorporates the opening weekend gross sales, MPAA rating, IMDb rating and removes the opening theaters variable.

This model has achieved a much better fit of the data as indicated by the convergence of the blue and green lines above. With this in mind, we recreated this model in sklearn in order to be able to be able to tune the hyperparameters to achieve a tighter model fit. We invoked the Lasso procedure, and wrote a function to find the optimal tuning parameters. The score function used was a mean-squared-error in order to accommodate for the regression used in the analysis. Even after the optimization, the final results of the model yielded coefficients that were very similar to those from the OLS in Statsmodel. The final coefficient tables were:

Conclusion and Next Steps

The results of analysis model provided interesting computational outputs that are worth discussing:

  • The 'budget' variable is both statistically signficant and meaningful. The coefficient of 0.30 indicates that every dollar in the budget entails 30 cents of revenue. This means that simply throwing money at a movie in the hopes that it works is not a wise strategy, and the other factors in the model are very important in ensuring that a movie does financially well.
  • The coefficients on all the season variables indicates are negative, which means that the holdout month - December - appears to be associated with the best results for a movie's release, with movies being released during that month having an average 20M dollar boost compared to other seasons.
  • The movie's MPAA rating also seems to matter, with both PG-13 and R-rated movies performing worse than their G-rated counterparts. The magnitude of the difference, controlling for all other factors, is about 18M dollars.
  • The gross during the opening weekend is highly associated with the amount that the movie ends up making, with a coefficient of 2.7 this would emphasize the need to start off with a bang.

There are many interesting directions this project can be continued in:

  • Further pre-processing to allow for better data matches between disparate datasets, including information about the cast itself, or factor that may be more atypical.
  • Incorporation of features beyond those included here such as a more detailed breakdown of budget specifics (e.g. marketing budget vs. cast pay vs. special effects)

For even more details, see our full work on github!