Having explored what it takes to clinch that Oscar trophy, Team Nuns decided to pursue another question that might interest the movie exec: What can be done to maximize a movie’s sales? After scouring public data sources for a range of information, here are some important takeaways:
When it comes to movie revenue, there’s nothing more straightforward than putting money into the product. Conventional wisdom would indicate that dollars spent against a movie are the clearest indicator of overall success. However, the Nuns model suggests that, once other factors have been controlled for, a movie makes only 15 cents for every dollar spent in the budget. Before movie houses rush into throwing money at the problem, they should be warned that other elements of the production and release are more impactful. So while you’re not making the next billion dollar movie without a lot of cash backing it, you could stand to make an 85% loss if you’re not careful.
Image credit: blogspot
Another piece of conventional wisdom posits that releasing around the Holiday season is a big bet for many in tinseltown. This time, the pervading philosophy wins out, with a December launch indicating very strong results for these movies than their equivalents during the rest of the year. In fact, a movie releasing December is likely to make $20M more if it releases in December than an exactly similar one at another time of the year. Unsurprisingly, the 'wishy-washy' period of January-May is associated with the lowest gross revenues, even when controlling to ensure like-for-life movies.
Image credit: quickmeme
If there's anything Pixar has taught us over the last few years, it's that making movies that kids enjoy but that also sing to adults is a formula for the big bucks. Movies that have a General or PG rating enjoy a nearly $18M advantage, all else equal, on their PG-13 or R-rated counterparts. Sounds like it pays to focus in on the feel-good family fun, and perhaps limit the release of films with mind-bending grotesquery (like Sharknado).
Image credit: tumblr
It seems Ms. Balan was right on the money, and our analysis bears out that the impact of the opening weekend’s sales as a strong indicator of how the movie will perform. We wouldn’t be drawing any causative conclusions here, but for every dollar made during the first four days of release, a movie is likely to make $2.7 dollars in its entire run. That seems like a good enough reason to push for that marketing blitz.
Image credit: tumblr
Oh, and yes, about how good the movie is. While determining the quality of a movie is open to interpretation, we relied on the masses and culled the IMDb rating (with thousands of IMDb-ans contributing) to provide us a sense of how a movie’s perceived calibre influenced its box office sales. The answer: very little. Moving up an entire point in the IMDb’s rating scale (no mean feat given the internet populace’s exacting standards) yields only an extra $10 million dollars. Important to note however, that this is while simultaneously considering all the other possible factors. Movies having a huge budget, the right opening weekend strategy, and targeting the appropriate audience tend to be pretty decent, leaving little room for differentiation. But that means you don’t need the next Citizen Kane for folks to cough up the dough. Just ask Joss Whedon, who despite his insistence above directed the thoroughly average, but box-office-smashing, Avengers 2.
Image credit: giphy
While the prestige of an Oscar lends a unique importance to the award in tinseltown, what most movie production houses truly care about is the revenue value of a movie. This is with good reason, since movies are expensive to make! We were able to uncover a substantial amount of public information about the revenue grossed by the top 250 movies annually for the last few years, along with features about the movies themselves, and decided to put it together and explore whether there were commonalities among those that performed better at the Box Office. Questions we explicitly addressed included:
The data was collected from a variety of sources, with the three primary ones being:
Prior to determining the appropriate modeling techniques, we sought to explore the data received from these disparate sources. Given the continuous nature of our outcome variable - the gross revenue from a movie - we conducted a series of scatter plots to determine key relationships.
Expectedly, there is a viable linear relatonship between the gross revenue of the movie and the budget that was put into it. However, as we see in the plot above, there is greater variation between the budget and revenue as both increase. This may result in budget being a positive predictor but one with lower significance.
We leverage the IMDb star rating as our closest proxy for the true 'quality' of a movie. Based on the plot above, the flat nature of the linear relationship indicates that rating may end up mattering less than we might believe.
The analysis involved for these questions is markedly different from that employed for the Oscar prediction scenario. Here, we are less interested in classification or the prediction of groups, and more concerned with describing the factors that are associated with the higher revenues. This is driven both by the nature of the data (having a continuous outcome variable), but also by the philosophy behind the question. We are not looking to find a way to directly target a customer or an activity, but inform a movie executive’s strategy in thinking about the business of a movie.
To that end, we will be employing a linear multivariate regression model, with the gross revenue as the outcome determined by the variables indicated under the ‘objectives’ page. Our model building process, even before engaging in a Machine Learning hyperparameter optimization, we must determine what variables to include in our multivariate regression. To do so we employed the OLS from Statsmodel.api and began with a simple model that included budget, the number of opening theaters, and the season of release.
As evidenced from the plot above, it appears that this rudimentary model does not do a great job of accounting for enough factors that would fit the model well. The blue line represents the best fit line based on the factors we provided, and the green line represents the lowess line for the data using these factors. The lowess line is a non-parametric curve that combines the multiple regression with a k-nn methodology, and its divergence from the best fit line indicates the relatively poor fit of the model.After a series of additional steps, we arrived at a model that also incorporates the opening weekend gross sales, MPAA rating, IMDb rating and removes the opening theaters variable.
This model has achieved a much better fit of the data as indicated by the convergence of the blue and green lines above. With this in mind, we recreated this model in sklearn in order to be able to be able to tune the hyperparameters to achieve a tighter model fit. We invoked the Lasso procedure, and wrote a function to find the optimal tuning parameters. The score function used was a mean-squared-error in order to accommodate for the regression used in the analysis. Even after the optimization, the final results of the model yielded coefficients that were very similar to those from the OLS in Statsmodel. The final coefficient tables were:
The results of analysis model provided interesting computational outputs that are worth discussing:
There are many interesting directions this project can be continued in:
For even more details, see our full work on github!