Predicting Social Media Shares by using MARS, Ridge Regression and LASSO Regression in R Studio
top of page

Predicting Social Media Shares by using MARS, Ridge Regression and LASSO Regression in R Studio

Updated: Jan 31, 2022



Content

  1. Abstract

  2. Introduction

  3. Methodology

  4. MARS Model

  5. Ridge Regression Model

  6. LASSO Regression Model

  7. Conclusion

Abstract


There is this data set in Kaggle [1] that encouraged us to use it in this research. The data set is about online articles published by Mashable [2]. There are 39644 articles in this data set and 61 variables exist. Two of them are 2 non-predictive and one of them is a response variable. One can struggle when she/he tries to do research if she/he does not know those domains. Because there are many variables in this data set.The first reason can be that the articles are published online. So, gathering information from the readers might be doable. The second reason is that sharing volume of the articles might depend on whether they might have been published on the weekdays or not. So and so, the original data set was created by applying this kind of analogies. In this research, three different regression methods have been applied to predict shares of online articles. Respectively, MARS, Ridge Regression and LASSO Regression were applied. To do that, the data set is divided into %80 train and %20. In every model, cross-validation was applied. 10 out of 61 variables were chosen for those models. Mean Squared Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) have been applied to measure the models' performance. Finally, the results show that these three models may have an overfitting problem. To fix that issue, we might need to make some adjustments, which may not be a good suggestion, since the data set might need other machine learning models to understand the data in it.


Introduction


Measuring online metrics is not new in the academic field (Hood, 1987). Different disciplines have tried to come up with solutions to the existing issues. In the social media world, likes, dislikes, comments, and shares have their importance in terms of the platform they have been used. Such as, in LinkedIn, there are 6 different "like" categories if someone tries to use that button. Maybe in the business world, showing some love in an ordinary reflection, may not be enough. One has to like, celebrate, support, be insightful, or be curious. People are weighing too much thought on the "like" button while surfing on LinkedIn. Every platform measures its metrics differently. In Instagram, there is only one "like" emoji that exist on that platform. Recently YouTube removed the "dislike" button from its users. Maybe they realized the same complication on social media or maybe social media might not be about being a "philosopher". So, giving a thumbs up for the content seem to be the only choice for the users. Because there is an ongoing issue about "likes on social media" (Burrow & Rainone, 2017; Poon & Jiang, 2020; Marengo et al., 2021). The examples can be extended about likes or dislikes on social media. However, those topics are out of this research's rhetoric. With that being said, in this research, we tried to analyze the importance of shares. Because it might not be wrong to state that the influencers are only asking for likes or subscriptions in the beginning or at the end of their videos.


Methodology


As stated above data was downloaded from Kaggle. Also, the original data set was gathered from UC Irvine Machine Learning Repository. Since there are 61 different metrics, we had to select which metrics that we would like to use in our models. So, this selection part can receive legitimate criticism for not using two-phase models (Dey et al.,2017; Kaur et al., 2020) to reduce the variables. Because, how does one know the mathematical results without using mathematics? Because Principal Component Analysis could reduce the dimensions of the variables. And so, the approach of using variables might have been scientific. After this explanation, the variables can be shown below;


n_tokens_content: Number of words in the content

num_imgs: Number of images

num_videos: Number of videos

global_subjectivity: Text subjectivity

global_sentiment_polarity: Text sentiment polarity

global_rate_positive_words: Rate of positive words in the content

global_rate_negative_words: Rate of negative words in the content

abs_title_subjectivity: Absolute subjectivity level

abs_title_sentiment_polarity: Absolute polarity level

shares: Number of shares (response variable)


With that being said, the first method was to give some descriptive statistics about this data set. Our second approach was to show a scatterplot (Becker, Chambers & Wilks) for the data set that we used. To do that, we applied the pairs method which gives a matrix of scatterplots [3]. In our third move, prior to building any models, we divided them into %80 train and %20. And also, prior to building every model, we did cross-validation to make sure to reach these goals below;


"1) To estimate performance of the learned model from available data using one algorithm. In other words, to gauge the generalizability of an algorithm


2) To compare the performance of two or more different algorithms and find out the best algorithm for the available data or alternatively to compare the performance of two or more variants of a parameterized model (Refaeilzadeh, Tang & Liu, 2009, p.2)"


Our fourth approach was to build and apply the models. What are they? We have selected MARS, Ridge Regression and LASSO Regression as for our predictors. Mainly because those three models have been used for predicting the response variables in different disciplines (Gepp et al., 2018; Bennett, 2019; Kerckhoffs, 2019;Fernández-Delgado et al., 2021; García-Nieto et al., 2021; Ozmen, 2022). Finally, we compared the train and the test results of these models. After that comparison, we wrote our analogy in conclusion.


Descriptive Statistics



Pairs Graphic


Our first observation about the pairs graphic is that finding linearity between the variables can be difficult. The second observation that is related to the first one, we might have applied classification machine learning methods to come up with a model, such as Logistic Regression (Wright, 1995), Classification Trees (Buntine, 1992), Support Vector Machines (Hearst et al., 1988), Random Forest (Breiman, 1999), and so on. The third observation is that almost every variable has a different relationship with each other. Which is quite interesting.


MARS Model

+ Fold01: degree=1, nprune=10

- Fold01: degree=1, nprune=10

+ Fold01: degree=2, nprune=10

- Fold01: degree=2, nprune=10

+ Fold01: degree=3, nprune=10

- Fold01: degree=3, nprune=10

-

-

-

- Fold10: degree=2, nprune=10

+ Fold10: degree=3, nprune=10

- Fold10: degree=3, nprune=10


10 Fold Cross-Validation has been applied. However, showing each and every one of them was going to take a huge space. For that reason, we decided not to show the whole table. Also, 3 degrees have been selected, and pruning has been limited to 10.

Before jumping to any conclusion about the model, our first observation is that the output of the r squared seemed very low.


RMSE Graphic

RMSE graphic denotes that we might need to do our grid search by selecting 6. However, looking at the results with bear eyes sometimes be misleading. Because the MARS model has been selected 7 while pruning the data set.


Important Variables


One can observe that there are only variables that seem to be important in this model. Seeing images at the top of the importance list showed us why social media is all about images.


Ridge Regression Model


Just like in the MARS model, we had to cut off the output. Because we might need to add pages and pages to show the results of the model.


RMSE Graphic


One can liken RMSE scores to a strict line. Our observation about regularization parameters keeps increasing.


Important Variables


The number of images seems to not be important for Ridge Regression. Global subjectivity finds itself in the importance list as in the previous model. Also, Ridge Regression may have better performance since more than 3 variable showed their importance.


LASSO Regression Model


Just like in the MARS and Ridge Regression models, we had to cut off the output. We may need to add one parenthesis here since Ridge Regression's and LASSO Regression's outputs seem similar. In terms of measuring the performance of their models, they use lambda, RMSE, R-squared and MAE as for their metrics. The only difference is that Ridge Regression takes lambda as "zero". On the other hand, LASSO Regression takes lambda as "one".


RMSE Graphic


One can liken RMSE scores to a strict line. Our observation about regularization parameters keeps increasing. The only difference in Ridge Regression and LASSO Regression in terms of their RMSE score is that LASSO Regression's RMSE score is lower than Ridge Regression's RMSE score.


Important Variables


LASSO Regression seems to have ineffective results when it comes to showing variables' importance. Because of the other two models, there was at least three variable that found themselves in the importance lists. In addition to that analogy, the response variable found itself room in that importance list for LASSO Regression, which makes us difficult to comment. Meaning that this result either could be a good thing or don't.


Train Results



Test Results



Conclusion

Overall, all three models have passed their train results. So, selecting any of these models might not be a radical approach to take. In other words, all three of them needs adjustments in terms of building them again. Another splitting method can be applied to solve this issue. However, this suggestion might not be work either. Because three consisting over-fitting problem (Cohen & Jensen, 1997; Ng, 1997) might not be a good sign for these models.


For future studies, one must apply for two-phase models to reduce the dimensionality of the variables. Machine Learning methods can be added to the analysis, such as Logistic Regression, Classification Trees, Support Vector Machines, Random Forest, Bagging, Boosting, Artificial Neural Network and so on. Or maybe Deep Learning methods can be applied to this data set. Another suggestion is building different models without adding MARS, Ridge Regression and LASSO Regression to comparisons. An alternative suggestion can be using Time Series Analysis for this data set. Because there is a seasonality between the articles' publishing time. Meaning that the readers might like to read some articles in some specific season. All in all, this research has some limitations and has also some restrictions. So, the author conducted this research with the awareness of these two sentences above.



Bibliography
  • Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

  • Bennett, J. B. (2019). Attitude and Adoption: Understanding Climate Change Through Predictive Modeling (Doctoral dissertation, Purdue University Graduate School).

  • Breiman, L. (1999). Random forests. UC Berkeley TR567.

  • Buntine, W. (1992). Learning classification trees. Statistics and computing, 2(2), 63-73.

  • Burrow, A. L., & Rainone, N. (2017). How many likes did I get?: Purpose moderates links between positive social media feedback and self-esteem. Journal of Experimental Social Psychology, 69, 232-236.

  • Cohen, P. R., & Jensen, D. (1997, January). Overfitting explained. In Sixth International Workshop on Artificial Intelligence and Statistics (pp. 115-122). PMLR.

  • Dey, K., Shrivastava, R., & Kaushik, S. (2017, November). Twitter stance detection—A subjectivity and sentiment polarity inspired two-phase approach. In 2017 IEEE international conference on data mining workshops (ICDMW) (pp. 365-372). IEEE.

  • Fernández-Delgado, M., Sirsat, M. S., Cernadas, E., Alawadi, S., Barro, S., & Febrero-Bande, M. (2019). An extensive experimental survey of regression methods. Neural Networks, 111, 11-34.

  • García-Nieto, P. J., García-Gonzalo, E., & Paredes-Sánchez, J. P. (2021). Prediction of the critical temperature of a superconductor by using the WOA/MARS, Ridge, Lasso and Elastic-net machine learning techniques. Neural Computing and Applications, 33(24), 17131-17145.

  • Gepp, A., Linnenluecke, M. K., O’Neill, T. J., & Smith, T. (2018). Big data techniques in auditing research and practice: Current trends and future opportunities. Journal of Accounting Literature, 40, 102-115.

  • Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their applications, 13(4), 18-28.

  • Hood, W. (1987). Online Databases: Pricing, Downloading and Front End Software. LASIE: Library Automated Systems Information Exchange, 17(4), 87-95.

  • Kaur, S., Kumar, P., & Kumaraguru, P. (2020). Detecting clickbaits using two-phase hybrid CNN-LSTM biterm model. Expert Systems with Applications, 151, 113350.

  • Kerckhoffs, J., Hoek, G., Portengen, L., Brunekreef, B., & Vermeulen, R. C. (2019). Performance of prediction algorithms for modeling outdoor air pollution spatial surfaces. Environmental science & technology, 53(3), 1413-1421.

  • Marengo, D., Montag, C., Sindermann, C., Elhai, J. D., & Settanni, M. (2021). Examining the links between active Facebook use, received likes, self-esteem and happiness: A study using objective social media data. Telematics and Informatics, 58, 101523.

  • Nesi, P., Pantaleo, G., Paoli, I., & Zaza, I. (2018). Assessing the reTweet proneness of tweets: predictive models for retweeting. Multimedia Tools and Applications, 77(20), 26371-26396.

  • Ng, A. Y. (1997, July). Preventing" overfitting" of cross-validation data. In ICML (Vol. 97, pp. 245-253).

  • Ozmen, A. (2022). Multi-objective regression modeling for natural gas prediction with ridge regression and CMARS. An International Journal of Optimization and Control: Theories & Applications (IJOCTA), 12(1), 56-65.

  • Poon, K. T., & Jiang, Y. (2020). Getting less likes on social media: Mindfulness ameliorates the detrimental effects of feeling left out online. Mindfulness, 11(4), 1038-1048.

  • Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. Encyclopedia of database systems, 5, 532-538.

  • Wright, R. E. (1995). Logistic regression.


Internet Sources

1) https://www.kaggle.com/yamqwe/predicting-number-of-shares-of-news-articles

3) https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/pairs

Disclaimer

The author would like to thank UC Irvine Machine Learning Repository for making this data set available.


bottom of page