Exploring relationship between continuous and likert scale data

I am working on a project and looking for some help from the community. The project’s goal is to find any kind of relationship between MetricA (integer data eg: Number of incidents) and 5-10 survey questions. The survey question’s values are from 1-10. Being a survey question, we can imagine this being sparse. There are lot of surveys with no answer.

I have grouped the data by date and merged them together. I chose to find the average survey score for each question to group by. This may not be the greatest approach but this I started off with this and calculated correlation between MetricA and averaged survey scores. Correlation was pretty weak.

Another approach was to use xgboost to predict and use shap values to see if high or low values of survey can explain the relationship on predicted MetricA counts.

Has any of you worked anything like this? Any guidance would be appreciated!

3 Likes

Traditional statistics methodology:

Depending on the dispersion, either a negative binomial or poisson regression with survey items acting as predictors, maybe with zero inflation.

In order to address the missingness, you can either fit a binary indicator for missing and replace the missing numbers with 1 (there are a lot of flavors to select from, based on your assumptions about the reason of the missingness).

As an aside, don’t anticipate a stellar performance. There is a lot of measurement error in surveys, even the most trustworthy ones. In the behavioral sciences, a typical r2 between a survey and the kind of behavior you’d expect to see in tandem (like trait anger and violence when prompted) is rarely greater than 0.3.

3 Likes

You make a valid point about r2. There are additional factors besides Metric A that could affect the results of this particular survey. To test the waters, I attempted a basic linear regression, but the outcome wasn’t very good.

On this, management has been really clear.

2 Likes

What sample size would you consider “sparse” if your survey consisted of five to ten items on a scale of 1 to 10?

You may wish to attempt the following several things if all you’re searching for is correlations using a Likert scale:

Sort the answers into fewer categories (e.g., 1–5, 6–8, 9–10). If there is variance in the responses to survey questions, this could be beneficial. Alternatively, you may be able to consider variables as categorical rather than numeric or ordinal.

Instead of using Pearson’s correlation coefficient, use the Spearman’s. It simply takes a second to check, but unless your data is formed particularly strangely, this probably won’t make much of a difference. A discernible rise in a correlation’s significance indicates that you could

2 Likes

Should I switch from software to data science?

1 Like

I guess you should do what you think is best with you

1 Like