Tim Shanahan is a social scientist and consultant who works with the Statistical Training and Techniques (STAT) team of the Fors Marsh Group in Arlington, VA. Tim has several years of survey and opinion research primarily in the educational sector. Tim’s research has been used to support organizational membership engagement and explore novel recruiting strategies for graduate and undergraduate students. Tim’s research has also evaluated the effectiveness of graduate admissions advising as well as the role that gender plays in occupational decision making.
Few fields have been as radically transformed by people’s access to the internet as survey research and consumer insight mining. The web has enabled surveys to be inexpensively developed and distributed to people and to substantially increase the number of respondents participating in survey-based research. Social scientists and market researchers alike can now field surveys across social media platforms, mobile phones, and website panels. Unfortunately, data from internet-based surveys and opt-in panels are increasingly contaminated by fraudulent responses generated by “survey bots.”
Bots are simple computer programs that impersonate people by interacting with computer systems, automating tasks, and independently completing a wide range of online operations. The sophistication, ease of use, and range of tasks that bots can complete is growing rapidly—while the costs are steadily declining. One study estimates that just more than half of all web traffic is now generated by automated bots. Bots can be very useful for web scraping, monitoring websites, or aggregating news; however, they are destructive when they are used to impersonate humans. For example, “ad fraud” or “click fraud” has long been a plague on the digital advertising industry, as increasing numbers of bots drive up ad click rates and reduce the actual return on the advertising spend.
Similarly, bots negatively affect the accuracy of survey research as commercially available survey bots are now capable of generating survey responses, usually for the purpose of receiving the digitally provided compensation for completed surveys. Fake data from bots often enter surveys via unscrupulous members of opt-in survey panels. Individuals in survey panels can purchase a survey bot that dramatically increases the number of completed surveys attributable to a panel member and generates compensation for survey completion. Even the most well-established panel and respondent providers have been affected by the problem. The prevalence of bot-generated data isn’t formally estimated but the problem appears to be growing as the number of commercially available survey bots is increasing. However, it is safe to conclude that bots are amplifying some of the existing problems arising from low-effort responding such as “speeding” and “straight-lining” found in nonprobability survey panels.
Since the survey bot problem is not going away, how can survey researchers manage this problem?
Fortunately, survey bots aren’t as effective at disguising themselves as they are at taking surveys. At several points in the survey lifecycle, countermeasures can be employed to help maintain the integrity of your data or identify fraudulent responses.
One of the most effective ways of dealing with survey bots is preventing them from taking surveys in the first place. You should work with your panel provider to make sure they have procedures in place to screen and verify the identity of respondents. Fors Marsh Group (FMG) employs many of these measures, such as engaging in routine name and address validations and mailing incentives to physical addresses.
Evidence of bot activity can often be found in both the distribution of survey responses, open-ended questions, and in back-end paradata such as IP addresses or time signatures. Unsophisticated bots will generate responses with the same patterns, so make sure to examine suspiciously high numbers of certain response choices. Make “consistency edits” by examining response patterns that are unusual, such as young people with long histories of certain behaviors or other implausible answers. Data from open-ended question fields should also be inspected for duplicate phrases or completely off-topic responses.
Paradata such as IP addresses and time signatures can also indicate bot activity. Multiple surveys from the same IP address are a reliable but not foolproof indicator that responses were automatically generated, so examine IP addresses and time signatures for duplicates. Where possible, also look for surveys that were completed extremely quickly or slowly or well outside of the average completion time.
There is rarely a single trait that exposes a bot, but looking at all the evidence in context can paint a picture of bot activity. If you suspect you have fraudulent responses, make sure to flag the data rather than removing cases outright so that differences in flagged and unflagged analyses can be studied. Although the number of responses generated by bots found in panel-provided data is likely to increase, researchers can stay a step ahead of bots by taking the preceding actions into consideration when preparing or analyzing your data. The features that make bots powerful are often the same ones that can give them away, such as fast responses, stilted or boilerplate answers, and high rates of “productivity.”