Don’t do your experiments double-blind: the importance of checking your data

Nelleke Oostdijk and Hans van Halteren

When developing software, you want to do just that. You do not want to spend more than half of your project on collecting data. As a result, you are tempted to download one of the many ready-made datasets that can be found on the web. This sentiment was already around long ago, but it has seemed to gain strength now there is also wide availability of deep learning software. Today everybody can do natural language processing, using mix-and-match downloaded data and software. Everybody can run interesting experiments and report on them, sometimes even barely (or not at all) understanding the software or the data.

Now, we quite accept that many of us have started to use software we don’t fully understand. With the current state of the art, we cannot expect otherwise. However, we pose that understanding your data is rather another matter, that you should always strive to inspect your data. Only then can you be sure that you are in fact doing what you think you are doing. And maybe even improve your modeling.

To test whether this, possibly prejudiced, opinion has any merit, we downloaded a dataset unknown to us (, consisting of 515K hotel reviews scraped from It was already used in some theses, papers, and blogs, where no comments were given that pointed out any shortcomings. At most, results with this dataset were remarked to be worse than those with other datasets, but for no clear reason. This remark was made in relation to the task of recognizing whether a (short) text is positive or negative, for which data is ideal as reviewers place text in separate columns for positive and negative comments.

We inspected the dataset and reannotated part of the texts as to their polarity. On the basis of this, we will first discuss some shortcomings of the dataset which are quite clear if you only take the trouble of looking. These can stem from the users of the site, e.g. positive comments in the negative columns or just positive and negative mixed together in one column, or from the scraping process, e.g. the removal of all punctuation, hyphens, and special characters. We then use both the original annotation (the column in which the text was apparently placed) and our reannotation to see how well the task of polarity recognition is performed.

Preliminary results with an odds-based learner (similar to that with which we reached 2nd place in the VarDial2 shared task on distinguishing between Dutch and Flemish subtitles) show that the reannotation leads to significantly better results both on classification of polarity and on recognizing mixed texts. At the conference, we will show the final results for various learners as well as an in-depth analysis of the errors and how these might be caused by the noise in the data.