This is an Up-goer Five translation of a writeup I did on a satire recognition project. Some precision was lost, but it was a great thought exercise in simplifying unnecessarily dry and verbose academic writing. As the hours waned on into inanity, it also became painfully obvious that pictures would have better served to illustrate many of these concepts.
Many people read the news. Some people make fun of real things that happened by writing not-real things. This writing sounds like the news. It is usually easy for a human to figure out if writing is real news or funny not-real news. But it is very hard for a computer. We show how to make a computer figure it out a lot of the time.
We train our computer using a lot of real news, and a lot of funny not-real news. We got these from a place where anyone can go on a computer and read them (the internet). Also, for each of the not-real news, the things that it talks about are like the things that at least one of the real news talks about. This is so that when we look at what is different between the not-real and the real news, we will only get things that make them real or not real, instead of just things that make them different because they talk about different things.
//nltk part of speech tagger
Sometimes, words that look the same can be used in different ways. Like 'lead' can mean a thing you write with, or it can mean when you have someone follow you. But people have made a free computer thing that, when you give it writing, can quickly figure out what ways the words are used. We use this before looking at the words.
A good way to start this kind of problem is to simply see how many of each word there is in each writing, and look at the numbers for words that we think will help.
1. Some words are not serious. A lot of not-real funny news uses these not-serious words to show that it is not real. There is a place anyone can go to on a computer that has many words and tells you if the words are not serious (Wiktionary). It also tells you other things about the words. We took the not-serious words from this place, and looked at how many of them there were in each writing.
We then took that number over the number of all words in that writing. We did this for every writing. We use this number to say how not-serious the writing is and call it a not-serious-number.
This is the reason why we had to make each not-serious-number smaller by the number of all the words in the writing. Suppose one writing has a hundred words, and twenty not-serious words. Another writing has ten hundred words, and a hundred not-serious words. The second writing has more not-serious words than the first, but words are more often not-serious in the first writing than the second. One in every five words in the first writing is not-serious, while one in every ten words in the second writing is not-serious. If we make the not-serious number smaller by the number of all the words in the writing, then it does not matter if the writing is big or small; we will just get how often the writing is not-serious.
2. There are also words that are used to tell you if things are VERY something. Very big, or very small, or very round, or very strange. Some not-real funny news uses these words more than real news to show that it is not real. The computer place where we got the not-serious words can also tell you if a word is VERY something. We took these words, and looked at how many of them were in each writing. As with the not-serious words, we took that number over the number of all words in the writing. We call this a VERY-number.
3. Of course, not-real funny news talks about things that do not happen in real life. We make our computer figure this out by looking for words that do not go together a lot in real news. If we have never seen the words go together in the real news, it might mean that the writing is made up. We see words like 'the' and 'and' all the time, so those words will not matter much. But if we see animals playing a human game, or people that are very well known doing strange things, that will not show up a lot in real news.
To figure this out, we see how many news a word is in if we already know that another word is in that news. We do this for every pair of words that we have. This gives us a number for each pair of words that tells us if they are together a lot. Let us call this a together-number.
Then for all the writing where we want to check if it is not-real news, we take a word "A". We look at its together-number for every other word. We take each together-number times the next together-number. This gives us a number for "A". Now we do this for another word. Once we have a together-together-number for every word, we take them all times each other. This gives us a together-number for the whole writing.
We do this for every writing. It takes a long time. We will talk about ways to make it fast.
Note that the together-number tells us about the chance that a writing is real news, while the not-serious-number and the VERY-number tell us about the chance that a writing is funny not-real news.
People who work on problems like this agree on ways to make numbers that tell you how well your computer did in figuring things out. This is good, because if everyone makes these numbers the same way, we can use them to tell if one way to figure out a problem is better than another way. We will tell you what some of these numbers are, and give short-names for them. We will also give short-names for other things we have to say a lot.
- The close-answers-number: the number of things the computer thought was right that were really right, over the number of all of the things it thought were right. This tells us how close together the things it thought was right are.
- The close-to-the-middle-number: the number of things the computer thought was right that were really right, added to the number of things the computer thought was wrong that were really wrong. That number over the number of all the things that the computer looked at.
- The things-that-matter-number: the number of things the computer thought was right that were really right, over the number of all the things that were really right.
- The good-job-number: the close-answers-number times the things-that-matter-number, times two. That over the close-answers-number added to the things-that-matter-number. A lot of people use this one number to see how well they did, because both the close-answers-number and the things-that-matter-number are important.
- The tree-piece-number of a number you give is the number of times that ten has to go times ten to get the number you give.
For each of those sets, we look for the set of words that give us the best good-job-number when we use those sets to figure out if the writing we have is not-real funny news. Even though the place we got the words from said that they were not serious, seeing them in a writing might not mean that the writing is not serious. We have to find out which words are good at telling us if writing is real news or not.
//dynamic programming, optimizing for F-score
There is a way to do this kind of thing fast. We look at the good-job-number for every word set we can make with one word. We save those numbers and use each of them to find the best good-job-numbers we can get for every word set we can make with two words. For each one-word-set, we see what the good-job-number becomes if we add any other word to the set. But we only save the best new word for each word set. The best word is the one that gives us the biggest good-job-number when we add it to the first word-set. Then we use those best two-word-sets to find the best word sets with three words, the same way as we used the best one-word-sets to find the best two-word-sets. All the way up to all of the words. After we are done, we see which word set got the biggest good-job-number. That is the word set that we will use to make the not-serious-numbers and the VERY-numbers.
Because we only save the word that gives us the best good-job-number each time we add more words, the biggest number of times we will have to do this is the number of words in the word set, times the number of words in the word set, times the number of news we look at, times the tree-piece-number of the number of news we look at. This only took an hour for our computer to do.
//why the brute force approach is terrible
If we looked at every group of words we could make, instead of ignoring the words that don't work well, we would have to make numbers two times as many twos as there are words but one less. If we use the not-serious word set, which has six times ten hundred words, that would be a very, very big number. It is many, many more times than the number of every thing that has ever been.
We now have a lot of different ways to make a computer figure out if news is real or not-real. There is the not-serious-number, the VERY-number, and the together-number. But we have to use them together. Some of these might be better at telling us if news is not-real than others. There is a way a lot of people use to see how much better some numbers are. We will not talk about it here because we might shit rocks trying to do it in Up-goer Five talk.
//support vector machines
Now we have many numbers telling us things about each writing. But they do not tell us yes or no, if a writing is real or not-real news. We need to figure out a deciding-number. If the number for a writing is bigger than that deciding-number, then we can say yes or no.
Since we know the real news and the not-real news, we can train a computer to figure out what the deciding-numbers should be in order to get the most answers right.
Suppose we think that every writing with a not-serious-number of at least five is not-real news, and every writing with a not-serious-number of less than five is real news. A writing with a not-serious-number of six could still be not-real news, just because of the way it was written. But we will think whatever the not-serious number tells us.
If the things we get right or wrong make a better good-job-number when we decide with a not-serious-number of five, than if we use a not-serious-number of six, or four, or any other number, then we have picked the best not-serious number we can use to decide if a writing is real or not-real news.
We have to figure out what the best deciding-number is for the not-serious-numbers, the VERY-numbers, and the together-numbers. Only then can we use them to decide if new writing is real or not-real news.
When we looked at what funny not-real news our computer figured out, we got a very, very good close-to-the-middle-number, which means that our computer said very little actual news was not-real. We got an okay close-answers-number, which means the computer said that some not-real news was real news. This is what we were trying to do. Sometimes, it is hard for even people to tell if not-real news is not real, so we only want to pick the not-real-funny news that we are very sure about. //optimized for few false positives
Only one other try on making computers figure out funny not-real news has been written about, and their numbers were close to ours, but we did better at picking funny news right. We did not use very hard ways of making our computer think, so there should be ways to figure out funny not-real news much better.
Another thing people do in funny not-real news is say things that are not true, when they mean it as a joke. One way we could maybe figure this out is by looking at words that usually mean good things, and words that usually mean bad things, and then seeing if the good or bad words go together when they should not.
Still another thing we could do is look at the names of real things. Things that are not in a serious word-book, like names of people and places. Our computer might need to know a lot about what those things mean, though, which would be very hard.