Datafilos: května 2023

úterý 16. května 2023

Summary evaluation for Wikipedia

Wikipedia articles, at least in English, tend to be overgrown - they contain a lot of information of mixed importance. However, we do not always have time to go thru all the content. It helps that articles are structured to have the most important things in the first sentence/paragraph. However, the importance is not really differentiated within the body. If you have to read the body, you get swamp. I use two tricks to deal with that: 1. Switch to a different language. The idea is that articles in different languages are smaller. However, they still contain the most important information. 2. Use a historical version of the article. The idea is that the most important information was entered before the less important information. People are obsessed these days with text-generative AI. Hence a proposal to use AI for shortening of English articles. Do you need a short description? Generate just a single sentence. Was it not enough? Generate the rest of the paragraph. Need even more? Write a subtopic, which interests you. How to evaluate the quality of the summaries? A. Machine translate all different language variants of the article into English and check the information overlap between the summary and the language variants. Ideally, the overlap will be large. This exploits trick #1. B. Check the overlap between the summary and historical versions of the article. Ideally, the information in the summary will be present even in the old versions of the article. This exploits trick #2. Limitations: 1. Some important information is known only from some date. For example, election results are not available before the results are announced. This can be corrected by observing how quickly given information spreads across different language versions. If the information spreads quickly, it is likely important information, even though it is young information. 2. Language variants are highly correlated because they copy from each other. However, it is reasonable to assume that, for example, English and Spanish are more correlated than, for example, Tuu and Thai, simply because fewer people speak both Tuu and Thai than English and Spanish. If the compensation of these differences is necessary, estimate a correlation matrix on the data and use it to weight the signal.