As part of my review research for my first project on Fall 2019, "Auto-tagging of TCR records", I began to find a distinct pattern in the papers that I read. Most of the papers hinted that the machine learning algorithms that they developed to tag blogs, research papers, news links and other textual data performed better than the tags provided by the authors of those documents. I am writing this blog post to explain about document tagging, autotagging and how claim mentioned above might be true in a sense
Wikipedia  defines tagging as a keyword or term assigned to a piece of information which helps describe the item . Tags are being excessively used today especially while storing digital information. Tagging serves many purposes.
- Describe about the content
- Organise and manage one's own content
- Searching content that others have shared
- Easy Navigation
- Recommendation of resources that a person may like
What happens if a resource does not have a tag?
Today, the tags on a digital resource serve as a very key component in the search and recommendation systems. When a resource does not have a tag, it becomes unsearchable is being refered to as an orphan resources - or I like to call it, a dangling resource (In memory of pointers[3,4]). I recall the development team summer intern Bingjun Li present a similar case in his seminar in July.
Due to the high volume of documents with no tags, researchers have been working on methods to Autotag (Automatically tag) the document so that they may be more accessible and organised. Though there are 100s of research papers written on this topic (and more!), I will explain a method proposed by Brooks and Montanez. The researchers use the blogs data from Technorati , a popular blogging site of that time. The researchers are trying to forms a hierarchy of tags to create better tags
- First, they test the similarity of articles that contained similar keywords. by using a NLP approach called TFIDF  on the content of the articles. As part of this, they randomly selected 250 of the top 1000 tags from Technorati. For each tag, they collected 20 blog articles and these articles form a tag cluster: a cluster of articles whose content is assumed to be similar.
- After they found that there was overlap in tag usage among similar articles (that is, the tags could be grouped onto categories), they tried to determine if a hierarchical structure for tags could be found. This was done using a process called agglomerative clustering.
- In Agglomerative clustering, they compared each tag cluster to every other tag cluster, using a cosine similarity metric. Then the two closest-similarity clusters from our list of tag clusters are removed and are replaced with a new abstract tag cluster, which contains all of the articles in each original cluster. This cluster is assigned an abstract tag, which is the common tags among all the tags for each cluster. They continue this process till they are left with one single global cluster that contains all of the articles (Figure 1)
Figure 1: Sample of hierarchical tags cluster created 
Now let us look at the question that we wanted to answer, Does Machine learning know better about your work than you?
The studies that I have read claim (especially ) that their model of autotagging was better recieved by a sample population than the tags created by the authors themselves.
Tags assigned by the author of the documents can be inconsistent and idiosyncratic, both due to users' personal terminology as well as due to the different purposes tags fulfil [11,12] . Also, the tags given by the author may be sometime clouded and biased by their own fields of interest and study. For example, a "data science and education" paper written by a data scientist would possible have a more data science related keywords, while one written by an educator may have more education related keywords.
Autotagging, if performed properly, would create tags and keywords which are free of these baises. This can also help in assigning tags to content that have not been assigned tags yet ( refered to as 'cold start problem' in most of the studies ). A uniformity in tags can lead to better search and recommendation, ability to group together similar content, easy navigation and content management.
We also have to leep in mind that autotagging may be ineffective in creating tags for novel content and while creating tags for specific content, it might still be the best chance for us to categorize the document into broader categories.
The debate still continues.....
As part of our research, we are trying to come up with an algorithm that can autotag the content present in TCR.
1 - Wikipedia - https://en.wikipedia.org/wiki/Tag_(metadata)
2 - Unsupervised Auto-tagging for Learning Object Enrichment - https://link.springer.com/chapter/10.1007/978-3-642-23985-4_8
3 - Wikipedia - https://en.wikipedia.org/wiki/Pointer_(computer_programming)
4 - Wikipedia - https://en.wikipedia.org/wiki/Dangling_pointer
5 - Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering - http://www2006.org/programme/files/pdf/583.pdf
6 - Wikipedia - https://en.wikipedia.org/wiki/Technorati
7 - Technorati - https://technorati.com/
8 - Wikipedia - https://en.wikipedia.org/wiki/Tf%E2%80%93idf
10 - Wikipedia - https://en.wikipedia.org/wiki/Cosine_similarity
11- Usage patterns of collaborative tagging systems -https://journals.sagepub.com/doi/pdf/10.1177/0165551506062337
12 - Tag Recommendation using Probablistic Topic Models - https://pdfs.semanticscholar.org/0284/11180e1cd67849c9e1218bf69de6f9a8d5f7.pdf
13 - LDA for On-the-Fly Autotagging - https://dl.acm.org/citation.cfm?id=1864774