Machine Learning Projects
1.Vector space models for sentiment analysis.
This would deal with the problem of sentiment analysis in song lyrics. It is a step towards exploring different vector space models that can be useful for the same. Dataset creation would not be a part of the project as dataset for the same will be provided.
2.ML and Neural Networks for sentiment analysis.
For the given annotated dataset of hindi song lyrics, different Ml and NN techniques are to be employed and a thorough comparative study should be done for them. The final outcome should be a robust system that gives substantially accurate results.
3.Context2Vec and IMS+embeddings supervised methods of WSD on Hindi and Marathi datasets
Project Description: implement the Context2Vec and IMS+embeddings supervised methods and if possible other methods of WSD on Hindi and Marathi datasets as mentioned in reference paper "http://lcl.uniroma1.it/wsdeval/data/EACL17_WSD_EvaluationFramework.pdf".
Paper detail:
Alessandro Raganato, Jose Camacho-Collados and Roberto Navigli.
Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison.
Proceedings of EACL 2017, Valencia, Spain
http://lcl.uniroma1.it/wsdeval/
Prerequisites (Specific Programming Language, Knowledge about ML, NN etc): Programming and Python language and basic knowledge about WSD task, classifiers and
4. Code-Mixed Tweet Identifier
Simple presence of some language-2 (say English) words in a sentence of language-1 (say Hindi) may not necessary make the sentence code-mixed. For example: "Ram ne Mohan ko book di" may not be considered are code-mixed even though it contains a proper English word "book". Similarly sentence like "kal hawa kafi tez chal rahi thi I was so scared" also may not be considered code-mixed. I first sentence book is simple a borrowed word. It adds nothing from English grammar. Sentence second is actually composed of two monolingual sentences "kal hawa kafi tez chal rahi thi" and "I was so scared". The aim of this project is to identify code-mixed sentences which have mixing of grammatical constructions from each language. For example: "This baarish ki raat always scares me", "Sunday is the weekly ghar ka saaf safai day" etc.
In this project you will explore the problem of Language Identification and Transliteration. For the original problem, CNN architecture is recommended.
5. Develop a Language Identification system for code-mixed Social Media Text Analysis
In this project you can explore different sequence labeling systems like CRF, Structured Perceptron, RNNs etc.
6. Factoid QA engine for unstructured football data
In this project, we will scrape and gather a huge dataset of unstructured football data off the internet, and then explore the results of various (at least 4) factoid QA engines implemented for this dataset, and compare results. Our aim is to spot improvements that can be made to any of these approaches particular to this domain. How does making the data set semi-structured affect results? What are the features of our data-set which favour some approaches over others? We will start with the survey of approaches presented in ACL 2006 (Mengqiu Wang), and then proceed to more recent methods used.
Pre-Requisites: Python, CL-1, CL-2 (basic knowledge of ML is a bonus)
7. Project Title: Web-based tool for collecting Tip-Of-Tongue instances
Project Description: To develop a web-based data collection tool for gathering information related to a linguistic phenomenon named TOT (Tip-Of-Tongue). We have a sample tool structure to guide through the various tool phases that can be used as a reference to start the development. The project doesn’t expect much of linguistic background/technology knowledge. However it is expected that the tool should facilitate multi-language inputs & display ( English/Hindi/Marathi)
Prerequisites: HTML/any front-end language, JavaScript, MS-Excel
8. Joint Modeling for POS Tagging and Chunking for Hindi.
For Hindi a Simple Feed-Forward-Neural-Netwrok POS accuracy is 97.2, Chunk accuracy is 97.5 without POS features and ~99 with POS (Gold) features. When we use AUTO-POS for chunking at test time, the improvement of 1.5% vanishes, rather we face a slight drop in accuracy of chunker with AUTO-POS features (97.3). Learning POS Tagging and Chunking in a joint model is expected to improve both POS and Chunk accuracy.
9. Hybrid Morph Analyser for Hindi.
Hindi Morph Analyzer gives multiple analyzes for a word. The current system selects the first analysis. Develop a system to learn and select the correct morph output for the possible choices.
10. Corpus Preprocessing Toolkit for Social Media Text
Since most of the corpora available online have been created using different web sources like newspapers, social media, etc., there is a large amount of noise present in them like meaningless sentences, unprocessed HTML tags, different canonical forms of same characters. The aim of the project is to develop a generic module which should perform a step by step corpus cleaning and normalization.
11. Dialog Data Creation and Information Retrieval
The project involves creating realistic dialog data in tourist domain, using slack interface. This will be followed by retrieving information from the json dump of the dialog data created previously.
Prerequisites - Python3
12. Interactive Authoring Tool for Multilingual Generation
We propose that once an structurally and lexically unambiguous input is provided to an MT system, we expect to achieve better translation. In this project, we aim to develop an interactive system which in collaboration with author will produce unambiguous input.
Pre-requisite : Python
13. Event Annotation and building event chain from a text
14. Improving WSD resources using Corpus and Knowledge rich resources
WSD resources are very useful for Machine Translation purpose. In order to create reliable WSD resource and improve the existing one, we aim in this project to extract information from the corpus and use existing knowledge rich resources such as FrameNet, WordNet.
Pre-requisite : Knowledge about corpus processing and machine learning
15. Developing word/phrase aligned parallel corpora
A good word/phrase aligned parallel corpora is useful for SMT and also for creating bilingual lexical resources. Participants of this project are expected to improve upon an existing algorithm of alignment.
Pre-requisite : Programming and familiarity with SMT
16. Evaluation of Semantic Textual Similarity(STS) systems
The aim of the project is to evaluate the existing STS systems on English and SMT Systems.
Pre-requisite:Programming and familiarity with SMT, knowledge about STS systems will be better
17. Neural Machine Translation for Indic languages.
Exploring Neural Machine Translation for various Indic languages, especially resource-scarce languages. Various modes of supervision will be experimented with.
Prerequisites: Basics of Machine Learning - Neural Networks, Supervision methods, etc. Text processing skills - Python / Any other language of your choice.
Familiarity with a deep learning framework like Torch / Tensorflow preferred.
18. Hierarchical Machine Translation Workbench for Indian Languages
Building a single framework for translation between 8 Indian Languages. Identifying challenges, collecting resources for improvement of these systems.
19. Beyond Word2Vec - Embedding Words and Phrases in Same Vector Space
Building deep learning architectures to embed multi-word units into a vector space maximizing similarity between units of different sizes.
20. Generating Factoid Questions from Wikipedia
Generating Who, What, Where, When question from wikipedia text.
21. Shallow Parsing For Telugu Code-Mixed Text
Tokenization, Language Identification, Normalization, Transliteration, POS Tagging, Chunking Pipeline for Telugu Code-Mixed Text
22. Text Denormalization for Code-Mixed Text Generation
Generating code-mixed text from raw monolingual text.
23. Sentence Reordering for Machine Translation -
Converting English sentence into Indian Language order e.g:- Ram hit Shyam. -> Ram Shyam hit.
24. Sentence Compression
Reduce complex sentences into their simpler counterparts.
e.g:-
The tall boy, wearing a green shirt, suddenly hit the white dog on the head with a long stick. ->
1. the boy wearing a shirt hit the dog with a stick (level-1)
2. the boy hit the dog(level-2)
3. boy hit dog(level-3)
25. Bulding An Attention based Neural Machine Translation System for Indian Languages
Subscribe to:
Post Comments
(
Atom
)
No comments :
Post a Comment