My Portfolio

MojoQA

Objective: RAG-based LLM application to answer queries related to Mojo programming language

Description: MojoQA is a RAG (Retrieval Augmented Generation) based LLM application that can answer queries related to Mojo programming language. For creating Mojo QA Bot, I have extracted the official Mojo documentation and created a vector store containing corresponding embeddings. To answer each query, we retrieve the most similar embeddings and provide it to the LLM as context.

Tools & Methods used:Python, PyTorch, NLP, LangChain, DeepLake, Streamlit

Github Repository

Joke Generation & Rating using LLM

Objective: An interactive chatbot that can generate various categories of jokes and assess the quality of jokes.

Data Sources: rJokesDataset

Description: Using the rjokes dataset, a GPT-2 model was fine-tuned for joke generation, and a BERT model was fine-tuned for predicting the humor level of the joke

Tools & Methods used:Python, PyTorch, Pandas, NLP, Transformers, pytest, LLM

Github Repository

Content Diffusion from Social Media to News Quotes - MSc Thesis

Objective: Measure the diffusion of tweets as speaker attributed quotations in news articles in the context of social movements and analyze how social media have impacted the News Gatekeeping process employed by news media.

Data Sources: Twitter, Quotebank, Wikidata, MediBias/FactCheck

Description: Extracted user information and twitter handle of US Politicians and Hollywood Celebrities from Wikidata. Collected social movement tweets posted by these users from Twitter. By leveraging Quotebank, we measured the amount of social movement related tweets that got reported as direct quotations by news media. To understand the impact of social media in the news gatekeeping process, various experiments were conducted.

Tools & Methods used:Author emotion detection, Stance detection, Statistical analysis, Various ML classifiers, Time series analysis, Propensity score matching, LLM, Data Modelling.

Report

Latent Variables in Computer Vision

Objective: Understand latent variables in computer vision and explore the difference between latent variables in Variational AutoEncoder (VAE) and vanilla Generative Adversarial Network (GAN).

Description:Trained a vanilla GAN on the UTKFace dataset containing over 23000 images of human faces belonging to different ethnic groups, ages, gender, emotions, etc. Created fake images, and new images with specific features by exploring the latent space and performing basic vector operations.

Github Repository

AutoEncoder based Image Retrieval

Description: Trained an autoencoder on CIFAR-10 dataset. Encoder takes as input the 32 x 32 x 3 images and maps it to 10-dimensional vector in latent space. Decoder was trained to recreate the original input image from the 10 dimensional vector. Encoder part of the trained autoencoder architecture was used for building the image retrieval system. Every image to be searched was converted into a 10 dimensional vector using encoder and the most similar image was computed finding the cosine distance between this 10-D vector with the 10-D vectors of all the images in the training dataset. Various sanity checks were conducted to measure the robustness of the system

Github Repository

Summarized Line Graph

Summarized line graph visualization technique is designed specifically for data analysts to communicate data to decision-makers more effectively and efficiently. In this project I have implemented an interactive summarized line graph using D3.js library.

Github Repository

Contact

Email:: raigonkunnathaugustin@gmail.com