Company: Sapient Logic
Role: Machine Learning Engineer
Project: Semantic Similarity Analysis for ATT&CK Framework and Digital Vaccine Mapping
Developed and implemented advanced natural language processing (NLP) and machine learning techniques to assess the coverage of Tipping Point's digital vaccines against the ATT&CK framework. The project involved analyzing the semantic similarity between the descriptions of ATT&CK techniques and the corresponding digital vaccine descriptions to identify potential gaps in coverage and improve the effectiveness of the digital vaccines in preventing malicious threats.
Techniques and Algorithms:
- Word Embedding: Utilized word embedding techniques such as Word2Vec and GloVe to represent words and phrases as dense vectors, capturing semantic and syntactic relationships.
- Topic Modeling: Employed Sentence-BERT (SBERT) to generate sentence-level embeddings and identify latent topics within the ATT&CK framework and digital vaccine descriptions.
- Document-Term Matrix (DTM): Constructed a matrix representation of the documents, with each row representing a document and each column representing a term, to facilitate further analysis and similarity calculations.
- Semantic Textual Similarity (STS): Implemented algorithms to measure the semantic similarity between sentences and paragraphs from the ATT&CK framework and digital vaccine descriptions, leveraging cosine similarity and other distance metrics.
- Bag of Words: Utilized the bag-of-words representation to convert text data into numerical feature vectors, enabling machine learning algorithms to process and analyze the data effectively.
- Doc2Vec and Paragraph2Vec: Employed document-level and paragraph-level embedding techniques to generate dense vector representations of entire documents or paragraphs, capturing their semantic meaning and context.
Libraries and Technologies:
- NLTK (Natural Language Toolkit): Utilized NLTK for text preprocessing tasks such as tokenization, normalization (lemmatization and stemming), and regular expression matching.
- Gensim: Leveraged Gensim library for topic modeling, word embedding, and similarity calculations.
- spaCy: Employed spaCy library for advanced NLP tasks such as named entity recognition, dependency parsing, and part-of-speech tagging.
- Pandas and NumPy: Used Pandas and NumPy for efficient data manipulation, analysis, and numerical computations.
- TensorFlow and PyTorch: Utilized deep learning frameworks TensorFlow and PyTorch to train and deploy neural network models, leveraging CUDA for GPU acceleration.
- React: Developed a user interface using React to visualize the aggregation results and provide an intuitive way to explore the mapped relationships between ATT&CK techniques and digital vaccines.
Key Accomplishments:
- Analyzed the content of the ATT&CK framework and the description of each digital vaccine by calculating semantic textual similarity using advanced NLP techniques and machine learning algorithms.
- Preprocessed and normalized the text data using tokenization, lemmatization, stemming, and regular expressions to ensure data consistency and quality.
- Applied word embedding techniques (BERT and GloVe) to capture the semantic meaning of words and phrases in the ATT&CK framework and digital vaccine descriptions.
- Trained and fine-tuned deep learning models to determine the semantic similarity between sentences and paragraphs from the datasets.