{"id":1913,"date":"2019-09-04T16:07:25","date_gmt":"2019-09-05T00:07:25","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/blog\/nlp-twitter-airline-blog\/"},"modified":"2019-09-04T16:07:25","modified_gmt":"2019-09-05T00:07:25","slug":"nlp-twitter-airline-blog","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/","title":{"rendered":"Getting Started with Natural Language Processing: US Airline Sentiment Analysis"},"content":{"rendered":"\n\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Sections<\/strong><\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Introduction to NLP<\/li>\n<li>Dataset Exploration<\/li>\n<li>NLP Processing<\/li>\n<li>Training<\/li>\n<li>Hyperparameter Optimization<\/li>\n<li>Resources for Future Learning<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Introduction to NLP<\/strong><\/h2>\n\n\n\n<p>Natural Language Processing (NLP) is a subfield of machine learning concerned with processing and analyzing natural language data, usually in the form of text or audio. Some common challenges within NLP include speech recognition, text generation, and sentiment analysis, while some high-profile products deploying NLP models include Apple\u2019s Siri, Amazon\u2019s Alexa, and many of the chatbots one might interact with online.<\/p>\n\n\n\n<p>To get started with NLP and introduce some of the core concepts in the field, we\u2019re going to build a model that tries to predict the sentiment (positive, neutral, or negative) of tweets relating to US Airlines, using the popular\u00a0<a href=\"https:\/\/www.kaggle.com\/crowdflower\/twitter-airline-sentiment\" target=\"_blank\" rel=\"noreferrer noopener\">Twitter US Airline Sentiment dataset<\/a>.<\/p>\n\n\n\n<p>Code snippets will be included in this post, but for fully reproducible notebooks and scripts, view all of the notebooks and scripts associated with this project on its Comet project\u00a0<a href=\"https:\/\/www.comet.com\/demo\/nlp-twitter-airline\" target=\"_blank\" rel=\"noreferrer noopener\">page<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Dataset Exploration<\/strong><\/h3>\n\n\n\n<p>Let\u2019s start by importing some libraries. Make sure to install\u00a0<a href=\"http:\/\/comet.ml\/\" target=\"_blank\" rel=\"noreferrer noopener\">Comet<\/a>\u00a0for experiment management, visualizations, code tracking and hyperparameter optimization.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Comet\nfrom comet_ml import Experiment<\/code><\/pre>\n\n\n\n<p>A few standard packages: pandas, numpy, matplotlib, etc.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Standard packages\nimport os\nimport pickle\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt<\/code><\/pre>\n\n\n\n<p><a href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Nltk<\/a>\u00a0for natural language processing functions:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># nltk\nimport nltk\nfrom nltk.tokenize import sent_tokenize, word_tokenize\nfrom nltk.corpus import stopwords\nfrom nltk.stem.snowball import SnowballStemmer<\/code><\/pre>\n\n\n\n<p><a href=\"https:\/\/scikit-learn.org\/stable\/\" target=\"_blank\" rel=\"noreferrer noopener\">Sklearn<\/a>\u00a0and\u00a0<a href=\"https:\/\/keras.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">keras<\/a>\u00a0for machine learning models:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># sklearn for preprocessing and machine learning models\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import GradientBoostingClassifier\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.utils import shuffle\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\n# Keras for neural networks\nfrom keras.models import Sequential\nfrom keras.layers import Dense, Dropout, BatchNormalization, Flatten\nfrom keras.layers.embeddings import Embedding\nfrom keras.preprocessing import sequence\nfrom keras.utils import to_categorical\nfrom keras.callbacks import EarlyStopping<\/code><\/pre>\n\n\n\n<p>Now we\u2019ll load the data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>raw_df = pd.read_csv('twitter-airline-sentiment\/Tweets.csv')<\/code><\/pre>\n\n\n\n<p>Let\u2019s check the shape of the dataframe:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>raw_df.shape()\n&gt;&gt;&gt; (14640, 15)<\/code><\/pre>\n\n\n\n<p>So we\u2019ve got 14,640 samples (tweets), each with 15 features. Let\u2019s take a look at what features this dataset contains.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>raw_df.columns<\/code><\/pre>\n\n\n\n<p><code>'tweet_id'<\/code>\u00a0,\u00a0<code>'airline_sentiment'<\/code>\u00a0,\u00a0<code>'airline_sentiment_confidence'<\/code>\u00a0,\u00a0<code>'negativereason'<\/code>\u00a0,\u00a0<code>'negativereason_confidence'<\/code>\u00a0,\u00a0<code>'airline'<\/code>\u00a0,\u00a0<code>'airline_sentiment_gold'<\/code>\u00a0,\u00a0<code>'name'<\/code>\u00a0,\u00a0<code>'negativereason_gold'<\/code>\u00a0,\u00a0<code>'retweet_count'<\/code>\u00a0,\u00a0<code>'text'<\/code>\u00a0,\u00a0<code>'tweet_coord'<\/code>\u00a0,\u00a0<code>'tweet_created'<\/code>\u00a0,\u00a0<code>'tweet_location'<\/code>\u00a0,\u00a0<code>'user_timezone'<\/code><\/p>\n\n\n\n<p>Let\u2019s also take a look at airline sentiment for each airline (code can be found on\u00a0<a href=\"https:\/\/www.comet.com\/demo\/nlp-airline\/99bcfee71c74405c84d2da1766ee4374?experiment-tab=code\" target=\"_blank\" rel=\"noreferrer noopener\">Comet<\/a>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Create a Comet experiment to start tracking our work\nexperiment = Experiment(\n    api_key='&lt;HIDDEN&gt;',\n    project_name='nlp-airline',\n    workspace='demo')\nexperiment.add_tag('plotting')\nairlines= ['US Airways',\n           'United',\n           'American',\n           'Southwest',\n           'Delta',\n           'Virgin America']\nfor i in airlines:\n     indices = airlines.index(i)\n     new_df=raw_df[raw_df['airline']==i]\n     count=new_df['airline_sentiment'].value_counts()\n     experiment.log_metric('{} negative'.format(i), count[0])\n     experiment.log_metric('{} neutral'.format(i), count[1])\n     experiment.log_metric('{} positive'.format(i), count[2])\nexperiment.end()<\/code><\/pre>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1017\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/chart-1.jpg\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1016\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/chart-2.jpg\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1015\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/chart-3.jpg\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1014\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/chart-4.jpg\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1013\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/chart-5.jpg\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1012\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/chart-6.jpg\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<p>Every airline has more negative tweets than either neutral or positive tweets, with Virgin America receiving the most balanced spread of positive, neutral and negative of all the US airlines. While we\u2019re going to focus on NLP-specific analysis in this write-up, there are excellent sources of further feature-engineering and exploratory data analysis. Kaggle kernels\u00a0<a href=\"https:\/\/www.kaggle.com\/parthsharma5795\/comprehensive-twitter-airline-sentiment-analysis\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.kaggle.com\/mrisdal\/exploring-audience-text-length\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>\u00a0are particularly instructive in analyzing features such as audience and tweet length as related to sentiment.<\/p>\n\n\n\n<p>Let\u2019s create a new dataframe with only\u00a0<code>tweet_id<\/code>\u00a0,\u00a0<code>text<\/code>\u00a0, and\u00a0<code>airline_sentiment<\/code>\u00a0features.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df = raw_df[['tweet_id', 'text', 'airline_sentiment']]<\/code><\/pre>\n\n\n\n<p>And now let\u2019s take a look at a few of the tweets themselves. What\u2019s the data look like?<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df['text'][1]\n\n&gt; \"@VirginAmerica plus you've added commercials to the experience... tacky.\"\n\ndf['text'][750]\n\n&gt; \"@united you are offering us 8 rooms for 32 people #FAIL\"\n\ndf['text'][5800]\n\n&gt; \"@SouthwestAir Your #Android Wi-Fi experience is terrible! $8 is a ripoff! I can't get to @NASCAR or MRN for @DISupdates #BudweiserDuels\"<\/code><\/pre>\n\n\n\n<p>Next, we\u2019re going to conduct a few standard NLP preprocessing techniques to get our dataset ready for training.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>NLP Processing<\/strong><\/h2>\n\n\n\n<p>For the purposes of constructing NLP models, one must conduct some basic steps of text preprocessing in order to transfer text from human language to a machine readable format for further processing. Here we will cover some of the standard practices:\u00a0<em>tokenization, stopword removal, and stemming<\/em>. You can consult\u00a0<a href=\"https:\/\/medium.com\/@datamonsters\/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908\" target=\"_blank\" rel=\"noreferrer noopener\">this post<\/a>\u00a0to learn about additional text preprocessing techniques.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tokenization<\/h3>\n\n\n\n<p>Given a character sequence and a defined document unit, tokenization is the task of chopping it up into discrete pieces called\u00a0<em>tokens<\/em>. In the process of chopping up text, tokenization also commonly involves throwing away certain characters, such as punctuation.<\/p>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1011\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/input-output.png\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<p>It is simple (and often useful) to think of tokens simply as words, but to fine tune your understanding of the specific terminology of NLP tokenization, the\u00a0<a href=\"https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/tokenization-1.html\" target=\"_blank\" rel=\"noreferrer noopener\">Stanford NLP group\u2019s overview<\/a>\u00a0is quite useful.<\/p>\n\n\n\n<p>The NLTK library has a built-in\u00a0<a href=\"https:\/\/www.nltk.org\/api\/nltk.tokenize.html\" target=\"_blank\" rel=\"noreferrer noopener\">tokenizer<\/a>\u00a0we will use to tokenize the US Airline Tweets.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.tokenize import word_tokenize\ndef tokenize(sentence):\n    tokenized_sentence = word_tokenize(sentence)\n    return tokenized_sentence<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Stopword Removal<\/h3>\n\n\n\n<p>Sometimes, common words that may be of little value in determining the semantic quality of a document are excluded entirely from the vocabulary. These are called\u00a0<em>stop words<\/em>. A general strategy for determining a list of stop words is to sort the terms by\u00a0<em>collection frequency<\/em>\u00a0(total number of times each term appears in the document) and then to filter out the most frequent terms as a stop list \u2014 hand-filtered by semantic content.<\/p>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1010\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/stop-list.png\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<p>NLTK has a standard stopword list we will adopt here.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.corpus import stopwords\nclass PreProcessor:\n    def __init__(self, df, column_name):\n        self.stopwords = set(stopwords.words('english'))\n    def remove_stopwords(self, sentence):\n        filtered_sentence = []\n        for w in sentence:\n            if ((w not in self.stopwords) and\n                (len(w) &gt; 1) and\n                (w[:2] != '\/\/') and\n                (w != 'https')):\n                filtered_sentence.append(w)\n        return filtered_sentence<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Stemming<\/h3>\n\n\n\n<p>For grammatical purposes, documents use different forms of a word (look, looks, looking, looked) that in many situations have very similar semantic qualities. Stemming is a rough process by which variants or related forms of a word are reduced (stemmed) to a common base form. As stemming is a removal of prefixed or suffixed letters from a word, the output may or may not be a word belonging to the language corpus.\u00a0<em>Lemmatization<\/em>\u00a0is a more precise process by which words are properly reduced to the base word from which they came.<\/p>\n\n\n\n<p>Examples:<\/p>\n\n\n\n<p><strong><em>Stemming<\/em><\/strong>: car, cars, car\u2019s, cars\u2019\u00a0<em>become<\/em>\u00a0car<\/p>\n\n\n\n<p><strong><em>Lemmatization<\/em><\/strong>: am, are is\u00a0<em>become<\/em>\u00a0be<\/p>\n\n\n\n<p><strong><em>Stemmed and Lemmatized Sentence<\/em><\/strong>: \u2018the boy\u2019s cars are different colors\u2019\u00a0<em>become<\/em>\u00a0\u2018the boy car is differ color\u2019<\/p>\n\n\n\n<p>The most common algorithm for stemming English text is [Porter\u2019s algorithm](TO DO).\u00a0<a href=\"http:\/\/snowball.tartarus.org\/texts\/introduction.html\" target=\"_blank\" rel=\"noreferrer noopener\">Snowball<\/a>, a language for stemming algorithms, was developed by Porter in 2001 and is the basis for the NLTK implementation of its SnowballStemmer, which we will use here.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from nltk.stem.snowball import SnowballStemmer\nclass PreProcessor:\n\n    def __init__(self, df, column_name):\n        self.stemmer = SnowballStemmer('english')\n    def stem(self, sentence):\n        return [self.stemmer.stem(word) for word in sentence]<\/code><\/pre>\n\n\n\n<p>Code for these preprocessing steps can be found on\u00a0<a href=\"https:\/\/www.comet.com\/demo\/nlp-airline\/ed77f2a005a740b09fc50f02c326f080?experiment-tab=code\" target=\"_blank\" rel=\"noreferrer noopener\">Comet<\/a>.<\/p>\n\n\n\n<p>Next we\u2019ll create a PreProcessor object, containing methods for each of these steps, and run it on the\u00a0<code>text<\/code>\u00a0column of our data frame to tokenize, stem and remove stopwords from the tweets.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>preprocessor = PreProcessor(df, 'text')\ndf['cleaned text'] = preprocessor.full_preprocess()<\/code><\/pre>\n\n\n\n<p>And now we\u2019ll split our data into training, validation and test sets.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df = shuffle(df, random_state=seed)\n\n# Keep 1000 samples of the data as test set\n\ntest_set = df[:1000]\n\n# Get training and validation data\n\nX_train, X_val, y_train, y_val = train_test_split(df['cleaned_text'][1000:], df['airline_sentiment'][1000:], test_size=0.2, random_state=seed)\n\n# Get sentiment labels for test set\n\ny_test = test_set['airline_sentiment']<\/code><\/pre>\n\n\n\n<p>Now that we\u2019ve split our data into train, validation and test sets, we\u2019ll TF-IDF vectorize them<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">TF-IDF Vectorization<\/h3>\n\n\n\n<p>TFIDF, or\u00a0<em>term frequency \u2014 inverse document frequency<\/em>, is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used to produce weights associated with words that can be useful in searches of information retrieval or text mining. The tf-idf value of a word increases proportionally to the number of times a word appears in a document, and is offset by the number of documents in the corpus that contain that word. This offset helps adjust for the fact that some words appear more frequently in general (think of how stopwords like \u2018a\u2019, \u2018the\u2019, \u2018to\u2019 might have incredibly high tf-idf values if not for offsetting).<\/p>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1009\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/tf-idf-1024x672-1.jpg\" alt=\"\" \/>\n<figcaption>Source:\u00a0<a href=\"https:\/\/becominghuman.ai\/word-vectorizing-and-statistical-meaning-of-tf-idf-d45f3142be63\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/becominghuman.ai\/word-vectorizing-and-statistical-meaning-of-tf-idf-d45f3142be63<\/a><\/figcaption>\n<\/figure>\n<\/div>\n\n\n\n<p>We will use scikit-learn\u2019s implementation of\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.TfidfVectorizer.html\" target=\"_blank\" rel=\"noreferrer noopener\">TfidfVectorizer<\/a>, which converts a collection of raw documents (our twitter dataset) into a matrix of TF-IDF features.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>vectorizer = TfidVectorizer()\nX_train = vectorizer.fit_transform(X_train)\nX_val = vectorizer.transform(X_val)\nX_test = vectorizer.transform(test_set['cleaned_text'])<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Training<\/strong><\/h3>\n\n\n\n<p>We are ready to start training our model. The first thing we\u2019ll do is create a Comet experiment object:<\/p>\n\n\n\n<p><code>experiment = Experiment(api_key='your-personal-key', project_name='nlp-airline', workspace='demo')<\/code><\/p>\n\n\n\n<p>Next, we\u2019ll build a\u00a0<a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener\">Light Gradient-Boosting classifier (LGBM)<\/a>, an\u00a0<a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener\">XGBoost classifier<\/a>, and a relatively straightforward\u00a0<a href=\"https:\/\/keras.io\/models\/sequential\/\" target=\"_blank\" rel=\"noreferrer noopener\">neural network with keras<\/a>\u00a0and compare how each of these models performs. Oftentimes it\u2019s hard to tell which architecture will perform best without testing them out. Comet\u2019s project-level view helps make it easy to compare how different experiments are performing and let you easily move from model selection to model tuning.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">LGBM<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code># sklearn's Gradient Boosting Classifier (GBM)\n\ngbm = GradientBoostingClassifier(n_estimators=200, max_depth=6, random_state=seed)\n\ngbm.fit(X_train, y_train)\n\n# Check results\n\ntrain_pred = gbm.predict(X_train)\n\nval_pred = gbm.predict(X_val)\n\nval_accuracy = round(accuracy_score(y_val,val_pred), 4)\n\ntrain_accuracy = round(accuracy_score(y_train, train_pred), 4)\n\n# log to comet\n\nexperiment.log_metric('val_acc', val_accuracy)\n\nexperiment.log_metric('Accuracy', train_accuracy)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>XGBOOST<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>xgb_params = {'objective' : 'multi:softmax',\n    'eval_metric' : 'mlogloss',\n    'eta' : 0.1,\n    'max_depth' : 6,\n    'num_class' : 3,\n    'lambda' : 0.8,\n    'estimators' : 200,\n    'seed' : seed\n}\n\ntarget_train = y_train.astype('category').cat.codes\n\ntarget_val = y_val.astype('category').cat.codes\n\n# Transform data into a matrix so that we can use XGBoost\n\nd_train = xgb.DMatrix(X_train, label = target_train)\n\nd_val = xgb.DMatrix(X_val, label = target_val)\n\n# Fit XGBoost\n\nwatchlist = [(d_train, 'train'), (d_val, 'validation')]\n\nbst = xgb.train(xgb_params, d_train, 400, watchlist,\n\nearly_stopping_rounds = 50, verbose_eval = 0)\n\n# Check results for XGBoost\n\ntrain_pred = bst.predict(d_train)\n\nval_pred = bst.predict(d_val)\n\nexperiment.log_metric('val_acc', round(accuracy_score(target_val, val_pred)*100, 4))\n\nexperiment.log_metric('Accuracy', round(accuracy_score(target_train, train_pred)*100, 4))<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Neural Net<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code># Generator so we can easily feed batches of data to the neural network\n\ndef batch_generator(X, y, batch_size, shuffle):\n    number_of_batches = X.shape[0]\/batch_size\n    counter = 0\n    sample_index = np.arange(X.shape[0])\n\n    if shuffle:\n        np.random.shuffle(sample_index)\n    while True:\n        batch_index = sample_index[batch_size*counter:batch_size*(counter+1)]\n        X_batch = X[batch_index,:].toarray()\n        y_batch = y[batch_index]\n        counter += 1\n\n        yield X_batch, y_batch\n        if (counter == number_of_batches):\n            if shuffle:\n                np.random.shuffle(sample_index)\n            counter = 0\n\n# Initialize sklearn's one-hot encoder class\n\nonehot_encoder = OneHotEncoder(sparse=False)\ninteger_encoded_train = np.array(y_train).reshape(len(y_train), 1)\nonehot_encoded_train = onehot_encoder.fit_transform(integer_encoded_train)\ninteger_encoded_val = np.array(y_val).reshape(len(y_val), 1)\nonehot_encoded_val = onehot_encoder.fit_transform(integer_encoded_val)\nexperiment.add_tag('NN')\n\n# Neural network architecture\n\ninitializer = keras.initializers.he_normal(seed=seed)\nactivation = keras.activations.elu\noptimizer = keras.optimizers.Adam(lr=0.0002, beta_1=0.9, beta_2=0.999, epsilon=1e-8)\nes = EarlyStopping(monitor='val_acc', mode='max', verbose=1, patience=4)\n\n# Build model architecture\n\nmodel = Sequential()\nmodel.add(Dense(20, activation=activation, kernel_initializer=initializer, input_dim=X_train.shape[1]))\nmodel.add(Dropout(0.5))\nmodel.add(Dense(3, activation='softmax', kernel_initializer=initializer))\nmodel.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])\n\n# Hyperparameters\n\nepochs = 15\nbatch_size = 32\n\n# Fit the model using the batch_generator\n\nhist = model.fit_generator(generator=batch_generator(X_train, onehot_encoded_train, batch_size=batch_size, shuffle=True), epochs=epochs, validation_data=(X_val, onehot_encoded_val), steps_per_epoch=X_train.shape[0]\/batch_size, callbacks=[es])<\/code><\/pre>\n\n\n\n<p>Comparing our models using Comet\u2019s project view, we can see that our Neural Network models are outperforming the XGBoost and LGBM experiments by a considerable margin.<\/p>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1008\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/experiment-list.jpg\" alt=\"\" \/>\n<figcaption><a href=\"https:\/\/www.comet.com\/demo\/nlp-airline\/view\/j1ZRx1zuXUmju7PBvRKlZEzlV\" target=\"_blank\" rel=\"noreferrer noopener\">Comet Experiment List View<\/a><\/figcaption>\n<\/figure>\n<\/div>\n\n\n\n<p>Let\u2019s select the neural net architecture for now and fine tune it.\u00a0<em>Note<\/em>, since we\u2019ve stored all of our experiments \u2014 including the XGBoost and LGBM runs we\u2019re not going to use right now \u2014 if we decide we\u2019d like to revisit those architectures in the future, all we\u2019ll have to do is view those experiments in the Comet project page and we\u2019ll be able to reproduce them instantly.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Hyperparameter Optimization<\/strong><\/h3>\n\n\n\n<p>Now that we\u2019ve selected our architecture from an initial search of XGBoost, LGBM and a simple keras implementation of a neural network, we\u2019ll need to conduct a hyperparameter optimization to fine-tune our model. Hyperparameter optimization can be an incredibly difficult, computationally expensive, and slow process for complicating modeling tasks. Comet has built an\u00a0<a href=\"https:\/\/www.comet.com\/docs\/python-sdk\/introduction-optimizer\/\" target=\"_blank\" rel=\"noreferrer noopener\">optimization service<\/a>\u00a0that can conduct this search for you. Simply pass in the algorithm you\u2019d like to sweep the hyperparameter space with, hyperparameters and ranges to search, and a metric to minimize or maximize, and Comet can handle this part of your modeling process for you.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from comet_ml import Optimizer\nconfig = {\n    \"algorithm\": \"bayes\",\n    \"parameters\": {\n        \"batch_size\": {\"type\": \"integer\", \"min\": 16, \"max\": 128},\n        \"dropout\": {\"type\": \"float\", \"min\": 0.1, \"max\": 0.5},\n        \"lr\": {\"type\": \"float\", \"min\": 0.0001, \"max\": 0.001},\n    },\n    \"spec\": {\n        \"metric\": \"loss\",\n        \"objective\": \"minimize\",\n    },\n}\n\nopt = Optimizer(config, api_key=\"&lt;HIDDEN&gt;\", project_name=\"nlp-airline\", workspace=\"demo\")\n\nfor experiment in opt.get_experiments():\n    experiment.add_tag('LR-Optimizer')\n\n    # Neural network architecture\n\n    initializer = keras.initializers.he_normal(seed=seed)\n    activation = keras.activations.elu\n    optimizer = keras.optimizers.Adam(\n         lr=experiment.get_parameter(\"lr\"),\n         beta_1=0.99,\n         beta_2=0.999,\n         epsilon=1e-8)\n\n    es = EarlyStopping(monitor='val_acc',\n                       mode='max',\n                       verbose=1,\n                       patience=4)\n\n    batch_size = experiment.get_parameter(\"batch_size\")\n\n    # Build model architecture\n\n    model = Sequential(# Build model like above)\n    score = model.evaluate(X_test, onehot_encoded_val, verbose=0)\n    logging.info(\"Score %s\", score)<\/code><\/pre>\n\n\n\n<p>After running our optimization, it is straightforward to select the hyperparameter configuration that yielded the highest accuracy, lowest loss, or whatever performance you were seeking to optimize. Here we keep the optimization problem rather simple: we only search\u00a0<code>epoch<\/code>,\u00a0<code>batch_size<\/code>, and\u00a0<code>dropout<\/code>. The parallel coordinates chart shown below, another native Comet feature, provides a useful visualization of the underlying hyperparameter space our optimizer has traversed:<\/p>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1007\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/comet-vis-dashboard.png\" alt=\"\" \/>\n<figcaption><a href=\"https:\/\/www.comet.com\/demo\/nlp-airline\/view\/j1ZRx1zuXUmju7PBvRKlZEzlV\" target=\"_blank\" rel=\"noreferrer noopener\">Comet Visualizations Dashboard<\/a><\/figcaption>\n<\/figure>\n<\/div>\n\n\n\n<p>Let\u2019s run another optimization sweep, this time including a range of learning rates to test.<\/p>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1006\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/comet-vis-dashboard-2.jpg\" alt=\"\" \/>\n<figcaption><a href=\"https:\/\/www.comet.com\/demo\/nlp-airline\/view\/j1ZRx1zuXUmju7PBvRKlZEzlV\" target=\"_blank\" rel=\"noreferrer noopener\">Comet Visualizations Dashboard<\/a><\/figcaption>\n<\/figure>\n<\/div>\n\n\n\n<p>And again we get a view into the regions of the underlying hyperparameter space that are yielding higher values.\u00a0<code>val_acc<\/code>\u00a0<\/p>\n\n\n\n<p>Say now we\u2019d like to compare the performance of two of our better models to keep fine-tuning. Simply select two experiments from your list and click the\u00a0<code>Diff<\/code>\u00a0button and Comet will allow you to visually inspect every code and hyperparameter change, as well as side-by-side visualizations of both experiments.<\/p>\n\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" class=\"wp-image-1005\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/diff-view.jpg\" alt=\"\" \/>\n<figcaption><a href=\"https:\/\/www.comet.com\/demo\/nlp-airline\/258a9e3df84346e3bb503aff758cb134\/ee2949dac5d74dc789103f03b986ff80\/compare\" target=\"_blank\" rel=\"noreferrer noopener\">Comet Experiment Diff View<\/a><\/figcaption>\n<\/figure>\n<\/div>\n\n\n\n<p>From here you can continue your model building. Fine tune one of the models we\u2019ve pulled out of the architecture comparison and parameter optimization sweeps, or go back to the start and compare new architectures against our baseline models. All of your work is saved in your Comet project space.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Resources for Future Learning<\/strong><\/h3>\n\n\n\n<p>For additional learning resources in NLP, check out fastai\u2019s new\u00a0<a href=\"https:\/\/www.fast.ai\/2019\/07\/08\/fastai-nlp\/\" target=\"_blank\" rel=\"noreferrer noopener\">NLP course<\/a>\u00a0or this\u00a0<a href=\"https:\/\/medium.com\/huggingface\/the-best-and-most-current-of-modern-natural-language-processing-5055f409a1d1\" target=\"_blank\" rel=\"noreferrer noopener\">blog post<\/a>\u00a0published by Hugging Face that covers some of the best recent papers and trends in NLP. MonkeyLearn has also published a nice article covering <a href=\"https:\/\/monkeylearn.com\/sentiment-analysis\/\">sentiment analysis<\/a>.<\/p>\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n<h2 class=\"wp-block-heading\"><em>Want to stay in the loop?\u00a0<a href=\"https:\/\/info.comet.ml\/newsletter-signup\/?utm_campaign=tensorboard-integration&amp;utm_source=blog&amp;utm_medium=CTA\">Subscribe to the Comet Newsletter<\/a>\u00a0for weekly insights and perspective on the latest ML news, projects, and more.<\/em><\/h2>\n","protected":false},"excerpt":{"rendered":"<p>Sections Introduction to NLP Dataset Exploration NLP Processing Training Hyperparameter Optimization Resources for Future Learning Introduction to NLP Natural Language Processing (NLP) is a subfield of machine learning concerned with processing and analyzing natural language data, usually in the form of text or audio. Some common challenges within NLP include speech recognition, text generation, and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1928,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,9,7],"tags":[],"coauthors":[106],"class_list":["post-1913","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-product","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Getting Started with Natural Language Processing: US Airline Sentiment Analysis - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Getting Started with Natural Language Processing: US Airline Sentiment Analysis\" \/>\n<meta property=\"og:description\" content=\"Sections Introduction to NLP Dataset Exploration NLP Processing Training Hyperparameter Optimization Resources for Future Learning Introduction to NLP Natural Language Processing (NLP) is a subfield of machine learning concerned with processing and analyzing natural language data, usually in the form of text or audio. Some common challenges within NLP include speech recognition, text generation, and [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2019-09-05T00:07:25+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/landing-1-3.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"917\" \/>\n\t<meta property=\"og:image:height\" content=\"373\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Nikolas Laskaris\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nikolas Laskaris\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Getting Started with Natural Language Processing: US Airline Sentiment Analysis - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/","og_locale":"en_US","og_type":"article","og_title":"Getting Started with Natural Language Processing: US Airline Sentiment Analysis","og_description":"Sections Introduction to NLP Dataset Exploration NLP Processing Training Hyperparameter Optimization Resources for Future Learning Introduction to NLP Natural Language Processing (NLP) is a subfield of machine learning concerned with processing and analyzing natural language data, usually in the form of text or audio. Some common challenges within NLP include speech recognition, text generation, and [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2019-09-05T00:07:25+00:00","og_image":[{"width":917,"height":373,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/landing-1-3.jpg","type":"image\/jpeg"}],"author":"Nikolas Laskaris","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Nikolas Laskaris","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/"},"author":{"name":"engineering@atre.net","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/550ac35e8e821db8064c5bd1f0a04e6b"},"headline":"Getting Started with Natural Language Processing: US Airline Sentiment Analysis","datePublished":"2019-09-05T00:07:25+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/"},"wordCount":1594,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/landing-1-3.jpg","articleSection":["Comet Community Hub","Product","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/","url":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/","name":"Getting Started with Natural Language Processing: US Airline Sentiment Analysis - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/landing-1-3.jpg","datePublished":"2019-09-05T00:07:25+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/landing-1-3.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/landing-1-3.jpg","width":917,"height":373,"caption":"plane image with twitter logo on the left side"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/nlp-twitter-airline-blog\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Getting Started with Natural Language Processing: US Airline Sentiment Analysis"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/550ac35e8e821db8064c5bd1f0a04e6b","name":"engineering@atre.net","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/027c18177377edf459980f0cfb83706c","url":"https:\/\/secure.gravatar.com\/avatar\/d002a459a297e0d1779329318029aee19868c312b3e1f3c9ec9b3e3add2740de?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d002a459a297e0d1779329318029aee19868c312b3e1f3c9ec9b3e3add2740de?s=96&d=mm&r=g","caption":"engineering@atre.net"},"sameAs":["https:\/\/live-cometml.pantheonsite.io"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/engineeringatre-net\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/1913","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=1913"}],"version-history":[{"count":0,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/1913\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/1928"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=1913"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=1913"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=1913"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=1913"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}