Saturday, March 11, 2017

Google Next '17 Recap

I just returned from Google Next '17. It was my first time at Next, or at a Google event. I had already been looking at Google services for both professional and personal work, and the agenda promised some answers to questions I had, and probably more questions that I'd need to answer.

Some of the reactions to Google Next are interesting. This one was disappointed in the focus on enterprise customers, wanting more visionary focus and less SAP :) I get the sentiment, but the reality I live in is the one that Google wants to transform.

The one thing I came away with from Next '17 is that there is a viable on-ramp for teams to move faster by delegating a lot of the undifferentiated heavy lifting they do on premise today to managed services in the cloud. That's because of what those services are and how they are implemented. Those services are what is pulling teams to public cloud.

Managed services have always been one of the core values of public cloud, along with general purpose IaaS. In the past few years,  and especially in the past year, the service diversity and capability all three major public cloud providers has made managed services the overwhelming core value of public cloud for most companies. IaaS is becoming more and more of an implementation detail.

The graph below from this post by Subbu Allamaraju explains where the value of public cloud is for most companies. Most companies have less than 1000 servers, and would benefit from the increased market share of getting product to customers faster before they benefit from any economy of scale. That's true regardless of which public cloud services you consume. They're managed for you, so you get to focus on building your product more, and spend less time operating that product.




Services are great, but not if they lock you into a single cloud. Ideally, a team uses services that exist on another cloud provider or  are open source, so at worst I could just run them on another providers IaaS.  If a service is proprietary, it needs to provide more value than a non proprietary version would.

So services are awesome, except when they're proprietary. Sort of. I have grouped Google offerings into three sections: Open - based on OSS, Proprietary - not OSS but still attractive, and Future State. I will try to explain the value proposition I see in each one. This is not the entire list of services, but the ones that really fit the problems I've been running into lately.

Open


Google Container Engine (GKE), based on Kubernetes, GKE  GA'd in August 2015, and momentum  really started to pick up in 2016.  Kubernetes has caught and passed Cloud Foundry as the default OSS way to think about running apps  both on prem and across all public clouds. Microsoft offers managed Kubernetes, and many teams run their own cluster on AWS.

One reason for the shift in momentum is that it is much easier to move legacy architectures to Kubernetes than it is to move them to Cloud Foundry.  While Cloud Foundry is now working with containers, it doesn't work with state - state is assumed to be outside of the cluster. This makes migrating any typical architecture over more involved.

At a minimum, moving an  app to run on Cloud Foundry would require it to be refactored to use a service broker for connections to stateful backend services. In contrast, moving a legacy application to Kubernetes would require a config change to point to the service endpoint of the stateful service.

I'm not saying it's completely plug and play -- the stateful service needs to be configured to meet required replication policies, and leverage persistent volumes -- but the hardcore 12factor requirements around statelessness are not present in Kubernetes. Because of this, migration to 12factor microservices can become much more incremental, which increases the chance of migration success.

Kubernetes, like clustered anything, is hard to configure and operate. The discussion on Kubernetes networking I attended on the last day of Next 17 reinforced that there is significant complexity that make running Kubernetes on premise much harder than using a managed service. I'm very excited to start experimenting with Kubernetes via Google Container Engine because I don't want to trip over that complexity.

I'm also excited to play with Minikube as part of a rational Kubernetes dev stack. Minikube is a great step forward for kubernetes developer tooling, and I think it's going to massively accelerate adoption. I think it's greatly simplified how to develop more complex Kubernetes apps that have multiple services. Developers can accurately replicate cluster state and service access patterns with minikube, and push changes to a deploy pipeline with more confidence.

DataFlow (based on Apache Beam) is exciting to me because of the unified processing model around batch and streaming. Prior to Beam,  cognitive overhead for the number of options for both batch and streaming was overwhelming. Being able to reason across a single system, even if that system uses different underlying processing technologies, makes it possible for teams to provide value faster because they don't have to ramp up on very diverse APIs and concepts. We can provide the same insights regardless of delivery mode

Proprietary


Despite what I said about not being locked into a single cloud, there are Google specific technologies that make me consider making exceptions to the rule. All of these are in the data processing/machine learning realm. Google IMO is far ahead of the other providers wrt getting intelligence from data. In this case I think the lock in is worth it due to the significant leverage gained.

The scale potential of Spanner is very exciting. The implementation is not a silver bullet - it forces you to reason about locality of data at the schema level - but in my opinion this is much better than the locality ignorant schemas of traditional RDBMS system that force people to horizontally partition and therefore silo data in response to scale.

BigTable is very appealing because of its natural fit to time series data. A lot of the data I play with is event based, and so most insights come from aggregating events . The kinds of in stream insights done in  DataFlow would be greatly expanded if it  can access state in BigTable.

The ability to make queries over BigTable with BigQuery is also really exciting. I had previously thought of BigQuery as a way to reason over object storage alone, but the ability to unify how to reason across multiple sources of data is simplifying, much in the same way that reasoning over batch vs stream processing is simplifying.

At this point it would be logical to ask if the services I'm most excited about exist on other public clouds. Lock in needs to be weighed against immediate value. Microsoft does offer Kubernetes as an option in it's Container Service, along with Mesos DC/OS and Docker Swarm. Apache Beam can be run on other IaaS and is in fact being built to use Spark as a runner.

Spanner, BigTable, and BigQuery are unique to Google, but the patterns (RDBMS, Columnar Storage, batch SQL across heterogenous data sources) have OSS analogues.  I don't feel as locked in as if I were building services at Amazon, using their purely proprietary services. But I am more locked in on the data processing side because Google is ahead of the scaled data processing curve when compared to the other public cloud vendors.

Future State


Beyond processing and storing data at scale, these services are the most exciting to me because they have the potential to democratize machine learning the way public cloud originally democratized self service compute, storage, and network.  One is based on open source, the other is proprietary, both are examples of how Google operates from a  future state.

Google Machine Learning Engine (Tensorflow) is intriguing. When I wrote a neural network as part of a class, I spent so much time struggling with setting it up that I lost perspective on the problem we were solving. Anything that purports to help me keep this perspective by making neural net construction (and the tuning associated with backpropagation) easier is something that gets me excited.

The Vision API - I'm assuming this is (partially) based on Tensorflow, because of the limited image recognition work I've done in the past. Just playing around with the API gives me about 20 new ideas I'd like to work into current personal projects. Now I just need a time management API...

I'm hoping to get some time to play with a lot of these in the next few weeks, and document my (mis) adventures here. It's been fun reading about the tech and doing 'hello world' applications, but I'd like to apply them to my current problem domain and see how far I get by leveraging them. 

Saturday, November 26, 2016

Changing Roles...again

So much for that new years resolution to post more -- its coming up on a year since my last post. About two months ago I moved from Product Management back into Engineering. In September, I left the HPE Stackato product management team  and took a senior engineering management position at Zonar Systems, a Seattle based vehicle telematics company.

The last two years were hard, but the lessons  I learned were ultimately worth the struggle, because I learned so much.

I learned to execute better, from being very specific about what I wanted out of every meeting, to always driving discussions to closure, even if it is across several meetings and emails.

I learned how important it is as a Product Owner to define the arc of a backlog - the way product themes are implemented via epics, stories, and ultimately tasks -- and how important it is to keep the backlog well defined.

I learned how inspirational leaders can unleash the best out of a team, and how non inspirational leaders can demotivate the same team.

I also learned a lot about myself. I did have some misgivings about the role, and as time went on those misgivings were shown to be true. Unfortunately I ignored those initial instincts, and by the time I started listening to them, the compromises I had made to stay in the role were significant, and I didn't know if I had compromised so much that I was 'trapped' in the role.

The one thing I really missed was thinking about the 'How'. As a Product Manager,  I loved thinking about the Why and the What, but in order to be effective as a Product Manager I had to delegate the How completely.

When I realized that being a Product Manager was not for me,  I sat back and made a list of what I wanted to do next, and after some searching, found that position at Zonar.

I've been very happy digging into the new job. Ironically, it's the things that I learned as a product manager that have been the most critical to my success in the first two months. That's great, because there was a period of time where I was really questioning the move to product management and whether it was a mistake. At this point it looks like the best possible move to have made.

I would like to say that I was that smart and forward looking, and intentional about my career path. The truth was that I was never thinking that far ahead. What I really did was follow my curiosity, which is what I've always done. I started in software  because I was interested in writing the logic behind the applications I used.  The job choices I made as a software engineer were motivated by the chance to learn new technologies and skills to solve harder and more interesting problems. That's been the story of my career and my life, and so far it has panned out well. 

Monday, January 11, 2016

Changing Roles

It has been almost 2 years since I've posted anything in this blog.  In July 2014 I decided to leave Disney, and more significantly, change roles. At Disney I had been leading the engineering and analytics teams of a central data service. In my not-so-new role at HP I'm on the product management team for the Helion Development Platform, which in it's current incarnation is known as HPE Helion Stackato, a Cloud Foundry based Platform as a Service.

I was (and still am!) excited to do Product Management. As an engineer I have seen how essential the consistent application of product vision and stewardship can be to an engineering effort. I did not know if I really had product vision or was just delusional, but I wanted to find out either way...

The best part about this job is that I get to think - a lot - about how to make software development better for the average enterprise engineer. The struggle is real! Most engineering teams still write IaaS based applications like they were running on bare metal servers, and that leads to a world of hurt, because an application distributed across IaaS provided resources has to contend with underlying network, compute, and storage service failures.

The promise of Platform as a Service is that it enables easy development and deployment of cloud native applications - applications that take advantage of the elasticity of the cloud while dealing with the ephemerality of the underlying IaaS. Cloud native is a concept that takes most engineering organizations some time to get their head around.

As a result, there is a significant educational aspect to my role, which I love. I get to help people to focus on creating value with software, something that has enthralled me since I was 14 years old teaching myself BASIC on my Dads IBM PC/AT.

I get to present my thoughts to various captive audiences, either at customer onsite visits or conferences. Here is a presentation from HP Discover 2015 that I think captures the problem we are trying to solve and the best approaches to take to solve the problem.



In 2016 I'm really excited because of the acceleration and rapid maturity of several key technologies that have the potential to play very well together. 

Containers, mostly via Docker, have brought easy authoring and immutable infrastructure into mainstream software development. As an example, the other day, instead of hand installing Kafka and Zookeeper onto a VM and then doing it again by hand when I needed to grow my test Kafka cluster, I just typed "docker run...", pointing to a Kafka/ZK image built and published to Docker Hub by Spotify. I got to take advantage of all of their hard work, and save the potential multiple hours required to get that image working correctly.  

Kubernetes and Cloud Foundry are viable orchestration mechanisms for distributed, container based applications, handling deployment, scaling, and failure remediation -- deployment and scaling are things that we used to do by hand, late at night, on pins and needles. Dealing with failure usually meant doubling your hardware, or writing complex startup scripts customized to each application. Both approaches are quite differently opinionated, and I can see the merits of each one for different use cases, sometimes in the same overall application stack! 

Mesos has emerged as an intermediate resource management layer that abstracts the underlying IaaS away. The emergence of next generation big data workloads running on Mesos is something that really excites me as an ex service owner who struggled to justify value of the insights provided by using big data technologies against the steep costs of hardware. Having a layer that allocates a finite resources and maximizes resource allocation across very diverse workloads makes the start up cost of new experimental data investigation much lower, and therefore much more likely. 

All of these technologies are evolving at an incredible rate. I'm excited to see, and hopefully play a role in delivering, the next generation of platforms that make these technologies easy to consume and manage, and allow engineering teams to focus on features instead of infrastructure. 

I'm trying to post more this year - the last 18 months have been a heads down, get it done, tactical march. That was great, but I think I miss key insights when I don't occasionally digest and reflect what is going on around me. I hope to do more of that here over the next year. It's a new years resolution, hopefully one that will last longer than the one I made about not eating sugar :)



Tuesday, March 18, 2014

Making Sense of Unstructured Text In Online Reviews Part 4: Sentiment Analysis: Is More Data The Cure?

A Swing And A Miss

In my last post I had trained a Bayesian classifier using a dataset pulled from sitejabber.com, which provides reviews of ecommerce sites. I had pulled that data for a single site. I then trained and tested the data -- and found that even though my classifier performed at 83%, it had completely mis-classified all positive reviews.

As noted in the last post, only 20% of my original review data was positive -- 55 records out of the total of approximately 275 records. This leads me to three questions:
  1. Did I really test and validate the data in the most effective way?
  2. As this is a bayesian classifier, will increasing the amount of positive data help the classifier identify positive data more effectively?
  3. If so, how do I go about increasing data from a finite set of data? 

Giving the classifier another chance with K fold validation

Before trying anything, I'd like to understand whether my test and validation approach could be made more deterministic. I had previously run several iterations using randomly selected test and train validations. That doesn't give me guaranteed coverage of my entire data set or a valid, reproducible process upon which I can try improvements.

I can get that coverage and reproducibility by using k fold validation across the data.

K fold validation works like this:

  1. break the dataset into K equivalent subsets.
  2. hold one of the subsets out for testing.
  3. use all of the other subsets for training. 
  4. train the data on the k-1 subsets, test it on the kth subset. 
  5. rotate through all subsets - repeat 2-4, holding out a different subset each time. 
  6. average the accuracy of all test+train processes.
When I re-run my tests using K-fold validation with 10 folds, I got an average accuracy across the entire dataset of 84.6%. Which is different than the 83% score that I had gotten doing 'randomized' tests.

In this baseline run I implicitly 'stratified' my test and training data -- all test and training data folds had the same proportion of positive and negative reviews, in this case there was roughly a 1:5 ratio between positive and negative reviews.

Getting data to be k foldable involves two steps: dividing into k folds, and building test data from one fold and training data from all of the rest. I've done these steps separately so that I can iterate through all k test and training sets with the same k folds.

This is the method I used to split an array into k folds:

def partitionArray(self,partitions, array):
        """
        @param partitions - the number of partitions to divide array into
        @param array - the array to divide
        @return an array of the partitioned array parts (array of subarrays)
        """
        nextOffset = incrOffset = len(array)/partitions
        remainder = len(array)%partitions
        lastOffset = 0
        partitionedArray = []
        
        for i in range(partitions):
            partitionedArray.append(array[lastOffset:nextOffset])
            lastOffset= nextOffset
            nextOffset += incrOffset
        
        partitionedArray[i].extend(array[incrOffset:incrOffset + remainder])
                    

        return partitionedArray

This is the method I used to build test and train sets, holding out the partition specified by the iteration parameter. It assumes I'm handing it two k-partitioned arrays, one with bad reviews and one with good reviews.

 def buildKFoldValidationSets(self,folds,iteration, reviewsByRating):
        """
        build test and training sets
        @param iteration - the offset of the arrays to hold out
        @param reviewsByRating - the set of reviews to build from
        @return test and training arrays
        """
        
        test = []
        test.extend(reviewsByRating[1][iteration])
        test.extend(reviewsByRating[5][iteration])
        
        training = []
    
        for i in range(folds):
            if i == iteration:
                continue
            training.extend(reviewsByRating[1][i])
            training.extend(reviewsByRating[5][i])

        return training, test

Increasing The Data Set with Sampling

How do I increase the set of positive data if there is no more data to be used? I can take advantage of the fact that I am using a Bayesian classifier, which takes a 'bag of words' approach.  In Bayesian classification, there is no information that depends on the sentence structure of the review text or the sequence of words, just words and word frequency counts. And the features (the words) are assumed to be independent from one another.

How does that help? My theory is that mis-classification happened because there wasn't enough positive review data to help the classifier recognize positive vs negative reviews.  In order to increase the positive data set I need to generate more positive reviews.

Knowing that the Bayesian classifier doesn't care about sentence structure or word interdependence allows me to treat reviews as bags of words and nothing more. The word frequency counts in those bags of words need to line up to the overall word frequency distribution of the entire review set. 

One way to do this is to build the data from the data that already exists, by taking random samples from an array that contains all the words across the set of positive reviews in the training data.

Pretend the following sentence is actually a review:
         The big big green caterpillar ate the small green leaf.

putting the words in an array that looks like this: 
        somearray = ['the','big','big','green','caterpillar','ate','the','small','green','leaf']

I can sample that array to build up another sentence. That sentence has a 1/10 chance of being 'leaf', and a 1/5 chance of being 'big'.   I can extend the sample set to be as large as I want -- covering multiple sentences, a review, multiple reviews, etc.

In this case I'm 'sampling with replacement', meaning that I don't remove the sample I get from the sampled set, which means that the probability of picking a word does not change across samples. This is important because I want the words in any generated data to have the same probability distribution that they do in the real data, and my sample set is built from the real data.

In Python sampling with replacement looks like this:

        word = somearray[random.randint(0,len(somearray))] 

I use this method to create reviews comprised of words randomly selected from the distribution of positive training words, and make sure the review length is the average length of all real positive reviews:

def createReview(self,textFreqDist,reviewLength):
        """
        @param textFreqDist -  the array containing the frequency distribution of words to choose from.
        @param reviewLength -  the length of the review (in words) to build
        @return the generated review as a string
        """
        randLen = len(textFreqDist)
        reviewStr = ""
        
        for i in range(reviewLength):
            reviewStr += (textFreqDist[random.randint(0,randLen-1)] + ' ')
        


        return reviewStr

A Cautionary Note on Overfitting

When I first did the positive 'boost', I was getting really good results....really, really good results.  99% accuracy on a test set was a number that seemed too good to be true. And it was. 

In my code I had not 'held out' the test data prior to growing the training data. So my training data was being seeded with words from my test data, and I was 'polluting' my training and test process. While the classifier performed incredibly well on the test set, it would have performed relatively poorly on other data when compared to a classifier trained and tested on data that has been held apart. 

When I rewrote the training and testing process, I made sure to hold out test data prior to sampling from the training set. This meant that the terms in the positive review test data did not factor into the overall training data sample set. While those words may have been present in the training data sample set, they would be counted at a lower frequency, so the test process wouldn't be biased. 

New Test Results

I ran the same 10 fold validation process over training data whose positive review set had been boosted to be 50% of the overall training set. This isn't stratified K fold validation -- by boosting the number positive reviews with resampling of the training word data, I am altering the positive to negative ratio of the training set. Because the test data was held out of the boosting process,  the ratio of positive to negative reviews in the test data remainsthe same.  The code used to train and test the data is the same as before.

My test results averaged to 89.4%, an improvement from 84.6%. However, when I look at the errors more closely, I see that most of the errors are still due to mis-classifying positive reviews, which is interesting, given that I've boosted positive training data to be 50% of the training set. In the base training run my best effort mis-classified 60% of the positive reviews, and my worst efforts mis-classified 100% of the positive reviews. In the boosted training run my best effort mis-classified 20% of the positive reviews, and my worst effort mis-classified 60% of the positive reviews. 

Summary

This improvement makes sense because word frequency directly affects how the Bayesian classifier works. My 'boosting' effort worked because of the naive assumption of word independence in the classifier -- I didn't have to account for word dependencies, I only had to account for word frequency. 

If I were to do this over again, I would do the following: 
  1. If at all possible, get more data. Having only 5-10 positive reviews in the test set didn't give me a lot to work with -- it is hard to draw conclusions from such a small positive review set.
  2. k-fold validation from the beginning to get the average accuracy per approach.
  3. investigate mis-classification errors before doing any optimization! 
Most of my time was spent analyzing and building the optimal training data set.  The biggest improvement made was not in tweaking the algorithm, but 'boosting' the positive training data to increase the recognition of positive reviews. The biggest mistake I made was to not examine training errors immediately.

To get improvements, I couldn't treat the algorithm as a black box, I had to know enough about how it functioned to prepare the data for an optimal classification score. Note that this approach wouldn't work in an approach at assumed some level of dependence between words in a review text -- I'd have to calculate that dependency in order to generate reviews. 

A final note: this is a classifier that was trained on a single source of reviews. That's great to classify more reviews about that ecommerce site, but the classifier would probably suck tremendously on a travel review site. However, the approaches taken would work if we had travel review site training data. 

Potential next steps include:
  1. Getting (more) new data from a different source.
  2. Trying the bayes classifier on that data
  3. Trying a different classifier, e.g. the maxent classifier on the same data
  4. Going deeper into sentiment: what entities were positive / negative sentiment directed at?


Tuesday, February 4, 2014

Making Sense of Unstructured Text in Online Reviews Part 3: Trying to Improve Classifier Accuracy

This post is part of a series where I try to classify online review text in more and more concrete ways. Right now I'm training a classifier to accurately classify one (bad) vs five (good) starred reviews. In the last post I had done some initial training and testing of an NLTK Bayesian classifier. In this post I want to see if I can improve the accuracy score of my classifier by getting smarter about which features I include.

In the last post I had experimented with varying the quantity of feature set, and had found that while encoding more features into a classifier during training helps accuracy, there is an eventual accuracy ceiling. My feature set came from taking the top N words from a frequency distribution of all words in the reviews text. Here is what the accuracy curve looks like:

One other way to improve accuracy is to address the 'quality' of the feature set by looking at features not only in terms of their frequency across the training corpus, but looking at their relative frequencies across classifications.

In the review classification done so far, individual words are the features. I'm going to try to 'tune' feature sets in several different ways -- I have no idea if these will work, but they seem reasonable. I'm going to call these attempts hypotheses, because my goal is to prove them to be true or false, with relatively minimal effort.

Hypothesis 1: Throw away features with a low 'frequency differential'

My hypothesis is that there are features that have a much higher chance of being in a negative review than a positive review, and vice versa. Those are the features that we want to keep. Other features are ones that have approximately the same chance of being in either type of review (positive or negative).

    P(review rating | features) = P(features, review rating)

In the equation above, the P(features, review rating) term is the multiplied probabilities of each P(feature, review rating). If I'm looking for a higher overall probability that a document is one star over five star or vice versa, having per feature probabilities that are similar for one star or five star reviews means that my overall probabilities for one and five star will be close to equal, which could tip classification results 'the other way' and increase my error rate.

I can validate this hypothesis by filtering out those low probability differential features and keeping the ones that have a high probability differential: a high difference between P(feature, review rating) for {1 star, 5 star} ratings.

Building The Feature Set

I had trained and tested the classifier by taking input data, splitting it into a test and a training set, then training and testing. I will recreate that process now to get the raw data so that I can 'remove' common terms with low probability:

            sjr = SiteJabberReviews(pageUrl,filename)
            sjr.load()
            asd = AnalyzeSiteData()
           
I ended up recoding the building of the training and test data so that the data sets being built had a more even distribution of ratings across them:

def generateLearningSetsFromReviews(self,reviews, ratings,buckets):
        
        # check to see that percentages sum to 1
        # get collated sets of reviews by rating. 
        
        val = 0.0
        for pct in buckets.values():
            val += pct
            
        if val > 1.0:
            raise 'percentage values must be floats and must sum to 1.0'
        
        reviewsByRating = defaultdict(list)
            
        for reviewSet in reviews:
            for rating in ratings:
                reviewList = [(self.textBagFromRawText(review.text), rating) 
                      for review in reviewSet.reviewsByRating[rating]]
                reviewsByRating[rating].extend(reviewList)
                random.shuffle(reviewsByRating[rating]) # mix up reviews from different reviewSets
            
        
        # break collated sets across all ratings into percentage buckets
        learningSets = defaultdict(list) 
        
        for rating in ratings:
            sz = len(reviewsByRating[rating]) 
            
            lastidx = 0
            for (bucketName, pct) in buckets.items():
                idx=lastidx + int(pct*sz)
                
                learningSets[bucketName].extend(reviewsByRating[rating][lastidx:idx])
                
                lastidx  = idx
                

        return learningSets


When I built up the training data using this method, the sets were returned in the buckets[] array:

        buckets = asd.generateLearningSetsFromReviews([sjr],[1,5],{'training': 0.8,'test':0.2})


Each training set in this list is actually an array of (textBag, rating) tuples:
     buckets = [[(bagOfText,rating)...],[..]]

I want to get frequency distributions of common terms from one and five star reviews in the training data, so that I can find terms with a high probability differential:                         

            # get common terms and frequency differentials

            allWords1 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 1]
            fd1 = FreqDist(allWords1)
            
            allWords5 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 5]
            fd5 = FreqDist(allWords5)

            commonTerms = [w for w in fd1.keys() if w in fd5.keys()]

            # now get frequency differentials

            commonTermFreqs = [(w,fd1.freq(w),fd5.freq(w),abs(fd1.freq(w)-fd5.freq(w))) 
                for w in commonTerms]

            commonTermFreqs.sort(key=itemgetter(3),reverse=True)

Now we've got common terms, sorted by their absolute differential between frequency distributions in 1 and 5 star reviews.

if I plot this distribution:

            freqdiffs = [diff for (a,b,c,diff) in commonTermFreqs]
            plt.plot(freqdiffs)
            plt.show()

I can see that it falls off sharply:

This looks like a Zipfian distribution: "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word..." 

The shape of this distribution implies that only a small subset of  the terms actually have a frequency differential that really 'matters' in the hypothesis -- all terms aren't needed. I can start arbitrarily by keeping all terms with a frequency differential > 0.001 to quickly test the hypothesis. That leaves 131 of the original 688 common terms.

Note that in getting and filtering common terms, I have not retained the terms that very strongly signal one review rating or another: those would be the terms that exist only in one review rating corpus or another. Note that even though those terms do not exist in one or the other review corpus, and that would make the calculation go to zero, the non existent terms are 'smoothed out' by including them in the other corpus and adding a very small value to the frequency of all terms in that corpus, which guarantees that there are no terms with a zero frequency, and the Bayesian calculation won't zero out.

I would need to add those terms into the set of terms that we filter by.

The full set of filtered terms is comprised of both uncommon and filtered common words:

            filterTerms = [w for (w,x,y,diff) in commonTermFreqs if diff > 0.001]
            fd1Only = [w for w in fd1.keys() if w not in fd5.keys]
            filterTerms.extend(fd1Only)
            fd5Only = [w for w in fd5.keys() if w not in fd1.keys]
            filterTerms.extend(fd5Only)

            defaultWordSet = set(filterTerms) # rename so I dont have to rewrite the encoding method 

And I use those words as features identified at encoding time:

            def emitDefaultFeatures(tokenizedText):
                '''
                @param tokenizedText: an array of text features
                @return: a feature map from that text.
                '''
                tokenizedTextSet = set(tokenizedText)
                featureSet = {}
                for text in defaultWordSet:
                    featureSet['contains:%s'%text] = text in tokenizedTextSet
                

                return featureSet

Testing The Hypothesis

Now I can train the classifier: asd.encodeData() takes care of encoding features from the training and test sets by calling emitDefaultFeatures() for each review.

            encodedTrainSet = asd.encodeData(rawTrainingSetData,emitDefaultFeatures )
            classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)
            
            encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

            print nltk.classify.accuracy(classifier, encodedTestSet)

And I get an accuracy of 0.83, the same accuracy I got with no manipulation of the feature set, which is 0.02 less than my optimal accuracy. Whoops.

Detailed Error Analysis

There is one other step  I can take to understand the accuracy of the classifier, and that is to analyze the errors made on the test set. If I know how I mis-classified the data, that can help me affect the classifier.

            shouldBeClassed1 = []
            shouldBeClassed5 = []
            
            for (textbag, rating) in buckets['test']:
                testRating = classifier.classify(emitDefaultFeatures(textbag))
                if testRating != rating:
                    if rating == 1:
                        shouldBeClassed1.append(textbag)
                    else:
                        shouldBeClassed5.append(textbag)

A quick check on the error arrays shows me that I've only made mistakes on the reviews that should be classified as positive:

          >>>> print len(shouldBeClassed5.append(textbag))
          11

Wait a minute. That number looks familiar. Let me review the raw data again: 
          >>>>print len(sjr.reviewsByRating[5])
          55
          >>>>print int(0.2*len(sjr.reviewsByRating[5]))
          11

This data shows that  I mis-classified all 11 positive reviews in the test data, because my error analysis showed that I had eleven mis-classified positive reviews, and I only had 11 positive reviews in the teset set based on an 80% training/20% testing split.

A quick reversal to the original test method (that collected features from a FreqDist of all terms in the training data) shows that I mis-classified all 11 positive reviews as well.

Summary

This was one attempt to improve classifier accuracy by trying something reasonable with the feature set -- removing features whose probability differential across 1 star and 5 star review corpuses was very small.

While the numbers initially looked 'decent', deeper analysis shows that my classifier completely mis-classified positive reviews. In the future I'll do error analysis of classifiers before trying to theorize about what could make the classifier more accurate.

Looking closer at the data, the data set had 55 total  positive reviews and 273 total negative reviews.  In other words only 20% of my data was actually positive review data.

I had originally scraped only one reviewed site for data, but now I think I'm going to need to scrape more sites to get a more representative set of positive review data so that the classifier has more training examples.

In my next post I'm going to try to collect a more representative 'set' of data, and also take a slightly different approach to validating my classifier. I'm going to do error analysis up front and attempt to correct my classifier based on the errors I see, then test the classifier against new test data -- testing a fixed classifier against the data I used to fix it will give me a false sense of accuracy, because the test data used to do error analysis has in effect become training data.




Wednesday, January 22, 2014

Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

To do classification I'm going to stand on the shoulders of the giants -- specifically the giants  who wrote and maintain the NLTK package.  In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

Brief Recap

I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that  I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.

To review: Here is how I would download data for all reviews from a review site:

(note I've only implemented a 'scraper' for one site, http://sitejabber.com)

    from sitejabberreviews import SiteJabberReviews
    from analyzesitedata import AnalyzeSiteData


    pageUrl = 'reviews/www.zulily.com'
    filename = "siteyreviews.pkl"
    
    sjr = SiteJabberReviews(pageUrl,filename)    

    sjr.download(True) # this saves the reviews to the file specified above

Once I've downloaded the data, I can always load it up from that file later:

    pageUrl = 'reviews/www.zulily.com'
    filename = "siteyreviews.pkl"
    
    sjr = SiteJabberReviews(pageUrl,filename)    
    sjr.load()

Next Step: Bayesian Classification

NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this:  the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

    P(review rating | features) = P(features, review rating)/P(features)

In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those  features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'.  So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

    P(review rating | features) = P(features, review rating)

This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier  by training it with one star and five star review data. This is a pretty simple, binary approach to review classification. 

Feature Set Generation, Training, and Testing

To train and initially test, the NLTK Bayesian classifier, I need to do the following:
  1. Extract train and test data from my review data.
  2. Encode train and test data.
  3. Train the classifier with the encoded training data
  4. Test the classifier with the encoded test data.
  5. Investigate errors during the test
  6. Modify training set and repeat as needed.
I've written a helper method to generate training and test data:

def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):
       # generate tuples of (textbag,rating)
        reviewList = [(self.textBagFromRawText(review.text), key) 
           for review in reviews.reviewsByRating[key]]
        
        return reviewList[: int(trainSetPercentage*len(reviewList))],
            reviewList[int(trainSetPercentage*len(reviewList)):] 

the generateTestAndTrainingSetsFromReviews() method calls  textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

 def textBagFromRawText(self,rawText):
        '''
        @param rawText: a string of whitespace delimited text, 1..n sentences
        @return: the word tokens in the text, stripped of non text chars including punctuation
        '''
        rawTextBag = []        
        sentences = re.split('[\.\(\)?!&,]',rawText)
        for sentence in sentences:
            lowered = sentence.lower()
            parts = lowered.split()
            rawTextBag.extend(parts)
         
        
        textBag = [w for w in rawTextBag if w not in stopwords.words('english')]    

        return textBag

I generate test and training data for one and five star reviews using  generateTestAndTrainingSetsFromReviews():

            # load helper objects
            sjr = SiteJabberReviews(pageUrl,filename)
            sjr.load()
            asd = AnalyzeSiteData()

            trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)
            
            trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)
            
            rawTrainingSetData = []
            rawTrainingSetData.extend(trainingSet1)
            rawTrainingSetData.extend(trainingSet5)
            random.shuffle(rawTrainingSetData)

            rawTestSetData = []
            rawTestSetData.extend(testSet1)

            rawTestSetData.extend(testSet5)
            random.shuffle(rawTestSetData)

With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document.  I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

            # for raw Training Data, generate all words in the data
            
            all_words = [w for (words, condition) in rawTrainingSetData for w in words]
            fdTrainingData = FreqDist(all_words)
            
            # take an arbitrary subset of these
            defaultWordSet = fdTrainingData.keys()[:1000]
            
            def emitDefaultFeatures(tokenizedText):
                '''
                @param tokenizedText: an array of text features
                @return: a feature map from that text.
                '''
                tokenizedTextSet = set(tokenizedText)
                featureSet = {}
                for text in defaultWordSet:
                    featureSet['contains:%s'%text] = text in tokenizedTextSet
                
               return featureSet

That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here: 

      def encodeData(self,trainSet,encodingMethod):
          return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

(Aside: I love list comprehensions! ) With training  data encoded, we can encode the data and train the classifier as follows:

       encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)
       classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

       encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)
       print nltk.classify.accuracy(classifier, encodedTestSet)

This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy  drops to 75%.

Summary

The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.



Saturday, January 4, 2014

Making Sense of Unstructured Text in Online Reviews, Part 1

Introduction

I just returned from a meticulously researched vacation to a small fishing village an hour north of Cabo San Lucas, Mexico. The main reason for the great time we had was the amount of up front research that we put into finding the right places to stay, by researching the hell out of them via tripadvisor reviews.

After reading 100s of reviews, it occurred to me that If I were running a hotel, I would want to know why people liked me or why they didn't. I would want to be able to rank their likes and dislikes by type and magnitude, and make business decisions on whether to address them or not. I would also be interested in whether the same kind of issues (focusing on the dislikes here) grew or abated over time.

I could say the same thing about e-commerce sites. If I were in the business of selling someone something, and they really didn't like the way the transaction went, I'd like to know what they didn't like, and whether/how many other people felt the same way, so I could respond in a way that reduces customer dissatisfaction.

One nice thing about reviews is that they come with a quantitative summary: a rating. Every paragraph in a review section of a review site maps to a rating. This is great because it allows me to pre-categorize text. It's free training data!

I've broken this effort into two+ phases: getting the data, analyzing/profiling the data, and tbd next steps. I'm very sure I need to get the data, I'm pretty sure I can take some first steps at profiling the data, and from there on out it gets hazy. I know I want to determine why people like or don't like a site, but I don't have a very clear way to get there. Consider that a warning :)

Phase 1: Getting The Data

I had been out of the screen scraping loop for a while. I had heard of BeautifulSoup, the python web scraping utility. But I had never used it, and thought I was in for a long night of toggling between my editor and the documentation. Boy was I wrong. I had data flowing in 30 minutes. Beautiful Soup is the easy button as far as web scraping is concerned.

Here is the bulk of the logic I used to pull pagination data and then use that to navigate to review pages from sitejabber.com (I'm focusing on ecommerce sites first)

        # first get the pages we need to navigate to to get all reviews for this site. 
        page = urllib2.urlopen(self.pageUrl)
        soup = BeautifulSoup(page)
        
        pageNumDiv = soup.find('div',{'class':'page_numbers'})
        
        anchors = pageNumDiv.find_all('a')
        
        urlList = []
        urlList.append(self.pageUrl)
        for anchor in anchors:
            urlList.append(self.base + anchor['href'])
        
        # with all pages set, pull each page down and extract review text and rating. s
        for url in urlList:    
            page = urllib2.urlopen(url)
            soup = BeautifulSoup(page)
            divs = soup.find_all('div',id=re.compile('ReviewRow-.*'))
            
                    
            for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']

Note the need to download the page first, I used urlllib2.urlopen() to get the page. I then created a BeautifulSoup representation of the page:

    soup = BeautifulSoup(page)

Once I had that, it was a matter of finding what  I needed. I used find() and find_all() to get to the elements I needed. Any element returned is itself searchable, and has different ways to access it's attributes:

    for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
               rawRating = div.find(itemprop='ratingValue')['content']

text above retrieves inner text from any element. Element attributes are accessed as keys from the element, like the 'content' one above. The rawRating value was actually pulled from a meta tag that was in the ReviewText div above: 

itemprop="ratingValue" content = "1.0"/>

find()/find_all() are very powerful, a lot more detail and power is provided in the documentation. They can search by item ID, specific attributes (the itemprop attribute above is an example), and regexes can be used to match multiple elements. 

Crawling all of that data is fun but time consuming. I stored review text and rating data in a wrapper class, mapped by rating into a reviewsByRating map:

         for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']
                
                
                r = Review(text,rawRating)
                
                if self.reviewsByRating.has_key(r.rating):
                    reviews = self.reviewsByRating[r.rating]
                else:
                    reviews = []
                    self.reviewsByRating[r.rating] = reviews
                
                reviews.append(r) 

and flushed that map to disk using pickle:

     def saveToDisk(self):
          with open(self.filename,'w') as f:
              pickle.dump(self.reviewsByRating,f)

this let me load the data from file without having to scrape it again:

    def load(self):
          with open(self.filename,'r') as f:
              self.reviewsByRating = pickle.load(f)

Next step will be to start investigating the data.