Ok, I didn't really “hack” GH's Projects API... but this ​is​ about how you can put together server side interactions for end points that don't have public API's using PhantomJS and Node.js.

This technique has far reaching implications for apps that you need to interface with when they don't have API's. There's a few ethics debates involved here, but that's an entirely different can of worms. Let's get to the work I did to make our apps play nice with GH's Projects App. 

The Problem

I’m in the throes of ramping up the growth hacking practice here at Dialexa, and I recently became obsessed with improving the engagement rate and health of our idea backlog. I need ideas to come from fresh eyes and minds that aren’t just mine. Growth Hacker’s (GH) Projects (currently in beta) is excellent at this. Adoption of new tools at our office is just abysmal, though. We’re drowning in tools. Engagement is my biggest hurdle here, and thankfully, that is something I’m quite adept at solving as a growth hacker!

Discover what growth hacking is with this download.

The Ideal Solution

Simply put: I need my external team members to use what they already use (Slack) and for 'Projects' to catch it. You'd typically do this with public API's that both products would release. From there, a guy like me would come a long and glue them together.

Enter Slack, the golden child of all things communication these days. I was so excited to see the “integrate slack” message pop up when I got a new notebook started in Projects. Then totally deflated when I realized it was just the chatter of tickets being created or moved. It was not the “send an idea to the growth notebook” like I needed. 

So, Projects is in beta… guaranteed no public API yet. After rooting around under the hood of the network traffic, you can see it’s a well-done self-describing API though. I can read requests and keep up with how I would send one, except for the auth. Secret / undocumented API’s are just an absolute bear to tackle mostly because of this authentication work. I just didn’t have time for that.

The Duct Taped Actual Solution

My Hail Mary was having a server drive a browser. I’ve always toyed around with this idea for when something I want to hack on doesn’t have an API, but the more I researched how Slack was going to setup integrations and what little work I needed to do, the more I got that “just hack it in a few hours” itch I couldn’t leave unscratched. Naturally, a few hours turned into 12, but such is the way of things like this. And, I enjoyed every problem-solving second of it!

When I found PhantomJS, my first thought was “Why haven’t growth hackers been ‘automating’ front-end hacks with this thing.” It’s a real unholy use of the tool if we’re all just going to be honest. This thing was meant for QA testers to be able to test front-end design and code changes. Or for folks to automate some of their testing or simple tasks for their projects. But, as a growth hacker, all I saw was this:

“A programmable browser that can hit any website and do my bidding… in Javascript.”

I time-boxed myself to 24 hours and threw the Hail Mary pass by learning enough PhantomJS and Node.js to glue together an integration. Below are the details, step-by-step discoveries, and all the resources I read and linked to so I could take tech I’d never seen before and attack this issue.

This post should honestly be titled “The life of a Growth Hacker,” because this is pretty much how I accomplish a lot of my tasks and experiments: running into a problem that should be easy, isn’t easy, learning something on the fly to prove it out, and finally have a playbook drafted for my Dialexa colleagues.

Hello World

I skipped straight to the first set of code I felt like I could work with and have a decent learning curve to be dangerous with. One of the first things this course taught was error handling. As you test in PhantomJS, you’re browsing a page and executing JS at various checkpoints without ever seeing a thing. It’s really important you get the error handling right. If not, you’ll just sit there watching your server do nothing while on the other end your code completely explodes without error logging. The feeling is a very blind feeling while using Phantom. We're so accustomed to our console logs and inspectors / debuggers, that when you're completely stripped of them, things get confusing fast. Error handling is key.

So, after customizing my example from the book, I landed on this (don’t judge my code cleanliness, remember I’m going for a 24-hour or less hack because we move fast and break things at Dialexa):

// load system modules
var url = 'growthhackers.com/app/canvas/1234/notebook';
var system = require('system');
var args = system.args;

// load scraping modules and settings
page = require('webpage').create();
page.viewportSize = {
  width: 1920,
  height: 1280,
};

// Open and do things, Hello World!
page.open(url, function(status) {
  // log message with status
  console.log('--== Page Loaded:' + status + ' ==--');

  // Snap a picture to see what the JS sees
  page.render('image.png');
});

// ERROR HANDLING!
phantom.onerror = function(msg, trace) {
  // build up error message stack
  var msgStack = ['PHANTOM ERROR:' + msg];

  // if a stack trace is available push formatted messages to msgStack

  if (trace && trace.length) {
    msgStack.push('TRACE:');
    trace.forEach(function(t) {
      msgStack.push(' -> ' + (t.file || t.sourceURL) + ': ' + t.line + (t.function ? ' (in function ' + t.function+')' : ''));
    });
  }

  // log stackTrace
  console.error(msgStack.join('\n'));

  // exit PhantomJS with error code
  phantom.exit(1);
};

Let’s break this script down into (giant) understandable chunks:

  1. Tell Phantom all the things it needs to run and any config type variables early on. Since I’m always going to post Slack messages to a single board, I put the board URL in directly so that it redirects me to log in and I end up landing on the idea board.
  2. Use the page object to open a URL and react to its success or failure.
  3. Handle errors.

And you execute your scripts in PhantomJS with this in the terminal (I’m coding over on a Unix based Macbook Pro):

$ phantomjs Hello-world.js

Immediately, it fails. Since my work didn’t match the teacher, I took to Google and found this article: PhantomJS failing to open HTTPS site

This Stack Overflow article details a couple of flags you need to set because of some new things happening under the hood dealing with SSL protocols (oh, ye gods of stack overflow, how you have saved my bacon so many times). 

Simply use these switches: --ignore-ssl-errors=true and --ssl-protocol=tlsv1 before any other parameters. Our PhantomJS terminal command becomes this:

$ phantomjs --ignore-ssl-errors=true --ssl-protocol=tlsv1 Hello-world.js

And now we’ll see our console logs over in terminal, and our image outputs the login screen. Huzzah! The next step / giant leap in the tutorials is poking the buttons.

Your javascript is my javascript now!

Up next, do a couple of things to our page.open:

// Open and do things, Hello World!
page.open(url, function(status) {
  // log message with status
  console.log('--== Page Loaded:' + status + ' ==--');

  // Snap a picture to see what the JS sees
  page.render('image.png');

  page.evaluate(function() {
    $('#user_email').val('ken@dialexa.com');
    $('#user_password').val('this_isnt_my_password');
    $('button').click();
  });
});

Page.evaluate is where the meat of the work gets done. This is where we can take advantage of whatever libraries are available (or even insert our own), and can start slinging out JavaScript that affects or reads the page. This is the part that had me geeked out the most about Phantom.js, and as you can see here, I’m setting form values and clicking buttons. Namely, the login button.

Now, we’ll be triggering more than just the initial page.open network traffic, and so we need to implement some logs to actually be able to see everything going on. I hooked into a few of the events and created logs that were very noisy to begin with while I learned what was going to happen. Up until this point, I hadn’t dug into what our dear old GH Projects app did to make a login happen and redirect to the page.

To be brief, I hooked into this to get the most bang for my buck:

page.onUrlChanged = function(newURL) {
  console.log('URL Changed: ' + newURL);
}

I also snapped a screenshot anytime we fully finished loading a new URL:

var screenshotCounter = 0;
page.onLoadFinished = function(status) {
  screenshotCounter++;
  console.log('--== Page Load Finished | Snapping Pic ==--');
  page.render('image' + screenshotCounter + '.png');
}

Truth time: I was hoping that it’d just take me to the page of ideas where I could scrape data as well as submit one. What happened was a bit of a train wreck and presented about 4 hours of hurdle jumping. The console turned into a barrage of resource request logs as it redirected a good deal and then ultimately settled in on the first hurdle, which was the most challenging.

Lace up those Nike’s, Projects Seriously Doesn’t Want PhantomJS to Work

Hurdle #1: Projects is “Only available for Chrome.”

I’ve honestly got to laugh about this one because the only browser I typically use every single day is Chrome, so the User Agent sniffing caught me by surprise. And I only arrived at figuring out that this is where my script bottomed out on because of the screenshot snippet above.

“No problem,” I thought, “This is PhantomJS, surely the geniuses that made this knew we’d need to fake out browsers at some point.” So I used http://www.esolutions.se/whatsmyinfo to grab a legit looking User Agent for Chrome and then got to Googling. It’s readily available, so I had a ton of resources to look through. Here’s the result of my first and last google on it:

http://phantomjs.org/api/webpage/property/settings.html

The first link in the Google search is the documentation on PhantomJS and points right at the Settings object for Phantom. In fact, even better, they show in their example how to override it.

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';

Stick this right below our page setup at the top and we’re cooking with Crisco again. Firing up this change though led to the same results.

Hurdle #2: Projects is still kicking me out, despite me looking just like Chrome

GH’s Projects app kicked me right back into /support/chrome_only.html

At this point, it was time to start digging through source code. I lucked out, and it only took me about an hour to track it down using Chrome’s inspector and hunting through each page that was showing in my network logs. It was painful when I found it because I was being let into the page I wanted to get to, but immediately being kicked out with ruthless efficiency. I mean, there are meta tags in the head, and then there’s this script. So close, but so far away!

Here’s the offender, and I found it by going to each file in the “Sources” tab of chrome’s inspector and doing a Cmd-F search for “/support/chrome_only.html”. I did waste a ton of time unpacking Angular files and trying to read through those, but ultimately I got a hit on the search tag, and I saw this glorious “1 match” when I clicked over.

In a script block at the top, we find the gatekeeper:

(function() {
  var isChrome,
    exceptions = ['/app/s/'];

  for (var i = 0; i < exceptions.length; i++) {
    if (location.href.match(exceptions[i])) {
      return;
    }
  }

  isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);
  if (device.mobile() || device.tablet()) {
    location = "/support/desktops_only.html";
  } else {
    if (!isChrome) {
      location = "/support/chrome_only.html";
    }
  }
})();

The “isChrome” bit there checks my navigator’s userAgent property, which should have been passing no problem. That vendor test, though… I hadn’t expected that. No problem, we’ve got the settings object! But alas, it just wasn’t going to go down without a fight, because you can’t set the vendor there.

After adding some console logs to the in page scripts using evaluate I was seeing that I was coming across as Chrome but showing up as Apple when I checked the vendor I was sending. 

Some googling and head scratching later and I came across this SAINT of a person that posted his code of this exact need being solved on the fly. Here’s his solutions repo: https://github.com/JamesMGreene/phantomjs-solutions

And we’re looking at change-client-userAgent.js, particularly this bit here:

page.onInitialized = function() {
  page.evaluate(function() {
    var newNavigator = Object.create(window.navigator);
    newNavigator.userAgent = "blah";
    window.navigator = newNavigator;
  });
};

Previously, I hadn’t even paid attention to this event, but oh man it’s powerful. According to the documentation:

“This callback is invoked after the web page is created but before a URL is loaded. The callback may be used to change global objects.”

I’m not entirely sure when this happens exactly, but according to James Greene’s scripts, it happens before any script tags have the chance to kick me out. And it worked. I adjusted his script like so:

page.onInitialized = function() {
  page.evaluate(function() {
    var newNavigator = Object.create(window.navigator);
    newNavigator.vendor = "Google Inc";
    newNavigator.userAgent = "Chrome";
    window.navigator = newNavigator;
  });
};

For this scenario, I didn’t even bother trying to figure out what Google typically reports, because I’ve got the gatekeeper’s code here and can just see what it wants to let me through.

Success! Albeit it’s very flimsy success, but we’re essentially driving a website to get at an “API” early, I’m ok with a little flimsy today.

Hurdle #3: A Metric Ton of JS / Angular Errors

At this point, I fully realized exactly why they’re keeping me out. I never really thought that GH’s programmers were trying to keep my bot specifically from getting in, but I hadn’t tried to think through why they were blocking out browsers. It becomes apparent quickly though when a browser isn’t supported with all the fancy stuff they’re doing in Angular because the entire page just falls flat on its face. You end up with headings and page structure, lots of JS errors about Modules not being able to be loaded, and an empty page.

This led to about 30 minutes of learning a bit of Angular (I’m an Ember man myself), unpacking / un-minifying scripts to better understand the failures, and then just asking myself this important question:

“Do I even need the page to fully load? What JS do I have at my disposal now that I’m into the page with a fully authenticated Cookie?” 

An aside: I also couldn’t help but wonder how badly I was screwing up their analytics because Phantom does indeed load tracking pixels and scripts like Optimizely. In fact, you can see a lot of their variables setup to A/B test us in the cookie values. This thought entertained me as someone who has crawled through data points during late nights and been mystified by results.

So, we’re in. How do we execute some jQuery at the right time? The page.onLoadFinished only takes a “status” parameter, and has no notion of what page you’re actually on. For all intents and purposes, PhantomJS is just trudging on through redirects and follows.

At this point, I really start understanding how this is legitimately created for testing. Here’s the issue:

I have events that give me URLs. I have events that let me know when I can execute code. I do not have this data mashed together that would make it convenient to know when I’ve arrived on the right page with the right cookie. Enter the ultra kludgey hack, tied together with duct tape and bubblegum:

// Track the last URL our events told us they saw us get changed to
var currentURL;
page.onUrlChanged = function(newURL) {
  currentURL = newURL;
};

// Check what our current url is, and if it's the notebook page, execute some jQuery
page.onLoadFinished = function(status) {
  if (currentURL === 'https://beta-projects.growthhackers.com/app/canvas/2201/notebook') {
    var jqueryVersion = page.evaluate(function() {
      return $.fn.jquery;
    });

    console.log(jqueryVersion);
  }
}

We do in fact come back with jQuery. Next up, let’s do some real work that we’ve been trying to do the entire time: Submit an Idea!

By going to the website, preserving the network log, and creating a quick entry of an idea, we can see exactly how the underlying API wants data to get to it. Here’s the inspector’s results and the important bits:

General Request Info:
Request URL: https://beta-projects.growthhackers.com/canvas/2201/quick
Request Method: POST

Request Payload:
{
  "experiment": {
    "name": "test 2",
    "category_id": null
  }
}

This was, far and away, the easiest part of this whole “hack”.

Our onLoadFinished work now becomes:

// Check what our current url is, and if it's the notebook page, execute some jQuery
page.onLoadFinished = function(status) {
  if (currentURL === 'https://beta-projects.growthhackers.com/app/canvas/2201/notebook') {

    var response = page.evaluate(function() {
      $.ajax({
        url: '/canvas/2201/quick',
        data: {
          experiment: {
            name: '‘this is a test from a bot'’,
            category_id: null
          }
        },
        method: "POST",
        dataType: "json"
      });
    });
  }
};

Running this script in the terminal, I actually do see over in my Projects notebook a new idea show up. I feel like Dr. Frankenstein, stitching together some crazy wild beast of a machine and watching its heart beat for the first time!

Hurdle #4: Are we there yet?

Ok, now we need to kill this Phantom instance using phantom.exit(); but if we do it too soon, the AJAX won’t finish out. So… how do we know when a page.evaluate’s work has finished?

A couple of Googles later and I came across this post: Does Phantom.js capture all AJAX?

Linked in this answer, is this beautifully simple, brutally obvious “WaitFor.js” I’d seen mentioned in other StackOverflow articles. Here’s the bit that I grabbed and added to my crawler script:

function waitFor(testFx, onReady, timeOutMillis) {
  var maxtimeOutMillis = timeOutMillis ? timeOutMillis : 3000, // Default Max Timout is 3s
    start = new Date().getTime(),
    condition = false,
    interval = setInterval(function() {
      if ((new Date().getTime() - start < maxtimeOutMillis) && !condition) {
        // If not time-out yet and condition not yet fulfilled
        condition = (typeof(testFx) === "string" ? eval(testFx) : testFx());
        //^ defensive code
      } else {
        if (!condition) {
          // If condition still not fulfilled (timeout but condition is 'false')
          console.log("'waitFor()' timeout");
          phantom.exit(1);
        } else {
          // Condition fulfilled (timeout and/or condition is 'true')
          console.log("'waitFor()' finished in " + (new Date().getTime() - start) + "ms.");
          typeof(onReady) === "string" ? eval(onReady): onReady(); // Do what it's supposed to do once the condition is fulfilled
          clearInterval(interval); // Stop this interval
        }
      }
    }, 250); // repeat check every 250ms
};

The basic idea is that you have something that you can test on the page, and you just wait and keep checking for it. So, I decided just to add a .done to my ajax call and append a div that was unique to just my app.

It reads like this: if you don’t see this div, then my AJAX isn’t done, and you shouldn’t exit Phantom yet. As soon as you do see it, I’m all done, so fire away.

I implemented it like this, just after the ajax evaluate bit. Here’s the full onLoadFinished event that does the posting of the idea, including waiting for.

// Check what our current url is, and if it's the notebook page, execute some jQuery
page.onLoadFinished = function(status) {
  if (currentURL === 'https://beta-projects.growthhackers.com/app/canvas/2201/notebook') {

    var response = page.evaluate(function() {
      $.ajax({
        url: '/canvas/2201/quick',
        data: {
          experiment: {
            name: 'this is a test from a bot',
            category_id: null
          }
        },
        method: "POST",
        dataType: "json"
      }).done(function() {
        // When the ajax function is done, add this div to the body so waitFor has something
        // to watch for.
        $('body').append('<div id="kenIsDone">Finished!</div>');
      });
    });

    waitFor(function() {
      // Look for our div and return it up the chain to the first param for waitFor as truthy
      return page.evaluate(function() {
        return $('#kenIsDone').is(':visible');
      });
      // If the first param was truthy, do this down here
    }, function() {
      console.log('Saw KenIsDone -----------------------');
      phantom.exit();
    });
  }
};

We got it working! This will successfully post the static idea string we’re using for testing, and then quit itself once it’s done. We’re a very trivial Node.js server away from passing the script a param we get from working with the Slack API and firing off responses to Slacks API about our progress as we work. That's for the next post!

Ok, let’s recap!

  1. Using PhantomJS we created a script to type in the form fields and click the log in button from a server.
  2. Once logged in, and immediately on page initialization, before any in-page scripts can run, we blow away the navigator object, create a new one, and spoof that we’re Chrome and our Vendor (the problem property for us) is Google Inc.
  3. This lets us through the Browser Gatekeeper, and into a very broken page. We don’t care though because we’ve got an auth cookie and access to jQuery inside that page.
  4. We run an AJAX post to submit a new idea and when the AJAX request finishes we append a div to the page that says we finished.
  5. We kill the service only when we see the div appended to the page, enabling us to wait for non-traditional monitoring feedback to come through outside of our Event Emitter loops. 

Alright! Next post, we’ll show the much easier portions of this:

  1. Write a Node.js server to handle requests from Slack’s API.
  2. Pass along the idea submitted by Slack to the Script we just wrote.
  3. Modify the script to replace the string we’re testing within the AJAX request.
  4. Radio out progress from the script to the Node.js server to send to Slack, so the end-user knows things are happening and gets feedback.

Click to Comment