Nick Fishman

  • Archive
  • RSS

Node.js HTTP requests with gzip/deflate compression

One of my recent projects involved scraping some web data for offline processing. I started using the excellent request library by Mikeal Rogers, which has a number of nice and convenient improvements over the default Node http library.

As I unleashed my first prototype on the web, the database started growing much faster than I had planned. I started by storing raw and uncompressed response data, so an immediate optimization was to use the Accept-Encoding HTTP request header to fetch compressed data from the server.

Unfortunately, some of my target servers sometimes sent back uncompressed data (which they’re entitled to do under the HTTP spec, it’s just slightly annoying). I needed a way to conditionally handle compressed data based on the Content-Encoding response header. I found a solution that worked with the default Node.js HTTP library, but it wasn’t immediately obvious how to port that to Mikeal’s request library.

Approach 1: no streams

My first solution collected data chunks into a Buffer, then passed that into the relevant zlib functions if needed. It’s more code than I wanted, but it works well.

Note: for simplicity, I’ve left out the logic that writes the compressed response body to the database.

https://gist.github.com/5499763

Approach 2: streams

The downside to the first approach is that all response data is buffered in memory. This was fine for my use case, but in general this can cause memory issues if you’re scraping websites with really large response bodies.

A better approach is to use streams, as Mikeal suggested. Streams are a wonderful abstraction that can help you manage memory consumption better, among other things. There are two great introductions to Node streams here and here. Keep in mind that streams in Node.js are somewhat intricate and still evolving (for example, Node 0.10 introduced streams2 which is not entirely backwards compatible with older versions of Node).

Here’s a working solution that pipes response data into a zlib stream, then pipes that into a final destination (a file, in this case). Notice that the code is cleaner and more readable.

https://gist.github.com/5515364

Summary

Both of those approaches will get the job done with Mikeal’s library, and the one you choose depends on the use case. In my project, I needed to save the compressed response data as a field of a Mongoose document, then further process the decompressed data. Streams don’t suit this use case well, so I used the first approach.

    • #gzip
    • #deflate
    • #http
    • #tech
    • #zlib
    • #nodejs
  • 1 month ago
  • 1
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Plink, a collaborative HTML5 music game

tulpinspiration:

Plink is a very original idea. It’s an online, collaborative, multiplayer music making toy made byDinahmoe. It uses Node.js and WebSockets to create an multi-user “chatroom” but instead of entering text to have a chat, the interface generates music!

Pulsating circles are generated by moving the mouse over a <canvas> element. Clicking and holding the mouse generates a musical tone. The colour of the circle determines the type of audio you play such as high or low notes, and is created using Google’s Web Audio JavaScript API. What a great combo of web technologies, and great fun too!

Check out a video of it in action: http://vimeo.com/26271666

There are some cool things going on here. On the client, they’re using the Web Audio API (available in recent versions of Chrome and Safari) to dynamically play sounds, and WebSockets to make the experience live and interactive. They’re using Node.js on the server.

It’s worth noting that the sounds all come from a pentatonic scale. This is why the music miraculously doesn’t sound cluttered or discordant, even when lots of people are playing. You’ve probably experienced something similar if you’ve ever tried playing just the black keys on a piano: no matter what order you play them, you just can’t go wrong.

Each instrument is sampled across 16 tones. Try some of them to hear what I mean:

  • http://labs.dinahmoe.com/plink/sounds/bziaou_11.ogg
  • http://labs.dinahmoe.com/plink/sounds/bziaou_12.ogg
  • http://labs.dinahmoe.com/plink/sounds/bziaou_13.ogg
  • http://labs.dinahmoe.com/plink/sounds/bziaou_14.ogg
  • http://labs.dinahmoe.com/plink/sounds/bziaou_15.ogg
  • http://labs.dinahmoe.com/plink/sounds/bziaou_16.ogg

On a tech note, the client code isn’t the cleanest thing in the world. It would be easier to read (and maintain) if it used Socket.IO instead of raw WebSockets, and if it used jQuery or some other JavaScript library to manipulate the DOM.

Still, a very innovative use of the Web Audio API and quite fun to play with.

    • #music
    • #nodejs
    • #socket.io
    • #tech
    • #websockets
    • #javascript
  • 1 year ago > tulpinspiration
  • 3
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Speeding up Mongoose queries by requesting only the fields you need

I’m currently building a startup (ampcloud) with Node.js, MongoDB, Mongoose, and a handful of other tools. After spending quite a few years in the Django world, it’s been fun doing a mental context switch into the land of JavaScript, callbacks, and closures. Occasionally I’ve run into some gotchas, and this particular one is a great example.

Let’s say you’re building a blog, and part of your database schema looks something like this:

var CommentSchema = new Schema({
  title: {type: String},
  body: {type: String},
  createdAt: {type: Date}
});

var PostSchema = new Schema({
  author: {type: String},
  title: {type: String},
  createdAt: {type: Date},
  slug: {type: String},
  comments: [CommentSchema]
});

module.exports.Post = mongoose.model('Post', PostSchema);

Every post is stored as a separate document in MongoDB, but all comments are embedded within it. This means that when you fetch a post, you’ll get all the comments back with it.

Now let’s say you want to display a list of the 20 most recent blog posts on your home page. Assuming you’re using Express, you would write a view like:

app.get('/', function(req, res) {
  Post
    .find()
    .asc('createdAt')
    .limit(20)
    .run(function(err, posts) {
      if (err) {
         res.render('error', {status: 500});
      } else {
        res.render('allposts', {posts: posts});
      }
    });
});

You’d also want to add an index to allow efficient querying by date created:

PostSchema.index({createdAt: 1});

Your blog will probably work well at first, but you’ll run into problems as soon as one of your amazing posts goes viral and gets thousands of comments. You’ll notice that your main page starts taking a lot longer to load. Even when you’re the only one browsing your blog, it just won’t feel as snappy anymore.

Beware: Mongoose fetches all fields by default

The culprit is the comments field. Because a Mongoose query requests all fields of a document by default, every site visitor will cause it to request and parse the entire list of comments. Every time. You don’t even need the list of comments to render the main page.

Let’s get rid of the comments field by adding the following line to the query chain:

    .exclude('comments')

The final result:

app.get('/', function(req, res) {
  Post
    .find()
    .asc('createdAt')
    .limit(20)
    .exclude('comments')
    .run(function(err, posts) {
      if (err) {
         res.render('error', {status: 500});
      } else {
        res.render('allposts', {posts: posts});
      }
    });
});

You’ll find that this performs a lot better. The problem isn’t so much that MongoDB can’t return the data quickly enough. Rather, Node.js has to spend much of its time parsing extra JSON into JavaScript objects, which is both unnecessary and time-consuming.

Not surprisingly, I recently encountered this issue in production. I made the fix right at 3:00 GMT, and the load dropped dramatically.

Takeaway: think about your queries

When your models start accumulating lots of data, think about whether you can request a subset of fields when making queries. See the Mongoose query documentation for details.

Caveat: Keep in mind that you won’t gain much by excluding fields that store primitive types like Strings, Numbers, or Dates. Even worse, your code will probably get harder to read and maintain. Only make such optimizations when you have to.

Some final notes

The above schema suffers from a fundamental flaw: it doesn’t scale well. If a blog post gets thousands of comments, you’ll probably want to paginate the comments and only show several hundred at a time. But with this schema, you can’t ask MongoDB for a subset of comments. You can only get all or nothing.

To make this production ready, you’d probably want to separate Comment and Post into separate Mongoose models, instead of nesting Comments within Posts as embedded documents. Each Comment would be a separate MongoDB document, you’d store the Post id within the Comment, and you could efficiently query for random subsets of comments on a particular blog post.

    • #mongodb
    • #mongoose
    • #nodejs
    • #tech
    • #ampcloud
  • 1 year ago
  • 4
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Nick Fishman

Portrait/Logo

About

I'm a software engineer and entrepreneur. I like to solve high-impact problems with technology. I'm also the CTO and co-founder of sonicpanther.
Follow @nickfishman

Social

  • @nickfishman on Twitter
  • Google
  • Linkedin Profile
  • nickfishman on github
  • RSS
  • Random
  • Archive
  • Mobile

© 2013 Nick Fishman. All rights reserved..

Effector Theme by Pixel Union