Algorithms to Live By

A few weeks ago I started reading Algorithms to Live By: The Computer Science of Human Decisions and been fascinated by it ever since:

What should we do, or leave undone, in a day or a lifetime? How much messiness should we accept? What balance of the new and familiar is the most fulfilling? These may seem like uniquely human quandaries, but they are not. Computers, like us, confront limited space and time, so computer scientists have been grappling with similar problems for decades. And the solutions they’ve found have much to teach us.

In a dazzlingly interdisciplinary work, Brian Christian and Tom Griffiths show how algorithms developed for computers also untangle very human questions. They explain how to have better hunches and when to leave things to chance, how to deal with overwhelming choices and how best to connect with others. From finding a spouse to finding a parking spot, from organizing one’s inbox to peering into the future, Algorithms to Live By transforms the wisdom of computer science into strategies for human living.

I'll try and give you a little taste of the book - which I'm still reading (more precisely, listening), so you'll know what you're getting into.

<!-- more -->

Anyway, Chapter II talks about the "when to stop" (Optimal Stopping) dilemma:

  • Should I keep searching for an apartment?
  • Should I keep dating, or settle down with my current partner?
  • Should I keep searching for a better hire, or take the current candidate?
  • Should I keep searching for a better parking stop?
  • and more ...

These problems aren't theoretical at all, and I kept "inventing" an algorithm to solve them. I never really thought that someone, somewhere, spent considarable time looking into these and finding the best algorithm to tackle them.

Each one of the 13 chapters (except the intro) uncovers a new algorithm, walks through day to day problems and explains how one can implement it to tackle problems in a fascinating, optimal way.

The book talks about the following subjects:

By the way, audible offers a 30 day trial which you can use to buy this book. I was a bit concered that it'll be hard to narrate properly, but after spending many hours listening to it, I can honestly say that the book is well narrated and very fun to listen too! Are you listening to a good tech book too? sharing is caring :)

Making Hexo Blazing Fast

A week ago I migrated my blog from Ghost to Hexo to gain better performance and save money.

Hexo is said to be "Blazing Fast", but while I did "feel" that my Hexo based site was snappier than its predecessor, it was far from "Blazing Fast".

Performance is extremely important. There are a lot of articles on the subject, most of which point out that website performance & uptime are key to user satisfaction. WebpageFX wrote a nice summary of the subject - Why Website Speed is Important.

I'm not a web developer, and have almost zero knowledge in website optimizations. Nonetheless, I've optimized more then a few apps in my career and know how to approach such problems.

All I need is to figure out the tooling to find the bottlenecks and fix them to gain good enough performance. That is, I'm not looking into optimizing every single piece of the website, only making it fast enough so it'll feel snappy.

This blog post explains the steps I took in order to dramatically decrease the average page size to less then 350k.

<!-- more -->

Benchmarks

First of all, I need to figure what what to test. I had Google Analytics set up for as long as I can remember, so my first step was figuring out what my users are doing. My conclusion was that most users will find my blog either by organically searching for a specific topic, or by finding the content on social networks. On either case, they'll land directly to one of my posts.

I saw a lot of traffic to my main page as well, and after some digging I figured out that most main page accesses were after a user landed on one of my posts and not direct access.

Honestly, none of the above was a suprise, but a good performance investigation is always based on real data.

Chrome DevTools

The Chrome Developer Tools are a set of web authoring and debugging tools built into Google Chrome. I've used them before so that was my first step.

I fired them up, disabled caching and surfed to one of my posts. I saw a few things:

  • a lot of Render Blocking CSS
  • a lot of content was loaded sequentially
  • a lot of external javascript
  • around 100 requests per page, many to ad relates websites.

I also looked at the source of several pages and noticed that -

  • Nothing is minified / optimized
  • There are a lot of duplicate code blocks
  • There are a several css and js calls in the header (render blocking)

Ok, now what?

  • I need to figure out how to decrease the amount of requests.
  • I need to remove all render blocking things
  • I need to understand what all of these scripts are used for
  • I need to optimize the html, js and css.

Pingdom Website Speed Test

Pingdom has a neat benchmark tool that gives a lot of insight into your websites performance.

I fired it up, entered a url and gained a few new insights:

  • My fonts are really big, as in 800k big.
  • My images are REALLY big, as in 1mb+ big.
  • My scripts & css are REALLY big, as in 500k+ big.
  • There are A LOT of redirects

Ok, now what?

  • I need to optimize images and fonts.
  • I need to figure out what's causing all the requests and decrease that number.

Google PageSpeed Insights

Google has its own benchmarking tool which helped me gain a few more insights in addition to the former:

  1. I already knew I had render blocking content, but this tool explained extacly which.
  2. The server response time wasn't fast enough.
  3. I had landing page redirects.
  4. All redirect chains were caused by disqus.

Ok, now what?

  • I need to figure out how to decrease the response time
  • I need to understand why I have so many landing page redirects

Conclusion

Recap -

  1. I need to figure out how to decrease the amount of requests.
  2. I need to remove all render blocking things
  3. I need to understand what all of these scripts are used for
  4. I need to optimize the html, js and css
  5. I need to optimize images and fonts
  6. I need to figure out what's causing all the requests and decrease that number
  7. I need to figure out how to decrease the response time
  8. I need to understand why I have so many landing page redirects

The former can be divided into three groups:

  1. perforance problems that were caused by the theme I used.
  2. performance problems that were caused by my own content.
  3. performance problems that were "caused by" Github Pages caching strategy.

Thankfully, with enough effort, I can fix all of the above. How?

  1. I have the source code for the theme, I need get my hands dirty and optimize it.
  2. I can alter my content and/or run optimizers to make serving my content faster.
  3. I actually had Cloudflare set up, but wrongly configured.

Experiments

I needed to conduct experiments to figure out the best course of action of each problem. It needed to be easy to conduct them, that is - automate deploy steps.

I did some googling and eventually decided to use Gulp.
Gulp is a toolkit for automating painful or time-consuming tasks in the development workflow.

Gulp was easy to understand and tinker with. I started by building the workflow I needed, then adding all optimizations I needed.

I found out that I can run Google PageSpeed Insights locally, using a tool called psi, which made the whole process really easy.

After each phase I checked to see the results. At some point I started deploying the blog to make sure my CDN changes didn't break anything. plus, I wanted to run bench tools I couldn't run locally.

Optimizations

All the steps detailed below are part of my Gulpfile. Feel free to look at it & even offer suggestions. Remember, I'm not a WebDev so I'm probably doing some things wrong!

Minify html, js and css

I tried different tools and different configurations to gain the best result.

Optimize images

I tried different tools and eventually settled on a few.
I actually use two jpeg optimizers to achieve compress the images further.

A side note: I also used lossy compression mechanism.

Fix Render Blocking things

I Removed render blocking css by inlining the critical path and moving the rest to load asynchronously. Furthermore, I removed render blocking css, concatenated scripts and inlined others.

Fix Redirects

Disqus is the comment system I use. It looked like it was the reason I had so many (50+) redirect on each post. The solution was simple, all I needed was to disable anonymous cookie targeting

Remove unused or duplicate code paths.

The original theme had a lot of plugins that I didn't use. Removing them made the theme a lot slimmer and readable.

Replace slow plugins

I replaced the custom share button with AddToAny. Actually, Icarus already supported AddToAny, but it didn't really work.

I also replaced MathJax with KateX which is significantly faster.

Optimize search

The theme came with custom search functionality which works really well. The problem was that it downloaded a json representation of all my posts (including text!), which was redudant.

I removed everything except the bare minimum and updated the code to only search titles, categories and tags.

Optimize fonts

I used IcoMoon to remove unused FontAwesome icons. FontAwesome went from 500k to 40k (!). I also removed unused Source Code Pro & Open Sans fonts.

Tune Cloudflare Performance

Cloudflare was my CDN of choice. Not because it's the best (it might be), but because it's free :)

Anyhow, I configured Cloudflare to cache my entire website, and as a result, added a Gulp task to invalidate Cloudflares cache on deploy.

I also turned on Rocket Loader. Rocket Loader is a general-purpose asynchronous JavaScript loader coupled with a lightweight virtual browser that almost always improves a web page's window.onload time.

Remove Disqus

Most of my posts take around 500k. 200k for content, and 300k to load Disqus (!).

I decided to remove Disqus and replace it with Gitment, a comment system based on GitHub Issues, which can be used in the frontend without any server-side implementation.

Gitment takes around 90kb after compression, which is 60% less then Disqus! Moreover, it's completely based on GitHub issues which is pretty cool IMO.

Oh, right. It's not perfect either -

Public client secret

Gitment uses OAuth authentication from the client side to do its magic.
It needs the client secret to do so, which according to GitHub:

Your client ID and client secret keys come from your application's configuration page. You should never, ever store these values in GitHub--or any other public place, for that matter. We recommend storing them as environment variables.

Although GitHub makes sure the client id and secret are only used for the configured callback, It's still not a good idea in my opinion.

gh-oauth-server

Every login request to gitment is proxied through gh-oauth.imsun.net.
gh-auth is needed because GitHub does't attach a CORS header to the logging requests.

The service doesn't record or store anything (I checked), but having a global service that's controlled by some guy is not my cup of tea.

I decided to set up my own gh-auth instance on DigitalOcean, and add logic that injects the client keys, instead of sending them from the user. That basically solved the previous issue!

Fixes

During my experiments I also found out that I broke some parts of my website when migrating from Ghost.

AMP

One of the reasons I moved to Hexo was to stop using Accelerated Mobile Pages.
Unfortunately, some people posted links pointing to AMP content on my blog, which lead to a 404.

I added a custom URL Forwarding rule in Cloudflare to redirect AMP content to the right page.

I also added a custom 404, because why not?

Broken Links

I had a feeling some links would break after migrating from Ghost. What I didn't know is that I had broken links inside my posts!

TL;DR: I used a neat tool called broken-link-checker that crawls a given website and checks for broken links.

Why were links broken?

Some were broken because of a mirgration error that caused some posts to get the wrong dates, leading to unreachable posts.

Others were broken as a result of differences from Ghost's Markdown rendering engine and Hexo's. For example, look at the following valid markdown:

[Ajax](http://en.wikipedia.org/wiki/Ajax_(programming))

Ghost would create a link named Ajax pointing to

http://en.wikipedia.org/wiki/Ajax_(programming)

While Hexo would create link named Ajax pointing to

http://en.wikipedia.org/wiki/Ajax_(programming

There's an issue on the subject with a solution.

SEO Optimizations

I'm not an SEO wizard, but I do know that submitting your sitemap to search engines is a good idea, so I did that.

I also added a robots.txt file, which is completely redudnant in my opinion, but why not?

Automatic Deployment

All the steps I outlined are really manual. I hate manual work. Instead, I set up CircleCI to do all the manual work for me.

Now every time I push something to the blog's github repository (oded.blog) the follwoing happens:

  1. CircleCI checks out the code from GitHub
  2. CircleCI installs all the dependencies Hexo needs to build the website
  3. CircleCI runs gulp, which in turn -
    • generates the static website
    • runs all the optimization steps
    • deploys the blog to odedlaz.github.io
    • invalidates Cloudflares cache to make sure the content is up to date.
    • fires a webhook that tells IFTTT to send me an update

Further work

Remove unused CSS

I tried to run uncss to remove unused css, but it broke most of the website.
Nevertheless, I'm pretty sure there are a lot of unused css selectors that I can safetly remove.

Search

Currently I'm using a custom search "engine" that works on the client side.
I might replace that with Algolia at some point, but currently that's not an issue.

From Ghost to Hexo

After a few months of hosting my blog at DigitalOcean, I decided to ditch Ghost.
Instead, I'm moving my blog to Hexo, hosted by GitHub Pages.

TL;DR -

  • Ghost doesn't offer any server features, but runs one.
  • Ghost wasn't fast enough.
  • Unnecessary Server costs.
  • To much Maintenance.

<!-- more -->

Why Ghost?

I didn't like WordPress and found Ghost a good alternative.
A full explanation can be found at - From WordPress.com to Ghost on DigitalOcean.

Why Move (again)?

Server runtime with no server features

Ghost doesn't offer any dynamic features, and honestly - I don't really need any.
My entire blog is static, except for comments which are provided by Disqus.

That means I needed to setup a server for a (de-facto) static website.

Speed

My blog wasn't snappy enough. There are a number of "blame" factors:

  • Ghost isn't fast enough (I never checked that)
  • The droplet I was paying for didn't supply enough resources for Ghost

I decided to redirect my users to AMP to gain extra speed. I set up rules on nginx to only redirect mobile users to AMP, which I didn't like for serveral reasons:

  1. AMP removed disqus and mathjax (which made some posts unreadable from mobile)
  2. It gives my users two distinct websites, with a completely different design.

Plus, Everything Alex Kras wrote about the subject - I decided to disable AMP on my site .

Maintenance

I found myself doing a lot of work to make sure my blog is working like I except it too, and that I have everything backed up.

Price

I felt fine with paying 10$ a month for a droplet, but once I had to pay more to gain performance I went back to the drawing board.

Hexo?

There are a ton of static page generators out there, but I like Hexo the most -

  • Hexo is fast
  • Hexo is simple
  • Hexo is powerful
    • It's easy hack
    • It's pluggable, with a huge plugin ecosystem
    • It has great tooling

All you need to do in order to get up and running is to type the following:

$ npm install hexo-cli -g
$ hexo init blog
$ cd blog
$ npm install
$ hexo server

Plus, Elad Zelingher, which is a good friend of mine and a rockstar, migrated his blog to Hexo.

Gains

Once I move to Hexo I'll gain -

  • Better performance
  • Zero maintenance - everything is hosted on GitHub
  • Free hosting - The site served using GitHub Pages
  • Global distribution via GitHub and Cloudflare
  • IMO, a better editing experience - I can finally edit with vim!

Migration Process

Hexo provides tooling for migration from other platform like: WordPress, Joomla, Jekyll, etc'.

Unfortunately, there are no "official" tools to migrate from Ghost. Fortunately, a solution is always a google search away: hexo-migrator-ghost.

The process was much easier than last time, but I still had to do some work:

  1. Fix Bugs: tags weren't properly migrated & some posts wouldn't pass migration, so I had to fix those in the migrator.
  2. Custom Migration: I used prism.js to highlight code blocks in Ghost. Hexo uses highlight.js instead. I had to take care of syntax changes.
  3. Manual Work: fixing tags, adding categories, etc'

Once I was done, and everything worked locally, I deployed the blog, then followed Cloudflare's Secure and fast GitHub Pages with CloudFlare.

coroutines: basic building blocks for concurrency

This part of the series explains the basic building block that allow writing concurrent programs in python.

Later in the series I'll show how to use different async paradigms using the new async syntax that was (finally) introduced in Python 3.5.

Prerequisites

  1. You're using python 3.6.x
  2. You're familiar with coroutines. otherwise, read - coroutines: Introduction.

<!-- more -->

A bit of history

How does async & await work?

I'm not a fan of duplicate work, and a colleague of mine told me that Brett Cannon, a Python core developer, already wrote a great post on the subject - How the heck does async/await work in Python 3.5?

I HIGHLY RECOMMEND READING ALL OF IT. It's beautifully written and explains all the new things that were added to the language in the past few years.

What’s New In Python 3.6

Python 3.6 is considered by many the first release that makes sense to move over from Python 2.x.

There (no very long) list of features that were added in Python 3.6 can be found here.

You can also watch Brett's talk on the subject:

<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/hk85RUtQsBI?rel=0" frameborder="0" allowfullscreen></iframe>

Anyhow, the following are the new features related to async.

PEP 525: Asynchronous Generators

PEP 492 introduced support for native coroutines and async / await syntax to Python 3.5. A notable limitation of the Python 3.5 implementation is that it was not possible to use await and yield in the same function body. In Python 3.6 this restriction has been lifted, making it possible to define asynchronous generators:

async def ticker(delay, to):
"""Yield numbers from 0 to *to* every *delay* seconds."""
for i in range(to):
yield i
await asyncio.sleep(delay)

The new syntax allows for faster and more concise code.

PEP 530: Asynchronous Comprehensions

PEP 530 adds support for using async for in list, set, dict comprehensions and generator expressions:

result = [i async for i in aiter() if i % 2]

Additionally, await expressions are supported in all kinds of comprehensions:

result = [await fun() for fun in funcs if await condition()]

The death of Twisted and such

There are a bunch of async frameworks in the wild, the most noticeable being Twisted and Tornado. All of them use Python's awesome generators & coroutines that were introduced into the language more than 15 years ago!

I'm not a fan of both. I've worked with Twisted a lot at my last work place. We wrote our generic crawler using Scrapy, which is based on Twisted.

These frameworks work great. until something breaks that is. Then you get the worst errors imaginable, and debugging them is a nightmare.

I was ecstatic when python introduced asyncio, then async & await. Many in the developer community, including myself, believed that was the end for Twisted, but we were wrong - The report of Twisted’s death was (and still is) an exaggeration.

Becoming the go-to guy for Linux internals

7 months ago I made a new years resolution to master vim: The Road to Mastering Vim. I'm not a master just yet, it's going to take a few years. After 7 months of exclusively editing text & code with vim, I can honestly say that I'm feeling at home and I can't go back.

A few days ago I told the world that I'm moving to Cybereason. I didn't say that I'm going as hard core as it gets - joining the team that develops the agent on Linux endpoints.

<h3><center> New Role → New Challenges. </center></h3>

<center></center>

<!-- more -->

What am I going to do there?

Honestly, I don't know much. I'll probably know more once I'm done with Cybereason's "Boot Camp".

What I do know at this point is that:

  • I'm going to write a lot of C++
  • I'll need to dive deep into linux internals

Reconnaissance

If you know me, or read my blog, you probably guessed that I'm not the passive kind. I want my ramp up session to be as quick as possible, and I want to provide value as fast as possible. I want to become the go-to guy at the company for everything Linux internals related.

In order to achieve that, I contacted my soon-to-be team leader, Gal Kaplan, and asked him for some reading materials and pointers, which He gladfully provided.

Gal told me I would be doing a lot of linux monitoring & performance, so it would be best to read about the subject.

Reconnaissance, Continued.

Now that I talked to my team leader and had the basic idea of what I'm looking for, I reached out to my own go-to guy for anything performance, debugging and monitoring related - Sasha Goldstein.

Sasha is a wizard when it comes to windows internals, and lately he has started to dive into linux internals. I've been following his blog for a few years now, and I'm a huge fan.

Sasha gave a talk about performance at a recruiting event a few days ago, and I went there to chat a bit. Long story short, I found myself chatting with him for more than an hour. He's awesome.

I asked him about relevant material I might need. That was his answer:

Pretty much everything that Brendan wrote in the last few years is relevant for Linux performance investigation and debugging. His older posts have a bunch of stuff about dtrace and Solaris, which isn't that hot anymore, but the last couple of years he's 100% hardcore Linux.

You should absolutely look into the following building blocks, which serve as the foundation for a bunch of other tools: uprobes/kprobes, ftrace, perf_events and perf. Also, some good familiarity with gdb can't hurt -- it's still the debugger of choice.

I also sent a similar query on a Facebook group, and more go-to guys of mine for anything Linux related, Amit Serper & Nati Cohen, said pretty much the same as well.

Recap: Three people I trust told me the same thing. read Brendan's stuff.

Who is this Brendan guy?

Basically, He's the guy you go to when you want to read about anything related to linux monitoring, performance and related internals in linux.

His work is incredible. I can't believe I've never heard of him up to this point.

You can read more about him on his bio page.

Asking Brendan for Mentoring

This might sound crazy, but what do I have to lose? If Brendan doesn't respond or doesn't care, nothing happens. Otherwise, I get to talk to an industry expert, get a few pointers and maybe even mentoring!

TL;DR: I never talked to him. from his about page:

I get a lot of emails, which build up into the hundreds when I've been traveling for a number of weeks. I do eventually read all the emails I'm sent, and while I want to reply to them all, I don't have the time to do so. If I didn't reply to you, sorry, it's likely because I'm just busy.

... I can't currently offer individual mentoring outside of my work. If you'd like to learn from me, I have shared a lot of content online, much of which is linked from this homepage. This includes over twenty hours of video presentations. There are also my books....

Ok, got it. No mentoring.

Steps to Mastering Linux Performance & Monitoring

  1. Brendan created a portfolio page with a selection of his most useful and popular material. I need to go through all of it.
  2. Read Brendan's Systems Performance: Enterprise and the Cloud book.
  3. Get acquainted with the tools that Sasha talked about.
  4. Understand the internals that these tools use to operate. from Brendan Gregg's website, LWN.net, linux-insides and general googling.
  5. Step up my game and get good familiarity with gdb.

Sounds like a great challenge for 2017, and probably 2018 as well 😅

Running, Editing & Debugging .NET Core Apps Inside a Container

Today I needed to add a few features to an existing .NET Core application. I'm running Fedora 25, but that shouldn't be an issue, right? because -

It appears that it doesn't love Fedora 25, because it's still not officially supported. Instead of hacking around and trying to get this thing working, and wasting my whole day doing so, I thought - why not use Docker?

The idea was simple - create a container that fires up Visual Studio Code inside a container that has .NET Core installed.

<!-- more -->

The Dockerfile is pretty straight forward -

FROM microsoft/dotnet

RUN apt update && apt install -y wget \
libnotify4 \
libgconf-2-4 \
libnss3 \
libgtk2.0-0 \
libxss1 \
libgconf-2-4 \
libasound2 \
libxtst6 \
libcanberra-gtk-dev \
libxkbfile1 \
libgl1-mesa-glx \
libgl1-mesa-dri && \
rm -rf /var/lib/apt/lists/* && \
wget https://go.microsoft.com/fwlink/?LinkID=760868 -O vscode.deb && \
dpkg -i vscode.deb && \
rm vscode.deb && \
useradd -m vscode -s /bin/bash

USER vscode

VOLUME [ "/home/vscode" ]
CMD [ "code", "--verbose" ]

Just build it, and you're half way there:

docker build --rm -t vscode -f /path/to/Dockerfile

Now comes the fun part. Figuring out how to fire up Visual Studio Code from inside a container, as if it's running natively in the host. Challenges:

  • Making Visual Studio Code connect to my running x server
  • Giving Visual Studio Code all the privileges it needs to run & debug an app
  • Running with the current user's id, to avoid ownership issues

I googled some, figure out the rest, and came out with following script. It's a little sharp on the edges, and I probably could've used Docker Compose, but it does the job!

#!/usr/bin/env bash

# lookup the path where vscode files would reside
VSCODEPATH="${VSCODEPATH:-$HOME/Dev/vscode}"

if [ ! -d "$VSCODEPATH" ]; then
echo "\$VSCODEPATH env isn't set!"
exit 1
fi

# create a directory that'll hold projects
# it'll be mounted to the containers home directory
# so all fetched packages and vscode configuration will be loaded between sessions
mkdir -p "$VSCODEPATH/dev"


# add docker to X server access control list (acl)
xhost +local:docker 2> /dev/null

docker run \
--rm \ # to remove the container when exiting the app
--detach \ # run detached from the console
--net=host \ # use the hosts network stack
--user "$(id -u)" \ # use the running user's id
--privileged \ # enable access to all devices on the host
--hostname "$(hostname -s)" \ # use the hosts name
-e DISPLAY="$DISPLAY" \ # connect to the hosts display
-v "/tmp/.X11-unix:/tmp/.X11-unix" \ # mount X's Unix-domain socket.
-v "$VSCODEPATH/dev:/home/vscode" \ # mount the containers dir in the host
vscode 2> /dev/null # the name of the container that was built

and viola..!

Goodbye Gartner, Hello Cybereason

TL;DR: I'm leaving Gartner Innovation Center and joining Cybereason 🎉🎉🎉

The blog post is split into four parts:

<!-- more -->

The Past

<center></center>

Senexx was acquired by Gartner three years ago. Originally, Gartner meant to move the development to the U.S, while Zeevi Michel (CEO, Senexx) & Michael Gelfand (CTO, Senexx) wanted to stay in Israel.

Nir Polonsky had an idea (back then was Gartner's New Markets Group VP). He believed that Gartner should start expanding to new markets that were previously untapped by Gartner, and pitched the idea to Eugene Hall (CEO, Gartner).

Eugene gave them the go-ahead and GICI (short for Gartner Innovation Center Israel) was born.

I joined just five months after the acquisition, and as far as I remember, was the fifth employee. We were a startup of our own - starting fresh, building the infrastructure for innovation.

I was tasked to quickly implement various proof-of-concept projects and to pitch them to management. It was very important to get something out the door before 2015 in order to show our capability.

We succeeded in pitching two ideas, and two products were born. My pitch for a Business Intelligence system gave me "Dev Lead" responsibilities, and together with Eran Katoni (now VP R&D), we formed a team that would take the project from concept to reality.

We grew fast, but were still a family. We ate breakfast, lunch and dinner together. Company-wide lunches and human Nerf-Gun targets weren't unusual. We goofed around a lot and built some amazing things.

The Present

<center></center>

A lot has changed since then. We renovated our building in Lilinblum, Tel Aviv & grown 10x. We've grown so much that our building couldn't fit everyone, so we had to lease extra space. We've also received a lot of recognition inside the company and gained a lot of respect for the things we delivered.

Today GICI is split into two distinct groups:

  • Digital Markets: working on gaining ground in new markets. One of their key products is CloudAdvice, but they also lead the work to converge Gartner's latest acquisitions in the digital market space: GetApp, Software Advice and Capterra.
  • Core: finding new ways to make Gartner work more efficiently. From a fully autonomous talent scout, to a search engine for analysts - Most of the products are heavily based on data science, and have been extremely disruptive.

I was always at Core, and always working closely with Zeevi and Michael. I actually reported directly to the Zeevi (now Managing Vice President, Head of GICI) for two years, until I moved to the "CTO Office" and started reporting to Michael, the CTO.

Working with Michael has been amazing. As his sidekick, I shared my time between writing infrastructure code, bug hunting, writing Proof-of-Concept projects, being part of significant architectural decisions, DevOps and around-the-clock evangelism.

<span style="color:#2C66C2">I LOVED IT</span>.

The decision to move

I love GICI. GICI is literally my second home. I love the people, the products, the management, the perks. I even love the building. Did I forget to mention that WE ARE HIRING?

The problem is that after three years, I got into a comfort zone. If you ever visited my about page you've probably noticed Rumi's quote at the beginning of the page: "Run from what's comfortable. Forget safety. Live where you fear to live. Destroy your reputation. Be notorious."

As much as I love GICI, it was too comfortable. Moreover, being a security fanatic, I had a dream of moving to the cyber security field.

A few weeks ago a dear friend of mine, Amit Serper, said he's coming for a visit from Boston.

Amit is the Principal Security Researcher at Cybereason, a startup that I've been following for a few years now:

"Cybereason is the leader in endpoint protection, offering endpoint detection and response, next-generation antivirus, managed monitoring and IR services.

Founded by elite intelligence professionals born and bred in offense-first hunting, Cybereason gives enterprises the upper hand over cyber adversaries.

The Cybereason platform is powered by a custom-built in-memory graph, the only truly automated hunting engine anywhere. It detects behavioral patterns across every endpoint and surfaces malicious operations in an exceptionally user-friendly interface.

Cybereason is privately held and headquartered in Boston with offices in London, Tel Aviv, and Tokyo."

I told Amit what's on my mind, and he told me to send him my resume. Actually, my romance with Cybereason started more then three years ago, right after their first round of funding, but that's completely out of scope. I might share that story someday.

Fast forward a few days, and a lot of enthusiasm on both parts, I signed.

Instead of just catching up with a friend over beer, I found myself switching jobs!

The Future

<center></center>

I won't lie to you guys. I'm scared as hell. I built a name for myself at GICI. I have the respect of my colleagues and management. I know everyone. I'm familiar with all the products and all the technologies.

Now I'm moving to Cybereason. I'm not familiar with most of the technology stack. I don't know anything about the internals of the product. I almost don't know anyone at the company and to top all that, Aviv Laufer, Cybereason's VP of Engineering, pretty much told me that he expects me be a 10x programmer.

I believe in Cybereason. I think they're doing exceptional work, and I'm not the only one. They have just completed a very successful series D funding round that added an extra $100 Million to their pocket. All in all, they've raised an impressive $190 Million to date!

Furthermore, I'm certain their product actually works. Why? because Amit works there, and I know that He would've never stayed there for so long if it didn't.

I'm starting from scratch, and <span style="color:#009CC4">extremely thrilled</span>!

P.S: I hope I won't fail miserably and find myself out of a job in a few months :)

names generator à la docker

I really like docker's names generator. It makes remembering id's easier, and as of version 0.7.x, it generates names from notable scientists and hackers, which gives it an extra bonus!

There are a number of ports out there (shamrin/namesgenerator for example), but all of them just copy-and-paste the names, which is not cool at all.

I decided to parse names-generator.go and extract the names from it, thus making sure it's always up to date.

I wrote two implementations, one in python and one in go.

  • python: parses the code directly using regular expressions
  • go: parses the code using go's AST package, and spit python code to stdout.

^ The hyperlinks lead to the relevant section in the blog post.

The amount of lines needed to do all that work is relatively short, which shows how powerful these languages are.

<!-- more -->

Regular Expressions

The following code is straight forward. First I download the text, then I parse it using a regular expressions.

I did use a cool trick - I've update the locals from the generated code.

import random
import requests
import re


URL = ("https://raw.githubusercontent.com"
"/moby/moby/master/pkg/namesgenerator/"
"names-generator.go")

# get the variable name (left|right) and text in the curly braces:
# - the first part captures 'left|right'
# - the second part looks behind for the '{' character, then captures
# anything between it and the '}' character
_var_and_curly = re.compile("(left|right).*((?<={)[^}]*)")

# extract the strings inside the curly braces text
_extract_in = re.compile("\"(.+)\"")

# update the locals of this package with '_left' and '_right' values
# that were parsed from the url
locals().update({"_" + var_name.strip(): _extract_in.findall(txt_in_brackets)
for var_name, txt_in_brackets
in _var_and_curly.findall(requests.get(URL).text)})

_sr = random.SystemRandom()


def get(retry=False):
"""
generates a random name from the list of adjectives and surnames
formatted as "adjective_surname". For example 'focused_turing'.
If retry is True, a random integer between 0 and 10 will be
added to the end of the name, e.g 'focused_turing3'
"""

while True:
name = _sr.choice(_left) + "_" + _sr.choice(_right)

# Steve Wozniak is not boring
if name != "boring_wozniak":
break

if retry:
name += str(random.randint(0, 10))

return name

Abstract Syntax Tree

When go ported its compiler to go a few years ago, it added a bunch of packages that the compiler needed to do its work. One of them is the excellent go/ast package, which makes parsing go code a breeze.

The following snippet spits PEP8 conformant python code to stdout:

  1. Downloads the names-generator.go file
  2. Parses the code
  3. Walks over the AST and prints out the variables it finds
  4. Prints the python get function

package main

import (
"fmt"
"go/ast"
"go/parser"
"go/token"
"io"
"io/ioutil"
"net/http"
"os"
)

//FuncVisitor ...
type FuncVisitor struct {
}

//getEmptyString generates an empty string of size n
func getEmptyString(n int) string {
spaces := make([]rune, n+1)
for i := range spaces {
spaces[i] = ' '
}
return string(spaces)

}

//getValue gets the string value of a basic ast node
func getValue(elt interface{}) string {
return elt.(*ast.BasicLit).Value
}

//Visit visits all nodes in the ast and prints variable content
func (v *FuncVisitor) Visit(node ast.Node) (w ast.Visitor) {
switch t := node.(type) {
case *ast.ValueSpec:
// iterate all variables (should be "first", "last")
for _, id := range t.Names {
// print the variable, for example: _left = [
txt := fmt.Sprintf("%s = [", id.Name)
fmt.Printf("_%s", txt)

// extract the number of leading spaces
spaces := getEmptyString(len(txt))
names := id.Obj.Decl.(*ast.ValueSpec).Values[0]
elts := names.(*ast.CompositeLit).Elts

// print the first value
fmt.Printf("%s,\n", getValue(elts[0]))

// iterate all inner values
for _, x := range elts[1 : len(elts)-2] {
value := getValue(x)

fmt.Printf("%s%s,\n", string(spaces), value)
}

// print the last value
fmt.Printf("%s%s]",
spaces,
getValue(elts[len(elts)-1]))
}

fmt.Printf("\n\n")
}

return v
}

var pyImports = `import random`

var pySystemRandom = `_sr = random.SystemRandom()`

var pyGetFunction = `def get(retry=False):
"""
generates a random name from the list of adjectives and surnames
formatted as "adjective_surname". For example 'focused_turing'.
If retry is True, a random integer between 0 and 10 will be
added to the end of the name, e.g 'focused_turing3'
"""

while True:
name = _sr.choice(_left) + "_" + _sr.choice(_right)

# Steve Wozniak is not boring
if name != "boring_wozniak":
break

if retry:
name += str(random.randint(0, 10))

return name
`

func downloadMobyProjectContainerNameGenerator() string {
url := "https://raw.githubusercontent.com/moby/moby/master/" +
"pkg/namesgenerator/names-generator.go"

tmpfile, _ := ioutil.TempFile(os.TempDir(),
"names-generator.go.")
defer tmpfile.Close()

response, _ := http.Get(url)
defer response.Body.Close()

io.Copy(tmpfile, response.Body)
return tmpfile.Name()
}

func main() {
// download names generator and parse it
path := downloadMobyProjectContainerNameGenerator()
file, _ := parser.ParseFile(token.NewFileSet(), path, nil, 0)

// generate python code that does the same
fmt.Printf("%s\n\n", pyImports)
ast.Walk(new(FuncVisitor), file)
fmt.Printf("%s\n\n\n", pySystemRandom)
fmt.Printf(pyGetFunction)
}

The production killer file descriptor

A few days ago one of our (Gartner Innovation Center) productions servers died as a result of a log file that wasn't properly rotated. This might sound like an easy problem to figure out and fix, but the situation was a bit more complex!

This blog post walks through all the steps we took to investigate and fix the issue. I find it extremely interesting & hope you would too!

<!-- more -->

Investigating the issue

We connected to the server and ran a few "checklist" commands -

$ df -alh
Filesystem Size Used Avail Use% Mounted on
...
/dev/sda3 225G 225G 0 100% /var/lib/docker

As you can see, /var/lib/docker, got filled up. But which part?

$ du -ahd 1 /var/lib/docker

du: cannot read directory '/var/lib/docker': Permission denied
4.0K /var/lib/docker

Snap! We can't get statistics on /var/lib/docker because we don't have the right privileges. We don't have root either. Ideas?

IT messed up and we can run docker in privileged mode:

"When the operator executes docker run --privileged, Docker will enable access to all devices on the host as well as set some configuration in AppArmor or SELinux to allow the container nearly all the same access to the host as processes running outside containers on the host" - Docker docs

So we fired up docker, mounting /var to /host_var:

$ docker run -t -i --privileged -v /var:/host_var alpine sh
# inside the container, running sh
$ du -ahd 1 /host_var/lib/docker
72K /host_var/lib/docker/network
4.2G /host_var/lib/docker/overlay2
1.7G /host_var/lib/docker/volumes
32M /host_var/lib/docker/image
4.0K /host_var/lib/docker/tmp-old
4.0K /host_var/lib/docker/swarm
20K /host_var/lib/docker/plugins
4.0K /host_var/lib/docker/trust
4.0K /host_var/lib/docker/tmp
219.1G /host_var/lib/docker/containers
225G /host_var/lib/docker

Interesting. du says /var/lib/docker/containers is full. Let's try and find out which container is the problematic one:

$ docker run -t -i --privileged -v /var:/host_var alpine sh
# inside the container...
$ du -hs /host_/lib/docker/containers/* | sort -hr | head -n 1
218.3G /host_var/lib/docker/containers/11bcb8ab547d177

Back to the host, we ran:

docker ps -a | grep 11bcb8ab547d177

and found out who was the trouble maker. But that doesn't solve anything! Back to the docker container:

$ du -hs /host_var/lib/docker/containers/11bcb8ab547d177/*
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/checkpoints
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/config.v2.json
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/hostconfig.json
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/hostname
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/hosts
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/resolv.conf
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/resolv.conf.hash
4.0K/host_var/lib/docker/containers/11bcb8ab547d177/shm

We suspected that the problem is logging, because our docker log driver rotates logs after they reach a certain size. We should've see the file at the following location, but didn't: /var/lib/docker/containers/11bcb8ab547d177/11bcb8ab547d177.json

What if a log wasn't rotated properly? what if the container holds a file descriptor to a log file that already got deleted?

It's really easy to find out! lets fire up lsof and search for deleted files.

$ lsof
lsof: WARNING: can't stat() overlay file system /var/lib/docker/overlay2/9ce66914ee2bbfcaa7646a87c74c772d5a90b7236fb1e84cfcc4a410e544afa4/merged
Output information may be incomplete.
lsof: WARNING: can't stat() tmpfs file system /var/lib/docker/containers/66ab0db40286ff7964fa0770d1dd660c611c2025be72067dea5d8982d73ec071/shm
Output information may be incomplete.
lsof: WARNING: can't stat() nsfs file system /run/docker/netns/594e57d254a8
Output information may be incomplete.
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root cwd unknown /proc/1/cwd (readlink: Permission denied)
systemd 1 root rtd unknown /proc/1/root (readlink: Permission denied)
systemd 1 root txt unknown /proc/1/exe (readlink: Permission denied)
systemd 1 root NOFD /proc/1/fd (opendir: Permission denied)
...

Snap! we're not running as root, which means we can only see output for our own processes. Some information about a process, such as its current directory, its root directory, the location of its executable and its file descriptors can only be viewed by the user running the process (or root).

You know the drill, right? we can mount /proc and run lsof. well, not exactly. lsof will list open files inside a container, and not the files in the host.

We can, on the other hand, search for deleted files manually:

$ docker run -t -i --privileged -v /proc:/host_proc alpine sh
# inside the container...
$ find /host_proc/*/fd -ls | grep '(deleted)' | grep /var/ | grep -v /var/lib/docker
952833633 0 l-wx------ 1 root root 64 Jun 3 08:39 /host_proc/991/fd/4 -> /var/log/docker.log-20170530 (deleted)

$ stat -Lc %s /host_proc/991/fd/4
102362472879

Awesome. We found that pid 991 holds a reference to a ~102GB log file. That means that the file wasn't rotated properly and filled up the disk. But why is the file descriptor pointing to /var/log? we'll discuss that later.

Fixing the issue

That's easy - Save the log & release the file descriptor by truncating it!

$ docker run -t -i --privileged -v /proc:/host_proc \
-v /var/log:/host_log alpine sh
# inside the container...
$ cp /host_proc/991/fd/4 /host_log/docker.log-20170530

# this one truncates the file
$ : > /host_proc/991/fd/4

And voila!

$ df -alh /var/lib/docker
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 227M 225G 99% 1% /var/lib/docker

Problem fixed.

Permanent fix

Why did the issue occur in the first place? We use docker's stock JSON File logging driver, which also rotates the files. Is that a bug?

It looks like our IT department messed up again, and set up logrotate in parallel to json logging:

$ cat /etc/logrotate.d/docker
/var/log/docker.log {
rotate 2
compress
missingok
notifempty
size 100M
}

There was a race between docker's log rotation and logrotate. But still, that means files could've been rotated twice, but not deleted.

The problem? logrotate moved the old log file, instead of copying it and truncating the original. That caused the docker daemon to keep writing to a file descriptor that points to a file that doesn't exist anymore!

The fix? disable either logrotate or the stock logging driver.

We chose to disable the stock driver, but the problem wasn't fixed just yet - the daemon can still leak file descriptors. That's where copytruncate come in:

Truncate the original log file to zero size in place after creating a copy, instead of moving the old log file and optionally creating a new one. It can be used when some program cannot be told to close its logfile and thus might continue writing (appending) to the previous log file forever. Note that there is a very small time slice between copying the file and truncating it, so some logging data might be lost. When this option is used, the create option will have no effect, as the old log file stays in place.

All we had to do is add a copytruncate line to our configuration and the issue was resolved. Tada!

Accidentally destroyed production database on first day of a job, how screwed am I?

I just read a post on reddit titled: "Accidentally destroyed production database on first day of a job, and was told to leave, on top of this I was told by the CTO that they need to get legal involved, how screwed am I?"

TL;DR

  • Guy get a document detailing how to setup a local development environment.
  • Guy sets up a development database
  • Guy runs a tool that performs tests on the application. accidentally, pointed the tool against the production database.
  • The credentials for the production database were in the development document
  • The tool clears the database between tests
  • Guy gets fired

I'm completely pissed! The guy made an honest mistake that can happen to anyone, and gets fired!

<!-- more -->

Thoughts

The guy to blame is the CTO. Why? here's partial list of my thoughts:

  1. There document should have never had the production password in it.
  2. Actually, there shouldn't be "a production" password in the first place.
  3. Developers shouldn't have write privileges for production databases. Only read privileges. Only ops and/or a certain subset of the team should have write privileges.
  4. You should ALWAYS have TESTED backups.
  5. You should always be prepared for a "total fuck up" and be able to recover quickly.

I can keep going, but the point is that the company's CTO "fucked up" - not just because He failed technically, but also because He created an environment that kills innovation.

What the CTO should've done

He should have rolled back the database, then assemble key personnel that would investigate the issue and figure out why it happened.

Then, He should have told the guy that "shit happens" and that he expects him to be more careful next time.

Also, I would've appreciated a company-wide email or a public blog post that details what exactly happened, joke about the mistake the new guy did and clarify that they're job is to make sure such incidents don't happen again.

Here's a few examples of companies that "fucked up" miserably, and handled them gracefully:

  • A few months ago someone at GitLab completely fucked up and made them lose six hours of database data (GitLab.com Database Incident). You know what they did to him? NOTHING.

Instead, they were completely transparent about the incident, and even wrote a blog post, Postmortem of database outage of January 31, detailing how they'll make sure such thing won't happen again.

  • Remember AWS's 4 hour outage a few months ago? It caused a loss of a few hundred million dollars of revenue for S&P 500 companies alone. Why? An engineer debugged an issue with S3's billing system and accidentally mistyped a command. Was the guy fired? NO.

Non of these companies should get a medal for dealing with incidents like they did. That's the way our industry is supposed to handle things. Making mistakes is human, our job is to learn from them and make our processes better.