*(True story: in my interview I was asked how I would extract entities from an H...

PaulHoule · on Nov 30, 2013

Uh, it is easy to make an entity extraction program that will get you a passing grade for a CS course.

It's not easy to make one that is good enough for commercial use unless somebody is hand curating the results. For the record, OpenCalais isn't good enough for commercial use (I've tried), and like most text analysis vendors, the people there blame their customers for the lack of adoption, not their product.

If you look at the leading entity extraction products they tend to be by huge companies like BBN and IBM; the state of the art open source product is UIMA, which out of the box does precisely nothing. What it does do is make it possible to coordinate the work of 100+ developers, linguists, scientists and other people in a bunch of different timezones so you can run a sweat shop that piles up a pyramid of heuristics at a price that can only be afforded by large organizations that expect to pay a lot and get very little for it.

Now the question of how you can cheaply build a knowledge base that can do the same is an interesting one that I've thought about a lot.

You can answer 90% of the questions that show up in a data science interview with "look it up in a hash table" or "look it up the literature". Off the top of my head I'd have a hard time explaining how to make a bloom filter or how to turn text into a suffix tree, although I probably could derive an algorithm to train an HMM. I can look up all this stuff in the literature so why bother?

tptacek · on Nov 30, 2013

What's your point? Are you suggesting that an interviewer expects you to write a production quality DOM parser in the course of a one-hour interview on a blackboard? No, they don't. So, move on: citing a library is an inadequate answer. You need to write code.

PaulHoule · on Dec 2, 2013

What's my point is that I'm an expert on the above problem, working on stuff that is beyond state-of-the-art.

If somebody hires me to work on that I can produce exceptional results.

If somebody hires me to fill in for the last guy who burned out on a project that is two years behind schedule, my performance is average.

Asking a question like that from an average person will give you an average answer and average, at best, results in real life.

etler · on Nov 30, 2013

My problem with that answer is that it doesn't tell me much about the engineer. It's a one sentence response and other than showing me that you know the right library to use, doesn't show any other depth. It's one step above "I would google it". As you said, libraries require a good deal of machinery to integrate into your project, so just saying you'd use a library doesn't touch on any of the potential problems that project might have. If you happened to know that libraries API and could explain in some more depth how you would use that library, I would accept it, but I highly doubt most people would have API knowledge off the top of their heads, and I would be suitably impressed if they did.

Ignoring all the other potential problems with Google's interview process, and the question of whether he should have been asked this for the position he was interviewing for in the first place (I can't say without knowing more details), saying I would use library X is a terrible answer to a question. The point of the interview is to get to know more about your skills and abilities, and a one sentence answer tells me almost nothing. Of course it is also partially the interviewer's responsibility to ask the right questions, look for the right qualities, and lead the interview in a way where the interviewee can properly demonstrate their skills, but it's also partially or equally the interviewee's responsibility to demonstrate that.

AndrewDucker · on Nov 30, 2013

The point, in this case, being that they weren't going for a coding position. They would not expect to solve that kind of problem as part of their job - because they were going for a community liason/management role, not a technical one.

tptacek · on Nov 30, 2013

I can't help you do a better job getting a job at Google, but I can help you with tech job interviews in general: don't suggest big libraries as solutions to small programming problems, and especially don't stick to your guns after the interviewer asks for the DIY solution.

leokun · on Nov 30, 2013

I think the gist of the problem is that when Google acquires a company the way they handle the non-technical staff can be a bit off-putting. Instead of working hard to find them a place within Google during the acquisition process, or just laying them off with some kind of severance, they put the burden on the individual. You have a year to find a place? Basically that just means, you have a year to find a new place to work.

quesera · on Nov 30, 2013

> when Google acquires a company the way they handle the non-technical staff can be a bit off-putting.

True enough.

Depending on the acquisition and the acquiring company, non-technical staff might not even get an interview -- they're just sent on their way with some thank you money, sometimes.

The standard year is the "don't say bad things about us" move done by companies that are sensitive to that kind of negative PR.

But really, finding a place for staff members that you would not have hired involves a lot of energy which can almost certainly be put to better use elsewhere in the business.

jfb · on Nov 30, 2013

Yeah, I think you can open your answer with, "well, I'd use libfoo, because there are a bunch of hairy edge cases", but then follow on immediately with a toy implementation to show you understand the problem domain. There's a certain sort of nerd-arrogance to the "I'm too good to write it myself" sort of answer.

michaelt · on Nov 30, 2013

Where you hear "I'm too good to write it myself" I hear "I'm too bad to write it myself"

I mean, working on a whiteboard with no reference material and no test data I'd be glad if I had time in an interview to write a bug-free DOM parser that would work on a modern, standards-compliant web page - I wouldn't expect to even start on dealing with malformed inputs, javascript, images, nested documents...

tptacek · on Nov 30, 2013

Sure. But writing a reliable bug-free DOM parser isn't the point of the question.

michaelt · on Nov 30, 2013

Do you mean you'd answer the question with a DOM parser that wasn't reliable and bug-free, that you wouldn't use a DOM parser at all, or that you'd use a DOM parser from a library (but consider such a library more acceptable than Calais which is understandable)?

Seems to me you're going to have to get a handle on the page to figure out which entities are going to be treated as such and which are e.g. nested inside comments, or within javascript strings, or within inline CSS, or in the head, or after the body close tag, or resemble entities but aren't valid ones, and so on.

wglb · on Dec 1, 2013

This is like in an interview context. The basic point to the question is at the highest level--how would you write such a parser, what might be some of the steps, what exactly is an entity.

It seems like you are addressing a question of total implementation. But in an interview, I would expect a candidate to be able to answer--not fully implement--what some of the basic steps are. As opposed "use a library".

jfb · on Dec 2, 2013

More to the point, if I'm asking you in an interview how to build a DOM parser, take it as given that, if you get the job, and we need a DOM parser, you will be actively encouraged to go use a library rather than build from scratch.

wglb · on Dec 2, 2013

And if I am asking you in an interview, it would be a given that you would be actively encouraged to figure out how to break it. One key skill in knowing how to break it knowing how one might be built.

Learning how to think abstractly is one thing, but the ability to "puncture" abstractions is quite another. The latter is more valuable in this context.