Why new Secure Internet solutions are technically Hard

Information Security is both very hard and very easy at the same time.

Not only are Internet Nasties a nuisance, or worse, they prevent the  new, useful Applications and Networks like e-Commerce, i-EDI, e-Health, e-Banking, e-Government and other business/commercial transactions systems.

Perfect Security isn't possible: ask any bank.

Defenders need to be 100.00% correct, every minute of every day.
Attackers need just one weakness for a moment to get in.

Not all compromises/breaches are equal: from nothing of consequence, up to being in full control with system owners not being aware of it.

All 'Security Systems' can only be "good enough" for their role, which depends on many factors.
How long do you need to keep your secrets? Minutes or Decades?


Quality and Excellence: Two sides of the same coin

Quality is predicated on Caring.
High Performance, also called "Excellence",  first requires people to Care about their results.

They are related through the Feedback Loop of Continuous Improvement, also known as O-O-D-A (Observe, Orient, Decide, Act) and Plan-Do-Check-Act (from W. Edwards Deming).

The Military take OODA another level with After-Action-Reviews or After-Action-Reports (AAR's), a structured approach to acquiring "Lessons Learned".

High Performance has two aspects: work-rate and consistency.
It's not enough to produce identical/consistent goods or results everytime, but you have to do it with speed.

There's an inviolate Quality Dictum:
You can't Check your own work.

For Organisations, this Dictum becomes:
 Objective assessment requires an Independent Expert Body.

From which follows the necessity for an External Auditor:
  Only Independent persons/bodies can check an Organisation and its people/processes for compliance and performance.

For around 80 years, Aviation has separated the roles of Investigation, or Root Cause Analysis, from Regulation, Compliance and Consequences. In the USA the NTSB Investigates and the FAA Regulates. This has led to consistent, demonstrable improvement in both Safety and Performance. Profitability is linked to Marketing, Financial Management and Administration, not just Performance.

All of which leads to the basic Professional Test for individuals:
 "Never Repeat, or allow to be repeated, Known Errors, Faults and Failures".

And the Raison d'être of Professional Associations or Bodies:
 To collect, preserve and disseminate Professional Learnings of Successes, Failures, Discovery and Invention.

Barry Boehm neatly summaries the importance of the Historical Perspective as:
Santayana half-truth: “Those who cannot remember the past are condemned to repeat it”

Don’t remember failures?
  • Likely to repeat them
Don’t remember successes?
  • Not likely to repeat them

All these statements are about Organisations as Adaptive Control Systems.

To effect change/improvement, there has to be reliable, objective measures of outputs and the means to effect change: Authority, the Right to Direct and Control, the ability to adjust Inputs or Direct work.

Which points the way as to why Outsourcing is often problematic:
  The Feeback Loop is broken because the hirer gives up Control of the Process.

Most Organisations that Outsource critical functions, like I.T., completely divest themselves of all technical capability and, from a multitude of stories, don't contract for effective Quality, Performance or Improvement processes.

They give up both the capability to properly assess Outputs and Processes and Control mechanisms to effect change. Monthly "management reports" aren't quite enough...


Business Metrics and "I.T. Event Horizons"

Is there any reason the "Public Service", as we call paid Government Administration in Australia, isn't the benchmark for good Management and Governance??

Summary: This piece proposes 5 simple metrics that reflect, but are not in themselves pay or performance measures for, management effectiveness and competence:
  • Meeting efficiency and effectiveness,
  • Time Planning/Use and Task Prioritisation,
  • Typing Speed,
  • Tool/I.T. Competence: speed and skill in basic PC, Office Tools and Internet tools and tasks, and
  • E-mail use (sent, read, completed, in-progress, pending, never resolved, personal, social, other).


Top Computing Problems

The 7 Millennium Prize Problems don't resonate for me...

These are the areas that do engage me:
The piece for the second item, "Multi-level memory" is old and not specifically written for this set of questions. Expect it to be updated at some time.

    Internetworking protocols

    Placemarker for a piece on Internetworking protocols and problems with IPv4 (security and facilities) and IPv6 (overheads, availability).

    "The Internet changes everything" - the Web 2.0 world we have is very different to where we started in 1996, the break-through year of 'The Internet' with IPv4.

    But it is creaking and groaning.
    Around 90% of all email sent is SPAM (Symantec quarterly intelligence report).

    And since 2004 when the "Hackers Turned Pro", Organised Crime makes the Internet a very dangerous place for most people.

    IPv6 protocols have been around for some time, but like Group 4 Fax before them, are a Great Idea, but nobody is interested...

    What are the problems?
    What shape could solutions have?
    Are there (general) solutions to all problems?

    Systems Design

    Are these new sorts of systems possible with current commercial or FOSS systems?
    What Design and Implementation changes might be needed?

    How do they interact with the other 'Computing Challenges' in this series?

    Flexible, Adaptable Hardware Organisations

    Placemarker for a piece on flexible hardware designs.

    I'd like to be able to buy a CPU 'brick' at home for on-demand compute-intensive work, like Spreadsheets.
    I'd like be able to easily transfer an application, then bring it back again.

    Secondly, if my laptop has enough CPU grunt, it won't have the Graphics processing or Displays (type, size, number) needed for some work... I'd like to be able to 'dock' my laptop and happily get on with it.
    The current regime is to transfer files and have separate environments that operate independently and I have to go through that long login-start-apps-setup-environment cycle.

    I prefer KDE (and other X-11 Windows Managers) to Aqua on Snow Leopard (OS/X 10.6) because they remember what was running in a login 'session', and recreate it when I login again.

    In 1995, I first used HP's CDE (IIRC) on X-11, that provided multiple work-spaces. This was mature technology then.

    It was only this year, 15 years on, that Apple provided "Spaces" for their uses.

    We already have good flexible storage options for most types of sites.
    Cheap NAS appliances are available for home use, up to high-end SAN solutions for large Enterprises.

    For micro- and portable-devices, the main uses are "transactional" web-based.
    These scale well already, and little, if nothing, can be done to improve this.

    Systems Design

    What flows from this 'wish list' is that no current Operating System design will support it well.
    The closest, "Plan 9", developed around 1990, allows for users to connect different elements to a common network and Authentication Domain:
    • (graphic) Terminals
    • Storage
    • CPU
    The design doesn't support the live migration of applications.

    Neither do the current designs of Virtual Machines (migrate the whole machine) or 'threads' and multi-processors.

    Datacentre Hardware organisation

    Related posts:
    Senior Google staffers wrote The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, which I thought showed break-through thinking.

    The importance of this piece is it wasn't theoretical, but a report of What Works in practice, particularly at 'scale'.

    Anything that Google, one of the star performers of the Internet Revolution, does differently is worthy of close examination.  What do they know that the rest of us don't get?

    While the book is an extra-ordinary blueprint, I couldn't but help asking a few questions:
    • Why do they stick with generic-design 1RU servers when they buy enough for custom designs?
    • How could 19-inch racks, designed for mechanical telephone exchanges a century ago, still be a good, let alone best, packaging choice when you build Wharehouse sized datacentres?
    • Telecommunications sites use DC power and batteries. Why take AC, convert to DC, back to AC, distribute AC to every server with inefficient, over-dimensioned power-supplies?
    Part of the management problem with datacentres is minimising input costs whilst maximising 'performance' (throughput and latency).


    What can you learn from a self-proclaimed "World's Greatest"?

    Note: This document is copyright© Steve Jenkin 1998-2010. It may not be
    reproduced, modified or distributed in any way without the explicit
    permission of the author. [Which you can expect to be given.]

    Lessons from the Worlds' Greatest Sys Admin - July 1998
    Presented at SAGE-AU Conference, July 1998
    Principles of System Admin
    Some WGSA Attributes
    About The WGSA
    Sayings of the WGSA.
    Some Sound Management Laws
    So What?
    How do you work with a "World's Greatest ..."
    Some "Good Stuff" I learnt from friends.
    Some of the WGSA's work



    (2005) The most frequent comment I received on the 'WGSA' talk was:
    "So you think you are the World's Best Sys Admin?"

    Answer: No, I am not the "WGSA".

    This paper is about someone, who is really an amalgamn of a number of people, who regarded themselves as "The World's Best Sys Admin". They never verbalised this opinion - they just lived it.

    Unfortunately, as is the case with all self-appointed 'guru's I've met, they had limited raw talent and an arrogance that prevented them admitting less-than-perfect performance, taking on-board any useful criticism or correction or learning new tools, techniques, processes, andorganisations from others they didn't consider an authority.

    I apologise in advance to the reader that the paper is mostly about "negative" learning,
    or What Not To Do...

    I included a section on Good Things I've Learned to show that I wasn't totallypreoccupied with the negative :-) But there were just too many good stories, and I really had to let steam off over this...

    Do you think you know the identity of "The WGSA"?

    You don't. For those that may even think it's them - No, it's is not you.

    The observations and opinions here came over a considerable period of time. It's not a single person - and it's not just Administrators. I've met "WG Programmer", Architect, Designer, Tester, Integrator, Networker, Technical Manager and CIO.

    So - onto the main game - the paper as presented to SAGE-AU in Old Parliment House.
    If you were there - did you catch any of the lollies I threw, or even a chocolate egg?

    Feedback is something I'm interested in.

    Drop me a line if you have your own stories, can add more useful models, or if you're late to this and found it useful. Suggestions for improvement gratefully accepted and acknowledged.


    To codify and inform.

    Once a problem is recognised and named, you can start to understand and address it.


    Junior Sys Admins
    - If you work for one.
    Senior Sys Admins
    - If you work with one.
    - If you have one working for you.


    • Talk
    • War stories
    • Feedback
    • And lots of Opinion.

    The BIG Questions

    • So What?
    • How do you work with one?


    I spent a year in 96/97 contracting in Sydney for what should've been a large, prosperous Australian multinational. They hadn't paid a dividend since 1990 and were taken over by a Dutch company at the end of 1996.

    The I.T. group was appalling.
    Staff turnover in the Unix Support and Networking areas was high - close to 100% in 12-18 months! The company spent under 1% of turnover on I.T. - against the industry average of 5+%.

    It felt like we were doing the impossible - and we were.

    They'd outsourced their mainframe, embraced 'open systems', installed a large scale WAN, gone client-server, were developing GUI and O-O applications and had an Internet presence.

    They'd also radically downsized in two steps:- from 200+ to 30-40 staff in 18 months.

    They were fully Buzzword Compliant, but were going nowhere.

    I was privileged to meet two people - both ex-telecoms technicians who had moved into computing. One, the self-proclaimed WGSA, had been responsible for setting up the Unix environment, and it's associated X.25 network, and been the Unix support manager for a couple of years, until finally taking a job in 'Technology Planning' - but just doing more of the same.

    The other ex-tech I've remained good friends with. He moved into Networking after a career in Civil Aviation, then a TAFE. He had enough PC, Unix, and Internet knowledge 'to be dangerous'. He'd left behind at the TAFE an environment where just 2 of them supported and ran the
    whole state TAFE network, +1 for Unix, 1 for printers and passwords, and 2 on the HelpDesk. When the lot was outsourced and a crack systems company took over - they boast they can cut 10%-20% from any operation - they ended up having to spend more.

    The contrast was stark and savage - one had left behind a legacy of chaos and disorder, the other was undoing the damage and providing real business productivity.

    This talk is about that experience and what I've learned.

    Principles of System Admin

    These are my values and principles. Your mileage may vary.
    • Know why you're there - To statisfy others business
    • Know what you Know, Know what you Don't Know,   and don't be afraid to get assistance.
    • Obey Sound Management Laws.
    • Learn, Develop, and Stay Current.
      We learn through Invention, Discovery, and Failure
      "That which isn't growing, is dying"
    • Give Value for Money.
      • Actively seek ways to put yourself out of work.
      • Minimise recurrent costs - wages, maintenance/support/rental charges
      • Maximise Reusability, Flexibility, Functionality, Reliability/Robustness.
    • Provide what's needed, not apparently wanted.
    • Listen and communicate with your users.
      • Provide Solutions.
      • Focus on Outcomes.

    Some WGSA Attributes

    • They don't exist outside fertile ground. They have to be allowed
      and encouraged by management and peers.
    • Only Dysfunctional people thrive and rise in dysfunctional
      Good people leave broken places - possibly after fighting for a time.
      The only other alternative is to withdraw and retreat into minimal
    • People are the ONLY asset of I.T. Organisations

      • Hardware: $1M to zero in 3 years
      • Software: $100k to zero in 3 minutes
      • Network: $250/point to zero in 3 seconds
    • Indicators:
      • High staff turnover
      • High contractor ratio
      • N.I.H. - Resistance to Change
      • Lack of "Professional" work habits - Defined Processes,
        Designated Responsibilities, Delegated Authority
      • Lack of History, Documentation, Policy, Procedures, Config Mgt,
        Version Control, Handovers, Induction
      • Chaos and Frenzy. Apparently understaffed and overworked - definitely unorganised
        "No time to fix problems, too busy fixing faults."
      • Nobody tasked with automating jobs or passing work back to level 1 support.
      • Every install project goes into crash mode.
        No standard, fast, system builds.
      • Maintenance frenzy - never seems to get any better.
      • Lack of fault analysis, reviews, Post Mortems, Post Implementation Reviews, capacity planning.
      • Single source of innovation and improvement - The "Guru"
      • The "BIG BANG FIX" is coming [Or the"Silver Bullet"]
        Nothing can be done, because "someone" [WGSA or friend] is creating "the Solution to All Our Problems".
      • (Senior) Management "Swooping" is allowed and tolerated.
      • " Don't show me problems, Show me solutions"
      • Mentoring and skills transfer absent
      • Constant reactive, not proactive, administration.
      • No organisation accountability - fail to do any task - routine or project - with impunity.
      • Blaming and Recrimination normal. No attempt to perform 'root cause analysis' and rectify faults.
      • No recognition or rewards for work well done.
      • Few Diagnostic, debugging, or troubleshooting Tools - even for common failure modes.
      • No Communication - up or down.
      • No Performance Indicators or Measurement/Assessment

    About The WGSA

    Of course he was the best. He had read every single 'white paper' from the vendor, and with his photographic memory, could recite it all back. All he needed to know was in those papers, and the manuals he'd read.

    He didn't need to meet and talk with his peers, he had none anyway! He had no need of professional organisations or finding out what had worked, or not, for other people.

    If he didn't have time to do something himself, he would get in a contractor, create a project, or hire a consultant. Funnily, these people were always only of very modest ability. The projects mostly ran
    out of money in "phase 1", when only the basic work was being done and well before the real benefits were to accrue.

    He'd written 25,000 lines of shell script to provide a "common" menuing and execution environment. It was a most flexible, adaptable, and configurable environment - and surprisingly similar to that run by
    his previous employer. Just the thing to control 12 machines... It was a real engineering triumph - for 1982! He'd built and deployed all this with no version control, configuration management, or documented release and maintenance procedures - and certainly no review.

    His crowning glory, "Xferutility", 7,000 lines in a single script, heavily utilised 'comes from' control files [they just appeared places, with no trace of whence they came], and could use 'rcp', 'ftp',
    and e-mail to achieve the functionality of uucp. Plus, it was the transfer mechanism, the interactive menu, the scheduler, and the status reporter. All things to all people bar those left to maintain it.

    Having not apparently done "Programming 1A", he'd not been introduced to the concepts of "coupling and cohesion" - put together everything that belongs together, separate unrelated concerns -
    and least necessary complexity.

    To go from the login prompt to the first displayed menu, over a dozen files or scripts were executed - often in perverse order. The system drive defaults would overwrite the local definitions!

    He also seemed unaware of basic capacity planning issues - like tracking the number of systems in the machine room and providing adequate rack space and cabling. Backups were another story entirely.

    The I.T. department policy was to have separate small systems for every division, no two the same. In 12 months it went from 12 systems in the machine room, to 23. And then to 35+ in the next 6 months.

    Having labelled me "a cannonball contractor who won't be around in the long term", he resigned the week he penned it, took an overseas holiday [run in the same flexible fashion], and rejoined his previous employer, through a services company, performing Network Management.

    Sayings of the WGSA.

    A few of these are paraphrases.

    What I find myself saying often is :-
    Why would you want it any other way?, and
    (2005)Would you expect any less?.

    The answers to these questions are usally: Yes, any other way, and "NO!".

    Sayings and tactics of WGSA and friends:
    • A basic tactic: Plan, Plan, Plan - and produce massive documents everyone else has to review.
      Nothing will actually get done.
    • Another basic tactic: Reality is at Fault, Adjust your Perceptions.
    • "It's worked that way for 3 years - it couldn't be broken now." [A basic tactic. You obviously have got the nature of the fault wrong. Ignorance and Rigidity are a powerful combination.]
    • Another basic tactic: Concentrate on the trivial, the Big Issues will fix themselves.
    • "You don't understand the full range of issues or complexities." [I know, you don't.]
    • "It works/worked fine for me..." [Hasn't told you or Reality is at fault.]
    • "Read the documentation I wrote." [But hasn't told you about.]
    • "You have to fully document that." [An attempt to divert, stall, or put you off.]
    • "The client doesn't want that." [Were they ever asked? Were they ever given options?]
    • "They [the clients] never asked." [Deflection. Clients are expected to be technical experts.]
    • "It's UnAuthorised." [But where is the Policy on that?]
    • "It's not Standard" [It's free. We have to pay heaps or the other boys will think we're not cool.]
    • "We can't afford that." [May be true, but unlikely based on the money chucked around on junkets/trinkets for the favoured few.]
    • "It's freeware. It's not supported." [Often said without a hint of irony in response to 'costs too much'.]
    • "If you can Cost Justify that..." [A stalling tactic. Nothing you put up will ever get approved.]
    • "You Just ..." [Makes you out to be a fool/incompetent, even though there is no way you could've known.]
    • "Why haven't you ... <;said angrily>" [So how would you know to do that, when you haven't been told about it?]
    • "It's really flexible/efficient/configurable/Easy when you use it... " [Defending a wildly over-complicated script]
    • "We need it because ...or We have to do it that way." [Of course there is nothing written to back it up. The WGSA wrote it, so it's going to stay.]
    • "We won't discuss that [now]." [No argument if there is no discussion.]
    • "That's not the way we do it around here." [No change is possible. Of course, nothing is written down and there is no Policy to back that up.]
    • "You can't do/say that." [Controlling.]
    • "What is the Vendor's policy on replacing that?" [Deflect and control. Of course the vendor doesn't have a written policy on when something is broken.]
    • "The Vendor's White Paper/Documentation says ..." [Appeal to another Authority. Stifle argument. Don't let facts or prior experience get in the way.]
    • "The Consultant's Report says ..." [Appeal to another Authority.]
    Remember, there are rules for him and another set for you.
    He will ignore e-mail, talk about you behind your back, set impossible deadlines [for you], and not keep his promises. Don't expect to be told about important stuff that impacts you, or that you happen to be expert in. You won't get invited to meetings, see reports, or be involved in
    the 'discussions' held before major decisions are announced.

    Rumour, disinformation, and 'Need to Know' are powerful tools for the WGSA.

    He will casually drop bombshells, regularly spring 'surprises' on you, and practices 'Divide and Conquer' extremely well. He allocates work, but will never help or clarify what he wants. And of course, won't follow up on it. He may fly into a 'justifiable rage' if he comes back in a month and something hasn't been done to his satisfaction... It's not easy being so perfect and all-knowing all the time.

    Rational argument won't work with the WGSA. What matters is that he thought it up, he's important, and the bosses [his mates], think he is an absolute Guru on everything.

    And if you ever get close to criticising him or winning an argument - slander and libel work just fine for him.

    Some Sound Management Laws

    (2005) Note:I don't try to come up with any principles or Laws for The "WGSA" follows.
    There is probably only one: "Seize Ever Opportunity". Which isn't a bad dictum, if it respects other people, fulfills your business's needs and goals and isn't only about advancing your personal agenda.

    My version of "Sound Management Laws" are presented for you to consider and understand where I am "Coming From":
    • Delegate Authority with Responsibility and Accountability.
    • Follow up, Follow through, Be Consistent.
    • Value and Empower your staff: People are your only asset.
    • Do It NOW!
    • Follow The Quality Circle: Plan, Act, Evaluate.
    • Encourage and Reward Professional Behaviour, deal quickly with repeats of poor behaviour.
    • Lead by Example.
    • Forge, maintain, and support Teams.
    In I.T. there are special management considerations:
    • Users come first
      • Satisfy Business Needs
      • Actively sell your successes and services to your users.
      • Constantly set and manage users expectations.
      • Inform, advise, consult
      • Be Honest and forthright - especially about your mistakes and failures.
        Take care to explain Why it won't happen again.
      • Be Proactive. You get to drive the technology, they drive the business operations.
    • Know Yourself, Your Staff, Your Tools.
    • Never take on a job you cannot do.
    • Don't give others jobs they can't do.
    • Risk Management, Reviews, and 'Performance Audits' are your chief tools in establishing a Learning organisation.
    Good working relationships between management and staff take time and effort to develop. They proceed through the following stages and are fragile. The whole lot, years of work, can be destroyed in an instant with a lie.

    What management want are people they trust, work very hard, and consistently produce quality work. People who hold the company's best interests to heart.

    Development Stages of People and Teams:
    • Honesty, Integrity, Openness, Frankness, Consistency
    • TRUST

    So What?

    Since the advent of the 486 in ~91, cheap LAN's in ~94, and the Net in ~96, I.T. systems and infrastructure have become essential and critical for all business operations. Systems Administration, Networking, Help Desk, and Database Admin are the glue that holds it all together from
    day to day.

    There is a myth that software doesn't wear out like machinery.
    The bits don't change, so it must be OK! By implication, you don't need to "maintain" systems and software, like you do machines.

    So why aren't we all running 286's and DOS 3.3?

    It's called 'bit rot'. The software doesn't change, but the environment does - which gives the same net effect. Year 2000 isn't a problem until your clock says 01/01/00.

    My argument is that company profitability is related directly to, ignoring management and leadership issues, staff efficiency [$ cost / $ sales] and new product evolution. These are driven directly by I.T.
    capability, which requires systems be constantly upgraded and enhanced - just to stay where you are! Similarly, I.T. operations staff must be continually increasing their own efficiency just to keep up.

    (2005) See the 2003 Harvard Business Review article
    "I.T. Doesn't Matter" by Nicholas G. Carr.

    Effective Systems Administration is the single greatest point of leverage in the I.T. infrastructure - which is itself the single greatest point of leverage in an organisation. It amplifies and extends the
    thinking, analysis, and decision making ability of the people in the organisation. Even sometimes the managers. It can even provide some corporate memory - a prerequisite for Knowledge.

    It's obvious the software in airplanes, spaceships, nuclear reactors, medical instruments, weapons systems, banks, and ATM's has to be correct, robust, and dependable or there are disastrous, often
    immediate, consequences. People die or billions goes missing. [Roll on NT - reactor control!]

    What's not obvious is the long, lingering decline and demise of businesses - large and small.

    The cost to Australia of losing a multi-billion dollar multinational company is incalculable. Well managed and well lead, it could still be a potent force on the global stage. Instead we have lost profits,
    destroyed assets, and put a few thousand people out of work.

    (2005) On May 28 2001, Australia's fourth largest telco, One-Tel, ceased trading on the ASX.
    The Packer and Murdoch families, who control the media conglomerates PBL and News Corporation, lost about A$1Billion in the debacle. A major factor in the failure was uncollected "receivables". The computer billing system was faulty.

    One.Tel closely followed the failure of HIH Insurance and Impulse Airlines.

    That's a disaster 10 times bigger than TWA-800 going down outside New York just after take-off in 97, and they are still fishing out pieces. Just because it is in glorious slow motion - taking a decade, not a minute to unfold - doesn't mean we shouldn't still be as concerned with businesses going down as with aircraft crashes. People lives are destroyed and assets lost just as thoroughly in both types of crashes.

    The government and professional bodies should be just as concerned with these outcomes and ensuring they can never happen again.

    How do you work with a "World's Greatest ..."

    I don't have an answer.

    My style has been described as "Straight Up the Middle, with lots of smoke and noise."

    My only response is to recognise an intractable situation early and leave as quickly as you can. A luxury I can afford, having no dependants and a low level of debt.

    I Need an answer and would like - Your Feedback.

    Some "Good Stuff" I learnt from friends.

    • Know what's important. Focus on that, ignore the trivia.
    • Practice - Order, Discipline, Rigour.
    • The job isn't done until your records are up to date.
    • Professionals do for $100k what anyone can do for $1M.
    • Remember Good Ways to do things when you see them.
    • ASK other people - what works? What doesn't.
    • STANDARDISE. Make it so they is just one way things happen.
    • Be prepared to work odd hours to not impact your users.
    • Hit your deadlines.
    • Clean up as you go.
    • The details are important.
    • DON'T accept a job you can't do.
    • Be Proactive, not Reactive.
    • Practice 'Root Cause Analysis' - fix faults and processes, not just
    • You have to stay on the leading edge. This takes lots of time and experimentation.
    • There is NO substitute for ability, experience, and general knowledge.
    • Aim for 100% reliability. Know what you have to do to achieve it.
    • If you make Rules, apply them without exception.
      You may get called The Network Nazi, but it will all work and you will be respected.
    • Be personally flexible when dealing with users. Meet their needs, not just their expressed desire. This may involve some education.
    • Let users know what's happening.
    • Protect your staff from the vicissitudes of Management.
    • Freeware is FINE. If it meets the need, use it.
    • Know and Explore your tools.

    Some of the WGSA's work

    Here is a [longish] list of some of the wonderful technical and process problems I came across. Remember this was a largish, not huge, enterprise. There were only 75 Unix hosts, a thousand or so users [total], and a network that went to less than 100 sites.

    Many of the systems were front-ends to the mainframe or a production system for the business.
    The Unix support team was mostly 3 people, sometimes with a manager, sometimes with people doing performance analysis/reporting, or 'implementations' - such as HP Openview [I.T. Operations].

    • Common Environment: 25,000 lines of Shell Script. A good technology for 1982, not 1997.
      Very poorly written. Basic programming rules of 'Coupling and Cohesion' violated.
    • All actions implemented as shell functions, but merged with interactive menu system. Extremely heavy reliance on Environment variables, with perverse re-mapping of names.
    • 'Standard Operating Environment'. More shell script! No concept of standard builds, current patch levels, consistent program versions, or automatic software updates. 12 or more months of wasted effort. [Sold to the management initially on the great results from HP's internal network.
      With 100,000 PC's and 23,000 Unix hosts spread over 660 sites, they saved US$200M/year in support costs alone by adopting a 'Common Operating Environment'. That was based on keeping all systems up to the same versions of software and config files.]
    • Xferutility: 7,000 lines of shell script, doing a subset of uucp's functionality.
      Insidious bugs like:-

      • Using the (local) return code of 'rsh' and thinking it was all working.
      • Using rcp and not checking for a previous aborted transfer.
      • [Destination file ends up with zero modes. Not writable by owner. Copy aborts, but script keeps chugging along.]
    • HP-UX 10 'bug'. #!/bin/ksh missing. Default '/sbin/sh' used with surprising results - 'exit' doesn't work.
    • /usr/local/bin banned. All executables and tools to reside in admin's home directories.
    • Common Admin logons banned. But 'essential utilities', like Xferutility, used a common account with .rhosts trusted all over the place, and even privileged access possible with sudo.
    • Common User Home directories basic to functioning of 'Common Environment' scripts. Ran ~/.profile to start menu, which [eventually] ran ~/$LOGNAME.profile.
    • NO master passwd file. No unique UID's, but notionally unique LOGNAME's.
    • NO mechanism to add or remove users from multiple machines.
    • NO shadow password files.
    • NO password aging.
    • NO retiring of unused accounts. No checking for intrusions.
    • Default password of LOGNAME. Never checked and never reset.
    • Help Desk's 'Password reset' function broken on most machines. No corrective action taken.
    • Crack broke 80% of the passwords on the central admin hub. [Including that of WGSA]. Nothing was done.
    • WGSA login setup on all systems, with .rhosts back to the admin hub, and 'sudo' access to 'mv' and 'cp'. WGSA had two passwords, family member names + digit. These were well publicised to all admins, and others.
    • NO definitive list of managed hosts.
    • DNS control files rebuilt every time from a 'hosts' file with 'host_to_named'.
    • NO alternate DNS primary.
      A single central machine contained all the network services - DNS, e-mail, dial-in access, administration, master copies of scripts and system config files, root
      passwords for all machines. This 'admin hub' was trusted, and could access all other systems. There was no fail-over system or contingency plan for massive failure.
    • Crippled DNS secondaries.
      This was for 'security'. There was NO IP access control in the network. A user with only a little knowledge could navigate the entire network. There was an IP path back to the central DNS, and the IP number were allocated in an orderly fashion.
    • Internal domain left at: XXXXX.com. Even where a firewall was installed with the domain of XXXXX.com.au!
    • Even with over 2000 device entries in the DNS, and a strong numbering plan initiated by Networks, running sub-domains was firmly and frequently rejected.
    • Win-NT and DHCP posed no problem for the DNS. Permanent number leases were granted.
    • 10 or more IP address ranges in use. Including a Class-B [the company owned], and other cute addresses like, 150.150.x.x [Wells Fargo's!]
    • IP over X.25 was chosen in 1994.
      Routers were 'too expensive'. By March 1996 there were massive network failures - morning and afternoon - due to overload of the $250k X.25 switches.
      Expensive terminal servers were deployed widely, 'because they handle IP over X.25'. Most production support problems related to config mgt, Network, printers, or terminal servers.
    • Untested backup tapes. In spite of a failure resulting in almost total loss of backup tapes for a system, no testing of readability of backups was performed.
    • Configuration Managements consisted of copies of scripts in WGSA home directory.
      NO mechanism for rolling out fixes to faults as found.
    • Version Control consisted of block comments at the start of the scripts.
    • Common Code duplicated across 'menus'.
    • Hard coded 'user types' in Common Environment scripts.
    • File names not distinguished by hostname. All called 'AdminMenu' for the 'Admin' user.
    • nonStandard capitalisation of file Names and environment variables.
    • Very early version of 'sudo' used and modified. Non-standard config files. No repository of config files. No version control. [And WGSA didn't believe me when I found a long standing bug in his code.]
    • No reviews of code, scripts, systems.
    • Little testing of new code. Try it live!
    • No documented procedures for standard tasks.
    • No records of faults fixed.
    • No regular analysis or reporting of production faults.
    • No running sheets on production faults.
    • No weekly section meeting. No dissemination of information, plans.
    • No standard machine builds. [Complex and long procedures to build the production systems - with many variants.]
    • No capability to track or report critical file changes on production systems.
    • Network Naming Standard defined [but not for Printers and print queues]:
      ux div 2 loc nr : 11 chars. Accepted by hostname, not by uname

      • ux = Unix,
      • div = Division 3 letter code,
      • 2 = 1st digit of state postcode,
      • loc = 3 letter code for town/suburb, arbitrarily assigned,
      • nr = 2 digit machine number
      WGSA Response: Set hostname to the long name, and uname to the old short
      [So what's the standard??]
    • X.400 was chosen as the 'Standard external E-mail system'.
    • HP Openview [@$100k ?], was chosen as the corporate mail system - 'because it could make an address a program'.
    • External E-mail addresses were:- Firstname_Lastname@XXXXX.com.au
      It took a long and bloody fight to get a script into production that used the Net standard of 'First.Last@XXXXX.com.au', plus generate all the usual abbreviations, and allow specific people to be included/excluded. This of course was removed a week or so after I left... [Only for them to hurriedly fall back to a manual list once they found a mail-loop problem.]
    • There were over 10 printing mechanisms, no map of network printers, and no naming standard for printers. [There was a printer called 'printer', and more than one called 'laser'.] Of course, nothing was documented on how it all worked, what got changed, or subtle faults found.
    • No disaster recovery or contingency plans existed. Hardware in the old AIX boxes occasionally died and caused not inconsiderable panic to the new admins.
    • The machine room had no sensible layout - even though it was newly installed in 96. There was a single ethernet for all the production, development, accounting, and maintenance systems.
    • Disk Layouts were recorded nowhere.
    • There was no consistency or standard way to way out Logical Volumes on disks.
    • The Journalling Filesystem [Veritas], was supposedly 'banned' from all HP-UX 10 systems. The defrag and on-the-fly extend utilities were an extra [pay for] package, so the 'free' part couldn't be used.
    The watchword for the I.T. branch was 'CHEAP'.
    [Do you think that was in any way related to the company dying?]


    There are some people out there that don't just think, but know, they are the best.

    They are dangerous.

    Left unchecked they will not only make life a misery for everyone around, they help bring companies,
    even very large ones, down.

    What singles them out is their inability to take input from others.

    Typical behaviours are:
    • Rigidity. Nothing can be changed.
    • Control. They have to say how everything is done.
    • Fixation. Things have to be done their way or not at all.
    • Discipline, rigour, defined processes. Usually absent. Always perverted.
    • Favoured Few. There is always an inner sanctum who control everything.
    If they are well settled and well regarded, the organisation is dysfunctional. It will be soul destroying staying.

    The only defense I know against them, once entrenched, is to leave.

    And thank you all for your patience. I hope you have taken something away from all this...

    Questions and Comments, please.

    Page Last Updated:
    Fri 30 Jul 2010 09:19:47 EST (to blogger)
    Wed Feb 1 19:17:47 EST 2006
    02-Jul-98  (first version)


    Microsoft Troubles - IX, the story unfolds with Apple closing in on Microsoft size.

    Three pieces in the trade press showing how things are unfolding.

    Om Malik points out that Intel and Microsoft fortunes are closely intertwined.
    Jean-Louis Gassée suggests that "Personal Computing" (on those pesky Personal Computers) is downsizing and changing.
    Joe Wilcox analyses Microsoft latest results and contrasts a little with Apple.


    Everything Old is New Again: Cray's CPU design

    I found myself writing, during a commentary on the evolution of SSD's in servers, that  large-slow-memory like Seymour Cray used (not cache), would affect the design of Operating Systems. The new scheduling paradigm:
    Allocate a thread to a core, let it run until it finishes and waits for (network) input, or it needs to read/write to the network.
    This leads into how Seymour Cray dealt with Multi-Processing, he used multi-level CPU's:
    • There were Application processors, many bits, many complex features like Floating Point and other fancy stuff, but had no kernel mode features or access to protected regions of hardware or memory, and
    • Peripheral Processors (PP's), really a single very simple, very high-speed processor, multiplexed to look like 10 small, slower processors that performed all kernel functions and controlled the operation of the Application Processors (AP's)
    Not only did this organisation result in very fast systems (Cray's designs were the fastest in the world for around 2 decades), but very robust and secure ones as well: the NSA and other TLA's used them extensively.

    The common received wisdom is that interrupt-handling is the definitive way to interface unpredictable hardware events with the O/S and rest of the system. That polling devices, the old-way, is inefficient and expensive.

    Creating a fixed overhead scheme is more expensive in compute cycles than an on-demand, or queuing, system, until the utilisation rate is very high. Then the cost of all the flexibility (or Variety in W. Ross Ashby's Cybernetics term) comes home to roost.

    Piers Lauder of Sydney University and Bell Labs improved total system throughput of a VAX-11/780 running Unix V8 under continuous full (student/teaching) load by 30% by changing the serial-line device driver from 'interrupt handling' to polling.

    All those expensive context-switches went away, to be replaced by a predictable, fixed overhead.
    Yes, when the system was idle or low-load, it spent a little more time polling, but marginal.
    And if the system isn't flat-out, what's the meaning of an efficiency metric?

    Dr Neil J Gunther has written about this effect extensively with his Universal Scaling Law and other articles showing the equivalence of the seemingly disparate approaches of Vector Processing and SMP systems in the limit of their performance.

    My comment about big, slow memory changing Operating System scheduling can be combined with the Cray PP/AP organisation.

    In the modern world of CMOS, micro-electronics and multi-core chips, we are still facing the same Engineering problem Seymour Cray was attempting to address/find an optimal solution to:
    For a given technology, how do you balance maximum performance with the Power/Heat Wall?
    More power gives you more speed, this creates more Heat, which results in self-destruction, the "Halt and Catch Fire" problem. Silicon junctions/transistors are subject to thermal run-away, as they get hotter, they consume more power and get hotter still. At some point that becomes a viscous cycle (positive feedback loop) and its game over. Good chip/system designs balance on just the right side of this knife edge.

    How could the Cray PP/AP organisation be applied to current multi-core chip designs?
    1. Separate the CPU designs for kernel-mode and Application Processors.
      A single chip needs only have a single kernel-mode CPU controlling a number of Application CPU's. With its constant overhead cost already "paid for", scaling of Application performance is going to be very close to linear right up until the limit.
    2. Application CPU's don't have forced context switches. They roar along as fast as they can for as long as they can, or the kernel scheduler decides they've had their fair share.
    3. System Performance and Security both improve by using different instruction sets and processor architectures for different applications. While a virus/malware might be able to compromise an Application, it can't migrate into the kernel unless it's buggy. The Security Boundary and Partitioning Model is very strong.
    4. There doesn't have to be competition between the kernel-mode CPU and the AP's for cache memory 'lines'. In fact, the same memory cell designs/organisations used for L1/L2 cache can be provided as small (1-2MB) amounts of very fast direct access memory. The modern equivalent of "all register" memory.
    5. Because the kernel-mode CPU and AP's don't contend for cache lines, each will benefit hugely in raw performance.
      Another, more subtle, benefit is the kernel can avoid both the 'snoopy cache' (shared between all CPU's) and VM systems. It means a much simpler, much faster and smaller (= cooler) design.
    6. The instruction set for the kernel-mode CPU will be optimised for speed, simplicity and minimal transistor count. You can forget about speculative execution and other really heavy-weight solutions necessary in the AP world.
    7. The AP instruction set must be fixed and well-know, while the kernel-mode CPU instruction set can be tweaked or entirely changed for each hardware/fabrication iteration. The kernel-mode CPU runs what we'd now call either a hypervisor or a micro-kernel. Very small, very fast and with just enough capability. A side effect is that the chip manufacturers can do what they do best - fiddle with the internals - and provide a standard hypervisor for other O/S vendors to build upon.
    Cheaper, Faster, Cooler, more robust and Secure and able to scale better.

    What's not to like in this organisation?

    A Good Question: When will Computer Design 'stabilise'?

    The other night I was talking to my non-Geek friend about computers and he formulated what I thought was A Good Question:
    When will they stop changing??
    This was in reaction to me talking about my experience in suggesting a Network Appliance, a high-end Enterprise Storage device, as shared storage for a website used by a small research group.
    It comes with a 5 year warranty, which leads to the obvious question:
    will it be useful, relevant or 'what we usually do' in 5 years?
    I think most of the elements in current systems are here to stay, at least for the evolution of Silicon/Magnetic recording. We are staring at 'the final countdown', i.e. hitting physical limits of these technologies, not necessarily their design limits. Engineers can be very clever.

    The server market has already fractioned into "budget", "value" and "premium" species.
    The desktop/laptop market continues to redefine itself - and more 'other' devices arise. The 100M+ iPhones, in particular, already out there demonstrate this.

    There's a new major step in server evolution just breaking:
    Flash memory for large-volume working and/or persistent storage.
    What now may be called internal or local disk.
    This implies a major re-organisation of even low-end server installations:
    Fast local storage and large slow network storage - shared and reliable.
    When the working set of Application data in databases and/or files will fit on (affordable) local flash memory, response times improve dramatically because all that latency is removed. By definition, data outside the working set isn't a rate limiting step, so its latency only slightly affects system response time. However, throughput, the other side of the Performance Coin, has to match or beat that of the local storage, or it will become the system bottleneck.

    An interesting side question:
     How will Near-Zero-Latency local storage impact system 'performance', both response times (a.k.a. latency) and throughput.

    I conjecture that both system latency and throughput will improve markedly, possibly super-linearly, because one of the bug-bears of Operating Systems, the context switch, will be removed. Systems have to expend significant effort/overhead in 'saving their place', deciding what to do next, then when the data is finally ready/available, to stop what they were doing and start again where they left off.

    The new processing model, especially for multi-core CPU's, will be:
    Allocate a thread to a core, let it run until it finishes and waits for (network) input, or it needs to read/write to the network.
    Near zero-latency storage removes the need for complex scheduling algorithms and associated queuing. It improves both latency and throughput by removing a bottleneck.
    It would seem that Operating Systems might benefit from significant redesign to exploit this effect, in much the same way that RAM is now large and cheap enough that system 'swap space' is now either an anachronism or unused.

    The evolution of USB flash drives saw prices/Gb halving every year. I've recently seen 4Gb SDHC cards at the supermarket for ~$15, whereas in 2008, I paid ~$60 for USB 4Gb.

    Rough server pricing for RAM in 2010 is A$65/Gb ±$15.
    List prices by Tier 1/2 vendors for 64Gb SSD is $750-$1000 (around 2-4 times cheaper from 'white box' suppliers).
    I've seen this firmware limited to 50Gb to improve performance and reliability comparable to current production HDD specs.
    This is $12-$20/Gb, depending on what base size and prices used.

    Disk drives are ~A$125 for 7200rpm SATA and $275-$450 for 15K SAS drives.
    With 2.5" drives priced in-between.
    Ie. $0.125/Gb for 'big slow' disks and $1 per GB for fast SAS disks.

    Roll forward 5 years to 2015 and 'SSD' might've doubled in size three times, plus seen the unit price drop. Hard disks will likely follow the same trend of 2-3 doublings.
    Say SSD 400Gb for $300: $1.25/Gb
    2.5" drives might be up to 2-4Tb in 2015 (from 500Gb in 2010) and cost $200: $0.05-0.10/Gb
    RAM might be down to $15-$30/Gb.

    A caveat with disk storage pricing: 10 years ago RAID 5 became necessary for production servers to avoid permanent data loss.
    We've now passed another event horizon: Dual-parity, as a minimum, is required on production RAID sets.

    On production servers, price of storage has to factor in the multiple overheads of building high-reliability storage (redundant {disks, controllers, connections}, parity and hot-swap disks and even fully mirrored RAID volumes plus software, licenses and their Operations, Admin and Maintenance) from unreliable parts. A problem solved by electronics engineers 50+ years ago with N+1 redundancy.

    Multiple Parity is now needed because in the time taken to recreate a failed drive, there's a significant chance of a second drive failure and total data loss. [Something NetApp has been pointing out and addressing for some years.] The reason for this is simple: the time to read/write a whole drive has steadily increased since ~1980. Recording density (bits per inch) times areal density (tracks per inch) have increased faster than read/write speeds, roughly multiplying recording density times rotational speed.

    Which makes running triple-mirrors a much easier entry point, or some bright spark has to invent a cheap-and-cheerful N-way data replication system. Like a general use Google File System.

    Another issue is that current SSD offerings don't impress me.

    They make great local disk or non-volatile buffers in storage array, but are not yet, in my opinion, quite ready for 'prime time'.

    I'd like to see 2 things changed:
    • RAID-3 organisation with field-replaceable mini-drives. hot-swap preferred.
    • PCI, not SAS or SATA connection. I.e. they appear as directly addressable memory.

    This way the hardware can access flash as large, slow memory and the Operating System can fabricate that into a filesystem if it chooses - plus if it has some knowledge of the on-chip flash memory controller, it can work much better with it. It saves multiple sets of interfaces and protocol conversions.

    Direct access flash memory will be always be cheaper and faster than SATA or SAS pseudo-drives.

    We would then see following hierarchy of memory in servers:

    • Internal to server
      • L1/2/3 cache on-chip
      • RAM
      • Flash persistent storage
      • optional local disk (RAID-dual parity or triple mirrored)
    • External and site-local
      • network connected storage array, optimised for size, reliability, streaming IO rate and price not IO/sec. Hot swap disks and in-place/live expansion with extra controllers or shelves are taken as a given.
      • network connected near-line archival storage (MAID - Massive Array of Idle Disks)
    • External and off-site
      • off-site snapshots, backups and archives.
        Which implies a new type of business similar to Amazon's Storage Cloud.
    The local network/LAN is going to be ethernet (1Gbps or 10Gbps Ethernet, a.k.a 10GE), or Infiniband if 10GE remains very expensive. Infiniband delivers 3-6Gbps over short distances on copper, external SAS currently uses the "multi-lane" connector to deliver four channels per cable. This is exactly right for use in a single rack.

    I can't see a role for Fibre Channel outside storage arrays, and these will go if Infiniband speed and pricing continues to drop. Storage Arrays have used SCSI/SAS drives with internal copper wiring and external Fibre interfaces for a decade or more.
    Already the premium network vendors, like CISCO, are selling "Fibre Channel over Ethernet" switches (FCoE using 10GE).

    Nary a tape to be seen. (Hooray!)

    Servers should tend to be 1RU either full-width or half-width, though there will still be 3-4 styles of servers:
    • budget: mostly 1-chip
    • value: 1 and 2-chip systems
    • lower power value systems: 65W/CPU-chip, not 80-90W.
    • premium SMP: fast CPU's, large RAM and many CPU's (90-130W ea)
    If you want removable backups, stick 3+ drives in a RAID enclosure and choose between USB, firewire/IEEE 1394, e-SATA or SAS.

    Being normally powered down, you'd expect extended lifetimes for disks and electronics.
    But they'll need regular (3-6-12 months) read/check/rewrite cycling or the data will degrade and be permanently lost. Random 'bit-flipping' due to thermal activity, cosmic rays/particles and stray magnetic fields is the price we pay for very high density on magnetic media.
    Which is easy to do if they are kept in a remote access device, not unlike "tape robots" of old.
    Keeping archival storage "on a shelf" implies manual processes for data checking/refresh, and that is problematic to say the least.

    3-5 2.5" drives will make a nice 'brick' for these removable backup packs.
    Hopefully commodity vendors like Vantec will start selling multiple-interface RAID devices in the near future. Using current commodity interfaces should ensure they are readable at least a decade into the future. I'm not a fan of hardware RAID controllers in this application because if it breaks, you need to find a replacement - which may be impossible at a future date. (fails 'single point of failure' test).

    Which presents another question using a software RAID and filesystem layout: Will it still be available in your O/S of the future?
    You're keeping copies of your applications, O/S, licences and hardware to recover/access archived data, aren't you? So this won't be a question... If you don't intend to keep the environment and infrastructure necessary to access archived data, you need to rethink what you're doing.

    These enclosures won't be expensive, but shan't be cheap and cheerful:
    Just what is your data worth to you?
    If it has little value, then why are you spending money on keeping it?
    If it is a valuable asset, potentially irreplaceable, then you must be prepared to pay for its upkeep in time, space and dollars. Just like packing old files into archive boxes and shipping them to a safe off-site facility cost money, it isn't over once they are out of your sight.

    Electronic storage is mostly cheaper than paper, but it isn't free and comes with its own limits and problems.

    • SSD's are best suited and positioned as local or internal 'disks', not in storage arrays.
    • Flash memory is better presented to an Operating System as directly accessible memory.
    • Like disk arrays and RAM, flash memory needs to seamlessly cater for failure of bits and whole devices.
    • Hard disks have evolved to need multiple parity drives to keep the risk of total data loss acceptably low in production environments.
    • Throughput of storage arrays, not latency, will become their defining performance metric.
      New 'figures of merit' will be:
      • Volumetric: Gb per cubic-inch
      • Power: Watts per Gb
      • Throughput: Gb per second per read/write-stream
      • Bandwidth: Total Gb per second
      • Connections:  Number simultaneous connections.
      • Price: $ per Gb available and $ per Gb/sec per server and total
      • Reliability: probability of 1 byte lost per year per Gb
      • Archive and Recovery features: snapshots, backups, archives and Mean-Time-to-Restore
      • Expansion and Scalability: maximum size (Gb, controllers, units, I/O rate) and incremental pricing
      • Off-site and removable storage: RAID-5 disk-packs with multiple interfaces are needed.
    • Near Zero-latency storage implies reorganising and simplifying Operating Systems and their scheduling/multi-processing algorithms. Special CPU support may be needed, like for Virtualisation.
    • Separating networks {external access, storage/database, admin, backups} becomes mandatory for performance, reliability, scaling and security.
    • Pushing large-scale persistent storage onto the network requires a commodity network faster than 1Gbps ethernet. This will either be 10Gbps ethernet or multi-lane 3-6Gbps Infiniband.
    Which leads to another question:
    What might Desktops look like in 5 years?

    Other Reading:
    For a definitive theoretical treatment of aspects of storage hierarchies, Dr. Neil J Gunther, ex-Xerox PARC, now Performance Dynamics, has been writing about "The Virtualization Spectrum" for some time.

    Footnote 1:
    Is this idea of multi-speed memory (small/fast and big/slow) new or original?
    No: Seymour Cray, the designer of the world's fastest computers for ~2 decades, based his designs on it. It appears to me to be a old idea whose time has come again.

    From a 1995 interview with the Smithsonian:
    SC: Memory was the dominant consideration. How to use new memory parts as they appeared at that point in time. There were, as there are today large dynamic memory parts and relatively slow and much faster smaller static parts. The compromise between using those types of memory remains the challenge today to equipment designers. There's a factor of four in terms of memory size between the slower part and the faster part. Its not at all obvious which is the better choice until one talks about specific applications. As you design a machine you're generally not able to talk about specific applications because you don't know enough about how the machine will be used to do that.
    There is also a great PPT presentation on Seymour Cray by Gordon Bell entitled "A Seymour Cray Perspective", probably written as a tribute after Cray's untimely death in an auto accident.

    Footnote 2:
    The notion of "all files on the network" and invisible multi-level caches was built in 1990 at Bell Labs in their Unix successor, "Plan 9" (named for one of the worst movies of all time).
    Wikipedia has a useful intro/commentary, though the original on-line docs are pretty accessible.

    Ken Thompson and co built Plan 9 around 3 elements:
    • A single protocol (9P) of around 14 elements (read, write, seek, close, clone, cd, ...)
    • The Network connects everything.
    • Four types of device: terminals, CPU servers, Storage servers and the Authentication server.
    Ken's original storage server had 3 levels of transparent storage (in sizes unheard of at the time):
    • 1Gb of RAM (more?)
    • 100Gb of disk (in an age where 1Gb drives where very large and exotic)
    • 1Tb of WORM storage (write-once optical disk. Unheard of in a single device)
    The usual comment was, "you can go away for the weekend and all your files are still in either memory or disk cache".

    They also pioneered permanent point-in-time archives on disk in something appearing to the user as similar to NetApp's 'snapshots' (though they didn't replicate inode tables and super-blocks).

     My observations in this piece can be paraphrased as:
    • re-embrace Cray's multiple-memory model, and
    • embrace commercially the Plan 9 "network storage" model.

    Promises and Appraising Work Capability and Proficiency

    Max Wideman, PMI Distinguished Contributor and Person of the Year and Canadian author of several Project Management books plus a slew of published papers, not only responded to, and published, some comments and conversations of between us, he then edited up some more emails into a Guest Article of his site.

    Many thanks to you Max for all your fine work and for seeing something useful in what I penned.


    Australia and the Researchers' Workbench

    This is a pitch for something new: the "Researchers' Workbench".
    Australia has the wealth and inventiveness to do it, but most probably, not the political will.
    Chalk that up to "the Cultural Cringe".

    I.T./Computers are "Cognitive Amplifiers".

    They aren't just a 'good' fit to Research, but a Perfect 'fit':
    Researchers are the definitive "Knowledge Workers".
    This idea isn't new or exclusive, here's what's gone before.

    Theme 1: Augmenting Human Intellect.
    In 1968, Doug Englebart of Stanford/SRI and his Augmenting Human Intellect Research Centre did a demo (available on-line) of their work in the area.
    40 years on, this initial work has languished.
    Could the ANU or CSIRO repeat the demo, even given some time? I doubt it...

    Theme 2: Bush's Memex.
    Couple this with what Vannevar Bush actually proposed in his 1945 article, "As we may think", the "Memex".

    This is much more than simple "Hypertext": It was a way to distil 'threads' of knowledge into accessible units, index and search them, and give/swap with others.

    There is a huge obstacle in the way of implementing true "Memex" capabilities these days:
    Copyright and Digital Library access.
    These obstacles are trivially solvable for people working within a single institution which has paid for collective access. But much more is needed to expand the scope to a general solution for swapable "threads" incorporating copyright, digital material.

    For instance, why does every University and Research organisation, or Libraries, in Australia separately license access to the many digital resources on-line?

    Especially when DEST (Dept of Education, Science and Training), and ARC (Australian Research Council), who fund the vast bulk of work, could copy the "Pharmaceutical Benefits Scheme" (PBS) and negotiate single licenses for all Australian Institutions.

    Theme 3: Researchers Workbench.
    The world of Programming was transformed around 1973 by Bell Labs and its "Programmers' Workbench".
    It consisted of a few tools, now considered staples by the Open Source project community.
    The power of the work was distilling essential processes into fewer than a dozen tools, applicable to many fields.

    Where is the "Researchers' Workbench" or the studies identifying what's needed or optimal at the individual, group/team, departmental, Institutional and Subject Area level??

    Theme 4: Research into Research.
    Lastly, the organisation and execution of Researcher work, time and processes needs definitive investigation.
    Mihály Csíkszentmihályi suggested, after studying what made Nobel Laureates different, "Flow" as a critical difference.
    Is it? i.e. Has that finding been tested and proven or refuted?
    Are there others?
    If "Flow" is real, why isn't it well known and practised as a matter of course in Academia?

    There have to be some simple disciplines that underpin the output of the best, most prolific Researchers.
    Something that allows them to use the equivalent of Einstein's "most powerful force in the Universe": compound interest.

    What techniques, disciplines or processes would allow Researchers to improve their output/performance by an significant fraction every year?

    I posit that there are only a very few tools needed for a Researchers' Workbench, such as:
    • Shared Annotated and Ranked Bibliographies
    • Mind-Map representations of Knowledge Areas
    • a (single) definitive searchable document document repository (docs as PDF's) for selected Bodies of Knowledge. i.e. "Google Books" for academic books/journals/papers.
    • PDF annotation, link/reference tools, and
    • appropriate search and information/knowledge organisation tools & representations.

    Applying these approaches together (Englebart's Augmentation, Bush's Memex, the Workbench and Research into Research)  would be transformative in the Academic/Research world.
    Also surprisingly cheap.

    Here's a simple test/question:
    Is there any evidence that Microsoft Word is even a suitable, let alone effective, tool for creating Academic Papers?

    If not, why it is used almost exclusively through major Universities?
    What are the characteristics needed of "Perfect Academic Writing Tools"?
    Why aren't these needs well known?

    The world of Academia and Research is an "Elite Sport": highly competitive, intense on-going activity requiring the best/most effective training techniques of the most suited and most capable individuals. With only occasional "crystallisation" points: where performance/outcomes are unequivocally judged. (Not unlike the Olympics).

    The A.I.S. (Australian Institute of Sport) proved two things:
    • "It takes a village to raise a child": or a large co-operative/coordinated organisation to create Great Sportspeople, and
    • the process can be replicated. [other countries now do it]

    The A.I.S. result wasn't accidental or unplanned: They applied the "Scientific Method" to their task/goal.
    What matters in producing "winning performances"? e.g. {Selection, Coaching, Technique, Training/Learning, Training Regime and Work-load, Nutrition, Mental/psych Factors, Recuperation and Recovery from Injuries}
    "What Works" in each of the areas?
    How do we measure and improve each area?

    After around 25 years of outstanding and obvious achievement of the A.I.S., I don't understand the seemingly complete blindness of Academia/Research to applying their own methods/principles to their own work. If Human Performances can be systematically improved in Sports by applying 'Science', then isn't that enough evidence to at least trial the idea on Academia and Research itself?

    Where is the Academic Discipline of "Research into Research"?
    Where is the "Office for Improving Research" in each and every University?
    Clearly, the simple, coarse metrics used now to characterise "Researcher Outputs" for DEST funding are not sufficient for intensive study of the Science of Research. One of the priority research areas of the A.I.S. initially was breaking down performances and the factors contributing.

    Alan Kay, a noted Computing pioneer at XEROX PARC, commented "Point of View is worth 80 IQ points".
    PARC managed to redefine the world of Computing/Networking in the early '70's with a small team and some very powerful organisational ideas and approaches.

    I suggest that applying the "I.T. as a Cognitive Amplifier" Point of View to Academic Research in a disciplined, organised way would significantly raise the Collective I.Q. of Australian Research, and in certain fields really produce that "80 IQ point" advantage Kay suggests.

    This is equivalent of Roger Bannister in 1954 not just breaking the record by 2 seconds and achieving the first "four minute mile", but beating Sebastian Coe's time of 30 years later.

    Australia could "steal a march" on the rest of the world.
    Once ahead, the work should feed on itself and push our Researchers further ahead.
    A secondary effect is The Best Centres attract the best, brightest and most ambitious. A virtuous circle.

    The A.I.S. experience shows the cost of creating "super-stars" is calculable and within the reach of a wealthy small nation, such as Australia.
    It also shows there is a very strong "First Mover" advantage, that can be maintained and extended for quite some time.

    The obvious starting place is Canberra, home of the Australian National University, the same place as the A.I.S.
    Like the A.I.S., the program has to become trans-national, with decentralised areas of excellence and expertise.

    Why now and where's the urgency?

    The USA's National Science Foundation, via it's Office of Integrative Activities, is now funding projects for a new program:
    Cyber-Enabled Discovery and Innovation (CDI).
    ... is NSF’s bold five-year initiative to create revolutionary science and engineering research outcomes made possible by innovations and advances in computational thinking.
    Australians have always been very inventive and occasionally this has translated into "innovation".
    We have shown we "punch above our weight" in areas where the Research Outputs can be directly applied, as suggested here...

    We get one shot at being "The First" to Augment and Amplify Researcher Intelligence.
    We can choose to be Leaders or Followers and enjoy the on-going consequences of either.


    Death by Success II

    There is another, much more frequent "Death by Success" cause, first introduced to me by Jerry Weinberg and Wayne Strider and Elaine Cline (Strider and Cline).

    It's the same process that some herbicides use: unconstrained growth.
    Monsanto's flagship herbicide Round Up is exactly this sort of agent.

    If you are very good at what you do and much sought after, this can lead directly to massive Failure - personally and in business.

    Growth is Good, but too much, too fast is a Killer.

    The only protection is awareness.
    As  Virginia Satir pointed out, "We can't see inside other people's heads, nor can we see ourselves as others see us" (courtesy again of Jerry and "Strider and Cline".)

    Typically you need objective, external help is recognising this condition.
    Once you have restored Situational Awareness, you can choose your response. Which may be "I'm outa here", Denial or something in between.

    There is an alternative form of "Death by Success", which again we see in the Plant Kingdom.

    Your initial approach, solution or technique may not Scale-Up or have a fixed Upper-Bound.
    E.g. if you sell "factory seconds", there is a limited supply that sets your maximum turnover.
    Or selling fragments of the Berlin Wall - at some point the Genuine Article is all gone...

    The example in the Plant Kingdom are when tree seedlings 'set' in unsuitable places, like a small pot or within a bottle. Down the road, they will become "root bound", which slows growth, then they'll consume all the nutrients and having converted 'everything' into plant material, die.

    That's it for that plant - all of one resource has been exhausted and it's Game Over.

    Death by Success

    The things you do in the beginning, when you're the minnow-against-the-giants, to start and build a business may not work well when you're successful, when you've become The Giant.

    Exactly what leads to Success can eventually lead to your downfall.

    You become very good at the things that have gained and seemingly maintained Success.  Every problem and challenge you've met have been solved with your brilliance and individual style.

    Why would you ever want or need to vary that approach?

    Until something new comes along and it all goes wrong:
      Inevitably in Business and Life, things change (perturbations arise in Control Systems terms).
      Responding with "More of the Same", as in the past, will, at some point, not work.
      If you've grown large, it will take time to fail, you'll have notice "things aren't great".
      Many companies only ever do "More of the Same",  often amping-it-up as results don't appear.
      The results are as predictable are throwing oil on a fire.

    Often I mention Sydney Finkelstein's book, "Why Smart Executives Fail" in which Finkelstein describes the results of 6 years of research.  He self-describes as "Steven Roth Professor of Management at the Tuck School of Business at Dartmouth College, where I teach courses on Leadership and Strategy".

    In Smart Executives, Finkelstein and his team documents a whole slew of companies (50) that burned bright and collapsed. This book was published in 2003, covering a turbulent period of US and global business, as well as some famous cases going back decades.

    The subjects of the research were chosen precisely because they were wildly successful and suffered a notable collapse. Enron and Worldcom are on the list, plus many I.T. companies such as Wang Computers.  The common thread is the collapse was avoidable and predictable.

    Would the conclusions, Lessons Learned and "Early Warning Signs" be different post the 2008 GFC (Global Financial Crisis)?  I think not...

    Finkelstein lists 7 naive causes of failure:
    1. The Executive were Stupid.
    2. The Executives couldn't have known What was Coming.
    3. It was a Failure to Execute.
    4. The Executives weren't trying Hard Enough.
    5. The Executives lacked Leadership Ability.
    6. The Company lacked the Necessary Resources.
    7. The Executives were simply a Bunch of Crooks.
    and comments in a para entitled "Failure to understand Failure":
    All seven of these standard explanations for why executives fail are clearly insufficient. (Because the companies had demonstrated excellence in becoming highly successful.)
    The next 300 pages are his answer. Part I describes "Great Corporate Failures" and Part II their Causes.
    This research ends with a positive message, Part III is "Learning from Mistakes":
    • Predicting the Future, Early Warning Signs.
    • How Smart Executives Learn, Living and Surviving in a World of Mistakes.
    His "Seven Habits of Spectacularly Unsuccessful People"  are worth reiterating:
    1. They see themselves and their companies as dominating their environments.
    2. They identify so completely with the company that there is no clear boundary between their personal interests and their corporation's interests.
    3. They think they have All the Answers.
    4. They ruthlessly eliminate anyone who isn't 100% behind them.
    5. They are consummate company spokespersons, obsessed with the company image.
    6. They underestimate major obstacles.
    7. They stubbornly rely on what worked for them in the past.
    Each of the 11 chapters has 30-50 references.  Although written and published for the general market, this isn't any "Puff piece".


    MMC - the Microsoft death blow for non-Enterprise markets

    MMC, "Mostly Macintosh Compatible", the equivalent for OS/X of WINE for Windows, doesn't yet exist, that I'm aware of.


    Why Microsoft is being left behind

    Paul Budde recently questioned, "Will Microsoft be able to make the jump?"
    [04-Apr-2010] For other comments see my pieces "Death by Success" and "Death by Success II".

    He quotes the marketing "S-curve" and Summer Players by Carol Velthuis describing company performance and market maturity in seasons of the year.


    ICT Productivity and the Failure of Australian Management

    Prior Related Posts:
    Quantifying the Business Benefits of I.T. Operations
    The Triple Whammy - the true cost of I.T. Waste
    Force Multipliers - Tools as Physical and Cognitive Amplifiers
    I.T. in context

    Alan Kohler and Robert Gottleibsen have been writing in "Business Spectator" about the relationship between jobs and Economic Productivity.

    They note that the USA has improved productivity in the last year while in Australia it has declined (+4% and -3% respectively).  My take on this is: a gross Failure of Australian Management.

    There is solid research/evidence that "ICT" is the single largest contributor to both partial and multi-factor Productivity, and is expected to be so for the next 20 years.  This is an big issue.


    Microsoft Troubles - VIII, MS-Office challenged

    "Microsoft Office is obsolete, or soon will be" By Joe Wilcox.

    I hadn't picked this trend, it's quite important.
    It squeezes their 2nd "birthright" (the other is the PC Operating System, I'd focussed on.)


    Microsoft Troubles - VII, An Insiders View

    A friend sent me this link to a New York Times Op-Ed 'contribution'.
    Huge news...
    February 4, 2010
    Op-Ed Contributor
    Microsoft’s Creative Destruction
    Dick Brass was a vice president at Microsoft from 1997 to 2004.
    This guy was a VP in the glory years - either side of Y2K, and before the 2004/5 Longhorn 'reset'.
    The failure to build the successor to XP was a breaking-point: the forced upgrade cycle was gone.

    He's likely to have a bunch of stock, or options, and a vested interest in the company's success/survival. His comments are likely to be both informed and as positive as they can be...