Tag: Performance

Stuff The Internet Says On Scalability For August 16th, 2019

Stuff The Internet Says On Scalability For August 16th, 2019

Wake up! It’s HighScalability time:

Do you like this sort of Stuff? I’d love your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 53 mostly 5 star reviews (124 on Goodreads). They’ll learn a lot and likely add you to their will.

Number Stuff:

  • $1 million: Apple finally using their wealth to improve security through bigger bug bounties.
  • $4B: Alibaba cloud service yearly run rate, growth of 66%. Says they’ll overtake Amazon in 4 years. 
  • 200 billion: Pinterest pins pinned across more than 4 billion boards by 300 million users.
  • 21: technology startups took in mega-rounds of $100 million or more. 
  • 3%: of users pass their queries through resolvers that actively work to minimize the extent of leakage of superfluous information in DNS queries.
  • < 50%: Google searches result in a click. SEO dies under walled garden shade.
  • 4 million: DDoS attacks in the last 6 months, frequency grew by 39 percent in the first half of 2019. IoT devices are under attack within minutes. Rapid weaponization of vulnerable services continued. 
  • 200: distributed microservices in S3, up from 8 when it started 13 years ago.
  • 50%: cumulative improvement to Instagram.com’s feed page load time.
  • $318 million: Fortnite monthly revenue, likely had more than six consecutive months with at least one million concurrent active users.
  • $18,000: in fines because you just had to have the license plate NULL. 
  • $6.1 billion: Uber created Dutch weapon to avoid paying taxes.
  • 14.5%: drop in 1H19 global semiconductor sales.
  • 13%: fall in ad revenue for newspapers. 

Quotable Stuff:

  • Donald Hoffman: That is what evolution has done. It has endowed us with senses that hide the truth and display the simple icons we need to survive long enough to raise offspring. Space, as you perceive it when you look around, is just your desktop—a 3D desktop. Apples, snakes, and other physical objects are simply icons in your 3D desktop. These icons are useful, in part, because they hide the complex truth about objective reality.
  • rule11: First lesson of security: there is (almost) always a back door.
  • Paul Ormerod: A key discovery in the maths of how things spread across networks is that in any networked system, any shock, no matter how small, has the potential to create a cascade across the system as a whole. Watts coined the phrase “robust yet fragile” to describe this phenomenon. Most of the time, a network is robust when it is given a small shock. But a shock of the same size can, from time to time, percolate through the system. I collaborated with Colbaugh on this seeming paradox. We showed that it is in fact an inherent property of networked systems. Increasing the number of connections causes an improvement in the performance of the system, yet at the same time, it makes it more vulnerable to catastrophic failures on a system-wide scale.
  • @jeremiahg: InfoSec is ~$127B industry, yet there’s no price tags on any vendor website. For some reason it’s easier to find out what a private plane costs than a ‘next-gen’ security product. Oh yah, and let’s not forget the lack of warranties.
  • Hall’s Law:  the maximum complexity of artifacts that can be manufactured at scales limited only by resource availability doubles every 10 years. 
  • YouTube~ Our responsibility was never to the creators or to the users,” one former moderator told the Post. “It was to the advertisers.”
  • reaperducer: It’s for this reason that’s I’ve stopped embedding micro data in the HTML I write. Micro data only serves Google. Not my clients. Not my sites. Just Google. Every month or so I get an e-mail from a Google bot warning me that my site’s micro data is incomplete. Tough. If Google wants to use my content, then Google can pay me. If Google wants to go back to being a search engine instead of a content thief and aggregator, then I’m on board.
  • Maxime Puteaux: The small satellite launch market has grown to account for “69% of the satellites launched last year in number of satellites but only 4% of the total mass launched (i.e 372 tons). … The smallsat market experienced a 23% compound annual growth rate (CAGR) from 2009 to 2018” with even greater growth expected in the future, dominated by the launch needs of constellations.
  • @Electric_Genie: San Diego has a huge, machine-intelligence-powered smart streetlight network that monitors traffic to time traffic signals. Now, they’ve added ability to detect pedestrians and cyclists
  • Simon Wardley: How to create a map? Well, I start off with a systems diagram, I give it an anchor at the top. In this case, I put customer and then I describe position through a value chain. A customer wants online photo storage, which needs website, which needs platform, which needs computer, which needs power, and of course, the stuff at the bottom is less visible to the customer than the stuff at the top.
  • Charity Majors: When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network.  Our tools are still coming to grips with this seismic shift.
  • Livia Gershon: According to McLaren, from 1884 to 1895, the Matrimonial Herald and Fashionable Marriage Gazette promised to provide “HIGH CLASS MATCHES” to U.K. men and women looking for wives and husbands. Prospective spouses could place ads in the paper or work directly with staff of the associated Word’s Great Marriage Association to privately make a connection.
  • @KarlBode: There is absolutely ZERO technical justification for bandwidth caps and overage fees on cable networks. Zero. It’s a glorified price hike on captive US customers who already pay more for bandwidth than most developed nations due to limited competition.
  • Fowler: That’s the other piece of app trackers, is that they do a whole bunch of bad things for our phone. Over the course of a week, I found 5,400 different trackers activated on my iPhone. Yours might be different. I may have more apps than you. But that’s still quite a lot. If you multiplied that out by an entire month, it would have taken up 1.5 gigabytes of data just going to trackers from my phone. To put that in some context, the basic data plan from AT&T is only 3 gigabyte
  • Kate Green: Starshot is straightforward, at least in theory. First, build an enormous array of moderately powerful lasers. Yoke them together—what’s called “phase lock”—to create a single beam with up to 100 gigawatts of power. Direct the beam onto highly reflective light sails attached to spacecraft weighing less than a gram and already in orbit. Turn the beam on for a few minutes, and the photon pressure blasts the spacecraft to relativistic speeds.
  • Markham Heid: Beeman says activities that are too demanding of our brain or attention — checking email, reading the news, watching TV, listening to podcasts, texting a friend, etc. — tend to stifle the kind of background thinking or mind-wandering that leads to creative inspiration. 
  • @ben11kehoe: Aurora never downsizes storage. Continue to pay at the highest roll you’ve ever made.
  • John Allspaw: Resilience is not preventative design, it is not fault-tolerance, it is not redundancy. If you want to say fault-tolerance, just say fault-tolerance. If you want to say redundancy, just say redundancy. You don’t have to say resilience just because, you can, and you absolutely are able to. I wish you wouldn’t, but you absolutely can, and that’ll be fine as well.
  • Matthew Ball: But, again, lucrative “free-to-play” games have been around for more than a decade. In fact, it turns out the most effective way to generate billions of dollars is to not require a player spend a single one (all of the aforementioned billion-dollar titles are also free-to-play). 
  • TrailofBits: Smart contract vulnerabilities are more like vulnerabilities in other systems than the literature would suggest. A large portion (about 78%) of the most important flaws (those with severe consequences that are also easy to exploit) could probably by detected using automated static or dynamic analysis tools.
  • @sfiscience: 1/2″Once you induce [auto safety] regulatory protection, there is a decline in the number of highway deaths. And then in 3-4 years, it goes right up to where it was before the safety regulation is imposed.”  2/2 There’s a kind of “risk homeostasis” with regulation: as people feel safer, they take more risks (eg, seatbelts led to faster driving and more pedestrian deaths). One exception:  @NASCAR deaths went UP with safety innovations. “People are not dumb, but they’re not rational-expectations-efficient either.”  
  • Michael F. Cohen: It may be hard to believe, but only a few years ago we debated when the first computer graphics would appear in a movie such that you could not tell if what you were looking at was real or CG. Of course, now this question seems silly, as almost everything we see in action movies is CG and you have no chance of knowing what is real or not.
  • Dropbox: Much like our data storage goals, the actual cost savings of switching to SMR (Shingled Magnetic Recording) have met our expectations. We’re able to store roughly 10 to 20 percent more data on an SMR drive than on a PMR drive of the same capacity at little to no cost difference. But we also found that moving to the high-capacity SMR drives we’re using now has resulted in more than a 20% percent savings overall compared to the last generation storage design.
  • Riot Games: The patch size was 68 MB for RADS and 83 MB for the new patcher. Despite the larger download size, the average player was able to update the game in less than 40 seconds, compared to over 8 minutes with the old patcher.
  • @grossdm: For a decade, VCs have been subsidizing the below-market provision of services to urban-dwellers: transport, food delivery, office space. Now the baton is being passed to public shareholders, who will likely have less patience. 20 years ago, public investors very quickly walked away from the below-market provision of e-commerce and delivery services  — i.e. Webvan. 
  • Julia Grace: Looking back, I should have done a lot more reorgs [at Slack] and I should’ve broken up a lot more parts of the organization so that they could have more specialization, but instead, it was working so we kept it all together.
  • Thomas Claburn: “No iCloud subscriber bargained for or agreed to have Apple turn his or her data – whether encrypted or not – to others for storage,” the complaint says. “…The subscribers bargained for, agreed, and paid to have Apple – an entity they trusted – store their data. Instead, without their knowledge or consent, these iCloud subscribers had their data turned over by Apple to third-parties for these third-parties to store the data in a manner completely unknown to the subscribers.”
  • @glitchx86: Some merit to TM: it solves the problem of the correctness of lock-based concurrent programs. TM hides all the complexity of verifying deadlock-free software .. and it isn’t an easy task 
  • @narayanarjun: We were experiencing 40ms latency spikes on queries at @MaterializeInc and @nikhilbenesch tracked it down to TCP NODELAY, and his PR just cracks me up. The canonical cite is a hacker news comment ((link: https://news.ycombinator.com/item?id=10608356) news.ycombinator.com/item?id=106083…) signed by John Nagle himself, and I can’t even.
  • Donald Hoffman: Perhaps the universe itself is a massive social network of conscious agents that experience, decide, and act. If so, consciousness does not arise from matter; this is a big claim that we will explore in detail. Instead, matter and spacetime arise from consciousness—as a perceptual interface.
  • MacCárthaigh: From the very beginning at AWS, we were building for internet scale. AWS came out of amazon.com and had to support amazon.com as an early customer, which is audacious and ambitious. They’re a pretty tough customer, as you can imagine, one of the busiest websites on Earth. At internet scale, it’s almost all uncoordinated. If you think about CDNs, they’re just distributed caches, and everything’s eventually consistent, and that’s handling the vast majority of things.
  • Jack Clark: Being able to measure all the ways in which AI systems fail is a superpower, because such measurements can highlight the ways existing systems break and point researchers towards problems that can be worked on.
  • Google: We investigated the remote attack surface of the iPhone, and reviewed SMS, MMS, VVM, Email and iMessage. Several tools which can be used to further test these attack surfaces were released. We reported a total of 10 vulnerabilities, all of which have since been fixed. The majority of vulnerabilities occurred in iMessage due to its broad and difficult to enumerate attack surface. Most of this attack surface is not part of normal use, and does not have any benefit to users. Visual Voicemail also had a large and unintuitive attack surface that likely led to a single serious vulnerability being reported in it.  Overall, the number and severity of the remote vulnerabilities we found was substantial. Reducing the remote attack surface of the iPhone would likely improve its security.
  • sleepydog: I work in GCP support. I think you would be surprised. Of course Linux is more common, but we still support a lot of customers who use Windows Server, SQL Server, and .NET for production.
  • Laurence Tratt: performance nondeterminism increasingly worries me, because even a cursory glance at computing history over the last 20 years suggests that both hardware (mostly for performance) and software (mostly for security) will gradually increase the levels of performance nondeterminism over time. In other words, using the minimum time of a benchmark is likely to become more inaccurate and misleading in the future…
  • Geoff Tate: A year ago, if you talked to 10 automotive customers, they all had the same plan. Everyone was going straight to fully autonomous, 7nm, and they needed boatloads of inference throughput. They wanted to license IP that they would integrate into a full ADAS chip they would design themselves. They didn’t want to buy chips. That story has backpedaled big time. Now they’re probably going to buy off-the-shelf silicon, stitch it together to do what they want, and they’re going to take baby steps rather than go to Level 5 right away.
  • Ann Steffora Mutschler: In discussions with one of the Tier 0.5 suppliers about whether sensor fusion is the way to go or if it makes better sense to do more of the computation at the sensor itself, one CTO remarked that certain types of sensor data are better handled centrally, while other types of sensor data are better handled at the edge of the car, namely the sensor, Fritz said.
  • Dai Zovi: A software engineering team would write security features, then actively go to the security team to talk about it and for advice. We want to develop generative cultures, where risk is shared. It’s everyone’s concern. If you build security responsibility into every team, you can scale much more powerfully than if security is only the security staff’s responsibility.
  • Nitasha Tiku: But that didn’t mean things would go back to normal at Google. Over the past three years, the structures that once allowed executives and internal activists to hash out tensions had badly eroded. In their place was a new machinery that the company’s activists on the left had built up, one that skillfully leveraged media attention and drew on traditional organizing tactics. Dissent was no longer a family affair. And on the right, meanwhile, the pipeline of leaks running through Google’s walls was still going as strong as ever.
  • Graham Allan: There’s another bottleneck that SoC designers are starting to struggle with, and it’s not just about bandwidth. It’s bandwidth per millimeter of die etch. So if you have a bandwidth budget that you need for your SoC, a very easy exercise is to look at all the major technologies you can find. If you have HBM2E, you can get on the order of 60+ gigabytes per second per millimeter of die edge. You can only get about a sixth of that for GDDR6. And I can only get about a tenth of that with LPDDR5.
  • Brian Bailey: If the industry is willing to give von Neumann the boot, it should perhaps go the whole way and stop considering memory to be something shared between instructions and data and start thinking about it as an accelerator. Viewed that way, it no longer has to be compared against logic or memory, but should be judged on its own merits. If it accelerates the task and uses less power, then it is a purely economic decision if the area used is worth it, which is the same as every other accelerator.
  • Barbara Tversky: This brings us to our First Law of Cognition: There are no benefits without costs. Searching through many possibilities to find the best can be time consuming and exhausting. Typically, we simply don’t have enough time or energy to search and consider all the possibilities. The evidence on action is sufficient to declare the Second Law of Cognition: Action molds perception. There are those who go farther and declare that perception is for action. Yes, perception serves action, but perception serves so much more. 
  • Jez Humble: testing is for known knowns, monitoring is for known unknowns, observability is for unknown unknowns
  • @briankrebs: Being in infosec for so long takes its toll. I’ve come to the conclusion that if you give a data point to a company, they will eventually sell it, leak it, lose it or get hacked and relieved of it. There really don’t seem to be any exceptions, and it gets depressing
  • Brendon Foye: The hyperscale giant today released a new co-branding guide (pdf), instructing partners in the AWS Partner Network (APN) how to position their marketing material when going to market with AWS. Among the guidelines, AWS said it won’t approve the use of terms like “multi-cloud,” “cross cloud,” “any cloud,” “every cloud,” “or any other language that implies designing or supporting more than one cloud provider.” The hyperscale giant today released a new co-branding guide (pdf), instructing partners in the AWS Partner Network (APN) how to position their marketing material when going to market with AWS. Among the guidelines, AWS said it won’t approve the use of terms like “multi-cloud,” “cross cloud,” “any cloud,” “every cloud,” “or any other language that implies designing or supporting more than one cloud provider.
  • Newley Purnell: Startup Engineer.ai says it uses artificial-intelligence technology to largely automate the development of mobile apps, but several current and former employees say the company exaggerates its AI capabilities to attract customers and investors.
  • George Dyson: If you look at the most interesting computation being done on the Internet, most of it now is analog computing, analog in the sense of computing with continuous functions rather than discrete strings of code. The meaning is not in the sequence of bits; the meaning is just relative. Von Neumann very clearly said that relative frequency was how the brain does its computing. It’s pulse frequency coded, not digitally coded. There is no digital code.
  • Brendon Dixon: Because they’ve chosen to not deeply learn their deep learning systems—continuing to believe in the “magic”—the limitations of the systems elude them. Failures “are seen as merely the result of too little training data rather than existential limitations of their correlative approach” (Leetaru). This widespread lack of understanding leads to misuse and abuse of what can be, in the right venue, a useful technology.
  • Ewan Valentine: I could be completely wrong on this, but over the years, I’ve found that OO is great for mapping concepts, domain models together, and holding state. Therefor I tend to use classes to give a name to a concept and map data to it. For example, entities, repositories, and services, things which deal with data and state, I tend to create classes for. Whereas deliveries and use cases, I tend to treat functionally. The way this ends up looking, I have functions, which have instances of classes, injected through a higher-order function. The functional code then interacts with the various objects and classes passes into it, in a functional manor. I may fetch a list of items from a repository class, map through them, filter them, and pass the results into another class which will store them somewhere, or put them in a bucket.
  • Timothy Morgan: But what we do know is that the [Cray] machine will weigh in at around 30 megawatts of power consumption, which means it will have more than 10X the sustained performance of the current Sierra system on DOE applications and around 4X the performance per watt. This is a lot better energy efficiency than many might have been expecting – a few years back there was talk of exascale systems requiring as much as 80 megawatts of juice, which would have been very rough to pay for at a $1 per kilowatt per year. With those power consumption numbers, it would have cost $500 million to build El Capitan but it would have cost around $400 million to power it for five years; at 30 megawatts, you are in the range of $150 million, which is a hell of a lot more feasible even if it is an absolutely huge electric bill by any measure.
  • Timothy Prickett Morgan: All of us armchair architecture quarterbacks have been thinking the CPU of the future looks like a GPU card, with some sort of high bandwidth memory that’s really close. 
  • Garrett Heinlen (Netflix): I believe GraphQL also goes a step further beyond REST and it helps an entire organization of teams communicate in a much more efficient way. It really does change the paradigm of how we build systems and interact with other teams, and that’s where the power truly lies. Instead of the back end dictating, “Here are the APIs you receive and here’s the shape in the format you’re going to get,” they express what’s possible to access. The clients have all the power between pulling in the data just what they need. The schema is the API contract between all teams and it’s a living evolving source of truth for your organization. Gone are the days of people throwing code over the wall thing like, “Good luck, it’s done.” Instead, GraphQL promotes more of a uniform working experience amongst front end and back end, and I would go further to say even product and designer could have been involved in this process as well to understand the business domain that you’re all working within.

Useful Stuff:

  • Fun thread. @jessfraz: Tell me about the weirdest bug you had that caused a datacenter outage, can be anywhere in the stack including human error. @dormando: one day all the sun servers fired temp alarms and shut off. thought AC had died or there was a fire. Turns out cleaners had wedged the DC door open, causing a rapid humidity shift, tricking the sensors. @ewindisch: connection pool leak in a distributed message queue I wrote caused the cascade failure of a datacenter’s network switches. This brought offlin a large independent cloud provider around 2013. @davidbrunelle: Unexpected network latency caused TCP sockets to stay open indefinitely on a fleet of servers running an application. This eventually led to PAT exhaustion causing around ~50% of outbound calls from the datacenter to fail causing a DC-wide brownout.
  • What happens when you go from LAMP to serverless: case study of externals.io. 90% of the requests are below 100ms. $17.37/month. Generally low effort migration.
  • By continuously monitoring increases in spend, we end up building scalable, secure and resilient Lambda based solutions while maintaining maximum cost-effectiveness. How We Reduced Lambda Functions Costs by Thousands of Dollars: In the last 7 months, we started using Lambda based functions heavily in production. It allowed us to scale quickly and brought agility to our development activities…We were serving +80M Lambda invocations per day across multiple AWS regions with an unpleasant surprise in the form of a significant bill…once we start running heavy workloads in production, the cost become significant and we spent thousands of dollars daily…to reduce AWS Lambda costs, we monitored Lambda memory usage and execution time based on logs stored in CloudWatch…we created dynamic visualizations on Grafana based on metrics available in the timeseries database and we were able to monitor in near real-time Lambda runtime usage…we gain insights into the right sizing of each Lambda function deployed in our AWS account and we avoided excessive over-allocation of memory. Hence, significantly reduced the Lambda’s cost…To gather more insights and uncover hidden costs, we had to identify the most expensive functions. Thats where Lambda Tags comes into the play. We leveraged those metadata to breakdown the cost per Stack…By reducing the invocation frequency (control concurrency with SQS), we reduced the cost up to 99%…we’re evaluating alternative services like Spot Instances & Batch Jobs to run heavy non-critical workloads considering the hidden costs of Serverless…we were using SNS and we had issues with handling errors and Lambda timeout, so we changed our architecture to use instead SQS and we configured a dead letter queue to reduce the number of times the same message can be handled by the Lambda function (avoir recursion). Hence, reducing the number of invocations.
  • Six Shades of Coupling: Content Coupling, Common Coupling, External Coupling, Control Coupling, Stamp Coupling and Data Coupling. 
  • When does redundancy actually help availability?: The complexity added by introducing redundancy mustn’t cost more availability than it adds. The system must be able to run in degraded mode. The system must reliably detect which of the redundant components are healthy and which are unhealthy. The system must be able to return to fully redundant mode.
  • AI Algorithms Need FDA-Style Drug Trials. The problem with this idea is molecules do not change whereas software continuously changes and learning software by definition changes reactively. No static proces like a one and done drug trial will yield meaningful results. We need a different approach that considers the unique nature software plays in systems. Certainly vendors can’t be trusted. Any AI will tell you that. Perhaps create a set of test courses that platforms can be continuously tested and fuzzed against?
  • AWS Lambda is not ready to replace convenctional EC2Why we didn’t brew our Chai on AWS Lambda: Chai Point, India’s largest organized Chai retailer, with over 150+ stores and over 1000+ boxC(IoT Enabled Chai and Coffee vending machines) are designed for corporate which serves approximately 250k cups of chai per day from all the channels…Most of the Chai Point’s stores and boxC machines typically run between 7 AM to 9 PM…[Lambda cold start is] one of the most critical and deciding factors for us to move back the Shark infrastructure to EC2…AWS Lambda has a limit of 50 MB as the maximum deployment package…it takes a delay of 1–2 minutes for logs to appear in the CloudWatch which makes it difficult for immediate debugging in a test environment…when it comes to deploying it in enterprise solutions where there are inter-services dependencies I think there is still time especially for languages like Java. 
  • Facebook Performance @Scale 2019 recap videos are now available. 
  • Sharing is caring until it becomes overbearing. Dropbox no longer shares code between platforms. Their policy now is to use the native language on each platform. It is simply easier and quicker to write code twice. And you don’t have to train people on using a custom stack. The tools are native. So when people move on you have not lost critical expertise. The one codebase to rule them all dream dies hard. No doubt it will be back in short order, filtered through some other promising stack.
  • Everyone these days wants your functions. Oracle Functions Now Generally Available. It’s built on the Apache 2.0 licensed Fn Project. Didn’t see much in the way of reviews or on costs.
  • On LeanXcale database. Interview with Patrick Valduriez and Ricardo Jimenez-Peris: There is a class of new NewSQL databases in the market, called Hybrid Transaction and Analytics Processing (HTAP). NewSQL is a recent class of DBMS that seeks to combine the scalability of NoSQL systems with the strong consistency and usability of RDBMSs. LeanXcale’s architecture is based on three layers that scale out independently, 1) KiVi, the storage layer that is a relational key-value data store, 2) the distributed transactional manager that provides ultra-scalable transactions, and 3) the distributed query engine that enables to scale out both OLTP and OLAP workloads. he storage layer, it is a proprietary relational key-value data store, called KiVi, which we have developed. Unlike traditional key-value data stores, KiVi is not schemaless, but relational. Thus, KiVi tables have a relational schema, but can also have a part that is schemaless. The relational part enabled us to enrich KiVi with predicate filtering, aggregation, grouping, and sorting. As a result, we can push down all algebraic operators below a join to KiVi and execute them in parallel, thus saving the movement of a very large fraction of rows between the storage layer and they query engine layer.
  • Apollo Day New York City 2019 Recap
    • During his keynote, DeBergalis announced one of Apollo’s most anticipated innovations, Federation, which utilizes the idea of a new layer in the data stack to directly meet developers’ needs for a more scalable, reliable, and structured solution to a centralized data graph.
    • Federation paired with existing features of Apollo’s platform like schema change validation listing creates a flow where teams can independently push updates to product microservices. This triggers re-computation of the whole graph, which is validated and then pushed into the gateway. Once completed, all applications contain changes in the part of the graph that is available to them. These events happen independently, so there is a way to operate, which allows each team to be responsible solely for its piece.
    • Another key concept that DeBergalis detailed was the idea that a “three-legged” stack is emerging in front-end development. The “legs” of this new “stool” that form the basis of this stack are React, Apollo, and Typescript. React provides developers with a system for managing user components, Apollo provides developers a system for managing data, and Typescript provides a foundation underneath that provides static typing end-to-end through the stack.
  • Lesson: sticker shock—in Google Cloud everything costs more you think it will, but it’s still worth it. Etsy’s Big Data Cloud Migration. Etsy generates a terabyte of data a day, they run hundreds of Hadoop workflows and thousands of jobs daily. Started out on prem. They migrated to the cloud over a year and half ago, driven by needing both the machine and people resources required to keep up with machine leaning and data processing tasks. Moving into the cloud decoupled systems so groups can operate independently. With their on prem system they didn’t worry about optimization, but on the cloud you must because the cloud will do whatever you tell it do—at a price. In the cloud there’s a business case for making things more efficient. They rearchitected as they moved over. Managed services were a huge win. As they grew bigger they simply didn’t have the resources and the expertise to run all the needed infrastructure. That’s now Google’s job. This allowed having more generalized teams. It would be impossible for their team of 4 to manage all the things they use in GCP. Specialization is not required run things. If you need it you just turn it on. That includes services like BigTable, k8s, Cloud Pub/Sub, Cloud Dataflow, and AI. It allows Etsy to punch above their weight class. They have a high level of support, with Google employees  embedded on their team. Etsy didn’t lift and shift,they remade the platform as they moved over. If they had to do it over again they might have tried for a middle road, changing things before the migration.
  • Facebook Systems @Scale 2019 recap videos are now available.
  • The human skills we need an an unpredictable world. Efficiency and robustness trade off against each other. The more efficient something is the less slack there is to handle to the unexpected. When you target efficiency you may be making yourself more vulnerable to shocks.
  • The lesson is, you can’t wait around for Netflix or anyone else to promote your show. It’s up to you to create the buzz. How a Norwegian Viking Comedy Producer Hacked Netflix’s Algorithm: The key to landing on Netflix’s radar, he knew, would be to hack its recommendation engine: get enough people interested in the show early…Three weeks before launch, he set up a campaign on Facebook, paying for targeted posts and Facebook promotions. The posts were fairly simple — most included one of six short (20- to 25-second) clips of the show and a link, either to the show’s webpage or to media coverage. They used so-called A/B testing — showing two versions of a campaign to different audiences and selecting the most successful — to fine-tune. The U.S. campaign didn’t cost much — $18,500, which Tangen and his production partners put up themselves — and it was extremely precise. In just 28 days, the Norsemen campaign reached 5.5 million Facebook users, generating 2 million video views and some 6,000 followers for the show. Netflix noticed. “Three weeks after we launched, Netflix called me: ‘You need to come to L.A., your show is exploding,'” Tangen recalls. Tangen invested a further $15,000 to promote the show on Facebook worldwide, using what he had learned during the initial U.S. campaign.
  • How did NASA Steer the Saturn V? Triply redundant in logic. Doubly redundant in memory.  Two compared to make sure getting the same answer. If the same numbers aren’t returned then a subroutine is called to determine at this point in the flight which number makes the most sense. During all Saturn flights had less than 10 miscompares. More component means less reliability. Never had catastrophic failure. Biggest problem is vibration.
  • Interesting idea, instead of interviews use how well a candidate performs on training software to determine how well a candidate knows a set of skills. The Cloudcast – Cloud Computing. Role of generalists is gone. Pick a problem people are struggling with and become an expert at solving that problem and market yourself as person who has the skill of solving the problem. 
  • The end state for any application is to write its own scheduler.  Making Instagram.com faster: Part 1. Use preload tags to start fetching resources as soon as possible. You can even preload GraphQL requests to get a head start on those long queries. Preloads have a higher network priority. Preload tag for all script resources and to place them in the order that they would be needed. Load in new batches before the user hits the end of their current feed. A prioritized task abstraction that handles queueing of asynchronous work (in this case, a prefetch for the next batch of feed posts).  If the user scrolls close enough to the end of the current feed, we increase the priority of this prefetch task to ‘high’ by cancelling the pending idle callback and thus firing off the prefetch immediately. Once the JSON data for the next batch of posts arrives, we queue a sequential background prefetch of all the images in that preloaded batch of posts. We prefetch these sequentially in the order the posts are displayed in the feed rather than in parallel, so that we can prioritize the download and display of images in posts closest to the user’s viewport. Also Preemption in Nomad — a greedy algorithm that scales
  • Native lazy loading has arrived! Adding the loading attribute to the images decreased the load time on a fast network connection by ~50% — it went from ~1 second to < 0.5 seconds, as well as saving up to 40 requests to the server 🎊. All of those performance enhancements just from adding one attribute to a bunch of images!
  • Maybe it should just be simpler to create APIs?  Simple Two-way Messaging using the Amazon SQS Temporary Queue Client. Seems a lot of people use queues for front-end back-end communication because it’s simpler to setup and easier to secure than createing an HTTP endpoint. So AWS came up with a virtual queue that let’s you multiplex many virtual queues over a physical queue. No extra cost. It’s all done on the client. A clever tag based heartbeat mechanism is used to garbage collect queues.
  • Monolith to Microservices to Serverless — Our journey: A large part of our technology stack at that time comprised of a Spring based application and a MySQL database running on VMs in a data centre…The application was working for our thousands of customers, day in, day out, with little to no downtime. But it couldn’t be denied that new features were becoming difficult to build and the underlying infrastructure was beginning to struggle to scale as we continued to grow as a business…We needed a drastic rethink of our infrastructure and that came in the shape of Docker containers and Kubernetes…We took a long hard look at our codebase and with the ‘independent loosely coupled services’ mantra at the forefront of our minds we were quickly able to break off large parts of the monolith into smaller much more manageable services. New functionality was designed and built in the same way and we were quickly up to a 2 node K8s cluster with over 35 running pods….Fast forward to Today and we have now been using AWS for well over 2 years, we have migrated the core parts of our reporting suite into the cloud and where appropriate all new functionality is built using serverless AWS services. Our ‘serverless first’ ethos allows us to build highly performant and highly scaleable systems that are quick to provision and easy to manage. 
  • This is Crypto 101. Security Now 727 BlackHat & DefCon. Steve Gibson details how electronic hotel locks can protect themselves against replay attacks: All that’s needed to prevent this is for the door, when challenged to unlock, to provide a nonce for the phone to sign and return. The door contains a software ratchet. This is a counter which feeds a secretly-keyed AES symmetric cipher. Each door lock is configured with its own secret key which is never exposed. The AES cipher which encrypts a counter, produces a public elliptic key which is used to verify signatures. So the door lock first checks the key that it is currently valid for and has been using. If that fails, it checks ahead to the next public key to see whether that one can verify the returned signature. If not, it ignores the request. But if the next key does successfully verify the request’s signature it makes the next key permanent, ratcheting forward and forgetting the previous no-longer-valid key. This means that the door locks do not need to communicate with the hotel. Each door lock is able to operate autonomously with its own secret key which determines the sequence of its public keys. The hotel system knows each room’s secret key so it’s able to issue the proper private signing key to each guest for the proper room. If that system is designed correctly, no one with a copy of the Mobile Key software, and the ability to eavesdrop on the conversation, is able to gain any advantage from doing so.
  • Trip report: Summer ISO C++ standards meeting (Cologne). Reddit trip report. C++20 is now feature complete. Added: modules, coroutines, concepts including in the standard library via ranges, <=> spaceship including in the standard library, broad use of normal C++ for direct compile-time programming, ranges, calendars and time zones, text formatting, span, and lots more. Contracts were moved to C++21. 
  • Ingesting data at “Bharat” Scale: Initially, we considered Redis for our failover store, but with serving an average ingestion rate of 250K events per second, we would end up needing large Redis clusters just to support minutes worth of panic of our message bus. Finally, we decided to use a failover log producer that writes logs locally to disk. This periodically rotates & uploads to S3…We’ve seen outages, where our origin crashes & as it tries to recover, it is inundated with client retries & pending requests in the surge queue. That’s a recipe for cascading failure…We want to continue to serve the requests we can sustain, for anything over that, sorry, no entry. So we added a rate-limit to each of our API servers. We arrived at this configuration after a series of simulations & load-tests, to truly understand at what RPS our boxes will not sustain the load. We use nginx to control the number of requests per second using a leaky bucket algorithm. The target tracking scaling trigger is 3/4th of the rate-limit, to allow for the room to scale; but there are still occasions where large surges are too quick for target-tracking scaling to react.

Soft Stuff:

  • jedisct1/libsodium: Sodium is a new, easy-to-use software library for encryption, decryption, signatures, password hashing and more. It is a portable, cross-compilable, installable, packageable fork of NaCl, with a compatible API, and an extended API to improve usability even further. Its goal is to provide all of the core operations needed to build higher-level cryptographic tools.
  • amejiarosario/dsa.js-data-structures-algorithms-javascript: In this repository, you can find the implementation of algorithms and data structures in JavaScript. This material can be used as a reference manual for developers, or you can refresh specific topics before an interview. Also, you can find ideas to solve problems more efficiently.
  • linkedin/brooklin: Brooklin is a distributed system intended for streaming data between various heterogeneous source and destination systems with high reliability and throughput at scale. Designed for multitenancy, Brooklin can simultaneously power hundreds of data pipelines across different systems and can easily be extended to support new sources and destinations.
  • gojekfarm/hospital: Hospital is an autonomous healing system for any System. Any failure or faults occurred in the system will be resolved automatically according to given run-book by the Hospital without manual intervention.
  • BlazingDB/pyBlazing: BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
  • serverless/components: Forget infrastructure — Serverless Components enables you to deploy entire serverless use-cases, like a blog, a user registration system, a payment system or an entire application — without managing complex cloud infrastructure configurations.

Pub Stuff:

  • Zooming in on Wide-area Latencies to a Global Cloud Provider: The network communications between the cloud and the client have become the weak link for global cloud services that aim to provide low latency services to their clients. In this paper, we first characterize WAN latency from the viewpoint of a large cloud provider Azure, whose network edges serve hundreds of billions of TCP connections a day across hundreds of locations worldwide. 
  • What is Applied Category Theory? Two themes that appear over and over (and over and over and over) in applied category theory are functorial semantics and compositionality. 
  • ML can never be fair. On Fairness and Calibration: In this paper, we investigate the tension between minimizing error disparity across different population groups while maintaining calibrated probability estimates. We show that calibration is compatible only with a single error constraint (i.e. equal false-negatives rates across groups), and show that any algorithm that satisfies this relaxation is no better than randomizing a percentage of predictions for an existing classifier. These unsettling findings, which extend and generalize existing results, are empirically confirmed on several datasets.

from High Scalability

Stuff The Internet Says On Scalability For August 2nd, 2019

Stuff The Internet Says On Scalability For August 2nd, 2019

Wake up! It’s HighScalability time—once again:

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 52 mostly 5 star reviews (121 on Goodreads). They’ll learn a lot and hold you in even greater awe.

Number Stuff:

  • $9.6B: games investment in last 18 months, equal to the previous five years combined.
  • $3 million: won by a teenager in the Fortnite World Cup.  
  • 100,000: issues in Facebook’s codebase fixed from bugs found by static analysis. 
  • 106 million: Capital 1 IDs stolen by a former Amazon employee. (complaint)
  • 2 billion: IoT devices at risk because of 11 VXWorks zero day vulnerabilities.
  • 2.1 billion: parking spots in the US, taking 30% of city real estate, totaling 34 billion square meters, the size of West Virginia, valued at 60 trillion dollars.
  • 2.1 billion: people use Facebook, Instagram, WhatsApp, or Messenger every day on average. 
  • 100: words per minute from Facebook’s machine-learning algorithms capable of turning brain activity into speech. 
  • 51%: Facebook and Google’s ownership of the global digital ad market space on the internet.
  • 56.9%: Raleigh, NC was the top U.S. city for tech job growth.
  • 20-30: daily CPAN (Perl) uploads. 700-800 for Python.
  • 476 miles: LoRaWAN (Low Power, Wide Area (LPWA)) distance world record broken using 25mW transmission power.
  • 74%: Skyscanner savings using spot instances and containers on the Kubernetes cluster.
  • 49%: say convenience is more important than price when selecting a provider.
  • 30%: Airbnb app users prefer a non-default font size.
  • 150,000: number of databases migrated to AWS using the AWS Database Migration Service.
  • 1 billion: Google photos users, @MikeElgan: same size as Instagram but far larger than Twitter, Snapchat or Pinterest
  • 300M: Pinterest monthly active users with evenue of $261 million, up 64% year-over-year, on losses of $26 million for the second-quarter of 2019.
  • 7%: of all dating app messages were rated as false.
  • $100 million: Goldman Sachs spend to improve stock trades from hundreds of milliseconds down to 100 microseconds while handling more simultaneous trades. The article mentions using microservices and event sourcing, but it’s not clear how that’s related.

Quotable Stuff:

  • Josh Frydenberg, Australian Treasurer: Make no mistake, these companies are among the most powerful and valuable in the world. They need to be held to account and their activities need to be more transparent.
  • Neil Gershenfeld: Fabrication merges with communication and computation. Most fundamentally, it leads to things like morphogenesis and self-reproducing an assembler. Most practically, it leads to almost anybody can make almost anything, which is one of the most disruptive things I know happening right now. Think about this range I talked about as for computing the thousand, million, billion, trillion now happening for the physical world, it’s all here today but coming out on many different link scales.
  • Alan Kay: Marvin and Seymour could see that most interesting systems were crossconnected in ways that allowed parts to be interdependent on each other—not hierarchical—and that the parts of the systems needed to be processes rather than just “things”
  • Lawrence Abrams: Now that ransomware developers know that they can earn monstrous payouts from local cities and insurance policies, we see a new government agency, school district, or large company getting hit with a ransomware attack every day.
  • @tmclaughbos: A lot of serverless adoption will fail because organizations will push developers to assume more responsibility down the stack instead of forcing them to move up the stack closer to the business.
  • Lightstep: Google Cloud Functions’ reusable connection insertion makes the requests more than 4 times faster [than S3] both in region and cross region.
  • HENRY A. KISSING, ERERIC SCHMIDT, DANIEL HUTTENLOCHER: The evolution of the arms-control regime taught us that grand strategy requires an understanding of the capabilities and military deployments of potential adversaries. But if more and more intelligence becomes opaque, how will policy makers understand the views and abilities of their adversaries and perhaps even allies? Will many different internets emerge or, in the end, only one? What will be the implications for cooperation? For confrontation? As AI becomes ubiquitous, new concepts for its security need to emerge. The three of us differ in the extent to which we are optimists about AI. But we agree that it is changing human knowledge, perception, and reality—and, in so doing, changing the course of human history. We seek to understand it and its consequences, and encourage others across disciplines to do the same.
  • minesafetydisclosures: Visa’s business is all about scale. That’s because the company’s fixed costs are high, but the cost of processing a transaction is essentially zero. Said more simply, it takes a big upfront investment in computers, servers, personnel, marketing, and legal fees to run Visa. But those costs don’t increase as volume increases; i.e., they’re “fixed”. So as Visa processes more transactions through their network, profit swells. As a result, the company’s operating margin has increased from 40% to 65%. And the total expense per transaction has dropped from a dime to a nickel; of which only half of a penny goes to the processing cost. Both trends are likely to continue.
  • noobiemcfoob: Summarizing my views: MQTT seems as opaque as WebSockets without the benefits of being built on a very common protocol (HTTP) and being used in industries beyond just IoT. The main benefits proponents of MQTT argue for (low bandwidth, small libraries) don’t seem particularly true in comparison to HTTP and WebSockets.
  • erincandescent: It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I’d likely implement the architecture to benefit from the existing tooling.
  • Director Jon Favreau~  the plan was to create a virtual Serengeti in the Unity game engine, then apply live action filmmaking techniques to create the film — the “Lion King” team described this as a “virtual production process.”
  • Alex Heath: In confidential research Mr. Cunningham prepared for Facebook CEO Mark Zuckerberg, parts of which were obtained by The Information, he warned that if enough users started posting on Instagram or WhatsApp instead of Facebook, the blue app could enter a self-sustaining decline in usage that would be difficult to undo. Although such “tipping points” are difficult to predict, he wrote, they should be Facebook’s biggest concern. 
  • jitbit: Well, to be embarrassingly honest… We suck at pricing. We were offering “unlimited” plans to everyone until recently. And the “impressive names” like you mention, well, they mostly pay us around $250 a month – which used to be our “Enterprise” pricing plan with unlimited everything (users, storage, agents etc.) So I guess the real answer is – we suck at positioning and we suck at marketing. As the result – profits were REALLY low (Lesson learned – don’t compete on pricing). P.S. Couple of years ago I met Thomas from “FE International” at some conference, really experienced guy, who told me “dude, this is crazy, dump the unlimited plan like right now” so we did. So I guess technically we can afford a PaaS now…
  • 1e-9: The markets are kind of like a massive, distributed, realtime, ensemble, recursive predictor that performs much better than any one of its individual component algorithms could. The reason why shaving a few milliseconds (or even microseconds) can be beneficial is because the price discovery feedback loops get faster, which allows the system to determine a giant pricing vector that is more self-consistent, stable, and beneficial to the economy. It’s similar to how increasing the sample rate of a feedback control system improves performance and stability. Providers of such benefits to the markets get rewarded through profit.
  • @QuinnyPig: There’s something else afoot too. I fix cloud bills. If I offer $10k to save $100k people sign off. If I offer $10 million to save $100 million people laugh me out of the room. Large numbers are psychologically scary.
  • mrjn:  Is it worth paying $20K for any DB or DB support? If it would save you 1/10th of an engineer per year, it becomes immediately worth. That means, can you avoid 5 weeks of one SWE by using a DB designed to better suit your dataset? If the answer is yes (and most cases it is), then absolutely that price is worth. See my blog post about how much money it must be costing big companies building their graph layers. Second part is, is Dgraph worth paying for compared to Neo or others? Note that the price is for our enterprise features and support. Not for using the DB itself. Many companies run a 6-node or a 12-node distributed/replicated Dgraph cluster and we only learn that much later when they’re close to pushing it into production and need support. They don’t need to pay for it, the distributed/replicated/transactional architecture of Dgraph is all open source. How much would it cost if one were to run a distributed/replicated setup of another graph DB? Is it even possible, can it execute and perform well? And, when you add support to that, what’s the cost?
  • @codemouse: It’s halfway to 2020. At this point, if any of your strategy is continued investment into your data centers you’re doing it wrong. Yes migration may take years, but you’re not going to be doing #cloud or #ops better than @awscloud
  • hermitdev: Not Citibank, but previously worked for a financial firm that sold a copy of it’s back office fund administration stack. Large, on site deployment. It would take a month or two to make a simple DNS change so they could locate the services running on their internal network. The client was a US depository trust with trillions on deposit. No, I wont name any names. But getting our software installed and deployed was as much fun as extracting a tooth with a dull wood chisel and a mallet.
  • Insikt Group: Approximately 50% of all activity concerning ransomware on underground forums are either requests for any generic ransomware or sales posts for generic ransomware from lower-level vendors. We believe this reflects a growing number of low-level actors developing and sharing generic ransomware on underground forums.
  • Facebook: For classes of bugs intended for all or a wide variety of engineers on a given platform, we have gravitated toward a “diff time” deployment, where analyzers participate as bots in code review, making automatic comments when an engineer submits a code modification. Later, we recount a striking situation where the diff time deployment saw a 70% fix rate, where a more traditional “offline” or “batch” deployment (where bug lists are presented to engineers, outside their workflow) saw a 0% fix rate.
  • Andy Rachleff: Venture capitalists know that the thing that causes their companies to go out of business is lack of a market, not poor execution. So it’s a fool’s errand to back a company that proposes to do a ride-hailing service or renting a room or something as crazy as that. Again–how would you know if it’s going to work? So the venture industry outsourced that market risk to the angel community. The angel community thinks they won it away from the venture community, but nothing could be further from the truth, because it’s a sucker bet. It’s a horrible risk/reward. The venture capitalists said, “Okay, let the angels invest at a $5 million valuation and take all of that market risk. We’ll invest at a $50 million valuation. We have to pay up if it works.” Now they hope the company will be worth $5 billion to make the same return as they would have in the old model. Interestingly, there now are as many companies worth $5 billion today as there were companies worth $500 million 20 years ago, which is why the returns of the premier venture capital firms have stayed the same or even gone up.
  • imagetic: I dealt with a lot of high traffic live streaming video on Facebook for several years. We saw interaction rates decline almost 20x in a 3 year period but views kept increasing. Things just didn’t add up when the dust settled and we’d look at the stats. It wouldn’t be the least bit surprised if every stat FB has fed me was blown extremely out of proportion.
  • prism1234: If you are designing a small embedded system, and not a high performance general computing device, then you already know what operations your software will need and can pick what extensions your core will have. So not including a multiply by default doesn’t matter in this case, and may be preferred if your use case doesn’t involve a multiply. That’s a large use case for risc-v, as this is where the cost of an arm license actually becomes an issue. They don’t need to compete with a cell phone or laptop level cpu to still be a good choice for lots of devices.
  • oppositelock: You don’t have time to implement everything yourself, so you delegate. Some people now have credentials to the production systems, and to ease their own debugging, or deployment, spin up little helper bastion instances, so they don’t have to use 2FA each time to use SSH or don’t have to deal with limited-time SSH cert authorities, or whatever. They roll out your fairly secure design, and forget about the little bastion they’ve left hanging around, open to 0.0.0.0 with the default SSH private key every dev checks into git. So, any former employee can get into the bastion.
  • Lyft: Our tech stack comprises Apache Hive, Presto, an internal machine learning (ML) platform, Airflow, and third-party APIs.
  • Casey Rosenthal: It turns out that redundancy is often orthogonal to robustness, and in many cases it is absolutely a contributing factor to catastrophic failure. The problem is, you can’t really tell which of those it is until after an incident definitively proves it’s the latter.
  • Colm MacCárthaigh: There are two complementary tools in the chest that we all have these days, that really help combat Open Loops. The first is Chaos Engineering. If you actually deliberately go break things a lot, that tends to find a lot of Open Loops and make it obvious that they have to be fixed.
  • @eeyitemi: I’m gonna constantly remind myself of this everyday. “You can outsource the work, but you can’t outsource the risk.” @Viss 2019
  • Ben Grossman~ this could lead to a situation where filmmaking is less about traditional “filmmaking or storytelling,” and more about “world-building”: “You create a world where characters have personalities and they have motivations to do different things and then essentially, you can throw them all out there like a simulation and then you can put real people in there and see what happens.”
  • cheeze: I’m a professional dev and we own a decent amount of perl. That codebase is by far the most difficult to work in out of anything we own. New hires have trouble with it (nobody learns perl these days). Lots of it is next to unreadable.
  • Annie Lowrey: All that capital from institutional investors, sovereign wealth funds, and the like has enabled start-ups to remain private for far longer than they previously did, raising bigger and bigger rounds. (Hence the rise of the “unicorn,” a term coined by the investor Aileen Lee to describe start-ups worth more than $1 billion, of which there are now 376.) Such financial resources “never existed at scale before” in Silicon Valley, says Steve Blank, a founder and investor. “Investors said this: ‘If we could pull back our start-ups from the public market and let them appreciate longer privately, we, the investors, could take that appreciation rather than give it to the public market.’ That’s it.”
  • alexis_fr: I wonder if the human life calculation worked well this time. As far as I see, Boeing lost more than the sum of the human lives; they also lost reputation for everything new they’ve designed in the last 7 years being corrupted, and they also engulfed the reputation of FAA with them, whose agents would fit the definition of “corrupted” by any people’s definition (I know, they are not, they just used agents of Boeing to inspect Boeing because they were understaffed), and the FAA showed the last step of failure by not admitting that the plane had to be stopped until a few days after the European agencies. In other words, even in financial terms, it cost more than damages. It may have cost the entire company. They “DeHavailland”’ed their company. Ever heard of DeHavailland? No? That’s probably to do with their 4 successive deintegrating planes that “CEOs have complete trust in.” It just died, as a name. The risk is high.
  • Neil Gershenfeld: computer science was one of the worst things ever to happen to computers or science, why I believe that, and what that leads me to. I believe that because it’s fundamentally unphysical. It’s based on maintaining a fiction that digital isn’t physical and happens in a disconnected virtual world.
  • @benedictevans: Netflix and Sky both realised that a new technology meant you could pay vastly more for content than anyone expected, and take it to market in a new way. The new tech (satellite, broadband) is a crowbar for breaking into TV. But the questions that matter are all TV questions
  • @iamdevloper: Therapist: And what do we do when we feel like this? Me: buy a domain name for the side project idea we’ve had for 15 seconds. Therapist: No
  • @dvassallo: Step 1: Forget that all these things exist: Microservices, Lambda, API Gateway, Containers, Kubernetes, Docker. Anything whose main value proposition is about “ability to scale” will likely trade off your “ability to be agile & survive”. That’s rarely a good trade off. 4/25 Start with a t3.nano EC2 instance, and do all your testing & staging on it. It only costs $3.80/mo. Then before you launch, use something bigger for prod, maybe an m5.large (2 vCPU & 8 GB mem). It’s $70/mo and can easily serve 1 million page views per day.
  • PeteSearch: I believe we have an opportunity right now to engineer-in privacy at a hardware level, and set the technical expectation that we should design our systems to be resistant to abuse from the very start. The nice thing about bundling the audio and image sensors together with the machine learning logic into a single component is that we have the ability to constrain the interface. If we truly do just have a single pin output that indicates if a person is present, and there’s no other way (like Bluetooth or WiFi) to smuggle information out, then we should be able to offer strong promises that it’s just not possible to leak pictures. The same for speech interfaces, if it’s only able to recognize certain commands then we should be able to guarantee that it can’t be used to record conversations.
  • Murat: As I have mentioned in the previous blog post, MAD questions, Cosmos DB has operationalized a fault-masking streamlined version of replication via nested replica-sets deployed in fan-out topology. Rather than doing offline updates from a log, Cosmos DB updates database at the replicas online, in place, to provide strong consistent and bounded-staleness consistency reads among other read levels. On the other hand, Cosmos DB also maintains a change log by way of a witness replica, which serves several useful purposes, including fault-tolerance, remote storage, and snapshots for analytic workload.
  • grauenwolf: That’s where I get so frustrated. Far too often I hear “premature optimization” as a justification for inefficient code when doing it the right way would actually require the same or less effort and be more readable.
  • Murat: Leader – I tell you Paxos joke, if you accept me as leader. Quorum – Ok comrade. Leader – Here is joke! (*Transmits joke*) Quorum – Oookay… Leader – (*Laughs* hahaha). Now you laugh!! Quorum – Hahaha, hahaha.
  • Manmax75: The amount of stories I’ve heard from SysAdmins who jokingly try to access a former employers network with their old credentials only to be shocked they still have admin access is a scary and boggling thought.
  • @dougtoppin: Fargate brings significant opportunity for cost savings and to get the maximum benefit the minimal possible number of tasks must be running to handle your capacity needs. This means quickly detecting request traffic, responding just as quickly and then scaling back down.
  • @evolvable: At a startup bank we got management pushback when revealing we planned to start testing in production – concerns around regulation and employees accessing prod. We changed the name to “Production Verification”. The discussion changed to why we hadn’t been doing it until now. 
  • @QuinnyPig: I’m saying it a bit louder every time: @awscloud’s data transfer pricing is predatory garbage. I have made hundreds of thousands of consulting dollars straightening these messes out. It’s unconscionable. I don’t want to have to do this for a living. To be very clear, it’s not that the data transfer pricing is too expensive, it’s that it’s freaking inscrutable to understand. If I can cut someone’s bill significantly with a trivial routing change, that’s not the customer’s fault.
  • @PPathole: Alternative Big O notations: O(1) = O(yeah) O(log n) = O(nice) O(nlogn) = O(k-ish) O(n) = O(ok) O(n²) = O(my) O(2ⁿ) = O(no) O(n^n) = O(f*ck) O(n!) = O(mg!)
  • Brewster Kahle: There’s only a few hackers I’ve known like Richard Stallman, he’d write flawless code at typing speed. He worked himself to the bone trying to keep up with really smart former colleagues who had been poached from MIT. Carpal tunnel, sleeping under the desk, really trying hard for a few years and it was killing him. So he basically says I give up, we’re going to lose the Lisp machine. It was going into this company that was flying high, it was going to own the world, and he said it was going to die, and with it the Lisp machine. He said all that work is going to be lost, we need a way to deal with the violence of forking. And he came up with the GNU public license. The GPL is a really elegant hack in the classic sense of a hack. His idea of the GPL was to allow people to use code but to let people put it back into things. Share and share alike.

Useful Stuff:

  • It’s probably not a good idea to start a Facebook poll on the advisability of your pending nuptials a day before the wedding. But it is very funny and disturbingly plausible. Made Public. Another funny/sad one is using a ML bot to “deal with” phone scams. The sad part will be when both sides are just AIs trying to socially engineer each other and half the world’s resources become dedicated to yet another form of digital masturbation. Perhaps we should just stop the MADness?
  • Urgent/1111 Zero Day Vulnerabilities Impacting VxWorks, the Most Widely Used Real-Time Operating System (RTOS). I read this with special interest because I’ve used VxWorks on several projects. Not once do I ever remember anyone saying “I wonder if the TCP/IP stack has security vulnerabilities?” We looked at licensing costs, board support packages, device driver support, tool chain support, ISR service latencies, priority inversion handling, task switch determinacy, etc. Why did we never think of these kind of potential vulnerabilities? One reason is social proof. Surely all these other companies use VxWorks, it must be good, right? Another reason is VxWorks is often used within a secure perimeter. None of the network interfaces are supposed to be exposed to the internet, so remote code execution is not part of your threat model. But in reality you have no idea if a customer will expose a device to the internet. And you have no idea if later product enhancements will place the device on the internet. Since it seems all network devices expand until they become a router, this seems a likely path to Armageddon. At that point nobody is going to requalify their entire toolchain. That just wouldn’t be done in practice. VxWorks is dangerous because everything is compiled into a single image the boots and runs, much like a unikernel. At least when I used it that was the case. VxWorks is basically just a library you link into your application that provides OS functionality. Your write the boot code, device drivers, and other code to make your application work. So if there’s a remote code execution bug it has access to everything. And a lot of these images are built into ROM, so they aren’t upgradeable. And if even if the images are upgradeable in EEPROM or flash, how many people will actually do that? Unless you pay a lot of money you do not get the source to VxWorks. You just get libraries and header files. So you have no idea what’s going on in the network stack. I’m surprised VxWorks never tested their stack against a fuzzing kind of attack. That’s a great way to find bugs in protocols. Though nobody can define simplicity, many of the bugs were in the handling of the little used TCP Urgent Pointer feature. Anyone surprised that code around this is broke? Who uses it? It shouldn’t be in the stack at all. Simple to say, harder to do.
  • JuliaCon 2019 videos are now available. You might like Keynote: Professor Steven G. Johnson and The Unreasonable Effectiveness of Multiple Dispatch
  • CERN is Migrating to open-source technologies. Microsoft wants too much for their licenses so CERN is giving MS the finger.
  • Memory and Compute with Onur Mutlu:
    • The main problem is that DRAM latency is hardly improving at all. From 1999 to 2017, DRAM capacity has increased by 128x, bandwidth by 20x, but latency only by 1.3x! This means that more and more effort has to be spent tolerating memory latency.  But what could be done to actually improve memory latency?
    • You could “easily” get a 30% latency improvement by having DRAM chips provide a bit more precise information to the memory controller about actual latencies and current temperatures.
    • Another concept to truly break the memory barrier is to move the compute to the memory. Basically, why not put the compute operations in memory?  One way is to use something like High-Bandwidth Memory (HBM) and shorten the distance to memory by stacking logic and memory.
    • Another rather cool (but also somewhat limited) approach is to actually use the DRAM cells themselves as a compute engine. It turns out that you can do copy, clear, and even logic ops on memory rows by using the existing way that DRAMs are built and adding a tiny amount of extra logic.
  • Want to make something in hardware? Like Pebble, Dropcam, or Ring. Who you gonna call? Dragon Innovation. Listen how on the AMP Hour podcast episode #451 – An Interview with Scott Miller
    • Typical customers build between 5k and 1 million units, but will talk with you at 100 units. Customers usually start small. They’ve built a big toolbox for IoT, so they don’t need to create the wheel every time, they have designs for sensing, processing, electronics on the edge, radios, and all the different security layers. They can deploy quickly with little customizations.
    • Dragon is moving into doing the design, manufacturing, packaging, issue all POs, and installation support. They call this Product as a Service (PaaS)—full end-to-end provider. Say you have a sensor to determine when avocados are ripe you would pay per sensor per month, or maybe per avocado, instead of a one time sale. Seeing more non-traditional getting into the IoT space, with different revenue models, Dragon has an opportunity to innovate on their business model. 
    • Consumer is dying and industrial is growing. A trend they are seeing in the US is a constriction of business to consumer startups in the hardware space, but an an expansion of industrial IoT. There have been a bunch of high profile bankruptcies in the consumer space (Anki, Jibo).
    • Europe is growing. Overall huge growth in industrial startups across Europe. Huge number of capable factories in the EU. They get feet on the ground to find and qualify factories. They have over 2000 factories in their database. 75% in China, increasingly more in the EU and the US. 
    • Factories are going global. Seeing a lot of companies driven out of China by the 25% tariffs, moving into Asian pacific countries like Taiwan, Singapore, Vietnam, Indonesia, Malaysia. Coming up quickly, but not up to China’s level yet. Dragon will include RFQs on a global basis, including factories from the US, China, EU, Indonesia, Vietnam, to see what the landed cost is as a function of geography. 
    • Factories are different in different countries. In China factories are vertically integrated. Mold making, injection molding, final assembly and test and packaging, all under one roof. Which is very convenient. In the US and Europe factories are more horizontal. It takes a lot more effort to put together your supply chain.  As an example of the degree they were vertically integrated this factory in China would make their own paint and cardboard. 
    • Automation is huge in China. Chinese labor rates are on average 5 to 6 dollars an hour, depending on region, factory, training. Focus is on automation. One factory they worked with had 100,000 workers now they have 30,000 because of automation.
    • Automation is different in China. Automation in China is bottom’s up. They’ll build a simple robot that attaches to a soldering iron and will solder the leads. In the US is top down. Build a huge full functioning worker that can do anything instead of a task specific robot. China is really good at building stuff so they build task specific robots to make their processes more efficient. Since products are always changing this allows them to stay nimble. 
    • Also Strange PartsDesign for Manufacturing Course, How I Made My Own iPhone – in China.
  • BigQuery best practices: Controlling costs: Query only the columns that you need; Don’t run queries to explore or preview table data; Before running queries, preview them to estimate costs; Using the query validator; Use the maximum bytes billed setting to limit query costs; Do not use a LIMIT clause as a method of cost control; Create a dashboard to view your billing data so you can make adjustments to your BigQuery usage. Also consider streaming your audit logs to BigQuery so you can analyze usage patterns; Partition your tables by date; f possible, materialize your query results in stages;  If you are writing large query results to a destination table, use the default table expiration time to remove the data when it’s no longer needed; Use streaming inserts only if your data must be immediately available.
  • Boeing has changed a lot over the years. Once upon a time I worked on a project with Boeing and the people were excellent. This is something I heard: “The changes can be attributed to the influence of the McDonnel family who maintain extremely high influence through their stock shares resulting from the merger. It has been gradually getting better recently but still a problem for those inside who understand the real potential impact.”
  • Maybe we are all just random matrices? What Is Universality? It turns out there are deep patterns in complex correlated systems that lie somewhere between randomness and order. They arise from components that interact and repel one another. Do such patterns exist in software systems? Also, Bubble Experiment Finds Universal Laws
  • PID Loops and the Art of Keeping Systems Stable
    • I see a lot of places where control theory is directly applicable but rarely applied. Auto-scaling and placement are really obvious examples, we’re going to walk through some, but another is fairness algorithms. A really common fairness algorithm is how TCP achieves fairness. You’ve got all these network users and you want to give them all a fair slice. Turns out that a PID loop it’s what’s happening. In system stability, how do we absorb errors, recover from those errors? 
    • Something we do in CloudFront is we run a control system. We’re constantly measuring the utilization of each site and depending on that utilization, we figure out what’s our error, how far are we from optimized? We change the mass or radius of effect of each site, so that at our really busy time of day, really close to peak, it’s servicing everybody in that city, everybody directly around it drawing those in, but that at our quieter time of day can extend a little further and go out. It’s a big system of dynamic springs all interconnected, all with PID loops. It’s amazing how optimal a system like that can be, and how applying a system like that has increased our effectiveness as a CDN provider. 
    • A surprising number of control systems are just like this, they’re just Open Loops. I can’t count the number of customers I’ve gone through control systems with and they told me, “We have this system that pushes out some states, some configuration and sometimes it doesn’t do it.” I find that scary, because what it’s saying is nothing’s actually monitoring the system. Nothing’s really checking that everything is as it should be. My classic favorite example of this as an Open Loop process, is certificate rotation. I happened to work on TLS a lot, it’s something I spent a lot of my time on. Not a week goes by without some major website having a certificate outage.
    • We have two observability systems at AWS, CloudWatch, and X-Ray. One of the things I didn’t appreciate until I joined AWS – I was a bit going on like Charlie and the chocolate factory, and seeing the insides. I expected to see all sorts of cool algorithms and all sorts of fancy techniques and things that I just never imagined. It was a little bit of that, there was some of that once I got inside working, but mostly what I found was really mundane, people were just doing a lot of things at scale that I didn’t realize. One of those things was just the sheer volume of monitoring. The number of metrics we keep on, every single host, every single system, I still find staggering.
    • Exponential Back-off is a really strong example. Exponential Back-off is basically an integral, an error happens and we retry, a second later if that fails, then we wait. Rate limiters are like derivatives, they’re just rate estimators and what’s going on and deciding what’s to let in and what to let out. We’ve built both of these into the AWS SDKs. We’ve got other back pressure strategies too, we’ve got systems where servers can tell clients, “Back off, please, I’m a little busy right now,” all those things working together. If I look at system design and it doesn’t have any of this, if it doesn’t have exponential back-off, if it doesn’t have rate-limiters in some place, if it’s not able to fight some power-law that I think might arise due to errors propagating, that tells me I need to be a bit more worried and start digging deeper.
    • I like to watch out for edge triggering in systems, it tends to be an anti-pattern. One reason is because edge triggering seems to imply a modal behavior. You cross the line, you kick into a new mode, that mode is probably rarely tested and it’s now being kicked into at a time of high stress, that’s really dangerous. Your system has to be idempotent, if you’re going to build an idempotent system, you might as well make a level-triggered system in the first place, because generally, the only benefit of building an edge-triggered system is it doesn’t have to be idempotent.
    • There is definitely tension between stability and optimality, and in general, the more finely-tuned you want to make a system to achieve absolute optimality, the more risk you are of being able to drive it into an unstable state. There are people who do entire PIDs on nothing else then finding that balance for one system. Oil refineries are a good example, where the oil industry will pay people a lot of money just to optimize that, even very slightly. Computer Science, in my opinion, and distributed systems, are nowhere near that level of advanced control theory practice yet. We have a long way to go. We’re still down at the baby steps of, “We’ll at least measure it.”
  • Re:Inforce 2019 videos are now available.
  • Top Seven Myths of Robust Systems: The number one myth we hear out in the field is that if a system is unreliable, we can fix that with redundancy; rather than trying to simplify or remove complexity, learn to live with it. Ride complexity like a wave. Navigate the complexity; The adaptive capacity to improvise well in the face of a potential system failure comes from frequent exposure to risk; Both sides — the procedure-makers and the procedure-not-followers — have the best of intentions, and yet neither is likely to believe that about the other; Unfortunately it turns out catastrophic failures in particular tend to be a unique confluence of contributing factors and circumstances, so protecting yourself from prior outages, while it shouldn’t hurt, also doesn’t help very much; Best practices aren’t really a knowable thing; Don’t blame individuals. That’s the easy way out, but it doesn’t fix the system. Change the system instead. 
  • They grow up so slow. What’s new in JavaScript: Google I/O 2019 Summary
  • From a rough calculation we saw about 40% decrease in the amount of CPU resources used. Overall, we saw latency stabilize for both avg and max p99. Max p99 latency also decreased a bit. Safely Rewriting Mixpanel’s Highest Throughput Service in Golang. Mixpanel moved from Python to Go for their data collection API. They has already migrated the Python API to use the Google Load Balancer to route messages to kubernetes pod on Google Cloud where an Envoy container load-balanced between eight Python API containers. The Python API containers then submitted the data to Google Pubsub queue via a pubsub sidecar container that had a kestrel interface. To enable testing against live traffic, we created a dedicated setup. The setup was a separate kubernetes pod running in the same namespace and cluster as the API deployments. The pod ran an open source API correctness tool, Diffy, along with copies of the old and new API services. Diffy is a service that accepts HTTP requests, and forwards them to two copies of an existing HTTP service and one copy of a candidate HTTP service. One huge improvement is we only need to run a single API container per pod. 
  • Satisfactory: Network Optimizations: It would be a big gain to stop replicating the inventory when it’s not viewed, which is essentially what we did, but the method of doing so was a bit complicated and required a lot of rework…Doing this also helps to reduce CPU time, as an inventory is a big state to compare, and look for changes in. If we can reduce that to a maximum of 4x the number of players it is a huge gain, compared to the hundreds, if not thousands, that would otherwise be present in a big base…There is, of course, a trade-off. As I mentioned there is a chance the inventory is not there when you first open to view it, as it has yet to arrive over the network…In this case the old system actually varied in size but landed around 48 bytes per delta, compared to the new system of just 3 bytes…On top of this, we also reduced how often a conveyor tries to send an update to just 3 times a second compared to the previous of over 20…the accuracy of item placements on the conveyors took a small hit, but we have added complicated systems in order to compensate for that…we’ve noticed that the biggest issue for running smooth multiplayer in large factories is not the network traffic anymore, it’s rather the general performance of the PC acting as a server.
  • MariaDB vs MySQL Differences: MariaDB is fully GPL licensed while MySQL takes a dual-license approach. Each handle thread pools in a different way. MariaDB supports a lot of different storage engines. In many scenarios, MariaDB offers improved performance.
  • Our pySpark pipeline churns through tens of billions of rows on a daily basis. Calculating 30 billion speed estimates a week with Apache Spark: Probes generated from the traces are matched against the entire world’s road network. At the end of the matching process we are able to assign each trace an average speed, a 5 minute time bucket and a road segment. Matches on the same road that fall within the same 5 minute time bucket are aggregated to create a speed histogram. Finally, we estimate a speed for each aggregated histogram which represents our prediction of what a driver will experience on a road at a given time of the week…On a weekly basis, we match on average 2.2 billion traces to 2.3 billion roads to produce 5.4 billion matches. From the matches, we build 51 billion speed histograms to finally produce 30 billion speed estimates…The first thing we spent time on was designing the pipeline and schemas of all the different datasets it would produce. In our pipeline, each pySpark application produces a dataset persisted in a hive table readily available for a downstream application to use…Instead of having one pySpark application execute all the steps (map matching, aggregation, speed estimation, etc.) we isolated each step to its own application…We favored normalizing our tables as much as possible and getting to the final traffic profiles dataset through relationships between relevant tables…Partitioning makes querying part of the data faster and easier. We partition all the resulting datasets by both a temporal and spatial dimension. 
  • Do not read this unless you can become comfortable with the feeling that everything you’ve done in your life is trivial and vainglorious. Morphogenesis for the Design of Design
    • One of my students built and runs all the computers Facebook runs on, one of my students used to run all the computers Twitter runs on—this is because I taught them to not believe in computer science. In other words, their job is to take billions of dollars, hundreds of megawatts, and tons of mass, and make information while also not believing that the digital is abstracted from the physical. Some of the other things that have come out from this lineage were the first quantum computations, or microfluidic computing, or part of creating some of the first minimal cells.
    • The Turing machine was never meant to be an architecture. In fact, I’d argue it has a very fundamental mistake, which is that the head is distinct from the tape. And the notion that the head is distinct from the tape—meaning, persistence of tape is different from interaction—has persisted. The computer in front of Rod Brooks here is spending about half of its work just shuttling from the tape to the head and back again.
    • There’s a whole parallel history of computing, from Maxwell to Boltzmann to Szilard to Landauer to Bennett, where you represent computation with physical resources. You don’t pretend digital is separate from physical. Computation has physical resources. It has all sorts of opportunities, and getting that wrong leads to a number of false dichotomies that I want to talk through now. One false dichotomy is that in computer science you’re taught many different models of computation and adherence, and there’s a whole taxonomy of them. In physics there’s only one model of computation: A patch of space occupies space, it takes time to transit, it stores state, and states interact—that’s what the universe does. Anything other than that model of computation is physics and you need epicycles to maintain the fiction, and in many ways that fiction is now breaking.
    • We did a study for DARPA of what would happen if you rewrote from scratch a computer software and hardware so that you represented space and time physically.
    • One of the places that I’ve been involved in pushing that is in exascale high-performance computing architecture, really just a fundamental do-over to make software look like hardware and not to be in an abstracted world.
    • Digital isn’t ones and zeroes. One of the hearts of what Shannon did is threshold theorems. A threshold theorem says I can talk to you as a wave form or as a symbol. If I talk to you as a symbol, if the noise is above a threshold, you’re guaranteed to decode it wrong; if the noise is below a threshold, for a linear increase in the physical resources representing the symbol there’s an exponential reduction in the fidelity to decode it. That exponential scaling means unreliable devices can operate reliably. The real meaning of digital is that scaling property. But the scaling property isn’t one and zero; it’s the states in the system. 
    • if you mix chemicals and make a chemical reaction, a yield of a part per 100 is good. When the ribosome—the molecular assembler that makes your proteins—elongates, it makes an error of one in 104. When DNA replicates, it adds one extra error-correction step, and that makes an error in 10-8, and that’s exactly the scaling of threshold theorem. The exponential complexity that makes you possible is by error detection and correction in your construction. It’s everything Shannon and von Neumann taught us about codes and reconstruction, but it’s now doing it in physical systems.
    • One of the projects I’m working on in my lab that I’m most excited about is making an assembler that can assemble assemblers from the parts that it’s assembling—a self-reproducing machine. What it’s based on is us. 
    • If you look at scaling coding construction by assembly, ribosomes are slow—they run at one hertz, one amino acid a second—but a cell can have a million, and you can have a trillion cells. As you were sitting here listening, you’re placing 1018 parts a second, and it’s because you can ring up this capacity of assembling assemblers. The heart of the project is the exponential scaling of self-reproducing assemblers.
    • As we work on the self-reproducing assembler, and writing software that looks like hardware that respects geometry, they meet in morphogenesis. This is the thing I’m most excited about right now: the design of design. Your genome doesn’t store anywhere that you have five fingers. It stores a developmental program, and when you run it, you get five fingers. It’s one of the oldest parts of the genome. Hox genes are an example. It’s essentially the only part of the genome where the spatial order matters. It gets read off as a program, and the program never represents the physical thing it’s constructing. The morphogenes are a program that specifies morphogens that do things like climb gradients and symmetry break; it never represents the thing it’s constructing, but the morphogens then following the morphogenes give rise to you.
    • What’s going on in morphogenesis, in part, is compression. A billion bases can specify a trillion cells, but the more interesting thing that’s going on is almost anything you perturb in the genome is either inconsequential or fatal. The morphogenes are a curated search space where rearranging them is interesting—you go from gills to wings to flippers. The heart of success in machine learning, however you represent it, is function representation. The real progress in machine learning is learning representation. 
    • We’re at an interesting point now where it makes as much sense to take seriously that scaling as it did to take Moore’s law scaling in 1965 when he made his first graph. We started doing these FAB labs just as outreach for NSF, and then they went viral, and they let ordinary people go from consumers to producers. It’s leading to very fundamental things about what is work, what is money, what is an economy, what is consumption.
    • Looking at exactly this question of how a code and a gene give rise to form. Turing and von Neumann both completely understood that the interesting place in computation is how computation becomes physical, how it becomes embodied and how you represent it. That’s where they both ended their life. That’s neglected in the canon of computing.
    • If I’m doing morphogenesis with a self-reproducing system, I don’t want to then just paste in some lines of code. The computation is part of the construction of the object. I need to represent the computation in the construction, so it forces you to be able to overlay geometry with construction.
    • Why align computer science and physical science? There are at least five reasons for me. Only lightly is it philosophical. It’s the cracks in the matrix. The matrix is cracking. 1) The fact that whoever has their laptop open is spending about half of its resources shuttling information from memory transistors to processor transistors even though the memory transistors have the same computational power as the processor transistors is a bad legacy of the EDVAC. It’s a bit annoying for the computer, but when you get to things like an exascale supercomputer, it breaks. You just can’t maintain the fiction as you push the scaling. The resource in very largescale computing is maintaining the fiction so the programmers can pretend it’s not true is getting just so painful you need to redo it. In fact, if you look down in the trenches, things like emerging ways to do very largescale GPU program are beginning to inch in that direction. So, it’s breaking in performance.
    •  What’s interesting is a lot of the things that are hard—for example, in parallelization and synchronization—come for free. By representing time and space explicitly, you don’t need to do the annoying things like thread synchronization and all the stuff that goes into parallel programming.
    • Communication degraded with distance. Along came Shannon. We now have the Internet. Computation degraded with time. The last great analog computer work was Vannevar Bush’s differential analyzer. One of the students working on it was Shannon. He was so annoyed that he invented our modern digital notions in his Master’s thesis to get over the experience of working on the differential analyzer.
    • When you merge communication with computation with fabrication, it’s not there’s a duopoly of communication and computation and then over here is manufacturing; they all belong together. The heart of how we work is this trinity of communication plus computation and fabrication, and for me the real point is merging them.
    • I almost took over running research at Intel. It ended up being a bad idea on both sides, but when I was talking to them about it, I was warned off. It was like the godfather: “You can do that other stuff, but don’t you dare mess with the mainline architecture.” We weren’t allowed to even think about that. In defense of them, it’s billions and billions of dollars investment. It was a good multi-decade reign. They just weren’t able to do it. 
    • Again, the embodiment of everything we’re talking about, for me, is the morphogenes—the way evolution searches for design by coding for construction. And they’re the oldest part of the genome. They were invented a very long time ago and nobody has messed with them since.
    • Get over digital and physical are separate; they can be united. Get over analog as separate from digital; there’s a really profound place in between. We’re at the beginning of fifty years of Moore’s law but for the physical world. We didn’t talk much about it, but it has the biggest impact of anything I know if anybody can make anything.

Soft Stuff:

  • paypal/hera (article): Hera multiplexes connections for MySQL and Oracle databases. It supports sharding the databases for horizontal scaling. It is a data access gateway that PayPal uses to scale database access for hundreds of billions of SQL queries per day. Additionally, HERA improves database availability through sophisticated protection mechanisms and provides application resiliency through transparent traffic failover. HERA is now available outside of PayPal as an Apache 2-licensed project.
  • zerotier/lf: a fully decentralized fully replicated key/value store. LF is built on a directed acyclic graph (DAG) data model that makes synchronization easy and allows many different security and conflict resolution strategies to be used. One way to think of LF’s DAG is as a gigantic conflict-free replicated data type (CRDT). Proof of work is used to rate limit writes to the shared data store on public networks and as one thing that can be taken into consideration for conflict resolution. 
  • pahud/fargate-fast-autoscaling: This reference architecture demonstrates how to build AWS Fargate workload that can detect the spiky traffic in less than 10 seconds followed by an immediate horizontal autoscaling.
  • ailidani/paxi: Paxi is the framework that implements WPaxos and other Paxos protocol variants. Paxi provides most of the elements that any Paxos implementation or replication protocol needs, including network communication, state machine of a key-value store, client API and multiple types of quorum systems.

Pub Stuff:

from High Scalability

Stuff The Internet Says On Scalability For July 26th, 2019

Stuff The Internet Says On Scalability For July 26th, 2019

Wake up! It’s HighScalability time—once again:


 The Apollo 11 guidance computer repeatedly crashed on descent. On earth computer scientists had just 13 hours to debug the problem. They did. It was CPU overload because of a wrong setting. Some things never change! 

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 52 mostly 5 star reviews (120 on Goodreads). They’ll learn a lot and hold you in even greater awe.

Number Stuff:

  • $11 million: Google fine for discriminating against not young people. 
  • 55,000+: human-labeled 3D annotated frames, a drivable surface map, and an underlying HD spatial semantic map in Lyft’s Level 5 Dataset.
  • 645 million: LinkedIn members with 4.5 trillion daily messages pumping through Kafka.
  • 49%: drop in Facebook’s net income. A fine result.
  • 50 ms: repeatedly randomizing elements of the code that attackers need access to in order to compromise the hardware. 
  • 7.5 terabytes: hacked Russian data.
  • 5%: increase in Tinder shares after bypassing Google’s 30% app store tax.
  • $21.7 billion: Apple’s profit from other people’s apps.
  • 5 billion: records in 102 datasets in the UCR STAR spatio-temporal index.
  • 200x: speediest quantum operation yet. 
  • 45%: US Fortune 500 companies founded immigrants and their kids.
  • 70%: hard drive failures caused by media damage, including full head crashes. 
  • 12: lectures on everything Buckminster Fuller knew.
  • 600,000: satellite images taken in a single day used to create a picture of Earth
  • 149: hours between Airbus reboots needed to mask software problems. 

Quotable Stuff:

  • @mdowd: Now that Alan Turing is on the 50 pound note, they should rename ATMs to “Turing Machines”
  • @hacks4pancakes: Another real estate person: “I tried using a password manager, but then my credit card was stolen and my annual payment for it failed – and they cut off access to all my passwords during a meeting.”
  • @juliaferraioli: “Our programs were more complex because Go was so simple” — @_rsc on reshaping #Golang at #gophercon
  • Jason Shen: A YC founder once said to me that he found little correlation between the success of a YC company and how hard their founders worked. That is to say, among a group of smart, ambitious entrepreneurs who were all already working pretty hard, the factors that made the biggest difference were things like timing, strategy, and relationships. Which is why Reddit cofounder-turned-venture capitalist Alexis Ohanian now warns against the “utter bullshit” of this so-called hustle porn mentality.
  • Dale Markowitz: A successful developer today needs to be as good at writing code as she is at extinguishing that code when it explodes.
  • @allspaw: I’m with @snowded on this. Taleb’s creation of ‘antifragile’ is what the Resilience Engineering community has referred to as resilience all along.
  • Timothy Lee: Heath also said that the interviewer assumed that the word “byte” meant eight bits. In his view, this also revealed age bias. Modern computer systems use 8-bit bytes, but older computer systems could have byte sizes ranging from six to 40 bits.
  • General Patton: If everybody is thinking alike, somebody isn’t thinking
  • panpanna: Architecturally, microkernels and unikernels are direct opposites. Unikernels strive to minimize communication complexity (and size and footprint but that is not relevant to this discussion) by putting everything in the same address space. This gives them many advantages among which performance is often mentioned but ease of development is IMHO equally important. However, the two are not mutually exclusive. Unikernels often run on top of microkernels or hypervisors.
  • Wayne Ma: The team ultimately put a stop to most [Apple] device leaks—and discovered some audacious attempts, such as some factory workers who tried to build a tunnel to transport components to the outside without security spotting them.
  • Dr. Steven Gundry: I think, uploaded most of our information processing to our bacterial cloud that lives in us, on us, around us, because they have far more genes than we do. They reproduce virtually instantaneously, and so they can do fantastic information processing. Many of us think that perhaps lifeforms on Earth, particularly animal lifeforms, exist as a home for bacteria to prosper on Earth. 
  • @UdiDahan: I’ve been gradually coming around to the belief that any “good” code base lives long enough for the environment around it to change in such away that its architecture is no longer suitable, making it then “bad”. This would probably be as equally true of FP as of OOP.
  • @DmitryOpines: No one “runs” a crypto firm Holger, we are merely the mortal agents through whose minor works the dream of disaggregated ledger currency manifests on this most unworthy of Prime Material Planes.
  • Philip Ball: One of the most remarkable ideas in this theoretical framework is that the definite properties of objects that we associate with classical physics — position and speed, say — are selected from a menu of quantum possibilities in a process loosely analogous to natural selection in evolution: The properties that survive are in some sense the “fittest.” As in natural selection, the survivors are those that make the most copies of themselves. This means that many independent observers can make measurements of a quantum system and agree on the outcome — a hallmark of classical behavior.
  • @mikko: Rarely is anyone thanked for the work they did to prevent the disaster that didn’t happen.
  • David Rosenthal: Back in 1992 Robert Putnam et al published Making democracy work: civic traditions in modern Italy, contrasting the social structures of Northern and Southern Italy. For historical reasons, the North has a high-trust structure whereas the South has a low-trust structure. The low-trust environment in the South had led to the rise of the Mafia and persistent poor economic performance. Subsequent effects include the rise of Silvio Berlusconi. Now, in The Internet Has Made Dupes-And Cynics-Of Us All, Zynep Tufecki applies the same analysis to the Web
  • Diego Basch: So here is an obvious corollary. Like I just mentioned, if you have an idea for an interesting gadget you will move to a place like Shenzhen or Hong Kong or Taipei. You will build a prototype, prove your concept, work with a manufacturer to iterate your design until it’s mature enough to mass-produce. Either you will bootstrap the business or you will partner with someone local to fund it, because VCs won’t give you the time of the day. Now, let’s say hardware is not your cup of tea and you want to build services. Why be in Silicon Valley at all?
  • Buckminster Fuller~ You derive data by segregating; You derive principles by integrating; Without data, you cannot calculate; Without calculations, you cannot generalize; Without generalizations, you cannot design; Without designs, you cannot discover; Without discoveries, you cannot derive new data…Segregation and integration are not opposed: they are complementary and interdependent. Striving to be a specialist OR a generalist is counterproductive; the aim is to be COMPREHENSIVE!
  • @jessfraz: I see a lot of debates about “open core”. To me the premise behind it is “we will open this part of our software but you gotta take care of supporting it yourself.” Then they charge for support. Except the problem was some other people *cough* clouds *cough* beat them to it.
  • @greglinden: Tech companies consistently get this wrong, thinking this is a simple black-and-white ML classification problem, spam or not spam, false or not false. Disinformation exploits that by being just ambiguous enough to not get classified as false. It’s harder than that.
  • Brent Ozar: The ultimate, #1, primary, existential, responsibility of a DBA – for which all other responsibilities pale in comparison – is to implement database backup and restore processing adequate to support the business’s acceptable level of data loss.
  • Alex Hern: A dataset with 15 demographic attributes, for instance, “would render 99.98% of people in Massachusetts unique”. And for smaller populations, it gets easier: if town-level location data is included, for instance, “it would not take much to reidentify people living in Harwich Port, Massachusetts, a city of fewer than 2,000 inhabitants”.
  • Memory Guy: Our forecasts find that 3D XPoint Memory’s sub-DRAM prices will drive that technology’s revenues to over $16 billion by 2029, while stand-alone MRAM and STT-RAM revenues will approach $4 billion — over one hundred seventy times MRAM’s 2018 revenues.  Meanwhile, ReRAM and MRAM will compete to replace the bulk of embedded NOR and SRAM in SoCs, to drive even greater revenue growth. This transition will boost capital spending, increasing the spend for MRAM alone by thirty times to $854 million in 2029.
  • @unclebobmartin: John told me he considered FP a failure because, to paraphrase him, FP made it simple to do hard things but almost impossible to do simple things.
  • Dr. Neil J. Gunther: All performance is nonlinear.
  • @mathiasverraes: I wish we’d stop debating OOP vs FP, and started debating individual paradigms. Immutability, encapsulation, global state, single assignment, actor model, pure functions, IO in the type system, inheritance, composition… all of these are perfectly possible in either OOP or FP.
  • Troy Hunt: “1- All those servers were compromised. They were either running standalone VPSs or cpanel installations. 2- Most of them were running WordPress or Drupal (I think only 2 were not running any of the two). 3- They all had a malicious cron.php running”
  • Gartner: AWS makes frequent proclamations about the number of price reductions it has made. Customers interpret these proclamations as being applicable to the company’s services broadly, but this is not the case. For instance, the default and most frequently provisioned storage for AWS’s compute service has not experienced a price reduction since 2014, despite falling prices in the market for the raw components.
  • mcguire: Speaking as someone who has done a fair number of rewrites as well as watching rewrites fail, conventional wisdom is somewhat wrong. 1. Do a rewrite. Don’t try to add features, just replace the existing functionality. Avoid a moving target. 2. Rewrite the same project. Don’t redesign the database schema at the same time you are rewriting. Try to keep the friction down to a manageable level. 3. Incremental rewrites are best. Pick part of the project, rewrite and release that, then get feedback while you work on rewriting the next chunk.
  • Atlassian: Isolating context/state management association to a single point is very helpful. This was reinforced at Re:Invent 2018 where a remarkably high amount of sessions had a slide of “then we have service X which manages tenant → state routing and mapping”.
  • taxicabjesus: I have a ~77 year old friend who was recently telling me about going to Buckminster Fuller’s lectures at his engineering university, circa 1968. He quoted Mr. Fuller as saying something like, “entropy takes things apart, life puts them back together.”
  • Daniel Abadi: PA/EC systems sound good in theory, but are not particularly useful in practice. Our one example of a real PA/EC system — Hazelcast — has spent the past 1.5 years introducing features that are designed for alternative PACELC configurations — specifically PC/EC and PA/EL configurations. PC/EC and PA/EL configurations are a more natural cognitive fit for an application developer. Either the developer can be certain that the underlying system guarantees consistency in all cases (the PC/EC configuration) in which case the application code can be significantly simplified, or the system makes no guarantees about consistency at all (the PA/EL configuration) but promises high availability and low latency. CRDTs and globally unique IDs can still provide limited correctness guarantees despite the lack of consistency guarantees in PA/EL configurations.
  • Simone de Beauvoir: Then why “swindled”? When one has an existentialist view of the world, like mine, the paradox of human life is precisely that one tries to be and, in the long run, merely exists. It’s because of this discrepancy that when you’ve laid your stake on being—and, in a way you always do when you make plans, even if you actually know that you can’t succeed in being—when you turn around and look back on your life, you see that you’ve simply existed. In other words, life isn’t behind you like a solid thing, like the life of a god (as it is conceived, that is, as something impossible). Your life is simply a human life.
  • SkyPuncher: I think Netflix is the perfect example of where being data driven completely fails. If you listen to podcasts with important Netflix people everything you hear is about how they experiment and use data to decide what to do. Every decision is based on some data point.  At the end of the day, they just continue to add features that create short term payoffs and long term failures. Pennywise and pound foolish.
  • Frank Wilczek: I don’t think a singularity is imminent, although there has been quite a bit of talk about it. I don’t think the prospect of artificial intelligence outstripping human intelligence is imminent because the engineering substrate just isn’t there, and I don’t see the immediate prospects of getting there. I haven’t said much about quantum computing, other people will, but if you’re waiting for quantum computing to create a singularity, you’re misguided. That crossover, fortunately, will take decades, if not centuries.

Useful Stuff:

  • What an incredible series. Apollo 11 13 minutes to the MoonEp.05 The fourth astronaut tells the story of how the Apollo computer system was made. Apollo guidance computer weighed 30 kilos, was as big as a couple of shoe boxes, built by a team at MIT, was the worlds first digital, portable, general purpose computer. It was the first software system were people’s lives depended on it. It was the first fly-by-wire system. Contract 1, the first contract of the Apollo program, was for the navigation computer. The MIT group used inertial navigation, first pioneered in Polaris missies. The idea is if you know where you start, direction, and acceleration, then you always know where you are and where you are going. Until this time flight craft were moved by manually pushing levers and flipping switches. Apollo couldn’t affort the weight of these pulley based systems. They chose, for the first time ever (1964), to make a computer to control the flight of the space craft. They had to figure out everything. How would the computer communicate to all the different subsystems? Command a valve to open? Turn on an engine? Turn off an engine? Redirect an engine? Apollo is the moment when people stopped bragging about how big their computer was and started bragging about how small they were. Digital computers were the size of buildings at the time. At the time nobody trusted computers because they would only work a few hours or days at a time. They needed a computer to work for a couple weeks. They risked everything on a brand new technology called Integrated Circuits that were only in the labs. They made the very first computer with ICs. But they got permission to do it. A huge risk betting everything, but there was no alternative. There was no other way to build a computer with enough compute power. Use of ICs to build a digital computers is one of the lasting legacies of Apollo. Apollo bought 60% of the total chip output at the time, a huge boost to a fledgling computer industry. But the hardware needed software. Software was not even in the original 10 page contract. In 1967 they were afraid they wouldn’t meet the end of the decade deadline because software is so complicated to build. And nothing has changed since. Margaret Hamilton joined the project in 1964. There were no rules for software at the time. There was no field of software development. You could get hired just for knowing a computer language. So again, not much has changed. Nobody knew what software was. You couldn’t describe what you did to family. Very unregimented, very free environment. Don Eyles wrote the landing software on the AGC (Apollo Guidance Computer). The AGC was one square foot in size, weighed 70 pounds, 55 watts, 76kb of memory in the form of 16 bit words, only 4k was RAM, the rest was hard wired memory. Once written a program was printed out on paper and then converted to punch cards, by people key punch operators, that could be read directly into main frame computers, which translated them onto the AGC. Over 100 people worked on it at the end. All the cards had to be integrated together and submitted in one big run that executed overnight. Then the simulation would be run the next day to see if the code was OK. This was your debug cycle. The key punch operators had to go around at night and beat up on the prima donna programmers, who always wanted more time or do something over, to submit their jobs. Again, not much has changed. The key punch operators would go back to the programmers when they notice syntax errors. If the code wasn’t right the program wouldn’t go. It used core rope memory. Software was woven into the cores. If a wire went through one of the donut shaped cores of magnetic material that was a 0, if it went around a core that represented a 1. Software was hardware that was literally sewn into the structure of the computer, manually, by textile workers, by hand. Rope memory was proven tech at the time. It was bullet proof. There was no equivalent bullet proof approach to software, which is why Hamilton invented software engineering. There were no tools for finding errors at the time. They tried to find a way to build software so errors would not happen. Wrong time, wrong priority, wrong data, interface errors were the big source of errors. Nobody knew how to control a system with software. They came up with a verb and noun system that looked like a big calculator with buttons. The buttons had to be big and clear so they could be punched with gloves and seen through a visor. Verb, what do you want to do? Noun, what do you want to do it to? It used a simple little key board. There were three digital read outs, no text, it was all just numbers, three sets of numbers. To start initiating lunar landing program you would press noun, 63, enter. To start the program in 15 seconds you enter verb, 15, enter. A clock would start counting down. At zero it would start program 63 which initiated a large breaking burn to slow you down so you start dropping down to the surface of the moon. The astronauts didn’t fly, they controlled programs. They ended 200 meters from where they intended to land. Flying manually would have taken a lot more fuel. The computer was always on and in operation. It was a balance of control, a partnership. The intention at first was to create a fully automated system with two buttons: go to the moon; go home. They ended up with 500 buttons. Again, things don’t change.
  • Why Some Platforms Thrive and Others Don’t: Some digital networks are fragmented into local clusters of users. In Uber’s network, riders and drivers interact with network members outside their home cities only occasionally. But other digital networks are global; on Airbnb, visitors regularly connect with hosts around the world. Platforms on global networks are much less vulnerable to challenges, because it’s difficult for new rivals to enter a market on a global scale…As for Didi and Uber, our analysis doesn’t hold out much hope. Their networks consist of many highly local clusters. They both face rampant multi-homing, which may worsen as more rivals enter the markets. Network-bridging opportunities—their best hope—so far have had only limited success. They’ve been able to establish bridges just with other highly competitive businesses, like food delivery and snack vending. (In 2018 Uber struck a deal to place Cargo’s snack vending machines in its vehicles, for instance.) And the inevitable rise of self-driving taxis will probably make it challenging for Didi and Uber to sustain their market capitalization. Network properties are trumping platform scale.
  • James Hamilton: Where Aurora took a different approach from that of common commercial and open source database management systems is in implementing log-only storage. Looking at contemporary database transaction systems, just about every system only does synchronous writes with an active transaction waiting when committing log records. The new or updated database pages might not be written for tens of seconds or even minutes after the transaction has committed. This has the wonderful characteristic that the only writes that block a transaction are sequential rather than random. This is generally a useful characteristic and is particularly important when logging to spinning media but it also supports an important optimization when operating under high load. If the log is completing an I/O while a new transaction is being committed, then the commit is deferred until the previous log I/O has completed and the next log I/O might carry out tens of completed transactions that had been waiting during the previous I/O. The busier the log gets, the more transactions that get committed in a single write. When the system is lightly loaded each log I/O commits a single transaction as quickly as possible. When the system is under heavy load, each commit takes out tens of transaction changes at a slight delay but at much higher I/O efficiency. Aurora takes a bit more radical approach where it simply only writes log records out and never writes out data pages synchronously or otherwise. Even more interesting, the log is remote and stored with 6-way redundancy using a 4/6 write quorum and a 3/6 read quorum. Further improving the durability of the transaction log, the log writes are done across 3 different Availability Zones (each are different data centers).  In this approach Aurora can continue to read without problem if an entire data center goes down and, at the same time, another storage server fails. 
  • Videos from DSConf 2019 are now available
  • Given Microsoft’s purchase of LinkedIn three years ago, that LinkedIn is moving cloud to Azure should not be a big surprise. Have to wonder if there will be an Azure tax? Moving over from your own datacenters will certainly chew up a lot of cycles that could have went in to product.
  • Best name ever: A grimoire of functions
  • ML helping programmers is becoming a thing. A GPT-2 model trained on ~2 million files from GitHubAutocompletion with deep learningTabNine is an autocompleter that helps you write code faster. We’re adding a deep learning model which significantly improves suggestion quality. 
  • It’s hard to get a grasp on how EventBridge will change architectures. This article on using it as a new webhook is at least concrete. Amazon EventBridge: The biggest thing since AWS Lambda itself. Though with webhooks I just enter a url in a field on a form and I start receiving events. This works for PayPal, Slack, chatbots, etc. What’s the EventBridge equivalent? The whole how to hook things up isn’t clear at all. Also, Why Amazon EventBridge will change the way you build serverless applications
  • Tired of pumping all your data into a lake? Mesh it. The eternal cycle of centralizing, distributing, and then centralizing continues. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh: “In order to decentralize the monolithic data platform, we need to reverse how we think about data, it’s locality and ownership. Instead of flowing the data from domains into a centrally owned data lake or platform, domains need to host and serve their domain datasets in an easily consumable way.”  There’s also an interview at Straining Your Data Lake Through A Data Mesh – Episode 90
  • The problem of false knowledge. Exponential Wisdom Episode 74: The Future of Construction. In this podcast there’s a segment that extols the wonders of visiting the Sistine Chapel through VR instead of visiting in-person. Is anyone worried about the problem of false knowledge? If I show you a picture of a chocolate bar do you know what chocolate tastes like? Constricting all experience through only our visual senses is a form of false knowledge. The Sistine Chapel evoked in me a visceral feeling of awe tinged with sadness. Would I feel that through VR? I don’t think so. I walked the streets. Tasted the food. Met the people. Saw the city. All experiences that can’t be shoved through our eyes.
  • Good to see Steve Balmer hasn’t changed. Players players players. Developers developers developers.
  • Never thought of this downside of open source before. SECURITY NOW 724 HIDE YOUR RDP NOW!. Kazakhstan is telling citizens to install a root cert into their browser so they can perform man- in-the-middle attacks. An interesting question is how browser makers should respond. More interesting is what if Kazakhstan responds by making their own browser based on open source, compromising it, and requring its use? Black Mirror should get on this. Software around us appears real, but has actually been replaced by pod-progs. Also, Open Source Could Be a Casualty of the Trade War
  • Darkweb Vendors and the Basic Opsec Mistakes They Keep Making. Don’t use email addresses that link to other accounts. Don’t use the same IDs across accounts. Don’t ship from the same area. Don’t do stuff yourself so you can be photographed. Don’t model your product using your own hands. Don’t cause anyone to die. Don’t sell your accounts to others. Don’t believe someone when they offer to launder your money. 
  • Though it’s still Electron. When a rewrite isn’t: rebuilding Slack on the desktop: The first order of business was to create the modern codebase: All UI components had to be built with React; All data access had to assume a lazily loaded and incomplete data model; All code had to be “multi-workspace aware”. The key to our approach ended up being Redux. The key to its success is the incremental release strategy that we adopted early on in the project: as code was modernized and features were rebuilt, we released them to our customers.
  • Re-Architecting the Video Gatekeeper: We [Netflix] decided to employ a total high-density near cache (i.e., Hollow) to eliminate our I/O bottlenecks. For each of our upstream systems, we would create a Hollow dataset which encompasses all of the data necessary for Gatekeeper to perform its evaluation. Each upstream system would now be responsible for keeping its cache updated. With this model, liveness evaluation is conceptually separated from the data retrieval from upstream systems. Instead of reacting to events, Gatekeeper would continuously process liveness for all assets in all videos across all countries in a repeating cycle. The cycle iterates over every video available at Netflix, calculating liveness details for each of them. At the end of each cycle, it produces a complete output (also a Hollow dataset) representing the liveness status details of all videos in all countries.
  • Should you hire someone who has already done the job you need to do? Not necessarily. Business Lessons from How Marvel Makes Movies: Marvel does something that is very counterintuitive. Instead of hiring people that are going to be really good at directing blockbusters, they look for people that have done a really good job with medium-sized budgets, but developing very strong storylines and characters. So, generally speaking, what they do is they looked to other genres like Shakespeare or horror. You can have spy films, comedy films, buddy cop films and what they do is they say, if I brought this director into the Marvel universe, what could they do with our characters? How could they shake up our stories and kind of reinvigorate them and provide new energy and new life?
  • What is a senior engineer? A historian. EliRivers: I work on some software of which the oldest parts of the source code date back to about 2009. Over the years, some very smart (some of them dangerously smart and woefully inexperienced, and clearly – not at all their fault – not properly mentored or supervised) people worked on it and left. What we have now is frequently a mystery. Simple changes are difficult, difficult changes verge on the impossible. Every new feature requires reverse-engineering of the existing code. Sometimes literally 95% of the time is spent reverse-engineering the existing code (no exaggeration – we measured it); changes can take literally 20 times as long as they should while we work out what the existing code does (and also, often not quite the same, what it’s meant to do, which is sometimes simply impossible to ever know). Pieces are gradually being documented as we work out what they do, but layers of cruft from years gone by from people with deadlines to meet and no chance of understanding the existing code sit like landmines and sometimes like unbreakable bonds that can never be undone. In our estimates, every time we have to rely on existing functionality that should be rock solid reliable and completely understood yet that we have not yet had to fully reverse-engineer, we mark it “high risk, add a month”. The time I found that someone had rewritten several pieces of the Qt libraries (without documenting what, or why) was devastating; it took away one of the cornerstones I’d been relying on, the one marked “at least I know I can trust the Qt libraries”. It doesn’t matter how smart we are, how skilled a coder we are, how genius our algorithms are; if we write something that can’t be understood by the next person to read it, and isn’t properly documented somewhere in some way that our fifth replacement can find easily five years later – if we write something of which even the purpose, let alone the implementation, will take someone weeks to reverse engineer – we’re writing legacy code on day one and, while we may be skilled programmers, we’re truly Godawful software engineers.
  • You always learn something new when you listen to Martin Thompson. Protocols and Sympathy With Martin Thompson. He goes into the many implications of the Universal Scalability Law which covers what can be split up and shared whlle considering coherence costs, which is the time it takes parties working together to reach agreement. The mathematics for systems and the mathematics for people are all very similar because it’s just a system. Doubling the size of system doesn’t mean doubling the amount of work done. You have to ask if the workload is decomposable. The workload needs to decompose and be done in parallel, but not concurrently. Parallelism is doing multiple things at the same time. Concurrency is dealing with multiple things at the same time. Concurrency requires coordination. Adding slack to a system reduces response time because it reduces utilization. If we constantly break teams up and reform them we end up spending more time on achieving coherence. If your team has become more efficient and reaches agreement faster than you can do more things at the same time with less overhead. You get more throughput by maximizing parallelism and minimizing coherency. Slow down and think more. Also, Understanding the LMAX Disruptor
  • Excellent explanation. Distributed Locks are Dead; Long Live Distributed Locks! and Testing the CP Subsystem with Jepsen
  • Atlassian on Our not-so-magic journey scaling low latency, multi-region services on AWS. Do you have something like this: “a context service which needed to be called multiple times per user request, with incredibly low latency, and be globally distributed. Essentially, it would need to service tens of thousands of requests per second and be highly resilient.” They were stuck with a previous sharding solution so couldn’t make a complete break as they moved to AWS. The first cut was CQRS with DynamoDB, which worked well until higher load hits and DynamoDB had latency problems. They used SNS to invalidate node level caches. They replaced ELBs with ALBs which increased reliability but the p99 latency went from 10ms to 20ms. They went with Caffeine instead of Guava for their cache. They added a sidecar as a local proxy for a service.  A sidecar is essentially just another containerised application that is run alongside the main application on the EC2 node. The benefit of using sidecars (as opposed to libraries) is that it’s technology agnostic. Latencies fell drastically. 
  • Nike on Moving Faster With AWS by Creating an Event Stream Database: we turned to the Kinesis Data Firehose service…a service called Athena that gives us the ability to perform SQL queries over partitioned data…how does our solution compare to more traditional architectures using RDS or Dynamo? Being able to ingest data and scale automatically via Firehose means our team doesn’t need to write or maintain pre-scaling code…Data storage costs on S3($0.023 per GB-month) are lower when compared to DynamoDB($0.25 per GB-month) and Aurora($0.10 per GB-month)…In a sample test, Athena delivered 5 million records in seconds, which we found difficult to achieve with DynamoDB…One limitation is that Firehose batches out data in windows of either data size or a time limit. This introduces a delay between when the data is ingested to when the data is discoverable by Athena…Queries to Athena are charged by the amount of data scanned, and if we scan the entire event stream frequently, we could rack up serious costs in our AWS bill.
  • It’s not easy to create a broadcast feed. Here’s how Hoststar did it. Building Pubsub for 50M concurrent socket connections. They went through a lot of different options. They ended up using EMQX, client side load balancing, and multiple clusters with bridges connecting them and a reverse bridge. Each subscriber could support 250k clients. With 200 subscribe nodes, the system can support 50M connections and more. Also, Ingesting data at “Bharat” Scale
  • Making Containers More Isolated: An Overview of Sandboxed Container Technologies: We have looked at several solutions that tackle the current container technology’s weak isolation issue. IBM Nabla is a unikernel-based solution that packages applications into a specialized VM. Google gVisor is a merge of a specialized hypervisor and guest OS kernel that provides a secure interface between the applications and their host. Amazon Firecracker is a specialized hypervisor that provisions each guest OS a minimal set of hardware and kernel resources. OpenStack Kata is a highly optimized VM with built-in container engine that can run on hypervisors. It is difficult to say which one works best as they all have different pros and cons. 

Soft Stuff

  • Nodes: In Nodes you write programs by connecting “blocks” of code. Each node – as we refer to them – is a self contained piece of functionality like loading a file, rendering a 3D geometry or tracking the position of the mouse. The source code can be as big or as tiny as you like. We’ve seen some of ours ranging from 5 lines of code to the thousands. Conceptual/functional separation is usually more important.
  • Picat: Picat is a simple, and yet powerful, logic-based multi-paradigm programming language aimed for general-purpose applications. Picat is a rule-based language, in which predicates, functions, and actors are defined with pattern-matching rules. Picat incorporates many declarative language features for better productivity of software development, including explicit non-determinism, explicit unification, functions, list comprehensions, constraints, and tabling. Picat also provides imperative language constructs, such as assignments and loops, for programming everyday things. 
  • When the aliens find the dead husk of our civilization the irony is what will remain of history are clay cuneiform tablets. Something comforting knowing what’s oldest will last longest. Cracking Ancient Codes: Cuneiform Writing.
  • donnaware/AGCFPGA Based Apollo Guidance Computer. 

Pub Stuff

  • Unikernels: The Next Stage of Linux’s Dominance (overview): In this paper, we posit that an upstreamable unikernel target is achievable from the Linux kernel, and, through an early Linux unikernel prototype, demonstrate that some simple changes can bring dramatic performance advantages. rwmj: The entire point of this paper is not to start over from scratch, but to reuse existing software (Linux and memcached in this case), and fiddle with the linker command line and a little bit of glue to link them into a single binary. If you want to start over from scratch using a safe language then see MirageOS.
  • Linux System Programming. rofo1: The book is solid. I mentally place it up there with “Advanced programming in the UNIX Environment” by Richard Stevens. 
  • Checking-in on network functions:  we need beter approaches to VERIFY and INTERACT with network functions and packet processing program properties. here, we provide a HYBRID-APPROACH and implementation for GRADUALLY checking and validating the arbitrary logic and side effects by COMBINING design by contract, static assertions and type-checking, and code generation via macros all without PENALIZING programmers at development time
  • Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches: In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. 
  • DistME: A Fast and Elastic Distributed Matrix Computation Engine using GPUs: We implement a fast and elastic matrix computation engine called DistME by integrating CuboidMM with GPU acceleration on top of Apache Spark. Through extensive experiments, we have demonstrated that CuboidMM and DistME significantly outperform the state-of-the-art methods and systems, respectively, in terms of both performance and data size.
  • PARTISAN: Scaling the Distributed Actor Runtime (github, video, twitter): We present the design of an alternative runtime system for improved scalability and reduced latency in actor applications called PARTISAN. PARTISAN provides higher scalability by allowing the application developer to specify the network overlay used at runtime without changing application semantics, thereby specializing the network communication patterns to the application. PARTISAN reduces message latency through a combination of three predominately automatic optimizations: parallelism, named channels, and affinitized scheduling. We implement a prototype of PARTISAN in Erlang and demonstrate that PARTISAN achieves up to an order of magnitude increase in the number of nodes the system can scale to through runtime overlay selection, up to a 38.07x increase in throughput, and up to a 13.5x reduction in latency over Distributed Erlang.
  • BPF Performance Tools (book): This is the official site for the book BPF Performance Tools: Linux System and Application Observability, published by Addison Wesley (2019). This book can help you get the most out of your systems and applications, helping you improve performance, reduce costs, and solve software issues. Here I’ll describe the book, link to related content, and list errata.

from High Scalability

Sponsored Post: Educative, PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Sponsored Post: Educative, PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Who’s Hiring? 

  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Cool Products and Services

  • Grokking the System Design Interview is a popular course on Educative.io (taken by 20,000+ people) that’s widely considered the best System Design interview resource on the Internet. It goes deep into real-world examples, offering detailed explanations and useful pointers on how to improve your approach. There’s also a no questions asked 30-day return policy. Try a free preview today.
  • PA File Sight – Actively protect servers from ransomware, audit file access to see who is deleting files, reading files or moving files, and detect file copy activity from the server. Historical audit reports and real-time alerts are built-in. Try the 30-day free trial!
  • For heads of IT/Engineering responsible for building an analytics infrastructure, Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike older enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own. Read stories from customers like Okta and PagerDuty, or try Etleap yourself.
  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net
  • Build, scale and personalize your news feeds and activity streams with getstream.io. Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Scalyr is a lightning-fast log management and operational data platform.  It’s a tool (actually, multiple tools) that your entire team will love.  Get visibility into your production issues without juggling multiple tabs and different services — all of your logs, server metrics and alerts are in your browser and at your fingertips. .  Loved and used by teams at Codecademy, ReturnPath, Grab, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.
  • Advertise your product or service here!

Fun and Informative Events

  • Advertise your event here!

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.


PA File Sight monitors file access on a server in real-time.

It can track who is accessing what, and with that information can help detect file copying, detect (and stop) ransomware attacks in real-time, and record the file activity for auditing purposes. The collected audit records include user account, target file, the user’s IP address and more. This solution does NOT require Windows Native Auditing, which means there is no performance impact on the server. Join thousands of other satisfied customers by trying PA File Sight for yourself. No sign up is needed for the 30-day fully functional trial.


Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.


The Solution to Your Operational Diagnostics Woes

Scalyr gives you instant visibility of your production systems, helping you turn chaotic logs and system metrics into actionable data at interactive speeds. Don’t be limited by the slow and narrow capabilities of traditional log monitoring tools. View and analyze all your logs and system metrics from multiple sources in one place. Get enterprise-grade functionality with sane pricing and insane performance. Learn more today


If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

Sponsored Post: PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Sponsored Post: PA File Sight, Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Who’s Hiring? 

  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Cool Products and Services

  • PA File Sight – Actively protect servers from ransomware, audit file access to see who is deleting files, reading files or moving files, and detect file copy activity from the server. Historical audit reports and real-time alerts are built-in. Try the 30-day free trial!
  • For heads of IT/Engineering responsible for building an analytics infrastructure, Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike older enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own. Read stories from customers like Okta and PagerDuty, or try Etleap yourself.
  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net
  • Build, scale and personalize your news feeds and activity streams with getstream.io. Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Scalyr is a lightning-fast log management and operational data platform.  It’s a tool (actually, multiple tools) that your entire team will love.  Get visibility into your production issues without juggling multiple tabs and different services — all of your logs, server metrics and alerts are in your browser and at your fingertips. .  Loved and used by teams at Codecademy, ReturnPath, Grab, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.
  • Advertise your product or service here!

Fun and Informative Events

  • Advertise your event here!

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.


PA File Sight monitors file access on a server in real-time.

It can track who is accessing what, and with that information can help detect file copying, detect (and stop) ransomware attacks in real-time, and record the file activity for auditing purposes. The collected audit records include user account, target file, the user’s IP address and more. This solution does NOT require Windows Native Auditing, which means there is no performance impact on the server. Join thousands of other satisfied customers by trying PA File Sight for yourself. No sign up is needed for the 30-day fully functional trial.


Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.


The Solution to Your Operational Diagnostics Woes

Scalyr gives you instant visibility of your production systems, helping you turn chaotic logs and system metrics into actionable data at interactive speeds. Don’t be limited by the slow and narrow capabilities of traditional log monitoring tools. View and analyze all your logs and system metrics from multiple sources in one place. Get enterprise-grade functionality with sane pricing and insane performance. Learn more today


If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

2019 Open Source Database Report: Top Databases, Public Cloud vs. On-Premise, Polyglot Persistence

2019 Open Source Database Report: Top Databases, Public Cloud vs. On-Premise, Polyglot Persistence

2019 Open Source Database Report: Top Databases, Public Cloud vs. On-Premise, Polyglot Persistence

Ready to transition from a commercial database to open source, and want to know which databases are most popular in 2019? Wondering whether an on-premise vs. public cloud vs. hybrid cloud infrastructure is best for your database strategy? Or, considering adding a new database to your application and want to see which combinations are most popular? We found all the answers you need at the Percona Live event last month, and broke down the insights into the following free trends reports:

2019 Top Databases Used

So, which databases are most popular in 2019? We broke down the data by open source databases vs. commercial databases:

Open Source Databases

Open source databases are free community databases with the source code available to the general public to use, and may be modified or used in their original design. Popular examples of open source databases include MySQL, PostgreSQL and MongoDB.

Commercial Databases

Commercial databases are developed and maintained by a commercial business that are available for use through a licensing subscription fee, and may not be modified. Popular examples of commercial databases include Oracle, SQL Server, and DB2.

Top Open Source Databases

MySQL remains on top as the #1 free and open source database, representing over 30% of open source database use. This comes as no surprise, as MySQL has held this position consistently for many years according to DB-Engines.

2019 Most Popular Open Source Databases Used Report Pie Chart - ScaleGrid

PostgreSQL came in 2nd place with 13.4% representation from open source database users, closely followed by MongoDB at 12.2% in 3rd place. This again could be expected based on the DB-Engines Trend Popularity Ranking, but we saw MongoDB in 2nd place at 24.6% just three months ago in our 2019 Database Trends – SQL vs. NoSQL, Top Databases, Single vs. Multiple Database Use report.

While over 50% of open source database use is represented by the top 3, we also saw a good representation for #4 Redis, #5 MariaDB, #6 Elasticsearch, #7 Cassandra, and #8 SQLite. The last 2% of databases represented include ClickhouseGaleraMemcached, and Hbase.

Top Commercial Databases

In this next graph, we’re looking at a unique report which represents both polyglot persistence and migration trends: top commercial databases used with open source databases.

We’ve been seeing a growing trend of leveraging multiple database types to meet your application needs, and wanted to compare how organizations are using both commercial and open source databases within a single application. This report also represents the commercial database users who are also in the process of migrating to an open source database. For example, PostgreSQL, the fastest growing database by popularity for 2 years in a row, has 11.5% of its user base represented by organizations currently in the process of migrating to PostgreSQL.

So, now that we’ve explained what this report represents, let’s take a look at the top commercial databases used with open source.

2019 Most Popular Commercial Databases Used with Open Source Report Pie Chart - ScaleGrid

Oracle, the #1 database in the world, holds true representing over 2/3rds of commercial and open source database combinations. What is shocking in this report is the large gap between Oracle and 2nd place Microsoft SQL Server, as it maintains a much smaller gap according to DB-Engines. IBM Db2 came in 3rd place representing 11.1% of commercial database use combined with open source.

Cloud Infrastructure Breakdown by Database

Now, let’s take a look at the cloud infrastructure setup breakdown by database management systems.

Public Cloud vs. On-Premise vs. Hybrid Cloud

We asked our open source database users how they’re hosting their database deployments to identify the current trends between on-premise vs. public cloud vs. hybrid cloud deployments.

A surprising 49.5% of open source database deployments are run on-premise, coming in at #1. While we anticipated this result, we were surprised at the percentage on-premise. In our recent 2019 PostgreSQL Trends Report, on-premise private cloud deployments represented 59.6%, over 10% higher than this report.

Public cloud came in 2nd place with 36.7% of open source database deployments, consistent with the 34.8% of deployments from the PostgreSQL report. Hybrid cloud, however, grew significantly from this report with 13.8% representation from open source databases vs. 5.6% of PostgreSQL deployments.
2019 Open Source Databases Report: Public Cloud vs Private Cloud vs On-Premise Pie Chart - ScaleGrid

So, which cloud infrastructure is right for you? Here’s a quick intro to public cloud vs. on-premise vs. hybrid cloud: 

Public Cloud

Public cloud is a cloud computing model where IT services are delivered across the internet. Typically purchased through a subscription usage model, public cloud is very easy to setup with no large upfront investment requirements, and can be quickly scaled as your application needs change.

On-Premise

On-premise, or private cloud deployments, are cloud solutions dedicated to a single organization run in its own datacenter (or with a third-party vendor off-site). There are many more opportunities to customize your infrastructure with an on-premise setup, but requires a significant upfront investment in hardware and software computing resources, as well as on-going maintenance responsibilities. These deployment types are best suited for organizations with advanced security needs, regulated industries, or large organizations.

Hybrid Cloud

A hybrid cloud is a mixture of both public cloud and private cloud solutions, integrated into a single infrastructure environment. This allows organizations to share resources between public and private clouds to improve their efficiency, security, and performance. These are best suited for deployments that require the advanced security of an on-premise infrastructure, as well as the flexibility of the public cloud.

Now, let’s take a look at which cloud infrastructures are most popular by each open source database type.

Open Source Database Deployments: On-Premise

In this graph, as well as the public cloud and hybrid cloud graphs below, we break down each individual open source database by the percentage of deployments that leverage this type of cloud infrastructure.

So, which open source databases are most frequently deployed on-premise? PostgreSQL came in 1st place with 55.8% of deployments on-premise, closely followed by MongoDB at 52.2%, Cassandra at 51.9%, and MySQL at 50% on-premise.
2019 Percent of Open Source Databases Using an On-Premise Infrastructure Report - ScaleGrid

The open source databases that reported less than half of deployments on-premise include MariaDB at 47.2%, SQLite at 43.8%, and Redis at 42.9%. The database that is least often deployed on-premise is Elasticsearch at only 34.5%.

Open Source Database Deployments: Public Cloud

Now, let’s look at the breakdown of open source databases in the public cloud.

SQLite is the most frequently deployed open source database in a public cloud infrastructure at 43.8% of their deployments, closely followed by Redis at 42.9%. MariaDB public cloud deployments came in at 38.9%, then 36.7% for MySQL, and 34.5% for Elasticsearch.

2019 Percent of Open Source Databases Using a Public Cloud Infrastructure Report - ScaleGrid

Three databases came in with less than 1/3rd of their deployments in the public cloud, including MongoDB at 30.4%, PostgreSQL at 27.9%, and Cassandra with the fewest public cloud deployments at only 25.9%.

Open Source Database Deployments: Hybrid Cloud

Now that we know how the open source databases break down between on-premise vs. public cloud, let’s take a look at the deployments leveraging both computing environments.

The #1 open source database to leverage hybrid clouds is Elasticsearch which is came in at 31%. The closest following database for hybrid cloud is Cassandra at just 22.2%.

2019 Percent of Open Source Databases Using a Hybrid Cloud Infrastructure Report - ScaleGrid

MongoDB was in 3rd for percentage of deployments in a hybrid cloud at 17.4%, then PostgreSQL at 16.3%, Redis at 14.3%, MariaDB at 13.9%, MySQL at 13.3%, and lastly SQLite at only 12.5% of deployments in a hybrid cloud.

Open Source Database Deployments: Multi Cloud

On average, 20% of public cloud and hybrid cloud deployments are leveraging a multi-cloud strategy. Multi-cloud is the use of two or more cloud computing services. We also took a look at the number of clouds used, and found that some deployments leverage up to 5 different cloud providers within a single organization:

Average Number of Clouds Used for Open Source Database Multi-Cloud Deployments - ScaleGrid Report

Most Popular Cloud Providers for Open Source Database Hosting

In our last analysis under the Cloud Infrastructure breakdown, we analyze which cloud providers are most popular for open source database hosting:
2019 Most Popular Cloud Providers for Open Source Database Hosting Pie Chart - ScaleGrid

AWS is the #1 cloud provider for open source database hosting, representing 56.9% of all cloud deployments from this survey. Google Cloud Platform (GCP) came in 2nd at 26.2% with a surprising lead over Azure at 10.8%. Rackspace then followed in 4th representing 3.1% of deployments, and DigitalOcean and Softlayer followed last representing the remaining 3% of open source deployments in the cloud.

Polyglot Persistence Trends

Polyglot persistence is the concept of using different databases to handle different needs using each for what it is best at to achieve an end goal within a single software application. This is a great solution to ensure your application is handling your data correctly, vs. trying to satisfy all of your requirements with a single database type. An obvious example would be SQL which is good at handling structured data vs. NoSQL which is best used for unstructured data.

Let’s take a look at a couple polyglot persistence analyses:

Average Number of Database Types Used

On average, we found that companies leverage 3.1 database types for their applications within a single organization. Just over 1/4 of organizations leverage a single database type, with some reporting up to 9 different database types used:

Average Number of Database Types Used in an Organization - ScaleGrid Report

Average Number of Database Types Used by Infrastructure

So, how does this number break down across infrastructure types? We found that hybrid cloud deployments are most likely to leverage multiple database types, and average 4.33 database types at a time.

On-premise deployments typically leverage 3.26 different database types, and public cloud came in lowest at 3.05 database types leverage on average within their organization.

Average Number of Database Used On-Premise vs Public Cloud vs Hybrid Cloud - ScaleGrid Report

Databases Types Most Commonly Used Together

Let’s now take a closer look at the database types most commonly leveraged together within a single application.

In the chart below, the databases in the left column represent the sample size for that database type, and the databases listed on top are represent the percentage combined with that database type. The blue highlighted cells represent 100% of deployment combinations, while yellow represents 0% of combinations.

So, as we can see below in our database combinations heatmap, MySQL is our most frequently combined database with other database types. But, while other database types are frequently leveraged in conjunction with MySQL, that doesn’t mean that MySQL deployments are always leveraging another database type. This can be seen in the first row for MySQL, as these are lighter blue to yellow compared to the first column of MySQL which is shows a much higher color match to the blue representing 100% combinations.

The cells highlighted with a black border represent the deployments leveraging only that one database type, where again MySQL takes #1 at 23% of their deployments using MySQL alone.

Percent of Database Deployments Used With Another Database Type - ScaleGrid ReportWe can also see a similar trend with Db2, where the bottom row for Db2 shows that it is highly leveraged with MySQL, PostgreSQL, Cassandra, Oracle, and SQL Server, but a very low percentage of other database deployments also leverage Db2, outside of SQL Server which also uses DB2 in 50% of those deployments.

SQL vs. NoSQL Open Source Database Popularity

Last but not least, we compare SQL vs. NoSQL for our open source database report. SQL represents over 3/5 of the open source database use at 60.6%, compare to NoSQL at 39.4%.

SQL vs NoSQL Open Source Database Popularity - ScaleGrid Report

We hope these database trends were insightful and sparked some new ideas or validated your current database strategy! Tell us what you think below in the comments, and let us know if there’s a specific analysis you’d like to see in our next database trends report! Check out our other reports for more insight on what’s trending in the database space:

from High Scalability

Sponsored Post: Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Sponsored Post: Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Who’s Hiring? 

  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Fun and Informative Events

  • Advertise your event here!

Cool Products and Services

  • For heads of IT/Engineering responsible for building an analytics infrastructure, Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike older enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own. Read stories from customers like Okta and PagerDuty, or try Etleap yourself.
  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net
  • Build, scale and personalize your news feeds and activity streams with getstream.io. Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Scalyr is a lightning-fast log management and operational data platform.  It’s a tool (actually, multiple tools) that your entire team will love.  Get visibility into your production issues without juggling multiple tabs and different services — all of your logs, server metrics and alerts are in your browser and at your fingertips. .  Loved and used by teams at Codecademy, ReturnPath, Grab, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.
  • Advertise your product or service here!

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.


Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.


The Solution to Your Operational Diagnostics Woes

Scalyr gives you instant visibility of your production systems, helping you turn chaotic logs and system metrics into actionable data at interactive speeds. Don’t be limited by the slow and narrow capabilities of traditional log monitoring tools. View and analyze all your logs and system metrics from multiple sources in one place. Get enterprise-grade functionality with sane pricing and insane performance. Learn more today


If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

Gone Fishin’

Gone Fishin’

Well, not exactly Fishin’, but I’ll be on a month long vacation starting today. I won’t be posting new content, so we’ll all have a break. Disappointing, I know. Please use this time for quiet contemplation and other inappropriate activities.

If you really need a not so quick fix there’s always the back catalog of Stuff the Internet Says. Odds are there’s a lot you didn’t read—yet.

from High Scalability

Stuff The Internet Says On Scalability For May 10th, 2019

Stuff The Internet Says On Scalability For May 10th, 2019

Wake up! It’s HighScalability time:


Deep-sky mosaic, created from nearly 7,500 individual exposures, provides a wide portrait of the distant universe, containing 265,000 galaxies that stretch back through 13.3 billion years of time to just 500 million years after the big bang. (hubblesite)

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 45 mostly 5 star reviews (107 on Goodreads). They’ll learn a lot and hold you in awe.

Number Stuff:

  • 36%: of the world touches a Facebook app every month, 2 years over a life time
  • $84.4: average yearly Facebook ad revenue per user in North America
  • 1%: performers raked in 60% of all concert-ticket revenue world-wide in 2017—more than double their share in 1982
  • 175 zetabytes: size of datasphere in 2025, up 5x from 2018, 49% stored in public clouds
  • 45.9: Amazon’s percentage of U.S. online retail growth in 2018 and 20.8% of total U.S. retail sales growth
  • $4.5B: Apple’s make it all go away payment to Qualcomm
  • 64 nanowatts: per square meter energy harvested from sky battery
  • 18%: YoY drop in smartphone sales
  • 10x: size of software markets and businesses compare to 10-15 years ago laregly due to the liquidity provided by the global internet
  • 33: age of average American gamer, who prefers to play on their smartphone and is spending 20 percent more than a year ago and 85 percent more than in 2015, the $43.4 billion spent in 2018 was mostly on content
  • 336: average lifespan of a civilization in tears
  • 2/3rds: drop in M&A spending in April
  • 74%: SMBs said they “definitely would pay ransom at almost any price” to get their data back or prevent it from being stolen
  • 2.5B: devices powered by Android
  • 40%: use a hybrid cloud infrastructure
  • 50%: drop in 2019 hard disk sales, due to a combination of general market weaknesses and the transition of notebooks to SSDs
  • 2.5 million: total view count for the final Madden NFL 19 Bowl match
  • 40%: Amazon merchants based in China

Quotable Stuff:

  • @mjpt777: APIs to IO need to be asynchronous and support batching otherwise the latency of calls dominate throughput and latency profile under burst conditions. Languages need to evolve to better support asynchronous interfaces and have state machine support, not try to paper over the obvious issues with synchronous APIs. Not everyone needs high performance but the blatant waste  and energy consumption of our industry cannot continue.
  • Guido van Rosuum: I did not enjoy at all when the central developers were sending me hints on Twitter questioning my authority and the wisdom of my decisions, instead of telling me in my face and having an honest debate about things.
  • Isobel Cockerell: A kind of WeChat code had developed through emoji: A half-fallen rose meant someone had been arrested. A dark moon, they had gone to the camps. A sun emoji—“I am alive.” A flower—“I have been released.”
  • @scottsantens: Australian company shifts to 4-day week with every Weds off and no decrease in pay. Result? 46% more revenue, a tripling of profits, and happier employees taking fewer sick days. Also Thurs are now much more productive. We work too much.
  • Twitter: Across the six Twitter Rules policy categories included in this report, 16,388 accounts were reported by known government entities compared to 5,461 reported during the last reported period, an increase of 17%. 
  • Michael Sheetz: The Blue Moon lander can bring 3.6 metric tons to the lunar surface, according to Bezos. Bezos also unveiled the company’s BE-7 rocket engine at the event. The engine will be test fired for the first time this summer, Bezos said. It’s largely made of “printed” parts, he added. “We need the new engine and that’s what this is,” Bezos said.
  • Umich: Called MORPHEUS, the chip blocks potential attacks by encrypting and randomly reshuffling key bits of its own code and data 20 times per second—infinitely faster than a human hacker can work and thousands of times faster than even the fastest electronic hacking techniques. With MORPHEUS, even if a hacker finds a bug, the information needed to exploit it vanishes 50 milliseconds later. It’s perhaps the closest thing to a future-proof secure system.
  • Sean Illing: In some ways our dependence on the phone also makes us less independent. Americans always celebrate self-reliance as a value, but it’s very clear we don’t — even for a moment — want to be by ourselves or on our own any longer. I have mixed feelings about the whole mythology of self-reliance. But certainly, while the myth that we’re self-reliant lives on, our ability to be alone seems to be going by the wayside.
  • DSHR: If University libraries/archives spent 1% of their acquisitions budget on Web archiving, they could expand their preserved historical Web records by a multiple of 20x.
  • Alexander Rose: Probably a third of the organizations or the companies over 500 or 1,000 years old are all in some way in wine, beer, or sake production.
  • @benedictevans: Idle observation: 2/3 to 3/4 of Google and Facebook’s ad business is from companies that never bought print advertising other than Yellow Pages. And a lot of what was in print went elsewhere.
  • @stevecheney: There is so much asymmetry in the Valley it cracks me up… Distributed teams are not a new trend — they are just downstream to VCs when fundraising series A/B. We built a 50 person distributed co after YC late 2013. And are 4x more capital efficient because of it.
  • digitalcommerce360: retailers ranked Nos. 401-500 this year grew their collective web revenue by 24.3% in 2018 over 2017, faster than the 20.0% growth of Amazon, and well above the 14.1% year-over-year ecommerce growth in North America.
  • Nikita: So why was AMP needed? Well, basically Google needed to lock content providers to be served through Google Search. But they needed a good cover story for that. And they chose to promote it as a performance solution.
  • c2h5oh: Name one high profile whistleblower in USA in the last 30 years has not had his entire life upturned or, more often, straight up ruined as a direct result of his high moral standards. 
  • Nobody working at Boeing of FAA right now has witnessed one during their lifetime – all they saw were cautionary tales. 
  • Logan Engstrom et al.: In summary, both robust and non-robust features are predictive on the training set, but only non-robust features will yield generalization to the original test set. Thus, the fact that models trained on this dataset actually generalize to the standard test set indicates that (a) non-robust features exist and are sufficient for good generalization, and (b) deep neural networks indeed rely on these non-robust features, even in the presence of predictive robust features.
  • Andy Greenberg: SaboTor also underscored an aggressive new approach to law enforcement’s dark-web operations: The agents from the Joint Criminal Opioid Darknet Enforcement team that carried it out—from the FBI, Homeland Security Investigations, Drug Enforcement Administration, Postal Service, Customs and Border Protection, and Department of Defense—now all sit together in one room of the FBI’s Washington headquarters. They’ve been dedicated full-time to following the trail of dark-web suspects, from tracing their physical package deliveries to following the trail of payments on Bitcoin’s blockchain.
  • Dharmesh Thakker: The future of open source is in the cloud, and the future of cloud is heavily influenced by open source. Going forward, I believe the diamond standard in infrastructure software will be building a legendary open-source brand that is adopted by thousands of users, and then delivering a cloud-native, full-service experience to commercialize it. Along the way, non- open-source companies that use cloud “time-to-value” effectively, as well as hybrid open-source solutions delivered on multi-cloud and on-premise systems, will continue to thrive. This is the new OpenCloud paradigm, and I am excited about the hundreds of transformational companies that will be formed in the coming years to take advantage of it.
  • RcouF1uZ4gsC: There seems to be a trend of people making a lot of money designing/building stuff that erodes privacy and ethics and then leaving the company where they made that money and talking about privacy and ethics. Take for example Justin Rosenstein who invented the Like button.
  • A. Nonymous: On the fateful day, a switch crashed. The crash condition resulted in a repeated sequence of frames being sent at full wire speed. The repeated frames included broadcast traffic in the management VLAN, so every control-plane CPU had to process them. Network infrastructure CPUs at 100% all over the data center including core switches, routing adjacencies down, etc. The entire facility could not process for ~3.5 hours. No stretched L2, so damage was contained to a single site. This was a reasonably well-managed site, but had some dumb design choices. Highly bridged networks don’t tolerate dumb design choices.
  • Kevin Fogarty: Despite the moniker, 5G is more of a statement of direction than a single technology. The sub-6GHz version, which is what is being rolled out today, is more like 4.5G. Signal attenuation is modest, and these devices behave much like cell phones today. But when millimeter wave technology begins rolling out—current projections are 2021 or 2022—everything changes significantly. This slice of the spectrum is so sensitive that it can be blocked by clothing, skin, windows, and sometimes even fog.
  • DSHR: Why did “cloud service providers” have an “inventory build-up during calendar 2018”? Because the demand for storage from their customers was even further from insatiable than the drive vendors expected. Even the experts fall victim to the “insatiable demand” myth.
  • Eric Budish: In particular, the model suggests that Bitcoin would be majority attacked if it became sufficiently economically important — e.g., if it became a “store of value” akin to gold — which suggests that there are intrinsic economic limits to how economically important it can become in the first place.
  • Kalev Leetaru: In fact, much of the bias of deep learning comes from the reliance of the AI community on free data rather than paying to create minimally biased data. Putting this all together, as data science matures it must become far more like the hard sciences, especially a willingness to expend the resources to collect new data and ask the hard questions, rather than its current usage of merely lending a veneer of credibility to preordained conclusions.
  • Joel Hruska: AMD picked up one percentage point of unit share in the overall x86 market in Q1 2019 compared with the previous quarter and 4.7 percentage points of market share compared with Q1 2018. This means AMD increased its market share by 1.54x in just one year — a substantial improvement for any company.
  • @awsgeek: <- Meet the latest AWS Lambda Layers fanboy. I love how I can now move common dependencies into shared layers & reduce Lambda package sizes, which allows me to continue developing & debugging functions in the Lambda console. Yes, I love VIM, but I’m still a sucker for a GUI!
  • Chen: It’s not hard to believe that someone, maybe an employee, could be convinced to add a rogue element, a tiny little capacitor or something, to a board. There was a bug we heard about that looked like a generic Ethernet jack, and it worked like one, but it had some additional cables. The socket itself is the Trojan and the relevant piece is inside the case, so it’s hard to see.
  • @aallan: “The future is web apps, and has been since Steve Jobs told the WWDC audience in 2007 he had a ‘sweet solution’ to their desire to put apps on the iPhone – the web! – and was greeted by the stoniest of silences…” Yup, this! The future is never web apps.
  • @tmclaughbos: The biggest divide in the ops community isn’t Old v. DevOps v. SRE, k8s, v. serverless, or whatever. It is “How do I run infrastructure?” v. “How do I not run infrastructure?”.
  • @PaulDJohnston: The serverless shift * From Code to Configuration * From High LoC towards Low LoC (preferably zero) * From Building Services to Consuming Services * From Owning Workloads to Disowning Workloads
  • Vishal Gurbuxani: Facebook is not a social network anymore. It is a completely re-written internet, where a consumer spends their time, money, attention to buy products/services for their day-day life, as well as connect with people/groups, etc. I hope we can all take a stand and realize that our humanity is being lost by Facebook, when they choose to use algorithms to police 2.7 billion people.
  • @mattklein123: The thing I find most ironic about the C++ is dead narrative is that C++ IS one of the (several) reasons that Envoy has blown up. Google/Apple would not have touched Envoy if not C++, and their support was critical in early 2017, lending both expertise and resources. 1/ Winning in OSS is about product market fit, “hiring” contributors, and community building. The bigger the community, the more expertise and the more production usage, and this creates a compounding virtuous cycle. 2/
  • @clintsharp: It is nearly impossible to imbue an algorithm with *judgement*. And ultimately, we are paying operators of complex systems for their judgement. When to page out, when to escalate, when to bring in the developers. No algorithm is going to solve that for you.
  • Rudraksh Tuwani et al.: Our analysis reveals copy-mutation as a plausible mechanism of culinary evolution. As the world copes with the challenges of diet-linked disorders, knowledge of the key determinants of culinary evolution can drive the creation of novel recipe generation algorithms aimed at dietary interventions for better nutrition and health.
  • ellius: After fixing a recent bug, I asked my client company what if any postmortem process they had. I informally noted about 8 factors that had driven the resolution time to ~8 hours from what probably could have been 1 or 2. Some of them were things we had no control over, but a good 4-5 were things in the application team’s immediate control or within its orbit. These are issues that will definitely recur in troubleshooting future bugs, and doing a proper postmortem could easily save 250+ man hours over the course of a year. What’s more, fixing some of these issues would also aid in application development. So you’re looking at immediate cost savings
  • MITTR: The limiting factor for new machines is no longer the hardware but the power available to keep them humming. The Summit machine already requires a 14-megawatt power supply. That’s enough to light up an entire a medium-sized town. “To scale such a system by 10x would require 140 MW of power, which would be prohibitively expensive,” say Villalonga and co. By contrast, quantum computers are frugal. Their main power requirement is the cooling for superconducting components. So a 72-qubit computer like Google’s Bristlecone, for example, requires about 14 kw.  “Even as qubit systems scale up, this amount is unlikely to significantly grow,” say Villalonga and co.
  • Patient0: In the ~15 years I spent building software in C++ I don’t recall a single time that I wished for garbage collection. By using RAII techniques, it was always possible (nay, easy!) to write code that cleaned up after itself automatically. I always found it easier to reason about and debug programs because I knew when something was supposed to be freed, and in which thread. In contrast, in the ~10 years I spent working in Java, I frequently ran into problems with programs which needed an excessive amount of memory to run. I spent countless frustrating hours debugging and “tuning” the JVM to try to reduce the excessive memory footprint (never entirely successfully). Garbage collection is an oversold hack – I concede there are probably some situations where it is useful – but it has never lived up to the claims people made about it, especially with respect to it supposedly increasing developer productivity.
  • dan.j.newhouse: I just went through migrating our production machines from m5 and r5 instances to z1d over the last couple months. I’m a big fan of the z1d instance family now. Where I work, our workloads are very heavy on CPU, in addition to wanting a good chunk of RAM (what database doesn’t, though?). The m5 and r5 instances don’t cut it in the CPU department, and the c5 family is just poor RAM per dollar. While this blog post is highlighting the CPU, the z1d also has the instance storage NVMe ssd (as does the r5d). Set the database service to automatic delayed start, and toss TempDB on that disk. That local NVMe ssd is great in multiple ways. First, it’s already included in the price of the EC2 instance. Secondly, I’ve seen throughput in the neighborhood of 750 MB/s of against it (YMMV). Considering the cost of a io1 volume with a good chunk of provisioned IOPS is NOT cheap, plus you need an instance large enough to support that level of throughput to EBS in the first place, this is a big deal. If you’ve got the gp2-blues, with TempDB performing poorly, or worse, even experiencing buffer latch wait timeouts for our good buddy database ID 2, making a change to a z1d (or r5d, if you don’t need that CPU) to leverage that local ssd is really something to consider.
  • _nothing: My opinion is different nowadays. Instagram is surely a place where exists a lot of true beauty and expression, but it’s also a place full of people largely driven by societal and monetary reward, to an extent that I’ve come to consider unhealthy. We are being influenced, and we are influencing. And we like that– my social brain wants to know what society considers beautiful, it enjoys training itself on what society considers beautiful, it wants to be affirmed in the beauty of its own body. Instagram gave me exactly what I wanted. But (and I don’t mean to criticize anyone here at all, considering I was and still am subject to the same pressures and influences) I don’t think what I thought I wanted was healthy. I want to be happy. A constant stream of corgi videos and bikini photos and travel porn gives me little ups, but it also shapes my brain in ways I think could be damaging.

Useful Stuff:

  • Another example of specialization being the key to scalability and efficiency. It’s fascinating to see all the knobs Dropbox can tune because they do one thing well. How we optimized Magic Pocket for cold storage
    • kdkeyser: This article is about single-region storage vs. multi-region storage (and how to reduce the cost in this case). There is very little public info available about distributed storage systems in multi-region setup with significant latency between the sites.
    • preslavle: In our approach the additional codebase for cold storage is extremely small relative to the entire Magic Pocket codebase and importantly does not mutate any data in the live write path: data is written to the warm storage system and then asynchronously migrated to the cold storage system. This provides us an opportunity to hold data in both systems simultaneously during the transition and run extensive validation tests before removing data from the warm system. We use the exact same storage zones and codebase for storing each cold storage fragment as we use for storing each block in the warm data store. It’s the same system storing the data, just for a fragment instead of a block. In this respect we still have multi-zone protections since each fragment is stored in multiple zones.
    • Over 40% of all file retrievals in Dropbox are for data uploaded in the last day, over 70% for data uploaded in the last month, and over 90% for data uploaded in the last year. Dropbox has unpredictable delete patterns so we needed some process to reclaim space when one of the blocks gets deleted.
    • This system is already designed for a fairly cold workload. It uses spinning disks, which have the advantage of being cheap, durable, and relatively high-bandwidth. We save the solid-state drives (SSDs) for our databases and caches. Magic Pocket also uses different data encodings as files age. When we first upload a file to Magic Pocket we use n-way replication across a relatively large number of storage nodes, but then later encode older data in a more efficient erasure coded format in the background
    • Dropbox’s network stack is already heavily optimized for transferring large blocks of data over long distances. We have a highly tuned network stack and gRPC-based RPC framework, called Courier, that is multiplexing requests over HTTP/2 transport. This all results in warm TCP connections with a large window size that allows us to transfer a multi-megabyte block of data with a single round-trip.
    • One beautiful property of the cold storage tier is that it’s always exercising the worst-case scenario. There is no plan A and plan B. Regardless of whether a region is down or not, retrieving data always requires a reconstruction from multiple fragments. Unlike our previous designs or even the warm tier, a region outage does not result in major shifts in traffic or increase of disk I/O in the surviving regions. This made us less worried about hitting unexpected capacity limits during emergency failover at peak hours.
  • If you come to silicon valley should you work for a consumer oriented company or an enterprise SaaS company? The answer for a long time has been to target the consumer space. Consumer has been sexy, where the innovation is, where a new business model can win, where opportunity can be found. Exponent Episode 170 — A Perfect Meal makes the case that’s not the case any more. Consumer and enterprise SaaS have switched roles. Consumer is now controlled by monopolies. It’s hard for a new entrant to gain a foot hold in the consumer space. Enterprise SaaS is where a new product can win on merit. The examples given are Zoom and Slack. Both Zoom and Slack have won because they are better than their competitors. Can you say the same about many recent consumer products? The change is driven by the same trends we’ve seen drive the consumer market. Bring your own device in the enterprise has made users more of driving force in deciding what software an enterprise adopts. The role of the gatekeeper has diminished. You only need to convince an individual employee at a company to give your product a try, which is perfect for software as a service. Anyone at a company can sign up for a SaaS product at no risk. A sales team doesn’t have build relationships to drive sales. Employees drive adoption. Once in a company, especially if your product has a viral component, you can land and expand sales, something both Zoom and Slack have mastered. This drives down customer acquisition costs dramatically. Once you have an individual onboard that individual can infect others. And your sales team, after seeing a company has a number of users, can call that company to try and get the entire company on board. The pitch can be you’re relieving pain by offering a managed service for the entire company instead of the pain of each team managing a service for themselves. If you want to build a product where the best product wins then enterprise is the new sexy. The competitive dynamics in the enterprise reward being the better company in a way that consumer no longer does.
  • A radical rethinking of the stack. Fast key-value stores: An idea whose time has come and gone: We argue that the time of the RInK [Remote, in-memory key-value] store has come and gone: their domain-independent APIs (e.g., PUT/GET) push complexity back to the application, leading to extra (un)marshalling overheads and network hops. Instead, data center services should be built using stateful application servers or custom in-memory stores with domain-specific APIs, which offer higher performance than RInKs at lower cost.
    • SerDes is always a huge waste: in ProtoCache prior to its rearchitecture, 27% of latency was due to (un)marshalling. In our experiments (Section 3), (un)marshalling accounts for more than 85% of CPU usage. We also found (un)marshalling a 1KB protocol buffer to cost over 10us, with all data in the L1 cache. A third-party benchmark [5] shows that other popular serialization formats (e.g., Thrift [27]) are equally slow
    • Extra network hops also have a cost:  prior to its rearchitecture, ProtoCache incurred an 80 ms latency penalty simply to transfer large records from a remote store, despite a high speed network.
    • What they want instead: Stateful application servers couple full application logic with a cache of in-memory state linked into the same process (Fig. 2b). This architecture effectively merges the RInK with the application server; it is feasible when a RInK is only accessed by a single application and all requests access a single key. Latency is 29% to 57% better (at the median), with relative improvement increasing with object size. 
    • This is really a back to the future model of services. Services were stateful at one time. Then we went stateless to scale and added in caches to mitigate the performance penalty for separating state from logic. It would be interesting to see something like Lambda distribute stateful actors instead of functions.
    • Good discussion on HN.
  • We don’t need no stinkin’ OS. But of course the interfaces they talk about are really just another OS. I/O Is Faster Than the CPU – Let’s Partition Resources and Eliminate (Most) OS Abstractions: I/O is getting faster in servers that have fast programmable NICs and non-volatile main memory operating close to the speed of DRAM, but single-threaded CPU speeds have stagnated. Applications cannot take advantage of modern hardware capabilities when using interfaces built around abstractions that assume I/O to be slow. We therefore propose a structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware. The parakernel facilitates application-level parallelism by securely partitioning the resources and multiplexing only those resources that are not partitioned. Great discussion on HN. We’ve seen all this before. 
  • We should keep this lesson in mind when it comes to the use of biological weapons. Anything put out in the world can be captured, analysed, 3D printed en masse, and sent right back at the attacker. How Chinese Spies Got the N.S.A.’s Hacking Tools, and Used Them for Attacks: Chinese intelligence agents acquired National Security Agency hacking tools and repurposed them in 2016 to attack American allies and private companies in Europe and Asia, a leading cybersecurity firm has discovered. The episode is the latest evidence that the United States has lost control of key parts of its cybersecurity arsenal.
  • Real Time Lambda Cost Analysis Using Honeycomb: All of the services that support our web and mobile applications at Fender Digital are built using AWS Lambda. With Lambda’s cost-per-use billing model we have cut the cost of hosting our services by approximately 90%…While we have been tagging our resources diligently to calculate the cost of each service, the AWS Cost Explorer does not allow us to delve into the configuration of the function and the actual resources it has consumed versus the actual invocation times….Log aggregation to the rescue! We aggregate all of the Cloudwatch log groups for our Lambda functions into a single Kinesis stream, which then parses out the structured JSON logs and publishes them to honeycomb.io. I cannot recommend that tool highly enough for analyzing log data. It is a great product from a great group of people. AWS adds its own log lines as well, including when the function starts an invocation, when the invocation ends, and a report of the invocation that can be used to calculate the cost of the invocation…The function is currently configured for 512 MB, and since most of the time it spends is in network I/O on calling app store APIs, we can reduce the configured memory. If we were to reduce it to 384 MB we would see an ~25% reduction in cost. The average invocation time may be slightly higher, but as this function is invoked off of a DynamoDB stream it has no direct impact on user experience. What are the actual costs incurred? Assuming we’ve already consumed the free tier for Lambda, during the 3 day period we are looking at there were 1,405,640 requests x $0.0000002 per request = $0.28 and 140,057.8 gb-seconds of compute time x $0.00001667 = $2.33. For those three days, this function cost $2.61, leading to a potential savings of $0.65. That’s not a lot, but as our use of lambda steadily grows we can ensure we are not over-provisioned.
  • Measuring MySQL Performance in Kubernetes: You can see the numbers vary a lot, from 10550 tps to 20196 tps, with the most time being in the 10000tps range. That’s quite disappointing. Basically, we lost half of the throughput by moving [MySQL] to the Kubernetes node. To improve your experience you need to make sure you use Guaranteed QoS. Unfortunately, Kubernetes does not make it easy. You need to manually set the number of CPU threads, which is not always obvious if you use dynamic cloud instances. With Guaranteed QoS there is still a performance overhead of 10%, but I guess this is the cost we have to accept at the moment.
  • If your company runs multiple lines of business (think Gmail vs. Google Docs vs. Google Calendar, etc.), how can you tell how much of your hardware and infrastructure spend is attributed to each LOB? Embracing context propagation: When the request enters our system, we typically already know which LOB it represent, either from the API endpoint or even directly from the client apps. We can use context (baggage) to store the LOB tag and use it anywhere in the call graph to attribute measurements of resource usage to specific LOB, such as number of reads/writes in the storage, or number of messages processed by the messaging platform.
  • Attack of the Killer Microseconds: A new breed of low-latency I/O devices, ranging from faster datacenter networking to emerging non-volatile memories and accelerators, motivates greater interest in microsecond-scale latencies. Existing system optimizations targeting nanosecond- and millisecond-scale events are inadequate for events in the microsecond range. New techniques are needed to enable simple programs to achieve high performance when microsecond-scale latencies are involved, including new microarchitecture support. 
  • The fundamental issue with humanity is one person’s idea of utopia is another’s dystopia. We spend our lives navigating the resulting maelstrom of the conflict. This vision is not Humanity Unchained, it’s Humanity Limited by its own creation. Do you really want humanity limited by an AI enforcing all the “standard problems of political philosophy”? We would be stuck in the past rather than moving forward. Stephen Wolfram: But what will be possible with this? In a sense, human language was what launched civilization. What will computational language do? We can rethink almost everything: democracy that works by having everyone write a computational essay about what they want, that’s then fed to a big central AI—which inevitably has all the standard problems of political philosophy. New ways to think about what it means to do science, or to know things. Ways to organize and understand the civilization of the AIs. A big part of this is going to start with computational contracts and the idea of autonomous computation—a kind of strange merger of the world of natural law, human law, and computational law. Something anticipated three centuries ago by people like Leibniz—but finally becoming real today. Finally a world run with code.
  • A very well written Post-mortem and remediations for [Matrix] Apr 11 security incident. When you hear this in a movie you know exactly what happens next: What can we trust if not our own servers? And that’s what happened: If there is one lesson everyone should learn from this whole mess, it is: SSH agent forwarding is incredibly unsafe, and in general you should never use it. Not only can malicious code running on the server as that user (or root) hijack your credentials, but your credentials can in turn be used to access hosts behind your network perimeter which might otherwise be inaccessible. All it takes is someone to have snuck malicious code on your server waiting for you to log in with a forwarded agent, and boom, even if it was just a one-off ssh -A. TL;DR: keep your services patched; lock down SSH; partition your network; and there’s almost never a good reason to use SSH agent forwarding.
  • You have a great IoT idea, but you just can’t get past the lack of a M2M networks. Good news. AT&T went live with their NarrowBand Internet of Things (NB-IoT) network. Bodyport, for example, uses the LTE-M network to connect a smart scale that transmits patients’ cardiovascular data to remote care teams in near real-time. They’re working with suppliers to certify $5 modules that connect devices to NB-IoT and pricing plans are available for as low as $5/year/device. You need a revenue model, but at least it’s possible.
  • Meet programmers from a more civilized age. Brian Kernighan interviews Ken Thompson at Vintage Computer Festival East 2019
  • I/O Is Faster Than the CPU – Let’s Partition Resources and Eliminate (Most) OS Abstractions: NICs and non-volatile main memory operating close to the speed of DRAM, but single-threaded CPU speeds have stagnated. Applications cannot take advantage of modern hardware capabilities when using interfaces built around abstractions that assume I/O to be slow. We therefore propose a structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware. The parakernel facilitates application-level parallelism by securely partitioning the resources and multiplexing only those resources that are not partitioned
  • 6 new ways to reduce your AWS bill with little effort: AWS introduced AMD-powered EC2 instances that are 10% cheaper compared to the Intel-powered Instances. They provide the same resources (CPU, memory, network bandwidth) and run the same AMIs; Use VPC endpoints instead of NAT gateways; Convertible Reserved EC2 Instances – Saving potential: Additional 25% over On-Demand (assuming you can now go from 1-year terms to 3-year terms); EC2 Spot Instances – Saving potential: 70-90% over On-Demand; S3 Intelligent-Tiering.
  • Maybe this would work for how to communicate within a team? Mr. Rogers’ Nine Rules for Speaking to Children (1977): State the idea you wish to express as clearly as possible, and in terms preschoolers can understand; Rephrase in a positive manner; Rephrase your idea to eliminate all elements that could be considered prescriptive, directive, or instructive; Rephrase any element that suggests certainty; Rephrase your idea to eliminate any element that may not apply to all children; Add a simple motivational idea that gives preschoolers a reason to follow your advice; Rephrase your new statement, repeating the first step; Rephrase your idea a final time, relating it to some phase of development a preschooler can understand.
  • How is a blockchain and end-to-end encryption totally owned by Facebook any more private?  Top 3 Takeaways from Facebook: Blockchain will be at the center of Facebook’s Strategy for their entire platform and payments; Building out infrastructure on the data-level to have security from the ground up; Code privacy and data use principles as first class concepts into the infrastructure. This is one of the primary uses cases of using a Blockchain based network; I didn’t see anyone catch this sound bite from Mark, but he basically said that they are rewriting all of Facebook’s back-end code to be more user-centric, which is a distributed ledger and access control.
  • Do you want to go serverless and survive AWS region level outages? Here’s a huge amount of practical detail as well tips and gotchas. Disaster Tolerance Patterns Using AWS Serverless Services: S3 resilience: Use versioning and cross region replication for S3 buckets; Use CloudFront origin failover for read access to replicated S3 buckets; DynamoDB resilience: Use global tables for DynamoDB tables; API Gateway and Lambda resilience: Use a regional API Gateway and associated Lambda functions in each region; Use Route 53 latency or failover routing with health checks in front API Gateways; Cognito User Pools resilience: Create custom sync solution for now. 
  • The Story Behind an Instacart Order, Part 1: Building a Digital Catalog: Fun fact: While partners can send us inventory data at any point in the day, we receive most data dumps around 10 pm local time. Certain individual pieces of our system (like Postgres) weren’t configured to handle these 10 pm peak load times efficiently — we didn’t originally build with elastic scalability in mind. To solve this we began a “lift and shift” of our catalog infrastructure from our artisanal system to a SQL-based interface, running on top of a distributed system with inexpensive storage. We’ve decoupled compute from that storage, and in this new system, we rely on Airflow as our unified scheduler to orchestrate that work. Re-building our infrastructure now, not only helps us deal with load times efficiently, it saves cost in the long run and ensures that we can make more updates every night to the catalog as we receive more product attributes and our data prioritizations models evolve.
  • Some risks of coordinating only sometimes: Within AWS, we are starting to settle on some patterns that help constrain the behavior of systems in the worst case. One approach is to design systems that do a constant amount of coordination, independent of the offered workload or environmental factors. This is expensive, with the constant work frequently going to waste, but worth it for resilience. Another emerging approach is designing explicitly for blast radius, strongly limiting the ability of systems to coordinate or communicate beyond some limited radius. We also design for static stability, the ability for systems to continue to operate as best they can when they aren’t able to coordinate.
  • O’reilly has branded something they call the Next Architecture: The growth we’ve seen on our online learning platform in cloud topics, in orchestration and container-related terms such as Kubernetes and Docker, and in microservices is part of a larger trend in how organizations plan, code, test, and deploy applications that we call the Next Architecture. This architecture allows fast, flexible deployment, feature flexibility, efficient use of programmer resources, and rapid adapting, including scaling, to unpredictable resource requirements. These are all goals businesses feel increasingly pressured to achieve to keep up with nimble competitors. There are four aspects of the Next Architecture: Decomposition, Cloud, Containers, and Orchestration. 
  • We’ve learned from running Azure Functions that only 30% of our executions are coming from HTTP events.  KEDA: bringing event-driven containers and functions to Kubernetes: With the release of KEDA, any team can now deploy function apps created using those same [Microsoft] tools directly to Kubernetes. This allows you to run Azure Functions on-premises or alongside your other Kubernetes investments without compromising on the productive serverless development experience. The open source Azure Functions runtime is available to every team and organization, and brings a world-class developer experience and programming model to Kubernetes.The combination of flexible hosting options and an open source toolset gives teams more freedom and choice. If you choose to take advantage of the full benefits of a managed serverless service, you can shift responsibility and publish your apps to Azure Functions.
  • Why would you pick Fargate over Lambda? How Far Out is AWS Fargate: If I were to describe Fargate, I’d describe it as clusterless container orchestration. Much like “serverless” means an architecture where the server has been abstracted away…Lambda is an additional layer of abstraction where if your workload can be expressed as a function and complete it’s work in 15 minutes or less, then it’s a great choice, especially if your workload leans towards the sporadic. But if you need more control or the limits imposed by Lambda’s abstractions pose a problem for your workload, then Fargate is worth a close look. You don’t really need to choose one or the other as they very much compliment each other…Fargate is inherently simpler than Kubernetes because it only does one thing: container orchestration. And it does this very well. Everything else is provided by an external AWS service.
  • You’ve heard of autonomous self-driving cars? How about autonomous databases? It’s a DBMS that can deploy, configure, and tune itself automatically without any human intervention. Advanced Database Systems 2019 #25: Self-Driving Database Engineering (slides): Personnel is ~50% of the TOC of a DBMS. Average DBA Salary (2017): $89,050. The scale and complexity of DBMS installations have surpassed humans. Replace DBMS components with ML models trained at runtime. True autonomous DBMSs are achievable in the next decade. You should think about how each new feature can be controlled by a machine.
  • Good explanation with a really cool white board. Latency Under Load: HBM2 vs. GDDR6. A traffic analogy is used. The more lanes you have to memory the higher the bandwidth. Use HBM2 for highest bandwidth and best power efficiency. Downside it’s harder to design with and has higher cost. GDDR is a compromise giving great performance and wide pathway to memory.
  • microsoft/Quantum: These samples demonstrate the use of the Quantum Development Kit for a variety of different quantum computing tasks. Most samples are provided as a Visual Studio 2017 C# or F# project under the QsharpSamples.sln solution.
  • Vanilla JS: a fast, lightweight, cross-platform framework for building incredible, powerful JavaScript applications.
  • sirixdb/sirix: facilitates effective and efficient storing and querying of your temporal data through snapshotting (only ever appends changed database pages) and a novel versioning approach called sliding snapshot, which versions at the node level. Currently we support the storage and querying of XML- and JSON-documents in our binary encoding. 
  • Book of Proceedings from Internet Identity Workshop 27. There’s a huge number of topics and 47 pages of notes. 
  • Gorilla: A Fast, Scalable, In-Memory Time Series Database: Gorilla optimizes for remaining highly available for writes and reads, even in the face of failures, at the expense of possibly dropping small amounts of data on the write path. To improve query efficiency, we aggressively leverage compression techniques such as delta-of-delta timestamps and XOR’d floating point values to reduce Gorilla’s storage footprint by 10x. This allows us to store Gorilla’s data in memory, reducing query latency by 73x and improving query throughput by 14x when compared to a traditional database (HBase)- backed time series data. 
  • Uber created a site collecting all their research papers in one place. You might be surprised at all the topics they cover. 

from High Scalability

Stuff The Internet Says On Scalability For May 3rd, 2019

Stuff The Internet Says On Scalability For May 3rd, 2019

Wake up! It’s HighScalability time:

Event horizon? Nope. It’s a close up of a security hologram. Makes one think.

 

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 45 mostly 5 star reviews (105 on Goodreads). They’ll learn a lot and hold you in awe.

Number Stuff:

  • $1 trillion: Microsoft is the most valuable company in the world (for now)
  • 20%: global enterprises will have deployed serverless computing technologies by 2020
  • 390 million: paid Apple subscriptions, revenue from the services business climbed from $9.9 billion to $11.5 billion, services now account for “one-third” of the company’s gross profits
  • 1011: CubeStat missions
  • $326 billion: USA farm expenses in 2017
  • 61%: increase in average cyber attack losses from $229,000 last year to $369,000 this, a figure exceeding $700,000 for large firms versus just $162,000 in 2018.
  • $550: can yield 20x profit on the sale of compromised login credentials

Quotable Stuff:

  • Robert Lightfoot~ Protecting against risk and being safe are not the same thing. Risk is just simply a calculation of likelihood and consequence. Would we have ever launched Apollo in the environment we’re in today? Would Buzz and Neil have been able to go to the moon in the risk posture we live in today? Would we have launched the first shuttle with a crew? We must move from risk management to risk leadership. From a risk management perspective, the safest place to be is on the ground. From a risk leadership perspective, I believe that’s the worst place this nation can be.
  • Paul Kunert: In dollar terms, Jeff Bezos’s cloud services wing grew 41 per cent year on year to $7.6bn, figures from Canalys show. Microsoft was up 75 per cent to $3.4bn and Google grew a whopping 83 per cent to $2.3bn.
  • @codinghorror: 1999 “MIT – We estimate that the puzzle will require 35 years of continuous computation to solve” 2019 “🌎- LOL” https://www.csail.mit.edu/news/programmers-solve-mits-20-year-old-cryptographic-puzzle …
  • @dvassallo: TIL what EC2’s “Up to” means. I used to think it simply indicates best effort bandwidth, but apparently there’s a hard baseline bottleneck for most EC2 instance types (those with an “up to”). It’s significantly smaller than the rating, and it can be reached in just a few minutes. This stuff is so obscure that I bet 99% of Amazon SDEs that use EC2 daily inside Amazon don’t know about these limits. I only noticed this by accident when I was benchmarking S3 a couple of weeks ago
  • @Adron: 1997: startup requires about a million $ just to get physical infra setup for a few servers. 2007: one can finally run stuff online and kind of skip massive hardware acquisitions just to run a website. 2017: one can scale massively & get started for about $10 bucks of infra.
  • Wired: Nadella’s approach as “subtle shade.” He never explicitly eighty-sixed a division or cut down a product leader, but his underlying intentions were always clear. His first email to employees ran more than 1,000 words—and made no mention of Windows. He later renamed the cloud offering Microsoft Azure. “Satya doesn’t talk shit—he just started omitting ‘Windows’ from sentences,” this executive says. “Suddenly, everything from Satya was ‘cloud, cloud, cloud!’ ”
  • @ThreddyTheTrex: My approach to side projects has evolved. Beginning of my career: “I will build everything from scratch using C and manage my own memory and I don’t mind if it takes 3 years.” Now: “I will use existing software that takes no more than 15 minutes to configure.”
  • btown: The software wouldn’t have crashed if the user didn’t click these buttons in a weird order. The bug was only one factor in a chain of events that led to the segfault.
  • @Tjido: There are more than 10,000 data lakes in AWS. @strataconf  #datalakes #stratadata
  • Nicolas Kemper: Accretive projects are everywhere: Museums, universities, military bases – even neighborhoods and cities. Key to all accretive projects is that they house an institution, and key to all successful institutions is mission. Whereas scope is a detailed sense of both the destination and the journey, a mission must be flexible and adjust to maximum uncertainty across time. In the same way, an institution and a building are often an odd pair, because whereas the building is fixed and concrete, finished or unfinished, an institution evolves and its work is never finished.
  • @markmadsen: Your location-identified tweets plus those of two friends on twitter predict your location to within 100m 77% of the time. Location data is PII and must be treated as such #StrataData
  • Backblaze: The Annualized Failure Rate (AFR) for Q1 is 1.56%. That’s as high as the quarterly rate has been since Q4 2017 and its part of an overall upward trend we’ve seen in the quarterly failure rates over the last few quarters. Let’s take a closer look. 
  • Theron Mohamed: Google’s advertising revenue rose by 15% to $30.72 billion, a sharp slowdown from 24% growth a year ago, according to its earnings report for the first quarter of 2018. Paid clicks rose 39%, a significant decrease from 59% year-on-year growth in the first quarter of 2018. Cost-per-click also fell 19%, after sliding 19% in the same period of 2018.
  • @ajaynairthinks: “It was what we know to do so it was faster” -> this is the key challenge. Right now, the familiar path is not easy/effective in the long term, and the effective path is not familiar in the short term. We need make this gap visible, and we need to make the easy things familiar.
  • @NinjaEconomics: “For the first time ever there are now more people in the world older than 65 than younger than 5.”
  • Filipe Oliveira: with the new AWS public cloud C5n Instances designed for compute-heavy applications and to deliver performance that is just about indistinguishable from bare metal, with 100 Gbps Networking along with a higher ceiling on packets per second, we should be able deliver at least the same 50 Million operations per second bellow 1 millisecond with less VM nodes
  • Nima Khajehnouri: Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders
  • Carmen Bambach: He is an artist of his time and one that transcends his time. He is very ambitious. It’s important to remember that although Leonardo was a “disciple of experience,” as he called himself, he is also paying great attention to the sources of his time. After having devoured and looked at and bought many books, he realizes he can do better. He really wants to write books, but it’s a very steep learning curve. The way we should look at his notebooks and the manuscripts is that they are essentially the raw material for what he had intended to produce as treatises. His great contribution is being able to visualize knowledge in a way that had not been done before. 
  • Charlie Demerjian: The latest Intel roadmap leak blows a gaping hole in Intel’s 10nm messaging. SemiAccurate has said all along that the process would never work right and this latest info shows that 10nm should have never been released.
  • @mipsytipsy: Abuse and misery pile up when you are building and running large software systems without understanding them, without good feedback loops. Feedback loops are not a punishment. They mature you into a wise elder engineer.  They give you agency, mastery, autonomy, direction. And that is why software engineers, management, and ops engineers should all feel personally invested in empowering software engineers to own their own code in production.
  • Skip: Serverless has made it possible to scale Skip with a small team of engineers. It’s also given us a programming model that lets us tackle complexity early on, and gives us the ability to view our platform as a set of fine-grained services we can spread across agile teams.
  • seanwilson: Imagine having to install Trello, Google Docs, Slack etc. manually everywhere you wanted to use it, deal with updates yourself and ask people you wanted to collaborate with to do the same. That makes no sense in terms of ease of use.
  • Darryl Campbell: The slick PR campaign masked a design and production process that was stretched to the breaking point. Designers pushed out blueprints at double their normal pace, often sending incorrect or incomplete schematics to the factory floor. Software engineers had to settle for re-creating 40-year-old analog instruments in digital formats, rather than innovating and improving upon them. This was all done for the sake of keeping the Max within the constraints of its common type certificate.
  • Stripe: We have seen such promising results from our remote engineers that we are greatly increasing our investment in remote engineering. We are formalizing our Remote engineering hub. It is coequal with our physical hubs, and will benefit from some of our experience in scaling engineering organizations.
  • Joel Hruska: According to Intel in its Q1 2019 conference call, NAND price declines were a drag on its earnings, falling nearly twice the expected amount. This boom and bust cycle is common in the DRAM industry, where it drove multiple players to exit the market over the past 18 years. This is one reason we’re effectively down to just three DRAM manufacturers — Samsung, SK Hynix, and Micron. There are still a few more players in the NAND market, though we’ve seen consolidations there as well.
  • Alastair Edwards: The cloud infrastructure market is moving into a new phase of hybrid IT adoption, with businesses demanding cloud services that can be more easily integrated with their on-premises environment. Most cloud providers are now looking at ways to enter customers’ existing data centres, either through their own products or via partnerships
  • Paul Johnston: And yes I can absolutely see how the above company could have done this whole solution better as a Serverless solution but they don’t have the money for rearchitecting their back end (I don’t imagine) and what would be the value anyway? It’s up and running, with paying clients. The value at this point doesn’t seem valuable. Additional features may be a good fit for a Serverless approach, but not the whole thing if it’s all working. The pain of migrating to a new backend database, the pain of server migrations even at this level of simplicity, the pain of having to coordinate with other teams on something that seems so trivial, but never is that trivial has been really hard.
  • @rseroter: In serverless … Functions are not the point. Managed services are not the point. Ops is not the point. Cost is not the point. Technology is not the point. The point is focus on customer value. @ben11kehoe laying it all out. #deliveragile2019
  • @jessitron: Serverless is a direction, not a destination. There is no end state. @ben11kehoe  Keep moving technical details out of the team’s focus, in favor of customer value. #deliverAgile 
  • @ondayd: RT RealGeneKim “RT jessitron: When we rush development, skip tests and refactoring, we get “Escalating Risk.” Please give up the “technical debt” description; it gives businesspeople a very wrong impression of the tradeoffs. From Janellekz #deliverAgile “
  • @ben11kehoe: Good points in here about event-driven architectures. I do think the “bounded context” notions from microservices are still applicable, and that we don’t have good enough tools for establishing contracts for events and dynamic routing for #serverless yet.
  • Riot Games: We use MapReduce, a common cluster computing model, to calculate data in a distributed fashion. Below is an example of how we calculate the cosine similarity metric – user data is mapped to nodes, the item-item metric is calculated for each user, and the results are shuffled and sent to a common node so they can be aggregated together in the reduce stage. It takes approximately 1000 compute hours to carry out the entire offer generation process, from snapshotting data to running all of the distributed algorithms. That’s 50 machines running for 20 hours each.
  • Will Knight: Sze’s hardware is more efficient partly because it physically reduces the bottleneck between where data is stored and where it’s analyzed, but also because it uses clever schemes for reusing data. Before joining MIT, Sze pioneered this approach for improving the efficiency of video compression while at Texas Instruments.
  • Hersleb hypothesis~ coding is a socio-technical process where code and humans interact. According to what we call the the Hersleb hypothesis, the following anti-pattern is a strong predictor for defects: • If two code sections communicate…• But the programmers of those two sections do not…• Then that code section is more likely to be buggy
  • Joel Hruska: But the adoption of chiplets is also the engineering acknowledgment of constraints that didn’t used to exist. We didn’t used to need chiplets. When companies like TSMC publicly predict that their 5nm node will deliver much smaller performance and power improvements than previous nodes did, it’s partly a tacit admission that the improvements engineers have gotten used to delivering from process nodes will now have to be gained in a different fashion. No one is particularly sure how to do this, and analyses of how effectively engineers boost performance without additional transistors to throw at the problem have not been optimistic.
  • Bryan Meyers: To some extent I think we should view chiplets as a stop-gap until other innovations come along. They solve the immediate problems of poor yields and reticle limits in exchange for a slight increase in integration complexity, while opening the door to more easily integrating application-specific accelerators cost-effectively. But it’s also not likely that CPU sockets will get much larger. We’ll probably hit the limit of density when chiplet-based SoC’s start using as much power as high-end GPUs. So really we’re waiting on better interconnects (e.g. photonics or wireless NoC) or 3D integration to push much farther. Both of which I think are still at least a decade away.
  • Olsavsky: And that will be a constant battle between growth, geographic expansion in AWS, and also efficiencies to limit how much we actually need. I think we are also getting much better at adding capacity faster, so there is less need to build it six to twelve 12 months in advance.
  • Malith Jayasinghe: We noticed that a non-blocking system was able to handle a large number of concurrent users while achieving higher throughput and lower latency with a small number of threads. We then looked at how the number of processing threads impacts the performance. We noticed a minimal impact on throughput and average latency on the number of threads. However, as the number of threads increases, we see a significant increase in the tail latencies (i.e. latency percentiles) and load average.
  • Paul Berthaux: We [Algolia] run multiple LBs for resiliency – the LB selection is made through round robin DNS. For now this is fine, as the LBs are performing very simple tasks in comparison to our search API servers, so we do not need an even load balancing across them. That said, we have some very long term plans to move from round-robin DNS to something based on Anycast routing. he detection of upstream failures as well as retries toward different upstreams is embedded inside NGINX/OpenResty. I use the log_by_lua directive from OpenResty with some custom Lua code to count the failures and trigger the removal of the failing upstream from the active Redis entry and alert the lb-helper after 10 failures in a row. I set up this failure threshold to avoid lots of unnecessary events in case of short self resolving incidents like punctual packet loss. From there the lb-helper will probe the failing upstream FQDN and put It back in Redis once it’ll recover.

Useful Stuff:

from High Scalability