Tag: Performance

Sponsored Post: Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Sponsored Post: Etleap, PerfOps, InMemory.Net, Triplebyte, Stream, Scalyr

Who’s Hiring? 

  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Fun and Informative Events

  • Advertise your event here!

Cool Products and Services

  • For heads of IT/Engineering responsible for building an analytics infrastructure, Etleap is an ETL solution for creating perfect data pipelines from day one. Unlike older enterprise solutions, Etleap doesn’t require extensive engineering work to set up, maintain, and scale. It automates most ETL setup and maintenance work, and simplifies the rest into 10-minute tasks that analysts can own. Read stories from customers like Okta and PagerDuty, or try Etleap yourself.
  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net
  • Build, scale and personalize your news feeds and activity streams with getstream.io. Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Scalyr is a lightning-fast log management and operational data platform.  It’s a tool (actually, multiple tools) that your entire team will love.  Get visibility into your production issues without juggling multiple tabs and different services — all of your logs, server metrics and alerts are in your browser and at your fingertips. .  Loved and used by teams at Codecademy, ReturnPath, Grab, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.
  • Advertise your product or service here!

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.


Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.


The Solution to Your Operational Diagnostics Woes

Scalyr gives you instant visibility of your production systems, helping you turn chaotic logs and system metrics into actionable data at interactive speeds. Don’t be limited by the slow and narrow capabilities of traditional log monitoring tools. View and analyze all your logs and system metrics from multiple sources in one place. Get enterprise-grade functionality with sane pricing and insane performance. Learn more today


If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

Gone Fishin’

Gone Fishin’

Well, not exactly Fishin’, but I’ll be on a month long vacation starting today. I won’t be posting new content, so we’ll all have a break. Disappointing, I know. Please use this time for quiet contemplation and other inappropriate activities.

If you really need a not so quick fix there’s always the back catalog of Stuff the Internet Says. Odds are there’s a lot you didn’t read—yet.

from High Scalability

Stuff The Internet Says On Scalability For May 10th, 2019

Stuff The Internet Says On Scalability For May 10th, 2019

Wake up! It’s HighScalability time:


Deep-sky mosaic, created from nearly 7,500 individual exposures, provides a wide portrait of the distant universe, containing 265,000 galaxies that stretch back through 13.3 billion years of time to just 500 million years after the big bang. (hubblesite)

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 45 mostly 5 star reviews (107 on Goodreads). They’ll learn a lot and hold you in awe.

Number Stuff:

  • 36%: of the world touches a Facebook app every month, 2 years over a life time
  • $84.4: average yearly Facebook ad revenue per user in North America
  • 1%: performers raked in 60% of all concert-ticket revenue world-wide in 2017—more than double their share in 1982
  • 175 zetabytes: size of datasphere in 2025, up 5x from 2018, 49% stored in public clouds
  • 45.9: Amazon’s percentage of U.S. online retail growth in 2018 and 20.8% of total U.S. retail sales growth
  • $4.5B: Apple’s make it all go away payment to Qualcomm
  • 64 nanowatts: per square meter energy harvested from sky battery
  • 18%: YoY drop in smartphone sales
  • 10x: size of software markets and businesses compare to 10-15 years ago laregly due to the liquidity provided by the global internet
  • 33: age of average American gamer, who prefers to play on their smartphone and is spending 20 percent more than a year ago and 85 percent more than in 2015, the $43.4 billion spent in 2018 was mostly on content
  • 336: average lifespan of a civilization in tears
  • 2/3rds: drop in M&A spending in April
  • 74%: SMBs said they “definitely would pay ransom at almost any price” to get their data back or prevent it from being stolen
  • 2.5B: devices powered by Android
  • 40%: use a hybrid cloud infrastructure
  • 50%: drop in 2019 hard disk sales, due to a combination of general market weaknesses and the transition of notebooks to SSDs
  • 2.5 million: total view count for the final Madden NFL 19 Bowl match
  • 40%: Amazon merchants based in China

Quotable Stuff:

  • @mjpt777: APIs to IO need to be asynchronous and support batching otherwise the latency of calls dominate throughput and latency profile under burst conditions. Languages need to evolve to better support asynchronous interfaces and have state machine support, not try to paper over the obvious issues with synchronous APIs. Not everyone needs high performance but the blatant waste  and energy consumption of our industry cannot continue.
  • Guido van Rosuum: I did not enjoy at all when the central developers were sending me hints on Twitter questioning my authority and the wisdom of my decisions, instead of telling me in my face and having an honest debate about things.
  • Isobel Cockerell: A kind of WeChat code had developed through emoji: A half-fallen rose meant someone had been arrested. A dark moon, they had gone to the camps. A sun emoji—“I am alive.” A flower—“I have been released.”
  • @scottsantens: Australian company shifts to 4-day week with every Weds off and no decrease in pay. Result? 46% more revenue, a tripling of profits, and happier employees taking fewer sick days. Also Thurs are now much more productive. We work too much.
  • Twitter: Across the six Twitter Rules policy categories included in this report, 16,388 accounts were reported by known government entities compared to 5,461 reported during the last reported period, an increase of 17%. 
  • Michael Sheetz: The Blue Moon lander can bring 3.6 metric tons to the lunar surface, according to Bezos. Bezos also unveiled the company’s BE-7 rocket engine at the event. The engine will be test fired for the first time this summer, Bezos said. It’s largely made of “printed” parts, he added. “We need the new engine and that’s what this is,” Bezos said.
  • Umich: Called MORPHEUS, the chip blocks potential attacks by encrypting and randomly reshuffling key bits of its own code and data 20 times per second—infinitely faster than a human hacker can work and thousands of times faster than even the fastest electronic hacking techniques. With MORPHEUS, even if a hacker finds a bug, the information needed to exploit it vanishes 50 milliseconds later. It’s perhaps the closest thing to a future-proof secure system.
  • Sean Illing: In some ways our dependence on the phone also makes us less independent. Americans always celebrate self-reliance as a value, but it’s very clear we don’t — even for a moment — want to be by ourselves or on our own any longer. I have mixed feelings about the whole mythology of self-reliance. But certainly, while the myth that we’re self-reliant lives on, our ability to be alone seems to be going by the wayside.
  • DSHR: If University libraries/archives spent 1% of their acquisitions budget on Web archiving, they could expand their preserved historical Web records by a multiple of 20x.
  • Alexander Rose: Probably a third of the organizations or the companies over 500 or 1,000 years old are all in some way in wine, beer, or sake production.
  • @benedictevans: Idle observation: 2/3 to 3/4 of Google and Facebook’s ad business is from companies that never bought print advertising other than Yellow Pages. And a lot of what was in print went elsewhere.
  • @stevecheney: There is so much asymmetry in the Valley it cracks me up… Distributed teams are not a new trend — they are just downstream to VCs when fundraising series A/B. We built a 50 person distributed co after YC late 2013. And are 4x more capital efficient because of it.
  • digitalcommerce360: retailers ranked Nos. 401-500 this year grew their collective web revenue by 24.3% in 2018 over 2017, faster than the 20.0% growth of Amazon, and well above the 14.1% year-over-year ecommerce growth in North America.
  • Nikita: So why was AMP needed? Well, basically Google needed to lock content providers to be served through Google Search. But they needed a good cover story for that. And they chose to promote it as a performance solution.
  • c2h5oh: Name one high profile whistleblower in USA in the last 30 years has not had his entire life upturned or, more often, straight up ruined as a direct result of his high moral standards. 
  • Nobody working at Boeing of FAA right now has witnessed one during their lifetime – all they saw were cautionary tales. 
  • Logan Engstrom et al.: In summary, both robust and non-robust features are predictive on the training set, but only non-robust features will yield generalization to the original test set. Thus, the fact that models trained on this dataset actually generalize to the standard test set indicates that (a) non-robust features exist and are sufficient for good generalization, and (b) deep neural networks indeed rely on these non-robust features, even in the presence of predictive robust features.
  • Andy Greenberg: SaboTor also underscored an aggressive new approach to law enforcement’s dark-web operations: The agents from the Joint Criminal Opioid Darknet Enforcement team that carried it out—from the FBI, Homeland Security Investigations, Drug Enforcement Administration, Postal Service, Customs and Border Protection, and Department of Defense—now all sit together in one room of the FBI’s Washington headquarters. They’ve been dedicated full-time to following the trail of dark-web suspects, from tracing their physical package deliveries to following the trail of payments on Bitcoin’s blockchain.
  • Dharmesh Thakker: The future of open source is in the cloud, and the future of cloud is heavily influenced by open source. Going forward, I believe the diamond standard in infrastructure software will be building a legendary open-source brand that is adopted by thousands of users, and then delivering a cloud-native, full-service experience to commercialize it. Along the way, non- open-source companies that use cloud “time-to-value” effectively, as well as hybrid open-source solutions delivered on multi-cloud and on-premise systems, will continue to thrive. This is the new OpenCloud paradigm, and I am excited about the hundreds of transformational companies that will be formed in the coming years to take advantage of it.
  • RcouF1uZ4gsC: There seems to be a trend of people making a lot of money designing/building stuff that erodes privacy and ethics and then leaving the company where they made that money and talking about privacy and ethics. Take for example Justin Rosenstein who invented the Like button.
  • A. Nonymous: On the fateful day, a switch crashed. The crash condition resulted in a repeated sequence of frames being sent at full wire speed. The repeated frames included broadcast traffic in the management VLAN, so every control-plane CPU had to process them. Network infrastructure CPUs at 100% all over the data center including core switches, routing adjacencies down, etc. The entire facility could not process for ~3.5 hours. No stretched L2, so damage was contained to a single site. This was a reasonably well-managed site, but had some dumb design choices. Highly bridged networks don’t tolerate dumb design choices.
  • Kevin Fogarty: Despite the moniker, 5G is more of a statement of direction than a single technology. The sub-6GHz version, which is what is being rolled out today, is more like 4.5G. Signal attenuation is modest, and these devices behave much like cell phones today. But when millimeter wave technology begins rolling out—current projections are 2021 or 2022—everything changes significantly. This slice of the spectrum is so sensitive that it can be blocked by clothing, skin, windows, and sometimes even fog.
  • DSHR: Why did “cloud service providers” have an “inventory build-up during calendar 2018”? Because the demand for storage from their customers was even further from insatiable than the drive vendors expected. Even the experts fall victim to the “insatiable demand” myth.
  • Eric Budish: In particular, the model suggests that Bitcoin would be majority attacked if it became sufficiently economically important — e.g., if it became a “store of value” akin to gold — which suggests that there are intrinsic economic limits to how economically important it can become in the first place.
  • Kalev Leetaru: In fact, much of the bias of deep learning comes from the reliance of the AI community on free data rather than paying to create minimally biased data. Putting this all together, as data science matures it must become far more like the hard sciences, especially a willingness to expend the resources to collect new data and ask the hard questions, rather than its current usage of merely lending a veneer of credibility to preordained conclusions.
  • Joel Hruska: AMD picked up one percentage point of unit share in the overall x86 market in Q1 2019 compared with the previous quarter and 4.7 percentage points of market share compared with Q1 2018. This means AMD increased its market share by 1.54x in just one year — a substantial improvement for any company.
  • @awsgeek: <- Meet the latest AWS Lambda Layers fanboy. I love how I can now move common dependencies into shared layers & reduce Lambda package sizes, which allows me to continue developing & debugging functions in the Lambda console. Yes, I love VIM, but I’m still a sucker for a GUI!
  • Chen: It’s not hard to believe that someone, maybe an employee, could be convinced to add a rogue element, a tiny little capacitor or something, to a board. There was a bug we heard about that looked like a generic Ethernet jack, and it worked like one, but it had some additional cables. The socket itself is the Trojan and the relevant piece is inside the case, so it’s hard to see.
  • @aallan: “The future is web apps, and has been since Steve Jobs told the WWDC audience in 2007 he had a ‘sweet solution’ to their desire to put apps on the iPhone – the web! – and was greeted by the stoniest of silences…” Yup, this! The future is never web apps.
  • @tmclaughbos: The biggest divide in the ops community isn’t Old v. DevOps v. SRE, k8s, v. serverless, or whatever. It is “How do I run infrastructure?” v. “How do I not run infrastructure?”.
  • @PaulDJohnston: The serverless shift * From Code to Configuration * From High LoC towards Low LoC (preferably zero) * From Building Services to Consuming Services * From Owning Workloads to Disowning Workloads
  • Vishal Gurbuxani: Facebook is not a social network anymore. It is a completely re-written internet, where a consumer spends their time, money, attention to buy products/services for their day-day life, as well as connect with people/groups, etc. I hope we can all take a stand and realize that our humanity is being lost by Facebook, when they choose to use algorithms to police 2.7 billion people.
  • @mattklein123: The thing I find most ironic about the C++ is dead narrative is that C++ IS one of the (several) reasons that Envoy has blown up. Google/Apple would not have touched Envoy if not C++, and their support was critical in early 2017, lending both expertise and resources. 1/ Winning in OSS is about product market fit, “hiring” contributors, and community building. The bigger the community, the more expertise and the more production usage, and this creates a compounding virtuous cycle. 2/
  • @clintsharp: It is nearly impossible to imbue an algorithm with *judgement*. And ultimately, we are paying operators of complex systems for their judgement. When to page out, when to escalate, when to bring in the developers. No algorithm is going to solve that for you.
  • Rudraksh Tuwani et al.: Our analysis reveals copy-mutation as a plausible mechanism of culinary evolution. As the world copes with the challenges of diet-linked disorders, knowledge of the key determinants of culinary evolution can drive the creation of novel recipe generation algorithms aimed at dietary interventions for better nutrition and health.
  • ellius: After fixing a recent bug, I asked my client company what if any postmortem process they had. I informally noted about 8 factors that had driven the resolution time to ~8 hours from what probably could have been 1 or 2. Some of them were things we had no control over, but a good 4-5 were things in the application team’s immediate control or within its orbit. These are issues that will definitely recur in troubleshooting future bugs, and doing a proper postmortem could easily save 250+ man hours over the course of a year. What’s more, fixing some of these issues would also aid in application development. So you’re looking at immediate cost savings
  • MITTR: The limiting factor for new machines is no longer the hardware but the power available to keep them humming. The Summit machine already requires a 14-megawatt power supply. That’s enough to light up an entire a medium-sized town. “To scale such a system by 10x would require 140 MW of power, which would be prohibitively expensive,” say Villalonga and co. By contrast, quantum computers are frugal. Their main power requirement is the cooling for superconducting components. So a 72-qubit computer like Google’s Bristlecone, for example, requires about 14 kw.  “Even as qubit systems scale up, this amount is unlikely to significantly grow,” say Villalonga and co.
  • Patient0: In the ~15 years I spent building software in C++ I don’t recall a single time that I wished for garbage collection. By using RAII techniques, it was always possible (nay, easy!) to write code that cleaned up after itself automatically. I always found it easier to reason about and debug programs because I knew when something was supposed to be freed, and in which thread. In contrast, in the ~10 years I spent working in Java, I frequently ran into problems with programs which needed an excessive amount of memory to run. I spent countless frustrating hours debugging and “tuning” the JVM to try to reduce the excessive memory footprint (never entirely successfully). Garbage collection is an oversold hack – I concede there are probably some situations where it is useful – but it has never lived up to the claims people made about it, especially with respect to it supposedly increasing developer productivity.
  • dan.j.newhouse: I just went through migrating our production machines from m5 and r5 instances to z1d over the last couple months. I’m a big fan of the z1d instance family now. Where I work, our workloads are very heavy on CPU, in addition to wanting a good chunk of RAM (what database doesn’t, though?). The m5 and r5 instances don’t cut it in the CPU department, and the c5 family is just poor RAM per dollar. While this blog post is highlighting the CPU, the z1d also has the instance storage NVMe ssd (as does the r5d). Set the database service to automatic delayed start, and toss TempDB on that disk. That local NVMe ssd is great in multiple ways. First, it’s already included in the price of the EC2 instance. Secondly, I’ve seen throughput in the neighborhood of 750 MB/s of against it (YMMV). Considering the cost of a io1 volume with a good chunk of provisioned IOPS is NOT cheap, plus you need an instance large enough to support that level of throughput to EBS in the first place, this is a big deal. If you’ve got the gp2-blues, with TempDB performing poorly, or worse, even experiencing buffer latch wait timeouts for our good buddy database ID 2, making a change to a z1d (or r5d, if you don’t need that CPU) to leverage that local ssd is really something to consider.
  • _nothing: My opinion is different nowadays. Instagram is surely a place where exists a lot of true beauty and expression, but it’s also a place full of people largely driven by societal and monetary reward, to an extent that I’ve come to consider unhealthy. We are being influenced, and we are influencing. And we like that– my social brain wants to know what society considers beautiful, it enjoys training itself on what society considers beautiful, it wants to be affirmed in the beauty of its own body. Instagram gave me exactly what I wanted. But (and I don’t mean to criticize anyone here at all, considering I was and still am subject to the same pressures and influences) I don’t think what I thought I wanted was healthy. I want to be happy. A constant stream of corgi videos and bikini photos and travel porn gives me little ups, but it also shapes my brain in ways I think could be damaging.

Useful Stuff:

  • Another example of specialization being the key to scalability and efficiency. It’s fascinating to see all the knobs Dropbox can tune because they do one thing well. How we optimized Magic Pocket for cold storage
    • kdkeyser: This article is about single-region storage vs. multi-region storage (and how to reduce the cost in this case). There is very little public info available about distributed storage systems in multi-region setup with significant latency between the sites.
    • preslavle: In our approach the additional codebase for cold storage is extremely small relative to the entire Magic Pocket codebase and importantly does not mutate any data in the live write path: data is written to the warm storage system and then asynchronously migrated to the cold storage system. This provides us an opportunity to hold data in both systems simultaneously during the transition and run extensive validation tests before removing data from the warm system. We use the exact same storage zones and codebase for storing each cold storage fragment as we use for storing each block in the warm data store. It’s the same system storing the data, just for a fragment instead of a block. In this respect we still have multi-zone protections since each fragment is stored in multiple zones.
    • Over 40% of all file retrievals in Dropbox are for data uploaded in the last day, over 70% for data uploaded in the last month, and over 90% for data uploaded in the last year. Dropbox has unpredictable delete patterns so we needed some process to reclaim space when one of the blocks gets deleted.
    • This system is already designed for a fairly cold workload. It uses spinning disks, which have the advantage of being cheap, durable, and relatively high-bandwidth. We save the solid-state drives (SSDs) for our databases and caches. Magic Pocket also uses different data encodings as files age. When we first upload a file to Magic Pocket we use n-way replication across a relatively large number of storage nodes, but then later encode older data in a more efficient erasure coded format in the background
    • Dropbox’s network stack is already heavily optimized for transferring large blocks of data over long distances. We have a highly tuned network stack and gRPC-based RPC framework, called Courier, that is multiplexing requests over HTTP/2 transport. This all results in warm TCP connections with a large window size that allows us to transfer a multi-megabyte block of data with a single round-trip.
    • One beautiful property of the cold storage tier is that it’s always exercising the worst-case scenario. There is no plan A and plan B. Regardless of whether a region is down or not, retrieving data always requires a reconstruction from multiple fragments. Unlike our previous designs or even the warm tier, a region outage does not result in major shifts in traffic or increase of disk I/O in the surviving regions. This made us less worried about hitting unexpected capacity limits during emergency failover at peak hours.
  • If you come to silicon valley should you work for a consumer oriented company or an enterprise SaaS company? The answer for a long time has been to target the consumer space. Consumer has been sexy, where the innovation is, where a new business model can win, where opportunity can be found. Exponent Episode 170 — A Perfect Meal makes the case that’s not the case any more. Consumer and enterprise SaaS have switched roles. Consumer is now controlled by monopolies. It’s hard for a new entrant to gain a foot hold in the consumer space. Enterprise SaaS is where a new product can win on merit. The examples given are Zoom and Slack. Both Zoom and Slack have won because they are better than their competitors. Can you say the same about many recent consumer products? The change is driven by the same trends we’ve seen drive the consumer market. Bring your own device in the enterprise has made users more of driving force in deciding what software an enterprise adopts. The role of the gatekeeper has diminished. You only need to convince an individual employee at a company to give your product a try, which is perfect for software as a service. Anyone at a company can sign up for a SaaS product at no risk. A sales team doesn’t have build relationships to drive sales. Employees drive adoption. Once in a company, especially if your product has a viral component, you can land and expand sales, something both Zoom and Slack have mastered. This drives down customer acquisition costs dramatically. Once you have an individual onboard that individual can infect others. And your sales team, after seeing a company has a number of users, can call that company to try and get the entire company on board. The pitch can be you’re relieving pain by offering a managed service for the entire company instead of the pain of each team managing a service for themselves. If you want to build a product where the best product wins then enterprise is the new sexy. The competitive dynamics in the enterprise reward being the better company in a way that consumer no longer does.
  • A radical rethinking of the stack. Fast key-value stores: An idea whose time has come and gone: We argue that the time of the RInK [Remote, in-memory key-value] store has come and gone: their domain-independent APIs (e.g., PUT/GET) push complexity back to the application, leading to extra (un)marshalling overheads and network hops. Instead, data center services should be built using stateful application servers or custom in-memory stores with domain-specific APIs, which offer higher performance than RInKs at lower cost.
    • SerDes is always a huge waste: in ProtoCache prior to its rearchitecture, 27% of latency was due to (un)marshalling. In our experiments (Section 3), (un)marshalling accounts for more than 85% of CPU usage. We also found (un)marshalling a 1KB protocol buffer to cost over 10us, with all data in the L1 cache. A third-party benchmark [5] shows that other popular serialization formats (e.g., Thrift [27]) are equally slow
    • Extra network hops also have a cost:  prior to its rearchitecture, ProtoCache incurred an 80 ms latency penalty simply to transfer large records from a remote store, despite a high speed network.
    • What they want instead: Stateful application servers couple full application logic with a cache of in-memory state linked into the same process (Fig. 2b). This architecture effectively merges the RInK with the application server; it is feasible when a RInK is only accessed by a single application and all requests access a single key. Latency is 29% to 57% better (at the median), with relative improvement increasing with object size. 
    • This is really a back to the future model of services. Services were stateful at one time. Then we went stateless to scale and added in caches to mitigate the performance penalty for separating state from logic. It would be interesting to see something like Lambda distribute stateful actors instead of functions.
    • Good discussion on HN.
  • We don’t need no stinkin’ OS. But of course the interfaces they talk about are really just another OS. I/O Is Faster Than the CPU – Let’s Partition Resources and Eliminate (Most) OS Abstractions: I/O is getting faster in servers that have fast programmable NICs and non-volatile main memory operating close to the speed of DRAM, but single-threaded CPU speeds have stagnated. Applications cannot take advantage of modern hardware capabilities when using interfaces built around abstractions that assume I/O to be slow. We therefore propose a structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware. The parakernel facilitates application-level parallelism by securely partitioning the resources and multiplexing only those resources that are not partitioned. Great discussion on HN. We’ve seen all this before. 
  • We should keep this lesson in mind when it comes to the use of biological weapons. Anything put out in the world can be captured, analysed, 3D printed en masse, and sent right back at the attacker. How Chinese Spies Got the N.S.A.’s Hacking Tools, and Used Them for Attacks: Chinese intelligence agents acquired National Security Agency hacking tools and repurposed them in 2016 to attack American allies and private companies in Europe and Asia, a leading cybersecurity firm has discovered. The episode is the latest evidence that the United States has lost control of key parts of its cybersecurity arsenal.
  • Real Time Lambda Cost Analysis Using Honeycomb: All of the services that support our web and mobile applications at Fender Digital are built using AWS Lambda. With Lambda’s cost-per-use billing model we have cut the cost of hosting our services by approximately 90%…While we have been tagging our resources diligently to calculate the cost of each service, the AWS Cost Explorer does not allow us to delve into the configuration of the function and the actual resources it has consumed versus the actual invocation times….Log aggregation to the rescue! We aggregate all of the Cloudwatch log groups for our Lambda functions into a single Kinesis stream, which then parses out the structured JSON logs and publishes them to honeycomb.io. I cannot recommend that tool highly enough for analyzing log data. It is a great product from a great group of people. AWS adds its own log lines as well, including when the function starts an invocation, when the invocation ends, and a report of the invocation that can be used to calculate the cost of the invocation…The function is currently configured for 512 MB, and since most of the time it spends is in network I/O on calling app store APIs, we can reduce the configured memory. If we were to reduce it to 384 MB we would see an ~25% reduction in cost. The average invocation time may be slightly higher, but as this function is invoked off of a DynamoDB stream it has no direct impact on user experience. What are the actual costs incurred? Assuming we’ve already consumed the free tier for Lambda, during the 3 day period we are looking at there were 1,405,640 requests x $0.0000002 per request = $0.28 and 140,057.8 gb-seconds of compute time x $0.00001667 = $2.33. For those three days, this function cost $2.61, leading to a potential savings of $0.65. That’s not a lot, but as our use of lambda steadily grows we can ensure we are not over-provisioned.
  • Measuring MySQL Performance in Kubernetes: You can see the numbers vary a lot, from 10550 tps to 20196 tps, with the most time being in the 10000tps range. That’s quite disappointing. Basically, we lost half of the throughput by moving [MySQL] to the Kubernetes node. To improve your experience you need to make sure you use Guaranteed QoS. Unfortunately, Kubernetes does not make it easy. You need to manually set the number of CPU threads, which is not always obvious if you use dynamic cloud instances. With Guaranteed QoS there is still a performance overhead of 10%, but I guess this is the cost we have to accept at the moment.
  • If your company runs multiple lines of business (think Gmail vs. Google Docs vs. Google Calendar, etc.), how can you tell how much of your hardware and infrastructure spend is attributed to each LOB? Embracing context propagation: When the request enters our system, we typically already know which LOB it represent, either from the API endpoint or even directly from the client apps. We can use context (baggage) to store the LOB tag and use it anywhere in the call graph to attribute measurements of resource usage to specific LOB, such as number of reads/writes in the storage, or number of messages processed by the messaging platform.
  • Attack of the Killer Microseconds: A new breed of low-latency I/O devices, ranging from faster datacenter networking to emerging non-volatile memories and accelerators, motivates greater interest in microsecond-scale latencies. Existing system optimizations targeting nanosecond- and millisecond-scale events are inadequate for events in the microsecond range. New techniques are needed to enable simple programs to achieve high performance when microsecond-scale latencies are involved, including new microarchitecture support. 
  • The fundamental issue with humanity is one person’s idea of utopia is another’s dystopia. We spend our lives navigating the resulting maelstrom of the conflict. This vision is not Humanity Unchained, it’s Humanity Limited by its own creation. Do you really want humanity limited by an AI enforcing all the “standard problems of political philosophy”? We would be stuck in the past rather than moving forward. Stephen Wolfram: But what will be possible with this? In a sense, human language was what launched civilization. What will computational language do? We can rethink almost everything: democracy that works by having everyone write a computational essay about what they want, that’s then fed to a big central AI—which inevitably has all the standard problems of political philosophy. New ways to think about what it means to do science, or to know things. Ways to organize and understand the civilization of the AIs. A big part of this is going to start with computational contracts and the idea of autonomous computation—a kind of strange merger of the world of natural law, human law, and computational law. Something anticipated three centuries ago by people like Leibniz—but finally becoming real today. Finally a world run with code.
  • A very well written Post-mortem and remediations for [Matrix] Apr 11 security incident. When you hear this in a movie you know exactly what happens next: What can we trust if not our own servers? And that’s what happened: If there is one lesson everyone should learn from this whole mess, it is: SSH agent forwarding is incredibly unsafe, and in general you should never use it. Not only can malicious code running on the server as that user (or root) hijack your credentials, but your credentials can in turn be used to access hosts behind your network perimeter which might otherwise be inaccessible. All it takes is someone to have snuck malicious code on your server waiting for you to log in with a forwarded agent, and boom, even if it was just a one-off ssh -A. TL;DR: keep your services patched; lock down SSH; partition your network; and there’s almost never a good reason to use SSH agent forwarding.
  • You have a great IoT idea, but you just can’t get past the lack of a M2M networks. Good news. AT&T went live with their NarrowBand Internet of Things (NB-IoT) network. Bodyport, for example, uses the LTE-M network to connect a smart scale that transmits patients’ cardiovascular data to remote care teams in near real-time. They’re working with suppliers to certify $5 modules that connect devices to NB-IoT and pricing plans are available for as low as $5/year/device. You need a revenue model, but at least it’s possible.
  • Meet programmers from a more civilized age. Brian Kernighan interviews Ken Thompson at Vintage Computer Festival East 2019
  • I/O Is Faster Than the CPU – Let’s Partition Resources and Eliminate (Most) OS Abstractions: NICs and non-volatile main memory operating close to the speed of DRAM, but single-threaded CPU speeds have stagnated. Applications cannot take advantage of modern hardware capabilities when using interfaces built around abstractions that assume I/O to be slow. We therefore propose a structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware. The parakernel facilitates application-level parallelism by securely partitioning the resources and multiplexing only those resources that are not partitioned
  • 6 new ways to reduce your AWS bill with little effort: AWS introduced AMD-powered EC2 instances that are 10% cheaper compared to the Intel-powered Instances. They provide the same resources (CPU, memory, network bandwidth) and run the same AMIs; Use VPC endpoints instead of NAT gateways; Convertible Reserved EC2 Instances – Saving potential: Additional 25% over On-Demand (assuming you can now go from 1-year terms to 3-year terms); EC2 Spot Instances – Saving potential: 70-90% over On-Demand; S3 Intelligent-Tiering.
  • Maybe this would work for how to communicate within a team? Mr. Rogers’ Nine Rules for Speaking to Children (1977): State the idea you wish to express as clearly as possible, and in terms preschoolers can understand; Rephrase in a positive manner; Rephrase your idea to eliminate all elements that could be considered prescriptive, directive, or instructive; Rephrase any element that suggests certainty; Rephrase your idea to eliminate any element that may not apply to all children; Add a simple motivational idea that gives preschoolers a reason to follow your advice; Rephrase your new statement, repeating the first step; Rephrase your idea a final time, relating it to some phase of development a preschooler can understand.
  • How is a blockchain and end-to-end encryption totally owned by Facebook any more private?  Top 3 Takeaways from Facebook: Blockchain will be at the center of Facebook’s Strategy for their entire platform and payments; Building out infrastructure on the data-level to have security from the ground up; Code privacy and data use principles as first class concepts into the infrastructure. This is one of the primary uses cases of using a Blockchain based network; I didn’t see anyone catch this sound bite from Mark, but he basically said that they are rewriting all of Facebook’s back-end code to be more user-centric, which is a distributed ledger and access control.
  • Do you want to go serverless and survive AWS region level outages? Here’s a huge amount of practical detail as well tips and gotchas. Disaster Tolerance Patterns Using AWS Serverless Services: S3 resilience: Use versioning and cross region replication for S3 buckets; Use CloudFront origin failover for read access to replicated S3 buckets; DynamoDB resilience: Use global tables for DynamoDB tables; API Gateway and Lambda resilience: Use a regional API Gateway and associated Lambda functions in each region; Use Route 53 latency or failover routing with health checks in front API Gateways; Cognito User Pools resilience: Create custom sync solution for now. 
  • The Story Behind an Instacart Order, Part 1: Building a Digital Catalog: Fun fact: While partners can send us inventory data at any point in the day, we receive most data dumps around 10 pm local time. Certain individual pieces of our system (like Postgres) weren’t configured to handle these 10 pm peak load times efficiently — we didn’t originally build with elastic scalability in mind. To solve this we began a “lift and shift” of our catalog infrastructure from our artisanal system to a SQL-based interface, running on top of a distributed system with inexpensive storage. We’ve decoupled compute from that storage, and in this new system, we rely on Airflow as our unified scheduler to orchestrate that work. Re-building our infrastructure now, not only helps us deal with load times efficiently, it saves cost in the long run and ensures that we can make more updates every night to the catalog as we receive more product attributes and our data prioritizations models evolve.
  • Some risks of coordinating only sometimes: Within AWS, we are starting to settle on some patterns that help constrain the behavior of systems in the worst case. One approach is to design systems that do a constant amount of coordination, independent of the offered workload or environmental factors. This is expensive, with the constant work frequently going to waste, but worth it for resilience. Another emerging approach is designing explicitly for blast radius, strongly limiting the ability of systems to coordinate or communicate beyond some limited radius. We also design for static stability, the ability for systems to continue to operate as best they can when they aren’t able to coordinate.
  • O’reilly has branded something they call the Next Architecture: The growth we’ve seen on our online learning platform in cloud topics, in orchestration and container-related terms such as Kubernetes and Docker, and in microservices is part of a larger trend in how organizations plan, code, test, and deploy applications that we call the Next Architecture. This architecture allows fast, flexible deployment, feature flexibility, efficient use of programmer resources, and rapid adapting, including scaling, to unpredictable resource requirements. These are all goals businesses feel increasingly pressured to achieve to keep up with nimble competitors. There are four aspects of the Next Architecture: Decomposition, Cloud, Containers, and Orchestration. 
  • We’ve learned from running Azure Functions that only 30% of our executions are coming from HTTP events.  KEDA: bringing event-driven containers and functions to Kubernetes: With the release of KEDA, any team can now deploy function apps created using those same [Microsoft] tools directly to Kubernetes. This allows you to run Azure Functions on-premises or alongside your other Kubernetes investments without compromising on the productive serverless development experience. The open source Azure Functions runtime is available to every team and organization, and brings a world-class developer experience and programming model to Kubernetes.The combination of flexible hosting options and an open source toolset gives teams more freedom and choice. If you choose to take advantage of the full benefits of a managed serverless service, you can shift responsibility and publish your apps to Azure Functions.
  • Why would you pick Fargate over Lambda? How Far Out is AWS Fargate: If I were to describe Fargate, I’d describe it as clusterless container orchestration. Much like “serverless” means an architecture where the server has been abstracted away…Lambda is an additional layer of abstraction where if your workload can be expressed as a function and complete it’s work in 15 minutes or less, then it’s a great choice, especially if your workload leans towards the sporadic. But if you need more control or the limits imposed by Lambda’s abstractions pose a problem for your workload, then Fargate is worth a close look. You don’t really need to choose one or the other as they very much compliment each other…Fargate is inherently simpler than Kubernetes because it only does one thing: container orchestration. And it does this very well. Everything else is provided by an external AWS service.
  • You’ve heard of autonomous self-driving cars? How about autonomous databases? It’s a DBMS that can deploy, configure, and tune itself automatically without any human intervention. Advanced Database Systems 2019 #25: Self-Driving Database Engineering (slides): Personnel is ~50% of the TOC of a DBMS. Average DBA Salary (2017): $89,050. The scale and complexity of DBMS installations have surpassed humans. Replace DBMS components with ML models trained at runtime. True autonomous DBMSs are achievable in the next decade. You should think about how each new feature can be controlled by a machine.
  • Good explanation with a really cool white board. Latency Under Load: HBM2 vs. GDDR6. A traffic analogy is used. The more lanes you have to memory the higher the bandwidth. Use HBM2 for highest bandwidth and best power efficiency. Downside it’s harder to design with and has higher cost. GDDR is a compromise giving great performance and wide pathway to memory.
  • microsoft/Quantum: These samples demonstrate the use of the Quantum Development Kit for a variety of different quantum computing tasks. Most samples are provided as a Visual Studio 2017 C# or F# project under the QsharpSamples.sln solution.
  • Vanilla JS: a fast, lightweight, cross-platform framework for building incredible, powerful JavaScript applications.
  • sirixdb/sirix: facilitates effective and efficient storing and querying of your temporal data through snapshotting (only ever appends changed database pages) and a novel versioning approach called sliding snapshot, which versions at the node level. Currently we support the storage and querying of XML- and JSON-documents in our binary encoding. 
  • Book of Proceedings from Internet Identity Workshop 27. There’s a huge number of topics and 47 pages of notes. 
  • Gorilla: A Fast, Scalable, In-Memory Time Series Database: Gorilla optimizes for remaining highly available for writes and reads, even in the face of failures, at the expense of possibly dropping small amounts of data on the write path. To improve query efficiency, we aggressively leverage compression techniques such as delta-of-delta timestamps and XOR’d floating point values to reduce Gorilla’s storage footprint by 10x. This allows us to store Gorilla’s data in memory, reducing query latency by 73x and improving query throughput by 14x when compared to a traditional database (HBase)- backed time series data. 
  • Uber created a site collecting all their research papers in one place. You might be surprised at all the topics they cover. 

from High Scalability

Stuff The Internet Says On Scalability For May 3rd, 2019

Stuff The Internet Says On Scalability For May 3rd, 2019

Wake up! It’s HighScalability time:

Event horizon? Nope. It’s a close up of a security hologram. Makes one think.

 

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 45 mostly 5 star reviews (105 on Goodreads). They’ll learn a lot and hold you in awe.

Number Stuff:

  • $1 trillion: Microsoft is the most valuable company in the world (for now)
  • 20%: global enterprises will have deployed serverless computing technologies by 2020
  • 390 million: paid Apple subscriptions, revenue from the services business climbed from $9.9 billion to $11.5 billion, services now account for “one-third” of the company’s gross profits
  • 1011: CubeStat missions
  • $326 billion: USA farm expenses in 2017
  • 61%: increase in average cyber attack losses from $229,000 last year to $369,000 this, a figure exceeding $700,000 for large firms versus just $162,000 in 2018.
  • $550: can yield 20x profit on the sale of compromised login credentials

Quotable Stuff:

  • Robert Lightfoot~ Protecting against risk and being safe are not the same thing. Risk is just simply a calculation of likelihood and consequence. Would we have ever launched Apollo in the environment we’re in today? Would Buzz and Neil have been able to go to the moon in the risk posture we live in today? Would we have launched the first shuttle with a crew? We must move from risk management to risk leadership. From a risk management perspective, the safest place to be is on the ground. From a risk leadership perspective, I believe that’s the worst place this nation can be.
  • Paul Kunert: In dollar terms, Jeff Bezos’s cloud services wing grew 41 per cent year on year to $7.6bn, figures from Canalys show. Microsoft was up 75 per cent to $3.4bn and Google grew a whopping 83 per cent to $2.3bn.
  • @codinghorror: 1999 “MIT – We estimate that the puzzle will require 35 years of continuous computation to solve” 2019 “🌎- LOL” https://www.csail.mit.edu/news/programmers-solve-mits-20-year-old-cryptographic-puzzle …
  • @dvassallo: TIL what EC2’s “Up to” means. I used to think it simply indicates best effort bandwidth, but apparently there’s a hard baseline bottleneck for most EC2 instance types (those with an “up to”). It’s significantly smaller than the rating, and it can be reached in just a few minutes. This stuff is so obscure that I bet 99% of Amazon SDEs that use EC2 daily inside Amazon don’t know about these limits. I only noticed this by accident when I was benchmarking S3 a couple of weeks ago
  • @Adron: 1997: startup requires about a million $ just to get physical infra setup for a few servers. 2007: one can finally run stuff online and kind of skip massive hardware acquisitions just to run a website. 2017: one can scale massively & get started for about $10 bucks of infra.
  • Wired: Nadella’s approach as “subtle shade.” He never explicitly eighty-sixed a division or cut down a product leader, but his underlying intentions were always clear. His first email to employees ran more than 1,000 words—and made no mention of Windows. He later renamed the cloud offering Microsoft Azure. “Satya doesn’t talk shit—he just started omitting ‘Windows’ from sentences,” this executive says. “Suddenly, everything from Satya was ‘cloud, cloud, cloud!’ ”
  • @ThreddyTheTrex: My approach to side projects has evolved. Beginning of my career: “I will build everything from scratch using C and manage my own memory and I don’t mind if it takes 3 years.” Now: “I will use existing software that takes no more than 15 minutes to configure.”
  • btown: The software wouldn’t have crashed if the user didn’t click these buttons in a weird order. The bug was only one factor in a chain of events that led to the segfault.
  • @Tjido: There are more than 10,000 data lakes in AWS. @strataconf  #datalakes #stratadata
  • Nicolas Kemper: Accretive projects are everywhere: Museums, universities, military bases – even neighborhoods and cities. Key to all accretive projects is that they house an institution, and key to all successful institutions is mission. Whereas scope is a detailed sense of both the destination and the journey, a mission must be flexible and adjust to maximum uncertainty across time. In the same way, an institution and a building are often an odd pair, because whereas the building is fixed and concrete, finished or unfinished, an institution evolves and its work is never finished.
  • @markmadsen: Your location-identified tweets plus those of two friends on twitter predict your location to within 100m 77% of the time. Location data is PII and must be treated as such #StrataData
  • Backblaze: The Annualized Failure Rate (AFR) for Q1 is 1.56%. That’s as high as the quarterly rate has been since Q4 2017 and its part of an overall upward trend we’ve seen in the quarterly failure rates over the last few quarters. Let’s take a closer look. 
  • Theron Mohamed: Google’s advertising revenue rose by 15% to $30.72 billion, a sharp slowdown from 24% growth a year ago, according to its earnings report for the first quarter of 2018. Paid clicks rose 39%, a significant decrease from 59% year-on-year growth in the first quarter of 2018. Cost-per-click also fell 19%, after sliding 19% in the same period of 2018.
  • @ajaynairthinks: “It was what we know to do so it was faster” -> this is the key challenge. Right now, the familiar path is not easy/effective in the long term, and the effective path is not familiar in the short term. We need make this gap visible, and we need to make the easy things familiar.
  • @NinjaEconomics: “For the first time ever there are now more people in the world older than 65 than younger than 5.”
  • Filipe Oliveira: with the new AWS public cloud C5n Instances designed for compute-heavy applications and to deliver performance that is just about indistinguishable from bare metal, with 100 Gbps Networking along with a higher ceiling on packets per second, we should be able deliver at least the same 50 Million operations per second bellow 1 millisecond with less VM nodes
  • Nima Khajehnouri: Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders
  • Carmen Bambach: He is an artist of his time and one that transcends his time. He is very ambitious. It’s important to remember that although Leonardo was a “disciple of experience,” as he called himself, he is also paying great attention to the sources of his time. After having devoured and looked at and bought many books, he realizes he can do better. He really wants to write books, but it’s a very steep learning curve. The way we should look at his notebooks and the manuscripts is that they are essentially the raw material for what he had intended to produce as treatises. His great contribution is being able to visualize knowledge in a way that had not been done before. 
  • Charlie Demerjian: The latest Intel roadmap leak blows a gaping hole in Intel’s 10nm messaging. SemiAccurate has said all along that the process would never work right and this latest info shows that 10nm should have never been released.
  • @mipsytipsy: Abuse and misery pile up when you are building and running large software systems without understanding them, without good feedback loops. Feedback loops are not a punishment. They mature you into a wise elder engineer.  They give you agency, mastery, autonomy, direction. And that is why software engineers, management, and ops engineers should all feel personally invested in empowering software engineers to own their own code in production.
  • Skip: Serverless has made it possible to scale Skip with a small team of engineers. It’s also given us a programming model that lets us tackle complexity early on, and gives us the ability to view our platform as a set of fine-grained services we can spread across agile teams.
  • seanwilson: Imagine having to install Trello, Google Docs, Slack etc. manually everywhere you wanted to use it, deal with updates yourself and ask people you wanted to collaborate with to do the same. That makes no sense in terms of ease of use.
  • Darryl Campbell: The slick PR campaign masked a design and production process that was stretched to the breaking point. Designers pushed out blueprints at double their normal pace, often sending incorrect or incomplete schematics to the factory floor. Software engineers had to settle for re-creating 40-year-old analog instruments in digital formats, rather than innovating and improving upon them. This was all done for the sake of keeping the Max within the constraints of its common type certificate.
  • Stripe: We have seen such promising results from our remote engineers that we are greatly increasing our investment in remote engineering. We are formalizing our Remote engineering hub. It is coequal with our physical hubs, and will benefit from some of our experience in scaling engineering organizations.
  • Joel Hruska: According to Intel in its Q1 2019 conference call, NAND price declines were a drag on its earnings, falling nearly twice the expected amount. This boom and bust cycle is common in the DRAM industry, where it drove multiple players to exit the market over the past 18 years. This is one reason we’re effectively down to just three DRAM manufacturers — Samsung, SK Hynix, and Micron. There are still a few more players in the NAND market, though we’ve seen consolidations there as well.
  • Alastair Edwards: The cloud infrastructure market is moving into a new phase of hybrid IT adoption, with businesses demanding cloud services that can be more easily integrated with their on-premises environment. Most cloud providers are now looking at ways to enter customers’ existing data centres, either through their own products or via partnerships
  • Paul Johnston: And yes I can absolutely see how the above company could have done this whole solution better as a Serverless solution but they don’t have the money for rearchitecting their back end (I don’t imagine) and what would be the value anyway? It’s up and running, with paying clients. The value at this point doesn’t seem valuable. Additional features may be a good fit for a Serverless approach, but not the whole thing if it’s all working. The pain of migrating to a new backend database, the pain of server migrations even at this level of simplicity, the pain of having to coordinate with other teams on something that seems so trivial, but never is that trivial has been really hard.
  • @rseroter: In serverless … Functions are not the point. Managed services are not the point. Ops is not the point. Cost is not the point. Technology is not the point. The point is focus on customer value. @ben11kehoe laying it all out. #deliveragile2019
  • @jessitron: Serverless is a direction, not a destination. There is no end state. @ben11kehoe  Keep moving technical details out of the team’s focus, in favor of customer value. #deliverAgile 
  • @ondayd: RT RealGeneKim “RT jessitron: When we rush development, skip tests and refactoring, we get “Escalating Risk.” Please give up the “technical debt” description; it gives businesspeople a very wrong impression of the tradeoffs. From Janellekz #deliverAgile “
  • @ben11kehoe: Good points in here about event-driven architectures. I do think the “bounded context” notions from microservices are still applicable, and that we don’t have good enough tools for establishing contracts for events and dynamic routing for #serverless yet.
  • Riot Games: We use MapReduce, a common cluster computing model, to calculate data in a distributed fashion. Below is an example of how we calculate the cosine similarity metric – user data is mapped to nodes, the item-item metric is calculated for each user, and the results are shuffled and sent to a common node so they can be aggregated together in the reduce stage. It takes approximately 1000 compute hours to carry out the entire offer generation process, from snapshotting data to running all of the distributed algorithms. That’s 50 machines running for 20 hours each.
  • Will Knight: Sze’s hardware is more efficient partly because it physically reduces the bottleneck between where data is stored and where it’s analyzed, but also because it uses clever schemes for reusing data. Before joining MIT, Sze pioneered this approach for improving the efficiency of video compression while at Texas Instruments.
  • Hersleb hypothesis~ coding is a socio-technical process where code and humans interact. According to what we call the the Hersleb hypothesis, the following anti-pattern is a strong predictor for defects: • If two code sections communicate…• But the programmers of those two sections do not…• Then that code section is more likely to be buggy
  • Joel Hruska: But the adoption of chiplets is also the engineering acknowledgment of constraints that didn’t used to exist. We didn’t used to need chiplets. When companies like TSMC publicly predict that their 5nm node will deliver much smaller performance and power improvements than previous nodes did, it’s partly a tacit admission that the improvements engineers have gotten used to delivering from process nodes will now have to be gained in a different fashion. No one is particularly sure how to do this, and analyses of how effectively engineers boost performance without additional transistors to throw at the problem have not been optimistic.
  • Bryan Meyers: To some extent I think we should view chiplets as a stop-gap until other innovations come along. They solve the immediate problems of poor yields and reticle limits in exchange for a slight increase in integration complexity, while opening the door to more easily integrating application-specific accelerators cost-effectively. But it’s also not likely that CPU sockets will get much larger. We’ll probably hit the limit of density when chiplet-based SoC’s start using as much power as high-end GPUs. So really we’re waiting on better interconnects (e.g. photonics or wireless NoC) or 3D integration to push much farther. Both of which I think are still at least a decade away.
  • Olsavsky: And that will be a constant battle between growth, geographic expansion in AWS, and also efficiencies to limit how much we actually need. I think we are also getting much better at adding capacity faster, so there is less need to build it six to twelve 12 months in advance.
  • Malith Jayasinghe: We noticed that a non-blocking system was able to handle a large number of concurrent users while achieving higher throughput and lower latency with a small number of threads. We then looked at how the number of processing threads impacts the performance. We noticed a minimal impact on throughput and average latency on the number of threads. However, as the number of threads increases, we see a significant increase in the tail latencies (i.e. latency percentiles) and load average.
  • Paul Berthaux: We [Algolia] run multiple LBs for resiliency – the LB selection is made through round robin DNS. For now this is fine, as the LBs are performing very simple tasks in comparison to our search API servers, so we do not need an even load balancing across them. That said, we have some very long term plans to move from round-robin DNS to something based on Anycast routing. he detection of upstream failures as well as retries toward different upstreams is embedded inside NGINX/OpenResty. I use the log_by_lua directive from OpenResty with some custom Lua code to count the failures and trigger the removal of the failing upstream from the active Redis entry and alert the lb-helper after 10 failures in a row. I set up this failure threshold to avoid lots of unnecessary events in case of short self resolving incidents like punctual packet loss. From there the lb-helper will probe the failing upstream FQDN and put It back in Redis once it’ll recover.

Useful Stuff:

from High Scalability

Stuff The Internet Says On Scalability For April 26th, 2019

Stuff The Internet Says On Scalability For April 26th, 2019

Wake up! It’s HighScalability time:

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 45 mostly 5 star reviews (103 on Goodreads). They’ll learn a lot and hold you in even greater awe.

  • $30 million: Apple’s per month AWS bill (a ~50% reduction); 73%: Azure YoY growth; 3,500: times per day andon cords are pulled at Toyota; $1 trillion: size of micromobility market; $1 billion: cryptopiracy is the new sea piracy; $702 million: Tesla fist quarter loss; $5.0 billion: FTC Facebook fine; 1.56 billion: Facebook DAUs, 8% growth; 93%: Facebook mobile advertising revenue out of total; 40%: internet traffic driven by bots; one litre per hour: required by Roman galley oarsmen; 1200: Fortnite World Cup cheaters; 26: states ban community broadband; $50M: Slack yearly AWS spend; 575: companies paying slack $100k/year; 
  • Quotable Quotes:
    • Claude Shannon: Then there’s the idea of dissatisfaction. By this I don’t mean a pessimistic dissatisfaction of the world — we don’t like the way things are — I mean a constructive dissatisfaction. The idea could be expressed in the words, This is OK, but I think things could be done better. I think there is a neater way to do this. I think things could be improved a little. In other words, there is continually a slight irritation when things don’t look quite right; and I think that dissatisfaction in present days is a key driving force in good scientists.
    • Albert Kao: A feature of modular structure is that there’s always information loss, but the effect of that information loss on accuracy depends on the environment. Surprisingly, in complex environments, the information loss even helps accuracy in a lot of situations.
    • Eduards Sizovs: Be the company that says: we are hiring mentoring.
    • ???: There Is No Shortage of Talent. There’s a Shortage of Suckers. 
    • Andrew Leonard: AWS is striking at the Achilles’ heel of open source: lifting the work of others, and renting access to it. Observers of the clash between AWS and open source worry that the room for further innovation may be rapidly shrinking. 
    • V8: Like many with a background in programming languages and their implementations, the idea that safe languages enforce a proper abstraction boundary, not allowing well-typed programs to read arbitrary memory, has been a guarantee upon which our mental models have been built. It is a depressing conclusion that our models were wrong — this guarantee is not true on today’s hardware.
    • John Allspaw: resilience is something that a system does, not what it has
    • Valantar: Denser than DRAM, not NAND. Speed claims are against NAND, price/density claims against DRAM – where they might not be 1/10th the price, but definitely cheaper. The entire argument for 3D Xpoint is “faster than NAND, cheaper than DRAM (while persistent and closer to the former than the latter in capacity)”, after all.
    • @JoeEmison: Don’t need luck when I have serverless.
    • bzbarsky: Caches, say. Again as a concrete example, on Mac the OS font library (CoreText) has a multi-megabyte glyph cache that is per-process. Assuming you do your text painting directly in the web renderer process (which is an assumption that is getting revisited as a result of this problem), you now end up with multiple copies of this glyph cache. And since it’s in a system library, you can’t easily share it (even if you ignore the complication about it not being readonly data, since it’s a cache). Just to make the numbers clear, the number of distinct origins on a “typical” web page is in the dozens because of all the ads. So a 3MB per-process overhead corresponds to something like an extra 100MB of RAM usage…
    • T-Mobile: millimeter wave (mmWave) spectrum has great potential in terms of speed and capacity, but it doesn’t travel far from the cell site and doesn’t penetrate materials at all. It will never materially scale beyond small pockets of 5G hotspots in dense urban environments.
    • @StegerPatrick: I think it’s a bit of both honestly. True an idle instance makes AWS coin but at the same time they have to build out datacenters to support those idle boxes. My thinking is that margins are higher with lambda per box. 0 numbers to back it up, just my gut.
    • @jamesurquhart: I remember the question being asked in an AWS ops meeting: “Do we know mathematically that we detected that error as early as statistically reasonable?”
    • Herman Narula: Video games are the most important technological change happening in the world right now. Just look at the scale: a full third of the world’s population (2.6 billion people) find the time to game, plugging into massive networks of interaction. These networks let people exercise a social muscle they might not otherwise exercise. While social media can amplify our differences, could games create a space for us to empathize?
    • George Dyson: There are two kinds of creation myths: those where life arises out of the mud, and those where life falls from the sky. In this creation myth, computers arose from the mud, and code fell from the sky.
    • returnofthecityboy: As someone who has worked in both fields, often it’s nigh impossible in big corporations, and I imagine the military and aerospace is no exception, to improve something that you’re not immediately tasked with. There are 0.1% of things that are “your job”, and 99.9% of things are “not your job”. If you see something wrong in “not your job”, deep hierarchies, leadership that is selected for seniority over intelligence, and lots of layers of bureaucratic crap make it impossible for you to change anything about it.
    • Fortnite Source: The executives keep reacting and changing things. Everything has to be done immediately. We’re not allowed to spend time on anything. If something breaks — a weapon, say — then we can’t just turn it off and fix it with the next patch. It has to be fixed immediately, and all the while, we’re still working on next week’s patch. It’s brutal. I hardly sleep. I’m grumpy at home. I have no energy to go out. Getting a weekend away from work is a major achievement. If I take a Saturday off, I feel guilty. I’m not being forced to work this way, but if I don’t, then the job won’t get done. I know some people who just refused to work weekends, and then we missed a deadline because their part of the package wasn’t completed, and they were fired. People are losing their jobs because they don’t want to work these hours.
    • @hillelogram: Catastrophic accidents like the 737 are signs of deep systemic issues at all layers of the system and cannot be isolated to things like “good practices”. This is something we see in all major accidents: people scapegoat the technicians and miss the errors of the C-levels.
    • @Electric_Genie: Exclusive: For efficiency, Boeing wanted huge engines on the 737 Max. However, thoroughly redesigning the plane to accommodate the huge engines would have been very costly. So Boeing took the inexpensive route, relying primarily on software. Bad idea.
    • @deadprogrammer: AWS is not about paying for what you use, it’s about paying for what you forgot to turn off.
    • mdbm: I’ve seen a lot of friends publish apps to different ecosystems (e.g. marketplace.atlassian.com, https://apps.shopify.com, etc.) and make a steady profit. Seems like this is a good way to tap into an existing user base (although you share a portion of the revenue).
    • Peter Rüegg: The researchers took it a step further: they created a biological dual-core processor, similar to those in the digital world, by integrating two cores into a cell. To do so, they used CRISPR-Cas9 components from two different bacteria. Fussenegger was delighted with the result, saying: “We have created the first cell computer with more than one core processor.” This biological computer is not only extremely small, but in theory can be scaled up to any conceivable size. “Imagine a microtissue with billions of cells, each equipped with its own dual-core processor. Such ‘computational organs’ could theoretically attain computing power that far outstrips that of a digital supercomputer – and using just a fraction of the energy,” 
    • Laurence Scott: Our ongoing challenge, then, will be to negotiate the inherent inauthenticity and cynicism of an influence economy while preserving our ability to be occupied, and perhaps changed for the better, by the alien ideas of other people.
    • Oscar Schwartz: Perhaps we can take a lesson from the author Edgar Allan Poe. When he viewed von Kempelen’s Mechanical Turk, he was not fooled by the illusion. Instead, he wondered what it would be like for the chess player trapped inside, the concealed laborer “tightly compressed” among cogs and levers in “exceedingly painful and awkward positions.”
    • Polina Marinova: Once the very embodiment of Silicon Valley venture capital, the storied firm has suffered a two-decade losing streak. It missed the era’s hottest companies, took a disastrous detour into renewable energy, and failed to groom its next-generation leadership. Can it ever regain the old Kleiner magic?
    • torpfactory: Let’s just all pause for a moment of silence to ponder just how amazing even the PX Xavier is compared to somewhat recent history: ASCI Red was the first 1 TOP computer. It consumed 850kW of power and cost $46m. It was the fastest supercomputer in the world 19 years ago this June. Now we’re arguing over whether all our cars will be driving around with a computer either 144 or 300x as fast using either 8500x or 1700x less power and costing in the neighborhood of 100,000x less.
    • xibbie: With the shuttle disasters, NASA was at the forefront of science, and the pilots signed up knowing the risks. NASA may have been under time pressure (I’m not familiar with the root causes here, just echoing your reasons), but the deadlines didn’t carry commercial penalties. With the 737 crashes, commercial interests have clearly compromised safety objectives, and the passengers and pilots were BY DESIGN kept unaware of the increased risks.
    • KillerCodeMonky: The tendency to want to “fix” things is sometimes part of the problem. It’s called lava flow pattern, and it’s when your codebase is programmed in waves of different guiding design and architecture. One of the best things you can do in an old codebase is just follow existing patterns and design, even if you don’t agree with them. If you’re going to refactor anyway, make it very obvious where the line is between old and new, so that others know which design to follow where. Preferably putting the new stuff into a separate namespace or even library.
    • OpenAI Five: We were expecting to need sophisticated algorithmic ideas, such as hierarchical reinforcement learning, but we were surprised by what we found: the fundamental improvement we needed for this problem was scale.
    • david gerard: It’s 2019, and, of course, no systems using blockchains for access control of patient data are in production. Because this is snake oil that claims to work around political, social and legal issues using impossible and nonexistent technological magic. The only beneficiaries will be blockchain consultancies. Patient outcomes — which is a number that you have to provide for literally every exciting new medical initiative — will only be negative.
    • Geoff Huston: To understand just how dramatic the change could be for the DNS, it has been observed that some 70% of all queries to the root name servers relate to non-existent top-level names. If all recursive resolvers performed DNSSEC validation and aggressive NSEC caching of just the root zone by default, then we could drop the query rate to the root servers by most of this 70%, which is a dramatic change.
    • Pascal Luban: So that’s the lesson to draw for game designers:  When a game features characters we can identify with and situations we can relate to,  the intensity of the gamer experience is multiplied. This BAFTA reward illustrates a trend I identified several years ago:  The growing role of emotions in games and their consequences on their design and the part of a good narrative.
    • Merv Adrian: The DBMS market returned to double digit growth in 2017 (12.7% year over year in Gartner’s estimate) to $38.8 billion. Over 73% of that growth was attributable to two vendors: Amazon Web Services and Microsoft, reflecting the enormous shift to new spending going to cloud and hybrid-capable offerings. In 2018, the trend grew, and the erosion of share for vendors like Oracle, IBM and Teradata continued. We don’t have our 2018 data completed yet, but I suspect we will see a similar ballpark for overall growth, with the same players up and down as last year. Competition from Chinese cloud vendors, such as Alibaba Cloud and Tencent, is emerging, especially outside North America.
    • @stevecheney: The “getting multiple term sheets” is a myth in startup / VC land. Only about 10% of startups get more than 2 term sheets and the vast majority get just 1.
    • @stevecheney: VCs spend around 1/3 of their time on portfolio co’s, 1/3 sourcing deals and the remaining on misc & fund management. That means your VC – split among say 10 boards – spends about 3% of their day job thinking about you… that’s not a lot, so make it impactful.
    • John DiLullo: Today’s security experts recognize that the typical enterprise is in retreat and not winning the battle. Losses due to cybercrime hit a record in 2018 and it is largely believed that another record will be broken in 2019. People often believe that they are next in line, if they have not already been breached. No one feels totally safe. No one has that much hubris.
    • Peter Newman: amsung is attempting to take on the likes of Qualcomm and Intel, as well as third-party semiconductor manufacturers like TSMC, with its costly chip plans [$116 billion through 2030]
    • steve cheney: But one thing is very clear: we are entering an era where cars will become autonomous, navigate by themselves and prevent catastrophes from happening at unprecedented scale. And just as smartphones spawned a different set of winners every decade, you can bet the next set of car makers will look much different than today’s. Google, Uber and Tesla will all be involved… And although we know almost nothing of Tesla’s future chip plans, we know they did something uncharacteristic and deeply impressive. They built a hardware and software platform foundation from the ground up, funding and subsidizing it – just barely – based on the incredible dream they are selling today.
    • turlockmike: A nice to heave feature would be an option on the lambda to set a “Keep-Warm” option. The ability to set a minimum number of warmed instances, as well as being able to set a specific schedule (similar to a cron job schedule) all in exchange for some fee. This would help developers not have to write workarounds for services that need a certain level of responsiveness during cold periods. (Think people doing uncommon things late at night).
    • Sarah Jackson: We must also recognise [for Mayans] that personhood is a dynamic state. An entity isn’t always or inherently a person. This is kind of wild – not only do we have to keep our eye out for the various persons who might surround us on a daily basis, but we have to be aware that things might be entering or exiting this state.
    • Andy Greenberg: Kaspersky found evidence that the Asus and videogame attacks, which it collectively calls ShadowHammer, are likely linked to an older, sophisticated spying campaign, one that it dubbed ShadowPad in 2017. In those earlier incidents, hackers hijacked server management software distributed by the firm Netsarang, and then used a similar supply chain attack to piggyback on CCleaner software installed on 700,000 computers. But just 40 specific companies’ computers received the hackers second-stage malware infection.—including Asus
    • @GossiTheDog: Amusing oops from Facebook – Facebook Marketplace included the exact GPS location of sellers in their public Json data by mistake. When told about it through bug bounty, they closed it twice as ‘not an issue’.
    • Joseph Trevithick: Just a little over a month after MD Helicopters unveiled its latest armed helicopter, the MD 969, the company has revealed a new option for the chopper, a seven-round launcher that sits inside the main cabin and pops precision-guided munitions out through a hatch in the rear of the fuselage. The system uses the increasingly popular Common Launch Tube, or CLT, which can already accommodate a wide variety of payloads, including small drones.
    • 013a: Today, if you’re not locked in, you’re leaving business value on the table. I hope that changes in the future, and maybe Kube will be the standard platform we’ve needed to push it forward (for example, I wish I could tell kube “give me a queue with FIFO and exactly once delivery”, it knows what cloud provider you’re on, if you’re on AWS it provisions an SQS queue, if you’re on GCloud it errors because they don’t have one of those yet, and in either case I communicate with it from the app using a standard Kube API, not the aws-sdk). But for now, lean in. Don’t fight the lock-in; every minute you spend fighting it is a minute that you should be spending fighting your competition.
    • BENJAMIN SEIBOLD: The transition from uniform traffic flow to jamiton-dominated flow is similar to water turning from a liquid state into a gas state. In traffic, this phase transition occurs once traffic density reaches a particular, critical threshold at which the drivers’ anticipation exactly balances the delay effect in their velocity adjustment. The most fascinating aspect of this phase transition is that the character of the traffic changes dramatically while individual drivers do not change their driving behavior at all.
    • DSHR: I’ve shown that, whatever consensus mechanism they use, permissionless blockchains are not sustainable for very fundamental economic reasons. These include the need for speculative inflows and mining pools, security linear in cost, economies of scale, and fixed supply vs. variable demand. Proof-of-work blockchains are also environmentally unsustainable. The top 5 cryptocurrencies are estimated to use as much energy as The Netherlands. This isn’t to take away from Nakamoto’s ingenuity; proof-of-work is the only consensus system shown to work well for permissionless blockchains. The consensus mechanism works, but energy consumption and emergent behaviors at higher levels of the system make it unsustainable.
    • @allafarce: Hertz saw they did not have the internal capabilities in tech and digital experience, so they went to a vendor. But they also made that vendor the “product owner”. Seems like a pretty critical misstep. If you build ONE internal capability, product ownership is a good one.
    • Reed Albergotti: Then Thomas realized her daughter’s nightmares were real. In August, she walked into the room and heard pornography playing through the Nest Cam, which she had used for years as a baby monitor in their Novato, Calif., home. Hackers, whose voices could be heard faintly in the background, were playing the recording, using the intercom feature in the software. “I’m really sad I doubted my daughter,” she said.
    • Xavin: Millimeter-wave is really going to only ever be used for wireless connections inside one room where extremely high bandwidth is needed. Stuff like a TV on the wall with the receiver not connected over to the side, or wireless VR headsets. It can also be used for outdoor point to point connections, but the drawbacks mean it won’t ever work well for devices you move around or hold, like cellphones. Part of the confusion with 5G is that in addition to the millimeter-wave stuff that’s all smoke and mirrors as far as cellular service, it’s also just the next improved spec for cellular hardware. So there will be benefits to switching to it, lower latency, higher bandwidth, more security, etc, it’s just going to be evolutionary and not anything most people will notice. Certainly nothing that will sell handsets, which is why they are hyping the millimeter-wave stuff even though everyone with any domain knowledge knows it’s useless for cellular in 99.9% of cases.
    • jdleesmiller: 1. Multiple Rates of Change: Why must changing one module in a monolith take longer than changing the same module in its own service? Perhaps a better use of time would be to improve CI+CD on the monolith. 2. Independent Life Cycles: Ditto. If the tests partition neatly across modules, why not just run the tests for those modules? If they don’t, not running all your tests seems more likely to let bugs through, wasting the investment in writing said tests. 3. Independent Scalability: How bad is it to have a few more copies of the code for account administration module loaded even if it’s not serving a lot of requests? The hot endpoints determine how many servers you need; the cold ones don’t matter much. And load is stochastic: what is hot and what is cold will change over time and is not trivial to predict. If you separate the services, you have to over-provision every service to account for its variability individually, with no opportunity to take advantage of inversely correlated loads on different modules in the monolith. 4. Isolated Failure: Why not wrap the flaky external service in a library within the monolith? And if another persistence store is needed to cache, it is no harder to hook that up to a monolith. 5. Simplify Interactions: Ditto. The Façade pattern works just as well in a library. 6. Freedom to choose the right tech: This is true, but as they say having all these extra technologies comes with a lot of extra dev training, dev hiring and ops costs. Maybe it would have been better to use a ‘second best’ technology that the team already has rather than the best technology that it doesn’t, once those costs are accounted for. The cultural points are largely valid for large organisations, but I feel like it positions microservices as an antidote to waterfall methods and long release cycles, which I think could be more effectively addressed with agile practices and CI+CD. 
    • Josh Waitzkin: And a lot of what I work on with guys is creating rhythms in their life that really are based on feeding the unconscious mind, which is the wellspring of creativity information and then tapping it. So for example, ending the workday with high quality focus on a certain area of complexity where you can use an insight and then waking up first thing in the morning creating input and applying your mind to it, journaling on it. Not so much to do a big brainstorm, but to tap what you’ve been working on unconsciously overnight. Which of course, is a principle that Hemingway wrote about when he spoke about the two core principles in his creative writing process, number one ending the workday with something left to write and — Tim Ferriss: Yeah, often in mid-sentence even. Josh Waitzkin: Right. So not doing everything he had to do. Which most people do, but they feel this sense of guilt if they’re not working. You and I have discussed this at length, but leaving something left to write and then the second principle, release your mind from it. Don’t think about it all night. Really let go. Have a glass of wine. Then wake up first thing in the morning and reapply your mind to it. And it’s amazing because you’re basically feeding the mind complexity and then tapping that complexity or tapping what you’ve done with it. This rhythm, the large variation of it is overnight, and then you can have microbursts of it throughout the day. Before workouts pose a question, do a workout, release your mind after workout, return to it, and do creative bursts. Before you go to the bathroom, before you go to lunch, before anything. And in that way you’re systematically training yourself to generate the crystallization experience, that ah-ha moment that can happen once a month or once a year. A lot of what I do is work on systems to help it happen once a day or four times a day, and when we’re talking about guys who run financial groups of $20 to $30 billion, for example, if they have a huge insight that can have unbelievable value. If you can really train people to get systematic about nurturing their creative process, it’s unbelievable what can happen and most of that work relates to getting out of your own way. It’s unloading. It’s the constant practice of subtraction, reducing friction.
  • Why does Uber test in production? Testing in Production at Scale
    • 600 cities. 64 countries. 750m active riders. 3m active drivers. 15m trips per day. 10b cumulative trips. 1000s of microservices. 1000s of commits per day. 
    • Less operational cost of maintaining a parallel stack. One knob to control capacity. No synchronization required. More accurate end-to-end capacity planning. Ensures stack can handle load. Test traffic takes same code path as production traffic. Enables other use cases like Canary, Shadowing, A/B testing. 
    • Tenancy Oriented Architecture. How to make different tenants—testing and production—coexist. Want isolation between test & production, tenancy-based access control, and minimal deviation between test and production and environments. Want to use the same stack for both. Must be able to support multiple architectures at the same time. 
    • Tenancy Building Blocks. Attach a tenancy context to everything. Tenancy aware infrastructure. Tenancy aware environments. Tenancy aware routing. 
  • Usenix Fast’19 and NSDI’19 videos are now available
  • Can Luminary become the Netflix of podcasting? Unlikely. The production and distribution costs for podcasts are simply too low for an aggregator to gain leverage. YouTube worked because streaming video is technically hard (and expensive). Audio isn’t. Netflix works because creating high quality video content is hard. We already have a near endless supply of high quality niche podcast content. Where can an aggregator add value? One area is in the interface/platform. Podcasting is stuck in the past because nobody is in a position to drive the platform forward. Like for ebooks, podcasts have not enriched the underlying audio data type at all. Ebooks are similarly stunted. The Kindle should be capable of so much more. We have supercomputers in our pockets! What do they do? Display content at the level of stone knives and bear skins. Multimedia apps in the 1990s were far more advanced that we have today. This is the Apple play. Make a user experience that is so much better that consumers simply want to use it more than any other podcast platform. For example, we can’t do a simple thing like share an audio snippet. Podcasts have zero viral loops. And we still can’t read the text of a podcast. That’s been a solved problem for several years now. Apple is just an RSS feed, they don’t care about podcasting as a platform. If someone seriously made the underlying podcast platform better, that might do it. Just aggregating content won’t be enough. Also, Epic Games Boss Says They’ll Stop Doing Exclusives If Steam Gives Developers More Money
  • This. IAM Is The Real Cloud Lock-In. And this. Your CS Degree Won’t Prepare You For Angry Users, Legacy Code, or the Whims of Other Engineers
  • Which tech stack(s) is AWS built on?  in0pinatus: I am ex-AWS and therefore qualified to comment. Everyone knows that almost everything runs on EC2, but it’s a little known-fact that most of the control plane is actually a collection of PHP scripts. However for backwards internal compat reasons they are stuck on PHP4.3. The hardware itself is bought ad-hoc from Fry’s whenever needed and AWS uses spare Warehouse Fulfilment Associates to run it over to the DCs. When people say that “Logistics is at the heart of Amazon” this is what they mean. S3 isn’t in PHP though, it’s actually written in Perl and like all Perl code is littered with star wars in-jokes and data structures invented especially for the purpose. I never want to see another Classified Cumulative Purple Tree as long as I live. Glacier’s name is in jest, it’s actually a massive array of vinyl records and the name is a pun about the heatsinks used for data erasure; the whole thing is actually designed to help cool the AZs, which is why Glacier launched in hot climate regions first. Jeff did indeed buy a chip factory, but bought it in the UK by mistake and the resulting megashipment of potato crisp products is the reason he had to buy Wholefoods as well. James Hamilton is completely absent from AWS these days, having take the leadership principle of Own A Ship to the ultimate extreme, and will only be upstaged when Bezos succeeds in his plan – hatched after he misread a six-pager about LP revisions during a particularly gruelling OP session – to Own A Space Ship and Earn Thrust. Finally, and most horrifyingly of all, the ghastly secret behind “Serverless” lambda is that it really doesn’t use servers. Lambda actually uses the collective spare brain capacity of every Java programmer trying to fix mysterious Gradle issues on misconfigured ECS containers. When they say “AWS runs on Java” this is what is really meant. Andy Jassy got the idea after a broken Echo read the whole of the Hyperion Cantos to him and wouldn’t respond to “ALEXA STOP”. Nice!
  • BBC iPlayer: Architecting for TV: we basically moved to the idea of using server side render wherever we could. This meant that we have a hybrid app where some of the logic is in the client, but then a lot of it is HTML, it’s JavaScript chunks, it’s CSS built by the back end and then swapped out in the right places on the client. Performance massively increased, our lower performing devices loved it. Memory Usage really went down. The garbage collection seem to kick in and managed better with that. There was less need for the TAL abstractions. And importantly, we had way less logic in the front end, so it was much easier to test and reason with that. This is before and after. We reduced memory usage quite significantly despite actually tripling the number of images and the complexity of the UI that we had. We were using significantly less memory, performance was better and we were motoring again.
  • Develop Hundreds of Kubernetes Services at Scale with Airbnb: As of a week ago, 50% of our services are now in Kubernetes. Our goal is by the end of H1 this year, for the rest of our services to be in Kubernetes. This is about the services we know, so several hundred services. About 250 of those are in the critical production path. Some of these are more critical than others. But for example, when you go to airbnb.com, and you look at that homepage, that’s a Kubernetes service. When you type into the search bar, “I am looking for homes in London this weekend to stay at,” that’s also a Kubernetes service. Similarly, when you’re booking that Airbnb and making those payments, that’s also going through Kubernetes services. The point of this talk. Kubernetes out-of-the-box worked for us, but not without a lot of investment. There were a lot of issues that we noticed right away and we had to figure out how to solve them. The 10 takeaways that I wrote individually at different points in the talk. We started with the configuration itself, abstracting away that complex Kubernetes configuration. Then one strong opinion we took is standardizing on environments and namespaces, and that really helped us at different levels, especially with our tooling. Everything about a service should be in one place in Git. Once we store it in Git, we can get all these other things for free. So we can make best practices a default by generating configuration and storing that in Git. We can version it and refactor it automatically, which just reduces developer toil and also gets important security fixes, etc., in on a schedule. We can create an opinionated tool that basically automates common workflows, and we can also distribute this tool as kubectl plugin. We can integrate with Kubernetes for the tooling. CI/CD should run the same commands engineers run locally in a containerized environment. You can validate configuration as part of CI and CD. Code and configuration should be deployed with the same process. In that case, that’s our deploy process, custom controllers to deploy custom configuration. And then you can just use custom resources and custom controllers to integrate with your infrastructure.
  • Bug Juice, Amazon Envy: Inside Airbnb’s Engineering Challenges: Airbnb is shifting its code to what is known as a services architecture. Like many startups, Airbnb originally grew its business off a single, interconnected code base written in the coding language Ruby on Rails. That meant that as the company grew, software engineers would have to make updates to the site one at a time, slowing down how quickly the work could progress. The new technical infrastructure will allow engineers to do things like make changes to the code powering messaging between hosts and guests without affecting engineers working on the listing results page. But the technical work is expensive. The company plans to grow its 1,000-person engineering staff by 40% to 50% this year, people briefed on the matter said. The infrastructure changes also are driving up Airbnb’s spending on AWS compute and storage. Airbnb sharply increased its AWS usage in the early part of this year, a person familiar with the matter said. Last year, Airbnb’s AWS bill was more than $170 million, which was in line with what the company projected to spend for the year, the person said. AWS didn’t comment for this article. “We’re just under-resourced. Compared to Uber, we’re completely under-penetrated on engineers,” a person close to the company said. 
  • Tinder’s move to Kubernetes: It took nearly two years, but we finalized our migration in March 2019. The Tinder Platform runs exclusively on a Kubernetes cluster consisting of 200 services, 1,000 nodes, 15,000 pods, and 48,000 running containers. Infrastructure is no longer a task reserved for our operations teams. Instead, engineers throughout the organization share in this responsibility and have control over how their applications are built and deployed with everything as code.
  • James Hamilton has the description of Tesla’s new ASIC you’ve been looking for: Overall, it’s a nice design. They have adopted a conservative process node and frequency. They have taken a pretty much standard approach to inference by mostly leaning on a fairly large multiple/add array. In this case a 96×96 unit. What’s particularly interesting to me what’s around the multiply/add array and their approach to extracting good performance from a conventional low-cost memory subsystem. I was also interested in their use of two redundant inference chips per car with each exchanging results each iteration to detect errors before passing the final plan (the actuator instructions) to a safety system for validation before the commands are sent to the actuators. Performance and price/performance look quite good.
  • The GRAND stack is GraphQL, React, Apollo, Neo4j. And there’s also Elixir, Phoenix, Absinthe, GraphQL, React, and Apollobenwilson-512: We’ve also felt the complexity of React and Apollo. It works best when, as others mentioned, you’ve got distinct teams that can focus on each part. In situations where that isn’t the case the same decouplings that make it easier for teams to operate independently just add overhead and complexity. We’re in a similar boat these days so in fact our latest projects are back to simple server side rendering, but we’re still making the data retrieval calls with GraphQL. It ensures that that the mobile app and reporting tools we’re also developing will have parity, and we don’t need to write ad hoc query logic for each use case. The built in docs and validations have simply proven too useful to pass up, and you really don’t need a heavy weight client to make requests.
  • ThoughtWorks Technology Radar Vol. 20. It’s a ist of of different technologies and if they think you should adopt, trial, assess, hold on them. For example, in the techniques section you should adopt: Four key metrics; Micro frontends; Opinionated and automated code formatting; Polyglot programming; Secrets as a service. trial: Chaos Engineering; Container security scanning; Continuous delivery for machine learning (CD4ML) models; Crypto shredding; Infrastructure configuration scanner; Service mesh. assess: Ethical OS; Smart contracts; Transfer learning for NLP; Wardley mapping. hold: Productionizing Jupyter Notebooks; Puncturing encapsulation with change; data capture; Release train; Templating in YAML. They have sections on Platforms, Languages & Frameworks, and Tools. 
  • Good advice on when you should use a ledger vs a blockchain. AmazonWebServices:  QLDB is aimed at applications that require a complete and verifiable record of all changes to the database. Amazon Managed Blockchain is aimed at applications where you have multiple parties that wish to interact through a blockchain. Customers building on QLDB will trust that AWS is faithfully executing their SQL statements to update the current and history views of their data. But once the journal transactions are published, they cannot be changed even by AWS without detection. An example of a customer that might benefit from QLDB is a logistics company. When they receive a shipment from a supplier and forward it to the receiver, they can record the relevant information in QLDB and publish periodic journal digests to all of their customers. Later, if the supplier or receiver claims that the updates didn’t happen in a timely manner and wants to audit the update trail, the company can give them direct access to the journal from AWS and they can verify that the transactions were executed at the time. A value for them is that QLDB can help them prove to their customers that history hasn’t changed.
  • Standford has The Silicon Genesis collection, which gathers together roughly 100 oral histories and interviews with the people who conceived, built and worked in the semiconductor industry centered in Silicon Valley since the 1950s.
  • While important to avoid overoptimism, cynics about the lack of real-world applications are missing a fundamental point. We’re starting to see AI solve a class of “unsolvable” problems and then solved in a weekend. I’m nervously excited about AI. Nervous about the disruption to society, but excited about the improvements to health and climate change. Change doesn’t happen overnight, but when a team has the right algorithms, parameters and scale, it can happen in two days. Takeaways from OpenAI Five (2019): 2019. Rough year for professional gamers. Great year for AI research. OpenAI’s Dota 2 and AlphaStar have outright or almost beat the best gamers with limited handicaps…Deep Reinforcement Learning Scales on Some Grand Challenges…OpenAI’s 2018 analysis showed the amount of AI compute is doubling every 3.5 months. TI9 is no exception; it had 8x more training compute than than TI8. TI9 consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months…AlphaGo and AlphaStar’s breakthroughs are also attributed to scaling existing algorithms…self-play has a large benefit: human biases are removed. It comes with expensive tradeoffs: increased training times and difficulty in finding proper reward signals can prevent model convergence…Impact on Human Psyche – Humans play with hesitation. Bots don’t. This is a jarring experience. Professional Go player described “it is like looking at yourself in the mirror, naked”. This often results in humans playing slightly from his/her normal play styles (likely sub-optimal)….
  • An indepth summary of Key Takeaway Points and Lessons Learned from QCon London 2019. @danielbryantuk: “Accidental microservices” at Google, via @el_bhs at #qconlondon “Microservices here largely emerged from the requirement of running applications at planetary scale”   @lizthegrey: Because it’s so much more expensive to do a remote RPC than do a function call in the same process, we’re making two steps back performance-wise when we make one step forward decoupling things. Read Hellerstein’s paper for more. #QConLondon @lizthegrey: It’s a 1000x performance hit to move things across process boundaries. And it magnifies market dominance of proprietary solutions. Think about when it makes sense to deploy functions to the edge, but proceed with caution for your core performance critical services. #QConLondon
  • How is Fauna different the CockroachDB or Spanner? Learn the secrets in Unpacking Fauna: A Global Scale Cloud Native Database – Episode 78
  • Before TCP/IP a babel of different networking protocols made it so computers from different lands could not talk to each other. Shocking, isn’t it? Relive those barbaric times in this excellent History of DECnet with Dave Oran podcast episode. Crave more history? Can you imagine a time when you would call someone up because they were cussing on the internet? That and more in The History of the ISP Industry With Sonic’s Dane Jasper
  • True, but it shouldn’t be. Storing HD photos in a relational database: recipe for an epic fail
  • Before you turn on HTTP/2, make sure that your application can handle the differences. HTTP/2 has different traffic patterns than HTTP/1.1, and your application may have been designed specifically for HTTP/1.1 patterns. Why Turning on HTTP/2 Was a Mistake: with HTTP/2, the browser can now send all HTTP requests concurrently over a single connection. Multiplexing substantially increased the strain on our servers. First, because they received requests in large batches instead of smaller, more spread-out batches. And secondly, because with HTTP/2, the requests were all sent together—instead of staggered like they were with HTTP/1.1—so their start times were closer together, which meant they were all likely to time out…server support for HTTP prioritization is spotty at best. Many CDNs and load balancers either don’t support it at all, or have buggy implementations, and even if they do, buffering can limit its effectiveness.
  • Beating round-trip latency with Redis pipelining: What is actually different here? Instead of sequentially doing the requests, we pipeline them. This means we send the requests without waiting on the responses. We send as many requests as our machine can, and the responses come in whenever they’re ready from the server. This means if we take the same latency figures as above, we can assume the time to retrieve all these ttls is going to be around 100ms! for 40 keys, the latency will probably be a bit higher, but closer to 100ms than 400ms for sure! 400 keys? Much closer to 100ms than 4000ms!
  • Just how expensive is the full AWS SDK?: WebPack improves the Initialization time across the board. Without any dependencies, Initialization time averages only 1.72ms without WebPack and 0.97ms with WebPack. Adding AWS SDK as the only dependency adds an average of 245ms without WebPack. This is fairly significant. Adding WebPack doesn’t improve things significantly either. Requiring only the DynamoDB client (the one-liner change discussed earlier) saves up to 176ms! In 90% of the cases, the saving was over 130ms. With WebPack, the saving is even more dramatic.
  • It’s nice to see the use of queues to solve a queue problem instead of using a database. Copy Millions of S3 Objects in minutes: All-in-all copying a million files from one bucket to another, took 8 minutes if they’re in the same region. For cross-region testing, it 25 minutes to transfer the same million files between us-west-1 and us-east-2. Not bad for solution that would fit easily into your free AWS tier…Using an SQS we’re able to scale our lambda invocations up in an orderly fashion, which allows our back-ends (e.g. DynamoDB or S3) to also scale up in tandem. Instantly invoking 800 lambdas to read an S3 bucket is a recipe for a Throttling and the dreaded ClientError
  • I wouldn’t have thought of this case either. Facebook’s AI missed Christchurch shooting videos filmed in first-person. Also, Applied machine learning at Facebook – Kim Hazelwood (Facebook)
  • Interested in side channel analysis? This ChipWhisperer wiki page is for you. ChipWhisperer is an open source toolchain dedicated to hardware security research. 
  • The test is a medium.com clone. But really, how hard is it to display “You read a lot. We like that.”?  A RealWorld Comparison of Front-End Frameworks with Benchmarks (2019 update). Small footprint: Svelte, Stencil, and AppRun. Small code base: ClojureScript with re-frame, AppRun and Svelte. Fastest: AppRun, Elm, Hyperapp.
  • rancher/k3os: a linux distribution designed to remove as much as possible OS maintaince in a Kubernetes cluster. It is specifically designed to only have what is need to run k3s. Additionally the OS is designed to be managed by kubectl once a cluster is bootstrapped.
  • facebook/folly (article):  a 14-way probing hash table that resolves collisions by double hashing. Up to 14 keys are stored in a chunk at a single hash table position. Folly’s F14 is widely used inside Facebook. F14 works well because its core algorithm leverages vector instructions to increase the load factor while reducing collisions, because it supports multiple memory layouts for different scenarios, and because we have paid attention to C++ overheads near the API. F14 is a good default choice — it delivers CPU and RAM efficiency that is robust across a wide variety of use cases.
  • Microsoft/BosqueLanguage: The Bosque programming language is a Microsoft Research project that is investigating language designs for writing code that is simple, obvious, and easy to reason about for both humans and machines. The key design features of the language provide ways to avoid accidental complexity in the development and coding process. The result is improved developer productivity, increased software quality, and enable a range of new compilers and developer tooling experiences.
  • FAST ’19 – Reaping the performance of fast NVM storage with uDepot. Optane can provide ~.06Moops/sec at ~ 10 microseconds of latency. Storage is no longer the bottleneck, the network is. The problem is existing KV stores are built for slower devices. Use sync IO. Data structures with ingerent IO amplification. Cache data in DRAM. Rich feature set. RocksDB is 7x slower than necessary.
  • FAST ’19 – Optimizing Systems for Byte-Addressable NVM by Reducing Bit Flipping. We were able to reduce the number of bits flipped by up to 3.56× over standard implementations of the same data structures with negligible overhead. We measured the number of bits flipped by memory allocation and stack frame saves and found that careful data placement in the stack can reduce bit flips significantly. These changes require no hardware modifications and neither significantly reduce performance nor increase code complexity, making them attractive for designing systems optimized for NVM.
  • Haxe (video). A language well known in the game industry. Haxe is a strictly-typed (and type-inferring) programming language with a diverse set of influences, including OCaml, Java and ActionScript. Its syntax will be familiar to anyone who’s worked with modern OO languages, however it has features you’d expect in a meta language, such as: everything’s-an-expression, compile-time code manipulation and pattern matching. In addition, it boasts an unusual talent; it can generate code in other programming languages.
  • apache/pulsar: a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
  • apache/incubator-hudi: manages storage of large analytical datasets on HDFS and serve them out via two types of tables
  • DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching: We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching.
  • Liquid brains, solid brains: How distributed cognitive architectures process information: Other systems are formed by sets of agents that exchange, store and process information but without persistent connections or move relative to each other in physical space. We refer to these networks that lack stable connections and static elements as ‘liquid’ brains, a category that includes ant and termite colonies, immune systems and some microbiomes and slime moulds. What are the key differences between solid and liquid brains, particularly in their cognitive potential, ability to solve particular problems and environments, and information-processing strategies? To answer this question requires a new, integrative framework.
  • Modular structure within groups causes information loss but can improve decision accuracy: We find that modular structure necessarily causes a loss of information, effectively silencing the input from a fraction of the group. However, the effect of this information loss on collective accuracy depends on the informational environment in which the decision is made. In simple environments, the information loss is detrimental to collective accuracy. By contrast, in complex environments, modularity tends to improve accuracy. This is because small group sizes typically maximize collective accuracy in such environments, and modular structure allows a large group to behave like a smaller group (in terms of its decision-making). These results suggest that in naturalistic environments containing correlated information, large animal groups may be able to exploit modular structure to improve decision accuracy while retaining other benefits of large group size.
  • NEEXP in MIP*: The main result of this work is the inclusion of NEEXP (nondeterministic doubly epxonential time) in MIP*. This is an exponential improvement over the prior lower bound and shows that proof systems with entangled provers are at least exponentially more powerful than classical provers.
  • Maybe storing important stuff in public is a bad unfuture proofable idea?  The Blockchain Bandit: Finding Over 700 Active Private Keys On Ethereum’s Blockchain: In this paper we examine how, even when faced with this statistical improbability, ISE discovered 732 private keys as well as their corresponding public keys that committed 49,060 transactions to the Ethereum blockchain. Additionally, we identified 13,319 Ethereum that was transferred to either invalid destination addresses, or wallets derived from weak keys that at the height of the Ethereum market had a combined total value of $18,899,969. In the process, we discovered that funds from these weak-key addresses are being pilfered and sent to a destination address belonging to an individual or group that is running active campaigns to compromise/gather private keys and obtain these funds. On January 13, 2018, this “blockchainbandit” held a balance of 37,926 ETH valued at $54,343,407.
  • Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy: Zippads transparently compresses variable-sized objects and stores them compactly. As a result, Zippads consistently outperforms a state-of-theart compressed memory hierarchy: on a mix of array- and object-dominated workloads, Zippads achieves 1.63× higher compression ratio and improves performance by 17%.
  • Using Fault-Injection to Evolve a Reliable Broadcast Protocol: This is the first article in a series about building reliable fault-tolerant applications with Partisan, our high-performance, distributed runtime for the Erlang programming language. As part of this project, we will start with some pretty simple protocols and show how our system will guide you in adjusting the protocol for fault-tolerance issues
  • Seven Sketches in Compositionality: An Invitation to Applied Category Theory: The purpose of this book is to offer a self-contained tour of applied category theory. It is an invitation to discover advanced topics in category theory through concretereal-world examples. Rather than try to give a comprehensive treatment of these topics—which include adjoint functors, enriched categories, proarrow equipments, toposes, and much more—we merely provide a taste of each. We want to give readers some insight into how it feels to work with these structures as well as some ideas about how they might show up in practice

from High Scalability

Stuff The Internet Says On Scalability For April 19th, 2019

Stuff The Internet Says On Scalability For April 19th, 2019

Wake up! It’s HighScalability time:

Spirit? Smoke? Lightning? Nope. It’s a gorgeous LIDAR image showing 1500 years of Willamette River movement (@Blacky_Himself)

 

Do you like this sort of Stuff? I’d greatly appreciate your support on Patreon. I wrote Explain the Cloud Like I’m 10 for people who need to understand the cloud. And who doesn’t these days? On Amazon it has 44 mostly 5 star reviews (102 on Goodreads). They’ll learn a lot and hold you in awe.

  • 536: IRS tax return submissions per second; 400,000: drone planted trees in a day; 200 million: smart speaker observers installed by year end; 54 million: GoT pirated in first 24 hours; 123,052: kg of crashed human spacecraft on the surface of the moon; 610 pounds: 128 kilobytes of  IBM S/360 core memory; 33%: per account month over month Lambda function growth; $100,000: Netflix bug bounty payout; $1 million: Shopify bug bount payout; ~$2300: cost to transfer 23TB from S3 to Backblaze B2 in 7 hours; 14%: Netflix users share passwords; 88%: believe P != NP; 30-90: minutes saved by StackOverflow per week; 83%: US teens have an iPhone; $2m: Microsoft bug bounty payout; $1 million: made by Colin Cowherd on 331 million Facebook page views, moving more to Instagram; 0: lines of code in Pong; ~95%: redis is slower when GDPR complient;
  • Quotable Quotes:
    • melodysheep: The universe has only just begun.
    • @matthew_d_green: I spent the year before Heartbleed visiting important people in DC trying to convince them OpenSSL was a mess, and they should fund it as “critical infrastructure”. They laughed and told me that term referred to dams and power plants.
    • Tim Cook: No
    • @asymco: Among 8,000 U.S. high school students surveyed, 83% have an iPhone, 9% Android. 86% plan their next phone to be an iPhone. -Piper Jaffray Taking Stock With Teens survey
    • @dialtone: Do you know S3 throughput is higher than a SATA 3 controller and almost as much as PCIe 4.0 2x? Depending on the usecase, mounting S3 as a filesystem, not only makes sense, but it saves a LOT of money (and time) as well. You just need to use it a special kind of filesystem.
    • Steven Melendez: By 2022, automation will displace about 75 million jobs worldwide. On the other hand, they will create an estimated 133 million new jobs. The predictions come from extrapolating from surveys sent to more than 300 major employers around the world. 
    • Charlie Demerjian: Then Apple suddenly caved. And paid Qualcomm handsomely for their troubles. And agreed to buy 5G modems from Qualcomm for the next 6-8 years. Qualcomm beat Apple like a drum because they were in the right and Apple was in the wrong. But why did Apple fold now? Why didn’t they drag things out for another few decades in court through appeals, delays, and the spurious tactics they deployed against Samsung? There is a really good reason for that, Apple was screwed and would have lost the iPhone market if they had waited any longer. Qualcomm had Apple over a barrel and just had to wait. The longer Apple postured and threatened, the stronger Qualcomm’s position got, and likely the bigger the check Apple had to write was.
    • @jolson88: I think the half-page of code in the LISP 1.5 manual about its metacircular evaluator would have to be included. I agree with Alan Kay that it’s basically as close as we’ll ever get to a “Maxwell’s Equations for Software”.
    • HBR: Thales Teixeira, associate professor at Harvard Business School, believes many startups fail precisely because they try to emulate successful disruptive businesses. He says by focusing too early on technology and scale, entrepreneurs lose out on the learning that comes from serving initial customers with an imperfect product. He shares how Airbnb, Uber, Etsy, and Netflix approached their first 1,000 customers very differently, helping to explain why they have millions of customers today.
    • Google: We were happy to find no difference in the effectiveness, performance ratings,  or promotions for individuals and teams whose work requires collaboration with colleagues around the world versus Googlers who spend most of their day to day working with colleagues in the same office. 
    • Bob Sutton: People make better decisions when they look into the future and they imagine that they already failed, and they tell a story about what happened. With better planning, it won’t be a story that has to be bleeped out.
    • ecnahc515: This is my experience with most systems that send webhooks, in particular payments, and subscription management systems. As you’ve elaborated on in other comments, queuing and periodic retries are generally the best way to handle interacting with what is effectively, an eventually consistent API/system.
    • Twitter: This was a pretty big investigation in the end that included a few engineers and multiple teams but a 25% reduction is [Redis] cluster size is a pretty nice result 
    • @colmmacc: I think right around this minute is just about exactly 5 years since the Heartbleed vulnerability in OpenSSL became public. I remember the day vividly, and if you’re interested, allow me to tell you about how the day, and the subsequent months, and years unfolded …
    • kitsunesoba: IMO, every attempt thus far has approached cross platform entirely the wrong way. They’re all way too focused on providing deeply custom UI. I believe that it may better to instead abstract app navigation as its own separate thing, making it a setting in a config file (e.g. tabbed, hamburger, split-pane tabbed, etc) and restricting UI coding to individual screens, which would have style hinting abilities and could be built via any number of platform agnostic (and perhaps language agnostic) ways. All this would then compile down to native UIKit, Android SDK, UWP, AppKit, etc.
    • @QuinnyPig: Confluent, Datastax, Neo4j, MongoDB, InfluxData, Elastic. It’s a who’s-who of who AWS has stomped on recently. Partnering with GCP seems a lot less perilous…#GoogleNext19
    • Erik Bernhardsson~ Why software tasks always take longer than you think? While the median blowup factor imputed from this fit is 1x (as before), the 99% percentile blowup factor is 32x, but if you go to 99.99% percentile, it’s a whopping 55 million! One (hand wavy) interpretation is that some tasks end up being essentially impossible to do. In fact, these extreme edge cases have such an outsize impact on the mean, that the mean blowup factor of any task ends up being infinite. This is pretty bad news for people trying to hit deadlines!
    • @MayaKaczorowski: “When Google researchers discovered the Spectre vulnerability, we used live migration to patch every single GCP server with zero downtime for our users. So our customers were protected even before they knew they needed to be” – Brad Calder #GoogleNext19
    • Ted Kaminski: We’re composing things together when we can reason compositionally about the result, and we’re extending when we need non-compositional reasoning.
    • Gregory Travis: I believe the relative ease—not to mention the lack of tangible cost—of software updates has created a cultural laziness within the software engineering community. Moreover, because more and more of the hardware that we create is monitored and controlled by software, that cultural laziness is now creeping into hardware engineering—like building airliners. Less thought is now given to getting a design correct and simple up front because it’s so easy to fix what you didn’t get right later.
    • DSHR: The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at [2014] prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk.
    • NASA: Given that the majority of the biological and human health variables remained stable, or returned to baseline, after a 340-day space mission, these data suggest that human health can be mostly sustained over this duration of spaceflight.
    • royjacobs: What I find unfortunate about infrastructure-as-code tooling is that a lot of the tooling isn’t actually using code, but instead uses esoteric configuration languages. Indeed, the article refers to Terraform with its custom syntax. Imho tools should use actual code (whether it’s TypeScript or Kotlin or whatever) instead of reinventing constructs like loops and string interpolation. Thankfully these tools are getting more popular, because frankly I can’t stand configuring another Kubernetes or GCP resource using a huge block of copy/pasted YAML.
    • Mathew Cherukara: The beauty is that this molecular model has no right to be as accurate as the atomistic models, but still ends up being so.
    • Geoff Huston:  By pushing client-side DNS queries into HTTPS the Internet itself has effectively lost control of the client end of DNS, and each and every application, including the vast array of malware, can use DOH [DNS over HTTPS] and the DNS as a command and control channel in a way that is undetectable by the client or client’s network operator. Much of today’s malware containment frameworks, including DNS firewalling, are rendered useless by DOH. Whether or not the browser has DOH enabled by default, applications can generate DOH requests for DNS resolution in a manner that bypasses today’s DNS-based malware containment mechanisms. As has been recently observed on a DOH-related mailing list: “Pandora’s box is now open and DOH has escaped, and there seems to be little we can do about it now. The times they are a changing.”
    • @AssaadRazzouk: Incredible Shrinking Battery Costs: Electric vehicles crossover point – when EVs  are cheaper than their polluting equivalents — was 2026 in 2017, 9 years out; then 2024 in 2018; and now 2022, just 3 years out. On current trends, expect parity next year
    • @obra: 6 hours later, I have replaced my 15 line perl script with a 500 meg Docker image and a few hundred lines of python. It works almost as well as what I had before, too!
    • @srhtcn: #Serverless 5 takeaways: – Still need great engineers. – LessOps, not NoOps – Cheaper, even at scale – Can do many use-cases – Vendor lock-in is often a myth #Serverless 5 takeaways:- Still need great engineers.- LessOps, not NoOps- Cheaper, even at scale- Can do many use-cases- Vendor lock-in is often a myth
    • Kieren McCarthy: So what does Polystream do instead? It streams game data and leaves the graphics processing to your device’s GPU to sort out. The result? A fraction of the cost per user.
    • @mcmillen: One of the worst management red-flags that I ever saw at Google was the time a large team was forcibly re-orged into Nest & were told that they weren’t allowed to transfer out for N months. Not the best way to motivate engineers & forever tarnished my opinion of Nest leadership.
    • Mike Titus: Don’t tell me that Amazon Cloud and Google Cloud, they wouldn’t love to have our [Black-Hole Picture] data and store it for us. Too much data and too much money—that’s why we don’t do it that way. Nothing beats the bandwidth of a 747 filled with hard disks.
    • @abbyfuller: someone just non-ironically suggested sharding database volumes across containers and i have heartburn now
    • @mipsytipsy: If you’re scared of pushing to production on Fridays, I recommend reassigning all your developer cycles off of feature development and onto your CI/CD process and observability tooling for as long as it takes to ✨fix that✨.
    • @kelseyhightower: The OS abstracts away the machine while leaking hardware faults. Docker abstracts away the OS while leaking software faults. Kubernetes abstracts away multiple machines while leaking the distributed system faults. We are the plumbers.
    • Jeffrey Zeldman: And internet investors don’t want a modest return on their investment. They want an obscene profit right away, or a brutal loss, which they can write off their taxes. Making them a hundred million for the ten million they lent you is good. Losing their ten million is also good—they pay a lower tax bill that way, or they use the loss to fold a company, or they make a profit on the furniture while writing off the business as a loss…whatever rich people can legally do under our tax system, which is quite a lot.
    • Gill Lee: Emerging memories like MRAM, PCRAM and ReRAM are leading candidates to complement and, in some cases, replace today’s mainstream technologies—promising higher performance, lower power and lower cost.
    • @davidbrunelle: The Starbucks web team deploys to production on a daily basis. We made a conscious decision to optimize for the ability to catch and resolve issues quickly. Most new features are either toggled off, or kept out of the production build until we want them exposed to customers. Shipping more frequently has a few benefits for us: 1. Each deployment is limited in size. Less risk and easier to isolate issues. 2. Deployments become routine and low stress. Almost non-events. 3. Much faster feedback cycles. 4. Everyone on the team becomes familiar with prod 6 replies 32 retweets 139 likes
    • Ahmed Kabil: After the French Revolution, when large parts of the cathedral were desecrated and damaged, oak trees from Versailles were used to rebuild it. The oaks that were planted thereafter were intended to be used to help rebuild Notre Dame, should it become necessary in the future.
    • Paul Johnston: A serverlss Application is one that provides maximum business value over it’s application lifecycle and that costs you nothing to run if nobody is using it, excluding data storage costs. 
    • David Baker: Humans have only been able to harness the power of proteins by making very small changes to the amino acid sequences of the proteins we’ve found in nature. This is similar to the process that our Stone Age ancestors used to make tools and other implements from the sticks and stones they found around them.
    • Erika Hamden: I work on FIREBall because what I want to take our view of the universe from one of mostly darkness, with just light from stars, to one where we can see and measure nearly every atom that exists.
    • BRIAN GALLAGHER: Another person’s mind comes through their mouth. 
    • Geoff Tate: In the edge, you’ve may have one camera, which could be a surveillance camera, a camera in your robot or your set-top box, and it’s processing one image. Any architecture that uses large batch sizes to get high throughput is disqualified in the edge. You should be able to do a good job with processing one image at a time, which is also known as batch size equals one. To be in an edge device, you’ve got to be single-digit watts. You’re not going to put an Nvidia Tesla T4 card at $2,000 and 75 watts into your surveillance camera because it’s too much power. But the people at the edge want to do real-time. Lots of detection and recognition, and processing bigger images on the tougher models, is what gives them better prediction accuracy. They want to get as much throughput as they can or fewer watts for their dollar budget.
    • Rob Matheson: MIT researchers have designed a novel flash-storage system that could cut in half the energy and physical space required for one of the most expensive components of data centers: data storage. In a paper being presented at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems, the researchers describe a new system called LightStore that modifies SSDs to connect directly to a data center’s network — without needing any other components — and to support computationally simpler and more efficient data-storage operations.
    • Matt Hayes: Using what they call DASH (DNA-based Assembly and Synthesis of Hierarchical) materials, Cornell engineers constructed a DNA material with capabilities of metabolism, in addition to self-assembly and organization – three key traits of life.
    • Paul McLellan: It turns out that a similar approach seems to work for photonic designs, although it goes under the name inverse design: you say what you want the photonic device to do, and then the system experiments until it hones in on an approach that works. It also turns out that, like the chair example I pictured above, you end up with designs no human would come up with. They are literally superhuman designs.
    • @fcosta_oliveira: just finished plotting the # io-threads vs RPS on @antirez Redis threaded-io branch single instance benchmark results, using TCP loopback and pipelining, on @GCPcloud n1-highcpu-96. Near 300K OPS on GETs. If we use pipelining we surpass 1.2M OPS. (1/3)
    • @Carnage4Life: There’s a book by Eric Schmidt where he mentions Sundar had the idea to recoup acquisition cost of Google Earth by having it install Google toolbar & hijack search defaults. They made hundreds of millions. This stuff is literally my job to know. 
    • Jennifer Valentino-DeVries: Law enforcement officials across the country have been seeking information from a Google database called Sensorvault — a trove of detailed location records involving at least hundreds of millions of devices worldwide
    • @kelseyhightower: FaaS Monolith: a collection of functions disguised as nano services, behind a single API gateway, leveraging the same database.
    • Maggie Koerth-Baker: To get a picture of the black hole, itself, the EHT project used a network of 10 Earthbound radio telescopes, linked together to function as a single system. The telescopes collected high-frequency radio waves from space, and four independent teams of scientists used algorithms to convert the radio signals into visual images.
    • @cgervais: One of our [Kyrus] Engineering teams started with #serverless last summer with @awscloud Lambda services and it turned out super-successful. So much so that two more teams are migrating their services as well. If you’re interested,we’re doing more with #serverless
    • @IamStan: Google Cloud Run: For those people that want to go Serverless, but really can’t let go of their Dockerfiles. Playing with it this morning… Initial thoughts… To use it, you need to create a Dockerfile, and you need to handle  HTTP traffic (ie run Express or some sort of http server)… These are two things you don’t need with Lambda or other Cloud Function platforms. This will make transition easier for developers who are used to running their own http server, and are comfortable with Docker
    • @davidgerard: “I used to work at Tumblr, the entirety of their user content is stored in a single multi-petabyte AWS S3 bucket, in a single AWS account, no backup, no MFA delete, no object versioning. It is all one fat finger away from oblivion.”
    • @newstodayohboy: If YouTube or Facebook are distributed computer systems where the CPUs are people, then a bad actor thinks of finding exploits in the people… cognitive biases become as important as programming errors.
    • Brent Ozar: 15 Reasons Your Query Was Fast Yesterday, But Slow Today. There are different workloads running on the server (like a backup is running right now.) You have a different query plan due to parameter sniffing. The query changed, like someone did a deployment or added a field to the select list…
    • R Danes: At the Cloud Next conference in San Francisco yesterday, Google signaled it’s ready to meet customers where they are with ready-to-wear products. It is tackling a main enterprise concern — hybrid and multicloud — with new offerings like the Anthos platform for managing applications across environments.
    • lget: I am not surprised. I was working in an open-plan office for a few years, and I also preferred talking to colleagues sitting only a few meters away via mail. It’s incredibly stressful to talk to a person face-to-face knowing that everyone on the team listens to the conversation. Not only do you have to constantly weight and evaluate your sentences, you also have the constant feeling that you are disturbing other team members just by talking. On the other hand, you also don’t want to drag a person into another room for privacy if you just want some quick update on something. 
    • @tmclaughbos: If you’re building custom software to be given to and run by a client, and you want to go serverless, I’m going to suggest you think long and hard about the operability of that software by your client. Hint: You should probably just stick code into a container and hand them a docker image. It’ll be a lot easier for you and the customer.
    • @CJHandmer: Do you ever lie awake and night and wonder how much cargo a SpaceX Starship could deliver to the moon, and then from the moon back to the Earth, without refueling on the Moon? Lots!! Amidst all the recent NASA Moon chat it’s easy to overlook just how transformative Starship is.
    • @antirez: Because who will replace us needs to find something inside herself/himself. Role models are not really sufficient or even needed, we need a way to understand how to put inside our children that thing that will make them adults that want to *do something*. The rest will follow.
    • C4ADS: found that GNSS spoofing activities in the Russian Federation, its occupied territories, and its overseas military facilities are larger in scope, more geographically diverse, and started earlier than any public reporting has suggested to date. Reports by CNN and the RNT Foundation identify fewer than 450 vessels affected since late 2016. Using Automatic Identification System (AIS) ship location data collected at scale, C4ADS identified 9,883 instances of GNSS spoofing that affected 1,311 commercial vessels beginning in February 2016. The disruptions appear to have originated from ten or more locations in Russia and Russian-controlled areas in Crimea and Syria.
    • @johnath: Over and over. Oops. Another accident. We’ll fix it soon. We want the same things. We’re on the same team. There were dozens of oopses. Hundreds maybe? I’m all for “don’t attribute to malice what can be explained by incompetence” but I don’t believe google is that incompetent.
    • @hcoyote: You have to admit … watching AWS Red Wedding is simultaneously horrible and amazing. Horrible in that  … all those poor people seeing their hopes and dreams crushed. Amazing in that … I can’t believe they get away with it. I can’t bring myself to look away.
    • IRS: We have about 60 different applications. I think we have about 12,000 or 13,000 servers on 12 mainframes. It’s difficult to continually patch. At some point, we need to replace and we’re definitely at that point. We’re as well posed as I think we’ve ever been to — and I was somewhat familiar with this when I was on the outside — but ours is as well posed as it’s ever been to be able to modernize both the infrastructure as well as the language so we’re moving forward with the ability to be agile and flexible as newer technologies come along.
    • MICHAEL FORTE: On top of this idiosyncratic population bias due to uneven population growth rates, there is a more persistent early adopter bias. These early adopters tend to be much more tech-savvy than the general population, trying out new products to be on the cutting edge of technology. This tech-savvy population desires features that can be detrimental to the target population. In our music example, tech-savvy users will want to select the specific bit-rate and sampling frequency of the song they are buying, but forcing our target population through this flow would lead to confusion and decreased conversion rates.
    • rchaud: FB’s public comments about these remind me a lot of the “5 Standard Excuses” scene in the ’80s BBC sitcom Yes Minister, where a civil servant lists the best CYA mea culpas for politicians to use when something goes wrong. 1. It occurred before certain important facts were known, and couldn’t happen again 2. It was an unfortunate lapse by an individual, which has now been dealt with under internal disciplinary procedures. 3. There is a perfectly satisfactory explanation for everything, but security forbids its disclosure. 4. It has only gone wrong because of heavy cuts in staff and budget which have stretched supervisory resources beyond their limits. 5. it was a worthwhile experiment, now abandoned, but not before it had provided much valuable data and considerable employment
  • What can a single person do these days? So much. Designing a modern serverless application with AWS Lambda and AWS Fargate: Recently I built and open sourced a sample application called changelogs.md. The application watches for open source packages on NPM, RubyGems, and PyPI. When a package is added or updated it crawls any changelog found in the package source code.
    • This is a crawler built on in serverless environment and makes full use of AWS services: CloudFront, ALB, S3, Fargate, ElastiCache, Redis Pub/Sub, SNS, API Gateway, Lambda, DynamoDB. Redis Pub/Sub messages get published back to your browser via a Docker container running Socket.io on AWS Fargate.
    • Why Fargate? There’s no reason to make an all-in choice on only one type of compute. Run code where it runs best. In changelogs.md short lived compute jobs are all being run in AWS Lambda. Long compute jobs that have no designated end run on Fargate. A design point you might not have considered is isolation: But it’s easy to imagine that someone may try to attack this component by creating a massive changelog that is gigabytes in size, or take advantage of an edge case in the parser to consume large amounts of resources and DDOS the parsing code. So the Lambda model of execution is perfect for these jobs because it keeps the invokes isolated from each other. If someone triggers a crawl on a malicious changelog that single Lambda execution might timeout but the rest of the system won’t experience any impact, because the individual Lambda invokes are separated from each other and the rest of the stack.
    • As a comparion here’s The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page. You might recognize those names. Obviously a search engine is different, but there’s an essential complexity related to crawling that both share.
    • Another interesting aspect is the use of AWS Cloud Development Kit (CDK) instead of CloudFormation: I find that CDK is at the right level of abstraction for my applications: it doesn’t stop me from using the core cloud native services that I want to use, and it doesn’t require me to write a ton of boilerplate to use those services.
    • Did it work? Depnds on the goals. The goals were minimal ops and minimal costs. Those were met: On average the price to run the service is about $3.80 a day. For about the price of my daily cup of coffee I can have an application crawling more than 80k open source repos a day. Here are some more fun stats: every month the application uses 40 million DynamoDB read units and 10 million write units, serves several 100k website visits using S3 and CloudFront, does about one million Lambda invokes and delivers over 500k notifications via SNS. It has two Docker containers and a small Redis instance running 24/7 all month. And remember all of this costs less than $4 a day!
  • Only star football players have the leverage necessary to demand a new contract. Not everyone has the leverage to port their social network over to a new platform, but it’s time content producers realize distribution is not the key part of the value chain anymore—they are. Distribution is free, global, and direct to customers over the internet. Unbundle yourself. You are the only differentiated source of you. You are the app. Disaggregate the aggregators! Use social networks like Instagram as on ramps to your own differentiated social network of you. Influencers are flocking to a surprising new kind of social media.
    • 350+ influencers with a collective audience of 3.5 billion people are flocking to a platform called Escapex, which gives them their own apps. It’s part of the next wave of social media focused on smaller, more private groups
    • For Abigail Ratchford, a glam model who posts sexy images of herself on Instagram to an audience of 8.9 million, this control has freed her up to share as much content as she wants with her biggest fans, who pay $9.99 a month for the privilege of accessing it, without living in fear of the Instagram algorithm
    • These personal apps aren’t just for influencers. Osric Chau, an actor who is best known for his role in the cult TV show Supernatural [a very minor character], has 624,000 followers on Instagram and launched his app in January 2018. He uses it for a very different purpose: to connect with his fans more intimately. While he charges $4.99 for a subscription
    • One key feature of Escapex apps is that fans like Enna earn points based on how much they engage with the app, including the fan feed (points are also available for purchase). Then, they can use those points to boost their comment on one of Chau’s posts so that he’s guaranteed to see it. According to Shapira, that’s the primary way that celebs like Jeremy Renner monetize their apps–all the content is free, but people pay to be seen
    • Being a superfan by yourself is a diminished experience. Imagine a huge sports fan who lives in a country where no one follows the sport. It’s part of your identity and connecting with others who share that identity is really important. It’s the major driver of fandom, which is why you see the same processes across all these different celebrities
    • With the future of major social networks unclear, going independent on social media is a savvy financial bet. “I would not be surprised at all if there’s a handful of people who are making crazy livings on apps in the future”
  • Instagram promises to tell you how they hold back the thundering herd. Thundering Herds & Promises: At Instagram, when turning up a new cluster we would run into a thundering herd problem as the cluster’s cache was empty. We then used promises to help solve this: instead of caching the actual value, we cached a Promise that will eventually provide the value. When we use our cache atomically and get a miss, instead of going immediately to the backend we create a Promise and insert it into the cache. This new Promise then starts the work against the backend. The benefit this provides is other concurrent requests will not miss as they’ll find the existing Promise — and all these simultaneous workers will wait on the single backend request. The net-effect is you’re able to maintain your assumptions around caching of requests. Assuming your request distribution has the property that 90% are cache-able; then you’d maintain that ratio to your backend even when something new happens or your service is restarted.
  • 3DXPoint’s most intriguing application is as a byte-addressable persistent memory that user space applications map into their address space (with the mmap() system call) and then access directly with loads and stores. Early Measurements of Intel’s 3DXPoint Persistent Memory DIMMs: The most critical difference between 3DXPoint and DRAM is that 3DXPoint has longer latency and lower bandwidth.  Load and store performance is also asymmetric. On average, random loads take 305 ns compared to 81 ns for DRAM.  For sequential loads, latencies are 169 ns, suggesting some buffering or caching inside the 3DXPoint DIMM. [Write] latency is 94 ns for 3DXPoint compared to 86 ns for DRAM.  For reads (at left), bandwidth peaks at 39.4 GB/s.  For writes (at right), it takes just four threads to reach saturation at 13.9 GB/s. Performance for read and write rises quickly until access size reaches 256 B and slowly climbs to a peak of 1.5 GB/s for stores and 2.8 GB/s for loads. 256 B is 3DXPoint’s internal block size. It represents the smallest efficient access granularity for 3DXPoint. Loads and stores that are smaller than this granularity waste bandwidth as they have the same latency as a 256 B access. Stores that are smaller also result in write amplification since 3DXPoint writes at least 256 B for every update, incurring wear and consuming energy. The figure also shows the benefits of building native, memory-mapped, persistent data structures for Redis and RocksDB.  The impact varies widely: performance for RocksDB increases by 3.5×, while Redis 3.2 gains just 20%. 
  • Taking programming up a notch. Facebook build a recommendation engine to how other programmers solved similar problems. Aroma: Using machine learning for code recommendation. Something like this for GitHub would be great. Though it would be interesting to know how well it works in practice. Do Facebook developers actually find it useful?
  • We now have more ways than ever of packaging up and running bits of serverless functionality. Lucet from Fastly is the most recent: “With Lucet, Fastly’s edge cloud can execute tens of thousands of WebAssembly programs simultaneously, in the same process, without compromising security. The Lucet compiler and runtime work together to ensure each WebAssembly program is allowed access to only its own resource.” It doesn’t use a VM, container, or isolate, it: @jedisct1: “Not a V8 sandbox. This is a new runtime, written from scratch in Rust, specifically designed for concurrency and low latency. It’s open source, and included in Lucet.” @nathankpeck~ “I think the future of serverless is the exact opposite direction. Right now the most interesting platform I’ve seen is Fastly Lucet …Long story short it compiles Rust, TypeScript, C, and C++ to WASM which executes in a V8 sandbox, only 50micros of overhead.” It’s still early days so we don’t know much about how it works in practice, but it’s good to see competition. Also, Running Unikernels in 2019 with OPS. Also also, Wasmer is taking WebAssembly beyond the browser.
  • Basic but good intro to the Google Network. The High-Performance Network
  • How Airbnb Avoids Double Payments in a Distributed Payments System: There are three different common techniques used in distributed systems to achieve eventual consistency: read repair, write repair, and asynchronous repair. Our solution in this particular post utilizes write repair, where every write call from the client to the server attempts to repair an inconsistent, broken state…Write repair requires clients to be smarter (we’ll expand on this later), and allows them to repeatedly fire the same request and never have to maintain state (aside from retries). Clients can thus request eventual consistency on-demand, giving them control over the user experience. Idempotency is an extremely important property when implementing write repair…For an API request to be idempotent, clients can make the same call repeatedly and the result will be the same…We implemented and utilized “Orpheus”, a general-purpose idempotency library, across multiple payments services…An idempotency key is passed into the framework, representing a single idempotent request. Tables of idempotency information, always read and written from a sharded master database (for consistency). Database transactions are combined in different parts of the codebase to ensure atomicity, using Java lambdas. Error responses are classified as “retryable” or “non-retryable”.
  • Looks like Murat has made good use of his sabbatical because this is an awesome description of Azure Cosmos DB: Microsoft’s Cloud-Born Globally Distributed Database: To realize [Global distribution, Elastic scalability of throughput and storage worldwide, Fine grained multi-tenancy] Cosmos DB uses a novel nested distributed replication protocol, robust scalability techniques, and well-designed resource governance abstractions, and I will try to introduce these next. Good discussion on HN.
  • What is Taking Serverless to the Next Level? It involves lots and lots of configuration files. We have all these UIs that do so much checking and cross linking and the best practice is to type text into configuration files. Configuration Driven Programming requires programmers to build a runtime model of the entire AWS ecosystem in their head to understand how their system executes. This is just like programming, only the runtime environment is far more complex and hidden. Is this really progress? Capital One’s move to serverless is given as an example. 70% performance gain. 90% cost saving by moving from EC2, ELB, and RDS to DynamoDB, S3, SNS, Lambda. 30% increase in velocity by adopting CI/CD pipeline. 
  • AWS Hero Ben Keho takes a look at The Good and the Bad of Google Cloud Run.
    • “So what’s bad about Cloud Run? Inside your container is a fully-featured web server doing all the work! The point of serverless is to focus on business value, and the best way to do that is to use managed services for everything you can, ideally only resorting to custom code for your business logic. If we try to compare Cloud Run to what’s possible inside a Lambda function execution environment, we’re missing the point. The point is that the code you put inside Lambda, the code that you are liable for, can be smaller and more focused because so much of the logic can be moved into the services themselves.” 
    • Ben deftly highlights a major philosophical difference between AWS and GCP. GCP tends to promote a world view where programmers generate value through well understood code rather than spending hours if not days coercing tools like API Gateway into doing their bidding. Web servers are simple and well understood. Dispatching a function is easy. Many lambda functions even dispatch internally using similar dispatch code to Express. API Gateway is not simple. It can do a lot, but it’s complex, confusing, and expensive. How many hours do you have to waste with velocity templates and a horrible local testing infrastructure to get to the point that running a web server in a container is not the path to a lack of value, but is the path to good local testing, clearer code, better control, and less frustration? Let’s not make a virtue out a technology stack choice based on such an undefinable quality like “value.” There are many paths to value. It doesn’t matter if Cloud Run is FaaS, is like or not like Lambda, or uses Kubernetes—what matters is can you get the job done?
    • How Google Cloud Run Combines Serverless with Containers: My initial observation is that Cloud Run delivers the same promise as the original PaaS. The fundamental difference between PaaS and Cloud Run lies in transparency. Since the underlying layer is based on Knative, every step can be easily mapped to the functionality of Istio and Kubernetes. When Cloud Run becomes available on GKE On-Prem and Anthos, customers will be able to target a consistent platform with a repeatable workflow to deploy modern applications.
  • Indeed it is. Nicely done. A Detailed Overview of AWS API Gateway
  • Satellites Are Reshaping How Traders Track Earthly Commodities. Using satellites and AI you can track all supply chains in real-time to make your bets using information others do not have. Put your surveillance capitalism hat on. One wonders is all those Google map cars are doing more than making maps? Could they be gathering intel? It’s impossible to know their true purpose. We’ve seen this before in WWII. It’s incredible how much effort was put into disinformation. Entire armies were faked to feed misinformation to Axis powers. And it worked. It’s all deliciously described in The Deceivers: Allied Military Deception in the Second World War. A blast to read. In the future we’re likely to see the same sort of deception techniques used to confuse those who spy from above.
  • Good Recap: LinkedIn at EmberConf 2019
  • The storage wars continue they do. Google now offers ice cold storage. AWS offers Glacier Deep. There’s more to consider than just price. tpetry: The interesting part it’s cheaper than AWS Glacier ($4 per TB per month) and slightly more expansive than AWS Glacier Deep Archive ($0.99 per TB per month) but the data is available immediately and not in hours like glacier where you have to pay a hefty premium for faster access to the data. Youden: The place where this won’t be as cheap as Backblaze is retrieval. Unless Google makes a big change, you’ll still have to pay for network egress, which is obscenely priced. Good discussion on HN.
  • Because we all know systems are used just for their stated purpose. Tools are always used to suit the purpose of their wielder, not the best hopes of their creators. The Messy Truth About Social Credit: But the social credit system as it currently exists is not aimed at Orwellian social control. Rather, the cluster of policy initiatives designated by the term are intended to promote greater trust—namely, trust between companies and their customers, and between citizens and the government. 
  • When you hit a scaling inflection point—both organizationally and customer load related—do you continue refactoring the old system or do you build a new one? Here’s everything you should not do. Microservices Gone Wrong
    • They chose to build a second new green field product. They chose microservices using containers on Mesos to run Spark, Hadoop, Storm, and apps. Do not treat microservices as objects. Have a couple of services will well defined boundaries. Start as a service as big as possible because it can always be split later. It’s hard to stich services together later. Have a dedicated team dedicated to prod and dev infrastructure. Automate everything deployment, migration, backups, state restoration, everything. Gracefully handle failures first, not the happy path. Define service level objectives early in terms of what users care about. Manage complexity is the number one job. You can never eliminate complexity, you can only move it around or add it. It’s too easy to create a system so complicated you can’t understand it and if you can’t understand it how can you hope to run it?
    • David Moore: This talk is dangerous. He’s giving a talk advising people to not choose microservices model because he incorrectly applied that pattern to his product and unsurprisingly had issues. This does not warrant advising others to not choose a very successful pattern that works for the biggest tech companies because you didn’t understand correctly.  1) The biggest red flag – he decomposed the services by database entity. All you’ve done at this point is create an extremely coupled ball of mud. Of course “simple refactoring became major coordinated surgery” because each service is a representation of an entity. 2) The second major issue is his primary issues were lack of proper CI/CD infrastructure. Circuit breakers, service mesh, service resilience, scaling the containers, monitoring etc. All of this is needed for a fault tolerant system. You can’t just build out some services and expect it to solve your problems because you checked a few boxes. 
  • Maybe too complicated. A simple linear call chain works too. Chaining Serverless Functions for Stateful Workflows: AWS Step Functions using Adapter Pattern. Also, Building an AWS Serverless ML Pipeline with Step Functions
  • Integrating AI directly into data stores makes a lot of sense All 29 AI announcements from Google
  • Microsoft/BosqueLanguage: The Bosque language derives from a combination of TypeScript inspired syntax and types plus ML and Node/JavaScript inspired semantics. This document provides an overview of the syntax, operations, and semantics in the Bosque language with an emphasis on the distinctive or unusual features in the language.
  • dgryski/go-perfbook: This document outlines best practices for writing high-performance Go code.
  • Microsoft/BlingFire: we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few. sergeio76: There are deterministic finite state machines underneath implemented in lean C++. These automata allow to implement operations with strings optimally or close to that. In the readme file there is a link to how to recompile linguistic resources, if you look inside the makefile that is used for resource compilation you will the steps.
  • emichael/dslabs (article): a new framework for creating, testing, model checking, visualizing, and debugging distributed systems lab assignments. paper: Students often march though test cases incrementally, fixing problems only once they occur. A particular student tried this for the primary-backup assignment and got stuck: the fix for a problem found by one test would often break the solution for previous tests. The student found he could find a version to pass each of the tests, just not the same version. After we encouraged him to start over with a clean design that met all of the criteria simultaneously, he was able to quickly converge on a solution.
  • OverSketched Newton: Fast Convex Optimization for Serverless Systems: Motivated by recent developments in serverless systems for large-scale machine learning as well as improvements in scalable randomized matrix algorithms, we develop OverSketched Newton, a randomized Hessian-based optimization algorithm to solve large-scale smooth and strongly-convex problems in serverless systems. OverSketched Newton leverages matrix sketching ideas from Randomized Numerical Linear Algebra to compute the Hessian approximately. These sketching methods lead to inbuilt resiliency against stragglers that are a characteristic of serverless architectures. We establish that OverSketched Newton has a linear-quadratic convergence rate, and we empirically validate our results by solving large-scale supervised learning problems on real-world datasets. Experiments demonstrate a reduction of ~50% in total running time on AWS Lambda, compared to state-of-the-art distributed optimization schemes.
  • Distributed consensus revised: In this thesis, we re-examine the foundations of how Paxos solves distributed consensus. Our hypothesis is that these limitations are not inherent to the problem of consensus but instead specific to the approach of Paxos. The surprising result of our analysis is a substantial weakening to the requirements of this widely studied algorithm. Building on this insight, we are able to prove an extensive generalisation over the Paxos algorithm.
  • Giant monolithic source-code repositories are one of the fundamental pillars of the back end infrastructure in large and fast-paced software companies. Uber on Keeping Master Green at Scale: This paper presents the design and implementation of SubmitQueue. It guarantees an always green master branch at scale: all build steps (e.g., compilation, unit tests, UI tests) successfully execute for every commit point. SubmitQueue has been in production for over a year, and can scale to thousands of daily commits to giant monolithic repositories.
  • Analyzing the Impact of GDPR on Storage Systems: We show that despite needing to introduce a small set of new features, a strict real-time compliance (e.g., logging every user request synchronously) lowers Redis’ throughput by ∼95%. Our work reveals how GDPR allows compliance to be a spectrum, and what its implications are for system designers. We discuss the technical challenges that need to be solved before strict compliance can be eciently achieved.
  • Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems: Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers

from High Scalability

MySQL High Availability Framework Explained – Part III: Failover Scenarios

MySQL High Availability Framework Explained – Part III: Failover Scenarios

MySQL High Availability Framework Explained – Part III: Failover Scenarios

In this three-part blog series, we introduced a High Availability (HA) Framework for MySQL hosting in Part I, and discussed the details of MySQL semisynchronous replication in Part II. Now in Part III, we review how the framework handles some of the important MySQL failure scenarios and recovers to ensure high availability.

MySQL Failover Scenarios

Scenario 1 – Master MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the master MySQL is no longer available. Pacemaker demotes the master resource and tries to recover with a restart of the MySQL service, if possible.
  • At this point, due to the semisynchronous nature of the replication, all transactions committed on the master have been received by at least one of the slaves.
  • Pacemaker waits until all the received transactions are applied on the slaves and lets the slaves report their promotion scores. The score calculation is done in such a way that the score is ‘0’ if a slave is completely in sync with the master, and is a negative number otherwise.
  • Pacemaker picks the slave that has reported the 0 score and promotes that slave which now assumes the role of master MySQL on which writes are allowed.
  • After slave promotion, the Resource Agent triggers a DNS rerouting module. The module updates the proxy DNS entry with the IP address of the new master, thus, facilitating all application writes to be redirected to the new master.
  • Pacemaker also sets up the available slaves to start replicating from this new master.

Thus, whenever a master MySQL goes down (whether due to a MySQL crash, OS crash, system reboot, etc.), our HA framework detects it and promotes a suitable slave to take over the role of the master. This ensures that the system continues to be available to the applications.

Scenario 2 – Slave MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the slave MySQL is no longer available.
  • Pacemaker tries to recover the resource by trying to restart MySQL on the node. If it comes up, it is added back to the current master as a slave and replication continues.
  • If recovery fails, Pacemaker reports that resource as down – based on which alerts or notifications can be generated. If necessary, the ScaleGrid support team will handle the recovery of this node.
  • In this case, there is no impact on the availability of MySQL services.

Scenario 3 – Network Partition – Network Connectivity Breaks Down Between Master and Slave Nodes

This is a classical problem in any distributed system where each node thinks the other nodes are down, while in reality, only the network communication between the nodes is broken. This scenario is more commonly known as split-brain scenario, and if not handled properly, can lead to more than one node claiming to be a master MySQL which in turn leads to data inconsistencies and corruption.

Let’s use an example to review how our framework deals with split-brain scenarios in the cluster. We assume that due to network issues, the cluster has partitioned into two groups – master in one group and 2 slaves in the other group, and we will denote this as [(M), (S1,S2)].

  • Corosync detects that the master node is not able to communicate with the slave nodes, and the slave nodes can communicate with each other, but not with the master.
  • The master node will not be able to commit any transactions as the semisynchronous replication expects acknowledgement from at least one of the slaves before the master can commit. At the same time, Pacemaker shuts down MySQL on the master node due to lack of quorum based on the Pacemaker setting ‘no-quorum-policy = stop’. Quorum here means a majority of the nodes, or two out of three in a 3-node cluster setup. Since there is only one master node running in this partition of the cluster, the no-quorum-policy setting is triggered leading to the shutdown of the MySQL master.
  • Now, Pacemaker on the partition [(S1), (S2)] detects that there is no master available in the cluster and initiates a promotion process. Assuming that S1 is up to date with the master (as guaranteed by semisynchronous replication), it is then promoted as the new master.
  • Application traffic will be redirected to this new master MySQL node and the slave S2 will start replicating from the new master.

Thus, we see that the MySQL HA framework handles split-brain scenarios effectively, ensuring both data consistency and availability in the event the network connectivity breaks between master and slave nodes.

This concludes our 3-part blog series on the MySQL High Availability (HA) framework using semisynchronous replication and the Corosync plus Pacemaker stack. At ScaleGrid, we offer highly available hosting for MySQL on AWS and MySQL on Azure that is implemented based on the concepts explained in this blog series. Please visit the ScaleGrid Console for a free trial of our solutions.

from High Scalability

Stuff The Internet Says On Scalability For April 12th, 2019

Stuff The Internet Says On Scalability For April 12th, 2019

PostPost a New Comment

Enter your information below to add a new comment.

Author Email (optional):

Author URL (optional):

Post:

Some HTML allowed: <a href=”” title=””> <abbr title=””> <acronym title=””> <b> <blockquote cite=””> <code> <em> <i> <strike> <strong>

from High Scalability

Sponsored Post: PerfOps, InMemory.Net, Triplebyte, Etleap, Stream, Scalyr

Sponsored Post: PerfOps, InMemory.Net, Triplebyte, Etleap, Stream, Scalyr

Who’s Hiring? 

  • Triplebyte lets exceptional software engineers skip screening steps at hundreds of top tech companies like Apple, Dropbox, Mixpanel, and Instacart. Make your job search O(1), not O(n). Apply here.
  • Need excellent people? Advertise your job here! 

Fun and Informative Events

  • Join Etleap, an Amazon Redshift ETL tool to learn the latest trends in designing a modern analytics infrastructure. Learn what has changed in the analytics landscape and how to avoid the major pitfalls which can hinder your organization from growth. Watch a demo and learn how Etleap can save you on engineering hours and decrease your time to value for your Amazon Redshift analytics projects. Register for the webinar today.
  • Advertise your event here!

Cool Products and Services

  • PerfOps is a data platform that digests real-time performance data for CDN and DNS providers as measured by real users worldwide. Leverage this data across your monitoring efforts and integrate with PerfOps’ other tools such as Alerts, Health Monitors and FlexBalancer – a smart approach to load balancing. FlexBalancer makes it easy to manage traffic between multiple CDN providers, API’s, Databases or any custom endpoint helping you achieve better performance, ensure the availability of services and reduce vendor costs. Creating an account is Free and provides access to the full PerfOps platform.
  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net
  • Build, scale and personalize your news feeds and activity streams with getstream.io. Try the API now in this 5 minute interactive tutorialStream is free up to 3 million feed updates so it’s easy to get started. Client libraries are available for Node, Ruby, Python, PHP, Go, Java and .NET. Stream is currently also hiring Devops and Python/Go developers in Amsterdam. More than 400 companies rely on Stream for their production feed infrastructure, this includes apps with 30 million users. With your help we’d like to ad a few zeros to that number. Check out the job opening on AngelList.
  • Scalyr is a lightning-fast log management and operational data platform.  It’s a tool (actually, multiple tools) that your entire team will love.  Get visibility into your production issues without juggling multiple tabs and different services — all of your logs, server metrics and alerts are in your browser and at your fingertips. .  Loved and used by teams at Codecademy, ReturnPath, Grab, and InsideSales. Learn more today or see why Scalyr is a great alternative to Splunk.
  • Advertise your product or service here!

If you are interested in a sponsored post for an event, job, or product, please contact us for more information.


Make Your Job Search O(1) — not O(n)

Triplebyte is unique because they’re a team of engineers running their own centralized technical assessment. Companies like Apple, Dropbox, Mixpanel, and Instacart now let Triplebyte-recommended engineers skip their own screening steps.

We found that High Scalability readers are about 80% more likely to be in the top bracket of engineering skill.

Take Triplebyte’s multiple-choice quiz (system design and coding questions) to see if they can help you scale your career faster.


The Solution to Your Operational Diagnostics Woes

Scalyr gives you instant visibility of your production systems, helping you turn chaotic logs and system metrics into actionable data at interactive speeds. Don’t be limited by the slow and narrow capabilities of traditional log monitoring tools. View and analyze all your logs and system metrics from multiple sources in one place. Get enterprise-grade functionality with sane pricing and insane performance. Learn more today


If you are interested in a sponsored post for an event, job, or product, please contact us for more information.

from High Scalability

From bare-metal to Kubernetes

From bare-metal to Kubernetes

In 2013 Dotcloud released Docker.

The Betabrand use case for Docker was immediately obvious. I saw it as the solution to simplify our development and staging environments; by getting rid of the ansible scripts (well, almost; more on that later).

Those scripts would now only be used for production.

At the time, one main pain point for the team was competing for our three physical staging servers: dev1, dev2 and dev3; and for me maintaining those 3 servers was a major annoyance.

After observing docker for a few months, I decided to give it a go in April 2014.

After installing docker on one of the staging servers, I created a single docker image containing our entire stack (haproxy, varnish, redis, apache, etc.) then over the next few months wrote a tool (sailor) allowing us to create, destroy and manage an infinite number of staging environment accessible via individual unique URLs.

Worth noting that docker-compose didn’t exist at that time; and that putting your entire stack inside one docker image is of course a big no-no but that’s an unimportant detail here.

From this point on, the team wasn’t competing anymore for access to the staging servers. Anybody could create a new, fully configured, staging container from the docker image using sailor. I didn’t need to maintain the servers anymore either; better yet, I shut down and cancelled 2 of them.

Our development environment, however, still was running on macos (well, “Mac OS X” at the time) and using the Ansible scripts.

Then, sometime around 2016 docker-machine was released.

Docker machine is a tool taking care of deploying a docker daemon on any stack of your choice: virtualbox, aws, gce, bare-metal, azure, you name it, docker-machine does it; in one command line.

I saw it as the opportunity to easily and quickly migrate our ansible-based development environment to a docker based one. I modified sailor to use docker-machine as its backend.

Setting up a development environment was now a matter of creating a new docker-machine then passing a flag for sailor to use it.

At this point, our development-staging process had been simplified tremendously; at least from a dev-ops perspective: anytime I needed to upgrade any software of our stack to a newer version or change the configuration, instead of modifying my ansible scripts, asking all the team to run them, then running them myself on all 3 staging servers; I could now simply push a new docker image.

Ironically enough, I ended up needing virtual machines (which I had deliberately avoided) to run docker on our macbooks. Using vagrant instead of Ansible would have been a better choice from the get go. Hindsight is always 20/20.

Using docker for our development and staging systems paved the way to the better solution that Betabrand.com now runs on.

from High Scalability