19
submitted 8 months ago by liori@lemm.ee to c/programming@programming.dev

I'm working on a query engine, essentially a tool to scan/filter/annotate by lookups/group by/aggregate a large dataset, tens-of-terabytes range. The compute part seems to be a bottleneck for me (I'll be doing around 80-300 GB/s of reads, and yes, I will have hardware capable of providing that kind of throughput). My hypothesis is that by encoding query in form of template arguments I can make the compiler generate code optimized for a specific type of query (like, the filtering or aggregation keys). But I do not know what queries will users send, so I need a way to instantiate templates at runtime.

Sounds simple: for a new type of query invoke a compiler at runtime to build a dynamic library with a new instantiation, then dynload it and off we go. Some prior work is here, though I'm pretty sure any JIT compiler also can counts here. But there's enough technical details to worry about, and at the same time this idea isn't novel, so I wonder—are there any packaged solutions for this kind of approach?

33

TL;DR: we have discovered XMPP (Jabber) instant messaging protocol encrypted TLS connection wiretapping (Man-in-the-Middle attack) of jabber.ru (aka xmpp.ru) service’s servers on Hetzner and Linode hosting providers in Germany. The attacker has issued several new TLS certificates using Let’s Encrypt service which were used to hijack encrypted STARTTLS connections on port 5222 using transparent MiTM proxy. The attack was discovered due to expiration of one of the MiTM certificates, which haven’t been reissued. There are no indications of the server breach or spoofing attacks on the network segment, quite the contrary: the traffic redirection has been configured on the hosting provider network. The wiretapping may have lasted for up to 6 months overall (90 days confirmed). We believe this is lawful interception Hetzner and Linode were forced to setup.

104
submitted 1 year ago* (last edited 1 year ago) by liori@lemm.ee to c/worldnews@lemmy.ml

In short, by the exit polls:

  • PiS (the current ruling party, right-wing) got the most votes, but cannot rule alone, nor can rule in coalition with the alt-right party Konfederacja by a decent margin (212 mandates total vs. 231 needed to have majority).
  • The opposition parties (center-right Koalicja Obywatelska, center-right Trzecia Droga and leftist Lewica) together have majority. They know they have to make a government together, the question is whether they can overcome their differences. They did suggest strong coöperation during their campaigns.
  • Highest-ever turnout (72%) in the elections.
  • The accompanying referendum (a device to have more funds for promoting ideas by the current ruling party) a total failure (40% turnout—voters had to explicitly opt-out of participation!).
36

While high-frequency trading is not exactly my favourite topic, I do like reading on their technical approaches.

By Paul Bilokon, Burak Gunduz

This work aims to bridge the existing knowledge gap in the optimisation of latency-critical code, specifically focusing on high-frequency trading (HFT) systems. The research culminates in three main contributions: the creation of a Low-Latency Programming Repository, the optimisation of a market-neutral statistical arbitrage pairs trading strategy, and the implementation of the Disruptor pattern in C++. The repository serves as a practical guide and is enriched with rigorous statistical benchmarking, while the trading strategy optimisation led to substantial improvements in speed and profitability. The Disruptor pattern showcased significant performance enhancement over traditional queuing methods. Evaluation metrics include speed, cache utilisation, and statistical significance, among others. Techniques like Cache Warming and Constexpr showed the most significant gains in latency reduction. Future directions involve expanding the repository, testing the optimised trading algorithm in a live trading environment, and integrating the Disruptor pattern with the trading algorithm for comprehensive system benchmarking. The work is oriented towards academics and industry practitioners seeking to improve performance in latency-sensitive applications.

[-] liori@lemm.ee 23 points 1 year ago

This plea for help is specifically for non-coding, but still deeply technical work.

[-] liori@lemm.ee 8 points 1 year ago

I guess the best start would be to have a person to organize volunteers.

507

I've said this previously, and I'll say it again: we're severely under-resourced. Not just XFS, the whole fsdevel community. As a developer and later a maintainer, I've learnt the hard way that there is a very large amount of non-coding work is necessary to build a good filesystem. There's enough not-really-coding work for several people. Instead, we lean hard on maintainers to do all that work. That might've worked acceptably for the first 20 years, but it doesn't now.

[…]

Dave and I are both burned out. I'm not sure Dave ever got past the 2017 burnout that lead to his resignation. Remarkably, he's still around. Is this (extended burnout) where I want to be in 2024? 2030? Hell no.

[-] liori@lemm.ee 7 points 1 year ago

As of May 2023, 65% of the Ukrainian refugees that left Ukraine starting February 2022 and decided to stay in Poland found a job—so, within around a year, as opposed to 5-6 years as in the article. Cultural similarity here is likely making it much, much simpler. For those who want to read more about the situation of Ukrainian refugees in Poland, this report by Polish National Bank (Narodowy Bank Polski, NBP) might be useful: https://nbp.pl/wp-content/uploads/2023/05/Raport_Imigranci_EN.pdf (in English!), there is a lot of interesting details.

21
submitted 1 year ago* (last edited 1 year ago) by liori@lemm.ee to c/programming@programming.dev

We are working on a tool that essentially allows external customers to access various extracts of our datasets, with parameterized filtering, aggregation, the usual stuff, though REST API. Some of these extracts are time consuming to prepare, so we are looking for ways to manage asynchronous report generation or making it possible for customers to schedule reports upfront, as opposed to having a synchronous API. There are tons of libraries for implementing synchronous REST APIs, but are there any standard approaches or tools for this kind of asynchronous cross-organizational communication? Like, maybe something that would allow each customer inspect their schedules and pending queries, configure how they want the results to be delivered? I fear we will need to build something like that from scratch.

[-] liori@lemm.ee 7 points 1 year ago

From my experience, despite all the citogenesis described in other comments here, Wikipedia citations are still better vetted than in many, many scientific papers, let alone regular journalism :/ I recall spending days following citation links in already well-cited papers to basically debunk basic statements in the field.

[-] liori@lemm.ee 11 points 1 year ago

Good question! I quickly found this table, though this is yearly statistics only: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3510019201

[-] liori@lemm.ee 27 points 1 year ago

One reason (among many) is that employment in American companies is less stable than in Europe with strong employment laws. Twitter could not do the same type of layoffs in Europe, with stories like this one being pretty common. But this safety net has a cost, and the cost is a part of total employment cost for employers. Whether the safety net is worth it for employees in IT, that's another matter—but it can't not be taken into account because of the law.

BTW, in some European countries there is a strong culture of IT workers doing long-term contractor work exactly to trade off employment laws for (usually quite a lot) higher wage.

[-] liori@lemm.ee 15 points 1 year ago

He likely couldn't "just" do it. The synchronization overhead for federation is large, and with the amount of data Reddit has, you'd have to put a lot of effort into writing efficient code to handle that. Or pay for a lot of servers doing it.

BTW, it would be interesting to see whether current lemmy codebase could handle it as well…

[-] liori@lemm.ee 26 points 1 year ago

I found it crazy useful to study old, established, mature technologies, like relational databases, storage, low-level networking stack, optimizing compilers, etc. Much more valuable than learning the fad of the year. For example, consider studying internals of Postgresql if you're using it.

[-] liori@lemm.ee 6 points 1 year ago* (last edited 1 year ago)

Given these criteria, ggplot2 wins by a landslide. The API, thanks to R's nonstandard evaluation feature, is crazy good compared to whatever is available in Python. Not having to use numpy/pandas as inputs is a bonus as well, somehow pandas managed to duplicate many bad features of R's data frame and introduce its own inconsistences, without providing many of the good features¹. Styling defaults are decent, definitely much better than matplotlib's, and it's much easier to consistently apply custom styling. Future of ggplot2 is defined by downstream libraries, ggplot2 is just the core of the ecosystem, which, at this point, is mature and stable. Matplotlib's activity is mostly because that lack of nonstandard evaluation makes it more cumbersome to implement flexible APIs, and so it just takes more work. Both have very minimal support for interactive and web, it's easier to just use shiny/dask to wrap them than to force them alone to do web/interactive stuff. Which, btw, again I'd say shiny » dask if nothing but for R's nonstandard evaluation feature.

Note though that learning proper R takes time, and if you don't know it yet, you will underestimate time necessary to get friendly. Nonstandard evaluation alone is so nonstandard that it gives headaches to people who'd otherwise be skilled programmers already. matplotlib would hugely win by flexibility, which you apparently don't need—but there's always that one tiny tweak you would wish to be able to do. Also, it's usually much easier to use the platform's default, whatever publishing platform you're going to use.

As for me, if I have choice, I'm picking ggplot2 as a default. So far it was good enough for significant majority of my academic and professional work.

¹ Admitably numpy was not designed for data analysis directly, and pandas has some nice features missing from R's data frames.

[-] liori@lemm.ee 7 points 1 year ago

At such scale, a scraper wouldn't be necessary, that's easily doable by humans involved in these communities—with a human touch as well.

[-] liori@lemm.ee 9 points 1 year ago

Yes, many times. And I recall using the technique manually back when I was working with Subversion many, many years ago. No fun stories though, sorry, it's just another tool in a toolbox.

2

Some time ago I was looking for a sources that would give me in-depth understanding of performance characteristics of large-scale storage. This is the best text on hard disk drives I've found so far, explaining details such as various switch times, zoned recording or head skew. It's almost 20 years old, though, and so misses some developments. I wonder if you know any more modern sources?

view more: next ›

liori

joined 1 year ago