Can regex be compiled into efficient machine code?
In general and absent of any context about a specific language? Yes, it can. The deeper question is really about the context (design goals) and costs (working with tradeoffs).
At the library-level, one example is Hana Dusikova's compile-time regular expressions in C++. See also its WG21 proposal paper, GitHub repo, and website with links to various conference talks. It leverages C++'s compile-time facilities. I think it hits close to what you're talking about. From its proposal paper:
The current std::regex design and implementation are slow, mostly because the RE pattern is parsed and compiled at runtime. Users often don’t need a runtime RE parser engine as the pattern is known during compilation in many common use cases. I think this breaks C++’s promise of “don’t pay for what you don’t use.” If the RE is known at compile time, the pattern should be checked during the compilation. The design of std::regex doesn’t allow for this as the RE input is a runtime string and syntax errors are reported as exceptions.
Notice a couple of things:
- The proposal addresses a core design goal of the language.
- The language had features that were able to support the implementation of a non-standard library implementation.
What would it take for a language to 'compile' regex at compile time to be as efficient as writing an equivalent string matching function by hand?
The answer is not so black and white. There can be space and time efficiency tradeoffs. Now that you have contextual information at each site that can be used for specialized code-generation, how much code do you inline? What do you inline and what don't you inline? What do you extract to common procedures and what impact does that have on costs from function calls? Inlining isn't always the best- there can be cases where more code sharing results in better instruction-cache usage. How much code gets generated? In what way does the runtime data cost relate to the inputs? How do all those costs compare to the costs of a generalized regex engine? At what point does generating optimized code for known-at-compile-time regexes become more costly in code size than just having a regex engine in the runtime environment?
If implemented at the library level, (assuming that you care,) you'd need a way for the user to communicate at the library level what they want to optimize for / what tradeoff they want, and then for the language to provide powerful enough facilities for things living at the library level to estimate the space and time costs, and even at that point, if the language specification and language implementations are more separate, you'd hit some boundary in how accurate those estimates can be with respect to what the implementation actually does (Ex. what it compiles).
Why do most languages not do this?
Not all languages prioritize efficiency of codegen as a design goal enough to want this. That's a perfectly valid design choice. You really can't be everything. In fact, the general overarching trend in languages over time seems to be to move away from the hardware and losing some of its benefits and have higher-level, more abstract languages and runtimes.
And a lot of runtime models are not really geared towards doing heavy compile-time optimization / specialization. For example, Java bytecode has a limited set of supported instructions, and that can further limit what kinds of code you can generate. At that point, it can become a conversation about adding deeper features to the language or its components (which has its own costs), or just shrugging your shoulders and leaving it to runtime implementations to detect patterns and optimize what they do under the hood (which has its limitations).
Sometimes it's just a matter of nobody caring enough or having enough time to put the work into implementing or specifying it yet. Things don't just magically appear, and a lot of people who work on language design are not doing that as their day-job. Life can get in the way. For languages that aren't designed by just one person, collaborating on things and making decisions as a group has its own challenges as well, such as just meeting at the same time (timezone things), agreeing on the same priorities, dealing with conflicts in effects on different use-cases, etc.
What would be the advantages or disadvantages of having regex as a core language feature as opposed to a library function?
As stated above, you could have the people implementing the language compilers / interpreter optimizers implement the tradoff optimizations of time and space instead of having something at the library level try to do those optimizations within its limitations and then not have that information propagate to the compilers / interpreter optimizers.