Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

I have worked on a private project on how to adapt LLaMA 3.2, a decoder-only (autoregressive) transformer, for Named Entity Recognition (NER) with HuggingFace (so not implemented "from scratch"). Traditionally, encoder-only models like BERT have dominated NER tasks due to their ability to process input text bidirectionally, capturing rich contextual information. However, by removing the causal mask in LLaMA, we enable it to leverage bidirectional context while maintaining its strengths in generative tasks, making it a versatile solution for NER.

What do you think about this implementation?

Project: https://github.com/d-kleine/NER_decoder

You must be logged in to vote

Replies: 1 comment · 3 replies

Comment options

That's pretty cool! I think that removing the causal mask is reasonable here. I've seen something similar in the recent classification finetuning papers where they used Llama models.

Based on the loss and qualitative eval at the end, it looks like it definitely works!

Btw. how long did it take to finetune? If it's not too long, I'd be curious to know how it would perform if you left the causal mask.

You must be logged in to vote
3 replies
@d-kleine
Comment options

I needed to fix the code as model.config.is_decoder = False does not seem to disable the attention masking, at least for Llama models. I have now implemented the bi-directionality based on this discussion, which works now.

I have tested running both models for a few epochs: The one with masking is faster than the one without (as expected), but after some training performs fairly well. The bidirectional model takes way longer to train (again, as expected) and seems to perform slightly - but not significantly - better than the one with the mask. But these are just first impressions from this toy experiment, nothing solid.

For the same notebook, I have tried out BERT as an encoder-only model some weeks ago: When using a decoder-only model like LLama 3.2, the training in both cases takes way longer (again, as expected, due to the more complex architecture). At least you can see that both approaches (attention masking, bi-directional) for decoder-only models see to work and could be considered as an alternative to encoder-only models for NER.

@rasbt
Comment options

rasbt Nov 5, 2024
Maintainer

Thanks for sharing. Yeah, good pretrained decoder-only models are usually pretty capable, even if they don't see future tokens, so I can see the bidirectionality maybe not doing that much (but it should definitely help a little like you observed).

I think it makes intuitive sense too. For the example from your notebook

Example 1: Steve Jobs, the co-founder of Apple Inc., was born in San Francisco, California.

One could infer everything from the left context. Of course there will be exceptions, but in most cases you probably don't need the right context (although it can help)

@d-kleine
Comment options

Yeah, I agree. For me, it seems like disabling the attention mask doesn't really pay-off in terms of performance to execution time/costs. But maybe it can be beneficial for specific use cases (or languages) where inferring from both sides is helpful.

I just finalized the notebook and fixed an issue to display the performance metrics (the model overfits highly, but this is just a showcase anyways).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.