Skip to main content
  1. About
  2. For Teams
Asked
Viewed 17k times
11

df (Pandas Dataframe) has three rows.

col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"

I'd like to extract the row that contains "is" and "small".

If I do

df.col_name.str.contains("is|small", case=False)

Then it catches "His" as well- which I don't want.

Is below query is the right way to catch the whole word in df.series?

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)
0

5 Answers 5

13

No, the regex /bis/b|/bsmall/b will fail because you are using /b, not \b which means "word boundary".

Change that and you get a match. I would recommend using

\b(is|small)\b

This regex is a little faster and a little more legible, at least to me. Remember to put it in a raw string (r"\b(is|small)\b") so you don’t have to escape the backslashes.

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you. I reflected your point /b -> \b. Still want to wait for few more days to see if there is any other way to catch the whole word.
tangentially, I had to add a r before the string to get it to work: anybody knows why? I have not found any reference to it..
well, apparently the |char makes it implicitly into a regex, while a \b does not..
@mccc It makes it into a raw string (this is a Python thing, not a Pandas or Regex thing).
@Laurel I think your answer would be more complete if you added the point about using raw string parameter because that is also missing from OP's query.
|
4

First, you may want to convert everything to lowercase, remove punctuation and whitespace and then convert the result into a set of words.

import string

df['words'] = [set(words) for words in
    df['col_name']
    .str.lower()
    .str.replace('[{0}]*'.format(string.punctuation), '')
    .str.strip()
    .str.split()
]

>>> df
                        col_name                                words
0                This is Donald.                   {this, is, donald}
1         His hands are so small         {small, his, so, are, hands}
2  Why are his fingers so short?  {short, fingers, his, so, are, why}

You can now use boolean indexing to see if all of your target words are in these new word sets.

target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))


print(df)
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}  False
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output:
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}   True
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

To extract the matching rows:

>>> df.loc[df.match, 'col_name']
# Output:
# 1    His hands are so small
# Name: col_name, dtype: object

To make this all into a single statement using boolean indexing:

df.loc[[all(target_word in word_set for target_word in target_words) 
        for word_set in (set(words) for words in
                         df['col_name']
                         .str.lower()
                         .str.replace('[{0}]*'.format(string.punctuation), '')
                         .str.strip()
                         .str.split())], :]

5 Comments

Thanks for the answer.. I'm trying to use Pandas' inbuilt indexing (because my table contains about 500k rows) but I guess you are indexing it by yourself...?
Not sure what you mean. This does use Pandas indexing.
This will return a match but not the whole string match !
@Nico, Please elaborate. Just use boolean indexing on the relevant column to extract the matching rows per the example above.
@Alexander Is it possible to find which word is matching in the sentence .
2

In "\bis\b|\bsmall\b", the backslash \b is parsed as ASCII Backspace before it is even passed to the regular expression method for matching/searching. For more information check this document about escape characters. It is mentioned in this document, that

When an ‘r’ or ‘R’ prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.

Therefore, there are two options -

  1. Use r prefix
df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)
  1. (Or) Escape the \ character -
df.col_name.str.contains("\\bis\\b|\\bsmall\\b", case=False)

Comments

0

Your way (with /b) didn't work for me. I'm not sure why you can't use the logical operator and (&) since I think that's what you actually want.

This is a silly way to do it, but it works:

mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)

1 Comment

The example you gave is confusing in that regard, although I see that you've since reworded it to make it a little clearer. This solves what you originally said the problem was "I'd like to extract the row that contains "is" and "small"."
0

In extension to the discussion, I want to use a variable inside the regex as follows:

df = df_w[df_w['Country/Region'].str.match("\b(location.loc[i]['country'])\b",case=False)]

If I do not put \b\b, the code returns all the columns with both Sudan and South Sudan. While, when I use "\b(location.loc[i]['country'])\b", it returns empty dataframes. Kindly tell me the correct usage.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.