Skip to main content
  1. About
  2. For Teams
Asked
Viewed 8k times
2

I have two text columns and I would like to find whether a word from one column is present in another. I wrote the below code, which works very well, but it detects if a word is present anywhere in the string. For example, it will find "ha" in "ham". I want to use regex expression instead, but I am stuck. I came across this post and looked at the second answer, but I haven't been able to modify it for my purpose. I would like to do something similar.

I would appreciate help and/or any pointers

d = {'emp': ['abc d. efg', 'za', 'sdfadsf '], 'vendor': ['ABCD enterprise', 'za industries', '' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')

def check_subset(vendor, employee):
    s = []
    for n in employee.split():
      # n=" " + n +"[^a-zA-Z\d:]"
      if ((str(n) in vendor.lower()) & (len(str(n))>1)):
        s.append(n)
    return s

check_subset("ABC-xy 54", "54 xy")

df['emp_name_find_in_vendor'] = df.apply(lambda row: check_subset(row['vendor'],row['clean_empy_name']), axis=1)
df
#########update 2

i updated my dataframe as below

d = {'emp': ['abc d. efg', 'za', 'sdfadsf ','abc','yuma'], 'vendor': ['ABCD enterprise', 'za industries', '','Person Vue\Cisco','U OF M CONTLEARNING' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')

I used code provided by first answer and it fails

  1. in case of 'Person Vue\Cisco' it throws the error error: bad escape \c. If i remove \ in 'Person Vue\Cisco', code runs fine
  2. in case of 'U OF M CONTLEARNING' it return u and m when clearly they are not a match

1 Answer 1

3

Yes, you can! It is going to be a little bit messy so let me construct in a few steps:

First, let's just create a regular expression for the single case of check_subset("ABC-xy 54", "54 xy"):

  • We will use re.findall(pattern, string) to find all the occurrences of pattern in string
  • The regex pattern will basically say "any of the words":
    • for the "any" we use the | (or) operator
    • for constructing words we need to use the parenthesis to group together... However, parenthesis (word) create a group that keeps track, so we could later call reuse these groups, since we are not interested we can create a non-capturing group by adding ?: as follows: (?:word)
import re

re.findall('(?:54)|(?:xy)', 'ABC-xy 54')
# -> ['xy', '54']

Now, we have to construct the pattern each time:

  • Split into words
  • Wrap each word inside a non-capturing group (?:)
  • Join all of these groups by |
re.findall('|'.join(['(?:'+x+')' for x in '54 xy'.split()]), 'ABC-xy 54')

One minor thing, since the last row's vendor is empty and you seem to want no matches (technically, the empty string matches with everything) we have to add a minor check. So we can rewrite your function to be:

def check_subset_regex(vendor, employee):
    if vendor == '':
        return []

    pattern = '|'.join(['(?:'+x+')' for x in vendor.lower().split(' ')])
    return re.findall(pattern, employee)

And then we can apply the same way:

df['emp_name_find_in_vendor_regex'] = df.apply(lambda row: check_subset_regex(row['vendor'],row['clean_empy_name']), axis=1)

One final comment is that your solution matches partial words, so employee Tom Sawyer would match "Tom" to the vendor "Atomic S.A.". The regex function I provided here will not give this as a match, should you want to do this the regex would become a little more complicated.


EDIT: Removing punctuation marks from vendors

You could either add a new column as you did with clean_employee, or simply add the removal to the function, as so (you will need to import string to get the string.punctuation, or just add in there a string with all the symbols you want to substitute):

def check_subset_regex(vendor, employee):
    if vendor == '':
        return []

    clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)

    pattern = '|'.join(['(?:'+x+')' for x in clean_vnd.lower().split(' ')])
    return re.findall(pattern, employee)

In the spirit of teaching to fish :), in regex the [] denote any of these characters... So [abc] would be the same to a|b|c.

So the re.sub line will substitute any occurrence of the string.punctuation (which evaluates to !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~) characters by a '' (removing them).


EDIT2: Adding the possibility of a single non-alphanumeric character at the end of each searchword:

def check_subset_regex(vendor, employee):
    if vendor == '':
        return []

    clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)

    pattern = '|'.join(['(?:'+x+'[^a-zA-Z0-9]?)' for x in clean_vnd.lower().split(' ')])
    return re.findall(pattern, employee)

In this case we are using:
- ^ as the first character inside a [] (called character class), denotes any character except for those specified in the character class, e.g. [^abc] would match anything that is not a or b or c (so d, or a white space, or @) - and the ?, which means the previous symbol is optional...

So, [^a-zA-Z0-9]? means an optional single non-alphanumeric character.

Sign up to request clarification or add additional context in comments.

6 Comments

i don't want "Tom" to be found in "Atomic". Your solution looks good. Let me understand it and i will accept it. Thanks for the explanation..that helps...teaches me how to fish!
it doesn't work in case of check_subset_regex("ABC-xy 54!", "54 xy")....when i do the search i am okay with special characters as they could indicate end of the sentence or end of the question etc...
@Ni_Tempe In that case it doesn't match because we haven't cleared the punctuation marks from the vendor. Would you want to clear all punctuation marks?
we can clear all special characters first. But i was wondering can we something like "search_word=" "+search_word +"[^a-zA-Z\d:]" and then str.findall(search_word)
You mean including for the possibility of special characters at the end of the word? Sure, see EDIT2 :)
|

Your Answer

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.