Regular expression to match string not containing a word?
15
I know it is possible to match for the word and using tools options reverse the match. (eg. by grep -v) However I want to know if it is possible using regular expressions to match lines which does not contain a specific word, say hede? 

Input:

# grep "Regex for do not contain hede" Input

Output: 
improve this question | comment
Gino Medhurst Created at: 2013-11-13 17:07:08 UTC By Gino Medhurst
Probably a couple years late, but what's wrong with: ([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$)))*? The idea is simple. Keep matching until you see the start of the unwanted string, then only match in the N-1 cases where the string is unfinished (where N is the length of the string). These N-1 cases are "h followed by non-e", "he followed by non-d", and "hed followed by non-e". If you managed to pass these N-1 cases, you successfully didn't match the unwanted string so you can start looking for [^h]* again - Haskell Maggio
@stevendesu: try this for 'a-very-very-long-word' or even better half a sentence. Have fun typing. BTW, it is nearly unreadable. Don't know about the performance impact. - Ena Bins DVM
@PeterSchuetze: Sure it's not pretty for very very long words, but it is a viable and correct solution. Although I haven't run tests on the performance, I wouldn't imagine it being too slow since most of the latter rules are ignored until you see an h (or the first letter of the word, sentence, etc.). And you could easily generate the regex string for long strings using iterative concatenation. If it works and can be generated quickly, is legibility important? That's what comments are for. - Carleton Padberg
@stevendesu: i'm even later, but that answer is almost completely wrong. for one thing, it requires the subject to contain "h" which it shouldn't have to, given the task is "match lines which [do] not contain a specific word". let us assume you meant to make the inner group optional, and that the pattern is anchored:

^([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$))?)*$

this fails when instances of "hede" are preceded by partial instances of "hede" such as in "hhede". - Ezra Ziemann
An earlier question/answer for the same problem is available here:
[link]stackoverflow.com/questions/116819/…
The answers are similar, but might be interesting. - Taylor Swaniawski
11 Answers
0
this is working if you wish to have the list of non-'hede' words as a result

h='Hoho \
Hihi \
Haha \
hede'

print [x for x in h.split() if x != 'hede']


and you can also build back the list separated with new lines!

print '\n'.join([x for x in h.split() if x != 'hede'])

0
If you're just using it for grep, you can use grep -v hede to get all lines which do not contain hede.

ETA Oh, rereading the question, grep -v is probably what you meant by "tools options".
0
Not regex, but I've found it logical and useful to use serial greps with pipe to eliminate noise.

eg.  search an apache config file without all the comments-

grep -v '\#' /opt/lampp/etc/httpd.conf      # this gives all the non-comment lines


and

grep -v '\#' /opt/lampp/etc/httpd.conf |  grep -i dir


The logic of serial grep's is (not a comment) and (matches dir)
0
The fact that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$


The regex above will match any string, or line without a line break, not containing the (sub) string 'hede'.
As mentioned, this is not something regex is "good" at (or should do), but still, it is possible. 

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s


or use it inline:

/(?s)^((?!hede).)*$/


(where the /.../ are the regex delimiters, ie, not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/


Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+
S = |e1| A |e2| B |e3| h |e4| e |e5| d |e6| e |e7| C |e8| D |e9|
    +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+

index    0      1      2      3      4      5      6      7


where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something. 

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).
0
Note that the solution to does not start with “hede”:

^(?!hede).*$


is generally much more efficient than the solution to does not contain “hede”:

^((?!hede).)*$


The former checks for “hede” only at the input string’s first position, rather than at every position.
0
Here's a good explanation of why it's not easy to negate an arbitrary regex. I have to agree with the other answers, though: if this is anything other than a hypothetical question, then a regex is not the right choice here.
0
If you want the regex to only fail if the entire string matches, the following will work:

^(?!hede$).*


e.g. -- If you want to allow all values except "foo" (i.e. "foofoo", "barfoo", and "foobar" will pass, but "foo" will fail), use: ^(?!foo$).*

Of course, if you're checking for exact equality, a better general solution in this case is to check for string equality, i.e. 

myStr !== 'foo'


You could even put the negation outside the test if you need any regex features (here, case insensitivity and range matching):

!/^[a-f]oo$/i.test(myStr)


The regex solution at the top may be helpful, however, in situations where a positive regex test is required (perhaps by an API).
0
If you want to match a character to negate a word similar to negate character class:

For example, a string:

<?
$str="aaa        bbb4      aaa     bbb7";
?>


Do not use:

<?
preg_match('/aaa[^bbb]+?bbb7/s', $str, $matches);
?>


Use:

<?
preg_match('/aaa(?:(?!bbb).)+?bbb7/s', $str, $matches);
?>


Notice:

"(?!bbb)." is neither lookbehind nor lookahead, it's lookcurrent, for example:
"(?=abc)abcde", "(?!abc)abcde"

0
with this, you avoid to test a lookahead on each positions:

/^(?:[^h]++|h++(?!ede))*+$/

0
The OP did not specify or Tag the post to indicate the context (programming language, editor, tool) the Regex will be used within.  

For me, I sometimes need to do this while editing a file using Textpad.  

Textpad supports some Regex, but does not support lookahead or lookbehind, so it takes a few steps.  

If I am looking to retain all lines that Do NOT contain the string hede, I would do it like this:


  1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.




    Search string:^(.)  
    Replace string:<@#-unique-#@>\1  
    Replace-all  



  2. Delete all lines that contain the string hede (replacement string is empty):  




    Search string:<@#-unique-#@>.*hede.*\n  
    Replace string:<nothing>  
    Replace-all  





  3. At this point, all remaining lines Do NOT contain the string hede. Remove the unique "Tag" from all lines (replacement string is empty):  




    Search string:<@#-unique-#@>
    Replace string:<nothing>  
    Replace-all  


Now you have the original text with all lines containing the string hede removed.


If I am looking to Do Something Else to only lines that Do NOT contain the string hede, I would do it like this:


  1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.




    Search string:^(.)  
    Replace string:<@#-unique-#@>\1  
    Replace-all  



  2. For all lines that contain the string hede, remove the unique "Tag":  




    Search string:<@#-unique-#@>(.*hede)
    Replace string:\1  
    Replace-all  





  3. At this point, all lines that begin with the unique "Tag", Do NOT contain the string hede. I can now do my Something Else to only those lines.





  4. When I am done, I remove the unique "Tag" from all lines (replacement string is empty):  




    Search string:<@#-unique-#@>
    Replace string:<nothing>  
    Replace-all  

0
The given answers are perfectly fine, just an academic point:

Regular Expressions in the meaning of theoretical computer sciences ARE NOT ABLE do it like this. For them it had to look something like this:

^([^h].*$)|(h([^e].*$|$))|(he([^h].*$|$))|(heh([^e].*$|$))|(hehe.+$) 


This only does a FULL match. Doing it for sub-matches would even be more awkward.
Your Answer