Using re to find only uppercase letters - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: Using re to find only uppercase letters (/thread-33793.html) |
Using re to find only uppercase letters - ranbarr - May-27-2021 Hi, Im trying to solve a problem using re module, and one of the requests is to find a string with the letters ATGC only in uppercase. this is my code: def isVCF(file): num_format = re.compile(r"^chr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*(?:\t[ATCG]){2}\t") with open(file, "r+") as my_file: for line in my_file: if not num_format.match(line): return False return Trueand this is an example to a line: The problem is that its matching the lowercase "t" aswell and I only want it to find uppercase letters.I've tried several things but none worked. Appreciate any kind of help! RE: Using re to find only uppercase letters - perfringo - May-27-2021 What is exact meaning of 'find a string with the letters ATGC only in uppercase'. Does it mean 'determine whether line contains word constructed only from letters ATGC in any combination'? Or same applied to the whole file? And finally - do you have to use re? Line #2 in your code reminds me one old programmering joke: regex in plural is regrets. RE: Using re to find only uppercase letters - nilamo - May-27-2021 Is there something that isn't shown? This shouldn't have matched: ChrX (your regex only looks for lowercase "chr")I'm assuming that's the same reason the lowercase "t" was matched. RE: Using re to find only uppercase letters - ranbarr - May-28-2021 (May-27-2021, 06:53 PM)perfringo Wrote: What is exact meaning of 'find a string with the letters ATGC only in uppercase'. Does it mean 'determine whether line contains word constructed only from letters ATGC in any combination'? Or same applied to the whole file? And finally - do you have to use re? Line #2 in your code reminds me one old programmering joke: regex in plural is regrets. It means that one of the letters(ATGC - just one) appears in columns 4 and 5 and yes, sadly I have to use regex RE: Using re to find only uppercase letters - ranbarr - May-28-2021 (May-27-2021, 09:21 PM)nilamo Wrote: Is there something that isn't shown? This shouldn't have matched: You are right - its actually like that: (r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2} RE: Using re to find only uppercase letters - nilamo - May-28-2021 Something still seems off, as that regex won't match the string. >>> import re >>> test = 'ChrX 74226540 T t 50 .' >>> test 'ChrX\t74226540\tT\tt\t50\t.' >>> print(test) ChrX 74226540 T t 50 . >>> raw_regex = r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}" >>> regex = re.compile(raw_regex) >>> regex.match(test) >>> regex re.compile('^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\\t0*[1-9][0-9]*\\t[^\\t]*\\t[ATGC]{2}') RE: Using re to find only uppercase letters - ranbarr - May-31-2021 (May-28-2021, 06:58 PM)nilamo Wrote: Something still seems off, as that regex won't match the string.>>> import re >>> test = 'ChrX 74226540 T t 50 .' >>> test 'ChrX\t74226540\tT\tt\t50\t.' >>> print(test) ChrX 74226540 T t 50 . >>> raw_regex = r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]{2}" >>> regex = re.compile(raw_regex) >>> regex.match(test) >>> regex re.compile('^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\\t0*[1-9][0-9]*\\t[^\\t]*\\t[ATGC]{2}') I kinda figured it out.. for some reason when I use the {2} its case insensitive so I just seperated it to do it twice: def isVCF(file): num_format = re.compile(r"^[Cc]hr(?:0?[1-9]|[1-9][0-9]|[MXY])\t0*[1-9][0-9]*\t[^\t]*\t[ATGC]\t[ATGC]") with open(file, "r+") as my_file: for line in my_file: if line.startswith("#"): continue if num_format.match(line): return True else: return FalseI used the if line.startwith to skip the headline |