Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att (/thread-40243.html) |
Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Melcu54 - Jun-28-2023 In my html file I have this line: <div class="color-black mt-lg-0" id="hidden">, in</div> <a href="https://neculaifantanaru.com/en/leadership-pro.html" title="View all articles from Leadership Pro" class="color-green font-weight-600 mx-1" id="hidden">Leadership Pro</a>I use this regex: ^\s*<a href="(.*?)" title="View`in order to find this link https://neculaifantanaru.com/en/leadership-pro.htmlIn notepad++ the regex search is ok ! The problem is in Python. FIND: (on line 18) b_content = re.search('^\s*<a href="(.*?)" title="View', new_file_content).group(1)REPLACE: old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)Gives me this error on line 18: Traceback (most recent call last): File "<module2>", line 18, in <module> AttributeError: 'NoneType' object has no attribute 'group'I, also, try to change that line with: b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content).group(1)but I get the same error. RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - bowlofred - Jun-28-2023 Doing regex on HTML is super annoying. Use an HTML parser instead (like beautifulsoup). Your regex is anchored at the front of the string. If your new_file_content contains the entire file, then the match will fail. When I try your command, but with only the second line in that variable, it matches.
RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Gribouillis - Jun-28-2023 (Jun-28-2023, 07:32 AM)bowlofred Wrote: Your regex is anchored at the front of the string.The regex multiline mode (?m) could do the trick.
RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Melcu54 - Jun-28-2023 (Jun-28-2023, 07:42 AM)Gribouillis Wrote:(Jun-28-2023, 07:32 AM)bowlofred Wrote: Your regex is anchored at the front of the string.The regex multiline mode hello. Can you update my code as to understand better ? RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - snippsat - Jun-28-2023 The old classic read from bs4 import BeautifulSoup import re html = '''\ <div class="color-black mt-lg-0" id="hidden">, in</div> <a href="https://neculaifantanaru.com/en/leadership-pro.html" title="View all articles from Leadership Pro" class="color-green font-weight-600 mx-1" id="hidden">Leadership Pro</a> ''' soup = BeautifulSoup(html, 'html.parser') link = soup.find('a').get('href') print(link) If you wonder about a working regex,but as in link should not use regex with HTML/XML.Can work in smaller part aa here,but can/will blow up with errors in lager HTML. >>> import re >>> >>> b_content = re.search(r"<a href=\"(.*?)\"", html).group(1) >>> b_content 'https://neculaifantanaru.com/en/leadership-pro.html' RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Melcu54 - Jun-28-2023 import re # Citește conținutul fișierului new-file.html with open('c:/Folder7/new-file.html', 'r') as file: first_code = file.read() # Citește conținutul fișierului old-file.html with open('c:/Folder7/old-file.html', 'r') as file: second_code = file.read() # Extrage URL-ul din first_code match = re.search('<a href="(.*?)" title="View all articles', first_code) if match is not None: url = match.group(1) # Înlocuiește URL-ul în second_code second_code = re.sub(', in <a href=".*?" title="Vezi toate', f', in <a href="{url}" title="Vezi toate', second_code) # Scrie conținutul modificat înapoi în old-file.html with open('c:/Folder7/old-file.html', 'w') as file: file.write(second_code) else: print("No match found") RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Gribouillis - Jun-28-2023 (Jun-28-2023, 07:50 AM)Melcu54 Wrote: Can you update my code as to understand better ?Add (?m) at the beginning the regex as specified in the re.MULTILINE documentation. It is very useful to read the documentation.
RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - snippsat - Jun-28-2023 As advised no regex 🔨with HTML/XML. from bs4 import BeautifulSoup with open('file.html') as file: first_code = file.read() with open('old-file.html') as file: second_code = file.read() soup = BeautifulSoup(first_code, 'html.parser') link = soup.find('a') link['href'] = second_code with open('old-file.html', 'w') as file: file.write(soup.prettify())
RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Melcu54 - Jun-28-2023 thank you veru much RE: Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att - Melcu54 - Jun-28-2023 SOLUTION 1: FIND: b_content = re.search('^\s*<a href="(.*?)" title="View', new_file_content).group(1)REPLACE: old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content) SOLUTION 2: FIND: b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content).group(1)REPLACE: old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content) SOLUTION 3: import re b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content) if b_content is not None: b_content = b_content.group(1) else: b_content = "No match found"SOLUTION 4: import re match = re.search('^\s*<a href="(.*?)" title="View', new_file_content) if match is not None: b_content = match.group(1) old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content) else: print("No match found") SOLUTION 5: (use re.MULTILINE ) import re match = re.search('^\s*<a href="(.*?)" title="View', new_file_content, re.MULTILINE) if match is not None: b_content = match.group(1) old_file_content = re.sub(', in <a href="([^"]*)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content) else: print("No match found") |