string parsing with re.search()

string parsing with re.search() - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: string parsing with re.search() (/thread-27323.html)

string parsing with re.search() - delahug - Jun-03-2020

hi,

i've been trying to parse a string using the re.search function but am running into trouble when it encounters ½ (the numeral representation of a half)...

l = re.search(r'[[]',str(viola.text),re.I).start()+1

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 1: ordinal not in range(128)

how should i proceed here or is there another way to do the string parsing?

thanks

RE: string parsing with re.search() - Gribouillis - Jun-03-2020

Make sure you are using python 3.

RE: string parsing with re.search() - snippsat - Jun-03-2020

There is no u'' in Python 3,so follow advice over.

# Python 3.8
>>> s = u'\xbd' 
>>> s
'½'

# Can remove <u> make no difference
>>> s = '\xbd' 
>>> s
'½'

# Python 2.7
>>> s = u'\xbd' 
>>> s
u'\xbd'
>>> s.encode()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)

# Try the obvious one  first  
>>> s.encode('utf-8')
'\xc2\xbd'
>>> print(s.encode('utf-8'))
Â½

# Make a guess
>>> print(s.encode('latin-1'))
½

On of the biggest changes moving to Python 3 was to make Unicode better Wink

RE: string parsing with re.search() - delahug - Jun-03-2020

(Jun-03-2020, 01:35 PM)snippsat Wrote: There is no u'' in Python 3,so follow advice over.

# Python 3.8
>>> s = u'\xbd' 
>>> s
'½'

# Can remove <u> make no difference
>>> s = '\xbd' 
>>> s
'½'

# Python 2.7
>>> s = u'\xbd' 
>>> s
u'\xbd'
>>> s.encode()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)

# Try the obvious one  first  
>>> s.encode('utf-8')
'\xc2\xbd'
>>> print(s.encode('utf-8'))
Â½

# Make a guess
>>> print(s.encode('latin-1'))
½

On of the biggest changes moving to Python 3 was to make Unicode better Wink

Thanks for your help.

But I don't get where this will fit in my code?

Specifically what I am looking at is this:


2½
[5½]


I want what's in the square brackets, within the second nested span.
If I grab the whole lot by referencing the span class, I then run into the problem above when using re.search() on the square bracket. It's caused (apparently) by the fraction in the first span.

Can I get at the second span directly?

thanks

RE: string parsing with re.search() - snippsat - Jun-03-2020

Now is that html so should not be using regex anyway,if want a funny read.

from bs4 import BeautifulSoup

html = '''\
<span class="rp-horseTable__pos__length">
<span>2½</span>
<span>[5½]</span>
</span>'''

soup = BeautifulSoup(html, 'lxml')

Usage:

>>> tag = soup.select_one('span > span:nth-child(2)')
>>> tag
<span>[5½]</span>
>>> tag.text
'[5½]'

So here find second span tag directly using CSS selector .
After using .text the parser has done it's job,so now can use regex if want what's inside square bracket

>>> import re
>>> 
>>> r = re.search(r"\[(.*)\]", tag.text)
>>> r.group(1)
'5½'

In a lager code may want to first match eg the class name the do what posted over.
Or can use find_all() as an other approach.

>>> tag = soup.find(class_="rp-horseTable__pos__length")
>>> tag
<span class="rp-horseTable__pos__length">
<span>2½</span>
<span>[5½]</span>
</span>

>>> tag.find_all('span')
[<span>2½</span>, <span>[5½]</span>]
>>> tag.find_all('span')[1]
<span>[5½]</span>

RE: string parsing with re.search() - Gribouillis - Jun-04-2020

delahug Wrote:I then run into the problem above when using re.search() on the square bracket. It's caused (apparently) by the fraction in the first span.

If you are running python 2.7, the problem is not caused by the fraction, it is caused by the implicit attempt to encode the string to the ascii encoding with the str() function, while the fraction character cannot be encoded with this encoding because it is not an ascii character. In python 3, there would be no such problem because str() doesnt try to encode the unicode string.

>>> # python 2.7
>>> text = u"\xbd"
>>> str(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)

You could perhaps try a unicode regex like u"[[]" and remove the call to str(), or better switch to python 3 because python 2 is no longer supported.

RE: string parsing with re.search() - delahug - Jun-04-2020

(Jun-04-2020, 03:09 AM)Gribouillis Wrote:
delahug Wrote:I then run into the problem above when using re.search() on the square bracket. It's caused (apparently) by the fraction in the first span.
If you are running python 2.7, the problem is not caused by the fraction, it is caused by the implicit attempt to encode the string to the ascii encoding with the str() function, while the fraction character cannot be encoded with this encoding because it is not an ascii character. In python 3, there would be no such problem because str() doesnt try to encode the unicode string.
>>> # python 2.7
>>> text = u"\xbd"
>>> str(text)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128)
You could perhaps try a unicode regex like u"[[]" and remove the call to str(), or better switch to python 3 because python 2 is no longer supported.

Thank you, sir!

Taking out str() gets me going forward again...

As for Python version. I installed an Anaconda environment (is that what it's called?!) so I have the version of Python (2.something) which came with this...

RE: string parsing with re.search() - Gribouillis - Jun-04-2020

delahug Wrote:I have the version of Python (2.something) which came with this

I think Anaconda can use python 3. Carrying on with python 2 exposes your code to a myriad of tiny issues like this one that simply don't exist with python 3. I could not stress enough how absurd it is to write code in an obsolete language.

RE: string parsing with re.search() - snippsat - Jun-04-2020

(Jun-04-2020, 09:39 AM)delahug Wrote: I installed an Anaconda environment (is that what it's called?!)

When install you use the Python 3.7 version of Anaconda.

To use my parse code post with BeautifulSoup and lxml,
then there is no install as Anaconda comes with these pre-installed.
Can look list here.
Anaconda and other ways to run Python

RE: string parsing with re.search() - delahug - Jun-04-2020

(Jun-04-2020, 10:34 AM)snippsat Wrote:
(Jun-04-2020, 09:39 AM)delahug Wrote: I installed an Anaconda environment (is that what it's called?!)
When install you use the Python 3.7 version of Anaconda.

To use my parse code post with BeautifulSoup and lxml,
then there is no install as Anaconda comes with these pre-installed.
Can look list here.
Anaconda and other ways to run Python

Thanks for this. Apologies that I appeared to overlook your previous reply - I didn't notice it because there was another after it.