Skip to content

Encoding issue #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TheoLechemia opened this issue Aug 7, 2018 · 4 comments
Closed

Encoding issue #157

TheoLechemia opened this issue Aug 7, 2018 · 4 comments
Labels

Comments

@TheoLechemia
Copy link

TheoLechemia commented Aug 7, 2018

Hello,

I'm trying to create a shapefile with non ascii character ('é', 'à' etc...).
When I pass only ascii characters to the record() method, everything is fine, but since when I pass non ascii, the data are totaly mixed (data not coresponding to the columns, and some data are reported to other columns)

Here in the screen shot, the columns "nom_valide" should be only "string" and the numbers at the beginning of the column should be in the "cd_ref" column...
columnx_mixed

I tought I had to encode myself my data, but I saw in the code, that it's already done...

def b(v):
    if PYTHON3:
        if isinstance(v, str):
            # For python 3 encode str to bytes.
            return v.encode('utf-8')
        elif isinstance(v, bytes):
            # Already bytes.
            return v
        else:
            # Error.
            raise Exception('Unknown input type')
    else:
        # For python 2 assume str passed in and return str.
        return v

When I pass already encoded data (bytes in utf-8), everything works, but all the data columns are prefixed with a "u"... because its encoded twice...

I also saw in the doc that we can pass the encoding to the Writter class, but I think the 1.2.12 version I doesn't have this feature yet.

I'm using pyshp 1.2.12, python 3, my data come from sqlalchemy and are already in utf8

Any help ?

Thanks a lot

@karimbahgat
Copy link
Collaborator

Hi @TheoLechemia.
Yes, manually specifying encoding is only available in v2.x, which I would recommend switching to if possible.
If you are stuck with v1.12.x, it does automatically encode the data to utf8 as you say. But not sure why it would mix up the columns like that when you pass unicode text.
Also not sure why it prefixes with 'u' when you pass in already encoded bytes. There should not be any double-encoding, since it only encodes the strings if they are unicode, and leaves as is if they are already encoded to bytes.
Could you share a snippet of the code you use for creating the shapefile?

@MichalTorma
Copy link

I stumbled upon the same issue and it seems that every Unicode character is taking place of 2 characters and this leaks to the following lines...
here is some sample code:

w = shapefile.Writer(shapeType=0)
w.field("_id", 'N')
w.field("name", 'C', size=10)

w.record(_id = 0, name="xxx")
w.record(_id = 1, name="åxx")
w.record(_id = 2, name="xxx")

w.save('/tmp/test')

karimbahgat added a commit that referenced this issue Sep 8, 2018
- Fixes issue in Py3 when converting text characters to byte strings, but in Py3 converts to unicode instead, because uses the Py2 specific str() function, instead of the version neutral b(). When the text contains non-ascii 2-byte unicode values this results in truncating the unicode length instead of the byte length, and thus results in incorrectly padded byte lengths and data values ending up in the wrong field/column. See #157, and also #148. 
- Also bump to next version.
@karimbahgat
Copy link
Collaborator

That's correct. Looking further into it, it seems the old code attempted to convert to byte string before truncating and padding to the correct text size. However, it used the Py2 specific str() method instead of b(), so in Py3 it got converted to unicode before truncating, thus leading to incorrect truncating and byte lengths, and thus column underflows etc.
Fixed now in fc630bb, and tested locally that works in Py2 and 3. Can any of you download the latest 1.2.x branch, and check that it works now? Will release a bugfix version once you confirm.

@MichalTorma
Copy link

That worked nicely and swiftly, good job 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants