Encoding issue #157

TheoLechemia · 2018-08-07T17:23:48Z

Hello,

I'm trying to create a shapefile with non ascii character ('é', 'à' etc...).
When I pass only ascii characters to the record() method, everything is fine, but since when I pass non ascii, the data are totaly mixed (data not coresponding to the columns, and some data are reported to other columns)

Here in the screen shot, the columns "nom_valide" should be only "string" and the numbers at the beginning of the column should be in the "cd_ref" column...

I tought I had to encode myself my data, but I saw in the code, that it's already done...

def b(v):
    if PYTHON3:
        if isinstance(v, str):
            # For python 3 encode str to bytes.
            return v.encode('utf-8')
        elif isinstance(v, bytes):
            # Already bytes.
            return v
        else:
            # Error.
            raise Exception('Unknown input type')
    else:
        # For python 2 assume str passed in and return str.
        return v

When I pass already encoded data (bytes in utf-8), everything works, but all the data columns are prefixed with a "u"... because its encoded twice...

I also saw in the doc that we can pass the encoding to the Writter class, but I think the 1.2.12 version I doesn't have this feature yet.

I'm using pyshp 1.2.12, python 3, my data come from sqlalchemy and are already in utf8

Any help ?

Thanks a lot

The text was updated successfully, but these errors were encountered:

karimbahgat · 2018-08-28T18:59:48Z

Hi @TheoLechemia.
Yes, manually specifying encoding is only available in v2.x, which I would recommend switching to if possible.
If you are stuck with v1.12.x, it does automatically encode the data to utf8 as you say. But not sure why it would mix up the columns like that when you pass unicode text.
Also not sure why it prefixes with 'u' when you pass in already encoded bytes. There should not be any double-encoding, since it only encodes the strings if they are unicode, and leaves as is if they are already encoded to bytes.
Could you share a snippet of the code you use for creating the shapefile?

MichalTorma · 2018-09-04T18:20:44Z

I stumbled upon the same issue and it seems that every Unicode character is taking place of 2 characters and this leaks to the following lines...
here is some sample code:

w = shapefile.Writer(shapeType=0)
w.field("_id", 'N')
w.field("name", 'C', size=10)

w.record(_id = 0, name="xxx")
w.record(_id = 1, name="åxx")
w.record(_id = 2, name="xxx")

w.save('/tmp/test')

- Fixes issue in Py3 when converting text characters to byte strings, but in Py3 converts to unicode instead, because uses the Py2 specific str() function, instead of the version neutral b(). When the text contains non-ascii 2-byte unicode values this results in truncating the unicode length instead of the byte length, and thus results in incorrectly padded byte lengths and data values ending up in the wrong field/column. See #157, and also #148. - Also bump to next version.

karimbahgat · 2018-09-08T20:32:26Z

That's correct. Looking further into it, it seems the old code attempted to convert to byte string before truncating and padding to the correct text size. However, it used the Py2 specific str() method instead of b(), so in Py3 it got converted to unicode before truncating, thus leading to incorrect truncating and byte lengths, and thus column underflows etc.
Fixed now in fc630bb, and tested locally that works in Py2 and 3. Can any of you download the latest 1.2.x branch, and check that it works now? Will release a bugfix version once you confirm.

MichalTorma · 2018-09-08T23:29:50Z

That worked nicely and swiftly, good job 😄

karimbahgat added the bug label Aug 28, 2018

karimbahgat closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Encoding issue #157

Encoding issue #157

TheoLechemia commented Aug 7, 2018 •

edited

Loading

karimbahgat commented Aug 28, 2018

Uh oh!

MichalTorma commented Sep 4, 2018

Uh oh!

karimbahgat commented Sep 8, 2018

Uh oh!

MichalTorma commented Sep 8, 2018

Uh oh!

Uh oh!

Encoding issue #157

Encoding issue #157

Comments

TheoLechemia commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

karimbahgat commented Aug 28, 2018

Uh oh!

MichalTorma commented Sep 4, 2018

Uh oh!

karimbahgat commented Sep 8, 2018

Uh oh!

MichalTorma commented Sep 8, 2018

Uh oh!

TheoLechemia commented Aug 7, 2018 •

edited

Loading