Skip to content

BUG: read_sql_query duplicates column names in cells in pandas v2.0.0 #52437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
raj-patra opened this issue Apr 5, 2023 · 20 comments
Open
3 tasks done
Labels
Bug IO SQL to_sql, read_sql, read_sql_query

Comments

@raj-patra
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pymssql

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)

result_set = pd.read_sql_query("select * from table", conn)
print(result_set.head())

Issue Description

In the latest version of pandas (v2.0.0), the read_sql_query function does not seem to work.

I am using a pymssql connection and when executing a query, the result set is fetched and the number of rows are intact but the column names are duplicated in the cell values.

The functionality works as expected in the v1.5.3 version

The result set looks like this with v2.0.0 (Unexpected Behavior)

Column 1 Column 2 Column 3
Column 1 Column 2 Column 3
Column 1 Column 2 Column 3
Column 1 Column 2 Column 3

Expected Behavior

Expected Behavior

Column 1 Column 2 Column 3
1 4 7
2 5 8
3 6 9

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.10.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_India.1252

pandas : 2.0.0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.12.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@raj-patra raj-patra added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2023
@raj-patra raj-patra changed the title BUG: read_sql_query duplicates column names in cells BUG: read_sql_query duplicates column names in cells in pandas v2.0.0 Apr 5, 2023
@mroeschke
Copy link
Member

Could you give a reproducible example using sqllite ":memory:" database?

@mroeschke mroeschke added IO SQL to_sql, read_sql, read_sql_query Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2023
@raj-patra
Copy link
Author

Looks like the bug is not reproducible in sqllite connections.
I tested with sqllite memory database, and the result was an expected behavior.

My guess is that any DBAPI2 connection other than sqllite will have the same unexpected behavior.

import sqlite3
import pandas as pd

conn = sqlite3.connect('file:cachedb?mode=memory&cache=shared')

df = pd.DataFrame([{"A":1,"B":2}, {"A":3,"B":4}, {"A":1,"B":2}, {"A":3,"B":4}])
print(df)

df.to_sql("test", conn, if_exists="replace")

resp = pd.read_sql("select * from test", conn)
print(resp)

@phofl
Copy link
Member

phofl commented Apr 9, 2023

You can also provide a reproducible example for mysql/postgress

@raj-patra
Copy link
Author

raj-patra commented Apr 10, 2023

Okay, so after an eternity of digging in, I found out that the issue is with the as_dict=True parameter of the package pymssql

With the following connection, pandas v2.0.0 gives unexpected results as described in the description.

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)

With the following connection, pandas v2.0.0 gives expected results as described in the description.

conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)

I have no clue as to why that might affect the result set but now we know what exactly is the issue.

I will leave it up to the maintainers to decide if we have to make any changes to the pandas code or mention it somewhere in the documentation that the as_dict=True will result in unexpected results.

I rest my case. ✌️

@raj-patra
Copy link
Author

I can add few more examples for what works and what does not...

WORKS

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
with conn.cursor() as cursor():
    cursor.execute(QUERY)
    res = pd.DataFrame(cursor.fetchall())

conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
res = pd.read_sql_query(QUERY, conn)

DOESN'T WORK

conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
with conn.cursor() as cursor():
    cursor.execute(QUERY)
    res = pd.DataFrame(cursor.fetchall())

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
res = pd.read_sql_query(QUERY, conn)

@raj-patra
Copy link
Author

@phofl @mroeschke
Any updates on this?

Is the team gonna look at the issue?
Let me know if there is any more clarification needed.

@phofl
Copy link
Member

phofl commented Apr 13, 2023

Investigations are welcome

@raj-patra
Copy link
Author

raj-patra commented Apr 13, 2023

@phofl

Upon some investigation, I found out that the bug might be in the following function:

def _wrap_result(

In version v1.5.x, the _wrap_result function directly wrapped list of dicts into a dataframe with the from_records function,

frame = DataFrame.from_records(data, columns=columns, coerce_float=coerce_float)

But with version v2.0.0 it calls a newly added function, _convert_arrays_to_dataframe in the following line:

frame = _convert_arrays_to_dataframe(data, columns, coerce_float, dtype_backend)

This function uses lib.to_object_array_tuples function which according to its documentation does the following
"Convert a list of tuples into an object array. Any subclass of tuple in rows will be casted to tuple."

content = lib.to_object_array_tuples(data)

What it fails to consider is the data that this lib function receives might not be a list of tuples but instead it could be list of dicts. Therefore, it fails to return correct data when a list of dicts is passed to it that comes from the cursor.fetchall() here:

result = cur.fetchall()

I am almost certain this occurs in the SQLiteDatabase class which is also a fallback when SQL Alchemy connector is missing. (for example, DBAPI2 connections with result sets as lists of dicts)

Let me know if this paints a correct picture because I have never done this before. I hope this helps. 🙂

@pwillen
Copy link

pwillen commented Aug 11, 2023

Hi, just wanted to let you know this is still an issue with sqlite as well. I usually use a row factory to return each row as a dict.
Reproducible example using sqllite ":memory:" database:

import pandas as pd


def sqlite3_factory(cursor: Any, row: Any) -> Any:
    """ Convert SQL into a Dict rather than Tuple
    :param cursor:
    :param row:
    :return:
    """
    d = {}
    for idx, col in enumerate(cursor.description):
        d[col[0]] = row[idx]
    return d


conn = sqlite3.connect(':memory:')
conn.row_factory = sqlite3_factory

my_dict = [{'name': 'foo'}, {'name': 'bar'}]

df = pd.DataFrame(my_dict)
df.to_sql(name='example', con=conn, if_exists='replace', index=False)
df = pd.read_sql_query("SELECT * FROM example", conn)

Expected Output:

name
foo
bar

Actual Output:

name
name
name

Note

Using sqlite3.Row is an option but loses dictionary functionality

@macllc2021
Copy link

Has this been resolved? Stumbled upon the same problem this morning.

@fabbber
Copy link

fabbber commented Jun 27, 2024

Still open? I'm facing a issue with this as well

@samueldy
Copy link

samueldy commented Jun 29, 2024

Want to report that I am currently facing the same issue with libsqlite=3.46.0 and pandas=2.2.2 from the conda-forge channel.

@asishm
Copy link
Contributor

asishm commented Jun 29, 2024

@fabbber @samueldy Are you both using cursors/connections that return dicts? Can you please confirm if you're using stdlib sqlite3 library for the connection or a sqlalchemy string/connectable? The docs mention:

Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.

If not, then please create a separate issue. The issue is still open. PRs welcome to fix.

see also: #53028

@samueldy
Copy link

samueldy commented Jul 1, 2024

My use cases are all with the standard sqlite3 library (import sqlite3) and with pd.read_sql. I am operating with a sqlite3 database stored on disk and am not using pd.read_sql_table or SQLAlchemy.

@asishm
Copy link
Contributor

asishm commented Jul 1, 2024

Are you using a dict factory/cursor with the sqlite connection? If not, can you produce a copy-pastable example that reproduces what you're seeing? @samueldy

@samueldy
Copy link

samueldy commented Jul 2, 2024

Yes, now that I look at it, using a dict factory does produce the error. I'm wrapping sqlite3.connect like this:

# Convenience context handler for operations on the database
# Return records as dictionaries.
# Thanks to https://stackoverflow.com/a/3300514


def dict_factory(cursor, row):
    d = {}
    for idx, col in enumerate(cursor.description):
        d[col[0]] = row[idx]
    return d


class SqliteDatabase:
    def __init__(self, db_path: str):
        self.db_path = db_path

        self.connection = sqlite3.connect(database=db_path)
        self.connection.row_factory = dict_factory
        self.cursor = self.connection.cursor()

    def __enter__(self):
        return self.connection, self.cursor

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.connection.commit()
        self.connection.close()

then fetch data like this:

with SqliteDatabase(db_path=my_db_path) as (
    conn,
    cur,
):
    ase_interface.show_entire_df(
        pd.read_sql("SELECT * from STOP ORDER BY StopID DESC LIMIT 5", con=conn)
    )

With the connection row factory enabled, I get the column names as values:

StopID Name DivisionID
0 StopID Name DivisionID
1 StopID Name DivisionID

Omitting the self.connection.row_factory = dict_factory line above (or using the same code with pandas=1.1.3) gives the expected result:

StopID Name DivisionID
0 2 Octave 4 2
1 1 Flute 8 1

@samueldy
Copy link

samueldy commented Jul 2, 2024

Yet having the row factory enabled in pandas=2.2.2 still returns the correct results with cursor.fetchall:

with SqliteDatabase(db_path=my_db_path) as (
    conn,
    cur,
):
    cur.execute("SELECT * from STOP ORDER BY StopID DESC LIMIT 5")
    print(cur.fetchall())
[{'StopID': 2, 'Name': 'Octave 4', 'DivisionID': 2}, {'StopID': 1, 'Name': 'Flute 8', 'DivisionID': 1}]

@asishm
Copy link
Contributor

asishm commented Jul 4, 2024

@samueldy Since you're already using a dict factory, you could replace your pd.read_sql call with pd.DataFrame.from_records(cur.fetchall())

The change was introduced in #50048 changing

frame = DataFrame.from_records(data, columns=columns, coerce_float=coerce_float) to
frame = _convert_arrays_to_dataframe(data, columns, coerce_float, dtype_backend)

which assumes a list of tuples (and iterating over a dict gives you the keys instead of the values).

One way to address this might be to revert back and add frame.convert_dtypes(dtype_backed=...) to keep the dtype_backend enhancement but would likely be less performant.

Another alternative could be checking for list of dicts and handling those, but you could theoretically replace the row_factory with anything. Maybe we restrict to assuming list of tuples/dicts and state that in the doc

@phofl if you have any thoughts

@samueldy
Copy link

samueldy commented Jul 5, 2024

Sounds good. As an end user I have no issue with pd.read_sql/pd.read_sql_query assuming a list of tuples as long as it's in the docs.

@asishm
Copy link
Contributor

asishm commented Jul 5, 2024

To be a bit more specific, the current behavior only applies to DBAPI2 connections used directly. If a sqlalchemy connection/string is used, then it follows a different path.

@mroeschke mroeschke removed the Needs Info Clarification about behavior needed to assess issue label Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

8 participants