-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: read_sql_query duplicates column names in cells in pandas v2.0.0 #52437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you give a reproducible example using sqllite |
Looks like the bug is not reproducible in sqllite connections. My guess is that any DBAPI2 connection other than sqllite will have the same unexpected behavior.
|
You can also provide a reproducible example for mysql/postgress |
Okay, so after an eternity of digging in, I found out that the issue is with the With the following connection, pandas v2.0.0 gives unexpected results as described in the description. conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True) With the following connection, pandas v2.0.0 gives expected results as described in the description. conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True) I have no clue as to why that might affect the result set but now we know what exactly is the issue. I will leave it up to the maintainers to decide if we have to make any changes to the pandas code or mention it somewhere in the documentation that the I rest my case. ✌️ |
I can add few more examples for what works and what does not... WORKS conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
with conn.cursor() as cursor():
cursor.execute(QUERY)
res = pd.DataFrame(cursor.fetchall())
conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
res = pd.read_sql_query(QUERY, conn) DOESN'T WORK conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
with conn.cursor() as cursor():
cursor.execute(QUERY)
res = pd.DataFrame(cursor.fetchall())
conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
res = pd.read_sql_query(QUERY, conn) |
@phofl @mroeschke Is the team gonna look at the issue? |
Investigations are welcome |
Upon some investigation, I found out that the bug might be in the following function: Line 166 in bd5ed2f
In version v1.5.x, the Line 146 in 2e218d1
But with version v2.0.0 it calls a newly added function, Line 176 in bd5ed2f
This function uses Line 148 in bd5ed2f
What it fails to consider is the data that this lib function receives might not be a list of tuples but instead it could be list of dicts. Therefore, it fails to return correct data when a list of dicts is passed to it that comes from the Line 2303 in bd5ed2f
I am almost certain this occurs in the Let me know if this paints a correct picture because I have never done this before. I hope this helps. 🙂 |
Hi, just wanted to let you know this is still an issue with sqlite as well. I usually use a row factory to return each row as a dict.
Expected Output:
Actual Output:
Note Using sqlite3.Row is an option but loses dictionary functionality |
Has this been resolved? Stumbled upon the same problem this morning. |
Still open? I'm facing a issue with this as well |
Want to report that I am currently facing the same issue with |
@fabbber @samueldy Are you both using cursors/connections that return dicts? Can you please confirm if you're using stdlib
If not, then please create a separate issue. The issue is still open. PRs welcome to fix. see also: #53028 |
My use cases are all with the standard sqlite3 library ( |
Are you using a dict factory/cursor with the sqlite connection? If not, can you produce a copy-pastable example that reproduces what you're seeing? @samueldy |
Yes, now that I look at it, using a dict factory does produce the error. I'm wrapping # Convenience context handler for operations on the database
# Return records as dictionaries.
# Thanks to https://stackoverflow.com/a/3300514
def dict_factory(cursor, row):
d = {}
for idx, col in enumerate(cursor.description):
d[col[0]] = row[idx]
return d
class SqliteDatabase:
def __init__(self, db_path: str):
self.db_path = db_path
self.connection = sqlite3.connect(database=db_path)
self.connection.row_factory = dict_factory
self.cursor = self.connection.cursor()
def __enter__(self):
return self.connection, self.cursor
def __exit__(self, exc_type, exc_val, exc_tb):
self.connection.commit()
self.connection.close() then fetch data like this: with SqliteDatabase(db_path=my_db_path) as (
conn,
cur,
):
ase_interface.show_entire_df(
pd.read_sql("SELECT * from STOP ORDER BY StopID DESC LIMIT 5", con=conn)
) With the connection row factory enabled, I get the column names as values:
Omitting the
|
Yet having the row factory enabled in with SqliteDatabase(db_path=my_db_path) as (
conn,
cur,
):
cur.execute("SELECT * from STOP ORDER BY StopID DESC LIMIT 5")
print(cur.fetchall())
|
@samueldy Since you're already using a dict factory, you could replace your The change was introduced in #50048 changing
which assumes a list of tuples (and iterating over a dict gives you the keys instead of the values). One way to address this might be to revert back and add Another alternative could be checking for list of dicts and handling those, but you could theoretically replace the row_factory with anything. Maybe we restrict to assuming list of tuples/dicts and state that in the doc @phofl if you have any thoughts |
Sounds good. As an end user I have no issue with |
To be a bit more specific, the current behavior only applies to DBAPI2 connections used directly. If a sqlalchemy connection/string is used, then it follows a different path. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
In the latest version of pandas (v2.0.0), the
read_sql_query
function does not seem to work.I am using a
pymssql
connection and when executing a query, the result set is fetched and the number of rows are intact but the column names are duplicated in the cell values.The functionality works as expected in the
v1.5.3
versionThe result set looks like this with
v2.0.0
(Unexpected Behavior)Expected Behavior
Expected Behavior
Installed Versions
INSTALLED VERSIONS
commit : 478d340
python : 3.10.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_India.1252
pandas : 2.0.0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : 4.12.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: