The Wayback Machine - https://web.archive.org/web/20201030024745/https://github.com/pandas-dev/pandas/issues/36688
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions #36688

Open
AstroMatt opened this issue Sep 27, 2020 · 5 comments · May be fixed by #36754
Open

Comments

@AstroMatt
Copy link

@AstroMatt AstroMatt commented Sep 27, 2020

Currently Pandas makes HTTP requests using "Python-urllib/3.8" as a User Agent.
This prevents from downloading some resources and static files from various places.
What if, Pandas would make requests using "Pandas/1.1.0" headers instead?
There should be possibility to add custom headers too (auth, csrf tokens, api versions and so on).

Use Case:

I am writing a book on Pandas:

I published data in CSV and JSON to use in code listings:

You can access those resources via browser, curl, or even requests, but not using Pandas.
The only change you'd need to do is to set User-Agent.
This is due to the readthedocs.io blocking "Python-urllib/3.8" User Agent for whatever reason.
The same problem affects many other places where you can get data (not only readthedocs.io).

Currently I get those resources with requests and then put response.text to one of:

  • pd.read_csv
  • pd.read_json
  • pd.read_html

Unfortunately this makes even simplest code listings... quite complex (due to the explanation of requests library and why I do this like that).

Pandas uses urllib.request.urlopen which does not allow to set http_headers
https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L146

Although urllib.request.urlopen can take urllib.request.Request as an argument.
And urllib.request.Request object has possibility to set custom http_headers
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

Possibility to add custom http_headers should be in pd.read_csv, pd.read_json and pd.read_html functions.

From what I see, the read_* call stack is three to four function deep.
There are only 6 references in 4 files to urlopen(*args, **kwargs) function.
So the change shouldn't be quite hard to implement.

http_headers parameter can be Optional[List] which will be fully backward compatible and would not require any changes to others code.

@AstroMatt AstroMatt changed the title ENH: Change Pandas `User-Agent` and add possibility to set custom `http_headers` to `pd.read_*` functions ENH: Change Pandas User-Agent and add possibility to set custom http_headers to pd.read_* functions Sep 27, 2020
@jreback
Copy link
Contributor

@jreback jreback commented Sep 27, 2020

we have had this request before

pls search for these issues

@AstroMatt
Copy link
Author

@AstroMatt AstroMatt commented Sep 27, 2020

Related to #10526

@Antetokounpo Antetokounpo linked a pull request that will close this issue Sep 30, 2020
4 of 5 tasks complete
@jreback jreback added IO Data and removed Needs Triage labels Oct 2, 2020
@jreback
Copy link
Contributor

@jreback jreback commented Oct 2, 2020

@martindurant can we pass these thru using StorageOptions?

@martindurant
Copy link
Contributor

@martindurant martindurant commented Oct 6, 2020

HTTP is the only of the "protocol://" URLs which is not handled by fsspec, because it already had its own code (whereas s3fs and gcs were already using fsspec second-hand).

For HTTPFileSystem, you can include headers as a key in client_kwargs, which could contain your custom user agent or anything else you want. That would look a little bit untidy, but OK

storage_options={"client_kwargs": {"headers": {"User-Agent": "pandas"}}}
@jreback
Copy link
Contributor

@jreback jreback commented Oct 6, 2020

ok u think a PR to add an example in read_csv / io.rst would be sufficient then

@AstroMatt if u are interested

@jreback jreback added the Docs label Oct 6, 2020
@jreback jreback added this to the Contributions Welcome milestone Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants
You can’t perform that action at this time.
Morty Proxy This is a proxified and sanitized view of the page, visit original site.