Twython unnecessarily disables compression. #310

Merged
Lukasa merged 3 commits from master into master 2014-02-25 12:35:58 -08:00
Lukasa commented 2014-02-23 07:10:14 -08:00 (Migrated from github.com)

While doing performance profiling as mentioned in lukasa/hyper#19, I stumbled upon the fact that Twython appears to send the header Accept-Encoding: identity (or leaves the header out altogether, it's not entirely clear, but they're semantically the same thing). This means that downloading the last 800 @-mentions to myself involves downloading 2.2MB of uncompressed JSON.

It seems to be trivial to turn this on by passing the appropriate header (Accept-Encoding: gzip, deflate) in client_args. More importantly though, this should be on by default when using requests. I've looked through the code and it doesn't seem to be deliberately turned off, so it looks like it's probably an accident.

From some code reading, the problem looks like it's around here. In this block of code you end up ensuring that Twython.client_args always contains a headers key. Subsequently, in this block, you ensure that Session.headers is overwritten with your new headers dictionary. This overwriting gets rid of the default headers (set here), which include the Accept-Encoding line.

I have two suggested fixes:

  1. Note that the headers will always be present and to treat them specially, by merging them instead of overwriting them.
  2. Add Accept-Encoding to your default headers dict.

Sample code to reproduce the problem (this is set up for using mitmproxy to examine the HTTP headers: if you don't want to, remove the client_args):

from twython import Twython

client_args = {
    'proxies': {
        'http': 'http://localhost:8080',
        'https': 'http://localhost:8080',
    },
    'verify': False,

# Uncomment this section to get the improved behaviour.
#    'headers': {
#        'Accept-Encoding': 'gzip, deflate',
    }
}

# Fill these in as needed.
API_KEY = ''
API_SECRET = ''
ACCESS_TOKEN = ''
ACCESS_SECRET = ''

t = Twython(
    API_KEY, 
    API_SECRET, 
    ACCESS_TOKEN, 
    ACCESS_SECRET, 
    client_args=client_args,
)

# Let's get all my mentions. First, get the initial set and save off the
# smallest ID found.
mentions = []
chunk = t.get_mentions_timeline(count=200, include_rts=1)
max_id = chunk[-1]['id']
mentions += chunk

# Now, keep looping until we reach the end or we've reached 800 tweets.
while len(chunk) > 180 and len(mentions) < 800:
    chunk = t.get_mentions_timeline(count=200, include_rts=1, max_id=max_id)
    max_id = chunk[-1]['id']
    mentions += chunk

# Print all the messages.
print("Messages: %d" % (len(mentions),))
print("Execution time: %f\n\n" % (end-start,))
While doing performance profiling as mentioned in lukasa/hyper#19, I stumbled upon the fact that Twython appears to send the header `Accept-Encoding: identity` (or leaves the header out altogether, it's not entirely clear, but they're semantically the same thing). This means that downloading the last 800 @-mentions to myself involves downloading 2.2MB of uncompressed JSON. It seems to be trivial to turn this on by passing the appropriate header (`Accept-Encoding: gzip, deflate`) in `client_args`. More importantly though, this should be on by default when using requests. I've looked through the code and it doesn't seem to be deliberately turned off, so it looks like it's probably an accident. From some code reading, the problem looks like it's around [here](https://github.com/ryanmcgrath/twython/blob/master/twython/api.py#L76-L83). In this block of code you end up ensuring that `Twython.client_args` always contains a `headers` key. Subsequently, in [this block](https://github.com/ryanmcgrath/twython/blob/master/twython/api.py#L110-L114), you ensure that `Session.headers` is _overwritten_ with your new `headers` dictionary. This overwriting gets rid of the default headers (set [here](https://github.com/kennethreitz/requests/blob/master/requests/sessions.py#L208)), which include the `Accept-Encoding` line. I have two suggested fixes: 1. Note that the headers will always be present and to treat them specially, by merging them instead of overwriting them. 2. Add `Accept-Encoding` to your default headers dict. Sample code to reproduce the problem (this is set up for using `mitmproxy` to examine the HTTP headers: if you don't want to, remove the `client_args`): ``` python from twython import Twython client_args = { 'proxies': { 'http': 'http://localhost:8080', 'https': 'http://localhost:8080', }, 'verify': False, # Uncomment this section to get the improved behaviour. # 'headers': { # 'Accept-Encoding': 'gzip, deflate', } } # Fill these in as needed. API_KEY = '' API_SECRET = '' ACCESS_TOKEN = '' ACCESS_SECRET = '' t = Twython( API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_SECRET, client_args=client_args, ) # Let's get all my mentions. First, get the initial set and save off the # smallest ID found. mentions = [] chunk = t.get_mentions_timeline(count=200, include_rts=1) max_id = chunk[-1]['id'] mentions += chunk # Now, keep looping until we reach the end or we've reached 800 tweets. while len(chunk) > 180 and len(mentions) < 800: chunk = t.get_mentions_timeline(count=200, include_rts=1, max_id=max_id) max_id = chunk[-1]['id'] mentions += chunk # Print all the messages. print("Messages: %d" % (len(mentions),)) print("Execution time: %f\n\n" % (end-start,)) ```
ryanmcgrath commented 2014-02-23 05:36:52 -08:00 (Migrated from github.com)

Oh wow, this is a pretty nice find - I would not have caught this. @michaelhelmick is more active with the merging at the moment, so I'm gonna leave this for him - just couldn't help but comment. :)

Oh wow, this is a pretty nice find - I would not have caught this. @michaelhelmick is more active with the merging at the moment, so I'm gonna leave this for him - just couldn't help but comment. :)
keonne commented 2014-02-23 06:46:11 -08:00 (Migrated from github.com)

Also wanted to pop in and comment - very nice find and fixes Cory

On 2/23/14 8:18 AM, Cory Benfield wrote:

While doing performance profiling as mentioned in Lukasa/hyper#19
https://github.com/Lukasa/hyper/issues/19, I stumbled upon the fact
that Twython appears to send the header |Accept-Encoding: identity|
(or leaves the header out altogether, it's not entirely clear, but
they're semantically the same thing). This means that downloading the
last 800 @-mentions to myself involves downloading 2.2MB of
uncompressed JSON.

It seems to be trivial to turn this on by passing the appropriate
header (|Accept-Encoding: gzip, deflate|) in |client_args|. More
importantly though, this should be on by default when using requests.
I've looked through the code and it doesn't seem to be deliberately
turned off, so it looks like it's probably an accident.

From some code reading, the problem looks like it's around here
https://github.com/ryanmcgrath/twython/blob/master/twython/api.py#L76-L83.
In this block of code you end up ensuring that |Twython.client_args|
always contains a |headers| key. Subsequently, in this block
https://github.com/ryanmcgrath/twython/blob/master/twython/api.py#L110-L114,
you ensure that |Session.headers| is /overwritten/ with your new
|headers| dictionary. This overwriting gets rid of the default headers
(set here
https://github.com/kennethreitz/requests/blob/master/requests/sessions.py#L208),
which include the |Accept-Encoding| line.

I have two suggested fixes:

  1. Note that the headers will always be present and to treat them
    specially, by merging them instead of overwriting them.
  2. Add |Accept-Encoding| to your default headers dict.

Sample code to reproduce the problem (this is set up for using
|mitmproxy| to examine the HTTP headers: if you don't want to, remove
the |client_args|):

from twython import Twython

client_args = {
'proxies': {
'http': 'http://localhost:8080',
'https': 'http://localhost:8080',
},
'verify': False,

Uncomment this section to get the improved behaviour.

'headers': {

'Accept-Encoding': 'gzip, deflate',

}

}

Fill these in as needed.

API_KEY = ''
API_SECRET = ''
ACCESS_TOKEN = ''
ACCESS_SECRET = ''

t = Twython(
API_KEY,
API_SECRET,
ACCESS_TOKEN,
ACCESS_SECRET,
client_args=client_args,
)

Let's get all my mentions. First, get the initial set and save off the

smallest ID found.

mentions = []
chunk = t.get_mentions_timeline(count=200, include_rts=1)
max_id = chunk[-1]['id']
mentions += chunk

Now, keep looping until we reach the end or we've reached 800 tweets.

while len(chunk) > 180 and len(mentions) < 800:
chunk = t.get_mentions_timeline(count=200, include_rts=1, max_id=max_id)
max_id = chunk[-1]['id']
mentions += chunk

Print all the messages.

print("Messages: %d" % (len(mentions),))
print("Execution time: %f\n\n" % (end-start,))


Reply to this email directly or view it on GitHub
https://github.com/ryanmcgrath/twython/issues/310.

Also wanted to pop in and comment - very nice find and fixes Cory On 2/23/14 8:18 AM, Cory Benfield wrote: > While doing performance profiling as mentioned in Lukasa/hyper#19 > https://github.com/Lukasa/hyper/issues/19, I stumbled upon the fact > that Twython appears to send the header |Accept-Encoding: identity| > (or leaves the header out altogether, it's not entirely clear, but > they're semantically the same thing). This means that downloading the > last 800 @-mentions to myself involves downloading 2.2MB of > uncompressed JSON. > > It seems to be trivial to turn this on by passing the appropriate > header (|Accept-Encoding: gzip, deflate|) in |client_args|. More > importantly though, this should be on by default when using requests. > I've looked through the code and it doesn't seem to be deliberately > turned off, so it looks like it's probably an accident. > > From some code reading, the problem looks like it's around here > https://github.com/ryanmcgrath/twython/blob/master/twython/api.py#L76-L83. > In this block of code you end up ensuring that |Twython.client_args| > always contains a |headers| key. Subsequently, in this block > https://github.com/ryanmcgrath/twython/blob/master/twython/api.py#L110-L114, > you ensure that |Session.headers| is /overwritten/ with your new > |headers| dictionary. This overwriting gets rid of the default headers > (set here > https://github.com/kennethreitz/requests/blob/master/requests/sessions.py#L208), > which include the |Accept-Encoding| line. > > I have two suggested fixes: > 1. Note that the headers will always be present and to treat them > specially, by merging them instead of overwriting them. > 2. Add |Accept-Encoding| to your default headers dict. > > Sample code to reproduce the problem (this is set up for using > |mitmproxy| to examine the HTTP headers: if you don't want to, remove > the |client_args|): > > from twython import Twython > > client_args = { > 'proxies': { > 'http': 'http://localhost:8080', > 'https': 'http://localhost:8080', > }, > 'verify': False, > > # Uncomment this section to get the improved behaviour. > > # 'headers': { > > # 'Accept-Encoding': 'gzip, deflate', > > ``` > } > ``` > > } > > # Fill these in as needed. > > API_KEY = '' > API_SECRET = '' > ACCESS_TOKEN = '' > ACCESS_SECRET = '' > > t = Twython( > API_KEY, > API_SECRET, > ACCESS_TOKEN, > ACCESS_SECRET, > client_args=client_args, > ) > > # Let's get all my mentions. First, get the initial set and save off the > > # smallest ID found. > > mentions = [] > chunk = t.get_mentions_timeline(count=200, include_rts=1) > max_id = chunk[-1]['id'] > mentions += chunk > > # Now, keep looping until we reach the end or we've reached 800 tweets. > > while len(chunk) > 180 and len(mentions) < 800: > chunk = t.get_mentions_timeline(count=200, include_rts=1, max_id=max_id) > max_id = chunk[-1]['id'] > mentions += chunk > > # Print all the messages. > > print("Messages: %d" % (len(mentions),)) > print("Execution time: %f\n\n" % (end-start,)) > > — > Reply to this email directly or view it on GitHub > https://github.com/ryanmcgrath/twython/issues/310.
Lukasa commented 2014-02-23 07:11:44 -08:00 (Migrated from github.com)

Thanks guys!

I've turned this into a pull request implementing the first of my possible fixes. Whether you end up wanting this solution is up to you guys.

I also want to note that the test is a little bit brittle: it's highly likely that a future version of requests will change the Accept-Encoding header it sends (see kennethreitz/requests#1916), at which point the test will break.

Thanks guys! I've turned this into a pull request implementing the first of my possible fixes. Whether you end up wanting this solution is up to you guys. I also want to note that the test is a little bit brittle: it's highly likely that a future version of requests will change the Accept-Encoding header it sends (see kennethreitz/requests#1916), at which point the test will break.
michaelhelmick commented 2014-02-25 12:35:54 -08:00 (Migrated from github.com)

Very nice catch @Lukasa :D

I'll hopefully push a new Twython release within the next week! Thanks again!

Very nice catch @Lukasa :D I'll hopefully push a new Twython release within the next week! Thanks again!
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: code/twython#310
No description provided.