The Wayback Machine - https://web.archive.org/web/20200913022538/https://github.com/zotero/translation-server/issues/10
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google book link only gets partial data #10

Closed
mvolz opened this issue Jun 14, 2018 · 16 comments
Closed

Google book link only gets partial data #10

mvolz opened this issue Jun 14, 2018 · 16 comments

Comments

@mvolz
Copy link

@mvolz mvolz commented Jun 14, 2018

curl -d '{ "query": "http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9" }' -H 'Content-Type: application/json' http://127.0.0.1:1969/search
Internal Server Error

node src/server.js

(3)(+0000000): Translators initialized with 523 loaded

(3)(+0000006): Listening on 0.0.0.0:1969

(3)(+0052583): HTTP GET http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9

(1)(+0000203): Error: read ECONNRESET

Error: read ECONNRESET
    at _errnoException (util.js:1024:11)
    at TCP.onread (net.js:615:25)

InternalServerError: An error occurred retrieving the document

  at Object.throw (/home/marielle/Code/translation-server-v2/node_modules/koa/lib/context.js:93:11)
  at Object.handleURL (/home/marielle/Code/translation-server-v2/src/endpoints.js:164:23)
  at <anonymous>
  at process._tickCallback (internal/process/next_tick.js:188:7)

http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9&dq=%2522Peggy+Eaton%2522&ots=KN-Z0-HAcv&sig=snBNf7bilHi9GFH4-6-3s1ySI9Q&redir_esc=y#v=onepage&q=%2522Peggy%2520Eaton%2522&f=false

@dstillman
Copy link
Member

@dstillman dstillman commented Jun 29, 2018

This works for me. It looks like this may be either a networking issue on your end or Google just blocking your IP.

@mvolz mvolz changed the title Google book link causes internal server error Google book link gets poor metadata Jun 29, 2018
@mvolz
Copy link
Author

@mvolz mvolz commented Jun 29, 2018

It no longer causes an internal server error with the update; might have to do with the xpath stuff being fixed. But now it does this:

[{"key":"3SBPBISF","version":0,"itemType":"book","creators":[],"tags":[],"title":"Some American Ladies","url":"https://books.google.com/books/about/Some_American_Ladies.html?id=Ct6FKwHhBSQC","libraryCatalog":"books.google.de","accessDate":"CURRENT_TIMESTAMP"}]

(4)(+0000000): Translate: Parsing code for Google Books (3e684d82-73a3-9a34-095f-19b112d88bbf, 2017-12-03 04:20:33)

(3)(+0000001): Translate: Beginning translation with Google Books

(3)(+0000001): Translate: resolving URL //books.google.com/books/feeds/volumes/Ct6FKwHhBSQC

(3)(+0000000): Translate: resolved to http://books.google.com/books/feeds/volumes/Ct6FKwHhBSQC

(3)(+0000001): Zotero.HTTP.doGet is deprecated. Use Zotero.HTTP.request

(3)(+0000000): HTTP GET http://books.google.com/books/feeds/volumes/Ct6FKwHhBSQC

(3)(+0000706): TypeError: Cannot read property 'textContent' of undefined

TypeError: Cannot read property 'textContent' of undefined
    at parseXML (eval at <anonymous> (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), <anonymous>:144:41)
    at /home/marielle/Code/translation-server-v2/modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js:331:5
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)

(2)(+0000000): Translate: Translation using Google Books failed:
TypeError: Cannot read property 'textContent' of undefined

TypeError: Cannot read property 'textContent' of undefined
at parseXML (eval at (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), :144:41)
at /home/marielle/Code/translation-server-v2/modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js:331:5
at
at process._tickCallback (internal/process/next_tick.js:188:7)
url => http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9

(5)(+0000000): Translate: Running handler 0 for error

(1)(+0000000): Translation using Google Books failed

(1)(+0000000): TypeError: Cannot read property 'textContent' of undefined

TypeError: Cannot read property 'textContent' of undefined
    at parseXML (eval at <anonymous> (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), <anonymous>:144:41)
    at /home/marielle/Code/translation-server-v2/modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js:331:5
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)
@mvolz mvolz changed the title Google book link gets poor metadata Google book link causes "Cannot read property 'textContent' of undefined at parseXML" error Jun 29, 2018
@dstillman
Copy link
Member

@dstillman dstillman commented Jun 29, 2018

@dstillman
Copy link
Member

@dstillman dstillman commented Jun 29, 2018

Also, have you changed the User-Agent setting? Google will almost certainly block you with a non-browser User-Agent.

@mvolz
Copy link
Author

@mvolz mvolz commented Jun 29, 2018

Yes, and no, just left the default string in there.

The fact that it is getting the title (Some American Ladies) and embedded metadata runs okay does seem indicate it's scraping it (i.e. not an IP block), just that it doesn't have the XML for some reason.

(I didn't paste the success part of the output, here's the rest:)

4)(+0000000): Translate: Parsing code for Embedded Metadata (951c027d-74ac-47d4-a107-9c3069ab7b48, 2018-02-13 19:20:46)

(3)(+0000002): Translate: Beginning translation with Embedded Metadata

(3)(+0000001): Translate: Embedded Metadata: found 7 meta tags.

(3)(+0000000): Translate: Creating translate instance of type import in sandbox

(4)(+0000000): Translate: Binding sandbox to http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9

(4)(+0000001): Translate: Parsing code for RDF (5e3ad958-ac79-463d-812b-a86a9235c28f, 2018-05-08 19:39:38)

(3)(+0000001): Translate: Initializing RDF data store

(3)(+0000006): Translate: Promise not available in sandbox in _itemDone()

(3)(+0000000): Translate: Saving item

(5)(+0000000): Translate: Running handler 0 for itemDone

(3)(+0000009): Translate: Looking for authors in byline, vcard

(3)(+0000004): Translate: Found 0 elements with 'byline' class

(3)(+0000001): Translate: Found 0 elements with 'vcard' class

(3)(+0000001): Translate: No byline found.

(3)(+0000001): Translate: Promise not available in sandbox in _itemDone()

(3)(+0000000): Translate: Saving item

(3)(+0000000): Translate: Translation successful

(5)(+0000001): Translate: Running handler 0 for done

(3)(+0000000): itemToAPIJSON: Discarded field publicationTitle: field not valid for type book

@dstillman
Copy link
Member

@dstillman dstillman commented Jun 29, 2018

I think you'll need to add some Zotero.debug() lines to see what it's getting instead of the XML. Google tends to lock down its data exports (BibTeX, etc.) more than its webpages, so it's totally possible that the XML is blocked for some reason.

@dstillman
Copy link
Member

@dstillman dstillman commented Jun 29, 2018

Oh, wait, I seem to be getting the same error now. We'll look into it.

@dstillman
Copy link
Member

@dstillman dstillman commented Jun 29, 2018

And now it's working for me again.

Is this failing for you consistently? If you delete package-lock.json and run npm i again, and then start the server again, does it still happen?

If so, can you add Zotero.debug(xmlhttp); above the processor(xmlhttp.responseText, xmlhttp, url); line in modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js (around line 311) and see what it shows?

@mvolz
Copy link
Author

@mvolz mvolz commented Jun 29, 2018

I'm getting it consistently, here's the output of the debug line:

		if(processor) {
			Zotero.debug(xmlhttp);

(3)(+0000314): {
    "responseText": "<?xml version='1.0' encoding='UTF-8'?><entry xmlns='http://www.w3.org/2005/Atom' xmlns:gbs='http://schemas.google.com/books/2008' xmlns:gd='http://schemas.google.com/g/2005' xmlns:batch='http://schemas.google.com/gdata/batch' xmlns:dc='http://purl.org/dc/terms'><id>http://www.google.com/books/feeds/volumes/Ct6FKwHhBSQC</id><updated>2018-06-29T15:51:03.000Z</updated><category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/books/2008#volume'/><title type='text'>Some American Ladies</title><link rel='http://schemas.google.com/books/2008/thumbnail' type='image/x-unknown' href='http://books.google.com/books/content?id=Ct6FKwHhBSQC&amp;printsec=frontcover&amp;img=1&amp;zoom=5&amp;imgtk=AFLRE737vH96m2qfjPFR3Xtz7zCkaITQxZtgGD5qYI6pij6vTEO7JUOUg0crywriKISD9LvKiOPaeQIRvia33nYCYkFWr6pDyzjZfiftIV8Sh-uxtspeyeCnTuOnypWpSV_EmVCrRqvX&amp;source=gbs_gdata'/><link rel='http://schemas.google.com/books/2008/info' type='text/html' href='http://books.google.com/books?id=Ct6FKwHhBSQC&amp;source=gbs_gdata'/><link rel='http://schemas.google.com/books/2008/annotation' type='application/atom+xml' href='http://www.google.com/books/feeds/users/me/volumes'/><link rel='alternate' type='text/html' href='http://books.google.com/books?id=Ct6FKwHhBSQC'/><link rel='self' type='application/atom+xml' href='http://www.google.com/books/feeds/volumes/Ct6FKwHhBSQC'/><gbs:contentVersion>2.1.0.0.preview.0</gbs:contentVersion><gbs:embeddability value='http://schemas.google.com/books/2008#not_embeddable'/><gbs:openAccess value='http://schemas.google.com/books/2008#disabled'/><gbs:viewability value='http://schemas.google.com/books/2008#view_no_pages'/><dc:creator>Meade Minnigerode</dc:creator><dc:date>1926</dc:date><dc:format>Dimensions 14.6x24.0x2.7 cm</dc:format><dc:format>332 pages</dc:format><dc:format>book</dc:format><dc:identifier>Ct6FKwHhBSQC</dc:identifier><dc:identifier>ISBN:0836913620</dc:identifier><dc:identifier>ISBN:9780836913620</dc:identifier><dc:language>en</dc:language><dc:publisher>G.P. Putnam's Sons</dc:publisher><dc:subject>Biography &amp; Autobiography / Women</dc:subject><dc:title>Some American Ladies</dc:title><dc:title>Seven Informal Biographies ...</dc:title></entry>"
    "headers": {
        "content-type": "application/atom+xml; charset=UTF-8"
        "expires": "Fri, 29 Jun 2018 15:51:03 GMT"
        "date": "Fri, 29 Jun 2018 15:51:03 GMT"
        "cache-control": "private, max-age=0, must-revalidate, no-transform"
        "vary": "Accept, X-GData-Authorization, GData-Version"
        "gdata-version": "1.0"
        "last-modified": "Fri, 29 Jun 2018 15:51:03 GMT"
        "transfer-encoding": "chunked"
        "x-content-type-options": "nosniff"
        "x-frame-options": "SAMEORIGIN"
        "x-xss-protection": "1; mode=block"
        "server": "GSE"
        "connection": "close"
    }
    "statusCode": 200
}

(3)(+0000012): TypeError: Cannot read property 'textContent' of undefined```
@adomasven
Copy link
Member

@adomasven adomasven commented Jul 2, 2018

"responseText": "http://www.google.com/books/feeds/volumes/Ct6FKwHhBSQC2018-06-29T15:51:03.000Z<title type='text'>Some American Ladies</title>gbs:contentVersion2.1.0.0.preview.0</gbs:contentVersion><gbs:embeddability value='http://schemas.google.com/books/2008#not_embeddable'/><gbs:openAccess value='http://schemas.google.com/books/2008#disabled'/><gbs:viewability value='http://schemas.google.com/books/2008#view_no_pages'/>dc:creatorMeade Minnigerode</dc:creator>dc:date1926</dc:date>dc:formatDimensions 14.6x24.0x2.7 cm</dc:format>dc:format332 pages</dc:format>dc:formatbook</dc:format>dc:identifierCt6FKwHhBSQC</dc:identifier>dc:identifierISBN:0836913620</dc:identifier>dc:identifierISBN:9780836913620</dc:identifier>dc:languageen</dc:language>dc:publisherG.P. Putnam's Sons</dc:publisher>dc:subjectBiography & Autobiography / Women</dc:subject>dc:titleSome American Ladies</dc:title>dc:titleSeven Informal Biographies ...</dc:title>"

For some reason you are receiving a very wrong looking version of the item metadata. Does translation with the old server work well?

Update: Maybe not, I had to look into the source of your comment, some xml tags got swallowed up by the markdown interpreter.

@adomasven
Copy link
Member

@adomasven adomasven commented Jul 2, 2018

Ok, so this is probably still just an XPath matching issue. Make sure you run npm i after pulling the latest commit and that your package-lock.json contains this line.

Also note, that I am developing and testing on node.js v10.5.0, npm v6.1.0

@mvolz
Copy link
Author

@mvolz mvolz commented Jul 10, 2018

I'm running the same version and the package-lock contains the same line :/

@adomasven
Copy link
Member

@adomasven adomasven commented Jul 10, 2018

What version of node.js and npm are you on?

@zuphilip
Copy link
Contributor

@zuphilip zuphilip commented Jul 10, 2018

The example "http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9 works for me as well under node 8.11.1 and npm 6.0.0.

@mvolz Can you activate this debug in the translator file https://github.com/zotero/translators/blob/master/Google%20Books.js#L93 and possibly some more comments later to see that the DOMParser is working as expected?

@adomasven
Copy link
Member

@adomasven adomasven commented Jul 10, 2018

TypeError: Cannot read property 'textContent' of undefined
at parseXML (eval at (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), :144:41)

This is certainly an xpath issue, which occurred in the xpath library before the latest fix. The line on which it fails is Google Books.js:144.

If @mvolz is running a version of npm that does not support package-lock.json, this is exactly where it would fail if it failed to fetch the latest version of the package. You could also try to rm node_modules and try again npm i again.

@mvolz mvolz changed the title Google book link causes "Cannot read property 'textContent' of undefined at parseXML" error Google book link only gets partial data Jul 10, 2018
@mvolz
Copy link
Author

@mvolz mvolz commented Jul 10, 2018

Removing the old packages seemed to do the trick.

@mvolz mvolz closed this Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.
Morty Proxy This is a proxified and sanitized view of the page, visit original site.