Google book link only gets partial data #10

mvolz · Jun 14, 2018

curl -d '{ "query": "http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9" }' -H 'Content-Type: application/json' http://127.0.0.1:1969/search
Internal Server Error

node src/server.js

(3)(+0000000): Translators initialized with 523 loaded

(3)(+0000006): Listening on 0.0.0.0:1969

(3)(+0052583): HTTP GET http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9

(1)(+0000203): Error: read ECONNRESET

Error: read ECONNRESET
    at _errnoException (util.js:1024:11)
    at TCP.onread (net.js:615:25)

InternalServerError: An error occurred retrieving the document

  at Object.throw (/home/marielle/Code/translation-server-v2/node_modules/koa/lib/context.js:93:11)
  at Object.handleURL (/home/marielle/Code/translation-server-v2/src/endpoints.js:164:23)
  at <anonymous>
  at process._tickCallback (internal/process/next_tick.js:188:7)

http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9&dq=%2522Peggy+Eaton%2522&ots=KN-Z0-HAcv&sig=snBNf7bilHi9GFH4-6-3s1ySI9Q&redir_esc=y#v=onepage&q=%2522Peggy%2520Eaton%2522&f=false

dstillman · Jun 29, 2018

This works for me. It looks like this may be either a networking issue on your end or Google just blocking your IP.

mvolz · Jun 29, 2018

It no longer causes an internal server error with the update; might have to do with the xpath stuff being fixed. But now it does this:

[{"key":"3SBPBISF","version":0,"itemType":"book","creators":[],"tags":[],"title":"Some American Ladies","url":"https://books.google.com/books/about/Some_American_Ladies.html?id=Ct6FKwHhBSQC","libraryCatalog":"books.google.de","accessDate":"CURRENT_TIMESTAMP"}]

(4)(+0000000): Translate: Parsing code for Google Books (3e684d82-73a3-9a34-095f-19b112d88bbf, 2017-12-03 04:20:33)

(3)(+0000001): Translate: Beginning translation with Google Books

(3)(+0000001): Translate: resolving URL //books.google.com/books/feeds/volumes/Ct6FKwHhBSQC

(3)(+0000000): Translate: resolved to http://books.google.com/books/feeds/volumes/Ct6FKwHhBSQC

(3)(+0000001): Zotero.HTTP.doGet is deprecated. Use Zotero.HTTP.request

(3)(+0000000): HTTP GET http://books.google.com/books/feeds/volumes/Ct6FKwHhBSQC

(3)(+0000706): TypeError: Cannot read property 'textContent' of undefined

TypeError: Cannot read property 'textContent' of undefined
    at parseXML (eval at <anonymous> (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), <anonymous>:144:41)
    at /home/marielle/Code/translation-server-v2/modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js:331:5
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)

(2)(+0000000): Translate: Translation using Google Books failed:
TypeError: Cannot read property 'textContent' of undefined

TypeError: Cannot read property 'textContent' of undefined
at parseXML (eval at (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), :144:41)
at /home/marielle/Code/translation-server-v2/modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js:331:5
at
at process._tickCallback (internal/process/next_tick.js:188:7)
url => http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9

(5)(+0000000): Translate: Running handler 0 for error

(1)(+0000000): Translation using Google Books failed

(1)(+0000000): TypeError: Cannot read property 'textContent' of undefined

TypeError: Cannot read property 'textContent' of undefined
    at parseXML (eval at <anonymous> (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), <anonymous>:144:41)
    at /home/marielle/Code/translation-server-v2/modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js:331:5
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)

dstillman · Jun 29, 2018

Can you load http://books.google.com/books/feeds/volumes/Ct6FKwHhBSQC from your IP address?

dstillman · Jun 29, 2018

Also, have you changed the User-Agent setting? Google will almost certainly block you with a non-browser User-Agent.

mvolz · Jun 29, 2018

Yes, and no, just left the default string in there.

The fact that it is getting the title (Some American Ladies) and embedded metadata runs okay does seem indicate it's scraping it (i.e. not an IP block), just that it doesn't have the XML for some reason.

(I didn't paste the success part of the output, here's the rest:)

4)(+0000000): Translate: Parsing code for Embedded Metadata (951c027d-74ac-47d4-a107-9c3069ab7b48, 2018-02-13 19:20:46)

(3)(+0000002): Translate: Beginning translation with Embedded Metadata

(3)(+0000001): Translate: Embedded Metadata: found 7 meta tags.

(3)(+0000000): Translate: Creating translate instance of type import in sandbox

(4)(+0000000): Translate: Binding sandbox to http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9

(4)(+0000001): Translate: Parsing code for RDF (5e3ad958-ac79-463d-812b-a86a9235c28f, 2018-05-08 19:39:38)

(3)(+0000001): Translate: Initializing RDF data store

(3)(+0000006): Translate: Promise not available in sandbox in _itemDone()

(3)(+0000000): Translate: Saving item

(5)(+0000000): Translate: Running handler 0 for itemDone

(3)(+0000009): Translate: Looking for authors in byline, vcard

(3)(+0000004): Translate: Found 0 elements with 'byline' class

(3)(+0000001): Translate: Found 0 elements with 'vcard' class

(3)(+0000001): Translate: No byline found.

(3)(+0000001): Translate: Promise not available in sandbox in _itemDone()

(3)(+0000000): Translate: Saving item

(3)(+0000000): Translate: Translation successful

(5)(+0000001): Translate: Running handler 0 for done

(3)(+0000000): itemToAPIJSON: Discarded field publicationTitle: field not valid for type book

dstillman · Jun 29, 2018

I think you'll need to add some Zotero.debug() lines to see what it's getting instead of the XML. Google tends to lock down its data exports (BibTeX, etc.) more than its webpages, so it's totally possible that the XML is blocked for some reason.

dstillman · Jun 29, 2018

Oh, wait, I seem to be getting the same error now. We'll look into it.

dstillman · Jun 29, 2018

And now it's working for me again.

Is this failing for you consistently? If you delete package-lock.json and run npm i again, and then start the server again, does it still happen?

If so, can you add Zotero.debug(xmlhttp); above the processor(xmlhttp.responseText, xmlhttp, url); line in modules/zotero/chrome/content/zotero/xpcom/utilities_translate.js (around line 311) and see what it shows?

mvolz · Jun 29, 2018

I'm getting it consistently, here's the output of the debug line:

		if(processor) {
			Zotero.debug(xmlhttp);


(3)(+0000314): {
    "responseText": "<?xml version='1.0' encoding='UTF-8'?><entry xmlns='http://www.w3.org/2005/Atom' xmlns:gbs='http://schemas.google.com/books/2008' xmlns:gd='http://schemas.google.com/g/2005' xmlns:batch='http://schemas.google.com/gdata/batch' xmlns:dc='http://purl.org/dc/terms'><id>http://www.google.com/books/feeds/volumes/Ct6FKwHhBSQC</id><updated>2018-06-29T15:51:03.000Z</updated><category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/books/2008#volume'/><title type='text'>Some American Ladies</title><link rel='http://schemas.google.com/books/2008/thumbnail' type='image/x-unknown' href='http://books.google.com/books/content?id=Ct6FKwHhBSQC&amp;printsec=frontcover&amp;img=1&amp;zoom=5&amp;imgtk=AFLRE737vH96m2qfjPFR3Xtz7zCkaITQxZtgGD5qYI6pij6vTEO7JUOUg0crywriKISD9LvKiOPaeQIRvia33nYCYkFWr6pDyzjZfiftIV8Sh-uxtspeyeCnTuOnypWpSV_EmVCrRqvX&amp;source=gbs_gdata'/><link rel='http://schemas.google.com/books/2008/info' type='text/html' href='http://books.google.com/books?id=Ct6FKwHhBSQC&amp;source=gbs_gdata'/><link rel='http://schemas.google.com/books/2008/annotation' type='application/atom+xml' href='http://www.google.com/books/feeds/users/me/volumes'/><link rel='alternate' type='text/html' href='http://books.google.com/books?id=Ct6FKwHhBSQC'/><link rel='self' type='application/atom+xml' href='http://www.google.com/books/feeds/volumes/Ct6FKwHhBSQC'/><gbs:contentVersion>2.1.0.0.preview.0</gbs:contentVersion><gbs:embeddability value='http://schemas.google.com/books/2008#not_embeddable'/><gbs:openAccess value='http://schemas.google.com/books/2008#disabled'/><gbs:viewability value='http://schemas.google.com/books/2008#view_no_pages'/><dc:creator>Meade Minnigerode</dc:creator><dc:date>1926</dc:date><dc:format>Dimensions 14.6x24.0x2.7 cm</dc:format><dc:format>332 pages</dc:format><dc:format>book</dc:format><dc:identifier>Ct6FKwHhBSQC</dc:identifier><dc:identifier>ISBN:0836913620</dc:identifier><dc:identifier>ISBN:9780836913620</dc:identifier><dc:language>en</dc:language><dc:publisher>G.P. Putnam's Sons</dc:publisher><dc:subject>Biography &amp; Autobiography / Women</dc:subject><dc:title>Some American Ladies</dc:title><dc:title>Seven Informal Biographies ...</dc:title></entry>"
    "headers": {
        "content-type": "application/atom+xml; charset=UTF-8"
        "expires": "Fri, 29 Jun 2018 15:51:03 GMT"
        "date": "Fri, 29 Jun 2018 15:51:03 GMT"
        "cache-control": "private, max-age=0, must-revalidate, no-transform"
        "vary": "Accept, X-GData-Authorization, GData-Version"
        "gdata-version": "1.0"
        "last-modified": "Fri, 29 Jun 2018 15:51:03 GMT"
        "transfer-encoding": "chunked"
        "x-content-type-options": "nosniff"
        "x-frame-options": "SAMEORIGIN"
        "x-xss-protection": "1; mode=block"
        "server": "GSE"
        "connection": "close"
    }
    "statusCode": 200
}

(3)(+0000012): TypeError: Cannot read property 'textContent' of undefined```

adomasven · Jul 2, 2018

"responseText": "http://www.google.com/books/feeds/volumes/Ct6FKwHhBSQC2018-06-29T15:51:03.000Z<title type='text'>Some American Ladies</title>gbs:contentVersion2.1.0.0.preview.0</gbs:contentVersion><gbs:embeddability value='http://schemas.google.com/books/2008#not_embeddable'/><gbs:openAccess value='http://schemas.google.com/books/2008#disabled'/><gbs:viewability value='http://schemas.google.com/books/2008#view_no_pages'/>dc:creatorMeade Minnigerode</dc:creator>dc:date1926</dc:date>dc:formatDimensions 14.6x24.0x2.7 cm</dc:format>dc:format332 pages</dc:format>dc:formatbook</dc:format>dc:identifierCt6FKwHhBSQC</dc:identifier>dc:identifierISBN:0836913620</dc:identifier>dc:identifierISBN:9780836913620</dc:identifier>dc:languageen</dc:language>dc:publisherG.P. Putnam's Sons</dc:publisher>dc:subjectBiography & Autobiography / Women</dc:subject>dc:titleSome American Ladies</dc:title>dc:titleSeven Informal Biographies ...</dc:title>"

~~For some reason you are receiving a very wrong looking version of the item metadata. Does translation with the old server work well?~~

Update: Maybe not, I had to look into the source of your comment, some xml tags got swallowed up by the markdown interpreter.

adomasven · Jul 2, 2018

Ok, so this is probably still just an XPath matching issue. Make sure you run npm i after pulling the latest commit and that your package-lock.json contains this line.

Also note, that I am developing and testing on node.js v10.5.0, npm v6.1.0

mvolz · Jul 10, 2018

I'm running the same version and the package-lock contains the same line :/

adomasven · Jul 10, 2018

What version of node.js and npm are you on?

zuphilip · Jul 10, 2018

The example "http://books.google.de/books?hl=en&lr=&id=Ct6FKwHhBSQC&oi=fnd&pg=PP9 works for me as well under node 8.11.1 and npm 6.0.0.

@mvolz Can you activate this debug in the translator file https://github.com/zotero/translators/blob/master/Google%20Books.js#L93 and possibly some more comments later to see that the DOMParser is working as expected?

adomasven · Jul 10, 2018

TypeError: Cannot read property 'textContent' of undefined
at parseXML (eval at (/home/marielle/Code/translation-server-v2/src/translation/sandboxManager.js:65:4), :144:41)

This is certainly an xpath issue, which occurred in the xpath library before the latest fix. The line on which it fails is Google Books.js:144.

If @mvolz is running a version of npm that does not support package-lock.json, this is exactly where it would fail if it failed to fetch the latest version of the package. You could also try to rm node_modules and try again npm i again.

mvolz · Jul 10, 2018

Removing the old packages seemed to do the trick.

mvolz changed the title ~~Google book link causes internal server error~~ Google book link gets poor metadata Jun 29, 2018

mvolz changed the title ~~Google book link gets poor metadata~~ Google book link causes "Cannot read property 'textContent' of undefined at parseXML" error Jun 29, 2018

mvolz changed the title ~~Google book link causes "Cannot read property 'textContent' of undefined at parseXML" error~~ Google book link only gets partial data Jul 10, 2018

mvolz closed this Jul 10, 2018

Aug	SEP	Oct
	13
2019	2020	2021

zotero / translation-server

Google book link only gets partial data #10

Google book link only gets partial data #10

mvolz commented Jun 14, 2018

dstillman commented Jun 29, 2018

mvolz commented Jun 29, 2018

dstillman commented Jun 29, 2018

dstillman commented Jun 29, 2018

mvolz commented Jun 29, 2018

dstillman commented Jun 29, 2018

dstillman commented Jun 29, 2018

dstillman commented Jun 29, 2018

mvolz commented Jun 29, 2018 •

edited by adomasven

adomasven commented Jul 2, 2018 •

edited

adomasven commented Jul 2, 2018

mvolz commented Jul 10, 2018

adomasven commented Jul 10, 2018

zuphilip commented Jul 10, 2018

adomasven commented Jul 10, 2018 •

edited

mvolz commented Jul 10, 2018

zotero / translation-server

Join GitHub today

Google book link only gets partial data #10

Google book link only gets partial data #10

Comments

mvolz commented Jun 14, 2018

dstillman commented Jun 29, 2018

mvolz commented Jun 29, 2018

dstillman commented Jun 29, 2018

dstillman commented Jun 29, 2018

mvolz commented Jun 29, 2018

dstillman commented Jun 29, 2018

dstillman commented Jun 29, 2018

dstillman commented Jun 29, 2018

mvolz commented Jun 29, 2018 • edited by adomasven

adomasven commented Jul 2, 2018 • edited

adomasven commented Jul 2, 2018

mvolz commented Jul 10, 2018

adomasven commented Jul 10, 2018

zuphilip commented Jul 10, 2018

adomasven commented Jul 10, 2018 • edited

mvolz commented Jul 10, 2018

mvolz commented Jun 29, 2018 •

edited by adomasven

adomasven commented Jul 2, 2018 •

edited

adomasven commented Jul 10, 2018 •

edited