Scraping from a searchfield

asked 2016-02-02 08:20:18 -0600

I am trying to scrape details of State of Rio government salaries that are available on this page http:// http://www.consultaremuneracao.rj.gov.br/pages/welcome.jsf (www.consultaremuneracao.rj.gov.br/pag...) With scrapy and selenium I have https:// bitbucket.org/chicocvenancio/salarios_rio (a working script) to store the results. Now I am having problems trying to construct the search queries such that I get all the results (ideally with the minimum amount of searches).

The query is "wildcarded" on both sides, such that "bel" returns isabel and bela, also the results are limited to 100. Strangely we may use wildcards such as % or _ .

Does anybody know of a good strategy for the search query with these restrictions? I have seen http:// www.charleshooper.net/blog/screen-scraping-search-results-for-information-retrieval/ (this strategy) for searches that place a wildcard at the end but I think it will be terribly inefficient when wildcards are used also at the beginning of a query.

(removed links for lack of karma)

edit retag flag offensive close merge delete