precious-lavenderP
Apify & Crawlee4y ago
11 replies
precious-lavender

Issues with charset

Hi everyone,
I am new to Apify. I love the utility of it so I have decided to learn it by using it to solve real life issue - by scraping data from official government website.

I am using cheerio scraper to get data from a list (link attached below) with Czech text. My problem is I cannot make it to get the the data with correct encoding. Characters from Czech alphabet are encoded incorrectly.

It scrapes this: "Apoďż˝tolskďż˝ cďż˝rkev, 1. sbor Praha" (with windows-1250) or this: Apo�tolsk� c�rkev, 1. sbor Praha (with utf8) instead of this: Apoštolská církev, 1. sbor Praha

I have tried experimenting forcing different response encoding  (utf8, windows-1250), I tried sending different headers but without success.

After many hours I feel like getting nowhere. Do You have any suggestions?


Start URL: https://www-cns.mkcr.cz/cns_internet/CNS/Seznam_cpo.aspx?id_subj=148&str_zpet=Seznam_CPO.aspx

Glob pattern: https://www-cns.mkcr.cz/cns_internet/CNS/Detail_cpo.aspx?id_subj=*&str_zpet=Seznam_CPO.aspx

Link selector: td > a

Code:
async function pageFunction(context) {
    const { $, request, log } = context;
    const pageTitle = $('title').first().text();
    const url = request.url;
    const churchName = $('td:contains("zev:")').next().text(); 

    log.info('Church Name:', { churchName });   
    return {
        url,
        churchName
    };
}


BTW: I am using proxy located in CZ to get to it.
Was this page helpful?