I am looking for a freelancer to write python program, to crawl data from the World Bank based on specific indicator names. The indicator names will be provided as arguments of the input python.
1. Should have a arguments of Proxy IPs and use random IP address according to the given IPs.
2. Should retry 3 times to craw data
Passed arguments:
(1)--config [login to view URL], including indicator name lists, may be json format, to specify global year range and specify indicator year range. For example:
{
"global": {
"starttime": "2021-01-01", //YYYY-MM-DD, if not specify, no limit to starttime
"endtime": "2021-01-01", //YYYY-MM-DD, if not specify, no limit to endtime
"craw_interval": 3000, //ms, craw interval between each
"proxies": [ //support multiple proxies, use random proxy. If proxy connect failed, try 3 times.
{
"proxy_type": "http",
"proxy": "[login to view URL]"
}
]
},
"indicators": [
{
"meta_name": "[login to view URL]",
"indicator_name": "Population Ages",
"indicator_code": "pop",
"starttime": "2021-01-01", //YYYY-MM-DD, if not specify, follow global config
"endtime": "2021-01-01", //YYYY-MM-DD, if not specify, follow global config
},
{
"meta_name": "[login to view URL]", //origin indicator name of world bank
"indicator_name": "Population Ages",
"indicator_code": "pop",
"starttime": "2021-01-01", //YYYY-MM-DD, if not specify, follow global config
"endtime": "2021-01-01", //YYYY-MM-DD, if not specify, follow global config
}
],
"mapping_dict": {
"country": [
{
"name": "USA",
"code": "USA",
"alias": [ // search every alias ignore case.
"United States of America",
"US"
]
}
],
"field": [ //use to remapping the output values when a single indicator contains multiple values fields. default field is value
{
"meta_name": "value", // meta_name is the original field name. the key of "value" represent original default value field.
"meta_value": "value"
},
{
"meta_name": "score", // meta_name is the original field name. the key of "value" represent original default value field.
"meta_value": "Final Score"
}
]
}
}
(2)--output output path of the final json. The output format should be json. Output format:
{
"indicators": [
{
"meta_name": "[login to view URL]",
"indicator_name": "Population Ages",
"indicator_code": "pop",
"starttime": "2021-01-01", //YYYY-MM-DD, if not specify, follow global config
"endtime": "2021-01-01", //YYYY-MM-DD, if not specify, follow global config
"status": "success", // if crawl failed, set status = failed.
"errmsg": "OK", //if crawl failed, show error message here
"total": 2939293, //total data entry counts
"countries": 238823, //total country counts
"years": 60 //total years
},
],
"data": [
{
"datasource": "worldbank",
"ref_link": "[login to view URL]", //the indicator's original link
"meta_name": "[login to view URL]",
"indicator_name": "Population Ages",
"indicator_code": "pop",
"country_name": "USA",
"country_code": "USA",
"crawl_time": "2022-01-01 12:00:00",
"year": "2022",
"starttime": "2022-01-01", // if data source only contains year like 2022, set it to start of the year
"endtime": "2022-12-30", // if data source only contains year like 2022, set it to end of the year
"values": { // use field mapping to convert to new field name first. the key should be the new field name.
"value": 123,
"Final Score": 300
}
}
]
}
3. Output should contains
(1) failed indicator names and failed reason
(2) remapping indicator_name and value fields
[login to view URL] provide
(1)[login to view URL]
(2)Python code
(3)A simple test case([login to view URL] and test command)
Skills and Experience:
- Strong experience in web scraping and data crawling
- Proficiency in Python or another suitable programming language for web scraping
- Familiarity with the World Bank's data structure and API
Data Format:
- The crawled data should be in JSON format.
Data Cleaning and Structuring:
- The client requires the data to be cleaned and structured according to specific data attributes.
Please provide examples of similar projects you have completed in the past.