library(tidyverse)library(targets)library(here)library(lubridate)library(sf)library(mapchina)# Point to the _targets folder location since this qmd is in a subfoldertar_config_set(store = here::here('_targets'),script = here::here('_targets.R'))# Load targetsongo <-tar_read(ongo)
Documentation of the pre-processing of scraped data
The inspected raw data is the version derived after inspecting the scraped csv file against the source website. The blank lines in the scraped csv file are from ROs1 that have been de-registered. The website keep their names in the general list but their profiles are all blank now.
A couple of scraping errors have also been corrected in the process. That is cases where the RO is active but shown as blank rows in the scraped. I manually added those rows back.
For those that already been de-registered, I added back their index, ro_name, and ro_id, and generated a dichotomous indicator that they are no more “active” in the still_active column.
For the still_active variable, the values are manually input for the blank lines. After checking against the official website, FALSE for those have indeed de-register and TRUE for those that turned out to be parsing errors. The rest NAs will be recoded to TRUE in data cleaning.
After the cross-checking process, the index number of ROs in our dataset is the same as the index in the official database as of 2022-10-04.
The following columns are also manually coded:
home: home county or region of the ONGO. Not using the name “country” to be flexible enough to denote, Hong Kong, Macau and Taiwan
psu_level: dummy variable indicating whether the official sponsoring unit (for registration and annual audit) is at the national or local level.
It is clearly correlated with geo_scope_num, but should be a mediating factor for the relationship we are examining.
ro_count: There are cases where ONGOs are not granted to operate across China, so they register multiple ROs to cover enough number of provinces where they need to work. We surely want to include this variable because it is directly correlated with geo_scope_num
A related multiple_index is an index number created to identify siblings, in the form of 100-1, 100-2…(starting from 100 to ensure all have three digits) NA for ONGOs only having one RO. Note there can be cases where one of the siblings has been de-registered, but only several.
local_aim: There are cases where the aim/mission of the ONGO is attached to a locality. So I created this dummy to indicate such situation. However, there seems to be some inconsistency in different ONGO’s choices whether to use a location name in their work field. Also, it might be the case that the location limitation is added as requested by the authority. So probably should not use this variable.
Index: Index number larger than 700 are ROs that were active at time of data scraping (Jan. 2022) but have been deregistered as of Oct. 2022.
work_field: original mission statement in Chinese. The name “field” is chosen for the moment, because it seems “area” is more easily to be confused with the geographical scope. It is not called “mission” to differentiate from the mission of the mother organization which is not subject to approval of the authority. Also, the statement of the work field tends to be more concrete than missions.
work_field_code1 is the main work field code; work_field_code2 is an alternative code for the work field if the ONGO’s work touches on overlapping fields
The field categories listed on Article 3 of the law.
Overseas NGOs may, in accordance with the provisions of this Law, engage in undertakings of benefit to the public in the areas of the economy, education, science, culture, health, sports and environmental protection, as well as in the areas of poverty and disaster relief.
Google translation of the work field statements (disclosed in Chinese) are included. Researcher 1 is the main coder of the work field. When there is some ambiguity of the coding, Researcher 1 marks “TRUE” in the coding_need_check document and Researcher 2 cross check the coding of the work field.
Work Field coding notes
Decided by how the work fields of registered NGO ROs cluster
[Economy and trade]
[Environment] (including animal protection)
[Education] (including youth development and international cultural exchange)
[Charity and humanitarian] poverty alleviation, disaster relief, etc., including rural development and social work If rural development is researched at the macroeconomics level, coded as science and technology, eg. #375
[Science and technology]
[Industry association]
[Health] public health and health care system support (professional training)
[Arts and Culture] including sports (noted as “sports” in work_field_code2) and international culture exchange
[General] usually grant-making organizations, Or working on general international communication, including among governments, Or working on multiple fields not overlapping
(Chinese background) (ONGOs set up by Chinese communities abroad or by Chinese entities in foreign jurisdictions). Such ONGOs tend to have strong local connection with its origins. Not sure yet if should be kept as a separate variable. Now coded only in work_field_code2. (equality)
Only in work_field_code2, aim to right-based Charity and humanitarian and non-right-based Charity and humanitarian orgs
Coding rules used and notes
Education vs. Charity
If charitable activities are all education related, e.g. building schools, donating education facilities and books, coded as education
If the aids provided include other community support, e.g. building bridges, roads, medical facilities, coded as charity
Economy and trade vs. Industry association
If the trade promoted is focused on a particular industry (poultry, grains etc.), coded as industry association
Arts and culture vs. education
Now includes sports organizations too.
Industry association vs. health, science and technology, arts and culture
There are professional association on health providers and certain techniques. Coded as health, science & technology at the moment because they also conducting activities of promoting certain techniques . And their nature of work can be quite different from industry associations on grains, wines, pecans etc. The are more of a professional association with specialized expertise.
Same as arts and culture organizations: e.g., ballet dancer association
Exception: specialized association coded as industry associations if only serving the member companies #91
Health vs. Charity
If only providing medical aids to underprivileged communities, not building schools etc., coded as health
Charity vs. General
If multiple work, e.g., poverty alleviation, education, health can be encompassed by Charity and humanitarian, then former, or latter (e.g.)
International exchange category is dropped because it is assimilated by other fields, i.e., international exchange in education, arts and culture, science and technology
Limitations: The sample/population only covers those INGOs that got successfully registered
Chinese background, equality and sports are coded in work_field_code2, we can derive separate variables for them if needed
Multiple ROs are coded based on the overall work area spans of the mother organization, not the specific RO, so ensure the work_area is consistent across ROs
---title: "Data cleaning"author: "Meng Ye"date: "2023-03-28"format: html: code-fold: showlanguage: title-block-published: "Last updated"---```{r load-packages-data, message=FALSE, warning=FALSE}library(tidyverse)library(targets)library(here)library(lubridate)library(sf)library(mapchina)# Point to the _targets folder location since this qmd is in a subfoldertar_config_set(store = here::here('_targets'), script = here::here('_targets.R'))# Load targetsongo <- tar_read(ongo)```# Documentation of the pre-processing of scraped data The inspected raw data is the version derived after inspecting the scraped csv file against the source website. The blank lines in the scraped csv file are from ROs^[RO/ro = representative office of ONGOs] that have been de-registered. [The website](https://ngo.mps.gov.cn/ngo/portal/toInfogs.do) keep their names in the general list but their profiles are all blank now. A couple of scraping errors have also been corrected in the process. That is cases where the RO is active but shown as blank rows in the scraped. I manually added those rows back. For those that already been de-registered, I added back their `index`, `ro_name`, and `ro_id`, and generated a dichotomous indicator that they are no more "active" in the `still_active` column. For the `still_active` variable, the values are manually input for the blank lines. After checking against the official website, `FALSE` for those have indeed de-register and `TRUE` for those that turned out to be parsing errors. The rest NAs will be recoded to `TRUE` in data cleaning. After the cross-checking process, the index number of ROs in our dataset is the same as the index in the official database as of 2022-10-04.The following columns are also manually coded:- `home`: home county or region of the ONGO. Not using the name "country" to be flexible enough to denote, Hong Kong, Macau and Taiwan- `psu_level`: dummy variable indicating whether the official sponsoring unit (for registration and annual audit) is at the national or local level. It is clearly correlated with `geo_scope_num`, but should be a mediating factor for the relationship we are examining. - `ro_count`: There are cases where ONGOs are not granted to operate across China, so they register multiple ROs to cover enough number of provinces where they need to work. We surely want to include this variable because it is directly correlated with `geo_scope_num` A related `multiple_index` is an index number created to identify siblings, in the form of 100-1, 100-2...(starting from 100 to ensure all have three digits) `NA` for ONGOs only having one RO. Note there can be cases where one of the siblings has been de-registered, but only several. - `local_aim`: There are cases where the aim/mission of the ONGO is attached to a locality. So I created this dummy to indicate such situation. *However,* there seems to be some inconsistency in different ONGO's choices whether to use a location name in their work field. Also, it might be the case that the location limitation is added as requested by the authority. So probably should not use this variable. - `Index`: Index number larger than 700 are ROs that were active at time of data scraping (Jan. 2022) but have been deregistered as of Oct. 2022. - `work_field`: original mission statement in Chinese. The name "field" is chosen for the moment, because it seems "area" is more easily to be confused with the geographical scope. It is not called "mission" to differentiate from the mission of the mother organization which is not subject to approval of the authority. Also, the statement of the work field tends to be more concrete than missions. `work_field_code1` is the main work field code; `work_field_code2` is an alternative code for the work field if the ONGO's work touches on overlapping fields The field categories listed on Article 3 of the law. > Overseas NGOs may, in accordance with the provisions of this Law, engage in undertakings of benefit to the public in the areas of the economy, education, science, culture, health, sports and environmental protection, as well as in the areas of poverty and disaster relief. Earlier fields used: - Commerce and Trade Promotion - Science and Technology - Education - Charity and Public Benefit - Environment - Industry Promotion and Self-regulation - Public Health - International Exchange - Multiple Fields (usually grant-making foundations)# Cross checking of ONGO work field codingGoogle translation of the work field statements (disclosed in Chinese) are included. Researcher 1 is the main coder of the work field. When there is some ambiguity of the coding, Researcher 1 marks "TRUE" in the `coding_need_check` document and Researcher 2 cross check the coding of the work field. # Work Field coding notes**Decided by how the work fields of registered NGO ROs cluster**- **[Economy and trade]**- **[Environment]** (including animal protection)- **[Education]** (including youth development and international cultural exchange)- **[Charity and humanitarian]** poverty alleviation, disaster relief, etc., including rural development and social work If rural development is researched at the macroeconomics level, coded as science and technology, eg. #375- **[Science and technology]**- **[Industry association]**- **[Health]** public health and health care system support (professional training)- **[Arts and Culture]** including sports (noted as “sports” in `work_field_code2`) and international culture exchange- **[General]** usually grant-making organizations, Or working on general international communication, including among governments, Or working on multiple fields not overlapping(Chinese background) (ONGOs set up by Chinese communities abroad or by Chinese entities in foreign jurisdictions). Such ONGOs tend to have strong local connection with its origins. Not sure yet if should be kept as a separate variable. Now coded only in work_field_code2. (equality)Only in `work_field_code2`, aim to right-based Charity and humanitarian and non-right-based Charity and humanitarian orgs# Coding rules used and notes1. Education vs. Charity If charitable activities are all education related, e.g. building schools, donating education facilities and books, coded as education If the aids provided include other community support, e.g. building bridges, roads, medical facilities, coded as charity2. Economy and trade vs. Industry association If the trade promoted is focused on a particular industry (poultry, grains etc.), coded as industry association3. Arts and culture vs. education Now includes sports organizations too. 4. Industry association vs. health, science and technology, arts and culture There are professional association on health providers and certain techniques. Coded as health, science & technology at the moment because they also conducting activities of promoting certain techniques . And their nature of work can be quite different from industry associations on grains, wines, pecans etc. The are more of a professional association with specialized expertise. Same as arts and culture organizations: e.g., ballet dancer association Exception: specialized association coded as industry associations if only serving the member companies #915. Health vs. Charity If only providing medical aids to underprivileged communities, not building schools etc., coded as health 6. Charity vs. General If multiple work, e.g., poverty alleviation, education, health can be encompassed by Charity and humanitarian, then former, or latter (e.g.)7. International exchange category is dropped because it is assimilated by other fields, i.e., international exchange in education, arts and culture, science and technology8. Limitations: The sample/population only covers those INGOs that got successfully registered9. Chinese background, equality and sports are coded in `work_field_code2`, we can derive separate variables for them if needed 10. Multiple ROs are coded based on the overall work area spans of the mother organization, not the specific RO, so ensure the work_area is consistent across ROs# Clean data## Province names```{r province list}province_list <- ongo |> group_by(province_cn) |> summarise(province_cn = first(province_cn)) |> drop_na() |> #de-registered ones pull(province_cn)province_list```There are totally `r length(province_list)` provinces (including provincial-level cities e.g. Beijing and Shanghai) that have ONGO rep offices.The maximum number of provinces that a RO can possible operate in is 32. ## English organization names```{r}table(is.na(ongo$org_name_en))```