- 
                Notifications
    You must be signed in to change notification settings 
- Fork 11
init tool to upload csv file of kktix ticket #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e22f363    to
    fd34e63      
    Compare
  
    | 
 Hii @uranusjr , thanks for the suggestion. Here is the pros and cons come to my mind: 
 We all want manual process as less as possible. However, readability is also important. My gut feeling shows my may mix a "automatic smart parser" and the mapping table like: Do you think this is a good idea? Besides, is there any conventional rule to name the bigquery column fields for data sanitizing? If you @david30907d @uranusjr know that, please let me know. | 
| 
 
 | 
| btw, @tai271828 what do you think if we upload some personal information (name or email) to BQ as primary key? We cannot calculate  Also, per discussion, we can setup some proper authentication and have volunteers to signup the non-disclosure agreement | 
| 
 I’d just go with a smart-ish parser and keep the mapping table as small as possible, for fields that cannot be easily parsed. All fields listed here are very easily transformable, so no mapping table is needed yet. import re
def make_field_name_readable(raw: str) -> str:
    match = re.search(r"__", raw)
    if not match:
        return raw
    return raw[:match.start()].replace("_", " ") | 
| btw @tai271828 I don't have a convention for column naming but have one for tables. | 
| Interesting, thank you @uranusjr @david30907d for your information. Let me follow up the update of this pull request then : ) | 
| From the maintenance point of view, I will avoid a third party service (DataCalalog service in this case) at the very beginning when the team is small. 
 By using a third party service it means 1) overhead of communication 2) maintenance effort of the service. We have higher priority tasks. The third party service as a solution will be considered again when the data amount is large and frequent. Otherwise it is overkill in our (current) case. @uranusjr 's code snippet works like a charm (see the quotation below). I will update my parser based on @uranusjr 's code snippet. By using the "new" paser, the column looks like: It looks much nicer (than the previous dumb code of mine). Some more special characters like  
 So, I feel like some mapping table is still needed ... let me think about this a bit more. 🤔 Regarding the primary key topic: @david30907d understood. I will re-upload the raw data. -tai 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-upload raw data to provide better primary keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-upload raw data to provide better primary keys.
|  | ||
|  | ||
| if __name__ == "__main__": | ||
| main() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-upload raw data to provide better primary keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using hashed emails in the end. The data will be uploaded later. See #23 (comment)
| @uranusjr @david30907d names like  Update: oh I notice that BigQuery will convert  | 
| 
 up to you haha 😄 | 
| The current parsing result is shown below. Time to decide the mapping table...  | 
| 2/c 
 | 
| 
 @uranusjr thanks for the suggestion. I have the same thought as well (and implemented earlier than your comments XD ) I currently follow the conventions mentioned in this comment #23 (comment) , e.g. singular noun, snake style ...etc. | 
| Currently it looks like this:  | 
| @uranusjr forgive my ignorance, what is  | 
| 
 https://en.wikipedia.org/wiki/My_two_cents 
 
 | 
| 
 They are  Let me know any names easier to understand for you. The goal is to make the field as much self-descriptive as possible. Thanks! | 
fd34e63    to
    9f2a083      
    Compare
  
    | This pull request is ready to review again. The pre-process of column names are shown below[1]. Additionally, the revised/enhanced pull request includes the following change: 
 [1]  | 
| Besides, 
 The reason is that the pull request is growing too much, and in my two cents, we should stop here a bit to make sure we all agree to how we upload the data. I will make following change/enhancement when I am also ready to upload all the data I have in the past years for making sure the column names of table are all consistent. | 
9f2a083    to
    d395a2a      
    Compare
  
    d395a2a    to
    fcc8dd5      
    Compare
  
    The tool is used to upload the ticket information exported from kktix to bigquery, and pre-process the "raw" data to be more bigquery friendly in column naming of tables. It's dangerous to upload data by default. So, we use dry-run mode by default. We would like to make the column names as much consistent as possible across years, so we use some heuristic. We may need to maintain the heuristic annually. Luckily and ideally, the annual maintanence will be one-off.
fcc8dd5    to
    6dc9d57      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK, btw might be good to document this up~
hope 2022 rookies can run this tools by themself


Types of changes
Description
When uploading the csv file of kktix ticket, the title names of column field parsed automatically by bigquery are very ugly with a lot of underscores. This tool not only helps users to upload the csv file, but also automatically handles and sanitize the title names of column field. See the attached images to see "as-is" and "to-be" results.
As-is (manually uploaded by using the bigquery dashboard UI)

Steps to Test This Pull Request
./upload-kktix-ticket-csv-to-bigquery.py ticket.csv -p bigquery-project-id -d dataset-name -t table-nameExpected behavior
The column names of data pending to upload will be shown.
If
--uploadargument is appended, a bigquery table named aftertable-nameis created under thedataset-nameofbigquery-project-idRelated Issue
#5
Additional context
Intenal trello card status of pycontw organization team:
https://trello.com/c/yRGq1sZ3/11-kktix-%E7%9A%84%E8%B3%87%E6%96%99%EF%BC%8C%E7%94%A8-airflow-%E5%AF%A6%E4%BD%9C-etl-%E4%B8%A6%E5%AD%98%E9%80%B2-bigquery