vaers_flatfile_build.py run_all() ... validate_dirs_and_files() ... 1 drops in input to process First (oldest) input: ../Download/ALL_VAERS_DROPS/2020-12-18_VAERS_CSV.zip Last (newest) input: ../Download/ALL_VAERS_DROPS/2023-09-29_AllVAERSDataCSVS.zip Already processed files do appear in vaers_changes and the latest will be built upon: vaers_changes/2020-12-18_VAERS_CHANGES.csv vaers_changes/2020-12-25_VAERS_CHANGES.csv vaers_changes/2021-01-08_VAERS_CHANGES.csv vaers_changes/2021-01-15_VAERS_CHANGES.csv vaers_changes/2021-01-22_VAERS_CHANGES.csv ... 144 total = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Next date 2023-09-29 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = unzip ../Download/ALL_VAERS_DROPS/2023-09-29_AllVAERSDataCSVS.zip Creating in ./vaers_working/ date marker file 2023-09-29 Consolidation Concatenating files, *VAERSDATA.csv, *VAERSVAX.csv, *VAERSSYMPTOMS.csv open vaers_working\2020VAERSDATA.csv ... Highest VAERS_ID 2679890 open vaers_working\2021VAERSDATA.csv ... Highest VAERS_ID 2681134 open vaers_working\2022VAERSDATA.csv ... Highest VAERS_ID 2678082 open vaers_working\2023VAERSDATA.csv ... Highest VAERS_ID 2688373 open vaers_working\NonDomesticVAERSDATA.csv ... Highest VAERS_ID 2688370 open vaers_working\2020VAERSVAX.csv ... Highest VAERS_ID 2679890 open vaers_working\2021VAERSVAX.csv ... Highest VAERS_ID 2681134 open vaers_working\2022VAERSVAX.csv ... Highest VAERS_ID 2678082 open vaers_working\2023VAERSVAX.csv ... Highest VAERS_ID 2688373 open vaers_working\NonDomesticVAERSVAX.csv ... Highest VAERS_ID 2688370 4808 exact duplicates dropped in concatenated files, now 2035299 rows open vaers_working\2020VAERSSYMPTOMS.csv ... Highest VAERS_ID 2679890 open vaers_working\2021VAERSSYMPTOMS.csv ... Highest VAERS_ID 2681134 open vaers_working\2022VAERSSYMPTOMS.csv ... Highest VAERS_ID 2678082 open vaers_working\2023VAERSSYMPTOMS.csv ... Highest VAERS_ID 2688373 open vaers_working\NonDomesticVAERSSYMPTOMS.csv ... Highest VAERS_ID 2688370 133356 records removed prior to the first covid report (covid_earliest_vaers_id 896636) 1728267 reports to work with (unique VAERS_IDs) lo_ever 896636 hi_all_never_published 2684987 hi_this_week 2688373 week_vids_present [2679890, 896750, 896754, 896765, 896766, 896897, 896637, 896638 ... 2687807, 2687808, 2688307, 2688311, 2688314, 2688315, 2688317, 2688370] list_range_all_ever [896636, 896637, 896638, 896639, 896640, 896641, 896642, 896643 ... 2688366, 2688367, 2688368, 2688369, 2688370, 2688371, 2688372, 2688373] list_range_week_only [2684988, 2684989, 2684990, 2684991, 2684992, 2684993, 2684994, 2684995 ... 2688366, 2688367, 2688368, 2688369, 2688370, 2688371, 2688372, 2688373] gaps_filled [2683894, 2683903, 2684009, 2684384, 2684389, 2684390, 2684391, 2684392 ... 2684951, 2684960, 2684963, 2684968, 2684980, 2684983, 2684986, 2684987] week_gaps_new [2685015, 2685017, 2685046, 2685154, 2685278, 2685310, 2685311, 2685346 ... 2688308, 2688309, 2688310, 2688312, 2688313, 2688316, 2688318, 2688319] remedied_past_all_never_published [896713, 896742, 896875, 896892, 896899, 897093, 897105, 897169 ... 2684876, 2684922, 2684923, 2684969, 2684970, 2684972, 2684973, 2684976] all_never_published [896713, 896742, 896875, 896892, 896899, 897093, 897105, 897169 ... 2688308, 2688309, 2688310, 2688312, 2688313, 2688316, 2688318, 2688319] VAERS_IDs 896637 to 2688373 expected 1791738 all_ever 1760946 gaps 30823 131289 dropped in df_data due to no covid VAX_TYPE involved in the report 166273 dropped in df_vax due to no covid VAX_TYPE involved in the report 162407 dropped in df_syms due to no covid VAX_TYPE involved in the report 1596978 covid reports to work with Repeat sentence removal in SYMPTOM_TEXT, showing each next larger if any (takes time) 372 SYMPTOM_TEXT field repeat sentences deduped in 82 reports, max difference 6192 bytes in VAERS_ID 1645697 Shortening some field values in VAX_NAME, VAX_MANU Merging DATA into VAX 1681756 rows in df_data_vax Aggregating symptoms into symptom_entries string, new column Combining symptoms column items. Grouping by VAERS_ID ... Appending each symptom in new column called symptom_entries Cleaning multiple delimiters due to empty columns Merging symptom_entries into df_data_vax 1681756 rows in df_data_vax_syms_consolidated Saving result into one file: vaers_consolidated/2023-09-29_VAERS_CONSOLIDATED.csv Consolidation of 2023-09-29 done Flattening Aggregate/flatten VAX items. Grouping by VAERS_ID 1596978 rows in df_vax_flat Merging DATA into VAX flattened 1596978 rows in df_data_vax_flat Merging symptom_entries into df_data_vax_syms_flat Saving result into one file: vaers_flattened/2023-09-29_VAERS_FLATTENED.csv 1596978 rows in vaers_flattened/2023-09-29_VAERS_FLATTENED.csv Flattening of 2023-09-29 done open vaers_flattened/2023-09-22_VAERS_FLATTENED.csv ... Highest VAERS_ID 2684990 Using flat 2023-09-29 already in memory, 1596978 rows Previous changes file for changes, cell_edits and status columns open vaers_changes/2023-09-22_VAERS_CHANGES.csv ... Highest VAERS_ID 2684990 Comparing 2023-09-22 v. 2023-09-29 1596978 this drop total covid 1594996 previous total covid 1594961 identical set aside 2017 this drop to work with 35 previous to work with 1982 difference 2016 new in 2023-09-29 0 delayed this week 34 deleted this week kept 0 restored this week Column value changes SYMPTOM_TEXT 1094447 St Luke's <> hospital Hospital 1 column altered 31372 modified reports on 2023-09-29 Writing ... vaers_changes/2023-09-29_VAERS_CHANGES.csv 1 report with the most (18) records/lots/doses: 1900339 1 comparison done Doing stats open stats.csv ... ok column changes: {'SYMPTOM_TEXT': 1, 'TODAYS_DATE': 0, 'BIRTH_DEFECT': 0, 'DIED': 0, 'ER_VISIT': 0, 'VAX_NAME': 0, 'VAX_DOSE_SERIES': 0, 'VAX_TYPE': 0, 'VAX_LOT': 0, 'DISABLE': 0, 'HISTORY': 0, 'RPT_DATE': 0, 'STATE': 0, 'SPLTTYPE': 0, 'ER_ED_VISIT': 0, 'VAX_SITE': 0, 'PRIOR_VAX': 0, 'HOSPDAYS': 0, 'OFC_VISIT': 0, 'VAX_MANU': 0, 'SEX': 0, 'X_STAY': 0, 'FORM_VERS': 0, 'CAGE_YR': 0, 'VAX_DATE': 0, 'VAX_ROUTE': 0, 'CAGE_MO': 0, 'LAB_DATA': 0, 'OTHER_MEDS': 0, 'CUR_ILL': 0, 'V_ADMINBY': 0, 'RECVDATE': 0, 'RECOVD': 0, 'L_THREAT': 0, 'ONSET_DATE': 0, 'NUMDAYS': 0, 'DATEDIED': 0, 'HOSPITAL': 0, 'AGE_YRS': 0, 'ALLERGIES': 0, 'V_FUNDBY': 0, 'symptom_entries': 0} This week 0 delayed/late/gapfill 34 deleted 0 restored 0 cell edits trivial not printed 1 cell edits significant 0 cells emptied entirely 1 writeups changed All time 542236 delayed/late/gapfill 31421 deleted 16 restored 29372270 cell edits trivial not printed 30956 cell edits significant 1475724 cells emptied entirely 7587 writeups changed 30823 never published [896713, 896742, 896875, 896892, 896899, 897093, 897105, 897169 ... 2688308, 2688309, 2688310, 2688312, 2688313, 2688316, 2688318, 2688319] 20 reports cleared of duplicate sentences within them 0 hr 27.0 min This week None 0 hr 27.0 min Overall None Saving vaers_changes/2023-09-29_VAERS_CHANGES_A.csv, 1048575 rows and vaers_changes/2023-09-29_VAERS_CHANGES_B.csv, 579808 rows No more to do, last set 2023-09-29 >= 2023-09-29 done 0 hr 28.9 min Done with vaers_flatfile_build.py at line 2375, clock time 2023-10-07 11:19:01.484641 - - - - - - - - - - - - - - - - - - - - - - - -