vaers_flatfile_build.py run_all() ... validate_dirs_and_files() ... 136 drops in input to process First (oldest) input: ../Download/ALL_VAERS_DROPS/2020-12-18_VAERS_CSV.zip Last (newest) input: ../Download/ALL_VAERS_DROPS/2023-07-28_AllVAERSDataCSVS.zip Already processed files do appear in vaers_changes and the latest will be built upon: vaers_changes/2020-12-18_VAERS_CHANGES.csv vaers_changes/2020-12-25_VAERS_CHANGES.csv vaers_changes/2021-01-08_VAERS_CHANGES.csv vaers_changes/2021-01-15_VAERS_CHANGES.csv vaers_changes/2021-01-22_VAERS_CHANGES.csv ... 135 total = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Next date 2023-07-28 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = unzip ../Download/ALL_VAERS_DROPS/2023-07-28_AllVAERSDataCSVS.zip Creating in ./vaers_working/ date marker file 2023-07-28 Consolidation Concatenating files, *VAERSDATA.csv, *VAERSVAX.csv, *VAERSSYMPTOMS.csv open vaers_working\2020VAERSDATA.csv ... Highest VAERS_ID 918561 open vaers_working\2021VAERSDATA.csv ... Highest VAERS_ID 2662484 open vaers_working\2022VAERSDATA.csv ... Highest VAERS_ID 2659681 open vaers_working\2023VAERSDATA.csv ... Highest VAERS_ID 2662342 open vaers_working\NonDomesticVAERSDATA.csv ... Highest VAERS_ID 2662279 open vaers_working\2020VAERSVAX.csv ... Highest VAERS_ID 918561 open vaers_working\2021VAERSVAX.csv ... Highest VAERS_ID 2662484 open vaers_working\2022VAERSVAX.csv ... Highest VAERS_ID 2659681 open vaers_working\2023VAERSVAX.csv ... Highest VAERS_ID 2662342 open vaers_working\NonDomesticVAERSVAX.csv ... Highest VAERS_ID 2662279 4728 exact duplicates dropped in concatenated files, now 2005713 rows open vaers_working\2020VAERSSYMPTOMS.csv ... Highest VAERS_ID 918561 open vaers_working\2021VAERSSYMPTOMS.csv ... Highest VAERS_ID 2662484 open vaers_working\2022VAERSSYMPTOMS.csv ... Highest VAERS_ID 2659681 open vaers_working\2023VAERSSYMPTOMS.csv ... Highest VAERS_ID 2662342 open vaers_working\NonDomesticVAERSSYMPTOMS.csv ... Highest VAERS_ID 2662279 133358 records removed prior to the first covid report (covid_earliest_vaers_id 896636) 1703131 reports to work with (unique VAERS_IDs) week_vids_present [896750, 896754, 896765, 896766, 896897, 896637, 896638, 896639 ... 2662220, 2662221, 2662225, 2662228, 2662229, 2662264, 2662278, 2662279] hi_all_never_published 2659096 lo_ever 896637 hi_this_week 2662484 list_range_all_ever [896637, 896638, 896639, 896640, 896641, 896642, 896643, 896644 ... 2662477, 2662478, 2662479, 2662480, 2662481, 2662482, 2662483, 2662484] list_range_week_only [2659097, 2659098, 2659099, 2659100, 2659101, 2659102, 2659103, 2659104 ... 2662477, 2662478, 2662479, 2662480, 2662481, 2662482, 2662483, 2662484] gaps_filled [2658305, 2658310, 2658311, 2658313, 2658360, 2658363, 2658365, 2658368 ... 2658228, 2658247, 2656219, 2658270, 2658286, 2658290, 2658296, 2658300] week_gaps_new [2659184, 2659187, 2659237, 2659269, 2659283, 2659291, 2659292, 2659326 ... 2662476, 2662477, 2662478, 2662479, 2662480, 2662481, 2662482, 2662483] remedied_past_all_never_published [896713, 896742, 896875, 896892, 896899, 897093, 897105, 897169 ... 2659036, 2659042, 2659045, 2659082, 2659083, 2659084, 2659085, 2659096] all_never_published [896713, 896742, 896875, 896892, 896899, 897093, 897105, 897169 ... 2662476, 2662477, 2662478, 2662479, 2662480, 2662481, 2662482, 2662483] VAERS_IDs 896637 to 2662484 expected 1765848 all_ever 1735509 gaps 30371 123716 dropped in df_data due to no covid VAX_TYPE involved in the report 155916 dropped in df_vax due to no covid VAX_TYPE involved in the report 153045 dropped in df_syms due to no covid VAX_TYPE involved in the report 1579415 covid reports to work with Repeat sentence removal in SYMPTOM_TEXT, showing each next larger if any (takes time) 1084 SYMPTOM_TEXT field repeat sentences deduped in 161 reports, max difference 6192 bytes in VAERS_ID 1645697 Shortening some field values in VAX_NAME, VAX_MANU Merging DATA into VAX 1662525 rows in df_data_vax Aggregating symptoms into symptom_entries string, new column Combining symptoms column items. Grouping by VAERS_ID ... Appending each symptom in new column called symptom_entries Cleaning multiple delimiters due to empty columns Merging symptom_entries into df_data_vax 1662525 rows in df_data_vax_syms_consolidated Saving result into one file: vaers_consolidated/2023-07-28_VAERS_CONSOLIDATED.csv Consolidation of 2023-07-28 done Flattening Aggregate/flatten VAX items. Grouping by VAERS_ID 1579415 rows in df_vax_flat Merging DATA into VAX flattened 1579415 rows in df_data_vax_flat Merging symptom_entries into df_data_vax_syms_flat Saving result into one file: vaers_flattened/2023-07-28_VAERS_FLATTENED.csv 1579415 rows in vaers_flattened/2023-07-28_VAERS_FLATTENED.csv Flattening of 2023-07-28 done open vaers_flattened/2023-07-21_VAERS_FLATTENED.csv ... Highest VAERS_ID 2659098 Using flat 2023-07-28 already in memory, 1579415 rows Previous changes file for changes, cell_edits and status columns open vaers_changes/2023-07-21_VAERS_CHANGES.csv ... Highest VAERS_ID 2659098 Comparing 2023-07-21 v. 2023-07-28 1579415 this drop total covid 1576796 previous total covid 1576722 identical set aside 2693 this drop to work with 74 previous to work with 2619 difference 2657 new in 2023-07-28 0 delayed this week 38 deleted this week kept 0 restored this week Column value changes DATEDIED 2 cells of trivial non-letter differences ignored 1 duplicate dropped in df_three_columns 2 DIED [] <> Y [1710283, 1826734] OTHER_MEDS 2422423 [] <> HUMIRA OTHER_MEDS 2422315 [] <> RINVOQ RECOVD 1826734 Y <> N SYMPTOM_TEXT 1786747 Japan JCS <> CS SYMPTOM_TEXT 2415604 white not hispanic or latino <> SYMPTOM_TEXT 2418151 AbbVie <> Company SYMPTOM_TEXT 1820251 Double North Yilan redacted <> SYMPTOM_TEXT 1824986 at PRIVACY <> SYMPTOM_TEXT 1807044 PRIVACY <> a SYMPTOM_TEXT 2053885 RIDOH <> DOH SYMPTOM_TEXT 1492532 Pfizer <> SYMPTOM_TEXT 1799918 PRIVACY On 03Oct2021 NICU <> In Oct2021 ICU SYMPTOM_TEXT 1469113 Pfizer <> SYMPTOM_TEXT 1521491 PRIVACY <> SYMPTOM_TEXT 1795624 PGS Puurs NTM <> Regulatory Authority RA SYMPTOM_TEXT 1823266 Taiwan north Ilan <> SYMPTOM_TEXT 2418170 ChemoCentryx <> Company SYMPTOM_TEXT 2440026 Phase 2/3 to Evaluate the Immunogenicity and Safety of mRNA Vaccine Boosters for SARS-CoV-2 Variants mRNA-1273-P205 <> SYMPTOM_TEXT 2422423 AbbVie <> Company SYMPTOM_TEXT 2422279 003110 <> SYMPTOM_TEXT 1653431 of unspecified race and ethnicity unknown <> SYMPTOM_TEXT 1782153 COVAES <> online portal SYMPTOM_TEXT 1511185 COVID-19 Adverse Event Self-Reporting Solution <> SYMPTOM_TEXT 1634759 of Cardiology <> 2 duplicates dropped in df_three_columns 3 VAX_DOSE_SERIES 2|UNK <> 2 [1479303, 1761291, 1803025] VAX_DOSE_SERIES 2422423 3|UNK <> 3 VAX_DOSE_SERIES 2422315 UNK|UNK|UNK <> UNK|UNK VAX_LOT 3 cells of trivial non-letter differences ignored VAX_LOT 1803025 FC5435|Unknown <> FC5435 VAX_LOT 2422315 ||1160069 <> | 3 duplicates dropped in df_three_columns 4 VAX_MANU Pfizer-BionT|Unknown <> Pfizer-BionT [1479303, 1761291, 1803025, 2422423] VAX_MANU 2422315 Pfizer-BionT|Unknown|Unknown <> Pfizer-BionT|Unknown 3 duplicates dropped in df_three_columns VAX_NAME 2422315 UNKNOWN|Not Specified NO BRAND NAME <> UNKNOWN 4 VAX_NAME C19 Pfizer-BionT|Not Specified NO BRAND NAME <> C19 Pfizer-BionT [1479303, 1761291, 1803025, 2422423] VAX_ROUTE 1 cell of trivial non-letter differences ignored 2 duplicates dropped in df_three_columns 3 VAX_ROUTE OT|OT <> OT [1479303, 1803025, 2422423] VAX_ROUTE 2422315 OT||OT <> OT| VAX_SITE 5 cells of trivial non-letter differences ignored 3 duplicates dropped in df_three_columns VAX_TYPE 2422315 COVID19|COVID19|UNK <> COVID19|COVID19 4 VAX_TYPE COVID19|UNK <> COVID19 [1479303, 1761291, 1803025, 2422423] symptom_entries 1473257 _|_Immune thrombocytopenia_|_ <> symptom_entries 1291690 _|_Deep vein thrombosis_|_ <> symptom_entries 1397475 _|_Postmenopausal haemorrhage_|_Transient ischaemic attack_|_ <> symptom_entries 1472272 _|_Maternal exposure during breast feeding_|_ <> symptom_entries 1472117 _|_Maternal exposure during breast feeding_|_ <> symptom_entries 1472621 _|_Irritability_|_ <> symptom_entries 1462798 _|_Bradycardia_|_ <> symptom_entries 1465031 _|_Foetal warfarin syndrome_|_ <> 13 columns altered 31153 modified reports on 2023-07-28 Writing ... vaers_changes/2023-07-28_VAERS_CHANGES.csv 1 report with the most (18) records/lots/doses: 1900339 1 comparison done Doing stats open stats.csv ... ok column changes: {'SYMPTOM_TEXT': 22, 'symptom_entries': 8, 'VAX_DOSE_SERIES': 5, 'VAX_MANU': 5, 'VAX_TYPE': 5, 'VAX_NAME': 5, 'VAX_ROUTE': 4, 'OTHER_MEDS': 2, 'VAX_LOT': 2, 'DIED': 2, 'RECOVD': 1, 'AGE_YRS': 0, 'VAX_DATE': 0, 'V_FUNDBY': 0, 'RPT_DATE': 0, 'V_ADMINBY': 0, 'CAGE_MO': 0, 'X_STAY': 0, 'ONSET_DATE': 0, 'PRIOR_VAX': 0, 'SPLTTYPE': 0, 'ER_ED_VISIT': 0, 'NUMDAYS': 0, 'FORM_VERS': 0, 'BIRTH_DEFECT': 0, 'STATE': 0, 'LAB_DATA': 0, 'HOSPITAL': 0, 'CAGE_YR': 0, 'ER_VISIT': 0, 'DISABLE': 0, 'L_THREAT': 0, 'HISTORY': 0, 'ALLERGIES': 0, 'TODAYS_DATE': 0, 'CUR_ILL': 0, 'VAX_SITE': 0, 'SEX': 0, 'DATEDIED': 0, 'RECVDATE': 0, 'HOSPDAYS': 0, 'OFC_VISIT': 0} This week 0 delayed/late/gapfill 38 deleted 0 restored 11 cell edits trivial not printed 61 cell edits significant 0 cells emptied entirely 22 writeups changed All time 542230 delayed/late/gapfill 31169 deleted 14 restored 29372228 cell edits trivial not printed 30588 cell edits significant 1475720 cells emptied entirely 7396 writeups changed 30371 never published [896713, 896742, 896875, 896892, 896899, 897093, 897105, 897169 ... 2662476, 2662477, 2662478, 2662479, 2662480, 2662481, 2662482, 2662483] 16 reports cleared of duplicate sentences within them This week 0 hr 30.2 min Overall 0 hr 30.3 min No more to do, last set 2023-07-28 >= 2023-07-28 done Done with vaers_flatfile_build.py at line 2335, clock time 2023-08-04 18:44:52.313826 - - - - - - - - - - - - - - - - - - - - - - - -