#!python ''' TODO: NL 1702590 Gary Ha to Albert https://medalerts.org/vaersdb/findfield.php?IDNUMBER=1702590&WAYBACKHISTORY=ON Thank you for calling my attention to this, it's a flaw in my flat file I need to address. Deleted before the purge. Restored this week. My record of those events are there but the purged info is nowhere to be found, as you're pointing out. Not immediately apparent to me the best way to cope with it, in case any suggestions might come to mind. I'll sleep on it. Good catch. /hawk Always now says 1 comparison done Make cumulative total: 16 reports cleared of duplicate sentences within them OpenShot video editor Eagle uses. Contacts: John Findley 832 709 6407 June 12, 2023 katlindley@protonmail.com peter.halligan58@gmail.com Amy Kelly replied to my email to Naomi Wolf. Uncover unknown vax type covids known lots novavax in splttype Moderna in writeup, pfizer, astra, comirnaty covid-19 vaccine Number doesn't look right: 6 modified reports on 2021-01-22 trivial or what??? Says 9 edits but changes aren't all there. 1963163 9 2022-01-21 SYMPTOM_TEXT: AND <> Says 3 but only 2 903134 3 2021-01-01 SPLTTYPE: [] <> VSafe 2022-12-23 SYMPTOM_TEXT: P R V S Y M <> [] 7 edits but shows only 1 ... 1457073 COVID19 Pfizer-BionT en6198 2 SYR RA C19 Pfizer-BionT 07/08/2021 KS 59 59 F Rash look like shingles, spread across back, chest, and right arm Y 03/30/2021 03/30/2021 0 no OTH synthroid no no Vsafe 2 07/08/2021 Y no _|_Rash_|_ 7 2021-07-23 SYMPTOM_TEXT: look <> looked 5 edits but changes blank ... 909274 COVID19 Pfizer-BionT EK5730 1 IM LA C19 Pfizer-BionT 12/26/2020 NY 50 50 F Patient developed symptoms early the next morning after the Covid vaccine. Started with fever 102.8 and body aches. had fever for 5 days. Extreme low energy for 8 days and continuing. Headaches everyday. cough for the last 2 days. no SOB, no N/V D, no Sore throat, no nasal congestion. no loss of taste and smell. not pregnant. N 12/18/2020 12/19/2020 1 Covid test on 12/21/2020- test negative. Think of having repeat Covid test and get a influenza test. PVT hydroxychloroquine 200mg bid none Rheumatoid arthritis VSafe 2 12/26/2020 none _|_Asthenia_|_Cough_|_Headache_|_Pain_|_Pyrexia_|_SARS-CoV-2 test negative_|_ 5 4 edits showing only 1 1378547 COVID19 Pfizer-BionT EN6207 1 SYR LA C19 Pfizer-BionT 06/07/2021 CO 73 73 F 3/18 vaccination *In Jan 2021, I never had varicose veins, I was standing up more and ended up with legs hurting me. After having the vaccinations, the veins were more painful on the R side, intense. I am supposed to see a vein specialist. When standing still, it makes them worse. When I don't walk, they hurt more. *My NP gave me a referral. I haven't made appt yet, it has been a week. I had a couple days where I didn't have pain but then we had bad weather, I wasn't walking and the pain started again. She listed 2-3 places. N 03/18/2021 03/18/2021 0 PVT deplin 15, hyosycamine .125 mg 3x day, ametal 2x daily, mexium 40 mg 2x daily, arethostatin, promarin, 0.625 2x weekly, Synthroid 25 units 1 tablet weekly, 2 tablets weekends; aspirin, glucosamine, vit E, zyrtec, probiotics, vit B complex, sleep apnea, stenosis in aortic valve, GERD, osteoarthritis, bi-polar, high cholesterol VSAFE 2 06/07/2021 onions _|_Pain in extremity_|_Varicose vein_|_Vascular pain_|_ 4 2021-07-09 OTHER_MEDS: vit complex <> Vitamin c x edits blank changes 902757 902988 903022 903028 903046 903094 903193 903195 903634 903650 903734 903769 903803 903913 904238 904271 904459 904508 904996 907873 908008 908863 908943 909274 910073 910159 910238 910668 910713 910829 919053 951153 952852 953671 953720 1001094 1005476 1012749 1046239 1046307 1046773 1238060 1238792 1238887 1247766 1247962 1251010 1256906 1258737 1259209 1259409 1263617 1263944 1266655 1274134 1283101 1289747 1293164 1293291 1296328 1296427 1307402 1311329 1317694 1317790 1320787 1326571 1326863 1333434 1334024 1336486 1336764 1337035 1337311 1337553 1347087 1347507 1347590 1347648 1350534 1350760 1351575 1358086 1358454 1358558 1358614 1358632 1358690 1358791 1360896 1361837 1365021 1365241 1371759 1374658 1377794 1378382 1381912 1381926 1384919 1386007 1386291 1392591 1398907 1402621 1407712 1443122 1456996 1457465 1497987 TODO: Verify 1204337 fixed. Example of differences vs. medalerts.org: Medalerts is showing only a change between 4/23/2021 and 4/22/2022 a year later ... https://medalerts.org/vaersdb/findfield.php?IDNUMBER=1204337&WAYBACKHISTORY=ON I'm seeing a big change between these two: 2021-06-25_VAERS_CONSOLIDATED.csv:1204337,COVID19,JANSSEN,20210413997,1,IM,LA,COVID19 (COVID19 (JANSSEN)),04/13/2021,NM,63.0,63.0,,F,,"Hello, I received my J&J vaccine on 3/31/2021. 11:00a.m. Here is my results: 3/31/21 Day1: Waited 30 mins at site due to the history of Penicillin and Sulfa issues in the past. I had a little itching at the shot site. That went away in about 15 mins. Late afternoon I had a slight headache. 4/1/21 Day 2: Morning was normal. Afternoon I started feeling a little sluggish. I had a little headache. 4/2/21 Day 3: Morning was slow. I noticed that I was tired. Headache was stronger. I took Ibuprofen 400mg. By mid day, I was on the couch the rest of the day. My body was telling me to rest. So I did. 4/2/21 Day 3: Morning was slow. 4/3/21 Day 4: Morning was extremely slow. I did force myself to get up at the normal time, but I didn't want to. Headache was still present. Body sluggish. I ran an errand for about an hour, and when I returned, I was exhausted, so I took a 3 hour nap. Headache all day. Took Ibuprofen 400mg. 4/4/21 Day 5: Headache had gone away. Today I noticed that I had a rash on my mid to lower back. We put some cortisone cream on it, and it helped. I have clear skin and don't normally have any reactions to anything. Very tired 4/5/21 Day 6: Feeling pretty good today. The rash is still there and itching, but the cream is helping. I feel like I am 80% back to my normal self. 4/6/21 Day 7: Almost back to normal. Rash has almost cleared up, and I have my energy back to about 90%. Still a very slight headache. 4/7/21 Day 8: Still having a slight headache. Skipped from 4/7-13th/21 - Doing much better. Still having a slight headache here and there. Noticing that I am very tired in the afternoon, and making myself take a nap.",,,,,,,,,N,03/31/2021,04/02/2021,2.0,None,OTH,,"Symbicort 80/4.5, albuterol Sulfate 90mcg, and the following Vitamins: D, C, Gluconate, Probiotic, Super enzymes, Q10, Super B, Multi-Vitamin.",None,Fibromyalgia since 2009,,,2,04/13/2021,,,,Sulfa and Penicillin,_|_Asthenia_|_Fatigue_|_Headache_|_Injection site pruritus_|_Mobility decreased_|_Rash_|_Rash pruritic_|_Sluggishness_|_ 2021-07-02_VAERS_CONSOLIDATED.csv:1204337,COVID19,JANSSEN,20210413997,1,IM,LA,COVID19 (COVID19 (JANSSEN)),04/13/2021,NM,63.0,63.0,,F,,"Hello, I received my J&J vaccine on 3/31/2021. 11:00a.m. Here is my results: 3/31/21 Day1: Waited 30 mins at site due to the history of Penicillin and Sulfa issues in the past. I had a little itching at the shot site. That went away in about 15 mins. Late afternoon I had a slight headache. 4/1/21 Day 2: Morning was normal. Afternoon I started feeling a little sluggish. I had a little headache. 4/2/21 Day 3: Morning was slow. I noticed that I was tired. Headache was stronger. I took Ibuprofen 400mg. By mid day, I was on the couch the rest of the day. My body was telling me to rest. So I did.",,,,,,,,,N,03/31/2021,04/02/2021,2.0,None,OTH,,"Symbicort 80/4.5, albuterol Sulfate 90mcg, and the following V",None,Fibromyalgia since 2009,,,2,04/13/2021,,,,Sulfa and Penicillin,_|_Asthenia_|_Fatigue_|_Headache_|_Injection site pruritus_|_Mobility decreased_|_Rash_|_Rash pruritic_|_Sluggishness_|_ What I'm seeing is that on 2021-07-02 they deleted: itamins: D, C, Gluconate, Probiotic, Super enzymes, Q10, Super B, Multi-Vitamin. Theirs says it was added on 4/22/2022 and that's it (except for other changes in Write-up, this applies to Other Medications) My code appears to be doing the right thing: OTHER_MEDS 1204337 vitamins d c gluconate probiotic super enzymes q10 super b multi vitamin -> v Use known lot codes to uncover additional covid involved in some reports. The following has a copy, see that spot for how-to: A way to make unique words of SYMPTOM_TEXT for searching known lot codes for reports flying under the radar from improperly not labeled covid In string compare, if 'a' is a substring of b, drop all in b except for a couple of words plus 'a', context. Only merge those not already in flat file, new reports, pull past from already processed, speed. vaers_id highest from previous results being the starting point from working dir In writeup, reference: Attack on fertility by symptom counts: https://welcometheeagle.substack.com/p/vaers-nothing-fancy-just-the-cold Naomi female issue numbers: https://dailyclout.io/dr-naomi-wolf-uncovers-pfizers-depopulation-agenda-as-evidenced-by-its-own-documents/ Helpful in any sport to confuse one’s opponent. To be able to mine VAERS data more efficiently and effectively, flat file containing everything with all changes and keeping deleted, by processing week-over-week. Iterate through CDC VAERS weekly data drops starting with oldest like 2020-12-18 and build a complete set, keeping changes. The first 5 dates before 2021-01-29 were a gap obtained from . Note their dates are the week prior to public drop, CDC delays one week (set to nearest Friday to be consistent in filenames). Output: Saved to a dir (directory) like ... vaers_flatfile Steps: 1. Consolidate to one file. This also contains SYMPTOMS file entries flattened. 2. Flatten VAX file also to one file (now all one row per VAERS_ID). 3. Operate on flattened, recording changes in 'changes' column. Consolidated: All records after start of covid, all years including foreign, otherwise unchanged. Flattened: One row per VAERS_ID with all information for multiple lots contained within each row plus all symptom file entries as new column called 'symptom_entries'. Changes: Also flattened and arrived at by comparing flattened, noting changes in a column called 'changes'. The latest always contains all of the information needed, no others matter anymore. Copyright 2023 Gary Hawkins Currently http://hawkvaers.com/download/ ''' import glob, os, sys, re, shutil, pprint, inspect import pandas as pd import numpy as np import subprocess as sp import time as _time from collections import Counter from datetime import datetime from string import punctuation from collections import OrderedDict pp = pprint.PrettyPrinter(indent=4) ''' from zipfile import ZipFile ... can't handle ... compression type 9 (deflate64) ''' import zipfile_deflate64 as zipfile # use ... pip install zipfile-deflate64 ... that solves it. date_floor = '' #'2022-11-11' # Useful any time for no redo of what's done date_ceiling = '' #'2021-07-01' # In testing to stop if greater (newer) than this reconsolidate = 0 # Once put togeher, no need to do again reflatten = 0 # Once flattened , no need to do again restart = 0 # In testing, clear directories and start over completely files_limit = [] #['2022'] # For testing speed, any of ['2020', '2021', '2022', '2023', 'NonD'] vids_limit = [] #[2270225, 2048529, 2451431 ] # For testing speed, vid stands for vaears_id, list as integers autodownload = 1 # Automatically download the latest VAERS data from https://vaers.hhs.gov/data/datasets.html dir_input = '../Download/ALL_VAERS_DROPS' dir_changes = 'vaers_flatfile' # changed from vaers_changes, resulting in a tad bit of cognitive confusion throughout the code now but overall better. dir_working = 'vaers_working' dir_flattened = 'vaers_flattened' dir_consolidated = 'vaers_consolidated' file_stats = 'stats.csv' file_never_published = 'never_published.txt' file_all_ever_seen = 'all_ever_seen.txt' file_symptoms_deduped = 'symptoms_deduped.txt' tones = 0 floor_notice_printed = 0 ceiling_notice_printed = 0 covid_earliest_vaers_id = 896636 # was during trials elapsed_begin = _time.time() elapsed_drop = _time.time() df_vax = pd.DataFrame() df_data = pd.DataFrame() df_syms_flat = pd.DataFrame() df_flat_1 = pd.DataFrame() df_flat_2 = pd.DataFrame() # df_flat_1 would be the previously flattened file df_changes_1 = pd.DataFrame() df_stats = pd.DataFrame() punctuations = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~' all_never_published = {} # changes week-to-week all_ever_seen = {} # grows all_symptoms_deduped = {} files = {} stats = {} df_done = {} working_date = '' # could be used more universally, currently just for stats columns_vaers = ['VAERS_ID', 'AGE_YRS','SEX', 'STATE', 'SPLTTYPE', 'DIED', 'L_THREAT', 'ER_VISIT', 'ER_ED_VISIT', 'HOSPITAL', 'DISABLE', 'BIRTH_DEFECT', 'OFC_VISIT', 'VAX_TYPE', 'VAX_MANU', 'VAX_LOT', 'VAX_DOSE_SERIES', 'VAX_ROUTE', 'VAX_SITE', 'VAX_NAME', 'DATEDIED', 'VAX_DATE', 'RPT_DATE', 'RECVDATE', 'TODAYS_DATE', 'ONSET_DATE', 'NUMDAYS', 'HOSPDAYS', 'X_STAY', 'RECOVD', 'CAGE_YR', 'CAGE_MO', 'V_ADMINBY', 'V_FUNDBY', 'FORM_VERS', 'PRIOR_VAX', 'CUR_ILL', 'OTHER_MEDS', 'ALLERGIES', 'HISTORY', 'LAB_DATA', 'SYMPTOM_TEXT'] # These are added: cell_edits, status, changes, symptom_entries # Dates are changed for sort from 12/14/2020 to 2020-12-14, omitting like cut_2023-07-14, leaving out on data fields the indicator that the field was blanked out. never_published_patch_done = 0 # due to my missing 2020-12-30 drop def stats_resolve(working_date): ''' Statistics file. All of the stats for this week are in the stats dictionary. Turn them into a row in df_stats and update the row called All (usually sum but vid for max sentence removal is the highest, for example). Overwrite stats.csv. TODO: If there's an interuption, make sure each week's stats are present or redo as needed. ''' global stats, df_stats print() ; print(' Doing stats') stats[ 'never_published' ] = len(all_never_published) df_tmp = pd.DataFrame.from_dict(stats, orient='index').T.reset_index().rename(columns={'index': 'date'}) # Separate out the columns from their sub-dictionary for k in list(stats[ 'columns' ].keys()): df_tmp[ k ] = stats[ 'columns' ][ k ] del df_tmp[ 'columns' ] df_tmp[ 'date' ] = working_date if len(df_stats): # A continuing run of many weeks one after another, stats.csv file was already opened or created new. df_tmp = df_tmp.reindex(df_stats.columns, axis=1) # same column order df_stats = pd.concat([ df_stats, df_tmp ], ignore_index=True) # ignore_index does a reindex df_stats = df_stats[df_stats.date.ne('All')] else: if os.path.exists(file_stats): df_stats = open_file_to_df(file_stats, doprint=1) # stored from a previous run df_stats = pd.concat([ df_stats, df_tmp ], ignore_index=True) df_stats = df_stats.loc[ df_stats.date.ne('All') ] # remove previous 'All' else: print(); print(f' Creating {file_stats} since it did not already exist') df_stats = df_tmp.copy() # Line in csv all zeros for first date with no compare if files['changes']: # has to be there, just making sure df_stats.loc[ len(df_stats) ] = 0 date_prv = list(files['changes']['keyval'].keys())[-1] df_stats.loc[ df_stats.date.eq(0), 'date' ] = date_prv del df_tmp df_stats = df_stats.drop_duplicates(subset='date', keep='last') df_stats = df_stats.sort_values(by='date') df_stats = df_stats.reset_index(drop=True) ''' 'All' row add ''' try: # there were some issues, TODO remove these try/except? df_stats = df_stats.append(pd.Series(df_stats.sum()), ignore_index=True) # lots of ways, surely one is best maybe better df_stats.loc[ df_stats.index[ -1 ], 'date' ] = 'All' except Exception as e: print(f'1 Exception {e}') ''' list_cols_vaers_minus_vid = list(set(columns_vaers) - set(['VAERS_ID'])) + ['symptom_entries'] df_vaers = df_stats.loc[df_stats.date.eq('All')][list_cols_vaers_minus_vid] df_vaers = df_vaers.sort_values(df_vaers.last_valid_index(), axis=1, ascending=False) # https://stackoverflow.com/a/41350596/962391 list_new_order = list_only_df_stats_columns + list_sorted_columns ''' ''' Sort vaers columns by 'All' row values descending ''' stats['columns'] = {k:v for k,v in sorted(stats['columns'].items(), key=lambda item: item[1], reverse=True)} # sorted by value, descending print() ; print(f"{ '':>10} column changes: {stats['columns']}") list_sorted_columns = list(stats['columns'].keys()) list_only_df_stats_columns = [c for c in list(df_stats.columns) if c not in list_sorted_columns] df_vaers = df_stats.loc[df_stats.date.eq('All')][list_sorted_columns] df_vaers = df_vaers.sort_values(df_vaers.last_valid_index(), axis=1, ascending=False) # https://stackoverflow.com/a/41350596/962391 list_new_order = list_only_df_stats_columns + list(df_vaers.columns) try: df_stats = df_stats.reindex(list_new_order, axis=1) except Exception as e: print(f'2 Exception {e}') # For print, corrections for those not actually a sum df_stats.loc[ df_stats.index[ -1 ], 'never_published' ] = stats['never_published'] # overwrite, the new all-time value df_stats.loc[ df_stats.index[ -1 ], 'sentence_dedupe_max_bytes' ] = df_stats.loc[ df_stats.index[ :-1 ], 'sentence_dedupe_max_bytes' ].max() idx_high_dedupe_bytes = df_stats['sentence_dedupe_max_bytes'].astype('float64').idxmax() df_stats.loc[ df_stats.index[ -1 ], 'sentence_dedupe_max_vid' ] = df_stats.loc[ df_stats.index[ idx_high_dedupe_bytes ], 'sentence_dedupe_max_vid' ] df_stats['comparisons'] = df_stats.index + 1 df_stats.loc[ df_stats.date.eq('All'), 'comparisons' ] = df_stats.loc[ df_stats.index[ -2 ], 'comparisons' ] try: _all = df_stats.loc[ df_stats.date.eq('All') ].to_dict(orient='records')[0] except Exception as e: print(f'3 Exception {e}') del _all['date'] ''' This week 830 delayed/late/gapfill 56 deleted 0 restored 7 cell edits trivial not printed 22 cell edits significant 1 cells emptied entirely 6 writeups changed All time 521672 delayed/late/gapfill 3009 deleted 1 restored 847030 cell edits trivial not printed 2495 cell edits significant 51 cells emptied entirely 526 writeups changed 21455 never published [903657, 903679, 903707, 903738, 903802, 903829, 903946, 903970 ... 1901656, 1901657, 1901658, 1901659, 1901660, 1901661, 1901662, 1901676] 16 reports cleared of duplicate sentences within them ''' print() ; print(' This week') print(f"{stats ['gapfill']:>10} delayed/late/gapfill") print(f"{stats ['deleted']:>10} deleted") print(f"{stats ['restored']:>10} restored") print(f"{stats['cell_trivial_changes_ignored']:>10} cell edits trivial not printed") print(f"{stats ['cells_edited']:>10} cell edits significant") print(f"{stats ['cells_emptied']:>10} cells emptied entirely") print(f"{stats ['writeups_changed']:>10} writeups changed") print() print(' All time') print(f"{_all ['gapfill']:>10} delayed/late/gapfill") print(f"{_all ['deleted']:>10} deleted") print(f"{_all ['restored']:>10} restored") print(f"{_all['cell_trivial_changes_ignored']:>10} cell edits trivial not printed") print(f"{_all ['cells_edited']:>10} cell edits significant") print(f"{_all ['cells_emptied']:>10} cells emptied entirely") print(f"{_all ['writeups_changed']:>10} writeups changed") print() print(f"{_all ['never_published']:>10} never published {subrange(sorted(all_never_published), 8)}") print(f"{ len(file_symptoms_deduped):>10} reports cleared of duplicate sentences within them") ''' Write to stats.csv ''' df_stats = df_stats.sort_values(by='date', ascending=False) write_to_csv(df_stats, file_stats, open=0, ignore_dupes=0) # last of 'All' is kept ''' Write all_symptoms_deduped to symptoms_deduped.txt ''' with open(file_symptoms_deduped, 'w') as f: for x in list(all_symptoms_deduped.keys()): f.write(f'{x}\n') return def stats_initialize(): ''' Per week, stored in stats dictionary, then written to stats.csv with each date as a row. ''' global stats stats = { 'comparisons' : 0, # Number of weeks. 'deleted' : 0, # Were visible, now gone. 'restored' : 0, # Had been deleted. 'gapfill' : 0, # Held back but finally showed up. 'never_published' : 0, # Total still being held back. Gaps in the VAERS_ID sequence. 'writeups_changed' : 0, # SYMPTOM_TEXT' 'cells_emptied' : 0, # Blanked out. 'cells_edited' : 0, # Significant edits. 'cell_trivial_changes_ignored': 0, # Trivial changes as a count of the number of cells. 'sentence_dedupe_bytes' : 0, # This week bytes in removal of repeat sentences in SYMPTOM_TEXT fields. 'sentence_dedupe_reports' : 0, # This week new reports that had repeat sentences (that were not counted before). 'sentence_dedupe_count' : 0, # This week total repeat sentence replacements including within cells. 'sentence_dedupe_max_bytes' : 0, # This week report with most deduped sentences, the bytes/characters value. 'All' row will be the highest all-time. 'sentence_dedupe_max_vid' : 0, # This week report with most deduped sentences VAERS_ID. 'columns' : {}, } columns_without_vid = list(set(columns_vaers) - set(['VAERS_ID'])) + ['symptom_entries'] stats['columns'] = {x:0 for x in columns_without_vid} def validate_dirs_and_files(): ''' Create directories if not existing ''' global all_never_published, all_symptoms_deduped, list_vid_repeat_sentences print() ; print('validate_dirs_and_files() ...') if not os.path.exists(dir_input): print(f'The expected inputs directory of CDC drops does not exist: {dir_input }') os.makedirs(dir_input ) print(' That directory has been created and should be populated with each weekly data drop to process, named like:') print(' ... with directories named like:') print(' 2020-12-25') print(' 2021-01-07') print(' ... and containing the CDC csv files like: 2021VAERSDATA.csv') exit('\n see validate_dirs_and_files()') if not os.path.exists(dir_working): print(f' The expected working directory for processing does not exist, creating {dir_working}') os.makedirs(dir_working) if not os.path.exists(dir_consolidated): print(f' The expected consolidated directory for storing files does not exist, creating {dir_consolidated}') os.makedirs(dir_consolidated) if not os.path.exists(dir_flattened): print(f' The expected flattened directory for storing files does not exist, creating {dir_flattened}') os.makedirs(dir_flattened) if not os.path.exists(dir_changes): print(f' The expected output directory for changes does not exist, creating {dir_changes}') os.makedirs(dir_changes) if os.path.exists(file_never_published): with open(file_never_published, 'r') as f: lines = f.readlines() lines = [x.strip() for x in lines] lines = [ int(x) for x in lines if x] all_never_published = {x:1 for x in lines} else: with open(file_never_published, 'w'): # create it empty pass if os.path.exists(file_symptoms_deduped): with open(file_symptoms_deduped, 'r') as f: lines = f.readlines() lines = [x.strip() for x in lines] lines = [ int(x) for x in lines if x] all_symptoms_deduped = {x:1 for x in lines} # dictionary more efficient than list else: with open(file_symptoms_deduped, 'w'): pass files_populate_information() if not files['input']['files']: exit(f" No csv or zip files in dir_input {dir_input}, no point in continuing") print() print(f"{(len(files['input']['date']) - len(files['changes']['date'])):>10} drops in input to process") ; print() print(f" First (oldest) input: {files['input']['files'][0]}") print(f" Last (newest) input: {files['input']['files'][-1]}") files_changes = sorted(files['changes']['files']) date_changes = sorted(files['changes']['date' ]) date_input = sorted(files['input' ]['date' ]) if files_changes: if date_input[-1] <= date_changes[-1]: print() if len(files_changes) >= 2: print(f" Second-to-last changes file is {files_changes[-2]}") if len(files_changes) >= 1: print(f" Last (newest) changes file is {files_changes[-1]}") print() ; print(f' Already processed files do appear in {dir_changes} and the latest will be built upon:') count = 0 for f in files_changes: # showing some of those print(f' {f}') count += 1 if count >= 5: print(f' ... {len(files_changes)} total') break print() else: ''' First covid 2020-12-25 treatment initially or other with date_floor ''' print() ; print(f' No processed files show up in {dir_changes}. Is the first in the loop.') return def subrange(list_in, _max): ''' Input list and for print return up to _max (like 5) at the start and end of that list ''' this_many = int(max(min(len(list_in) / 2, _max), 1)) head = f'{list_in[:this_many]}' if len(list_in) > 1 else '' head = re.sub(r'\]', '', head) tail = f'{list_in[-this_many:]}' if len(list_in) > 1 else '' tail = re.sub(r'\[', '', tail) return(head + ' ... ' + tail) def files_from_zip(zip_file, dir_dst): ''' This requires ... pip install zipfile-deflate64 ... to handle zip straight from https://vaers.hhs.gov/data/datasets.html See https://stackoverflow.com/a/73040025/962391 The alternative is to unzip and rezip to get away from compression type 9 (deflate64), a licensing issue ''' archive = zipfile.ZipFile(zip_file) print(f' unzip {zip_file}') for file in archive.namelist(): # only 2020... and NonDomestic files if file.startswith('202') or file.lower().startswith('nond'): archive.extract(file, './' + dir_dst) if files_limit: # In testing with files_limit, remove those that don't apply print(f' files_limit = {files_limit}') files_all = glob.glob(dir_working + '/' + '*.csv') files_all = [linux_path(x) for x in files_all] for file in files_all: for x in files_limit: if x not in file and 'nond' not in file.lower(): print(f' remove file {file}') os.remove(file) #don't do this ... print(f' Removing {zip_file}') #os.remove(zip_file) def open_file_to_df(filename, doprint=1): ''' Read CSV filename into dataframe df ''' df = 'see open_file_to_df()' try: if doprint: print(f' open {filename:>54}', flush=True, end='') with open(filename, encoding='utf-8-sig', errors='replace') as f: # 'utf-8-sig' and 'ISO-8859-1', need to resolve this df = pd.read_csv(f, index_col=None, header=0, sep=',', engine='python', encoding='ISO-8859-1').fillna('') if doprint: max_vid = 'ok' if 'VAERS_ID' in df.columns: max_vid = 'Highest VAERS_ID ' + str(df.VAERS_ID.astype(int).max()) print(f' ... {max_vid}') except ValueError as e: print(f'\n\t{e}') return df def files_populate_information(): ''' Often updating in 'files' variable ''' global files, floor_notice_printed, ceiling_notice_printed if not files: # make the keys for x in ['input', 'working', 'flattened', 'changes', 'consolidated']: files[x] = {} for y in ['date', 'files']: files[x][y] = [] # set _dir if x == 'input': files[x]['_dir'] = dir_input elif x == 'working': files[x]['_dir'] = dir_working elif x == 'changes': files[x]['_dir'] = dir_changes elif x == 'flattened': files[x]['_dir'] = dir_flattened elif x == 'consolidated': files[x]['_dir'] = dir_consolidated # current values for thing in list(files.keys()): _dir = files[thing]['_dir'] # filenames only and made lowercase) full = sorted( [y for x in os.walk(_dir) for y in glob.glob(os.path.join(x[0], '*' + '.*'))] ) # note other files/dirs can be there without a problem, only .csv or .zip are picked up full = [x for x in full if re.search(r'\\\d{4}\-\d{2}\-\d{2}', x)] # file must start with a date like 2020-12-24 full = [linux_path(x) for x in full] full = [x for x in full if (x.lower().endswith('.csv') or x.lower().endswith('.zip'))] full = [x for x in full if not (x.lower().endswith('_a.csv') or x.lower().endswith('_b.csv'))] # date only like 2020-12-24 files[thing]['date'] = [date_from_filename(x) for x in full] # date to either the zip file or directory name files[thing]['keyval'] = {date_from_filename(x) : x for x in full} files[thing]['valkey'] = {x : date_from_filename(x) for x in full} files[thing]['files' ] = list(files[thing]['valkey'].keys()) do_file_limits = 0 if date_floor: do_file_limits = 1 if not floor_notice_printed: print(f'\n\n\n\t\t date_floor is set at { date_floor}, limiting files\n\n') floor_notice_printed = 1 if date_ceiling: do_file_limits = 1 if not ceiling_notice_printed: print(f'\n\n\n\t\t date_ceiling is set at {date_ceiling}, limiting files\n\n') ceiling_notice_printed = 1 if do_file_limits: # remove those that don't apply if date_floor: files['input']['date'] = [x for x in files['input']['date'] if x >= date_floor] if date_ceiling: files['input']['date'] = [x for x in files['input']['date'] if x <= date_ceiling] for y in ['input']: # only this files[y]['date'] = [x for x in files[y]['date'] if x in files['input']['date']] files[y]['keyval'] = {k:v for k, v in files[y]['keyval'].items() if k in files['input']['date']} files[y]['valkey'] = {k:v for k, v in files[y]['valkey'].items() if v in files['input']['date']} files[y]['files' ] = list(files[y]['valkey'].keys()) #pp.pprint(files) return def open_files(_date): # like './vaers_drop_inputs/2020-12-25' ''' Input files in dir_input: csv within directories zip files in a single directory, containing csv, treated as if they are folders, sort of. ''' files_populate_information() if not reconsolidate: if _date in files['consolidated']['date']: # already consolidated print(f' {_date} already consolidated, no need to copy input files to dir_working') shutil.rmtree(dir_working) # removing directory os.mkdir(dir_working) set_files_date_marker(dir_working, _date) # a flag used by consolidate() return if not reconsolidate and _date in files['flattened']['date']: print(f' Skipping unzip because flattened for {_date} already exists') return if not _date in files['input']['keyval']: exit(f" Failed to find in files['input']['keyval'] the _date {_date} in open_files() ") files_value = files['input']['keyval'][_date] if isinstance(files_value, list): # csv files already extracted manually print(f' Copy {_date}/* to {dir_working}') shutil.rmtree(dir_working) # removing directory to avoid error next line shutil.copytree(_date, dir_working) set_files_date_marker(dir_working, _date) elif 'zip' in files_value: # zip file, treat it sort of like a directory here shutil.rmtree(dir_working) os.makedirs(dir_working) ####shutil.copy(files_value, dir_working) # copies the zip file files_from_zip(files_value, dir_working) else: exit(f' Unexpected _date {_date} in open_files() ') set_files_date_marker(dir_working, _date) return def save_multi_csv(this_date, df): df_a = df.head(1048575) if len(df) > 1048576: df_b = df.loc[ ~df.VAERS_ID.isin(df_a.VAERS_ID) ] #filename_a = dir_changes + '/' + this_date + '_VAERS_FLATFILE_A.csv' #filename_b = dir_changes + '/' + this_date + '_VAERS_FLATFILE_B.csv' filename_a = f'{dir_changes}/{this_date}_VAERS_FLATFILE_A.csv' filename_b = f'{dir_changes}/{this_date}_VAERS_FLATFILE_B.csv' print(f'Saving {filename_a}, {len(df_a)} rows') write_to_csv(df_a, filename_a, open=1) # TODO: How is this broken? ... 'vaers_flatfile' is not recognized if len(df_b): print(f' and {filename_b}, {len(df_b)} rows') write_to_csv(df_b, filename_b) ### Currently broken, writes file but is corrupted ###save_xlsx(df, 'XLSX_VAERS_FLATFILE.xlsx') def save_xlsx(df, filename): pd.io.formats.excel.ExcelFormatter.header_style = None df = df.copy() df = df.reset_index(drop=True) # Currently broken, writes file but is corrupted # what a mess ... ''' from unidecode import unidecode def FormatString(s): if isinstance(s, unicode): try: s.encode('ascii') return s except: return unidecode(s) else: return s df = df.applymap(FormatString) ''' print(' Unicode decode') df = df.applymap(lambda x: x.encode('unicode_escape'). decode('utf-8') if isinstance(x, str) else x) print(f' Saving to {filename}') df1 = df.head(1048575) if len(df) > 1048576: df2 = df.tail(len(df) - (1048575 + 1)) sheets = { 'First million or so rows': df1, 'Remaining rows': df2, } ''' with pd.ExcelWriter('test.xlsx', engine='openpyxl') as writer: df1.to_excel(writer, sheet_name = 'Tab1', index = False) df2.to_excel(writer, sheet_name = 'Tab2', index = False) ''' #df['SYMPTOM_TEXT'] = df.SYMPTOM_TEXT.str.replace(r'[^\x00-\x7F]+', r'{}', regex=True) #df['changes'] = df.changes .str.replace(r'[^\x00-\x7F]+', r'{}', regex=True) ''' New problem with openpyxl: openpyxl.utils.exceptions.IllegalCharacterError: ... benadryl allergy. Aroun??d 11 pm t ... ''' #print(' ILLEGAL_CHARACTERS_RE') #ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]') #df = df.applymap(lambda x: re.sub(ILLEGAL_CHARACTERS_RE, '{}', x) if isinstance(x, str) else x) #df = df.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x) print(f' Writing {filename}') with pd.ExcelWriter(filename, engine='openpyxl') as writer: for sheet in sheets: sheets[ sheet ].to_excel(writer, sheet_name = sheet, index = False) ''' writer.book.use_zip64() workbook = writer.book #workbook.use_zip64() ##workbook.use_zip64() # due to size threshold hit header_styles = workbook.add_format({'font_name': 'Arial', 'font_size': 10, 'bold': False}) cell_styles = workbook.add_format({ # was test, now unused, didn't take why? Re-try, not yet tested. 'font_name' : 'Arial', 'font_size' : 10, 'bold' : False, }) cell_styles = header_styles for sheet in sheets: sheets[sheet].to_excel(writer, sheet_name=sheet, index=False) worksheet = writer.sheets[ sheet ] for col_num, value in enumerate(sheets[sheet].columns.values): worksheet.write(0, col_num, value, cell_styles) worksheet.set_row(0, None, header_styles) ''' return # engine options: openpyxl or with pd.ExcelWriter(filename, engine='xlsxwriter') as writer: workbook = writer.book ##workbook.use_zip64() # due to size threshold hit header_styles = workbook.add_format({'font_name': 'Arial', 'font_size': 10, 'bold': False}) cell_styles = workbook.add_format({ # was test, now unused, didn't take why? Re-try, not yet tested. 'font_name' : 'Arial', 'font_size' : 10, 'bold' : False, }) for sheet in sheets: sheets[sheet].to_excel(writer, sheet_name=sheet, index=False) worksheet = writer.sheets[ sheet ] #workbook.use_zip64() # due to size threshold hit. Here instead??? https://stackoverflow.com/a/48095021/962391 ... FAIL for col_num, value in enumerate(sheets[sheet].columns.values): worksheet.write(0, col_num, value, cell_styles) worksheet.set_row(0, None, header_styles) workbook.close() # help? def get_next_date(): ''' Next date has to be the next flattened needed to be done. Compare uses flattened files. ''' #files_populate_information() already done in previous more_to_do() stats_initialize() # TODO: This has changed, there's surely some dead code here now that would never be hit, to be removed. dates_input = sorted(files['input' ]['date']) dates_changes = sorted(files['changes']['date']) if not files['flattened']['keyval']: # first run working_date = dates_input[0] return working_date # Comparing to inputs (to do) if dates_changes: high_changes_done = dates_changes[-1] changes_next_candidates = [x for x in dates_input if x > high_changes_done] if changes_next_candidates: date_next = changes_next_candidates[0] if date_ceiling and ( date_next > date_ceiling ): exit(f'\n\n\n\n\t\t\t date_ceiling is set and was reached: date_next {date_next} > {date_ceiling} \n\n\n\n') working_date = date_next return date_next else: ''' Save final output as two, split in half, due to Excel row limit of about 1.048 million per sheet. ''' if len(df_changes_1): save_multi_csv(high_changes_done, df_changes_1) if date_ceiling: exit(f'\n\n\n\n\t\t\t date_ceiling is {date_ceiling}, highest changes: {high_changes_done} \n\n\n\n') else: print() exit(' No more input files to process') ; print() else: ''' changes_not_done = sorted( set(files['input']['date']) - set(files['changes']['date']) ) if changes_not_done: date_todo = sorted(changes_not_done)[0] return date_todo Confusions. Whole thing needs a lot of conditions applied. If there are no changes files, can still be consolidated and flattened. If no changes, just return the first as next? No, that's a loop. ''' if files['flattened']['date']: if dates_input[0] in files['flattened']['date']: # Initiating, copy it this_dir_date = dates_input[0] file_for_copy_original = dir_changes + '/' + this_dir_date + '_VAERS_FLATFILE.csv' print( f' Due to first drop, creating from flattened: {file_for_copy_original}') print() file_flattened = files['flattened']['keyval'][this_dir_date] # TODO: Add ['changes', 'status', 'cell_edits'] and move them forward shutil.copyfile(file_flattened, file_for_copy_original) return dates_input[1] # NEXT ONE working_date = dates_input[0] return working_date # TODO, needs work consolidated_not_done = sorted( set(files['input']['date']) - set(files['consolidated']['date']) ) if consolidated_not_done: date_todo = sorted(consolidated_not_done)[0] print(f' Consolidation to do: {date_todo}') working_date = date_todo return date_todo flattened_not_done = sorted( set(files['input']['date']) - set(files['flattened']['date']) ) if flattened_not_done: date_todo = sorted(flattened_not_done)[0] print(f' Flattening to do: {date_todo}') working_date = date_todo return date_todo latest_flattened_done = sorted(files['flattened']['date'])[-1] if files['flattened']['date'] else '' this_drop_date_list = [x for x in dates_input if x > latest_flattened_done] date_next = this_drop_date_list[0] if len(this_drop_date_list) == 0: # No more to process print() exit(' No more files to process') elif date_ceiling and ( date_next > date_ceiling ): exit(f'\n\n\n\n\t\t\t date_ceiling is set and was reached: date_next {date_next} > {date_ceiling} \n\n\n\n') else: working_date = date_next return date_next def set_files_date_marker(_dir, filename): ''' For visual clarity create an empty file in dir with filename as the particular date ''' file_date_part = date_from_filename(filename) pattern_dir = './' + _dir + '/' if file_date_part: files_date_marker = pattern_dir + file_date_part if os.path.exists(files_date_marker): return else: print(f' Creating in {pattern_dir} date marker file {file_date_part}') open(files_date_marker, 'a').close() else: print(f' FAILED creating in {pattern_dir} date marker file {file_date_part}') return('EMPTY IN set_files_date_marker, must fix if hit') def get_files_date_marker(_dir): ''' Date from filename for files in director ''' pattern_dir = './' + _dir + '/' files_with_date = glob.glob(pattern_dir + '*-*') if not files_with_date: return('') files_with_date = files_with_date[0] # any will do, thus first in list returned (even if just one of course) file_date_part = date_from_filename(files_with_date) if file_date_part: return file_date_part else: print(' EMPTY in get_files_date_marker') return def date_from_filename(filename): ''' Pull just date portion of a filename ''' return re.sub(r'.*(\d{4}\-\d{2}\-\d{2}).*', r'\1', filename) def linux_path(x): return re.sub(r'\\', '/', x) def files_concat(files_list): ''' Join/concatenate files ''' df_out = None count = 0 for filename in files_list: df = open_file_to_df(filename, doprint=1) if count == 0: df_out = df.copy() else: df_out = pd.concat([df_out, df]) count += 1 df_out = df_out.fillna('') # There are nans in all NonDomestic, why? ''' VAERS_IDs inconsistently have a leading zero or not, fixing that, uniformity. This is important for removal of duplicates for example. Making double sure they are integers. ''' df_out['VAERS_ID'] = pd.to_numeric(df_out.VAERS_ID) ''' For debug but tricky as df_dupes are not all dupes but merely contain one that is: df_duplicated = df_out.loc[df_out.duplicated(subset=df_out.columns) == True].sort_values(by='VAERS_ID') # each record only once though vids_duplicated = df_duplicated.VAERS_ID.to_list() df_dupes = df_out.loc[ df_out.VAERS_ID.isin(vids_duplicated) ] # to examine them ''' len_before = len(df_out) df_out = df_out.drop_duplicates(df_out.columns).reset_index(drop=True) # must avoid dropping on default by index, use of df_out.columns if (len(df_out) - len_before): print(f'{(len_before - len(df_out)):>10} exact duplicates dropped in concatenated files, now {len(df_out)} rows') return df_out def write_to_csv(df, full_filename, open=0, ignore_dupes=0): ''' Output to csv file(s). CSV stands for Comma Separated Values ''' len_before = len(df) # No duplicate lines error check df = df.drop_duplicates(subset=df.columns, keep='last') # Should be none. Overwrite of values now done ''' Make date columns sortable for the user in spreadsheet like 2020-12-14 ''' for col in ['DATEDIED', 'VAX_DATE', 'RPT_DATE', 'RECVDATE', 'TODAYS_DATE', 'ONSET_DATE']: if col in df.columns: df[col] = pd.to_datetime(df[col]).dt.strftime('%Y-%m-%d') df = df.fillna('') if (len(df) - len_before): if not open: # use of 'open' is for manual debugging, while in code during a run while testing can be overridden with open=1, ignore_dupes=1 if not ignore_dupes: print(f'\n\n\n\n\t\t\t {(len_before - len(df)):>10} write_to_csv() {full_filename} complete duplicates dropped, SHOULD NOT HAPPEN \n\n\n\n') exit(f'\t\t\t Duplicates exist, this has to be fixed, exiting at {line()}\n\n\n') df.to_csv(full_filename, sep=',', encoding='utf-8-sig', index=False) if open: sp.Popen(full_filename, shell=True) # opening if specified like in the debugger manually def move_rows(df_subset, df_move_from, df_move_to): ''' Move rows in df_subset out of df_move_from into df_move_to Return the new df_move_from and df_move_to ''' if not len(df_subset): return df_move_from, df_move_to df_move_to = df_move_to.copy() df_move_to = pd.concat([df_move_to.reset_index(drop=True), df_subset.reset_index(drop=True) ], ignore_index=True) df_move_from = df_move_from.loc[ ~df_move_from.VAERS_ID.isin( df_move_to.VAERS_ID ) ] # everything from before, not now in done return df_move_from, df_move_to def move_column_forward(df_in, column): ''' Reorder columns moving column to second ''' if column not in df_in: return df_in columns_pre = list(df_in.columns) if columns_pre[1] == column: return df_in # already in that spot #columns = ['VAERS_ID', column] + columns[1:-1] columns = ['VAERS_ID', column] for c in columns_pre: if c not in columns: columns.append(c) return df_in.reindex(columns, axis=1) def nan_alert(df): ''' nan stands for not-a-number, meaning null, check to see if any are present in the dataframe (df) ''' count_nans = 0 if len(df): if df.isnull().values.any(): df_with_nans = df[df.isnull().any(axis=1)] count_nans = len(df_with_nans) print(f'\n\n\n\n\t\t\t {count_nans} rows with NANS line {inspect.stack()[1][2]}\n\n\n\n') print( df_with_nans ) # the rows with nan else: print(f'\n\n\t\t Empty df at line {inspect.stack()[1][2]}\n\n') # line number return count_nans def lookup(previous_date, this_dir_date, vid_cols_affected): ''' A tool for adhoc on-the-fly debugging. Show values for just the fields and VAERS_IDs for these compared files in dir_changes ''' print() print(f'debug lookup() on vids affected') files_populate_information() df_changes = pd.DataFrame() files_list = [ files['changes']['keyval'][previous_date], files['changes']['keyval'][this_dir_date] ] vids_all_list = list(vid_cols_affected.keys()) cols_all = [] for k in vid_cols_affected: for col in vid_cols_affected[k]: if col not in cols_all: cols_all.append(col) for filename in files_list: df_tmp = open_file_to_df(filename, doprint=0).fillna('') # [vids_all_list] df_tmp = df_tmp.loc[ df_tmp.VAERS_ID.isin(vids_all_list) ] df_tmp = df_tmp.copy()[['VAERS_ID'] + cols_all] df_tmp['date'] = date_from_filename(filename) df_changes = pd.concat([df_changes, df_tmp]) df_changes = move_column_forward(df_changes, 'date') df_changes = df_changes.sort_values(by=['VAERS_ID', 'date']) write_to_csv(df_changes, 'lookup_changes.csv', open=1) print() def single_plural(count, word): if count == 1: return re.sub(r'.$', '', word) # removing 's' on the end if just 1 else: return word def do_replace(d, col, tag, this, that): ''' Not essential, a utility for testing some prep for harvesting numbers. Makes regular expression replacements to remove false positives. Pared way down from its original elsewhere. ''' d2 = d.copy() # ??? print( f' do_replace {col} {tag:>40} {this:>40} {that}' ) d2[col] = d2[col].str.replace(this, that, flags=re.IGNORECASE, regex=True) return d2 def line(): ''' Print line number (debug) ''' caller = inspect.getframeinfo(inspect.stack()[1][0]) print(f' Line {caller.lineno} in {caller.function}()') def tone(): ''' Utility sounding a tone. For example at assign_outliers() main work starting or end of a run ''' if not tones: return print('\a', end='') # sounds a tone def do_elapsed(marker_in): ''' Calculate elapsed time for a given input marker as the starting point. ''' elapsd = (_time.time() - marker_in) / 60 # Minutes print(f"{'{} hr {} min'.format(int(elapsd / 60), '%.1f' % (elapsd % 60))}") tone() def exit(_in=None): ''' Exit with message ''' if not _in: _in = '' print() ; print(f'{_in}') do_elapsed(elapsed_begin) print() ; print(f'Done with {__file__} at line {inspect.stack()[1][2]}, clock time {str(datetime.now())}') print() ; print('- - - - - - - - - - - - - - - - - - - - - - - - ') tone() os._exit(0) def do_restart(): ''' For testing, start from scratch, earliest input date. Otherwise, will pick up after the latest processed in dir_changes or above date_floor if set ''' print(' Is restart, removing directories ...') for x in [dir_changes, dir_working, dir_flattened, dir_consolidated]: if not reconsolidate and x == dir_consolidated: continue # skip removal of consolidated if not reflatten and x == dir_flattened: continue # skip removal of flattened files if os.path.exists(x): print(f' Since restart={restart}, removing directory {x} to start empty there') shutil.rmtree(x) return def more_to_do(): ''' Return 1 or 0 on whether any more input files process ''' global elapsed_drop, df_changes_1 files_populate_information() if not files['changes']['date']: # None done yet return 1 if files['input']['date'] == files['flattened']['date'] == files['consolidated']['date'] == files['changes']['date']: if files['changes']['date'][-1] >= files['input']['date'][-1]: this_date = files['changes']['date'][-1] if df_changes_1 is not None and len(df_changes_1): save_multi_csv(this_date, df_changes_1) # more than one csv due to Excel row limit else: prv_date = files['changes']['date'][-1] filename = files['changes']['keyval'][prv_date] print(f'Patching for creation of split changes file _A and _B, no more to do case unusual, using {filename}') df_changes_1 = open_file_to_df(filename, doprint=0) save_multi_csv(this_date, df_changes_1) # more than one csv due to Excel row limit print() print(f" No more to do, last set {files['changes']['date'][-1]} >= {files['input']['date'][-1]} done") return 0 elapsed_drop = _time.time() # new start marker for total time on each week's drop return 1 def print_date_banner(this_drop_date): print() print('= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ') print(f' Next date {this_drop_date}') print('= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ') print() def do_never_published(week_vids_present): ''' Reports never published, identifying gaps in VAERS_ID sequence. Required to keep all VAERS_IDs ever seen so far, doing so as keys in dictionary. Many of these can be non-covid also. Use of dictionaries fast, lists slow. all_ever_seen file was late on the scene and maybe all of the week stuff is no longer needed, just high, lo, range (sequence) and gaps. ''' global all_never_published, all_ever_seen, never_published_patch_done ''' Tests 1 always there 2 always missing 3 was missing but appears this week 4 new this week 5 missing this week 6 new this week all_never_published = [2, 3] # from file never_published.txt opened earlier week_vids_present = [1, 3, 4, 6] # [x for x in df_data.VAERS_ID.to_list() if x >= covid_earliest_vaers_id] ''' if not all_ever_seen: with open(file_all_ever_seen, 'r') as f: lines = f.readlines() lines = [x.strip() for x in lines] lines = [ int(x) for x in lines if x] all_ever_seen = {x:1 for x in lines} dict_week_vids_present = {x:1 for x in week_vids_present} list_all_never_published = list(all_never_published.keys()) hi_all_never_published = max(list_all_never_published) lo_this_week = min(week_vids_present) hi_this_week = max(week_vids_present) lo_ever = min( min(list_all_never_published), min(all_ever_seen), lo_this_week ) list_range_all_ever = sorted(range(lo_ever, hi_this_week + 1)) # range() is non-inclusive at the top end list_range_week_only = sorted(range(hi_all_never_published + 1, hi_this_week + 1)) # a hack # Patch b/c I'm missing 2020-12-30 drop, these were published but later deleted patch = [ 905000, 905553, 907988, 907989, 908265, 908266, 908269, 908279, 909030, 909617, 910294, 910295, 910297, 910298, 910300, 910301, 910303, 910304, 910305, 910312, 910313, 910575, 910577, 910594, 910601, 910617, 910646, 913144 ] if (not never_published_patch_done) and hi_this_week > min(patch): # any case except if in contiguous run restart and not yet at the 2020-12-30 drop never_published_patch_done = 1 all_ever_seen = {**{x:1 for x in patch}, **all_ever_seen} all_ever_seen = {**dict_week_vids_present, **all_ever_seen} # merge these dictionaries fast python 3.5 and up set_gap_fills = dict_week_vids_present.keys() & all_never_published.keys() remedied_past_all_never_published = [x for x in all_never_published if x not in set_gap_fills] week_gaps_new = [x for x in list_range_week_only if x not in dict_week_vids_present] all_never_published = {x:1 for x in (remedied_past_all_never_published + week_gaps_new )} # new all_never_published print(f' lo_ever {lo_ever}') print(f' hi_all_never_published {hi_all_never_published}') print(f' hi_this_week {hi_this_week}') print(f' week_vids_present {subrange(list(dict_week_vids_present), 8)}') print(f' list_range_all_ever {subrange(list_range_all_ever, 8)}') print(f' list_range_week_only {subrange(list_range_week_only, 8)}') print(f' gaps_filled {subrange(sorted(set_gap_fills), 8)}') print(f' week_gaps_new {subrange(sorted(week_gaps_new), 8)}') print(f'remedied_past_all_never_published {subrange(sorted(remedied_past_all_never_published), 8)}') print(f' all_never_published {subrange(sorted(all_never_published), 8)}') # TODO: all_ever_seen won't be right until it is being saved and a full run from scratch to collect them or other means print(f' VAERS_IDs {lo_this_week:>7} to {hi_this_week:>7} expected {len(list_range_all_ever):>7} all_ever {len(all_ever_seen):>7} gaps {len(all_never_published):>7}') ''' Write all_never_published to file_never_published text file ''' with open(file_never_published, 'w') as f: for x in list(all_never_published.keys()): f.write(f'{x}\n') ''' Write all_ever_seen to file_all_ever_seen text file ''' with open(file_all_ever_seen, 'w') as f: for x in list(all_ever_seen.keys()): f.write(f'{x}\n') def consolidate(files_date_marker): ''' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Consolidate -- All in one file but multiple rows per VAERS_ID = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ''' global df_vax, df_data, df_syms_flat, all_never_published print() ; print(f' Consolidation') files_populate_information() ''' Consolidate only if the file doesn't already exist or if reconsolidate is set ''' if not reconsolidate and files_date_marker in files['consolidated']['date']: if files_date_marker in files['consolidated']['keyval']: # already done, use it print(f" Consolidation already done: {files['consolidated']['keyval'][files_date_marker]}") df_vax = pd.DataFrame() # avoid picking up a previous consolidation when flattening return else: print(f" ERROR: Expected {files_date_marker} in files['consolidated']['keyval']") if not reconsolidate and files_date_marker in files['flattened']['date']: print(f' Skipping consolidation because flattened for {files_date_marker} already exists') return ''' Combine all data, vax and symptoms files as one (if not already a file like 2020-12-25_VAERS_CONSOLIDATED.csv in which case just do the treament) ''' print(' Concatenating files, *VAERSDATA.csv, *VAERSVAX.csv, *VAERSSYMPTOMS.csv') df_data = files_concat( glob.glob(dir_working + '/' + '*VAERSDATA.csv' ) ) df_vax = files_concat( glob.glob(dir_working + '/' + '*VAERSVAX.csv' ) ) df_syms = files_concat( glob.glob(dir_working + '/' + '*VAERSSYMPTOMS.csv') ) ''' Remove all reports prior to the first jab 2020-12-14 902418 First report after public rollout is next day: https://www.medalerts.org/vaersdb/findfield.php?IDNUMBER=902418&WAYBACKHISTORY=ON 2020-10-02 896636 may have been from a trial: https://www.medalerts.org/vaersdb/findfield.php?IDNUMBER=896636&WAYBACKHISTORY=ON VAERS ID: 896636 VAERS Form: 2 Age: 47.0 Sex: Female Location: South Carolina Vaccinated: 2020-09-28 Onset: 2020-10-02 Submitted: 0000-00-00 Entered: 2020-11-14 Vaccination / Manufacturer (1 vaccine) Lot / Dose Site / Route COVID19: COVID19 (COVID19 (MODERNA)) / MODERNA - / UNK LA / SYR 2020-12-10 970043 Died same day, reported later: https://www.medalerts.org/vaersdb/findfield.php?IDNUMBER=970043&WAYBACKHISTORY=ON ''' if vids_limit: df_data = df_data.loc[ df_data.VAERS_ID.isin(vids_limit) ] df_vax = df_vax .loc[ df_vax .VAERS_ID.isin(vids_limit) ] df_syms = df_syms.loc[ df_syms.VAERS_ID.isin(vids_limit) ] len_before = len(df_data) df_data = df_data.loc[ df_data.VAERS_ID >= covid_earliest_vaers_id ] df_vax = df_vax .loc[ df_vax .VAERS_ID >= covid_earliest_vaers_id ] df_syms = df_syms.loc[ df_syms.VAERS_ID >= covid_earliest_vaers_id ] print() ; print(f'{len_before - len(df_data):>10} records removed prior to the first covid report (covid_earliest_vaers_id {covid_earliest_vaers_id})') print(f'{len(df_data):>10} reports to work with (unique VAERS_IDs)') ; print() do_never_published( df_data.VAERS_ID.to_list() ) ''' VAERS_IDs can have multiple records/doses/lots in each report but at least one covid19. ''' dfv = df_vax.copy() dfv['doses'] = dfv.VAERS_ID.map(dfv.VAERS_ID.value_counts()) dfv_single_doses = dfv.loc[ dfv.doses.eq(1) ] dfv_singles_w_covid = dfv_single_doses.loc[ dfv_single_doses.VAERS_ID.isin(dfv_single_doses.loc[ dfv_single_doses.VAX_TYPE.str.contains('COVID') ].VAERS_ID.to_list()) ].sort_values(by='VAERS_ID') dfv_multpl_doses = dfv.loc[ dfv.doses.ge(2) ] dfv_multiples_w_covid = dfv_multpl_doses.loc[ dfv_multpl_doses.VAERS_ID.isin(dfv_multpl_doses.loc[ dfv_multpl_doses.VAX_TYPE.str.contains('COVID') ].VAERS_ID.to_list()) ].sort_values(by='VAERS_ID') del dfv dfv_covid_type_both = pd.concat([dfv_singles_w_covid, dfv_multiples_w_covid]) dfv_covid_type_both = dfv_covid_type_both.drop_duplicates() df_data_other_for_covid_search = df_data.loc[ ~df_data.VAERS_ID.isin(dfv_covid_type_both.VAERS_ID) ] df_data_covid_other_found = df_data_other_for_covid_search.loc[ df_data_other_for_covid_search.SYMPTOM_TEXT.str.contains(r'Pfizer|Moderna|Janssen', re.IGNORECASE, na=False) ] #df_data_other_for_covid_search, df_data_covid_all = move_rows(df_data_covid_other_found, df_data_other_for_covid_search, df_data_covid_all) print(f'{len(df_data_covid_other_found):>10} additional reports captured with Pfizer|Moderna|Janssen in SYMPTOM_TEXT although not officially type COVID') vids_vax__covid_type_both = dfv_covid_type_both .VAERS_ID.to_list() vids_data_covid_other_found = df_data_covid_other_found.VAERS_ID.to_list() vids_all_covid_list = sorted( set(vids_vax__covid_type_both + vids_data_covid_other_found) ) # Reducing to only covid reports df_data = df_data.loc[ df_data.VAERS_ID.isin(vids_all_covid_list) ] df_vax = df_vax .loc[ df_vax .VAERS_ID.isin(vids_all_covid_list) ] df_syms = df_syms.loc[ df_syms.VAERS_ID.isin(vids_all_covid_list) ] #df_covid_type_all = pd.concat([dfv_covid_type_both, df_data_covid_other_found]) #df_data = df_covid_type_all.drop_duplicates() #del df_covid_type_all print() #''' Filter to just those in the two lists ''' #len_before = len(df_data) #df_data = df_data.loc[ df_data.VAERS_ID.isin(dfv_singles_w_covid.VAERS_ID) | df_data.VAERS_ID.isin(dfv_multiples_w_covid.VAERS_ID) ] #if (len(df_data) - len_before): # print(f'{(len_before - len(df_data)):>10} dropped in df_data due to no covid VAX_TYPE involved in the report') # #len_before = len(df_vax) #df_vax = df_vax.loc[ df_vax.VAERS_ID.isin(dfv_singles_w_covid.VAERS_ID) | df_vax.VAERS_ID.isin(dfv_multiples_w_covid.VAERS_ID) ] #if (len(df_vax) - len_before): # print(f'{(len_before - len(df_vax)):>10} dropped in df_vax due to no covid VAX_TYPE involved in the report') # #len_before = len(df_syms) #df_syms = df_syms.loc[ df_syms.VAERS_ID.isin(dfv_singles_w_covid.VAERS_ID) | df_syms.VAERS_ID.isin(dfv_multiples_w_covid.VAERS_ID) ] #if (len(df_data) - len_before): # print(f'{(len_before - len(df_syms)):>10} dropped in df_syms due to no covid VAX_TYPE involved in the report') ''' KEEP: Could be useful some day ... In any screened out from consolidate due to no covid entry in the report. A way to distill down, make unique words of SYMPTOM_TEXT for searching known lot codes for reports flying under the radar from improperly not labeled covid df_syms = df_to_write[['VAERS_ID', 'SYMPTOM_TEXT']] df_syms = df_syms.copy() df_syms['sym_word_search'] = df_syms.SYMPTOM_TEXT.str.lower().replace(r'[^a-zA-Z0-9|]', ' ', False, regex=True) df_syms['sym_word_search'] = df_syms.sym_word_search.str.replace(r'\s\s+', ' ', False, regex=True) df_syms['sym_word_search'] = df_syms.sym_word_search.apply(lambda x: x.split(' ')).apply(set).apply(lambda x: sorted(x)) df_syms['sym_word_search'] = df_syms.sym_word_search.apply(lambda x: ' '.join(x)) ''' len_before = len(df_data) df_data = df_data.drop_duplicates(subset='VAERS_ID') if (len(df_data) - len_before): print(f'{(len_before - len(df_data)):>10} duplicates dropped in df_data on VAERS_IDs') print(f'{len(set(df_data.VAERS_ID.to_list())):>10} covid reports to work with') ; print() ''' Remove repeat sentences ''' print(' Repeat sentence removal in SYMPTOM_TEXT, showing each next larger if any (takes time)') # Will anyone be unhappy that SYMPTOM_TEXT are no longer pristine but deduped? if len(df_data): df_data['SYMPTOM_TEXT'] = df_data.apply(lambda row: symptoms_dedupe_repeat_sentences( row[['VAERS_ID', 'SYMPTOM_TEXT']] ), axis=1) print() to_print = f"{stats['sentence_dedupe_count']:>10} SYMPTOM_TEXT field repeat sentences deduped in " to_print += f"{stats['sentence_dedupe_reports']} reports, max difference {stats['sentence_dedupe_max_bytes']} bytes in VAERS_ID {stats['sentence_dedupe_max_vid']}" print(to_print) ; print() len_before = len(df_data) df_data = df_data.drop_duplicates(subset='VAERS_ID') if (len(df_data) - len_before): print(f'{(len_before - len(df_data)):>10} duplicates dropped in df_data on VAERS_IDs, expected none') ''' Shorten some fields in VAX (the change to title case indicates I've touched it) ''' print(' Shortening some field values in VAX_NAME, VAX_MANU') ; print() ''' VAX_TYPE COVID19 COVID19-2 FLU4 VAX_NAME Shorter COVID19 (COVID19 (MODERNA)) C19 Moderna COVID19 (COVID19 (MODERNA BIVALENT)) C19 Moderna BIVALENT COVID19 (COVID19 (PFIZER-BIONTECH)) C19 Pfizer-BionT COVID19 (COVID19 (PFIZER-BIONTECH BIVALENT)) C19 Pfizer-BionT BIVALENT COVID19 (COVID19 (JANSSEN)) COVID19 (COVID19 (UNKNOWN)) COVID19 (COVID19 (NOVAVAX)) INFLUENZA (SEASONAL) (FLULAVAL QUADRIVALENT) Transitional with parens removed first step COVID19 COVID19 MODERNA COVID19 COVID19 MODERNA BIVALENT COVID19 COVID19 PFIZER-BIONTECH COVID19 COVID19 PFIZER-BIONTECH BIVALENT COVID19 COVID19 JANSSEN INFLUENZA SEASONAL FLULAVAL QUADRIVALENT ''' df_vax = df_vax.copy() df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'[\)\(]' , r'' , False, regex=True) # COVID19 COVID19 ... and/or COVID19-2 COVID19-2 ... except those so far don't show up in VAX_NAME, just in case df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'(?:COVID(\S+)\s)+' , r'C\1 ' , False, regex=True) #df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'(?:COVID(\d+)(\-\d+)*\s)+' , r'C\1 \2' , False, regex=True) df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'PFIZER.BION.*' , r'Pfizer-BionT' , False, regex=True) df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'MODERNA' , r'Moderna' , False, regex=True) df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'JANSSEN' , r'Janssen' , False, regex=True) df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'INFLUENZA' , r'Flu' , False, regex=True) df_vax['VAX_NAME'] = df_vax.VAX_NAME.str.replace(r'VACCINE NOT SPECIFIED' , r'Not Specified' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'UNKNOWN MANUFACTURER' , r'Unknown' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'PFIZER.BION.*' , r'Pfizer-BionT' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'MODERNA' , r'Moderna' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'JANSSEN' , r'Janssen' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'PF.*WYETH' , r'Pfizer-Wyeth' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'NOVARTIS.*' , r'Novartis' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'SANOFI.*' , r'Sanofi' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'MERCK.*' , r'Merck' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'PROTEIN.*' , r'Protein' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'GLAXO.*' , r'Glaxo' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'SEQIRUS.*' , r'Seqirus' , False, regex=True) df_vax['VAX_MANU'] = df_vax.VAX_MANU.str.replace(r'BERNA.*' , r'Berna' , False, regex=True) # To avoid jumbling in the case of tiny differences between drops, apparently this is necessary df_vax = df_vax.sort_values(by=['VAERS_ID', 'VAX_LOT', 'VAX_SITE', 'VAX_DOSE_SERIES', 'VAX_TYPE', 'VAX_MANU', 'VAX_ROUTE', 'VAX_NAME']) print(' Merging DATA into VAX') df_data_vax = pd.merge(df_vax.astype(str), df_data.astype(str), on='VAERS_ID', how='left') if nan_alert(df_data_vax): pause=1 df_data_vax = df_data_vax.fillna('') print(f'{len(df_data_vax):>10} rows in df_data_vax') ''' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = symptom_entries = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ''' ''' Aggregate symptoms into new column called symptom_entries, and merge for all symptoms covered as a single string ''' print(' Aggregating symptoms into symptom_entries string, new column') df_syms_flat = symptoms_file_entries_append_to_symptom_text(df_syms) ''' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Consolidation Save result into one file like ... 2020-12-25_VAERS_CONSOLIDATED.csv More than one row per VAERS_ID for the various doses/lots, thus duplicates to symptom_entries etc. = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ''' ''' symptom_entries to consolidated ''' print(' Merging symptom_entries into df_data_vax') df_data_vax_syms_consolidated = pd.merge(df_data_vax.astype(str), df_syms_flat.astype(str)[['VAERS_ID', 'symptom_entries']], on='VAERS_ID', how='left') #nan_alert(df_data_vax_syms_consolidated) # nan in symptom_entries in 1106891 on 03/10/2021 as the symptoms file has no entry line for it even tho Current Illness: Atrial Fibrillation df_data_vax_syms_consolidated = df_data_vax_syms_consolidated.fillna('') print(f'{len(df_data_vax_syms_consolidated):>10} rows in df_data_vax_syms_consolidated') ''' Make date columns sortable for the user in spreadsheet like 2020-12-14 ''' print('Fixing dates to sortable like 2022-11-01') for col in ['DATEDIED', 'VAX_DATE', 'RPT_DATE', 'RECVDATE', 'TODAYS_DATE', 'ONSET_DATE']: if col in df_data_vax_syms_consolidated.columns: df_data_vax_syms_consolidated[col] = pd.to_datetime(df_data_vax_syms_consolidated[col]).dt.strftime('%Y-%m-%d') # makes some nans df_data_vax_syms_consolidated = df_data_vax_syms_consolidated.fillna('') filename_consolidated = dir_consolidated + '/' + files_date_marker + '_VAERS_CONSOLIDATED.csv' print(f' Saving result into one file: {filename_consolidated}') write_to_csv(df_data_vax_syms_consolidated, filename_consolidated) print() print(f' Consolidation of {files_date_marker} done') return def symptoms_dedupe_repeat_sentences(df_vid_sym): ''' Remove repeat sentences in VAERS SYMPTOM_TEXT fields Prints each time a larger change occurs. Originated at https://stackoverflow.com/a/40353780/962391 ''' vid = df_vid_sym[0] content = str(df_vid_sym[1]) # sometimes not string ''' Some of my input files filled in by various people evidently might have gone through some processing in Excel by them. As a result, fields have double quotes injected, they have to be removed or big trouble, takes time but required. ''' content = re.sub('"', '', content) # string_split_multiple_delims = re.split('[:;,\|]\s*', content) # doesn't preserve them list_in = re.split(r'[^a-zA-Z0-9\'\s]', content) # split on everything except alphanumeric, space etc list_in = [x for x in list_in if len(x) >= 40] # just the longer strings dupes_list = [x for x, count in Counter(list_in).items() if count > 1] if not dupes_list: return content # Doing dedupe counts only once per VAERS_ID # Replace all but the first with ...^... , low tech solution: https://stackoverflow.com/a/53239962/962391 total_bytes = 0 for dup in dupes_list: content = content.replace(dup, ' ...^... ').replace(' ...^... ', dup, 1) total_bytes += len(dup) if vid not in all_symptoms_deduped: stats['sentence_dedupe_bytes'] += len(dup) if total_bytes > stats['sentence_dedupe_max_bytes']: stats['sentence_dedupe_max_bytes'] = total_bytes stats['sentence_dedupe_max_vid' ] = vid # stats if vid not in all_symptoms_deduped: # count each report only once, first time seen stats['sentence_dedupe_reports'] += 1 # count of reports that ever had any repeat sentences all_symptoms_deduped[vid] = 1 # for unique vid tracking count_of_replacements = content.count(' ...^... ') if count_of_replacements: stats['sentence_dedupe_count'] += count_of_replacements return content # modified def symptoms_file_entries_append_to_symptom_text(df_symfile): ''' Append symptoms entries from SYMPTOMS files to SYMPTOM_TEXT in symptoms column Files like 2023VAERSSYMPTOMS.csv ''' for col in list(df_symfile.copy().columns): # remove the version columns if 'VERSION' in col: del df_symfile[col] del col ''' Combine the symptoms column values to one string in a column called 'symptom_entries' Step 1: Five columns to just one ''' print(' Combining symptoms column items. Grouping by VAERS_ID ...') cols_symfile = sorted(set(df_symfile.columns) - set(['VAERS_ID'])) print(' Appending each symptom in new column called symptom_entries') df_symfile['symptom_entries'] = ['_|_'.join(x) for x in np.sort(df_symfile[cols_symfile])] # Works. Combining the 5 columns as 1, keeping their order. ''' Step 2: Multiple VAERS_ID rows to just 1 row each ''' df_symfile = df_symfile.reset_index(drop=True) # avoid sort here df_symfile = df_symfile[['VAERS_ID', 'symptom_entries']] df_symfile = df_symfile.astype(str).groupby('VAERS_ID').agg(list).applymap('_|_'.join).reset_index() #df_symfile = df_symfile.groupby('VAERS_ID').agg(lambda grp: '_|_'.join(grp.unique())).reset_index() # do not use drop=True here, it drops VAERS_ID print(' Cleaning multiple delimiters due to empty columns') df_symfile['symptom_entries'] = df_symfile.symptom_entries.str.replace(r'\s*_\|_\s*(?:\s*_\|_\s*)+\s*', '_|_', False, regex=True) # multiples to single (where empty columns) ''' _|_Always at beginning and end_|_ ... like a box or cell ''' df_symfile['symptom_entries'] = df_symfile.symptom_entries.str.replace(r'^(?!_\|_)(.*)$', r'_|_\1', False, regex=True) # _|_ at start of line always df_symfile['symptom_entries'] = df_symfile.symptom_entries.str.replace(r'^(.*)(?!_\|_)$', r'\1_|_', False, regex=True) # _|_ at end of line always df_long_cells = df_symfile.loc[ df_symfile.symptom_entries.str.len() >= 32720 ] # "Both . xlsx and . csv files have a limit of 32,767 characters per cell" if len(df_long_cells): print(f'{len(df_long_cells):>10} over 32720 in length to be truncated') df_long_cells['symptom_entries'] = df_long_cells.symptom_entries.str.replace(r'^(.{32720,}).*', r'\1 \[truncated, Excel cell size limit 32,767\]', regex=True) df_symfile = copy_column_from_1_to_2_per_vids('symptom_entries', df_long_cells, df_symfile, df_long_cells.VAERS_ID.to_list()) # TODO: This was 'symptoms', was a bug, right? del df_long_cells return df_symfile def flatten(files_date_marker): ''' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Flattening -- Only one row per VAERS_ID = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ''' global df_vax, df_data, df_syms_flat, df_flat_2 print() print(f' Flattening') files_populate_information() ''' Flatten only if the file doesn't already exist or if reflatten is set ''' if not reflatten and files_date_marker in files['flattened']['date']: if files_date_marker in files['flattened']['keyval']: # already done, use it print(f" Flattening already done: {files['flattened']['keyval'][files_date_marker]}") ; print() return else: print(f" ERROR: Expected {files_date_marker} in files['flattened']['keyval']") pull_vax_records = 0 file_consolidated = '' if not len(df_vax): # already exists if consolidate just created it # vax file records from consolidate file if files_date_marker in files['consolidated']['keyval']: file_consolidated = files['consolidated']['keyval'][files_date_marker] print(f" Pulling vax records from previously consolidate file {file_consolidated}") pull_vax_records = 1 else: print(f" ERROR: Expected {files_date_marker} in files['consolidated']['keyval']") if pull_vax_records: df_consolidated = open_file_to_df(file_consolidated, doprint=1).fillna('') columns_vax = ['VAERS_ID', 'VAX_TYPE', 'VAX_MANU', 'VAX_LOT', 'VAX_DOSE_SERIES', 'VAX_ROUTE', 'VAX_SITE', 'VAX_NAME'] # Note VAX_DATE is in data, not vax file columns_consolidated = df_consolidated.columns df_vax = df_consolidated[columns_vax].astype(str) columns_data = ['VAERS_ID'] + list( set(columns_consolidated) - set(columns_vax) ) df_data = df_consolidated[columns_data].astype(str) del df_data['symptom_entries'] # added back in later df_syms_flat = df_consolidated.astype(str)[['VAERS_ID', 'symptom_entries']] del file_consolidated, pull_vax_records, columns_vax, columns_data ''' Aggregate VAX for single VAERS_ID ''' print(' Aggregate/flatten VAX items. Grouping by VAERS_ID') df_vax = df_vax.reset_index(drop=True) df_vax_flat = df_vax.groupby('VAERS_ID').agg(list).applymap('|'.join).reset_index() len_before = len(df_vax_flat) df_vax_flat = df_vax_flat.drop_duplicates(subset=df_vax_flat.columns, keep='first') if (len(df_vax_flat) - len_before): print(f'\n\n\n\n\t\t\t\t {(len_before - len(df_vax_flat)):>10} duplicates dropped in df_vax_flat, THERE SHOULD BE NONE\n\n\n\n') print(f'{len(df_vax_flat):>10} rows in df_vax_flat') ; print() ''' Combine DATA into VAX (has multiple records for each VAERS_ID, when more than one dose is listed) ''' print(' Merging DATA into VAX flattened') df_data_vax_flat = pd.merge(df_vax_flat.astype(str), df_data.astype(str), on='VAERS_ID', how='left') if nan_alert(df_data_vax_flat): pause=1 df_data_vax_flat = df_data_vax_flat.fillna('') print(f'{len(df_data_vax_flat):>10} rows in df_data_vax_flat') ; print() len_before = len(df_data_vax_flat) df_data_vax_flat = df_data_vax_flat.drop_duplicates(subset=df_data_vax_flat.columns, keep='first') if (len(df_data_vax_flat) - len_before): print(f'{(len_before - len(df_data_vax_flat)):>10} duplicates dropped in df_data_vax_flat') print(f'{len(df_data_vax_flat):>10} rows in df_data_vax_flat') ; print() ''' symptom_entries to flattened ''' print(' Merging symptom_entries into df_data_vax_syms_flat') df_data_vax_syms_flat = pd.merge(df_data_vax_flat.astype(str), df_syms_flat.astype(str)[['VAERS_ID', 'symptom_entries']], on='VAERS_ID', how='left') df_data_vax_syms_flat = df_data_vax_syms_flat.fillna('') # there are nans like 1106891 with no record for it in ...symptoms.csv len_before = len(df_data_vax_syms_flat) # TODO: Understand the reason for the dupes df_data_vax_syms_flat = df_data_vax_syms_flat.drop_duplicates(subset=df_data_vax_syms_flat.columns, keep='first') if (len(df_data_vax_syms_flat) - len_before): print(f'{(len_before - len(df_data_vax_syms_flat)):>10} duplicates dropped in df_data_vax_syms_flat') ; print() ''' Save result into one file like ... 2020-12-25_VAERS_FLATTENED.csv ''' filename_flattened = dir_flattened + '/' + files_date_marker + '_VAERS_FLATTENED.csv' print(f' Saving result into one file: {filename_flattened}') ; print() write_to_csv(df_data_vax_syms_flat, filename_flattened) print(f'{len(df_data_vax_syms_flat):>10} rows in {filename_flattened}') ; print() print(f' Flattening of {files_date_marker} done') ; print() df_flat_2 = df_data_vax_syms_flat df_done['flat2'] = files_date_marker return ''' Example of examining the data ... While in debugger, this will open a dataframe in Excel, LibreOffice etc write_to_csv(df_changes, , open=1) df_consol = open_file_to_df('vaers_consolidated/2021-12-24_VAERS_CONSOLIDATED.csv', doprint=0) df_consol.VAERS_ID.value_counts() df_consol.loc[df_consol.VAERS_ID.eq(1900339)] df_flat = open_file_to_df('vaers_flattened/2021-12-24_VAERS_FLATTENED.csv', doprint=0) df_flat.VAERS_ID.value_counts() df_flat.loc[df_flat.VAERS_ID.eq(1900339)] df_changes = open_file_to_df(f'{dir_changes}/2021-12-24_VAERS_FLATFILE.csv', doprint=0) df_changes.VAERS_ID.value_counts() df_changes.loc[df_changes.VAERS_ID.eq(1900339)] Some VAERS_ID in 2021-01-07 for example with multiple lots/doses (multiple rows in VAX file) for testing 903999 903157 903848 903867 904522 905340 906428 907571 907776 907837 908431 908448 908763 909370 909520 910580 919620 18 doses: https://medalerts.org/vaersdb/findfield.php?IDNUMBER=1900339&WAYBACKHISTORY=ON ''' def get_flattened(previous_or_working, raw_date): ''' Flattened file, previous or [newest] working [for compare] Depends on the files dictionary having been populated. ''' flattened_all_dates = sorted( files['flattened']['date'] ) if flattened_all_dates: if previous_or_working == 'previous': candidates_previous = [ x for x in flattened_all_dates if x < raw_date ] if candidates_previous: return files['flattened']['keyval'][ candidates_previous[-1] ] # most recent flattened file before this current being worked on, to compare to else: return('') # can be first run, no previous flattened file, or first matching date_floor elif previous_or_working == 'working': if raw_date in files['flattened']['keyval']: return files['flattened']['keyval'][raw_date] # this current being worked on else: exit(f" ERROR: Expected {raw_date} in files['flattened']['keyval']") else: exit(f'\n\n\n\n\t\t\t ERROR: No flattened files using raw_date {raw_date}, unexpected \n\n\n\n') list_index_of_relative_to = files['input']['date'].index(raw_date) if list_index_of_relative_to == 0: # there's no previous in this case, and no point in doing a compare to itself print(f' ERROR: Index 0, no previous available') if len(files['flattened']['files']) == 0: print(f'\n\n\n\n\t\t\t ERROR: No flattened files, unexpected \n\n\n\n') return('') if len(files['flattened']['files']) == 1: print(f"\n\n\n\n\t\t\t ERROR: Only 1 flattened file, unexpected, it is {files['flattened']['files'][0]} \n\n\n\n") return('') if previous_or_working == 'previous': target = files['flattened']['files'][-2] elif previous_or_working == 'working': target = files['flattened']['files'][-1] return target def copy_column_from_1_to_2_per_vids(column, df1, df2, vids): # vids are VAERS_IDs df2.loc[df2.VAERS_ID.isin(vids), column] = df2['VAERS_ID'].map( pd.Series(df1[column].values, index=df1.VAERS_ID).to_dict() ) return df2 def compare(this_dir_date): ''' Compare based on flattened (one row per VAERS_ID with each row containing all of the information) saving modifications to changes column. Add reports to result set when not already there. VAX file columns: VAERS_ID VAX_TYPE VAX_MANU VAX_LOT VAX_DOSE_SERIES VAX_ROUTE VAX_SITE VAX_NAME There are multiple entries with same VAERS_ID, different VAX_LOT and VAX_DOSE_SERIES ''' global df_changes_1, df_flat_1 modified_reports = 0 #columns_added = ['cell_edits', 'changes', 'symptom_entries', 'status'] #columns_all = columns_vaers + columns_added ; files_populate_information() if 0 and this_dir_date == date_floor: # neutered, forgot why lol print(f" This date {this_dir_date} is the date_floor {date_floor}, nothing to compare yet, returning") ; print() return file_flattened_previous = get_flattened('previous', this_dir_date) # like file_flattened_working = get_flattened('working' , this_dir_date) # like if not file_flattened_previous: print() ; print(f' No flattened file to compare prior to this, {this_dir_date}, skip compare') file_for_copy_original = dir_changes + '/' + this_dir_date + '_VAERS_FLATFILE.csv' print(f' Due to first drop, copying current {file_flattened_working} to: {file_for_copy_original}') ; print() shutil.copyfile(file_flattened_working, file_for_copy_original) return previous_date = date_from_filename(file_flattened_previous) if not previous_date: # first run? exit('No previous_date, cannot continue') #df_flat_prv = open_file_to_df(file_flattened_previous, doprint=1) if len(df_flat_1): print(f" Using flat {df_done['flat1']} already in memory, {len(df_flat_1)} rows") df_flat_prv = df_flat_1.copy() # the one finished previously, carryover else: df_flat_prv = open_file_to_df(file_flattened_previous, doprint=1) if len(df_flat_2): print(f" Using flat {df_done['flat2']} already in memory, {len(df_flat_2)} rows") ; print() df_flat_new = df_flat_2.copy() # the one just finished, carryover else: df_flat_new = open_file_to_df(file_flattened_working, doprint=1) df_flat_1 = df_flat_new.copy() # for next time as prv df_done['flat1'] = this_dir_date # There are nans, blank fields are nan on open (not a number, meaning null, nothing) df_flat_prv = df_flat_prv.fillna('') ; df_flat_new = df_flat_new.fillna('') ''' cell_edits and changes fields only exist in the changes file Have to be pulled from there to be updated when applicable ''' print() ; print(f' Previous changes file for changes, cell_edits and status columns') ''' Previous changes filename, conditions, for compare ''' if len(df_changes_1): print(f" Using changes {df_done['changes']} already in memory") df_changes_prv = df_changes_1.copy() # the one finished previously in this function, carryover elif len( files['changes']['date'] ): dates_changes = files['changes']['date'] date_prv_candidates = [x for x in dates_changes if x < this_dir_date] date_changes_prv = date_prv_candidates[-1] # most recent filename_changes_previous = files['changes']['keyval'][date_changes_prv] # like df_changes_prv = open_file_to_df(filename_changes_previous, doprint=1) del dates_changes, date_prv_candidates, date_changes_prv, filename_changes_previous else: ''' Resort to initializing with flattened. TODO: Remove this as other code (copyfile above) means this will never happen now? ''' print(f' Due to first CHANGES, using {file_flattened_previous}') df_changes_prv = df_flat_prv.copy() # If first run or restart, it needs these columns since only changes files contain these columns if 'cell_edits' not in df_changes_prv or 'changes' not in df_changes_prv or 'status' not in df_changes_prv: if 'cell_edits' not in df_changes_prv.columns: df_changes_prv['cell_edits'] = 0 if 'changes' not in df_changes_prv.columns: df_changes_prv['changes' ] = '' if 'status' not in df_changes_prv.columns: df_changes_prv['status' ] = '' if nan_alert(df_changes_prv): pause=1 del file_flattened_previous, file_flattened_working df_flat_prv = df_flat_prv .astype(str) # merge below hit error on small test with VAX_DOSE_SERIES as int in df_flat_prv, avoiding that. df_flat_new = df_flat_new .astype(str) df_changes_prv = df_changes_prv.astype(str) df_flat_prv ['VAERS_ID'] = df_flat_prv ['VAERS_ID'] .astype('float64').astype(int) df_flat_new ['VAERS_ID'] = df_flat_new ['VAERS_ID'] .astype('float64').astype(int) df_changes_prv['VAERS_ID'] = df_changes_prv['VAERS_ID'] .astype('float64').astype(int) df_changes_prv['cell_edits'] = df_changes_prv['cell_edits'].astype('float64').astype(int) if nan_alert(df_flat_prv) or nan_alert(df_flat_new) or nan_alert(df_changes_prv): pause=1 df_flat_prv_ori = df_flat_prv.copy() ; df_flat_new_ori = df_flat_new.copy() print() print(f'{" "*30} Comparing') print() ; print(f'{" "*30} {previous_date} v. {this_dir_date}') print() print(f'{ len(df_flat_new):>10} this drop total covid') print(f'{ len(df_flat_prv):>10} previous total covid') ''' Initialize df_edits, starting with df_flat_new, importing df_changes_prv columns for matching rows, adding any in df_changes_prv not present in df_edits, then they will be changed in some cases later, eventually going to df_changes_done ''' df_changes_prv = move_column_forward(df_changes_prv, 'changes') df_changes_prv = move_column_forward(df_changes_prv, 'status') df_changes_prv = move_column_forward(df_changes_prv, 'cell_edits') df_changes_prv = df_changes_prv.sort_values(by=['cell_edits', 'status', 'changes'], ascending=False) df_edits = df_flat_new.copy() df_changes_done = pd.DataFrame(columns=list(df_changes_prv.columns)) df_edits['cell_edits'] = 0 ; df_edits['changes'] = '' ; df_edits['status'] = '' list_both_new_and_changes_vids = df_flat_new.loc[ df_flat_new.VAERS_ID.isin(df_changes_prv.VAERS_ID) ].VAERS_ID.to_list() df_edits = copy_column_from_1_to_2_per_vids('cell_edits', df_changes_prv, df_edits, list_both_new_and_changes_vids) df_edits = copy_column_from_1_to_2_per_vids('changes' , df_changes_prv, df_edits, list_both_new_and_changes_vids) df_edits = copy_column_from_1_to_2_per_vids('status' , df_changes_prv, df_edits, list_both_new_and_changes_vids) del list_both_new_and_changes_vids # Bring in the previous changes etc ''' creates even VAERS_ID nans ... df_edits.loc[df_edits.VAERS_ID.isin(df_changes_prv.VAERS_ID), ['VAERS_ID', 'cell_edits', 'changes', 'status']] = df_changes_prv[['VAERS_ID', 'cell_edits', 'changes', 'status']] ''' df_edits = move_column_forward(df_edits, 'changes') df_edits = move_column_forward(df_edits, 'status') df_edits = move_column_forward(df_edits, 'cell_edits') df_edits = df_edits.copy().sort_values(by=['cell_edits', 'status', 'changes'], ascending=False) if nan_alert(df_edits): pause=1 # for breakpoint in debugger # Add any others from df_changes_prv not in df_edits df_changes_prv_not_in_edits = df_changes_prv.loc[ ~df_changes_prv.VAERS_ID.isin(df_edits.VAERS_ID) ] df_edits = pd.concat([df_edits, df_changes_prv_not_in_edits]) ''' Identical. prv and new Set cell_edits 0 and nothing in changes column or status. Move to df_changes_done (removing from df_flat_prv and df_flat_new) ''' df_merged = df_flat_prv.merge(df_flat_new, indicator=True, how='outer', suffixes=('1', '2',)) # comparing them df_identical = df_merged.loc[ df_merged._merge.eq('both') ] # add these, done del df_identical['_merge'] vids_identical = df_identical.VAERS_ID.to_list() df_edits, df_changes_done = move_rows(df_edits.loc[ df_edits.VAERS_ID.isin(vids_identical) ], df_edits, df_changes_done) # Moved identicals to done, also remove these IDs in prv and new df_flat_prv = df_flat_prv.loc[ ~df_flat_prv.VAERS_ID.isin(vids_identical) ] df_flat_new = df_flat_new.loc[ ~df_flat_new.VAERS_ID.isin(vids_identical) ] print(f'{len(vids_identical):>10} identical set aside') del df_merged, df_identical, vids_identical print(f'{ len(df_flat_new):>10} this drop to work with') print(f'{ len(df_flat_prv):>10} previous to work with') print(f'{ len(df_flat_new) - len(df_flat_prv):>10} difference') ''' Deleted. Flat prv-only They all exist in df_edits but not in df_flat_new. Add cell_edits +42 and in status column add Deleted. Move to df_changes_done, removing from df_edits and df_flat_prv, also from df_flat_new as no compare possible of course. ''' list_deleted_vids = df_flat_prv.loc[ ~df_flat_prv.VAERS_ID.isin(df_flat_new.VAERS_ID) ].VAERS_ID.to_list() if len(list_deleted_vids): df_edits.loc[ df_edits.VAERS_ID.isin(list_deleted_vids), 'cell_edits' ] += len(columns_vaers) df_edits.loc[ df_edits.VAERS_ID.isin(list_deleted_vids), 'status' ] += f'Deleted {this_dir_date} ' df_edits, df_changes_done = move_rows(df_edits.loc[df_edits.VAERS_ID.isin(list_deleted_vids)], df_edits, df_changes_done) df_flat_prv = df_flat_prv.loc[ ~df_flat_prv.VAERS_ID.isin(list_deleted_vids) ] # remove them from flat_prv, and they are already not in new stats['deleted'] += len(list_deleted_vids) ''' Greater-Than. Flat new-only high VAERS_ID: df_flat_new VAERS_ID above max df_flat_prv Set cell_edits 0 and nothing in changes or status column. Move to df_changes_done (removing from df_flat_new) ''' list_flat_new_gt_vids = df_flat_new.loc[ df_flat_new.VAERS_ID.gt( df_flat_prv.VAERS_ID.max() ) ].VAERS_ID.to_list() df_edits, df_changes_done = move_rows(df_edits.loc[df_edits.VAERS_ID.isin(list_flat_new_gt_vids)], df_edits, df_changes_done) df_flat_new = df_flat_new.loc[ ~df_flat_new.VAERS_ID.isin(list_flat_new_gt_vids) ] # move them out of df_flat_new, they are now in df_changes_done. ''' Restored. Flat new that were Deleted then Restored They all exist in df_edits and are marked Deleted. Add cell_edits +42 and add status column Restored. Remove them from df_flat_new but keep them in df_edits since there could be other changes also. ''' df_edits_deleted = df_edits.loc[ df_edits.status.str.contains('Deleted \d{4}\-\d{2}\-\d{2} $') ] # Deleted as last entry df_edits_restored = df_edits_deleted.loc[ df_edits_deleted.VAERS_ID.isin(df_flat_new.VAERS_ID) ] list_edits_restored = [] if len(df_edits_restored): list_edits_restored = df_edits_restored.VAERS_ID.to_list() df_edits = df_edits.copy().sort_values(by=['status', 'changes', 'cell_edits'], ascending=False) df_edits.loc[ df_edits.VAERS_ID.isin(list_edits_restored), 'cell_edits' ] += len(columns_vaers) df_edits.loc[ df_edits.VAERS_ID.isin(list_edits_restored), 'status' ] += f'Restored {this_dir_date} ' # Mark them as Restored df_edits, df_changes_done = move_rows(df_edits.loc[ df_edits.VAERS_ID.isin(list_edits_restored) ], df_edits, df_changes_done) df_flat_new = df_flat_new.loc[ ~df_flat_new.VAERS_ID.isin(list_edits_restored) ] # remove them from df_flat_new being scrutinized df_flat_prv = df_flat_prv.loc[ ~df_flat_prv.VAERS_ID.isin(list_edits_restored) ] # some confusion, just removing, TODO stats['restored'] += len(list_edits_restored) del df_edits_deleted, df_edits_restored ''' Gapfill/Throttled/Delayed/Late. Flat new-only remaining. (VAERS_IDs that didn't show up in in the sequence their week but only later) First appearance, late. Is possible because all greater-than prv.max() have already been moved. Move from df_edits to df_changes_done and remove from df_flat_new ''' list_flat_only_new_vids = df_flat_new.loc[ ~df_flat_new.VAERS_ID.isin( df_changes_prv.VAERS_ID ) ].VAERS_ID.to_list() # had not shown up yet in changes_prv if len(list_flat_only_new_vids): df_edits = df_edits.copy().sort_values(by=['status', 'changes', 'cell_edits'], ascending=False) df_edits.loc[ df_edits.VAERS_ID.isin(list_flat_only_new_vids), 'status' ] += f'Delayed {this_dir_date} ' # staying with the term Delayed for now df_edits, df_changes_done = move_rows(df_edits.loc[df_edits.VAERS_ID.isin(list_flat_only_new_vids)], df_edits, df_changes_done) if nan_alert(df_edits): pause=1 df_flat_new = df_flat_new.loc[ ~df_flat_new.VAERS_ID.isin(list_flat_only_new_vids) ] stats['gapfill'] += len(list_flat_only_new_vids) print(f'{(len(list_flat_new_gt_vids)):>10} new in {this_dir_date}') print(f'{len(list_flat_only_new_vids):>10} delayed this week' ) print(f'{ len(list_deleted_vids):>10} deleted this week kept' ) print(f"{ len(list_edits_restored):>10} restored this week {' '.join([str(x) for x in list_edits_restored])}" ) del list_flat_new_gt_vids, list_flat_only_new_vids, list_deleted_vids, list_edits_restored len_before = len(df_flat_prv) df_flat_prv = df_flat_prv.drop_duplicates(subset=df_flat_prv.columns, keep='last') # Should be none. Overwrite of values now done if (len_before - len(df_flat_prv)): print(f'\n\n\n\n\t\t\t\t {(len_before - len(df_flat_prv)):>10} complete duplicates dropped in df_flat_prv, SHOULD NOT HAPPEN \n\n\n\n') len_before = len(df_flat_new) df_flat_new = df_flat_new.drop_duplicates(subset=df_flat_new.columns, keep='last') # Should be none. Overwrite of values now done if (len_before - len(df_flat_new)): print(f'\n\n\n\n\t\t\t\t {(len_before - len(df_flat_new)):>10} complete duplicates dropped in df_flat_new, SHOULD NOT HAPPEN \n\n\n\n') if prv_new_error_check(df_flat_prv, df_flat_new): pause=1 if nan_alert(df_flat_prv) or nan_alert(df_flat_new): pause=1 ''' Remaining are only common VAERS_IDs and same number of rows containing some fields that vary. ''' if 'SEX' in df_flat_prv.columns: # I object, it is a verb while gender is the scientific M or F noun they were looking for. The way it is, the column values could say yes, no or sometimes. df_flat_prv = df_flat_prv.copy() df_flat_new = df_flat_new.copy() df_flat_prv.loc[ df_flat_prv.SEX.eq('U'), 'SEX' ] = '' # loop below looks for from empty or to empty ... this is unholy df_flat_new.loc[ df_flat_new.SEX.eq('U'), 'SEX' ] = '' df_flat_prv = df_flat_prv.sort_values(by='VAERS_ID') ; df_flat_prv = df_flat_prv.reset_index(drop=True) df_flat_new = df_flat_new.sort_values(by='VAERS_ID') ; df_flat_new = df_flat_new.reset_index(drop=True) df_flat_new = df_flat_new.reindex(df_flat_prv.columns, axis=1) # TODO: Find out how the columns changed order after adding the blanks block of code below # A patch for old dates like 11/03/2022 to 2022-11-03 # THIS DOES NOT HAVE TO BE DONE IN THE FUTURE, TODAY IS JAN 2024 for col in ['DATEDIED', 'VAX_DATE', 'RPT_DATE', 'RECVDATE', 'TODAYS_DATE', 'ONSET_DATE']: if col in df_flat_prv.columns: df_flat_prv[col] = pd.to_datetime(df_flat_prv[col]).dt.strftime('%Y-%m-%d') # makes some nans df_flat_prv = df_flat_prv.fillna('') if len(df_edits): if nan_alert(df_edits): pause=1 vid_cols_affected = {} if not len(df_flat_new): print(f'{str(0):>10} column changes in {len(columns_vaers)} columns') else: print() ; print(' Column value changes') ; print() # section could do with some refactoring smarter surely #### CDC returned back from pipe to commas on 2022-01-07, sweeping trivial changes, unless my files are corrupted # List of columns changed, skipping cell_edits and changes try: df_all_changed = df_flat_prv.compare(df_flat_new) cols_changed_list = sorted( set( [ x[0] for x in list(df_all_changed.columns) ] ) ) except Exception as e: # note error, patch and allow to continue print(f'{line()} {e}') if len(df_flat_prv) > len(df_flat_new): df_flat_prv.loc[ df_flat_prv.VAERS_ID.isin(df_flat_new.VAERS_ID) ] elif len(df_flat_prv) < len(df_flat_new): df_flat_new.loc[ df_flat_new.VAERS_ID.isin(df_flat_prv.VAERS_ID) ] df_all_changed = df_flat_prv.compare(df_flat_new) cols_changed_list = sorted( set( [ x[0] for x in list(df_all_changed.columns) ] ) ) del df_all_changed count_cols_altered = len(cols_changed_list) show_col_row_progress = 0 count_cols = 0 ''' Debug utility Specify a VAERS_ID in vid_target and/or a selected_column to focus on. ''' debug_vids = [] # list of int VAERS_ID for pause if any and if breakpoint set debug_dates = [] # list of dates like '2021-01-15' for pause if any and if breakpoint set selected_column = '' # like 'SYMPTOM_TEXT' optional for debug for d in debug_dates: if this_dir_date == d: print(f'\n\n\n\n\t\t\t breakpoint for date {d} \n\n\n\n') del debug_dates for col in cols_changed_list: # Find the new changes if selected_column: # debug if col != selected_column: continue count_cols += 1 col_padded = (' ' * (20 - len(col))) + col df_slice_prv = df_flat_prv[['VAERS_ID', col]].sort_values(by='VAERS_ID') df_slice_new = df_flat_new[['VAERS_ID', col]].sort_values(by='VAERS_ID') df_slice_prv = df_slice_prv.reset_index(drop=True) # make these indexes same df_slice_new = df_slice_new.reset_index(drop=True) vids_changed_this_col = df_slice_new.loc[ df_slice_new[col].ne(df_slice_prv[col]) ].VAERS_ID.to_list() df_three_columns = pd.merge( df_slice_prv.loc[df_slice_prv.VAERS_ID.isin(vids_changed_this_col)], df_slice_new.loc[df_slice_new.VAERS_ID.isin(vids_changed_this_col)], on='VAERS_ID', suffixes=('_prv', '_new')) if nan_alert(df_three_columns): pause=1 del df_slice_prv, df_slice_new col_prv = col + '_prv' col_new = col + '_new' count_rows = 0 if debug_vids: debug_vids = [int(x) for x in debug_vids] # making sure and so it doesn't matter, VAERS_ID in df_three_columns are int df_three_columns = df_three_columns.loc[ df_three_columns.VAERS_ID.isin(debug_vids) ] if not len(df_three_columns): continue len_before = len(df_three_columns) df_three_columns = df_three_columns.drop_duplicates(subset=df_three_columns.columns, keep=False) show_dupe_count = 0 # no longer wanting to print this if show_dupe_count and (len(df_three_columns) - len_before): print(f'{(len_before - len(df_three_columns)):>10} duplicates dropped in df_three_columns (columns VAERS_ID, prv and new), should not happen') ''' Special case the 11-11 purge in 2022 to try to increase speed and to reduce the size of run output vastly by marking them in bulk as blanked out These bulk blankings in the purge are not counted in column changes: {} ''' df_3_made_blank = df_three_columns.loc[ df_three_columns[col_prv].ne('') & df_three_columns[col_new].eq('') ] if len(df_3_made_blank) >= 200: vids_list = df_3_made_blank.VAERS_ID.to_list() df_blanked_all = df_edits.loc[ df_edits.VAERS_ID.isin(vids_list) ] print(f' {len(df_3_made_blank):>10} {col} blanked out, noting in bulk, omitting their VAERS_IDs') df_blanked_all = df_blanked_all.copy() # quirky requirement here, what's the point in having df one can't edit without making a copy first? df_blanked_all['cell_edits'] = df_blanked_all['cell_edits'].astype('float64').astype(int) message = f"{len(df_3_made_blank)} rows with {df_3_made_blank[col_prv].astype(str).map(len).sum()} total bytes MADE BLANK" print(f'{" ":>10} {col_padded} {" ":>8} {message:>70} <> [] ') stats['cells_emptied'] += len(vids_list) df_blanked_all['changes_old'] = df_blanked_all['changes'] # col_prv to 'changes_new', a new column df_blanked_all.loc[df_blanked_all.VAERS_ID.isin(vids_list), 'changes_new'] = df_blanked_all['VAERS_ID'].map( pd.Series(df_3_made_blank[col_prv].values, index=vids_list).to_dict()) if nan_alert(df_blanked_all): pause=1 df_blanked_all['changes_a'] = f"{col} {this_dir_date}: " # section clumsy! Help!? df_blanked_all['changes_b'] = f" <> [] " df_blanked_all['changes'] = df_blanked_all['changes_old'] + df_blanked_all['changes_a'] + df_blanked_all['changes_new'] + df_blanked_all['changes_b'] del df_blanked_all['changes_old'] ; del df_blanked_all['changes_new'] ; del df_blanked_all['changes_a'] ; del df_blanked_all['changes_b'] df_blanked_all = df_blanked_all.reset_index(drop=True) # avoid ... cannot reindex from a duplicate axis ... in copy_column_from_1_to_2_per_vids() df_edits = df_edits.reset_index(drop=True) # no idea why this suddenly became necessary ''' duplicate values in your original index. To find them do this: df_blanked_all[df_blanked_all.index.duplicated()] df_edits[df_edits.index.duplicated()] ''' df_edits.loc[ df_edits.VAERS_ID.isin(vids_list), 'cell_edits' ] += 1 # direct df_edits = copy_column_from_1_to_2_per_vids('changes', df_blanked_all, df_edits, vids_list) # copied into df_edits = copy_column_from_1_to_2_per_vids('status' , df_blanked_all, df_edits, vids_list) df_three_columns = df_three_columns.loc[ ~df_three_columns.VAERS_ID.isin(vids_list) ] # remove them from df_three_columns if nan_alert(df_changes_done): pause=1 if not len(df_three_columns): # ok in this case continue if not len(df_three_columns): print(f'\n\n\n\n\t\t\t df_three_columns HAS NO CHANGES, UNEXPECTED, SOMETHING IS WRONG \n\n\n\n') continue df_three_columns = df_three_columns.sort_values(by=[col_prv, col_new]) ''' Remove same ignoring punctuation. 1. Often vastly reduces the work for diff_context() to do, like on 2021-09-03 from 496837 rows to 172. 2. Ignores for example commas changed to pipe like 1483307 on 2021-09-03 in SYMPTOM_TEXT, was otherwise annoying. CDC went wild on 2021-09-03 for example, changing commas to pipe. Now keeping pipe in VAX_LOT etc but ignoring in other fields with inexplicable changes like ... ALLERGIES 1075898 fentanyl,codeine,steroid,fruits,nuts <> fentanyl|codeine|steroid|fruits|nuts ... change of plan: Just treating DATEs differently, everything else the same, as not much info is lost, otherwise filled with trivial noise. ''' len_before = len(df_three_columns) if 'DATE' in col: # neutralize hundreds of thousands of date changes dropping 0's like 12/03/2020 <> 12/3/2020 df_three_columns = df_three_columns.loc[df_three_columns[col_prv].str.replace(r'^(?:0*\d\/\d\d|\d\d\/0*\d|0*\d\/0*\d)\/\d{4}', '', regex=True) != df_three_columns[col_new].str.replace(r'^(?:0*\d\/\d\d|\d\d\/0*\d|0*\d\/0*\d)\/\d{4}', '', regex=True)] else: df_three_columns = df_three_columns.loc[df_three_columns[col_prv].str.replace(r'[\W_]', '', regex=True) != df_three_columns[col_new].str.replace(r'[\W_]', '', regex=True)] ''' Line above can result in no changes, they were only punctuation etc ''' count_trivial_changes = len_before - len(df_three_columns) if count_trivial_changes: cellword = single_plural(count_trivial_changes, 'cells') #' cell' if count_trivial_changes == 1 else 'cells' print(f'{col_padded} {count_trivial_changes:>7} {cellword} of trivial non-letter differences ignored') stats['cell_trivial_changes_ignored'] += count_trivial_changes if not len(df_three_columns): # because with trivials removed, they're often all identical so there's often nothing left to compare continue # Faster way, drop lambda?: https://stackoverflow.com/questions/63306888/general-groupby-in-python-pandas-fast-way df_uniq = df_three_columns.groupby([col_prv, col_new])[['VAERS_ID', col_prv, col_new]].transform(lambda x: ' '.join(x.astype(str).unique())) len_before = len(df_uniq) df_uniq = df_uniq.drop_duplicates(subset=df_uniq.columns) count_dupes = len_before - len(df_uniq) if count_dupes: print(f"{count_dupes:>10} {single_plural(count_dupes, 'duplicates')} dropped in df_three_columns") df_uniq = df_uniq.copy() df_uniq['VAERS_IDs'] = df_uniq.VAERS_ID.str.split(' ') df_uniq = df_uniq[['VAERS_IDs', col_prv, col_new]] # dropping VAERS_ID strings with spaces where multiple df_uniq['len'] = df_uniq[col_new].apply(len) df_uniq = df_uniq.sort_values(by=list(set(df_uniq.columns) - set(['VAERS_IDs']))) # for print for index, row in df_uniq.iterrows(): count_rows += 1 if show_col_row_progress and (count_rows == 1) or (count_rows % 100 == 0) and (count_cols != count_cols_altered): # needs work, is here expecting updated progress indicators while code is working hard print(f'{col_padded:>20} {count_cols}/{count_cols_altered} columns {count_rows}/{len(df_uniq)} rows ', end='\r') vids_list = row['VAERS_IDs'] # all with this same change, often just one vids_list = [int(x) for x in vids_list] vids_list_to_print = '' vids_len_to_print = '' vid_to_print = '' vid = vids_list[0] # only one needs to be processed, then distribute the result to all the vids val_prv = row[col_prv] val_new = row[col_new] val_prv_blank = '[]' if not val_prv else '' # has to be captured before the diff val_new_blank = '[]' if not val_new else '' if val_new_blank: stats['cells_emptied'] += len(vids_list) if debug_vids or selected_column: for v in vids_list: if v not in vid_cols_affected: vid_cols_affected[v] = [col] if col not in vid_cols_affected[v]: vid_cols_affected[v].append(col) if col == 'SYMPTOM_TEXT': stats['writeups_changed'] += len(vids_list) stats['cells_edited'] += len(vids_list) # this does not count trivial changes like punctuation but some are minor stats['columns'][col] += len(vids_list) if val_prv and val_new: # both have content, pare down to the diff if re.sub(r'\W', '', val_prv) == re.sub(r'\W', '', val_new): # skip trivial, for example SYMPTOM_TEXT 1272751 40,3 <> 40|3 ... and ... SYMPTOM_TEXT 1946413 21:15|(at <> 21:15,(at continue # DIFF, vast val_prv, val_new = diff_context(val_prv, val_new, col, vid, this_dir_date) if not val_prv and not val_new: # both empty (no diffs) since trivial differences were removed they can wind up the same now continue if len(vids_list) == 1: # fill the values to be shown (printed to screen) depending on single or list/multiple vid_to_print = vid else: vids_len_to_print = len(vids_list) if len(vids_list) > 1 else '' vids_list_to_print = [int(x) for x in vids_list] # no need for all those string single quotes val_prv = val_prv_blank if val_prv_blank else val_prv val_new = val_new_blank if val_new_blank else val_new ''' 2115523 German character? Or replacement character ? https://stackoverflow.com/a/72656394/962391 'ADHD; Adverse reaction to antibiotics (Nausea and vomiting induced by K?vepening.); Allergy; Asthma; Migraine' UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd' in position 86: character maps to Possibly happens in command prompt but not in pycharm debugger ''' try: ''' Due especially to 2022-11-11 clearing out of SYMPTOM_TEXT, LAB_DATA, HISTORY etc for print during run only, abbreviating large content when just blanked out. ''' if val_new_blank and len(val_prv) > 100: prv_excerpt = re.sub(r'^(.{1,40}.*?)\b.*', r'\1', val_prv) + f' ... ~{len(val_prv) - 40} more' print(f'{vids_len_to_print:>10} {col_padded} {vid_to_print:>8} {prv_excerpt:>70} <> {val_new} {vids_list_to_print}') else: print(f'{vids_len_to_print:>10} {col_padded} {vid_to_print:>8} {val_prv:>70} <> {val_new} {vids_list_to_print}') except Exception as e: ''' Mystery TODO Example first dose shows up as question mark but is actually unicode: https://medalerts.org/vaersdb/findfield.php?IDNUMBER=2128531&WAYBACKHISTORY=ON Attempt using r'�', val_prv = re.sub(r'�', '_', val_prv) ... out of Search and Replace program, but bogus Exception workaround, 'charmap' codec can't encode character 'ufffd' in position 108: character maps to ... ''' val_prv = re.sub(r'?', '_', val_prv) # unicode replacement character? Looks like a box here. val_new = re.sub(r'?', '_', val_new) print(f' Exception workaround, {e} ...') print(f'{vids_len_to_print:>10} {col_padded} {vid_to_print} {val_prv:>70} <> {val_new} {vids_list_to_print}') ''' Append to changes column. Bring in any changes from a previous drop to be appended to. Make more efficient. TODO ''' for vid in vids_list: if val_new == val_new_blank: # Keep completely-deleted values in place but tag them with date like ... Myocarditis in a 42 year old female patient cut_2022-11-11 if not 'DATE' in col: # Skip addition of like cut_2022-11-11 in date fields so the transition to like 2022-11-11 from 11/11/2022 can work for sort in spreadsheet. df_edits.loc[ df_edits.VAERS_ID.eq(vid), col ] += f' cut_{this_dir_date} ' else: pause=2 else: df_edits.loc[ df_edits.VAERS_ID.eq(vid), 'changes' ] += f'{col} {this_dir_date}: {val_prv} <> {val_new} ' # cell_edits +1 for each cell that changed (non-trivially) on each report df_edits.loc[ df_edits.VAERS_ID.eq(vid), 'cell_edits' ] += 1 print() print(f"{ count_cols_altered:>10} {single_plural(count_cols_altered, 'columns')} altered") del count_cols_altered df_edits_unique = df_edits.drop_duplicates(subset=columns_vaers) modified_reports += len(df_edits_unique) print(f"{modified_reports:>10} modified {single_plural(modified_reports, 'reports')} on {this_dir_date}") ; print() # should not be necessary, TODO if nan_alert(df_changes_done): pause=1 df_changes_done['cell_edits'] = df_changes_done['cell_edits'].fillna(0) df_changes_done['changes'] = df_changes_done['changes'] .fillna('') df_changes_done['status'] = df_changes_done['status'] .fillna('') df_changes_done['cell_edits'] = df_changes_done['cell_edits'].astype('float64').astype(int) ''' Add everything, with changes now set, into df_changes_done ''' df_edits, df_changes_done = move_rows(df_edits, df_edits, df_changes_done) # moving entire df_edits to df_changes_done ''' VAERS_IDs are int here ''' len_before = len(df_changes_done) # All columns df_changes_done = df_changes_done.drop_duplicates(subset=df_changes_done.columns, keep='last') # Should be none. Overwrite of values now done if (len_before - len(df_changes_done)): print(f'{(len_before - len(df_changes_done)):>10} complete duplicates dropped in df_changes_done') len_before = len(df_changes_done) # Just VAERS_ID, error-checking to make sure df_changes_done = df_changes_done.drop_duplicates(subset='VAERS_ID', keep='last') # Should be none. if (len_before - len(df_changes_done)): print(f'\n\n\n\n\t\t\t\t{(len_before - len(df_changes_done)):>10} duplicate VAERS_ID dropped in df_changes_done, should never happen\n\n\n\n') ''' Sort the output by cell_edits. Section needs scrutiny. ''' df_changes_done['cell_edits'] = pd.to_numeric(df_changes_done.cell_edits) mask = df_changes_done['cell_edits'].eq(0) df_changes_done = pd.concat([df_changes_done[~mask].sort_values('cell_edits'), df_changes_done[mask]]) df_changes_done['cell_edits'] = df_changes_done['cell_edits'].astype('float64').astype(int) # TODO: Here? ''' dummy code maybe doesn't always work, don't know why. Although addition of the line above might have helped ... .astype('float64').astype(int). Fancy sort with 0's last while coping with any minus values. https://stackoverflow.com/a/64141721/962391 This should be ascending from 1 to infinity, followed by 0's. ''' df_changes_done.assign(dummy=df_changes_done.cell_edits.eq(0)).sort_values(['dummy', 'cell_edits', 'VAERS_ID']).drop('dummy', axis=1) df_changes_done = move_column_forward(df_changes_done, 'changes') df_changes_done = move_column_forward(df_changes_done, 'status') df_changes_done = move_column_forward(df_changes_done, 'cell_edits') df_changes_done = df_changes_done.sort_values(by=['cell_edits', 'status', 'changes'], ascending=False) if nan_alert(df_changes_done): pause=1 filename = dir_changes + '/' + this_dir_date + '_VAERS_FLATFILE.csv' print(f' Writing ... {filename}') ; print() write_to_csv(df_changes_done, filename) df_changes_1 = df_changes_done.copy() # use as the previous in next when processing more than one week. df_done[ 'changes' ] = this_dir_date #lookup(previous_date, this_dir_date, vid_cols_affected) # for debug, open a csv showing all of the reports touched for these dates compared ''' VAERS_IDs with the most records/doses/lots in each report and at least one covid19. (There's surely a better way with lambda or .apply() and/or .groupby() etc). Here, just counting delimiters in VAX_LOT and adding 1 to that. A report in df_vax has 18 entries in 2020-12-25 but no covid, for example, see df_vax.VAERS_ID.value_counts().unique() ''' df_doses = df_changes_done.loc[df_changes_done.VAX_LOT.str.contains('\|')][['VAERS_ID', 'VAX_LOT']] if nan_alert(df_doses): pause=1 df_doses['doses'] = df_doses.VAX_LOT.str.count('\|') vid_top_doses_list = sorted(df_doses.loc[ df_doses.doses.eq(df_doses.doses.max()) ].VAERS_ID.to_list()) top_num = df_doses.doses.max() + 1 print(f'{len(vid_top_doses_list):>10} {single_plural(len(vid_top_doses_list), "reports")} with the most ({top_num}) records/lots/doses: {" ".join([str(x) for x in vid_top_doses_list])}') stats['comparisons'] += 1 print(f"{stats['comparisons']:>10} {single_plural(stats['comparisons'], 'comparisons')} done") stats_resolve(this_dir_date) print(f' This week {do_elapsed(elapsed_drop)}') print(f' Overall {do_elapsed(elapsed_begin)}') print() return def diff_context( prv, new, col, vid, this_dir_date ): ''' Inputs: String column values like in SYMPTOM_TEXT. Code reviewer: Is there a better way? Surely. ''' delim = '' if len(prv) > 200: delim_count, delim = get_prevalent_delimiters(prv) a = diff_simplify( prv ) b = diff_simplify( new ) ''' Ignore punctuation and upper/lower case. [^a-zA-Z0-9|] clears everything not alphanumeric or pipe symbol (since VAX delim matters there) ''' if re.sub(r'[^a-zA-Z0-9|]', '', a.lower()) == re.sub(r'[^a-zA-Z0-9|]', '', b.lower()): return( '', '' ) if int(vid) in [2226564, 1940162, 1005427, 1867441, 2174886, 2122100, 2147076]: z=1 if (delim and delim_count >= 5) or col == 'symptom_entries': if col == 'symptom_entries': # these are lightning fast delim = '_|_' list_prv = a.split(delim) list_new = b.split(delim) only_prv = [x for x in list_prv if x not in list_new] # remove common only_new = [x for x in list_new if x not in list_prv] a = delim.join(only_prv) b = delim.join(only_new) ''' Neutralizing punctuation changes needed here too because like 1334938 comma to pipe ... Hallucination| auditory ... wacky stuff by them ''' if re.sub(r'[^a-zA-Z0-9|]', '', a.lower()) == re.sub(r'[^a-zA-Z0-9|]', '', b.lower()): return( '', '' ) # Ignore punctuation and upper/lower case if delim == '_|_': if sorted(only_prv) == sorted(only_new): return ('', '') # Ignore reversed order if a: a = delim + a + delim # endings if b: b = delim + b + delim return( a, b ) context = 6 # way low for cases like 1589954 to rid the word 'appropriate' by itself at len(largest_common_string) > context count_while = 0 continue_while = 1 length_max = 50 while count_while <= 9 and continue_while: # Whiddling strings down, removing common substrings count_while += 1 # Also consider if a in b: re.sub(a, '', b) ; a = '' and reverse for shorter no context if not a or not b: return( a, b ) if a.lower() == b.lower(): return( '', '' ) ''' Important: If changes are the same with words just in a different order, skipping those. Why? There are so many they would drown out changes. Applies for example to dose order changes, they are legion. ''' if delim: if delim == '|' and sorted(set(a.split('|'))) == sorted(set(b.split('|'))): # vax fields, their delimiter is the pipe symbol, also removing duplicates return( '', '' ) elif delim != '. ' and sorted(set(a.split(' '))) == sorted(set(b.split(' '))): # ignore reversed order and repeat words return( '', '' ) ''' To remove punctuation crudely: from string import punctuation ; ' '.join(filter(None, (word.strip(punctuation) for word in a.split()))) https://stackoverflow.com/a/15740656/962391 Ok, I give up. To avoid boatloads of problems, like commas replaced with pipe symbols in SYMPTOM_TEXT, going to compare/return just bare words for long strings Avoid affecting things like 48.0 `punctuation` is just a string ... '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' punctuations = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~' ... removing the single quote, note the 's', plural, variable name. ''' if len(a) > context and len(b) > context: a = ' '.join(filter(None, (word.strip(''.join(punctuation)) for word in a.split()))) # consider not stripping ' .. ' instead but dicey b = ' '.join(filter(None, (word.strip(''.join(punctuation)) for word in b.split()))) largest_string = longest_word_string(a, b) if largest_string and (len(largest_string) > context): # and ((len(a) > length_max) or (len(b) > length_max)): a = re.sub(re.escape(largest_string), ' .. ', a, flags=re.IGNORECASE) b = re.sub(re.escape(largest_string), ' .. ', b, flags=re.IGNORECASE) if a.lower() == b.lower(): return ('', '') a_list = [x.strip() for x in a.split()] ; b_list = [x.strip() for x in b.split()] only_a = [x for x in a_list if x not in b_list] only_b = [x for x in b_list if x not in a_list] a_list = [] ; b_list = [] for x in only_a: if x in a_list: continue a_list.append(x) for x in only_b: if x in b_list: continue b_list.append(x) a = ' '.join(a_list) ; b = ' '.join(b_list) if not a or not b: return (a, b) else: continue_while = 0 a = re.sub(r'\s\s+', ' ', a) ; b = re.sub(r'\s\s+', ' ', b) # multiple spaces a = a.strip() ; b = b.strip() # front and end spaces # remove multiples like ' cms hcc .. cms hcc .. cms hcc .. cms hcc ' if ' .. ' in a or ' .. ' in b: a_list = a.split('..') ; b_list = b.split('..') # not a complete job on 1371552 but good enough for now a_list = [x.strip() for x in a_list] ; b_list = [x.strip() for x in b_list] # drop '' and ' ' a_list = [x for x in a_list if re.search(r'\w', x)] ; b_list = [x for x in b_list if re.search(r'\w', x)] if len(a_list) != len(set(a_list)) or len(b_list) != len(set(b_list)): # if multiples only_a = [x for x in a_list if x not in b_list] only_b = [x for x in b_list if x not in a_list] a_list = [] ; b_list = [] for x in only_a: if x in a_list: continue a_list.append(x) for x in only_b: if x in b_list: continue b_list.append(x) a = ' .. '.join(a_list) ; b = ' .. '.join(b_list) a = re.sub(r'\s\s+', ' ', a) ; b = re.sub(r'\s\s+', ' ', b) # necessary anymore after additions above? if re.sub(r'[^a-zA-Z0-9|]', '', a.lower()) == re.sub(r'[^a-zA-Z0-9|]', '', b.lower()): # again return( '', '' ) # Ignore punctuation and upper/lower case return ( a, b ) def diff_simplify( x ): ''' Remove trivial stuff from string, etc ''' x = re.sub(r' ...^... ' , '' , x) # the repeat sentence replacements that were entered x = re.sub(r'[^\x00-\x7F]+', ' ' , x) # unicode x = re.sub(r'\s\s+' , ' ' , x) # multiple spaces with single to avoid trivial x = re.sub(r"''" , "'" , x) # doubled can mess things up like ... I''m ... in 1455992 x = x.strip() # front and end spaces return x def get_prevalent_delimiters(prv_str): ''' ''' max_delim = 0 delim = '' for d in [ '. ', '|', '; ', ' - ' ]: if d == '|' and '_|_' in prv_str: continue # clumsy but these get crossed and '_|_' handled elsewhere count = prv_str.count(d) if count > max_delim: max_delim = count if max_delim >= 5: delim = d return( max_delim, delim ) def longest_word_string(a, b): ''' https://stackoverflow.com/a/42882629/962391 ''' answer = [] list_a = a.split() list_b = b.split() len1, len2 = len(list_a), len(list_b) for i in range(len1): for j in range(len2): lcs_temp = 0 match = [] while ((i+lcs_temp < len1) and (j+lcs_temp < len2) and list_a[i+lcs_temp].lower() == list_b[j+lcs_temp].lower()): match.append( list_b[j+lcs_temp].lower() ) lcs_temp += 1 if len(match) > len(answer): answer = match answer = ' '.join(answer) answer = answer.strip() return answer def prv_new_error_check(df_prv, df_new): ''' Ensuring the VAERS_IDs are identical in each ''' vids_in_prv = df_prv.VAERS_ID.to_list() vids_in_new = df_new.VAERS_ID.to_list() if sorted(set(vids_in_prv)) != sorted(set(vids_in_new)): print() ; print(f' ERROR: Expected VAERS_ID in prv and new to match exactly') print(f' len prv: {len(vids_in_prv)}') print(f' len new: {len(vids_in_new)}') print(f' Difference: {len(vids_in_new) - len(vids_in_prv)}') vids_in_prv_not_in_new = df_prv.loc[ ~df_prv.VAERS_ID.isin(df_new.VAERS_ID) ].VAERS_ID.to_list() vids_in_new_not_in_prv = df_new.loc[ ~df_new.VAERS_ID.isin(df_prv.VAERS_ID) ].VAERS_ID.to_list() print(f' In new not in prv: {vids_in_new_not_in_prv}') print(f' In prv not in new: {vids_in_prv_not_in_new}') print() return 1 elif sorted(vids_in_prv) != sorted(vids_in_new): # as lists print() ; print(f' ERROR: VAERS_ID in prv are the same sets but there are multiple VAERS_ID in one') print(f' len prv: {len(vids_in_prv)}') print(f' len new: {len(vids_in_new)}') print(f' Difference: {len(vids_in_new) - len(vids_in_prv)}') list_multiples_prv = [k[0] for k,v in df_prv[ ['VAERS_ID'] ].value_counts().to_dict().items() if v > 1] if list_multiples_prv: print(f' VAERS_IDs: {list_multiples_prv}') print(f" df_prv.loc[ df_prv.VAERS_ID.isin({list_multiples_prv}) ][['VAERS_ID', 'VAX_LOT', 'VAX_MANU', 'VAX_TYPE', 'VAX_NAME']]") df_multiples_prv = df_prv.loc[df_prv.VAERS_ID.isin(list_multiples_prv)][['VAERS_ID', 'VAX_LOT', 'VAX_MANU', 'VAX_TYPE', 'VAX_NAME']] print(f'{df_multiples_prv}') list_multiples_new = [k[0] for k,v in df_new[ ['VAERS_ID'] ].value_counts().to_dict().items() if v > 1] if list_multiples_new: print(f' VAERS_IDs: {list_multiples_new}') print(f"df_new.loc[ df_new.VAERS_ID.isin({list_multiples_new}) ][['VAERS_ID', 'VAX_LOT', 'VAX_MANU', 'VAX_TYPE', 'VAX_NAME']]") df_multiples_new = df_new.loc[ df_new.VAERS_ID.isin(list_multiples_new) ][['VAERS_ID', 'VAX_LOT', 'VAX_MANU', 'VAX_TYPE', 'VAX_NAME']] print(f'{df_multiples_new}') exit(' Cannot continue with this discrepancy') def find_all_context(search_string, df_in, df_in_column, df_out_column): search_string = search_string .lower() df_in[df_in_column] = df_in[df_in_column].str.lower() df_found = df_in.loc[ df_in[df_in_column].str.contains(search_string, na=False) ] if len(df_found): df_found = df_found.copy() str_context = r'\b(.{0,20}' + search_string + r'.{0,20})\b' df_found[df_out_column] = df_found[df_in_column].str.findall(str_context).str.join(' ~~ ') return df_found else: return pd.DataFrame() # empty for len 0 def do_autodownload(): ''' https://vaers.hhs.gov/data/datasets.html ''' if not autodownload: return sys.path.insert(0, './vaers_downloader') from VAERSFileDownloader import updateVAERSFiles updateVAERSFiles( needsUpdate = True, years = [2020], workingDirectory = os.getcwd() ) def run_all(): print(__file__) ; print() print('run_all() ...') ; print() ''' Can start from scratch (if restart = 1 is set at top of this file) However: reconsolidating can be avoided with reconsolidate = 0 reflattening can be avoided with reflatten = 0 Result would be only changes would be wiped out and rebuilt. ''' if restart: do_restart() #do_autodownload() #exit('test autodownload done') validate_dirs_and_files() while more_to_do(): this_drop_date = get_next_date() # next input not already in a resulting changes file print_date_banner(this_drop_date) open_files (this_drop_date) consolidate (this_drop_date) flatten (this_drop_date) compare (this_drop_date) def show_vid_as_text(df, vid): if not type(vid) is int: print(f' Expected vid {vid} to be an integer') if vid not in df.VAERS_ID.to_list(): print(f' vid {vid} not found in the dataframe') return filename = f'{vid}.txt' df = df.loc[ df.VAERS_ID.eq(vid) ] df.to_csv(filename, sep=',', encoding='utf-8-sig', index=False) sp.Popen(filename, shell=True) run_all() exit()