TDM 20200: Project 4 — 2024
Motivation: It is worthwhile to learn how to parse hundreds of thousands of files systematically. We practice this skill, step by step.
Context: We return to the over-the-counter medications from Project 1, aiming to extract the ingredient substances from each medication, and creating a tally of all of the ingredient substances.
Scope: Python, XML
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/otc/archive1
through
-
/anvil/projects/tdm/data/otc/archive10
and
through
|
When building the dictionary, you will see that Dr Ward writes:
Some of you will not have worked with dictionaries too much in the past. Dictionaries start out empty, and you need to add the words as you go. Alternatively, if you want to, you can just write:
This approach is a little bit cleaner, but I didn’t know if you would understand it. If Either approach is OK, and you might have another Pythonic way that you want to handle this step in creating the dictionary. |
Dr Ward created 8 videos to help with this project. |
Questions
Question 1 (2 points)
Run the lines:
import pandas as pd
import lxml.etree
import glob
-
Remind yourself how to extract the ingredient substances from each of these two files:
/anvil/projects/tdm/data/otc/valu.xml
and/anvil/projects/tdm/data/otc/hawaii.xml
For each of these two files, print a list of all ingredient substances (it is OK if some are repeated; also, do not worry about which ingredient that the ingredient substances come from). For instance, if you extract the ingredient substances from the file
/anvil/projects/tdm/data/otc/valu.xml
you should get these ingredient substances:
HYPROMELLOSES
MINERAL OIL
POLYETHYLENE GLYCOL, UNSPECIFIED
POLYSORBATE 80
POVIDONE, UNSPECIFIED
... blah blah blah ...
STARCH, CORN
SODIUM STARCH GLYCOLATE TYPE A CORN
STEARIC ACID
TITANIUM DIOXIDE
ACETAMINOPHEN
or if you extract the ingredient substances from the file
/anvil/projects/tdm/data/otc/hawaii.xml
you should get these ingredient substances:
DIBASIC CALCIUM PHOSPHATE DIHYDRATE
WATER
SORBITOL
SODIUM LAURYL SULFATE
CARBOXYMETHYLCELLULOSE SODIUM, UNSPECIFIED FORM
... blah blah blah ...
WHITE WAX
MANGIFERA INDICA SEED BUTTER
ROSEMARY OIL
TOCOPHEROL
ZINC OXIDE
Question 2 (2 points)
-
Use this Python code:
for myfile in glob.glob("/anvil/projects/tdm/data/otc/archive1/*.xml")[0:11]
and use also this code:tree = lxml.etree.parse(myfile)
to loop over the first eleven files in thearchive1
directory. Print all of the ingredient substances from these first eleven files. -
Make a Python dictionary (called a
dict
in Python) from the ingredient substances, keeping track of the number of times that each ingredient substance occurs.
Question 3 (2 points)
-
Convert the dictionary from question 2b to a data frame.
-
Sort the dataframe according to the counts, and print the 5 most popular ingredient substances from those 10 files, and the number of times that each of these 5 most popular ingredient substances occurs. Your output should contain:
COCAMIDOPROPYL BETAINE 60
FD&C BLUE NO. 1 70
CITRIC ACID MONOHYDRATE 87
GLYCERIN 93
WATER 114
Question 4 (2 points)
-
Now analyze the first 1000 files from the
archive1
directory, and print the output that shows the 5 most popular ingredient substances from those 1000 files, and the number of times that each of these 5 most popular ingredient substances occurs. -
Now try to analyze all of the files from the
archive1
directory. Likely, your work will fail, because there is at least one enormous file that needs a little bit fancier parsing method! So you can add these lines:
from lxml.etree import XMLParser, parse
p = XMLParser(huge_tree=True)
and then add the parameter parser=p
to your parse
statement. Now you can analyze all of the files from the archive1
directory. Print output that shows the 5 most popular ingredient substances from all of the files (altogether) in the archive1
directory, and the number of times that each of these 5 most popular ingredient substances occurs.
Question 5 (2 points)
-
Now analyze all of the files in all 10 directories
archive1
througharchive10
, and print output that shows the 5 most popular ingredient substances from all of the files (altogether) in these 10 directories, and the number of times that each of these 5 most popular ingredient substances occurs.
Project 04 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project04.ipynb
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project04.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |