IT 385 Allegany College of Maryland Data Quality Assessment Paper Scenario:Your team is doing research on web traffic on Marymount’s network (dorms and aca

IT 385 Allegany College of Maryland Data Quality Assessment Paper Scenario:Your team is doing research on web traffic on Marymount’s network (dorms and academic/admin buildings). You will be using web traffic data, firewall data, and data from two external data sources of your choosing. (You can use your best guess for size of datasets from the university). Please write out a Data Quality Assessment that will include the following information below for your data sets. Student privacy must be included Data Quality Assessment
Description of Data

Type of Research
Types of Data
Format and Scale of Data

Don't use plagiarized sources. Get Your Custom Essay on
IT 385 Allegany College of Maryland Data Quality Assessment Paper Scenario:Your team is doing research on web traffic on Marymount’s network (dorms and aca
For $10/Page 0nly
Order Essay

Data Collection / Generation

Methodologies for data collection/generation
Data Quality and Standards

Data Management, Documentation, and Curation

Managing/storing and curating data
Metadata standards and data documentation
Data preservation strategy and standards

Data Security and confidentiality

Format information / data security standards
Main risks to data security

Data Sharing and Access

Suitability for sharing
Discovery by potential users of the research data
Governance of Access
The study teams exclusive use of the data
Regulations of responsibilities of users

Relevant institutional, departmental or study policies on data sharing and data security

Include policy name and URL/Reference to it. Also any laws that may apply.

Deliverable:Please provide the following in a PDF or Word Document(s):
Research Question/Problem you will use.
Data Quality Assessment Report, including 2 chosen external data sources used to match against web traffic. IT 385
Managing Big Data
Professor Tim Eagle
Module 3:
Data Assets
Data Assets
?
?
?
ETL
Data Formats
Sources of Data
?
?
?
?
?
?
?
Data Providers
Data Scraping
?
?
?
Internal vs External
Log Data
Public vs. Private
Free vs. Paid
Refreshing Data
Old Way – Scraping from HTML
New Way – API’s, XML, CAML, etc.
Data Formatting
?
?
Strategies – ETL vs ELT
Tools
Extract, Transform, and Load
?
?
?
Extract is the process of reading data from a database. In this stage, the data is collected,
often from multiple and different types of sources.
Transform is the process of converting the extracted data from its previous form into the
form it needs to be in so that it can be placed into another database. Transformation
occurs by using rules or lookup tables or by combining the data with other data.
Load is the process of writing the data into the target database.
Data Formats
?
?
Excel
Delimited Files – Most Common for larger sets
?
?
?
?
Can be easy, but older sets are often complex. (FECA example)
JSON – Javascript Object Notation – Becoming more popular
XML – Extensible Markup Language – Most common for newer data sets
Proprietary Formats
?
?
SAS Datasets
Database Backups/dumps
Sources of Data
Internal vs External
Internal
?
?
?
?
Free
Varied depending on business
Should be able to paint a picture of
your business
Can be quite messy
External
?
?
?
Not Free
Plenty of places to look
Pay for cleaner data
Sources of Data
Log Data
?
?
Depending on the project you are working with, log Data might be useful
Sources:
?
?
?
?
?
?
System Logs
Network Logs
Application Logs
Often need more formatting than other Data sources
Often very noisy
Log aggregation tools bring it all together, might handle formatting
?
?
Splunk
Greylog
Sources of Data
Public vs. Private
Public Data Sets:
?
?
?
?
?
NWS Weather
Data.gov
SSA DMF
?
?
Private Data Sets
?
?
?
?
?
Typically not free
Very business-speci?c
Update frequently
Very speci?c licensing terms
https://cloud.google.com/public-datasets/
?
https://aws.amazon.com/opendata/public-dat
asets/
https://www.fdbhealth.com/solutions/medkno
wledge/medknowledge-drug-pricing
?
Duns and Bradstreet
https://datasetsearch.research.google.com/
https://github.com/awesomedata/awesome-p
ublic-datasets
https://risk.lexisnexis.com/products/accurint-f
or-healthcare
Sources of Data
Free vs. Paid
Free
?
?
?
?
?
Paid
Generally will require more cleaning
Updated less often
Some sets are free for partial sets of
data
No Support for data issues
You get what you paid for…
?
?
?
?
?
?
Cleanliness of data varies
Pay for more updates
Full sets of data, or built for your
speci?c needs
Support often provided
Can get costly (Cost-Bene?t Analysis
needed)
Some Public Data costs as well
Sources of Data
Refreshing Data
?
?
How often?
Refresh Policy
?
?
?
?
?
Replace Existing Data
Update Existing Data
Append Data
Update and Add new Data
Storage Space limitations
Data Providers
Data Providers – Companies/Websites that aggregate various datasets, then provide data in
either a paid license or open license. Can be very market speci?c. Often have a single API to
access all data sets.
?
?
?
?
https://www.ecoinvent.org/home.html
https://intrinio.com/
https://www.programmableweb.com/category/all/apis
Data.gov
Also, there is a new push for Data as a Service, where we don’t download data sets, we just
query against a service provider.
Data Scraping
Old Way
?
?
?
Since the internet used to be fairly static, to get data, we would Scrape the data from the
web pages using code.
You’d have scripts run through pages, look for certain spots or words, and would capture
it in another ?le.
Often Memory and resource intensive.
Data Scraping
New Way
?
?
?
Grab RSS/XML feeds from pages
Use API’s to sites data
Use tools to Scrape data from Social media or webpages
?
?
?
?
?
Data Scraper – Chrome Plugin
WebHarvey
Import.io
Buy Pre Scraped and formatted data
Use a hybrid of new and old way to see the whole picture
?
https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-s
electorlib/
Data Formatting
ETL vs. ELT
https://www.xplenty.com/blog/etl-vs-elt/
Data Formatting
ETL
?
A continuous, ongoing process with a well-de?ned work?ow: ETL ?rst extracts data from
homogeneous or heterogeneous data sources. Next, it deposits the data into a staging
area. Then the data is cleansed, enriched, transformed, and stored in the data warehouse.
?
Used to required detailed planning, supervision, and coding by data engineers and
developers: The old-school methods of hand-coding ETL transformations in data
warehousing took an enormous amount of time. Even after designing the process, it took
time for the data to go through each stage when updating the data warehouse with new
information.
?
Modern ETL solutions are easier and faster: Modern ETL, especially for cloud-based data
warehouses and cloud-based SaaS platforms, happens a lot faster.
Data Formatting
ELT
?
Ingest anything and everything as the data becomes available: ELT paired with a data
lake lets you ingest an ever-expanding pool of raw data immediately, as it becomes
available. There’s no requirement to transform the data into a special format before
saving it in the data lake.
?
Transforms only the data you need: ELT transforms only the data required for a
particular analysis. Although it can slow down the process of analyzing the data, it offers
more ?exibility—because you can transform the data in different ways on the ?y to
produce different types of metrics, forecasts and reports. Conversely, with ETL, the entire
ETL pipeline—and the structure of the data in the OLAP warehouse—may require
modi?cation if the previously-decided structure doesn’t allow for a new type of analysis.
?
ELT is less-reliable than ETL: It’s important to note that the tools and systems of ELT are
still evolving, so they’re not as reliable as ETL paired with an OLAP database. Although it
takes more effort to setup, ETL provides more accurate insights when dealing with
massive pools of data. Also, ELT developers who know how to use ELT technology are
more dif?cult to ?nd than ETL developers
Data Formatting
Tools
?
?
?
?
?
Excel – It can do a ton of great stuff, however, it SH*** the bed with larger data sources
Scripting – Perl or Python – Work great with ?at ?les, XML, JSON, but not with others
SQL – Can do much of the formatting in SQL and create new tables.
SAS / R – Same as scripting, very powerful, but a learning curve.
ETL Speci?c Tools
?
?
?
?
?
?
Informatica
Microsoft SSIS
Oracle Data Integrator
IBM InfoSphere DataStage
Apache Air?ow, Kafka, Ni?
Talend Open Studio
Data Formatting
Tips
?
?
?
?
?
?
Dates/Times – convert to same timezone
Units – Convert to the same units when possible
Standardize addresses
Classify/Tag certain data
Timestamp data
If you are going to map, Geocode early – can be costly.
In Class Assignment
Find the best Data Source to help solve these problems.
?
?
?
?
?
When should our nation-wide business start to stock snow shovels?
Where is the best location to buy real estate for a car dealership?
Which third basemen plays the better in day games?
How many people have larger incomes and less children in a speci?c geographic area?
What Data Sources would have me (your professor) in them?
Post answers to discussion board posted, as replies to each question.
The End
Quiz 1 posted, due next
week
? Assignment 1 posted,
due in 2 weeks.
?
IT 385
Managing Big Data
Professor Tim Eagle
Module 4:
Data Quality
Data Quality
?
?
?
?
?
Garbage In, Garbage Out
De?nition of Data Quality
The Continuum of Data Quality
Other Problems with Data Quality
Creating Better Data Quality
?
?
?
?
?
Data Cleansing
Master Data Management
Data deduplication
Data Interpretation
The need for Domain Experts
Garbage In,
Garbage Out
De?nition of Data Quality
Validity
Reliability
Completeness
Precision
Timeliness
Integrity
Data measure what they are supposed to measure.
Everyone defines, measures, and collects data the same way—all
the time.
Data include all of the values needed to calculate indicators.
No variables are missing.
Data have sufficient detail. Units of measurement are very clear.
Data are up to date. Information is available on time.
Data are true. The values are safe from deliberate bias and have
not been changed for political or personal reasons.
Validity
?
Data measures what they are supposed to measure.
When it goes wrong:
AI And facial recognition.
https://www.wmm.com/sponsored-project/codedbias/?fbclid=IwAR1xSLJToeXvMbNePKnlTn
dCFjnO3485Iv0AMf5wcZC-Tb1UuIJSiT3ivnQ
Reliability
Everyone de?nes, measures, and collects data the same way – all the time.
When it goes wrong:
Even the discovery of the Americas was a result of bad data. Christopher Columbus
made a few signi?cant miscalculations when charting the distance between Europe and
Asia. First, he favored values given by Persian geographer Alfraganus, over the more
accurate calculations of Greek geographer, Eratosthenes. Second, Columbus assumed
Alfraganus was referring to Roman miles in his calculations when, in reality he was
referring to Arabic miles.
Completeness
Data includes all of the values needed to calculate indicators. No variables are missing.
When it goes wrong:
2016 Election- The 2016 United States Presidential election was also mired with bad data.
National polling data used to predict state-by-state Electoral College votes led to the prediction
of a Hillary Clinton landslide, a forecast that lead to many American voters to stay home on
Election Day.
Also, the Big Short
Precision
?
Data has suf?cient detail. Units of measurement are very clear.
When it goes wrong:
In 1999, NASA took a $125 million dollar hit when it lost the Mars Orbiter. It turns out that they
engineering team responsible for developing the Orbiter used English units of measurement
while NASA used the metric system. The problem here is the data was inconsistent making it a
rather costly and disastrous mistake.
Timeliness
Data is up to date. Information is available on time.
When it goes wrong:
The turning point of the civil war, Gettysburg. Lee, general of the Confederate army had old
intel, and didn’t know the accurate count of troops.
Integrity
Data is true. The values are safe from deliberate bias and have not been changed for political or
personal reasons.
When it goes wrong:
The Enron scandal in 2001 was largely a result of bad data. Enron was once the sixth-largest
company in the world. A host of fraudulent data provided to Enron’s shareholders resulted in
Enron’s meteoric rise and subsequent crash. An ethical external auditing ?rm could have
prevented this fraud from occurring.
Or the Anti-vaccination movement.
The Data Quality Continuum
The Data Quality Continuum
• Data and information is not static, it flows in a
data collection and usage process
–
–
–
–
–
–
Data gathering
Data delivery
Data storage
Data integration
Data retrieval
Data mining/analysis
Data Gathering
• How does the data enter the system?
• Sources of problems:
– Manual entry
– No uniform standards for content and formats
– Parallel data entry (duplicates)
– Approximations, surrogates – SW/HW constraints
– Measurement errors.
Solutions
• Potential Solutions:
– Preemptive:
• Process architecture (build in integrity checks)
• Process management (reward accurate data entry, data sharing,
data stewards)
– Retrospective:
• Cleaning focus (duplicate removal, merge/purge, name & address
matching, field value standardization)
• Diagnostic focus (automated detection of glitches).
Data Delivery
• Destroying or mutilating information by inappropriate pre-processing
– Inappropriate aggregation
– Nulls converted to default values
• Loss of data:
– Buffer overflows
– Transmission problems
– No checks
Solutions
•
Build reliable transmission protocols
– Use a relay server
•
Veri?cation
– Checksums, veri?cation parser
– Do the uploaded ?les ?t an expected pattern?
•
Relationships
– Are there dependencies between data streams and processing steps
•
Interface agreements
– Data quality commitment from the data stream supplier.
Data Storage
• You get a data set. What do you do with it?
• Problems in physical storage
– Can be an issue, but terabytes are cheap.
• Problems in logical storage (ER ?? relations)
– Poor metadata.
• Data feeds are often derived from application programs or legacy
data sources. What does it mean?
– Inappropriate data models.
• Missing timestamps, incorrect normalization, etc.
– Ad-hoc modifications.
• Structure the data to fit the GUI.
– Hardware / software constraints.
• Data transmission via Excel spreadsheets, Y2K
Solutions
• Metadata
– Document and publish data specifications.
• Planning
– Assume that everything bad will happen.
– Can be very difficult.
• Data exploration
– Use data browsing and data mining tools to examine the data.
• Does it meet the specifications you assumed?
• Has something changed?
Data Integration
• Combine data sets (acquisitions, across departments).
• Common source of problems
– Heterogenous data : no common key, different field formats
• Approximate matching
– Different definitions
• What is a customer: an account, an individual, a family, …
– Time synchronization
• Does the data relate to the same time periods? Are the time
windows compatible?
– Legacy data
• IMS, spreadsheets, ad-hoc structures
– Sociological factors
• Reluctance to share – loss of power.
Solutions
• Commercial Tools
– Significant body of research in data integration
– Many tools for address matching, schema mapping are available.
• Data browsing and exploration
– Many hidden problems and meanings : must extract metadata.
– View before and after results : did the integration go the way you
thought?
Data Retrieval
• Exported data sets are often a view of the actual data. Problems occur
because:
– Source data not properly understood.
– Need for derived data not understood.
– Just plain mistakes.
• Inner join vs. outer join
• Understanding NULL values
• Computational constraints
– E.g., too expensive to give a full history, we’ll supply a snapshot.
• Incompatibility
– Ebcdic?
Data Mining and Analysis
• What are you doing with all this data anyway?
• Problems in the analysis.
– Scale and performance
– Confidence bounds?
– Black boxes and dart boards
• “fire your Statisticians”
– Attachment to models
– Insufficient domain expertise
– Casual empiricism
Solutions
• Data exploration
– Determine which models and techniques are appropriate, find data bugs,
develop domain expertise.
• Continuous analysis
– Are the results stable? How do they change?
• Accountability
– Make the analysis part of the feedback loop.
Other problems in DQ – Missing Data
• Missing data – values, attributes, entire records, entire sections
• Missing values and defaults are indistinguishable
• Truncation/censoring – not aware, mechanisms not known
• Problem: Misleading results, bias.
Data Glitches
• Systemic changes to data which are external to the recorded process.
– Changes in data layout / data types
• Integer becomes string, fields swap positions, etc.
– Changes in scale / format
• Dollars vs. euros
– Temporary reversion to defaults
• Failure of a processing step
– Missing and default values
• Application programs do not handle NULL values well …
– Gaps in time series
• Especially when records represent incremental changes.
Departmental Silos
?
?
?
?
Everyone sees their job, department, business as the most important thing.
Often Departments or other Groups will have their own Data Quality Standards for
their speci?c mission.
Data Quality Suffers when you have to look at data across the business or between
companies.
Example: Federal ID for Companies and Businesses
? DUNS vs. TaxID vs. NPI vs. SSN vs. Department ID vs. Universal ID
Then why is every DB dirty?
• Consistency constraints are often not used
– Cost of enforcing the constraint
• E.g., foreign key constraints, triggers.
– Loss of flexibility
– Constraints not understood
• E.g., large, complex databases with rapidly changing requirements
– DBA does not know / does not care.
• Garbage in
– Merged, federated, web-scraped DBs.
• Undetectable problems
– Incorrect values, missing data
• Metadata not maintained
• Database is too complex to understand
Improving Data Quality
Data Cleansing
?
?
?
?
Not just about the Data itself, also about standardizing business log data & metrics
Create universal identi?ers across your business, look at best practices.
Convert dates to same timezone and format.
Standardize naming conventions in metadata
Cleansing Methods
?
?
?
?
?
Histograms
Conversion Tables
? Example – USA, U.S., U.S.A., US, United States
Tools
Algorithms
Manually
Master Data Management
MDM Continued
Data Deduplication
?
Data deduplication: A process that examines new data blocks
using hashing compares them to existing data blocks, and skips
redundant blocks when data is transferred to the target.
?
Data reduction: A process that tracks block changes, usually
using some kind of log or journal, and then transfers only new
blocks to the backup target.
Data Interpretation
?
Data interpretation refers to the implementation of processes through which data is
reviewed for the purpose of arriving at an informed conclusion. The interpretation of data
assigns a meaning to the information analyzed and determines its signi?cation and
implications.
?
?
Qualitative Interpretation – Observations, Documents, Interviews
Quantitative Interpretation – Mean, Standard Deviation, Frequency Distribution
Data Interpretation Problems
?
Correlation mistaken for Causation
?
Con?rmation Bias
?
Irrelevant Data
Domain Expertise
• Data quality gurus: “We found these peculiar records in
your database after running sophisticated algorithms!”
• Domain Experts: “Oh, those apples – we put them in the
same baskets as oranges because there are too few
apples to bother. Not a big deal. We knew that already.”
Why Domain Expertise?
• DE is important for understanding the data, the problem and interpreting the
results
• “The counter resets to 0 if the number of calls exceeds N”.
• “The missing values are represented by 0, but the default billed
amount is 0 too.”
• Insufficient DE is a primary cause of poor DQ – data are unusable
• DE should be documented as metadata
Where is the Domain Expertise?
• Usually in people’s heads – seldom documented
• Fragmented across organizations
– Often experts don’t agree. Force consensus.
• Lost during personnel and project transitions
• If undocumented, deteriorates and becomes fuzzy over time
The End
Readings: ebook Ch. 3
and Ch. 5
? Homework 1 is due
next week
?
ALSPAC DATA MANAGEMENT PLAN, 2019-2024
0. Proposal name
The Avon Longitudinal Study of Parents and Children (ALSPAC). Core Program Support
2019-2024.
1. Description of the data
1.1
Type of study
ALSPAC is a multi-generation, geographically based cohort study following 14,541 mothers recruited
during pregnancy (in 1990-1992) and their partners (G0), offspring (G1) and grandchildren (G2).
1.2
Types of data
Quantitative data
• Data from numerous self-completed pa…
Purchase answer to see full
attachment

Calculator

Calculate the price of your paper

Total price:$26

Need a better grade?
We've got you covered.

Order your paper