Carefully read this entire document
Purpose of the project
The goal of the individual project is to combine and implement all of the research/analytical
skills you acquired throughout the training modules. This means that you will independently
collect, process, analyse and visualise data to answer research question(s). You will have to
write Python code to accomplish this, and combine the outcomes in a short, written document
of 1,500 to 2,000 words (everything included).
The ‘independent’ is crucial here. There is no ready-made template to work from, and that’s
the whole point of this project. Imagine being an analyst for a government agency or a
consultancy firm, getting a project to complete by your team leader. You can’t walk in every
five minutes to ask what exactly it is you need to do. It is your responsibility to deliver, even if
not every step is entirely spelled out. That means you will have to make decisions for yourself,
be able to explain them, and assure quality. Remember, there’s usually more than one way to
skin a cat.
You will probably get stuck a few times. That’s the way it goes. You will have to find and try
multiple solutions, and navigate through more or less uncharted territory. However, everything
you need is available in the training modules in some form (!). This document will help you
though: it contains multiple tips to get you started, and redirects to materials that cover what
you cannot explicitly find in the modules. Still, it is up to you to identify what you need, how to
change it to fit your goal, and how to make it work.
The research questions
With the current covid-19 crisis, policy makers continuously need to communicate about
implemented measures. Their communication needs to be timely and clear, instil trust in, and
acceptance by the public. We will focus on policy makers’ Twitter communication as indicative
for their communication strategies. Please keep in mind that the scale of this project is
particularly narrow: it is just an exercise to test your skills, rather than a fully fledged research
The key research questions are:
1. How do policy makers (i.e., Australia’s prime minister, the minister of health, as well
as the health department) communicate about the covid-19 crisis? That is, what do
they communicate about covid, and what is their tone?
2. How has this communication evolved/shifted throughout time, considering the timeline
of the registered covid-19 cases?
For the first research question, you can draw on the following code book – i.e., to manually
code the contents of the tweets:
Tweet (content coding) Notes – Descriptions
Factuality (non-exclusive categories = three
variables with two possible values each:) –
Report current status (Values 1 or 0 – with 1
means yes, 0 means no)
Tweet contains statistics and factual situational
overviews of numbers of infections, casualties,
cured (‘the curve’) = value 1, otherwise value 0
Structural precautions (Values 1 or 0) Tweet explains what is government doing to
prevent/limit the spread of the virus (These are
enforceable policy measures such as forced
closures, limiting social gatherings, etc.) = value
1, otherwise value 0
Individual precautions (Values 1 or 0) What can/should individuals do to ‘flatten the
curve’ (These are advices on what we can do:
social distancing, hand washing, stay inside, seek
medical attention when confronted with
symptoms, etc.) = value 1, otherwise value 0
Emotionality (exclusive categorisation = one
variable with three possible value:) – HOW?
Threat (Value = 1) Predominantly warning for (possible) adverse
consequences: if you don’t do x, then y will
happen (and y is a negative outcome)
Reassurance (Value = 2) Predominantly playing down danger and/or
comforting people (e.g., by referring to limited
danger, positive signs, etc.)
Neutral (Value = 3) None of the above – neither predominantly
positively or negatively phrased
Locus responsibility (exclusive categorisation
= one variable with three possible values:)
Internal responsibility (Value = 1) Tweet says that we (Australian residents) are
responsible for the outcomes we are heading for:
we should do enough, be cautious and
responsible in following policy measures.
External responsibility (Value = 2) Tweet points to other countries as source of
danger, and/or reason why Australian measures
are not optimally effective.
No mention of responsibility (Value = 3) None of the above
What you need to do
Part I – Data collection
Write a Python script that accesses the Twitter API scrapes and harvest the tweets of the
three Twitter accounts the study focuses on. Get the contents of their tweets and when they
were tweeted. Write the information into a data file (an xlsx file, not a csv file – see the
documentation below). You will need to upload both the Python script and the data file that
you generated as part of the assignment.
Part II – Data analysis
Analyse the file that is provided on Blackboard, named twitterdata.xlsx. Everybody will use
the same data file, regardless of the data you collected yourself. The analysis will a descriptive
analysis of the tweets, in which you will compare how the three accounts in focus have tweeted
(and how that possibly changed over time). This will require to draw a random sample of 50
covid tweets per account (See the additional documentation on how to do that).
Be creative in how you handle the analysis. Again, this should be an independent enterprise
in which you draw on and apply all possibly suitable course materials. You will have to upload
the Python script in which you perform the analysis (data cleaning – if necessary – and
Part III – Report
Write a research report (1,500 to 2,000 words, all included), with the following sections:
1. Introduction section in which you contextualize the research and outline the research
2. Methodology section in which you concisely explain the procedure: (1) how did you
get the data (Although you will work with the data that I provide, the procedure should
be just the same as the one that you used, only the timeframe of data collection is
wider – started December 1st and lasts until April 18th), (2) what do the sample data
look like (i.e., how many tweets, harvested in what period – this will be a description of
the file that is available on Blackboard, not the data file you harvested yourself).
3. Results section in which you discuss the analysis: i.e., what did you do with what
data, and what does that tell us.
4. Discussion section in which you explain how the results answer the initial research
questions (i.e., what do the results actually mean). This is concluded by a reflection on
the strengths and weaknesses of the research methodology (draw inspiration from the
introduction lecture, as well as from the module on scraping APIs).
Make sure that the report mentions your name and student number. There are no strict
guidelines on how to format the document, except for the word count. However, make it look
clean and professional in every possible way. A professionally type-set research article by a
publisher such as Sage, Wiley-Blackwell, Elsevier might inspire you.
In total, there are five files you need to upload, combined in a single compressed .zip file:
1. A python file that harvests tweets and writes them into a data file (.py file)
2. The data file with the harvested tweets (.xlsx file)
3. A python file with the data processing/analysis/visualisation (.py file)
4. The final version of the data file that you processed (i.e., the one that you downloaded
from Blackboard, not the one that you harvested yourself (.xlsx file)
5. A text document with the 1,500 to 2,000-word research report (.pdf)
What are you graded on?
Your project makes up 50% of your final grade. I take the following matters into
1. Were you able to outline the relevance of the research question? (introduction
section report)
2. Is the code that you wrote to harvest tweets valid and effective? (harvest file)
3. Were you able to clearly describe the procedure on how tweets were harvested?
(methodology section report)
4. Were you able to transparently explain what you did with the data, what you
analysed/visualised? (results section report)
5. Were you able to clean and format the given research data? (analysis file)
6. Is the analysis/visualisation that your performed sound/valid? (analysis file)
7. Does your discussion of the results make sense in answering the research
questions? Are you able to pinpoint strengths and weaknesses of the method
(Including whether analysing tweets is the right way to go…)? (discussion section
8. Is your writing tidy and clear? (entire report)
9. Is your document professionally formatted? (entire report)
Help to get started…
Although this is independent work, I will be ready to answer you directed, concrete questions
on the Discussion Board and during the Zoom Q&A’s. Please, do not e-mail me: it is highly
inefficient, and the answer to your question is likely to help many others, so there is no reason
to keep it to yourself.
Apart from that, there is help available to get past the most difficult hurdles:
• You will need to use the Twitter API a ready-made Twitter scraper to harvest
tweets. There an example base-script available via Blackboard/GitHub. You will have
to customize and expand the code to write your first tweet-harvesting Python script.
• You need to analyse the data file that is available on Blackboard
(twitterdata.xlsx), not your own data file. The reason for that is twofold: (1) if you
somehow you are unable to manage building the harvest script, not all is lost and you
can still complete the other parts of the project, and (2) it is more straightforward for
me to understand what you did if the variable names are consist and you are working
from the same starting point. >>> The data file is available now.
o To read and write from and to a .xlsx file in Python instead of a csv, you need
to change the code we have been using so far just a tiny bit:
### Read from and write to xlsx
# Read xlsx file as data frame
df = pd.read_excel(‘twitterdata.xlsx’)
# Write data frame into xlsx file
We need to use the xlsx-format because Twitter data are just too messy to be
stored in a csv (too much chance for delimiter conflicts, regardless of which
one we use – , – ; – …). Moreover, it will help you to do the manual content
coding, because you can do that in Excel (open it, add a column variable on
the first line, and type in your values).
o The first things you need to do with that file is to filter the dataset so it
only contains tweets that mention (‘covid’, ‘corona’, ‘virus’ – and
potentially other keywords you find relevant). You can descriptively analyse
that data file. This is a categorization problem, in which you will add a variable
to your data frame with two potential values: ‘yes’ or ‘no’, based on the
presence of any of the aforementioned keywords. Then, you will have to subset
to only retain the ‘yes’. You should get to 631 covid-related tweets. This
§ Categorisation: see first example (i.e., genres) on data categorisation
(Module 5)
§ Sub-setting: see example code on checking data ingest (Module 5)
o For the analysis on how the politicians involved communicate, you will
need to do some manual content coding. You can use the aforementioned
five-variable code book and add and code these variables in Excel (i.e., type in
the values). To keep the amount of data in check (this is only an exercise after
all), I strongly suggest you randomly sample 50 tweets per account (so you
only need to read and manually code a manageable set of 150 tweets). This
the way to sample these 150 tweets, using pandas:
### Random sampling of 50 tweets per account
# Subset tweets by ‘ScottMorrisonMP’ and sample 50 random tweets,
write them into variable df_1
df_1 = df[df[‘handle’]==’ScottMorrisonMP’].sample(n = 50)
# Subset tweets by ‘greghuntmp’ and sample 50 random tweets, write
them into variable df_2
df_2 = df[df[‘handle’]==’greghuntmp’].sample(n = 50)
# Subset tweets by ‘healthgovau’ and sample 50 random tweets, write
them into variable df_3
df_3 = df[df[‘handle’]==’healthgovau’].sample(n = 50)
# Join these three dataframes together into one joint dataframe
df = pd.concat([df_1, df_2, df_3])
o There is a clear longitudinal (time) component in the second research
question bullet (i.e., how did it evolve over time). This means you probably
need to visualise a time series. We did that in Module 5 – See visualise a time
series). It might be helpful to do that here as well. You will need a variable that
indicates the days since December 1st 2019 – reportedly the first covid case. I
did that for you, and wrote the days since December 1st into the column ‘dayssince-
dec1’. You can use this variable any way you see fit: as is in the time
series, or you can further categorize it. Whatever you want. The datetime
calculation was done by running the code below. You don’t have to run it again,
just have a look at it, how it was done:
from datetime import datetime, timedelta
def calculatedays(date):
dec1 = “2019-12-01 00:00:00” # Setting the date of dec 1st
dec1 = datetime.strptime(dec1, “%Y-%m-%d %H:%M:%S”) # converting
it to a datetime object
timedelta = (date – dec1).days # Subtract the date of the tweet
with that intial data, convert the result to days
return timedelta
df[‘days-since-dec1’] = df[‘timestamp’].apply(lambda x:
calculatedays(x)) # Define column, use lambda function drawing on the
function calculatedays()
# Write data frame into xlsx file

Don't use plagiarized sources. Get Your Custom Essay on
For $10/Page 0nly
Order Essay

Calculate the price of your paper

Total price:$26

Need a better grade?
We've got you covered.

Order your paper