Wednesday, March 11, 2020

Collect data with a view to estimating population parameters using estimation techniques Essays

Collect data with a view to estimating population parameters using estimation techniques Essays Collect data with a view to estimating population parameters using estimation techniques Essay Collect data with a view to estimating population parameters using estimation techniques Essay Task: You are required to collect data with a view to estimating population parameters using estimation techniques. This should involve taking a random sample as well as calculating and comparing confidence intervals. I have decided to estimate the population parameters for sentence length in 2 different genres of books. I have chosen a horror book and a drama book to see how sentence length varies between them. In theory I would expect the horror book to have much shorter sentences to add suspense whilst I would expect the drama to have longer more descriptive sentences. Method: As it would be too time consuming to record the sentence length for the whole population (the whole book). I am going to use sampling. To try and avoid any bias I will use the random number function on a calculator to find a page in the book and then I will record the length of the first full sentence. I will take a 100 samples for each book as this is enough that I will be able to gain accurate estimates for the population parameters but not use too much time. If by chance 2 the random number function produces a number that has already been used I will simply take the length of the second sentence on that page. The Central Limit Theorem Because I dont know anything about how the population is distributed I have to use the Central Limit Theorem. Even if you dont know how the parent population is distributed the central limit theorem allows you to make predictions as to the distribution of the sample means. Also with a large enough sample the sample mean will be close to the population mean. The central limit theorem says that: * If you take enough samples then the means will be normally distributed. * The mean of the sample means is approximately equal to the population mean. * The variance of the sample mean is roughly the same as the population variance divided by the sample size * The large the sample size the closer the sample mean and variation are to the population mean and variation. X ~ (unknown) (à ¯Ã‚ ¿Ã‚ ½) then X ~ N(à ¯Ã‚ ¿Ã‚ ½/n) Once I have collected the data I will calculate the mean, standard deviation and variance of the sample. When I have figures for these I can estimate the variance and standard deviation of the population. Next I will calculate the standard error which will allow me to calculate confidence intervals for the population. When calculating confidence intervals I will use the tables for the normal function. Data Collection Data For Horror Book Page Sentence Length 217 12 161 11 178 1 200 24 14 13 138 27 38 11 80 3 155 10 43 8 65 20 171 7 96 31 135 8 96 6 128 17 163 12 199 20 41 17 59 18 93 9 173 6 90 12 56 9 100 17 123 40 218 2 110 11 59 21 193 13 20 8 229 41 203 8 126 1 197 10 53 8 83 9 190 10 182 25 203 28 226 5 110 13 196 14 87 7 63 24 42 31 43 1 131 13 185 2 200 25 29 23 194 21 44 4 125 1 51 3 32 25 47 6 194 10 21 16 124 25 221 7 127 5 77 16 56 35 222 4 34 28 141 8 231 8 15 12 30 20 157 12 97 20 131 34 108 21 173 6 75 8 192 4 139 6 100 8 223 31 99 12 118 8 201 24 138 7 230 10 95 21 193 15 147 10 196 12 190 11 170 3 120 27 162 39 123 16 129 9 73 7 208 18 98 5 51 145 11 Data For Drama Book Page Sentence Length 51 19 148 20 234 29 114 18 195 6 313 4 239 19 115 11 10 2 203 9 191 8 118 21 109 10 317 4 217 9 298 9 241 9 10 6 232 10 57 11 114 32 80 11 196 14 49 11 67 9 282 15 280 31 226 18 71 24 315 16 308 5 203 9 226 14 147 38 224 10 236 19 185 18 257 5 317 11 1 29 169 15 66 9 267 17 106 20 232 28 160 37 300 25 322 8 49 21 26 29 276 41 214 15 233 7 131 9 76 8 71 8 317 9 177 5 155 13 266 6 95 5 308 3 93 6 55 8 96 4 311 6 65 9 128 21 288 18 203 4 210 19 166 20 175 14 280 13 249 8 245 19 182 4 312 19 52 23 73 13 221 6 204 12 73 13 189 9 129 25 50 25 230 6 273 22 218 12 31 39 149 28 96 7 48 14 80 18 13 11 167 4 34 23 43 10 94 7 49 16 The first thing for me to do is to find the Mean, Standard Deviation and Variance of the sample I have taken. As it would be extremely time consuming trying to find the exact mean and variance for 100 results I have set up frequency tables which will allow me to work out the mean and variance more quickly. I have chosen quite small class intervals so that the calculations will be as accurate as possible. Drama Book Number of Words Frequency 1- 5 6- 10 11- 15 16- 20 21- 25 26- 30 31- 35 36- 40 41- 45 Number of Words Frequency Mid interval Value F x Miv F x Miv 1- 5 12 3 36 108 6- 10 31 8 248 1984 11- 15 19 13 247 3211 16- 20 17 18 306 5508 21- 25 10 23 230 5290 26- 30 5 28 140 3920 31- 35 2 33 66 2178 36- 40 3 38 114 4332 41- 45 1 43 43 1849 = 100 = 1430 = 28380 Horror Book Number of Words Frequency 1- 5 6- 10 11- 15 16- 20 21- 25 26- 30 31- 35 36- 40 41- 45 Number of Words Frequency Mid interval Value F x Miv F x Miv 1- 5 15 3 45 135 6- 10 31 8 248 1984 11- 15 18 13 234 3042 16- 20 12 18 216 3888 21- 25 12 23 276 6348 26- 30 4 28 112 3136 31- 35 5 33 165 5445 36- 40 2 38 76 2888 41- 45 1 43 43 1849 = 100 = 1415 = 28715 From my frequency tables I have been able to use a number of graphical methods to show the data. I have work out the median of the horror book to be 11 and the drama book to be 12. I have also found the lower quartiles to be at 7 for the horror book and 8 for the drama book whilst the upper quartiles are 19 for the drama book and 20 for the horror book. This tells me that the data for the horror book appears to be more spread out so I would therefore expect it to have a larger variance. Having found the mean of the samples I can say using the central limit theorem that the population means are the same. So the mean sentence length for horror books is 14.15 words and the mean sentence length for drama books is 14.3 words. However the variance obtained for the sample is not the same as that for the population it is a biased estimator. This means the mean of its distribution is not equal to the population value it is estimating. To obtain an unbiased estimator for the variance of the population we can use the formula We can see that this didnt really have that bigger affect on the variance because the value for n was quite large and so n / n-1 was almost 1. In order to calculate the accuracy of your value for the sample mean you can calculate the Standard Error (s.e.). This is the standard deviation of the sample means. This is found using the below formula. We can see that the s.e. for the horror book was 0.937 and for the drama book it was 0.891. This standard error is quite small so I can be quite confident that the actual mean of the populations is equal to that of the sample. A better way of showing how confident I can be in my approximations is to use confidence intervals. Confidence intervals allow you to give a percentage value to how confident you can be that the mean of the population is within certain values. The central limit theorem says that the sample mean is distributed normally when a large enough sample is taken and that the sample mean is equal to the population mean. This means that we can use the tables for the normal function to find out how confident we can be that the population mean is within a certain range. For example when 68% of the graph is shaded (below) we can use the normal tables to work out that the population mean is within + or 1 s.e. of the sample mean. So if you took a sample you could be 68% confident the sample mean was within + or 1 s.e. of the population mean. This can be written as the inequality: However because we dont know the value of we must rearrange this to form the inequality: I am going to use this to calculate 90%, 95% and 99% confidence intervals. The z value for a 90% confidence interval is 1.645 so I can be 90% sure that the sample mean is within 1.645 s.e of the population mean. This means the calculations are: For 95% confidence I found that the z value was 1.96 so you can be 95% sure the sample mean is within 1.96s.e. of the population mean. The z value for 99% was 2.58 s.e Data Interpretation From my calculations I have been able to work out the population parameters for the 2 books. Firstly I found that the mean for Alfred Hitchcock horror book was 14.15 whilst the mean for the drama was 14.3. I found that the population variance was 87.8 for the horror book and 79.3 for the drama book. The confidence intervals I calculated for the horror book were And for the drama book they were Although this supports my prediction that horror books would have less word per sentence I am not actually that confident in this conclusion. This is due to the fact that the confidence intervals for 99% have a large range of 4.84 words for horror books and 3.49 words for the drama book. This means that the actual population mean could be quite different to the sample mean I calculated and so it could be that the population mean for the drama book was actually more than that of the horror book. I also found that that the variance for the horror book was greater than that of the drama book. This is probably because a drama book is likely to keep the same style of writing throughout the book with roughly the same sentence length whereas a horror book is likely to contain parts where there is suspense and the sentences are short and parts where there is description and the sentences are much longer. One of the problems with my findings was that the calculated as means were not whole num ber. It is impossible to have fractions of a word so if you actually round the means to the nearest whole word they are exactly the same at 14. There were a number of limitations with this investigation firstly if I couldnt be that confident that the mean I obtained was that accurate if I wanted to be more accurate I would have to take a lot more samples. For example if I wanted to be 99% sure that the sample mean was within 0.1 of a word of the population mean I would have to take over 58,000 samples (see below) for the horror book and over 52,000 samples for the drama book. Obviously this is highly impractical but it shows how inaccurate my estimate is due to the fact that I took so few samples. Also I only sampled 1 book from each genre so it is difficult for me to accurately say that all books from these genres will be the same. It is possible that different authors with different writing styles will produce different sentence lengths. For example another horror writer may use longer sentences whilst another drama writer might use shorter sentences. So if I was to extend this investigation I would firstly take more samples to ensure greater accuracy which would therefore allow greater certainty in any conclusions drawn. Secondly I would compare a number of different horror books against each other to see if their population parameters were similar or if they varied. Another progression could be to sample a number of horror books by the same author to see if they are at all similar in their population parameters.