One thing I failed to consider in my previous blog is early admissions.

By admitting many or most of their students early, a college can appear to be very selective when, in fact, it is only selective for people who do not apply early. Applying early decision is the equivalent of ranking a school first, and schools thus know it will improve their matriculation rate by admitting students early. Also, students who really wanted to attend a particular school will perhaps be better than students who may have chosen the school 2nd or 3rd or worse.

A summary of actual acceptance rates at Ivy League schools, early and otherwise, appears here. To understand what is happening here, take Harvard, with the lowest overall acceptance rate of 5.8%. If you apply there through regular admissions, you have a 3.8% chance (less than 1 in 25) of being admitted. However, if you apply early decision, your chances increase to 18.4% (about 1 in 5 or 6). Of course, the quality of the students is likely different between the group that applies for regular admission and the group that applies early, so that the difference between two equally qualified students is likely lower. However, it seems doubtful that the entire difference is in quality of the application pool.

At a recent presentation I heard from an admissions officer at a local college, he stated outright that the standards change between early and later admissions even for "rolling" admissions schools. Put simply, early applicants get priority and are more likely to be accepted.

So what's the strategy? Apply early, but you only have one shot at early decision (typically you can only apply to one school). Therefore, apply to a top choice but the one in which you have a decent chance of getting into, according to that school's average SATs, grades, etc. If you reach too high, you will be rejected and relegated to the regular application pool, where chances of getting into top schools is far lower.

## Monday, May 5, 2014

## Monday, April 28, 2014

### Getting into College

Now that I have a 9th-grader, I am starting to think about college admissions. The urban myth is: "If you were applying to college now, you'd never get into the (great) college you went to (in the 1980s or 1990s)."

This belief is driven by lower acceptance rates at many elite colleges, as well as the parents and peers of those who went to elite schools. This washington post article debunks this myth. It refers to an article about a study at the Center for Public Education, which has more detail. On the other hand, this paper shows that while overall selectivity fell, the top schools are more difficult to get into, at least as measured by SAT/ACT scores.

Here are some factors that could be at play:

1) Regression to the mean. People who went to great schools are, on average, high achievers compared to the general public. However, if you take the cohort who were accepted to these schools, some fraction will have gotten in by chance, scoring better or doing better just by chance. The next generation will regress to the mean, and this means it will appear as if colleges are more selective,. among those who went to more selective colleges (by the same token, among those who went to the least selective schools, there will be the opposite effect)...all else being equal of course. This is the same effect that results in the children of the tallest people being shorter than their parents, even though they still may be taller than the average person.

2) People apply to more schools. When your average person applies to 10 schools, whereas the average person used to apply to 3, acceptance rates can go down, resulting in higher perceived selectivity. This article shows the number of people applying to four or more schools more than doubling since the 70s. The increase in applications might also imply that students that never would have applied to, say, Harvard, are now applying. This is why a lower acceptance rate doesn't actually mean it is more difficult to get admitted, once you adjust for the quality of the student.

3) Slight increase in actual selectivity at a few schools. The New York Times had an interesting article regarding the changes in selectivity, which focused on the number of spots per 100,000 population (rather than the number admitted). Harvard, with the greatest drop in selectivity, had a drop of 27% (the article focused only on US student rates) in the last 20 years. While this might seem large, keep in mind that their admissions rate has dropped about two-thirds, from 18% to 6%, a much larger change.

4) Student quality improved. There is certainly room in the equation for a true increase in student quality. As the article above implies, the top schools did have moderate increases in test scores.

No matter whether college is the same or more difficult to get into, it certainly appears that it is more stressful. One solution for this is the med school solution (and NYC schools solution): a ranking and matching program. This is fairly simple and goes as follows: each student ranks each school he/she applies to in order of preference. Colleges rank the students that apply in order. Colleges are matched students that are highest on their list, beginning with students who ranked them first. Students are required to go to the college they are matched with, or enter a second consolation round.

This belief is driven by lower acceptance rates at many elite colleges, as well as the parents and peers of those who went to elite schools. This washington post article debunks this myth. It refers to an article about a study at the Center for Public Education, which has more detail. On the other hand, this paper shows that while overall selectivity fell, the top schools are more difficult to get into, at least as measured by SAT/ACT scores.

Here are some factors that could be at play:

1) Regression to the mean. People who went to great schools are, on average, high achievers compared to the general public. However, if you take the cohort who were accepted to these schools, some fraction will have gotten in by chance, scoring better or doing better just by chance. The next generation will regress to the mean, and this means it will appear as if colleges are more selective,. among those who went to more selective colleges (by the same token, among those who went to the least selective schools, there will be the opposite effect)...all else being equal of course. This is the same effect that results in the children of the tallest people being shorter than their parents, even though they still may be taller than the average person.

2) People apply to more schools. When your average person applies to 10 schools, whereas the average person used to apply to 3, acceptance rates can go down, resulting in higher perceived selectivity. This article shows the number of people applying to four or more schools more than doubling since the 70s. The increase in applications might also imply that students that never would have applied to, say, Harvard, are now applying. This is why a lower acceptance rate doesn't actually mean it is more difficult to get admitted, once you adjust for the quality of the student.

3) Slight increase in actual selectivity at a few schools. The New York Times had an interesting article regarding the changes in selectivity, which focused on the number of spots per 100,000 population (rather than the number admitted). Harvard, with the greatest drop in selectivity, had a drop of 27% (the article focused only on US student rates) in the last 20 years. While this might seem large, keep in mind that their admissions rate has dropped about two-thirds, from 18% to 6%, a much larger change.

4) Student quality improved. There is certainly room in the equation for a true increase in student quality. As the article above implies, the top schools did have moderate increases in test scores.

No matter whether college is the same or more difficult to get into, it certainly appears that it is more stressful. One solution for this is the med school solution (and NYC schools solution): a ranking and matching program. This is fairly simple and goes as follows: each student ranks each school he/she applies to in order of preference. Colleges rank the students that apply in order. Colleges are matched students that are highest on their list, beginning with students who ranked them first. Students are required to go to the college they are matched with, or enter a second consolation round.

## Sunday, December 29, 2013

### CitiBike share--what are the chances?

I have been working with Joe Jansen on the Citibike data in the R Language. Citibike is New York's bike sharing program, which started in may and currently has more than 80,000 annual members. The R Language is a freely available object oriented programming language designed originally for doing statistics at Bell Labs.

Joe has downloaded all the data and done an extensive analysis, which you can find here. I did a simpler analysis predicting trips using a statistical regression model and graphed it using the function ggplot2 in R. I found that maximum temperature, humidity, wind, and amount of sunshine to be significant factors in predicting the number of trips that will be taken on any given day. While rain was not a significant factor, it is likely confounded with sunshine, so it is only not a factor after accounting for amount of sunshine. Also, keep in mind that a number of days with rain, especially in the summer, are generally sunny days with an hour or two of rain or thunderstorms. The day of the week, surprisingly, was not an important factor influencing number of trips. The R-squared, which is a typical measure of predictive power and is on a scale from 0 to 100%, was more than 70%.

Here is a graph of the results that shows the predicted number of trips per 1,000 members versus the actual number of trips. The day of the week is indicated by the color of the point.

I am an amateur with the function ggplot, and so the legend for day of the week has the days of the week in alphabetical order rahter than Monday , tuesday, etc. Help on that and other aspects of ggplot for this graph would be welcome (please comment accordingly).

If day of the week made a difference, for any given point on the x-axis (predicted trips) you would have more of a certain color that is high on the y-axis than other colors. For example, if more trips occurred on weekends, you would have more of the green colors (Saturday and Sunday) on top. However, no such affect seems to exist. I guess people are enjoying Citibike every day of the week, or casual riders on the weekends are roughly making up for weekday commuting riders.

Joe has downloaded all the data and done an extensive analysis, which you can find here. I did a simpler analysis predicting trips using a statistical regression model and graphed it using the function ggplot2 in R. I found that maximum temperature, humidity, wind, and amount of sunshine to be significant factors in predicting the number of trips that will be taken on any given day. While rain was not a significant factor, it is likely confounded with sunshine, so it is only not a factor after accounting for amount of sunshine. Also, keep in mind that a number of days with rain, especially in the summer, are generally sunny days with an hour or two of rain or thunderstorms. The day of the week, surprisingly, was not an important factor influencing number of trips. The R-squared, which is a typical measure of predictive power and is on a scale from 0 to 100%, was more than 70%.

Here is a graph of the results that shows the predicted number of trips per 1,000 members versus the actual number of trips. The day of the week is indicated by the color of the point.

I am an amateur with the function ggplot, and so the legend for day of the week has the days of the week in alphabetical order rahter than Monday , tuesday, etc. Help on that and other aspects of ggplot for this graph would be welcome (please comment accordingly).

If day of the week made a difference, for any given point on the x-axis (predicted trips) you would have more of a certain color that is high on the y-axis than other colors. For example, if more trips occurred on weekends, you would have more of the green colors (Saturday and Sunday) on top. However, no such affect seems to exist. I guess people are enjoying Citibike every day of the week, or casual riders on the weekends are roughly making up for weekday commuting riders.

## Monday, November 25, 2013

### Highest property taxes in America?

I read on CNN's money website today that Westchester County, NY has the highest property taxes in America (see Nov 25 Money website). Moreover, the New York area in general seems to have the highest taxes. That surprises me, because, as an owner of a co-op in Brooklyn, I know that my property taxes, and property taxes in general in the city, are extremely low.

So what's the problem? If you click on the "interactive graph" you find that you can display results in two ways. The headline and accompanying map refers to the taxes in dollars. This type of information is little more than a graph of housing prices in the US, because expensive houses have higher taxes than cheaper houses. Sure, tax rate comes into play, but the owners of a $10 million mansion in a low tax district still generally pay more property taxes per year than the owners of a $200,000 house in a high tax district.

Here's an example. Click on Brooklyn on the interactive map and you will see taxes of $3,050. Click on Richland County, South Carolina (where my parents live), and you will see that taxes average $1,129, nearly one-third the "high" taxes of Brooklyn. Yet this belies the fact that housing prices are much higher in Brooklyn.

How much higher? Well, to see this, go to the interactive map that shows taxes as a percentage of home prices. This map accounts for different housing costs and shows taxes in the familiar manner, as a rate. In this map, you can see that Brooklyn property taxes are 0.53% of housing prices and Richland County's are 0.75%. (By the way Westchester County is 1.76%, which is high but certainly not the highest).

Thus, while taxes on the map shown in the headline are nearly three times higher in Brooklyn than in Richland County, S.C., taxes are actually 30%

So what's the problem? If you click on the "interactive graph" you find that you can display results in two ways. The headline and accompanying map refers to the taxes in dollars. This type of information is little more than a graph of housing prices in the US, because expensive houses have higher taxes than cheaper houses. Sure, tax rate comes into play, but the owners of a $10 million mansion in a low tax district still generally pay more property taxes per year than the owners of a $200,000 house in a high tax district.

Here's an example. Click on Brooklyn on the interactive map and you will see taxes of $3,050. Click on Richland County, South Carolina (where my parents live), and you will see that taxes average $1,129, nearly one-third the "high" taxes of Brooklyn. Yet this belies the fact that housing prices are much higher in Brooklyn.

How much higher? Well, to see this, go to the interactive map that shows taxes as a percentage of home prices. This map accounts for different housing costs and shows taxes in the familiar manner, as a rate. In this map, you can see that Brooklyn property taxes are 0.53% of housing prices and Richland County's are 0.75%. (By the way Westchester County is 1.76%, which is high but certainly not the highest).

Thus, while taxes on the map shown in the headline are nearly three times higher in Brooklyn than in Richland County, S.C., taxes are actually 30%

*lower*in Brooklyn, when looked at as a percentage of home prices.## Monday, August 12, 2013

### What are the chances of different "splits" in bridge?

**If you know how to play bridge, skip to the fourth paragraph!**

In bridge, 13 cards are dealt to each of 4 players (so all 52 cards are dealt). Players sitting across from each other are partners, so we could think of the two teams positions as North and South and East and West on a compass. A process of "bidding" ensues, in which the team with the highest bid has selected a "trump" suit and a number of rounds, or "tricks" that they have contracted to take.

Suppose North-South had the highest bid and North is playing the hand. Then East "leads" a card, meaning East places a card (any card he/she wants) face up on the table. The play goes clockwise, East-> South-> West -> North. South, West and North must play a card of the same suit that East played. When four cards are down, the highest one wins the "trick" and that winner puts any card of his/hers down, in order to begin a new trick. Play continues until 13 rounds of 4 cards each have been played.

Suppose that West wins a trick and thus gets to play a card. He plays the Ace of Hearts. North, who is next and otherwise required to play hearts, is out of hearts. North can play any other suit, but if he chooses to play the "trump" suit (say Spades are trump), then he automatically wins the trick unless East or South is also out of hearts and play a higher card in Spades (the trump suit). In other words, trumps are very valuable. In the bidding process, the teams try to bid in such a way that the trump suit is one in which they have a lot of cards. Generally, the team with the winning bid (the "contract") will have at least 7 of the 13 trumps between the two of them, meaning the other team will have 6 or fewer. Whatever the number the opponents have, it is generally advantageous to the contract winners if they have the same number each rather than them being skewed to one or the other opponent.

*Bridge players begin here:*So here is the probability piece. Suppose you and your partner hold 7 trumps between you, what are the chances the opponents each have 3? have 4 and 2? have 5 and 1? have 6 and 0? To solve this sort of problem, we use combinations. See my earlier post for some detail (and more odds of bridge hands).

The opponents have 26 cards altogether and we want to know the number of different groups of six among those 26 cards. Think of this process as a process of picking six cards from the 26. You have 26 choices for the first card, 25 for the second, and so on, and thus there are 26*25*24*23*22*21 total 'permutations' of size 6. However, we do not care what order they are in so for each first card, there are 6 possible positions, for each second card, 5, etc., and thus we need to divide these permutations by 6*5*4*3*2*1, in order to get the number of unique sets when order does not matter. Again, see my earlier post for a more detailed explanation of this concept.

The R language allows for calculation of this combination of 6 out of 26 with the command "choose(26,6)." This is the denominator when we calculate probabilities, because it gives the total number of equally likely combinations of 6 cards. The numerator is split into the two bridge hands of 13 cards each. The number of combinations with an even 3-3 split are "13 choose 3" for both hands.

To calculate that probability in R, we write: choose(13,3)*choose(13,3)/choose(26,6) and get 35.5%

How about hands with a 4-2 split? That is the chance that Opponent 1's hand has 4 trumps multiplied by the chance that Opponent 2's hand has 2 trumps PLUS the chances that Opponent 2's hand has 4 trumps multiplied by the chance that Opponent 1's hand has 2 trumps. Since the chance that either Opponent has 4 are the same, we can just double the probability of Opponent 1 having 4 and Opponent 2 having 2. We get: choose(13,4)*choose(13,2)*2/choose(26,6) = 48.4% of one opponent having 4 and the other having 2 trumps.

Continuing this calculation, we get the following chances for hands with 6 trumps in the opponents hands( 6 trumps "out"):

3-3 split : 35.5%

4-2 split: 48.4%

5-1 split: 14.5%

6-0 split: 1.5%

For hands with 5 trumps out, we get:

3-2 split: 67.8%

4-1 split: 28.3%

5-0 split: 3.9%

For hands with 4 trumps out:

2-2 split: 40.7%

3-1 split: 49.7%

4-0 split: 9.5%

For hands with 3 trumps out:

2-1 split: 78%

3-0 split: 22%

For hands with 2 trumps out:

1-1 split: 52%

2-0 split: 48%

I find it interesting that the even split (for 2, 4, or 6 trumps out) is only the most likely scenario when 2 trumps are out. When 4 trumps are out, a 3-1 split is more likely. When 6 are out, a 4-2 split is more likely.

## Monday, April 29, 2013

### Simpson's Paradox

A North Slope real estate broker (named North) is trying to convince you that North Slope is a more affluent neighborhood than South Slope. To prove it, he explains that professionals in North Slope earn a median income of $150,000, versus only $100,000 in South Slope. Working class folks fare better in North Slope also, with hourly workers making $30,000 a year to South Slope's $25,000.

The South Slope real estate broker (named South) explains that North is crazy. South Slope is much more affluent. The median income in South Slope is $80,000 versus the North Slope median of $40,000.

Question: Who is lying, North or South?

Answer: It could be neither.

Consider the breakdown of income shown below.

We can see that North is not lying. Half the hourly South Slope workers earn $20K and half $30K, for a median of 25K. A similar calculation for the North Slope workers yields an hourly median of 30K. For professionals in the South Slope, the median is $100K, with half earning $80K and half earning $120K. In the North Slope, a similar calculation yields the median of $150,000.

South is not lying either. For the South Slope, the median is $80,000, since more than half of the workers make less than or equal to $80,000 and more than half make $80,000 or more (according to the definition of median, at least half must be above the median and at least half must be below). For the North Slope, the median is $40,000.

What happened here? The problem, and the reason for the conflict between the wages according to type of work and the overall wages, is that the percentage of residents in each category does not match. Thus, though professionals and hourly workers make more in the North Slope, there are far more hourly workers in the North Slope than in the South Slope. Thus, the overall median (or mean) income is lower in the North Slope.

While Wikipedia has an entry for Simpson's Paradox, a specific example of which I described above, it seems that most people are unaware of it. My motivation for writing about it is not the made-up example I present above but the fact that I encounter it so much in my everyday work. I either make my clients very happy by explaining that the 'bad' effect they have found may well be spurious or, anger them when I explain the interesting relationship they have found is a mere statistical anomaly.

The South Slope real estate broker (named South) explains that North is crazy. South Slope is much more affluent. The median income in South Slope is $80,000 versus the North Slope median of $40,000.

Question: Who is lying, North or South?

Answer: It could be neither.

Consider the breakdown of income shown below.

We can see that North is not lying. Half the hourly South Slope workers earn $20K and half $30K, for a median of 25K. A similar calculation for the North Slope workers yields an hourly median of 30K. For professionals in the South Slope, the median is $100K, with half earning $80K and half earning $120K. In the North Slope, a similar calculation yields the median of $150,000.

South is not lying either. For the South Slope, the median is $80,000, since more than half of the workers make less than or equal to $80,000 and more than half make $80,000 or more (according to the definition of median, at least half must be above the median and at least half must be below). For the North Slope, the median is $40,000.

What happened here? The problem, and the reason for the conflict between the wages according to type of work and the overall wages, is that the percentage of residents in each category does not match. Thus, though professionals and hourly workers make more in the North Slope, there are far more hourly workers in the North Slope than in the South Slope. Thus, the overall median (or mean) income is lower in the North Slope.

While Wikipedia has an entry for Simpson's Paradox, a specific example of which I described above, it seems that most people are unaware of it. My motivation for writing about it is not the made-up example I present above but the fact that I encounter it so much in my everyday work. I either make my clients very happy by explaining that the 'bad' effect they have found may well be spurious or, anger them when I explain the interesting relationship they have found is a mere statistical anomaly.

## Monday, November 12, 2012

### The Worst Graph

One reason for quotes like there are "there are lies, damn lies, and statistics" is because of graphs like these:

This was on the front of money.com this morning with the caption: "Huge US Oil Boom ahead: The U.S. will overtake Saudi Arabia to become the world's biggest oil producer before 2020."

I was shocked at first glance, because I thought oil production was going to go up 10 or 20-fold from the tiny amount in 2011 to the huge amount in 2015. That does indeed sound huge. Then I looked at the left y-axis, where I can see it is only going from 8 million to 10 million barrels a day, an increase of about 25%.

Fine, you say, but you can still easily see that the light blue bar is above the dark blue bar starting in 2025, showing the US overtakes Saudia Arabia.

I'm afraid not. The two bars are not Saudia Arabia versus US production but oil versus gas production, and it is not even clear whose production is depicted. Is the the whole world, the US, Saudi Arabia? The article puts US production at 5.8 million barrels a day in 2011, so it appears not to be US production, but other sources put it at closer to 9 million, so maybe it is the US.

Ok, you say, despite the poor caption, at least you can clearly see that gas production begins to top oil production (in whatever country the graph is depicting) around 2025.

Not really. Since oil is in millions of barrels per data and the gas is in billions of cubic meters (per day, per month, per year, who knows?), this is actually not the case. The year 2030 shows oil at about 10 million barrels per day and gas at nearly 800 billion cubic meters. Which is more? Maybe the readers of money can quickly translate these figures into BTUs or some useful measure of production output, but I sure can't tell you.

Fine, you say, but since they start at about the same level, we at least know that gas increases more than oil over the time period.

Sorry, even that is incorrect. Look at the scale on the left axis (oil), which starts at 8 and goes to 12, a 50% increase. The right axes starts at 600 and goes to 800, a 25% increase. Thus, oil goes from 8 to just over 10 (more than a 25% increase) while gas goes from a little over 600 to just little under 800 (less than a 33% increase--maybe a little more than oil but maybe not).

The only thing that appears to be correct about this graph is the year, until you realize that in the first period, there are only four years (2011-2015) while in the other periods, there are five year differences.

This was on the front of money.com this morning with the caption: "Huge US Oil Boom ahead: The U.S. will overtake Saudi Arabia to become the world's biggest oil producer before 2020."

I was shocked at first glance, because I thought oil production was going to go up 10 or 20-fold from the tiny amount in 2011 to the huge amount in 2015. That does indeed sound huge. Then I looked at the left y-axis, where I can see it is only going from 8 million to 10 million barrels a day, an increase of about 25%.

Fine, you say, but you can still easily see that the light blue bar is above the dark blue bar starting in 2025, showing the US overtakes Saudia Arabia.

I'm afraid not. The two bars are not Saudia Arabia versus US production but oil versus gas production, and it is not even clear whose production is depicted. Is the the whole world, the US, Saudi Arabia? The article puts US production at 5.8 million barrels a day in 2011, so it appears not to be US production, but other sources put it at closer to 9 million, so maybe it is the US.

Ok, you say, despite the poor caption, at least you can clearly see that gas production begins to top oil production (in whatever country the graph is depicting) around 2025.

Not really. Since oil is in millions of barrels per data and the gas is in billions of cubic meters (per day, per month, per year, who knows?), this is actually not the case. The year 2030 shows oil at about 10 million barrels per day and gas at nearly 800 billion cubic meters. Which is more? Maybe the readers of money can quickly translate these figures into BTUs or some useful measure of production output, but I sure can't tell you.

Fine, you say, but since they start at about the same level, we at least know that gas increases more than oil over the time period.

Sorry, even that is incorrect. Look at the scale on the left axis (oil), which starts at 8 and goes to 12, a 50% increase. The right axes starts at 600 and goes to 800, a 25% increase. Thus, oil goes from 8 to just over 10 (more than a 25% increase) while gas goes from a little over 600 to just little under 800 (less than a 33% increase--maybe a little more than oil but maybe not).

The only thing that appears to be correct about this graph is the year, until you realize that in the first period, there are only four years (2011-2015) while in the other periods, there are five year differences.

Subscribe to:
Posts (Atom)