New York University (NYU) - 21DOCS Test Area

http://www.nyu.edu/

by author

by title

by keyword

Determining Factors that Affect a Restaurant’s Yelp Rating

Kevin Han

and 3 more

December 12, 2016

KEYWORDS: Logistic regression, principal component analysis, lasso regression, hot spot analysis, kernel density

NYC Subways Safety Study

Sunny Kulkarni

November 28, 2016

PUI2016 Extra Credit Project Proposal

PUI2016 Extra Credit Project Proposal <yc2839>

Yue Cai

November 28, 2016

Problem Description: Why New York City’s Mental Health Service is Indispensable? The approach to exploring this question is via digging into NYC open data. There are now 836 mental health facilities located in the 5 Boroughs of NYC, among which, Manhattan has the highest number of 284 and Staten Island has the lowest number of 57. Firstly, the 311 complaint data would be analyzed to figure out whether or not the number of facilities is associated with the total compliant number, with the population density, or with the economic status of each Borough. In other words, this investigation is to determine what factors would makes people complain the most, and are they potentially related to mental stress at all? For this part, some linear regressions would help to identify the correlations between those factors. Secondly, NYC leading cause of death provides important information about that during 2007 to 2011, mental health problem can be the fourth high risky factor to death in NYC. Finally, if all data set works well and point out the correlation among potential factors, we can confirm that NYC's Mental Health Service is indeed required supported by the data.

An Analysis of the NYPD Public Indicators Dataset

Patrick Mwangi

November 28, 2016

Github: PatrickGitunduNYUID: pgm293Problem Description: In public safety, response delays (even to the minute) to crimes as they happen could mean the difference between life and death. Despite a decline in the number of “crime in progress” calls over the last three years, NYPD response times to crimes in progress and non-critical calls are all up, according to the 2016 Mayor’s Management Report released in September of this year. To get a grasp of the dynamics in response times could be invaluable. The NYPD releases a data set of the major felonies committed in the city per precinct as well as the response times for those crimes.Goals:By analyzing this data set of crimes by precinct and response time, can we :-1. Predict future response times by the precinct and crime. What will be the response time to crimes in the future?2. Identify trends, events or periodicity with regards to the response time?3. Identify the ‘slowest’ precincts?4. Spatially visualize the response times for crimes in progress using a ‘responsiveness’ scale for each precinct (zip code might be possible as well)?Data: The data to be used will be the NYPD public Indicators dataset hosted on New York City open data platform. The dataset contains response times for crimes in progress. The crimes recorded are split according to the 7 major felonies as well as by precinct.Analysis: I will look to predict future response times using regression. I will also carry out several time-series analyses with the data. I will look for trends by smoothing, look for any events through thresholds, lastly I will search for periodicity using ARMA. In addition to the above I will also look to use spatial autocorrelation through LISA to create a spatial ‘heat map’ of the average response times as seen geographically by precinct service area boundaries and zip code. This heat map will show the responsiveness of precincts on a scale from 1 to 10.Deliverable: As a deliverable, I would like to come up with a report showing my statistical and spatial conclusions about the response times and crimes as handled by precinct and time. This could be useful in identifying slower precincts and helping the city administration to better allocate resources towards bringing the response time to crimes in progress down.References: Anthony Shorris, Mindy Tarlow. The Mayor’s Management Report – September 2016.

dlk253 PUI2016 Extra Credit Proposal

danak

and 1 more

November 28, 2016

UPDATE:This proposal was updated to reflect the results seen in the report located here.After speaking with the Prof. Fedhere, the analysis shifted from histograms to plotting geographic features onto the raster image. The tiling of the images and hillside were completed as originally proposed. PUI2016 Extra Credit Project ProposalAnthropogenic Impacts found the Bedrock Layer<Dana Karwas, dlk253, dlk253>Problem Description: Coastal Urban ecosystems are under the constant pressure of natural and man made forces. How much is the bedrock layer effected by urban coastal ecosystems? By identifying patterns at the bedrock layer is it possible to identify urban coastal areas through their bedrock profile? Can an algorithm to measure anthropogenic impacts on urban coastal ecosystems be established by applying visual synthesis and analysis techniques to bedrock models? What can the bedrock tell us about the current state of our coastal cities? Can a metric for human impact be established by looking at the shape and topographic details of the bedrock? Data: The dataset this that is available and suitable is the Earth 2014 arcmin global topography and relief models from Curtin University. The data includes a global bedrock only layer which is what I would like to start with. It is available as gridded data and degree‐10,800 spherical harmonic. The bedrock (BED) includes Earth`s relief without water and ice masses.This data was found it in the paper linked below with accompanying data gateway. Paper: http://ddfe.curtin.edu.au/models/Earth2014/Hirt_Rexer2015_Earth2014.pdfData Gateway: href="http://ddfe.curtin.edu.au/models/Earth2014/">http://ddfe.curtin.edu.au/models/Earth2014/Bedrock href="http://ddfe.curtin.edu.au/models/Earth2014/">http://ddfe.curtin.edu.au/models/Earth2014/Bedrock Layer:href="http://ddfe.curtin.edu.au/models/Earth2014/Earth2014_visualisation_Antarctica.jpg">http://ddfe.curtin.edu.au/models/Earth2014/Earth2014_visualisation_Antarctica.jpgData href="http://ddfe.curtin.edu.au/models/Earth2014/Earth2014_visualisation_Antarctica.jpg">http://ddfe.curtin.edu.au/models/Earth2014/Earth2014_visualisation_Antarctica.jpgData Source: Western Australian Center for Geodesy, Curtin University PerthData PerthData Contact: [email protected] data is suitable for my questions because it has an isolated bedrock layer for the entire globe. The analysis will be made on three coastal urban cities in the US (New York City, Los Angeles, and New Orleans). I will have to pay close attention to land and ocean stitching and may need to find additional data to fill in data gaps in resolution if needed. I will also need to play close attention to the coordinate system transformation for my datasets. I will look for topographic anomalies by using imaging processing techniques on the shape files. I will establish a search criteria through image processing (histogram matching/analysis) for "man-made" interventions - ultimately leading to machine learning (this is very ambitious, and would be happy if I could just begin to compare a few histograms).Other data of interest: Earth1- ETOPO1 (1 arc minute)http://www.ngdc.noaa.gov/mgg/global/global.html2- SRTM30_PLUS (0.5 arc minute ~ 900 meters) and SRTM15_PLUS (0.25 arc minute ~ 450 meters)http://topex.ucsd.edu/WWW_html/srtm30_plus.htmlMarsThe MOLA Mission Experiment Gridded Data Records (MEGDRs) are global topographic maps of Mars http://pds-geosciences.wustl.edu/missions/mgs/megdr.htmlAnalysis Image processing techniques such as histogram matching could be used as a way to compare the datasets. Finding patterns in the histograms would be one way to begin identifying the impacts. ReferencesTechnical References:http://www.machinalis.com/blog/python-for-geospatial-data-processing/https://en.wikipedia.org/wiki/Histogram_matchinghttp://geospatialpython.com/https://github.com/GeospatialPython/pyshphttps://code.google.com/archive/p/pyshp/wikis/CreatePRJfiles.wikiTheoretical References:http://press.uchicago.edu/ucp/books/book/chicago/S/bo18295743.htmlhttp://www.nyu.edu/classes/bkg/methods/daston.pdfDeliverable: The expected deliverable would be an algorithm that can stitch together topographic bedrock data of ANY planetary body, href="http://astrogeology.usgs.gov/search/map/Mars/GlobalSurveyor/MOLA/Mars_MGS_MOLA_DEM_mosaic_global_463m" target="_blank">such href="http://astrogeology.usgs.gov/search/map/Mars/GlobalSurveyor/MOLA/Mars_MGS_MOLA_DEM_mosaic_global_463m">such as mars-- and search for human impact in that dataset. This algorithm can be used by agencies and students to search for "unnatural impacts". This will be interesting, I think, when the impact includes errors from the sensing device - such as those discussed in the mars MOLA dataset and impacts created from human (or other) intervention.

PUI2016 Urban Informatics Class Project Proposal

Ozgur L. Akkas

and 1 more

November 27, 2016

PUI2016 Urban Informatics Class Project Proposal

Data Check into Climate Change - A Brief View

Achilles Edwin Alfred Saxby

November 26, 2016

Profile - Achilles_SaxbyGitHub - achillessaxbyNYU - aes807

PUI Extra Credit Project

Le Xu

November 22, 2016

Author: Le Xu BEST TIME TO DRIVE A CAB? WHEN ARE THE LONGER TRIPS AND HIGHER AVERAGE FARES LIKELY TO HAPPEN? Problem Description: It is generally known that: Fridays’ and Saturdays’ nights are the busiest time period during the week, or bad weather could also help to drive up taxi demands. After the CUSP hack-day working on NYC Taxi dataset, I would love to further study the taxi fares by analyzing with the taxi trip occurring time and the daily weather. In my data analysis, I will answer questions like: What time of the day has better chance of longer trips or higher average trip fare? How bad weather (rain/snow) could affect taxi ridership, etc. Data: (ready for analysis) Data Taxi trip data _Source: NYC Taxi & Limousine Commission - NYC.gov_ Since there will be time series analysis, the taxi data from NYC.gov has included the time of the trip occurring, both the pick-up time and drop-off time, and the total fare as well as the total tip amount. The anticipated data processing will mainly include: binning the timestamps, to discover the trend/periodicity of the taxi income during the day/week/month. Daily Central Park weather data _Source: National Climatic Data Center_ https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail This dataset could give me the weather data that I need. *Since the size of the taxi data is rather huge, so I might consider take a subset of the dataset, however I am not sure currently. Analysis Plan: Idea 1: Night has more people go out/dine out and trains are getting slower or less, so night has more long trips than the day. NULL HYPOTHESIS: THE RATIO OF NUMBERS OF TAXI LONGER TRIPS OVER TOTAL NUMBERS OF TRIPS OCCURRED DURING THE DAYTIME IS SAME OR HIGHER THAN THE RATIO OF NUMBERS OF TAXI LONGER TRIP OVER TOTAL NUMBERS OF TRIPS OCCURRED DURING THE NIGHTTIME. Range of hours for this analysis. Day time (from 6:00 to 18:00) Night time (from 18:00 to 6:00) $$H0 : {Total Day Trip}\geq {Total Night Trips} $$ $$H1 : {Total Day Trip}\< {Total Night Trips} $$ I will use a significance level $$\alpha=0.05$$ I will probably use one month data in the winter and one month in the summer time. Idea 2: How the weather would affect the taxi demand? Rainy day could either drive up demands because people might not have umbrellas, or idle the demand because less people want to go out because of the rain. For the taxi rides with weather condition analysis, extreme events will be identified through outlier detection via thresholds. Clustering/Correlation Analysis could be performed in order to examine the relationship between taxi demands and weather. Analysis Tools: Pandas, matplotlib References: Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance - Todd W. Schneider http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ Visual Exploration of Big Spatio-Temporal Urban Data: A Study of New York City Taxi Trips - Nivan Ferreira, Jorge Poco, Huy T. Vo, Juliana Freire, and Claudio T. Silva link_text NYC Taxi Trip and Fare Data Analytics using BigData - Umang Patel # Deliverable: Idea 1: The result of the statistical analysis, with at least two figures to support. Idea 2: The result of the Clustering/Correlation Analysis.

Urban Environment and Perceived Safety of NYC

Achilles Edwin Alfred Saxby

and 4 more

November 14, 2016

AbstractUrban environments are growing more and more complex by the day. The emotional feeling of safety for a person interacting with these environments depends on a multitude of elements. Some of these elements are tangible/measurable whereas some of them are intangible. Social Science literature has shown a strong connection between the visual appearance of a city and its streets with the behavior and health of its citizens[1]. When the appearance of a neighborhood is considered with its impact on the health and behavior of the individuals that inhibit them, many aspects can be considered in this regard.Visual perception is one of the major elements that would help us understand actual safety of a neighborhood. MIT Media Lab has tried to quantify safety based on the visual perception of the neighborhoods. Beyond visual perception, there are characteristics of the neighborhood which do not get captured in an image which is frozen in time. In this study we are trying to explore different physical, socio-economic and incidental factors to see how closely they relate with the visual perception of safety. For example, taking into consideration a tourist coming entering a city for the first time. According to this tourist, the city may actually be perceived to be the best in the world and this perception can be skewed when considering safety in various areas around the city. This is because cities are usually heterogeneous in nature, and often unequal with respect to safety, demographics, cleanliness of the neighborhoods, the beauty of their architecture and many other evaluative dimensions. These aspects of the city although not saying much contribute highly to the perceived safety of the city as a whole or even as parts. This statement and example mentioned above creates some questions than need to be answered with respect to safety and its perception:What does the perception of safety tell us about different streets, neighborhoods or cities? Does the quality of the surroundings depend on the architecture or the width of the street or the amount of greenery? Is it possible that there is there another, unknown and semi-subconscious factor at play on our perceptions of safety?

REPORT ON DETROIT LAND BANK AUTHORITY OCCUPANCY MODEL

Geoff Perrin

November 13, 2016

FELLOWSHIP OVERVIEW I was hired as a Bloomberg Associates Fellow to work on the Detroit Land Bank Authority (DLBA) Inventory team beginning in July of 2016 in order to build a statistical “occupancy model” that would predict the occupancy of every residential non-lot parcel in the city of Detroit.

Exploring the CItibike trip duration differences between single-time customers and su...

Jie Zhou

October 20, 2016

AbstractThis report is based on my PUI Assignment 2 on Homework3 about a Citibike analysis with the python tool. The goal is to explore the Citibike trip duration difference between one-time customers and subscribers in terms of the MTP (Mean Trip Duration). The idea is that to prove that the average trip duration of single time customers is more than that of the subscribers, and further concludes that a single time customer would make better maximize the utilization than a subscriber. A hypothesis is established below. As the samples are not equal, a further Two-sided T-test is implemented, the results support the hypothesis.Keywords: CitiBike Data, Data Wrangling, Null Hypothesis, Alternative Hypothesis, Statistical Significance Level, Two-sided t-testHypothesis Null HypothesisH0: T(customer) <= T(subscriber)The mean trip duration of single time customers over a week is less than or equal to the mean trip duration of the subscribers over a week Alternative HypothesisH1: T(customer) > T(subscriber)The mean trip duration of single time customers over a week is more than the mean trip duration of the subscribers over a week. Statistical Significance LevelSignificance level: α = 0.05A significance level alpha(α) is chosen here to reflect how significant the hypothesis testing will be at the end of the test. Data AnalysisThe data was collected from the CitiBike_Data_Website for the trip duration from both one-time customer and subscriber. Later, this data was used to clean, organize, select, analyze, plot and visualize. First, the Null and Alternative hypothesis were established with a statistical significance level at 0.05, and then the data was collected, tabulated, cleaned, and reshaped.The analysis is conducted by applying Pandas and DataFrames to the Python to get the mean trip duration for Customers(one-time user) and Subscribers respectively. The figures are plotted by using Matplotlib accordingly. Meanwhile, as t-test applies for testing the difference between the samples when the variances of two normal distributions are unknown, which fit in the situation, the distribution of data is subjected to a two-sided t-test.

Citibike Analysis - Study on customers/subscribers ratio change during weekends

Adriano Yoshino

October 20, 2016

AbstractThis study was designed to investigate the idea about Customers riders being more frequent users of citibikes on weekends proportionally to Subscribers riders. If proved so, this finding can be useful to Citibike operations team to make the user experience better and to provide better infrastructure to the riders due to a particular operation during weekends. The study was made investigating Usertype and Date data from March, 2015 and measured through z-test.

Citibike Report

Xianbo Gao

October 19, 2016

AbstractThis report aims to find whether mean trip duration of young people is longer than middle-aged people. Z-test is performed to determine whether this hypothesis is true. After performed the test, it's very likely that young people ride longer than middle-aged people.DataThe data source is citibikedata, which contains the information of people using citibikes. Columns of birth years and trip durations of the data are used. To classify young and middle-agedpeople, people whose age is between 21 to 40 are defined as young and whose age is between 40 and 59 are defined as elder.AnalysisNull hypothesis test was performed. The null hypothesis is: Total trip duration of people who are in the age of (20,40] is equal or shorter than those in the age of (40,60] in the year 2015 in NYC. (with 5% significance) The normalized total trip duration of 5 different age range is shown below. It shows each certain age range's total trip duration divided by all age ranges' total trip duration.

Difference in biking times between male and female

Trang Dam

October 19, 2016

Abstract This project intends to examine if there is any difference between male bikers and female bikers in the day and night time. More specifically, my initial hypothesis is that men are more likely to bike than women in the night time due to safety concerns. Women are more sensitive to the potential safety risks when traveling at night than men do. Data I use the information of bikers in February, 2015 as the sample for my study. The time they started biking will be the determinant of the time they biked. I divided my sample into two groups, men and women based on the gender information. The day and night times are categorized as the followings: - Day time: from 7am - 7pm - Night time: from 7pm - 7 am Based on the available data, I calculated the normalized ratios of bikers in the day and night times for each gender for illustrative graphs and statistical analysis. Analysis Data Overview The graphs below show that there may be some differences between men and women in terms of the hours they are most likely to bike. In figure 1, the fractions of men riding bike after 6pm and before 8am are higher than those of women riding bike. In figure 2, which illustrates the fraction of each gender at day and night, the fraction of female riders is higher than that of male riders at day and lower at night.

citibike mini-project

Le Xu

October 18, 2016

Author: Chunqing Xu, Le Xu, Jianghao Zhu * Abstract This is analysis on citibike data. The idea is to see if age will affect the behavior of biking. The idea in this analysis is if younger woman is more likely than younger man choose riding a bike. Followed by data analysis and result.

HW6_Assignment_2

Sokratis Papadopoulos

October 17, 2016

Data-Driven Inference of CitiBike Data: An Analysis of Trip Distance by Rider Age

Claire Xueqi Huang

and 2 more

October 17, 2016

ABSTRACT The objective of this analysis was to perform a data-driven analysis of CitiBike trip data in New York City using statistical testing in python. Using CitiBike data from June 2016, the relationship between rider age and trip duration was explored. Specifically, the ratio of long distance trips to all distance trips in young riders was compared to that of all riders. Younger riders typically have more energy and strength, which translates into the ability to ride farther distances compared to all riders. The result of this analysis did not result in a significant difference between young riders and all riders, and therefore the null hypothesis could not be rejected. DATA The data for this analysis was obtained from the CUSP Data Facility at New York University. The data was subset to only fields needed to calculate rider trip distance: Start Station Latitude, Start Station Longitude, End Station Latitude, End Station Longitude, and rider birth year. Next, geopy was used to calculate trip distance in miles between the stations. The data wrangling process is detailed in the linked ipython notebook. ANALYSIS Through preliminary data inspection, the team took interest in long-distance trip in CitiBike riders. The team first defined null and alternative hypotheses: Null Hypothesis Long-distance trip ratio in young bikers is less than or equal to long-distance trip ratio in all bikers. $$H_0: Ly/Ay - L/A <= 0$$ Alternative Hypothesis Long-distance trip ratio in young bikers is less than or equal to long-distance trip ratio in all bikers. $$H_a: Ly/Ay - L/A > 0$$ significance level: $$\alpha = 0.05$$ Young riders were defined as millennials born after 1980, and long distance trips referred to trips greater than three miles from start to end station. In the exploratory phase of the analysis, the team reviewed the distribution of CitiBike ridership by birth year (Figure 1). The team then looked at the data pertaining to only long trips greater than three miles in distance for both age groups (Figure 2). The ratio of long trips for all riders was also explored, as shown in Figures 3 and 4. After the exploratory phase, the team began statistical testing. A Z-test was decided upon to test the hypothesis after peer review, and all riders were defined to be the population while millennial riders were defined as the sample to be tested. The ratios of each subgroup defined in the hypothesis were calculated and then tested. RESULTS Looking back at the data, although a young riders have a larger percentage of total trips (Figure 1), younger riders have a relatively small long-distance trip ratio (Figure 2). Therefore, when the Z-test was performed, our p-value indicated that the null hypothesis could not be rejected at a 0.95 significance level. The results of this data analysis reveal that millenial CitiBike riders do not have a significantly higher ratio of long distance trips, and therefore trip distance is not dictated by rider age. Link to ipython notebook on GitHub: https://github.com/nmonarizqa/PUI2016_nm2773/tree/master/HW6_nm2773/HW6_Assignment2.ipynb

PUI2016 _citibike_ Summary

Jonathan Toy

and 4 more

October 17, 2016

PUI2016 Citibike Project Summary ABSTRACT: In this project we looked at whether on average older individuals (over 40 years old) used Citibikes for shorter trips than younger individuals(less than 40 years old). Using information on trip duration and rider age for the month of February 2015, we ran a Z-test test for the proportions grouped by trip duration, yielding at statistic of 26.09. In this case we will reject the null hypothesis and conclude that older individuals are more willing to take shorter trips. DATA: We used the zip file on the Citibike's website corresponding to the month of February 2015. The data can be downloaded here: https://s3.amazonaws.com/tripdata/201502-citibike-tripdata.zip The corresponding .csv file contained entries for the start and stop station location, trip duration, customer type, birth year and gender of each rider during the month. We extracted age by subtracting the birth year of subscribers from the then current year 2015, and dropping all entries except trip duration and age. We split the pandas dataframe into those over and under 40 to create 2 samples. Then we divided the trip duration into two categories as short trip(less than 10 mins) and long trip(more than 10 mins) (see Figure 1). At last we normalized the distribution(see Figure 2).