software engineering datasets

In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montral, Canada, Samy Bengio, HannaM. Wallach, Hugo Larochelle, Kristen Grauman, Nicol Cesa-Bianchi, and Roman Garnett (Eds.). US Bureau of Labor Statistics. 153-164. Outcomes are # determined at the end of the semester through evaluation of student # team work in two categories: software engineering process (how well # the team applied best software engineering practices), and software # engineering product (the quality of the finished product the team # produced). arXiv:1909.00934http://arxiv.org/abs/1909.00934, All Holdings within the ACM Digital Library. # # # TAM FEATURE LIST # ---------------- # year # semester # timeInterval # teamNumber # semesterId # teamMemberCount # femaleTeamMembersPercent # teamLeadGender # teamDistribution # teamMemberResponseCount # meetingHoursTotal # meetingHoursAverage # meetingHoursStandardDeviation # inPersonMeetingHoursTotal # inPersonMeetingHoursAverage # inPersonMeetingHoursStandardDeviation # nonCodingDeliverablesHoursTotal # nonCodingDeliverablesHoursAverage # nonCodingDeliverablesHoursStandardDeviation # codingDeliverablesHoursTotal # codingDeliverablesHoursAverage # codingDeliverablesHoursStandardDeviation # helpHoursTotal # helpHoursAverage # helpHoursStandardDeviation # leadAdminHoursResponseCount # leadAdminHoursTotal # leadAdminHoursAverage # leadAdminHoursStandardDeviation # globalLeadAdminHoursResponseCount # globalLeadAdminHoursTotal # globalLeadAdminHoursAverage # globalLeadAdminHoursStandardDeviation # averageResponsesByWeek # standardDeviationResponsesByWeek # averageMeetingHoursTotalByWeek # standardDeviationMeetingHoursTotalByWeek # averageMeetingHoursAverageByWeek # standardDeviationMeetingHoursAverageByWeek # averageInPersonMeetingHoursTotalByWeek # standardDeviationInPersonMeetingHoursTotalByWeek # averageInPersonMeetingHoursAverageByWeek # standardDeviationInPersonMeetingHoursAverageByWeek # averageNonCodingDeliverablesHoursTotalByWeek # standardDeviationNonCodingDeliverablesHoursTotalByWeek # averageNonCodingDeliverablesHoursAverageByWeek # standardDeviationNonCodingDeliverablesHoursAverageByWeek # averageCodingDeliverablesHoursTotalByWeek # standardDeviationCodingDeliverablesHoursTotalByWeek # averageCodingDeliverablesHoursAverageByWeek # standardDeviationCodingDeliverablesHoursAverageByWeek # averageHelpHoursTotalByWeek # standardDeviationHelpHoursTotalByWeek # averageHelpHoursAverageByWeek # standardDeviationHelpHoursAverageByWeek # averageLeadAdminHoursResponseCountByWeek # standardDeviationLeadAdminHoursResponseCountByWeek # averageLeadAdminHoursTotalByWeek # standardDeviationLeadAdminHoursTotalByWeek # averageGlobalLeadAdminHoursResponseCountByWeek # standardDeviationGlobalLeadAdminHoursResponseCountByWeek # averageGlobalLeadAdminHoursTotalByWeek # standardDeviationGlobalLeadAdminHoursTotalByWeek # averageGlobalLeadAdminHoursAverageByWeek # standardDeviationGlobalLeadAdminHoursAverageByWeek # averageResponsesByStudent # standardDeviationResponsesByStudent # averageMeetingHoursTotalByStudent # standardDeviationMeetingHoursTotalByStudent # averageMeetingHoursAverageByStudent # standardDeviationMeetingHoursAverageByStudent # averageInPersonMeetingHoursTotalByStudent # standardDeviationInPersonMeetingHoursTotalByStudent # averageInPersonMeetingHoursAverageByStudent # standardDeviationInPersonMeetingHoursAverageByStudent # averageNonCodingDeliverablesHoursTotalByStudent # standardDeviationNonCodingDeliverablesHoursTotalByStudent # averageNonCodingDeliverablesHoursAverageByStudent # standardDeviationNonCodingDeliverablesHoursAverageByStudent # averageCodingDeliverablesHoursTotalByStudent # standardDeviationCodingDeliverablesHoursTotalByStudent # averageCodingDeliverablesHoursAverageByStudent # standardDeviationCodingDeliverablesHoursAverageByStudent # averageHelpHoursTotalByStudent # standardDeviationHelpHoursTotalByStudent # averageHelpHoursAverageByStudent # standardDeviationHelpHoursAverageByStudent # commitCount # uniqueCommitMessageCount # uniqueCommitMessagePercent # commitMessageLengthTotal # commitMessageLengthAverage # commitMessageLengthStandardDeviation # averageCommitCountByWeek # standardDeviationCommitCountByWeek # averageUniqueCommitMessageCountByWeek # standardDeviationUniqueCommitMessageCountByWeek # averageUniqueCommitMessagePercentByWeek # standardDeviationUniqueCommitMessagePercentByWeek # averageCommitMessageLengthTotalByWeek # standardDeviationCommitMessageLengthTotalByWeek # averageCommitCountByStudent # standardDeviationCommitCountByStudent # averageUniqueCommitMessageCountByStudent # standardDeviationUniqueCommitMessageCountByStudent # averageUniqueCommitMessagePercentByStudent # standardDeviationUniqueCommitMessagePercentByStudent # averageCommitMessageLengthTotalByStudent # standardDeviationCommitMessageLengthTotalByStudent # averageCommitMessageLengthAverageByStudent # standardDeviationCommitMessageLengthAverageByStudent # averageCommitMessageLengthStandardDeviationByStudent # issueCount # onTimeIssueCount # lateIssueCount # processLetterGrade # productLetterGrade, D. Petkovic, M. Sosnick-Prez, K. Okada, R. Todtenhoefer, S. Huang, N. Miglani, A. Vigil: Using the Random Forest Classifier to Assess and Predict Student Learning of Software Engineering Teamwork Frontiers in Education FIE 2016, Erie, PA, 2016. By using this systematic approach, TAM feature names are # produced that are human understandable and intuitive and related to # aggregation method. [2106.15209] Making the most of small Software Engineering datasets [link], [ICPC 2020] Zejun Zhang, Minxue Pan, Tian Zhang, Xinyu Zhou, and Xuandong Li. In Proceedings of the 38th International Conference on Machine Learning, PMLR 139: 6471-6482, 2021. [link], [TSE 2020] Minxue Pan, Tongtong Xu, Yu Pei, Zhong Li, Tian Zhang, and Xuandong Li. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. Hanmin Qin and Xin Sun. arXiv:2106.15853https://arxiv.org/abs/2106.15853, Charles Bouveyron and Stphane Girard. [link], [ESEC/FSE 2019] Yifei Lu, Minxue Pan, Juan Zhai, Tian Zhang, and Xuandong Li. It can improve the test coverage by considering different preference option combinations and detect more preference related bugs. Pre-training, Distilling Transformers for Neural Cross-Domain Search, BERT_SE: A Pre-trained Language Representation Model for Software Training deep neural-networks using a noise adaptation layer. When youre browsing for job openings, especially in data science and technology, youll likely see different roles that include the world engineer. It can be difficult to decipher the exact differences between the two roles from just reading job descriptions. Robust log-based anomaly detection on unstable log data. Post2Vec: Learning Distributed Representations of Stack Overflow Posts. Class imbalance evolution and verification latency in just-in-time software defect prediction. A Bug or a Suggestion? Journal of Systems and Software, Volume 159, 2020, Article 110433. Robust supervised classification with mixture models: Learning from data with uncertain labels. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. https://doi.org/10.1613/jair.953, Ting Chen, Simon Kornblith, Mohammad Norouzi, and GeoffreyE. Hinton. Experience: Quality Benchmarking of Datasets Used in Software Effort Payscale. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, HannaM. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence dAlch-Buc, EmilyB. A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches. Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss. IEEE / ACM, 548559. 2015. Coding (programming languages such as SQL, Python, Java, R, and Scala), ETL (extract, transform, and load) systems, Big data tools, such as Hadoop, MongoDB, and Kafka, Coding languages like Python, Java, C, C++, or Scala, Want to learn more?Learning Data Engineer Skills: Career Paths and Courses. 7. experimental, or empirical software engineering, Software Engineering Artifacts Can Really Assist Future Tasks, Cryptocurrency GitHub Activity and Market Cap Dataset, MSR: Mining Software Repositories conference, PROMISE: Predictive Models and Data Analytics in Software Engineering conference, ACM Transactions on Software Engineering and Methodology (TOSEM), ESEC/FSE: ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ICSE: International Conference on Software Engineering, IEEE Transactions on Software Engineering, SANER: IEEE International Conference on Software Analysis, Evolution and Reengineering, This list requires your input for its continuous improvement. 204214. These engineers operate at a broader level, building the infrastructure or platform that imports and stores the data for a website, app, or software. Methodol. Associate Professor Using Class Imbalance Learning for Software Defect Prediction. Read the, Outlets exclusively devoted to empirical software engineering research, Outlets that publish empirical software engineering research. TruptiM Kodinariya and PrashantR Makwana. Heres a rough breakdown of degrees commonly held by data and software engineers: Certifications can also help you break into data or software engineering. https://pytorch.org/. ACM, 807817. Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. This research approach is often termed experimental, or empirical software engineering. # # The SE process and SE product outcomes represent ML training classes # and are to be considered separately, e.g. Robust Learning of Deep Predictive Models from Noisy and Imbalanced OpenReview.net. US Software Engineer Jobs - Free Sample Dataset - ZenRows Biometrics Bulletin 1, 6 (1945), 8083. 8. IEEE Transactions on Software Engineering(2021), 11. Understanding and Improving Early Stopping for Learning with Noisy Labels. Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets. Yanming Yang, Xin Xia, David Lo, Tingting Bi, John Grundy, and Xiaohu Yang. J. Artif. This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. By continuing you agree to the use of cookies, TU Delft Research Portal data protection policy, Electrical Engineering, Mathematics and Computer Science, Source code of "An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Optimization", Generating Class-Level Integration Tests Using Call Site Information, PropR: Property-based Automatic Program Repair - Reproduction Package, CAPYBARA: Decompiled Binary Functions and Related Summaries, Classifying code comments in Java Mobile Applications, 10.4121/UUID:97F5FC68-0C48-4EA6-B357-184F5B6809C9, 10.4121/UUID:CB751E3E-3034-44A1-B0C1-B23128927DD8, Data underlying the Preliminary Evaluation of EvoCrash, 10.4121/UUID:001BB128-0A55-4A8D-B3F5-E39BFC5795EA. https://doi.org/10.1007/11538059_91, Jiangfan Han, Ping Luo, and Xiaogang Wang. What are best baseline models for different classes of predictive software models? Pytorch. Please contact us at petkovic '@' sfsu.edu. labeling data, in Software Engineering,there exist many small (< 1 000 samples) 9911), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). The Effects of Existing Review Comments on Code Review", Continuous Integration and Delivery practices for Cyber-Physical systems: An interview-based study. # # Detailed information about the exact format of the .csv file may be # found in the csv files themselves. Q-testing is an automated testing tool for Android applications. Geoffrey Hinton and TerrenceJ Sejnowski. 2020. Do Developers Really Know How to Use Git Commands? 47, 8 (2021), 15591586. ACM, 321332. https://openreview.net/forum?id=r1gRTCVFvB, Shyamgopal Karthik, Jrme Revaud, and Chidlovskii Boris. has proven effective on small-sized datasets, primarily thanks to pre-training, We are working on complete datasets from a wide variety of countries. [link], [STVR 2021] Renhe Jiang, Zhengzhao Chen, Yu Pei, Minxue Pan, Tian Zhang, and Xuandong Li. IEEE, 698709. July 21, 2022: One paper accepted to ASE 2022. This is a well-known database for SE research data. arXiv:2108.11096https://arxiv.org/abs/2108.11096, Sunghun Kim, Hongyu Zhang, Rongxin Wu, and Liang Gong. http://www.jstor.org/stable/3001968, Xiaoxue Wu, Wei Zheng, Xin Xia, and David Lo. # # It is left to the individual researcher to decide how to accomodate # NULL values, and the data is included in this file. [link], [ISSTA 2020 (Distinguished Paper Award)] Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. Data sets and data quality in software engineering Authors: Gernot Armin Liebchen Bournemouth University Martin John Shepperd Brunel University London Abstract and Figures OBJECTIVE - to assess. With such different end-goals, data and software engineers spend their time collaborating with different teams within the company. Software & Systems Modeling (2019), Volume 18, Issue 3, pp. [link], [JSS 2020] Zhengzhao Chen, Renhe Jiang, Zejun Zhang, Yu Pei, Minxue Pan, Tian Zhang and Xuandong Li. 309-326. Moreover, there is a lack of research on the feature set that should be used in DP recognition. Many of the data sets can also be useful in research using search-based software engineering methods. September 3-7, 2018, Montpellier, France. International Conference on. It offers a variety of purposes and tools for building data pipelines and automation of programs. Pattern Recognit. # # # PRIVACY # ------- # The data contained in this file does not contain any information # which may be individually traced to a particular student who # participated in the study. There are two reasons why. 85368546. IEEE Computer Society, 99108. https://doi.org/10.1109/TSE.2018.2883603. Cloudflare Ray ID: 7d1adaeabed12c21 Supervised cross-modal hashing methods leverage the labels of training data to improve the retrieval performance. Panichella, A. Laurens Vander Maaten and Geoffrey Hinton. Software engineers' salary depends on factors such . In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, JoanneM. Atlee, Tevfik Bultan, and Jon Whittle (Eds.). https://doi.org/10.1109/ICSME.2019.00070, Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Where to find software development related data sets In this paper, we propose RobustTrainer, the first approach to learning deep predictive models on raw training datasets where the mislabelled samples and the imbalanced classes coexist. Deep Learning Meets Software Engineering: A Survey on Pre-Trained Models 2019. Time intervals are used in research only. Accepted. IEEE Transactions on Software Engineering, vol. In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 1, Antonia Bertolino, Gerardo Canfora, and SebastianG. Elbaum (Eds.). Data Engineering is the field associated with analysis and tasks to get and store the data from other sources. # # INTRODUCTION # ------------ # # The data contained in these files were collected over a period of # several semesters from students engaged in software engineering # classes at San Francisco State University (class sections of CSC # 640, CSC 648 and CSC 848). Efficient validation of self-adaptive applications by counterexample probability maximization. Then, process those data and convert them into clean data used in further processes such as Data Visualisations, Business Analytics, Data Science solutions, etc. Easy modelling and verification of unpredictable and preemptive interrupt-driven systems. In addition to the README file, the archive # contains a number of .csv files. https://doi.org/10.1109/CVPR.2009.5206848, Davide Falessi, Aalok Ahluwalia, and MassimilianoDi Penta. arXiv:2108.11569https://arxiv.org/abs/2108.11569, Frank Wilcoxon. Day-to-day tasks for a data engineer might include: Acquiring datasets that align with business needs, Developing algorithms to transform data into actionable insights, Building, testing, and maintaining database pipeline architectures, Collaborating with management to fulfill company objectives, Creating new data validation methods and data analysis tools. IEEE Computer Society, 392401. 2020. previous models, especially for tasks involving natural language; whereas for Undergraduate compulsory course. (Creator), TU Delft - 4TU.ResearchData, 2 Apr 2020, DOI: 10.4121/UUID:23752F31-91B0-4C04-B070-C603541E1E90, Spadini, D. (Creator), Aniche, M. (Creator), Bacchelli, A. 2013. PMLR, 15971607. OpenReview.net. Data Engineer Education Requirements, https://www.zippia.com/data-engineer-jobs/education/. Accessed September 16, 2022. I am looking for motivated graduate students. The main aim of this thesis is to enable the required paradigm shift by laying down an accurate, comprehensive and information-rich foundation of feature and data sets. In Proceedings of the 28th ACM Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. https://doi.org/10.1109/CVPR42600.2020.01374, Tong Wei, Jiang-Xin Shi, Yu-Feng Li, and Min-Ling Zhang.