Eva Man-Yan Tse - Sunnyvale CA, US Michael Dean Lore - Katy TX, US James Daniel Attaway - Katy TX, US
Assignee:
Computer Associates Think, Inc. - Islandia NY
International Classification:
G06F 1730
US Classification:
707102, 707 10, 707101, 707103, 707104
Abstract:
A method of populating multiple data marts in a single operation from a set of transactional data held in a database in a single aggregation process, in which aggregate values are calculated only once, a determination is made as to which output data marts required the aggregate value, and the aggregate values are output to the appropriate data marts. Dimension data associated with the output aggregate records is also output to the appropriate data marts.
Method And Apparatus For Synchronizing Cache With Target Tables In A Data Warehousing System
Eva Man-Yan Tse - Sunnyvale CA, US Pinaki Mukhopadhyay - Cupertino CA, US Sumitro Samaddar - Cupertino CA, US
Assignee:
Informatica Corporation - Redwood City CA
International Classification:
G06F012/08 G06F017/30
US Classification:
711118, 711141, 707 8, 707201
Abstract:
A method and apparatus for processing (transporting) data, such as in a data warehouse system. In one embodiment, the data are received from a source and compared to data in a lookup cache comprising a subset of data from a first data set (e. g. , a dimension table). Instances of the data not present in a lookup cache (that is, new data) are identified. Information corresponding to these instances are generated (e. g. , a unique identifier is associated with each of these instances), and the first data set is updated accordingly. The lookup cache is then updated with the new data and the unique identifiers. Accordingly, the information (data) in the lookup cache and in the first data set are in synchronization. The lookup cache does not need to be rebuilt (e. g. , to update a second data set such as a fact table).
One embodiment of the present invention sets forth a technique for optimizing data in a dataset. The technique includes determining, based on one or more attributes of a dataset, an optimization that is associated with at least one of a file encoding, a file size, and a sort column. The technique also includes identifying a plurality of candidate configurations associated with the dataset and corresponding to the optimization, and for each candidate configuration, generating a corresponding set of evaluation metrics associated with the first optimization. The technique further includes determining, based on the sets of evaluation metrics corresponding to the plurality of candidate configurations, a set of configurations in the plurality of candidate configurations to be applied to the dataset. Finally, the technique includes modifying the dataset based on the set of configurations.
Netflix
Director of Big Data Services
Jaman Apr 2005 - Apr 2009
Vice President of Engineering
Informatica Nov 2000 - Mar 2005
Senior Development Manager
Informatica Jan 1999 - Nov 2000
Senior Software Engineer
Education:
University of Houston
Master of Science, Masters, Computer Science
University of Houston
Bachelors, Bachelor of Science, Computer Science
Skills:
Cloud Computing Agile Methodologies Hadoop Scalability Software Development Distributed Systems Python Amazon Web Services Big Data Java Software Design Business Intelligence Architecture Data Warehousing Software Engineering Big Data Infrastructure Architectures
Cloud Computing Agile Methodologies Software Development Hadoop Distributed Systems Python Big Data Infrastructure Business Intelligence Amazon Web Services (AWS)