Select Language

AI社区

公开数据集

酵母中的转录组

酵母中的转录组

10.14M
389 浏览
0 喜欢
0 次下载
0 条讨论
Biology Classification

数据结构 ? 10.14M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Disclaimer

    This is a data set of mine that I though might be enjoyable to the community. It's concerning Next generation sequencing and Transcriptomics. I used several raw datasets, that are public, but the processing to get to this dataset is extensive. This is my first contribution to kaggle, so be nice, and let me know how I can improve the experience. NGS machines are combined the biggest data producer worldwide. So why not add some (more? ) to kaggle.

    A look into Yeast transcriptomics

    Background

    Yeasts ( in this case *saccharomyces cerevisiae*) are used in the production of beer, wine, bread and a whole lot of Biotech applications such as creating complex pharmaceuticals. They are living eukaryotic organisms (meaning quite complex). All living organisms store information in their DNA, but action within a cell is carried out by specific Proteins. The path from DNA to Protein (from data to action) is simple. a specific region on the DNA gets transcribed to mRNA, that gets translated to proteins. Common assumption says that the translation step is linear, more mRNA means more protein. Cells actively regulate the amount of protein by the amount of mRNA it creates. The expression of each gene depends on the condition the cell is in (starving, stressed etc..) Modern methods in Biology show us all mRNA that is currently inside a cell. Assuming the linearity of the process, we can get more protein the more specific mRNA is available to a cell. Making mRNA an excellent marker for what is actually happening inside a cell. It is important to consider that mRNA is fragile. It is actively replenished only when it is needed. Both mRNA and proteins are expensive for a cell to produce . Yeasts are good model organisms for this, since they only have about 6000 genes. They are also single cells which is more homogeneous, and contain few advanced features (splice junctions etc.) ( all of this is heavily simplified, let me know if I should go into more details )

    The data

    files

    The following files are provided **SC_expression.csv** expression values for each gene over the available conditions **labels_CC.csv ** labels for the individual genes , their status and where known intracellular localization ( see below) Maybe this would be nice as a little competition, I'll see how this one is going before I'll upload the other label files. Please provide some feedback on the presentation, and whatever else you would want me to share.

    background

    I used 92 samples from various openly available raw datasets, and ran them through a modern RNAseq pipeline. Spanning a range of different conditions (I hid the raw names). The conditions covered stress conditions, temperature and heavy metals, as well as growth media changes and the deletion of specific genes. Originally I had 150 sets, 92 are of good enough quality. Evaluation was done on gene level. Each gene got it's own row, Samples are columns (some are in replicates over several columns) . Expression levels were normalized with by TPM (transcripts per million), a default normalization procedure. Raw counts would have been integers, normalized they are floats.

    Analysis and labels

    Genes

    The function of individual genes is a matter of dispute. Clearly living cells are complex. The inner machinations of cells are not visible. Gene functionality is commonly inferred indirectly by removing a gene, and test the cells behavior. This is time consuming and not very precise. As you can see in the dataset, there is still much to be done to fully understand even single cell yeasts. The provided dataset is allows for a different approach to functional classification of genes. The label files contained in the set correspond a gene to a specific label. The classification is based on the official Gene Onthology associations classification. I simplified the nomenclature. Gene functionality is usually given in a hierarchical structure. [inside cell --> cytoplasma --> associated to complex A ... ] I'm only keeping high level associations, and using readable terms instead of GO terms. I'll extend if people are interested.

    Labels

    CC labels concern Cellular Component. Where the gene is within a cell. goes into details of found associations. the term 'cellular_component' should be seen as E.g the label 'cellular_component' is synonymous with 'unknown location' . CC is the easiest label to attach to a gene. It is the one that can be studied the easiest. Still there are many genes missing. MF labels concern Molecular Function. What is the gene doing. [upcoming] BP labels concern Biological Processes. What is the genes involvement. [upcoming] The core interest here is whether it is possible to improve the genes classification by modeling the data. A common assumption says that genes that are expressed in the same conditions have functional relations. There are a bunch of possible applications out there, many of which are limited by our current state of knowledge on the complex systems we observe, or fail to do so. Bringing biology into the realm of data science is an ongoing effort. Having a better insight into the data might very well help.

    Note

    The dataset is real, and therefore noisy the labels are incomplete even though I'm using the current state of the art. That is how much is known. Using expression levels for classification was already attempted by softwares like SPELL (Serial Pattern of Expression Levels Locator).

    Acknowledgements

    I guess I own the dataset. It is a by product of another project of mine. If someone is interested in publishing this, contact me.

    Inspiration

    Unraveling genetic mechanisms is a complex but rewarding task. Humans and yeast are quite similar in many ways. So apart from the fact that we use it for food and medicine, we might actually use knowledge gained from yeast eventually for studying diseases. Again, any feedback is welcome, Enjoy, CE
    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:0 去赚积分?
    • 389浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享