C For Data Science?

C For Data Science?

Within the trade of Data-Science, there are many languages that can be used for statistical analysis and Machine Learning. Typically the choice of language in these situations relies solely on what the language can provide functionally . Popular choices include statistical languages like R and Python. Often object oriented programming languages are ignored as their structure can be quite hard to deal with compared to a functional language.

Nowhere is this more true than in one of the most vital languages to the formulation of modern computing, C.

Unless you’ve been programming under a rock, and likely even if you have been, you’ve heard of C. C is an extremely low-level language, that also has the advantage of flexing into really high-level development territory as well. Although C is versatile, C often takes a really long time to achieve anything high-level.

This is why there are so many languages that are in turn written and interpreted by C, like Python. But if Python is so great for Data Science, why can’t its mother hold the same?

In a lot of ways, C is perfectly acceptable for Data-Science. This is because a low-level language like C’s trademark operation is moving and managing data, as this is the biggest part of a low-level language. But there certainly are a lot of properties that make C a little less viable than a language like Python, for example.

Actually Object Oriented

Although it is certainly true that Python is object oriented, Python also possesses a lot of features that make it very much functional. And in software in general, this tends to be a common theme. There is no longer a use for a language to be one or the other, so why not do unique parts of both?

On the flip-side, C by nature is one hundred percent object oriented, which not only makes it harder for beginners, but also a lot harder for Data Scientists. Having a language with more functional properties, where functions can be run outside of classes without the need for constant callback is certainly a-lot easier when working back and forth in Data Science.

C is hard

There is a reason why C is never recommended as your first language; compared to interpreted languages like Python, as well as other languages, C is definitely one of the more difficult of the bunch. And machine learning can be surprisingly difficult to program, especially from scratch without libraries. And with that comes Cs biggest downside in general, and the Achilles heel of using C for data-science:

There are so many easier, better languages (for DS)

So why use C?

In addition to being hard to write, C is hard to read, especially for disorganized programmers with a lot of acute math going on. In most cases, It’s just not necessary to split the number of programmers that can read your code by 1/5th just to use C. With that in mind, there certainly are some redeeming factors for C.

Why it is good for Data Science

Outside of notebooks, and into the kettle of pipeline and software engineering with data, a lot of strong C code bases dramatically benefit when the machine learning algorithms themselves are also written in C. For software engineering with DS, C is an absolute godsend for everyone that knows how to use it. Python typically falls rather short in a lot of intensive situations because of Python’s issue with speed.

With that out of the way, it’s also pretty impressive just how many Python packages utilize C. Not only that, but in terms of relatively high-level programming, C is usually the fastest choice available.

Using C

I have been using C for a long time now, but I have absolutely never used C for data-science. The closest I’ve come to such a thing is being able to read and write CSV files.

But there’s a first for everything.

With that in mind, I first went to read in a CSV. This is relatively straightforward as shown below:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
const char* getfield(char* line, int num)
{
    const char* tok;
    for (tok = strtok(line, ";");
            tok && *tok;
            tok = strtok(NULL, ";\n"))
    {
        if (!--num)
            return tok;
    }
    return NULL;
}

int main()
{
    FILE* stream = fopen("l", "r");

    char line[1024];
    while (fgets(line, 1024, stream))
    {
        char* tmp = strdup(line);
        printf("Field 3 would be %s\n", getfield(tmp, 3));        return l;
}
}

Now it was time to fit try out an algorithm. I didn’t want to take the ridiculous amount of time to create a constructor as thanks to kernel execution, we can just name every class main and run a function that way.

int main()
{
int summation(arr) {
int arr[MAX_SIZE];
int i, n, sum=0;
for(i=0; i<n; i++)
{
scanf("%d", &arr[i]);
}
for(i=0; i<n; i++) { sum = sum + arr[i]; } return sum; }
int length(arr){for(i=0; i<n; i++) { sum = sum + 1;
}
return sum;
}
int llsq (int x[], int y[],xt[])
{
x = m.x
y = m.y
# Summatation of x*y
xy = x .* y
sxy = sum(xy)
# N
n = length(x)
# Summatation of x^2
x2 = x .^ 2
sx2 = sum(x2)
# Summatation of x and y
sx = sum(x)
sy = sum(y)
# Calculate the slope:
slope = ((n*sxy) - (sx * sy)) / ((n * sx2) - (sx)^2)
# Calculate the y intercept
b = (sy - (slope*sx)) / n
# Empty prediction list:
y_pred = []
for i in xt
pred = (slope*i)+b
append!(y_pred,pred)
return y_pred;
}}

So I think it’s obvious from how many lines it took just to create a function to do the many things required to obtain our data and manipulate it, C really isn’t all that great for Data-Science. Not to say C doesn’t have its place in Data Science, but in a lot of ways its too much easier to use a simpler language than to use C in the particular case of Data Science and Machine Learning to justify using a low-level language like C.

Typically, statistics are distinguished as high-level, functional programming. This is for good reason, as writing similar algorithms to the ones we would write in Python, R, and Scala is especially hard in C and requires a lot more methods.

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章