For instance I have thousands row with one of its is column 'cow_ID' where each cow ID have several rows. I want to replace those ID with number starting from 1 just to make it easier to remember.
df['cow_id'].unique().tolist()
resulting in:
5603,
5606,
5619,
4330,
5587,
4967,
5554,
4879,
4151,
5501,
4723,
4908,
3963,
4023,
4573,
3986,
5668,
4882,
5645,
5548
How do I change each unique ID into new number such as:
5603 -> 1
5606 -> 2
Try to look at
df.groupby('cow_id').ngroup()+1
Or try pd.factorize:
pd.factorize(df['cow_id'])[0]+1
As in the documentation, pd.factorize Encodes the object as an enumerated type or categorical variable.
Note that there are two return variables of pd.factorize
What you are looking for should be tagged with categorical encoding.
sklearn library in python has many preprocessing methods out of which label encoder should do the job for you. Refer this link.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder
Also keep in mind that using encodings like these might introduce some bias in your dataset as some algorithms can consider one label higher than the other, i.e., 1 > 2> ...>54 .
Refer this blog to learn more about encodings and when to use what
https://towardsdatascience.com/encoding-categorical-features-21a2651a065c
Let me know if you have any questions.
Here is the result using pandas.Categorical. The benefit is that you keep the original data and can flip back and forth.Here I create a variable called "c" that holds both the original categories and the new codes
Related
Moving from SAS to Python, I am trying to replicate a SAS macro-type process using input parameters to generate different iterations of code for each loop. In particular, I am trying to binarize continuous variables for modeling (regardless of the merit that may have). What I'm doing at the moment looks as follows:
Some sample data:
import pandas as pd
data=[[2,20],[4,50],[6,75],[1,80],[3,40]]
df=pd.DataFrame(data,columns=['var1','var2'])
Then I run the following:
df['var1_f'] = pd.cut(df['var1'], [0,1,2,3,4,5,7,np.inf], include_lowest=True, labels=['a','b','c','d','e','f','g'])
df['var2_f'] = pd.cut(df['var2'], [-np.inf,0,62,73,81,98,np.inf], include_lowest=True, labels=['a','b','c','d','e','f'])
.
.
.
df1=pd.get_dummies(df,columns=['var1_f'])
df1=pd.get_dummies(df1,columns=['var2_f'])
.
.
.
The above results in a table that contains the original DataFrame, but now has columns appended that take values 1 or 0, depending on whether the continuous variable falls into a particular band. That's great. There must be a better way to do this, rather than having potentially a dozen or so entries that are structurally identical, just with different arguments for variable names and cutoff/label values?
The SAS equivalent would involve replacing "varx_f", "varx", the cutoff values and the labels with placeholders that change on each iteration. In this case, I would do that through pre-defined values (as per the values in the above code), rather than dynamically.
How would I go about looping through this with different arguments for each iteration?
Apologies if this is an existing topic (I'm sure it is) - I just haven't been able to find it.
Thanks for reading!
I would like to use the model prediction (lets say RandomForestRegression) to replace the missing value in the column Age of a dataframe. I checked that the data type of the model prediction is numpy.ndarray.
Here’s what I do:
a = RandomForestRegressor()
a.fit(train_data, target)
result = a.predict(test_data)
df[df.Age.isna()].Age.iloc[:] = result
But it doesn’t work and can’t replace the nan value. May I ask why?
I saw some people use the same method but they work.
Do not use chained indexing. It is explicitly discouraged in the docs. The inconsistency you may be seeing may be linked to copy versus view discrepancies as described in the docs.
Instead, use a single pd.DataFrame.loc call:
df.loc[df['Age'].isna(), 'Age'] = result
See also Indexing and Selecting Data.
in my Pandas Dataframe I have loads of boolean Features (True/False). Pandas correctly represents them as bool if I do df.dtypes. If I pass my data frame to h2o (h2o.H2OFrame(df)) the boolean features are represented as enum. So they are interpreted as categorical features with 2 categories.
Is there a way to change the type of the features from enum to bool? In Pandas I can use df.astype('bool'), is there an equivalent in H2o?
One idea was to encode True/False to their numeric representation (1/0) before converting df to a H2o-Frame. But H2o now recognises this as int64.
Thanks in Advance for help!
The enum type is used for categorical variables with two or more categories. So it includes boolean. I.e. there is no distinct bool category in H2O, and there is nothing you need to fix here.
By the way, if you have a lot of boolean features because you have manually done one-hot encoding, don't do that. Instead give H2O the original (multi-level categorical) data, and it will do one-hot encoding when needed, behind the scenes. This is better because for algorithms like decision trees) they can use multi-level categorical data directly, so it will be more efficient.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html for some alternatives you can try. The missing category is added for when that column is missing in production.
(But "What happens when you try to predict on a categorical level not seen during training?" at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/deep-learning.html#faq does not seem to describe the behaviour you see?)
Also see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/use_all_factor_levels.html (I cannot work out from that description if you want it to be true or false, so try both ways!)
UPDATE: set use_all_factor_levels = F and it will only have one input neuron (plus the NA one) for each boolean input, instead of two. If your categorical inputs are almost all boolean types I'd recommend setting this. If your categorical inputs mostly have quite a lot levels I wouldn't (because, overall, it won't make much difference in the number of input neurons, but it might make the network easier to train).
WHY MISSING(NA)?
If I have a boolean input, e.g. "isBig", there will be 3 input neurons created for it. If you look at varimp() you can see there are named:
isBig.1
isBig.0
isBig.missing(NA)
Imagine you now put it into production, and the user does not give a value (or gives an NA, or gives an illegal value such as "2") for the isBig input. This is when the NA input neuron gets fired, to signify that we don't know if it is big or not.
To be honest, I think this cannot be any more useful than firing both the .0 and the .1 neurons, or firing neither of them. But if you are using use_all_factor_levels=F then it is useful. Otherwise all NA data gets treated as "not-big" rather than "could be big or not-big".
I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.
You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.
Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.
Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.
An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.
I am using csv.DictReader to read some large files into memory to then do some analysis, so all objects from multiple CSV files need to be kept in memory. I need to read them as Dictionary to make analysis easier, and because the CSV files may be altered by adding new columns.
Yes SQL can be used, but I'd rather avoid it if it's not needed.
I'm wondering if there is a better and easier way of doing this. My concern is that I will have many dictionary objects with same keys and waste memory? The use of __slots__ was an option, but I will only know the attributes of an object after reading the CSV.
[Edit:] Due to being on legacy system and "restrictions", use of third party libraries is not possible.
If you are on Python 2.6 or later, collections.namedtuple is what you are asking for.
See http://docs.python.org/library/collections.html#collections.namedtuple
(there is even an example of using it with csv).
EDIT: It requires the field names to be valid as Python identifiers, so perhaps it is not suitable in your case.
Have you considered using pandas.
It is works very good for tables. Relevant for you are the read_csv function and the dataframe type.
This is how you would use it:
>>> import pandas
>>> table = pandas.read_csv('a.csv')
>>> table
a b c
0 1 2 a
1 2 4 b
2 5 6 word
>>> table.a
0 1
1 2
2 5
Name: a
Use python shelve. It is a dictionary like object but can be dumped on disk when required and loaded back very easily.
If all the data in one column are the same type, you can use NumPy. NumPy's loadtxt and genfromtxt function can be used to read csv file. And because it returns an array, the memory usage is smaller then dict.
Possibilities:
(1) Benchmark the csv.DictReader approach and see if it causes a problem. Note that the dicts contain POINTERS to the keys and values; the actual key strings are not copied into each dict.
(2) For each file, use csv.Reader, after the first row, build a class dynamically, instantiate it once per remaining row. Perhaps this is what you had in mind.
(3) Have one fixed class, instantiated once per file, which gives you a list of tuples for the actual data, a tuple that maps column indices to column names, and a dict that maps column names to column indices. Tuples occupy less memory than lists because there is no extra append-space allocated. You can then get and set your data via (row_index, column_index) and (row_index, column_name).
In any case, to get better advice, how about some simple facts and stats: What version of Python? How many files? rows per file? columns per file? total unique keys/column names?