I cannot change the values of a column using python pandas - python

I am working with the [UCI adult dataset][1]. I have added a row as a header to facilitate operation. I need to change the last column, which can take two values, '<=50k' and '>50k' and whose name is 'etiquette'. I have tried the following
num_datos.loc[num_datos.loc[:,"etiquette"]=="<=50K", "etiquette"]=1
num_datos.loc[num_datos.loc[:,"etiquette"]==">50K", "etiquette"]=0
and the following
num_datos['etiquette'].replace(['<=50K'], 1)
num_datos['etiquette'].replace(['>50K'], 0)
However, this seems to do nothing, since if I then execute
print(num_datos.etiquette[0])
I still get a value of <=50K. Is there a way for me to replace the values of the column in question?

Your second try, using df.replace(), should work when you remove the square brackets from your string. So instead use:
num_datos['etiquette'].replace('<=50K', 1)
num_datos['etiquette'].replace('>50K', 0)
The function currently interprets ['<=50K'] as a list with one element, and cannot find any values in your dataframe that are a list with that element. Instead, you want it to look for the string.
Hope this helps!

Related

I need my values to not repeat their names in their categories

I am not sure how to fix this. This is the code I want, but I do not want it to continuously repeat the names of the rows in the output.
I'd suggest a few changes to your code.
Firstly, to answer your question, you can remove the multiple occurences of the words by using:
select_merch = d.loc[df['Category] == 'Merchandise'].sum()['Cost]
This will make sure to select only the sum of the Cost column for a particular dataframe. Also this code is very redundant and confusing. What you can do is also create a list and iterate over it for each category.
list(df['Category'].unique()) will give you a list of all the unqiue categories. Store it in a list and then iterate over it. Plus, you don't need to do a d=pd.Dataframe(df) everytime, you can use df itself as well.

Viewing column indices and names at once

I need to slice several different datasets that contain a lot of extraneous columns. It's easier for me to glance over the indices of the columns I want, and tell Python to save these columns, than type out their names one by one. For instance if I want to save only SCHOOL_DATE, STUDENT_DATE, STUDENT_P2_DATE, I'd rather tell Python to save column[3, 5:6] or something.
However, I can't find a quick way to view column names right next to their index.
Currently I just run a debugger up to a line where I create an array of my column names, then I view as array in Pycharm to quickly identify which # belongs to which name. I also tried iterating through columns to return their index position and name, but maybe because I don't know well how Python objects behave, wasn't able to get that to work.
SQLdf = pd.read_csv(desktoppath + SchoolFromSQLfilename)
cols = np.array(SQLdf.columns)
print(SQLdf.columns)
I put a debugger break on the print line. Obviously, I'd like to just print out the matches though straight into the console, than having to take a few point and click steps to view.
First do with enumerate
list(enumerate(df.columns))
[(0, 'id'), (1, 'A')]
Then pass to np.r_[3,[3:4],[5:8]]

How do i check for a specific exact match value in each array that is held within another array? Python

Hi this is my first question here, so a quick background is that I am trying to do a de-duplication process over a large excel file of names and other pieces of data. I extracted it to be an array of arrays.
So arr[0] would hold the contents of that one person and arr[0][1] would hold the last name.
I am having trouble finding a way to see if I have duplicated last names in my array PER entry.
my current code is basically like this for the condition checking
if(arr[x][1] in full_arr)
However it seems that I am getting way more entries than I should be. Is the Python "in" looking at partials too in other areas of the array? like arr[0][3] holds emails.
Thank you so much for your help!
You can use a combination of zip and set to check if there is a duplicate in a specific row of your multidimensionnal array:
if len(list(zip(*arr))[1]) != len(set(list(zip(*arr))[1])):
#if there is at least one duplicate: do some stuff
set remove the duplicate, so if len(set(array)) != len(array) it means that array have some duplicates.
The * operator unpack your array into positional argument: list(zip(a[0],a[1],a[2],...)) is the same as list(zip(*a))

Adding an element to a rdd depending on a calculation on the same rdd

I am new to Python spark so this question might be elementary. However, I could not find any good answer here or on Google so I will just ask it anyway.
I want add some elements to my rdd depending on some calculation I do on that rdd. Lets say my rdd is named lines and contains a string. I want to add two numbers which is tab separated together in the file. Then add this sum at the end of the lines rdd.
lines = sc.textFile("myFile.txt")
#Splitting the string where there are tabs
linesArr=lines.map(lambda line: line.split("\t"))
Now I want to add together the two first tabs in linesArr and add the result at the end of lines.
How do I do this?
For those of you who might wonder about the same thing here is how I solved it with a simple example:
n=sc.parallelize([(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9)])
m=n.map(lambda x: x[0]+x[1])
z=n.zip(m).map(lambda x: (x[0][0],x[0][1],x[1]))
The result z is: [(1,1,2),(2,2,4),...]
Note that if one omit the map the result will be [((1,1),2),((2,2),4),..] and I did not want that in this case.

Transforming type Int64Index into an integer index in Python

I'm quite new with python, however, I have to accomplish some assignment and I am struggling now on a problem. I try to get the index of the element in a table A when some other parameter from this table A corresponds to a value in a list B. The table A also already contains a column "index" where all elements are numerated from 0 till the end. Moreover, the values in tableA.parameter1 and listB can coincide only once, multiple matches are not possible. So to derive the necessary index I use a line
t=tableA.index[tableA.parameter1==listB[numberObservation]]
However, what I get as a result is something like:
t Int64Index([2], dtype='int64')
If I use the variable t in this format Int64Index, it doesn't suit for the further code I have to work with. Actually, I need only 2 as an integer number, without all this redundant rest.
Can somebody please help me to circumvent my problem? I am in total despair and would be grateful for any help.
Try .tolist()
t=tableA.index[tableA.parameter1==listB[numberObservation]].tolist()
This should return
t = [2]
a list "without all the redundant rest" :)
What package is giving you Int64Index? This looks vaguely numpy-ish, but numpy arrays define __index__ so a single element array of integer values will seamlessly operate as indices for sequence lookup.
Regardless, assuming t is supposed to be exactly one value, and it's a sequence type itself, you can just do:
t, = tableA.index[tableA.parameter1==listB[numberObservation]]
That trailing comma changes the line from straight assignment to iterable unpacking; it expects the right hand side to produce an iterable with exactly one value, and that one value is unpacked into t. If the iterable has 0 or 2+ values, you'll get a ValueError.

Categories

Resources