Suppose there is a pipeline as follows....
pipe = make_pipline(one-hot,selectkbest,......)
If you want to see only the selected columns...
selected_mask = pipe.named_steps['selectkbest'].get_feature_names_out()
I tried other methods, but I couldn't solve them... Here's how I tried it.
all_names = pipe.named_steps['onehotencoder'].get_feature_names()
selected_mask = pipe.named_steps['selectkbest'].get_support()
selected_names = all_names[selected_mask]
When I did the following sentence...
selected_names = all_names[selected_mask]
Recall the following error:
TypeError: only integer scalar arrays can be converted to a scalar index
How can I solve it?
Downloading some data from here:
http://insideairbnb.com/get-the-data.html
Then
listings = pd.read_csv('listings.csv')
Trying to change types
listings.bathrooms = listings.bathrooms.astype('int64',errors='ignore')
listings.bedrooms = listings.bedrooms.astype('int64',errors='ignore')
listings.beds = listings.beds.astype('int64',errors='ignore')
listings.price = listings.price.replace('[\$,]','',regex=True).astype('float')
listings.price = listings.price.astype('int64',errors='ignore')
Tried some other combinations but at the end pops error or just doesn't change datatype.
EDIT: corrected some typos
The apostrophes in the last line is not in the correct place and the last one is not the correct type: you need ' instead of ` (maybe it was accidentaly added because of the code block).
So for me it works like this:
listings.price.astype('int64', errors='ignore')
But if you would like to reassign it to the original variable then you need the same structure as you used in the previous lines:
listings.price = listings.price.astype('int64', errors='ignore')
I have a data set containing strings in 1 column that I want to count the most common character and put that character in a new column. I also want another column that contains the proportion of the string the character represents.
The method I want to use on each string is as follows:
sequence = 'ACCCCTGGC'
char_i_want = collections.Counter(sequence).most_common(1)[0] # for the character
value_i_want = collections.Counter(sequence).most_common(1)[1] / len(sequence) # for the proportion
I understand the result of most_common is a tuple, but when I try this in a python shell, I need to do collections.Counter(sequence).most_common(1)[0][0] to access the 0th element of the tuple, the tuple being the 0th element of the returned list. When I tried implementing that, it still didn't work.
Here is how I attempted to do it:
def common_char(sequence):
return Counter(sequence).most_common(1)[0][0]
def char_freq(sequence):
return Counter(sequence).most_common(1)[0][1] / len(sequence)
data = pd.read_csv('final_file_noidx.csv')
data['most_common_ref'] = data['REF'].map(lambda x: common_char(x))
data['most_common_ref_frac'] = data['REF'].map(lambda x: char_freq(x))
I am greeted by this error message: TypeError: 'float' object is not iterable
data['most_common_ref'] = data['REF'].map(lambda x: common_char(x), na_action='ignore')
data['most_common_ref_frac'] = data['REF'].map(lambda x: char_freq(x), na_action='ignore')
Needed to ignore NaNs, thanks Andy L.
We're trying to create a function that takes the input, some data containing the following information: ID number, Name, as well as a number of columns containing the grades for different assignments, and then sorts the data alphabetically (according to the name) and then displays the data with a column added that also displays the final grade (that we calculate with another function we made). We've tried writing the following code, but can't get it to work... The error-message given is "names = GRADESdata[:,1].tolist() TypeError: string indices must be integers".
Can anyone help us to figure out how to get it working?
def listOfgrades(GRADESdata):
names = GRADESdata[:,1].tolist()
names = names.sort(names)
assignments = GRADESdata[:,2::]
final_grades = computeFinalGrades(GRADESdata)
final_grades = np.array(final_grades.reshape(len(final_grades),1))
List_of_grades = np.hstack((GRADESdata, final_grades))
NOofColumns = np.size(GRADESdata,axis = 1)
display = np.zeros(NOofColumns)
for i in names:
display = np.vstack((display,GRADESdata[GRADESdata[:,1] == i]))
grades = display[1::,2:-1]
gradesfinal = display[1::,-1]
#Column titles
c = {"Student ID": GRADESdata[1::,0], "Name": GRADESdata[1::,1]}
for i in range(GRADESdata.shape[1]):
c["Assign.{}".format(i+1)] = GRADESdata[:,i]
c["Final grade"] = final_grades
d = pd.DataFrame(c)
print(d.to_string())
display = np.array([student_list, names, assignments, final_grades])
return display
The expected output is something like this (with the data below ofc):
ID number Name Assignment 1 Assignment 2 Final Grade
EDIT: the data input is a .csv file containing the following data:ID number,Name,Assignment 1,Assignment 2, etc.
The comma in
names = GRADESdata[:,1].tolist()
is not a valid character. the part between [: and ] must be an integer
From looking at .tolist(), I assume the data structure you're supposed to use is numpy.ndarray.
I managed to replicate the error with the following code:
print("12354"[:,1].tolist())
which makes sense if you're using a file name as input - and that's your mistake.
In order to fix this problem, you need to implement a string parser at the beginning or outside the function.
Add the following to your code at the beginning:
file=open(GRADESdata,"r")
data=file.read()
file.close()
list1=data.split("\n")#Replace \n with appropriate line separator
list2=[e.split(",") for e in list1]
GRADESdata=numpy.array(list2)
I'm trying to manipulate a list of items in python but im getting the error "AttributeError: 'list' object has no attribute 'split'"
I understand that list does not understand .split but i don't know what else to do. Below is a copy paste of the relevant part of my code.
tourl = 'http://data.bitcoinity.org/chart_data'
tovalues = {'timespan':'24h','resolution':'hour','currency':'USD','exchange':'all','mining_pool':'all','compare':'no','data_type':'price_volume','chart_type':'line_bar','smoothing':'linear','chart_types':'ccacdfcdaa'}
todata = urllib.urlencode(tovalues)
toreq = urllib2.Request(tourl, todata)
tores = urllib2.urlopen(toreq)
tores2 = tores.read()
tos = json.loads(tores2)
tola = tos["data"]
for item in tola:
ting = item.get("values")
ting.split(',')[2] <-----ERROR
print(ting)
To understand what i'm trying to do you will also need to see the json data. Ting outputs this:
[
[1379955600000L, 123.107310846774], [1379959200000L, 124.092526428571],
[1379962800000L, 125.539504822835], [1379966400000L, 126.27024617931],
[1379970000000L, 126.723474983766], [1379973600000L, 126.242406356837],
[1379977200000L, 124.788410570987], [1379980800000L, 126.810084904632],
[1379984400000L, 128.270580796748], [1379988000000L, 127.892411269036],
[1379991600000L, 126.140579640523], [1379995200000L, 126.513705084746],
[1379998800000L, 128.695124951923], [1380002400000L, 128.709738051044],
[1380006000000L, 125.987767097378], [1380009600000L, 124.323433535528],
[1380013200000L, 123.359378559603], [1380016800000L, 125.963250678733],
[1380020400000L, 125.074618194444], [1380024000000L, 124.656345088853],
[1380027600000L, 122.411303435449], [1380031200000L, 124.145747100372],
[1380034800000L, 124.359452274881], [1380038400000L, 122.815357211394],
[1380042000000L, 123.057706915888]
]
[
[1379955600000L, 536.4739135], [1379959200000L, 1235.42506637],
[1379962800000L, 763.16329656], [1379966400000L, 804.04579319],
[1379970000000L, 634.84689741], [1379973600000L, 753.52716718],
[1379977200000L, 506.90632968], [1379980800000L, 494.473732950001],
[1379984400000L, 437.02095093], [1379988000000L, 176.25405034],
[1379991600000L, 319.80432715], [1379995200000L, 206.87212398],
[1379998800000L, 638.47226435], [1380002400000L, 438.18036666],
[1380006000000L, 512.68490443], [1380009600000L, 904.603705539997],
[1380013200000L, 491.408088450001], [1380016800000L, 670.275397960001],
[1380020400000L, 767.166941339999], [1380024000000L, 899.976089609997],
[1380027600000L, 1243.64963909], [1380031200000L, 1508.82429811],
[1380034800000L, 1190.18854705], [1380038400000L, 546.504592349999],
[1380042000000L, 206.84883264]
]
And ting[0] outputs this:
[1379955600000L, 123.187067936508]
[1379955600000L, 536.794013499999]
What i'm really trying to do is add up the values from ting[0-24] that comes AFTER the second comma. This made me try to do a split but that does not work
You already have a list; the commas are put there by Python to delimit the values only when printing the list.
Just access element 2 directly:
print ting[2]
This prints:
[1379962800000, 125.539504822835]
Each of the entries in item['values'] (so ting) is a list of two float values, so you can address each of those with index 0 and 1:
>>> print ting[2][0]
1379962800000
>>> print ting[2][1]
125.539504822835
To get a list of all the second values, you could use a list comprehension:
second_vals = [t[1] for t in ting]
When you load the data with json.loads, it is already parsed into a real list that you can slice and index as normal. If you want the data starting with the third element, just use ting[2:]. (If you just want the third element by itself, just use ting[2].)