Update values in new column - python

I want to run a package(RAKE) to extract keyphrases from comments(df['CUSTOMER_RECOMMENDATIONS_TRANS]) and create a new column(df['keyphrase_RAKE']) to store them corresponding to each comment. I'm getting an error saying "ValueError: Length of values does not match the length of index".
I know the reason behind the error but don't know how to fix it. What can be done?
keywords return a list of keyphrases.
This the code:
import RAKE
import operator
# Reka setup with stopword directory
stop_dir = "SmartStoplist.txt"
rake_object = RAKE.Rake(stop_dir)
# Sample text to test RAKE
df = pd.read_excel('my.xlsx')
for i in df['CUSTOMER_RECOMMENDATIONS_TRANS']:
keywords = rake_object.run(i)
df['keyphrase_RAKE'] = keywords

you can usepandas.DataFrame.apply and avoid the for loop
df['keyphrase_RAKE'] = df['CUSTOMER_RECOMMENDATIONS_TRANS'].apply(rake_object.run)

Related

Pandas - Can't change datatype of dataframe columns

Downloading some data from here:
http://insideairbnb.com/get-the-data.html
Then
listings = pd.read_csv('listings.csv')
Trying to change types
listings.bathrooms = listings.bathrooms.astype('int64',errors='ignore')
listings.bedrooms = listings.bedrooms.astype('int64',errors='ignore')
listings.beds = listings.beds.astype('int64',errors='ignore')
listings.price = listings.price.replace('[\$,]','',regex=True).astype('float')
listings.price = listings.price.astype('int64',errors='ignore')
Tried some other combinations but at the end pops error or just doesn't change datatype.
EDIT: corrected some typos
The apostrophes in the last line is not in the correct place and the last one is not the correct type: you need ' instead of ` (maybe it was accidentaly added because of the code block).
So for me it works like this:
listings.price.astype('int64', errors='ignore')
But if you would like to reassign it to the original variable then you need the same structure as you used in the previous lines:
listings.price = listings.price.astype('int64', errors='ignore')

Display list sorted alphabetically

We're trying to create a function that takes the input, some data containing the following information: ID number, Name, as well as a number of columns containing the grades for different assignments, and then sorts the data alphabetically (according to the name) and then displays the data with a column added that also displays the final grade (that we calculate with another function we made). We've tried writing the following code, but can't get it to work... The error-message given is "names = GRADESdata[:,1].tolist() TypeError: string indices must be integers".
Can anyone help us to figure out how to get it working?
def listOfgrades(GRADESdata):
names = GRADESdata[:,1].tolist()
names = names.sort(names)
assignments = GRADESdata[:,2::]
final_grades = computeFinalGrades(GRADESdata)
final_grades = np.array(final_grades.reshape(len(final_grades),1))
List_of_grades = np.hstack((GRADESdata, final_grades))
NOofColumns = np.size(GRADESdata,axis = 1)
display = np.zeros(NOofColumns)
for i in names:
display = np.vstack((display,GRADESdata[GRADESdata[:,1] == i]))
grades = display[1::,2:-1]
gradesfinal = display[1::,-1]
#Column titles
c = {"Student ID": GRADESdata[1::,0], "Name": GRADESdata[1::,1]}
for i in range(GRADESdata.shape[1]):
c["Assign.{}".format(i+1)] = GRADESdata[:,i]
c["Final grade"] = final_grades
d = pd.DataFrame(c)
print(d.to_string())
display = np.array([student_list, names, assignments, final_grades])
return display
The expected output is something like this (with the data below ofc):
ID number Name Assignment 1 Assignment 2 Final Grade
EDIT: the data input is a .csv file containing the following data:ID number,Name,Assignment 1,Assignment 2, etc.
The comma in
names = GRADESdata[:,1].tolist()
is not a valid character. the part between [: and ] must be an integer
From looking at .tolist(), I assume the data structure you're supposed to use is numpy.ndarray.
I managed to replicate the error with the following code:
print("12354"[:,1].tolist())
which makes sense if you're using a file name as input - and that's your mistake.
In order to fix this problem, you need to implement a string parser at the beginning or outside the function.
Add the following to your code at the beginning:
file=open(GRADESdata,"r")
data=file.read()
file.close()
list1=data.split("\n")#Replace \n with appropriate line separator
list2=[e.split(",") for e in list1]
GRADESdata=numpy.array(list2)

List items to pandas columns

I have list with 4 urls:
['https://cache.wihaben.at/mmo/6/297/469/806_-1094197631.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_-455156804.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_466214286.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_1475201828.jpg']
and I want to build Pandas dataframe which should have Image_1, Image_2, Image_3andImage_4 as column names and URLs as row values.
My code:
advert_images = {('Image_1', eval(advert_image_list[0])),
('Image_2', eval(advert_image_list[1])),
('Image_3', eval(advert_image_list[2])),
('Image_4', eval(advert_image_list[3])),
}
adIm_DF = pd.DataFrame(advert_images)
is returning error:
File "", line 1
https://cache.wihaben.at/mmo/6/297/469/806_-1094197631.jpg
^ SyntaxError: invalid syntax
Evaluation is stuck on ":" in URL because it's probably parsing it as dict.
I also need option to itterate over n-number of URLs in list and build coresponding columns with values.
Columns being Image_(iterator_value), row being URL value.
If the URls are stored as a string (as #Tox pointed out) I have no problem with the code:
url_list = ['https://cache.wihaben.at/mmo/6/297/469/806_-1094197631.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_-455156804.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_466214286.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_1475201828.jpg']
im_labels = ['Image_{}'.format(x) for x in np.arange(1, len(url_list) ,1)]
im_df = pd.DataFrame([url_list], columns=im_labels)
You should make a string of the url.
str((advert_image_list[0])
I think you are confusing the use of eval. It is used to run code that is saved in a string. In your example python tries to run the url as code, which will obviously not work. You will not need eval.
Try this:
advert_image_list = ['https://cache.willhaben.at/mmo/6/297/469/806_-1094197631.jpg', 'https://cache.willhaben.at/mmo/6/297/469/806_-455156804.jpg', 'https://cache.willhaben.at/mmo/6/297/469/806_466214286.jpg', 'https://cache.willhaben.at/mmo/6/297/469/806_1475201828.jpg']
advert_images = [('Image_1', advert_image_list[0]),
('Image_2', advert_image_list[1]),
('Image_3', advert_image_list[2]),
('Image_4', advert_image_list[3])]
adIm_DF = pd.DataFrame(advert_images).set_index(0).T
this works for me
df = pd.DataFrame(columns=['Image1','Image2','Image3','Image4'])
df.loc[0] = ['https://cache.wihaben.at/mmo/6/297/469/806_-1094197631.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_-455156804.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_466214286.jpg', 'https://cache.wihaben.at/mmo/6/297/469/806_1475201828.jpg']

Splitting json data in python

I'm trying to manipulate a list of items in python but im getting the error "AttributeError: 'list' object has no attribute 'split'"
I understand that list does not understand .split but i don't know what else to do. Below is a copy paste of the relevant part of my code.
tourl = 'http://data.bitcoinity.org/chart_data'
tovalues = {'timespan':'24h','resolution':'hour','currency':'USD','exchange':'all','mining_pool':'all','compare':'no','data_type':'price_volume','chart_type':'line_bar','smoothing':'linear','chart_types':'ccacdfcdaa'}
todata = urllib.urlencode(tovalues)
toreq = urllib2.Request(tourl, todata)
tores = urllib2.urlopen(toreq)
tores2 = tores.read()
tos = json.loads(tores2)
tola = tos["data"]
for item in tola:
ting = item.get("values")
ting.split(',')[2] <-----ERROR
print(ting)
To understand what i'm trying to do you will also need to see the json data. Ting outputs this:
[
[1379955600000L, 123.107310846774], [1379959200000L, 124.092526428571],
[1379962800000L, 125.539504822835], [1379966400000L, 126.27024617931],
[1379970000000L, 126.723474983766], [1379973600000L, 126.242406356837],
[1379977200000L, 124.788410570987], [1379980800000L, 126.810084904632],
[1379984400000L, 128.270580796748], [1379988000000L, 127.892411269036],
[1379991600000L, 126.140579640523], [1379995200000L, 126.513705084746],
[1379998800000L, 128.695124951923], [1380002400000L, 128.709738051044],
[1380006000000L, 125.987767097378], [1380009600000L, 124.323433535528],
[1380013200000L, 123.359378559603], [1380016800000L, 125.963250678733],
[1380020400000L, 125.074618194444], [1380024000000L, 124.656345088853],
[1380027600000L, 122.411303435449], [1380031200000L, 124.145747100372],
[1380034800000L, 124.359452274881], [1380038400000L, 122.815357211394],
[1380042000000L, 123.057706915888]
]
[
[1379955600000L, 536.4739135], [1379959200000L, 1235.42506637],
[1379962800000L, 763.16329656], [1379966400000L, 804.04579319],
[1379970000000L, 634.84689741], [1379973600000L, 753.52716718],
[1379977200000L, 506.90632968], [1379980800000L, 494.473732950001],
[1379984400000L, 437.02095093], [1379988000000L, 176.25405034],
[1379991600000L, 319.80432715], [1379995200000L, 206.87212398],
[1379998800000L, 638.47226435], [1380002400000L, 438.18036666],
[1380006000000L, 512.68490443], [1380009600000L, 904.603705539997],
[1380013200000L, 491.408088450001], [1380016800000L, 670.275397960001],
[1380020400000L, 767.166941339999], [1380024000000L, 899.976089609997],
[1380027600000L, 1243.64963909], [1380031200000L, 1508.82429811],
[1380034800000L, 1190.18854705], [1380038400000L, 546.504592349999],
[1380042000000L, 206.84883264]
]
And ting[0] outputs this:
[1379955600000L, 123.187067936508]
[1379955600000L, 536.794013499999]
What i'm really trying to do is add up the values from ting[0-24] that comes AFTER the second comma. This made me try to do a split but that does not work
You already have a list; the commas are put there by Python to delimit the values only when printing the list.
Just access element 2 directly:
print ting[2]
This prints:
[1379962800000, 125.539504822835]
Each of the entries in item['values'] (so ting) is a list of two float values, so you can address each of those with index 0 and 1:
>>> print ting[2][0]
1379962800000
>>> print ting[2][1]
125.539504822835
To get a list of all the second values, you could use a list comprehension:
second_vals = [t[1] for t in ting]
When you load the data with json.loads, it is already parsed into a real list that you can slice and index as normal. If you want the data starting with the third element, just use ting[2:]. (If you just want the third element by itself, just use ting[2].)

use slice in for loop to build a list

I would like to build up a list using a for loop and am trying to use a slice notation. My desired output would be a list with the structure:
known_result[i] = (record.query_id, (align.title, align.title,align.title....))
However I am having trouble getting the slice operator to work:
knowns = "output.xml"
i=0
for record in NCBIXML.parse(open(knowns)):
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
i+=1
which results in:
list assignment index out of range.
I am iterating through a series of sequences using BioPython's NCBIXML module but the problem is adding to the list. Does anyone have an idea on how to build up the desired list either by changing the use of the slice or through another method?
thanks zach cp
(crossposted at [Biostar])1
You cannot assign a value to a list at an index that doesn't exist. The way to add an element (at the end of the list, which is the common use case) is to use the .append method of the list.
In your case, the lines
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
Should probably be changed to
element=(record.query_id, tuple(align.title for align in record.alignment))
known_results.append(element)
Warning: The code above is untested, so might contain bugs. But the idea behind it should work.
Use:
for record in NCBIXML.parse(open(knowns)):
known_results[i] = (record.query_id, None)
known_results[i][1] = (align.title for align in record.alignment)
i+=1
If i get you right you want to assign every record.query_id one or more matching align.title. So i guess your query_ids are unique and those unique ids are related to some titles. If so, i would suggest a dictionary instead of a list.
A dictionary consists of a key (e.g. record.quer_id) and value(s) (e.g. a list of align.title)
catalog = {}
for record in NCBIXML.parse(open(knowns)):
catalog[record.query_id] = [align.title for align in record.alignment]
To access this catalog you could either iterate through:
for query_id in catalog:
print catalog[query_id] # returns the title-list for the actual key
or you could access them directly if you know what your looking for.
query_id = XYZ_Whatever
print catalog[query_id]

Categories

Resources