My code as it is right now looks like this:
def read_in_movie_preference():
"""Read the move data, and return a
preference dictionary."""
preference = {}
movies = []
# write code here:
file_location="./data/"
f = open(file_location+"preference.csv","r")
df = f.readlines()
#names as keys and prefrences
for line in df:
name = line[1].strip("\n").split(",")
prefs = line[2:].strip("\n").split(",")
preference[line[1]] = line[2:]
#print(test)
#movie names`
movietitles = df[0].strip("\n").split(",")
for movie in movietitles:
movie=movie.rstrip()
#can't seem to get rid of the spaces at the end
movies+=movietitles[2:]
print(movies)
return [movies, preference]
I cant seem to get the movie titles into the list without spaces at the end of some of them & I also cant add the names and preferences into the dictionary... I am supposed to do this task with basic python and no pandas .. very stuck would appreciate any help!
the dictionary would have names as keys and the preference numbers in number format instead of strings so it would theoretically look like this:
key: pref:
dennis, 0 1 0 1 0 ect
[![enter image description here][1]][1]this is what the data set looks like
here is the data pasted:
So the issue here is that you are using rstrip on a copy of the data but never apply it to the original.
The issue
for movie in movietitles:
movie=movie.rstrip() # Changes the (copy) of the data rather than the original
# We still need to apply this back to movietitles
There are a couple ways to fix this!
# Using indexing
for _ in range(len(movietitles)):
movietitles[_] = movietitles[_].rstrip()
Or we can do this inline with list comprehension
# Using list comprehension
movietitles = [movie.rstrip() for movie in movietitles]
As stated in the other answer, when working with csv data it's recomended to use a csv parser, but completely unnecessary for this scale! Hope this helps
Downloading some data from here:
http://insideairbnb.com/get-the-data.html
Then
listings = pd.read_csv('listings.csv')
Trying to change types
listings.bathrooms = listings.bathrooms.astype('int64',errors='ignore')
listings.bedrooms = listings.bedrooms.astype('int64',errors='ignore')
listings.beds = listings.beds.astype('int64',errors='ignore')
listings.price = listings.price.replace('[\$,]','',regex=True).astype('float')
listings.price = listings.price.astype('int64',errors='ignore')
Tried some other combinations but at the end pops error or just doesn't change datatype.
EDIT: corrected some typos
The apostrophes in the last line is not in the correct place and the last one is not the correct type: you need ' instead of ` (maybe it was accidentaly added because of the code block).
So for me it works like this:
listings.price.astype('int64', errors='ignore')
But if you would like to reassign it to the original variable then you need the same structure as you used in the previous lines:
listings.price = listings.price.astype('int64', errors='ignore')
I want to run a package(RAKE) to extract keyphrases from comments(df['CUSTOMER_RECOMMENDATIONS_TRANS]) and create a new column(df['keyphrase_RAKE']) to store them corresponding to each comment. I'm getting an error saying "ValueError: Length of values does not match the length of index".
I know the reason behind the error but don't know how to fix it. What can be done?
keywords return a list of keyphrases.
This the code:
import RAKE
import operator
# Reka setup with stopword directory
stop_dir = "SmartStoplist.txt"
rake_object = RAKE.Rake(stop_dir)
# Sample text to test RAKE
df = pd.read_excel('my.xlsx')
for i in df['CUSTOMER_RECOMMENDATIONS_TRANS']:
keywords = rake_object.run(i)
df['keyphrase_RAKE'] = keywords
you can usepandas.DataFrame.apply and avoid the for loop
df['keyphrase_RAKE'] = df['CUSTOMER_RECOMMENDATIONS_TRANS'].apply(rake_object.run)
CHECK_OUTPUT_HERE
Currently, the output I am getting is in the string format. I am not sure how to convert that string to a pandas dataframe.
I am getting 3 different tables in my output. It is in a string format.
One of the following 2 solutions will work for me:
Convert that string output to 3 different dataframes. OR
Change something in the function so that I get the output as 3 different data frames.
I have tried using RegEx to convert the string output to a dataframe but it won't work in my case since I want my output to be dynamic. It should work if I give another input.
def column_ch(self, sample_count=10):
report = render("header.txt")
match_stats = []
match_sample = []
any_mismatch = False
for column in self.column_stats:
if not column["all_match"]:
any_mismatch = True
match_stats.append(
{
"Column": column["column"],
"{} dtype".format(self.df1_name): column["dtype1"],
"{} dtype".format(self.df2_name): column["dtype2"],
"# Unequal": column["unequal_cnt"],
"Max Diff": column["max_diff"],
"# Null Diff": column["null_diff"],
}
)
if column["unequal_cnt"] > 0:
match_sample.append(
self.sample_mismatch(column["column"], sample_count, for_display=True)
)
if any_mismatch:
for sample in match_sample:
report += sample.to_string()
report += "\n\n"
print("type is", type(report))
return report
Since you have a string, you can pass your string into a file-like buffer and then read it with pandas read_csv into a dataframe.
Assuming that your string with the dataframe is called dfstring, the code would look like this:
import io
bufdf = io.StringIO(dfstring)
df = pd.read_csv(bufdf, sep=???)
If your string contains multiple dataframes, split it with split and use a loop.
import io
dflist = []
for sdf in dfstring.split('\n\n'): ##this seems the separator between two dataframes
bufdf = io.StringIO(sdf)
dflist.append(pd.read_csv(bufdf, sep=???))
Be careful to pass an appropriate sep parameter, my ??? means that I am not able to understand what could be a proper parameter. Your field are separated by spaces, so you could use sep='\s+') but I see that you have also spaces which are not meant to be a separator, so this may cause a parsing error.
sep accept regex, so to have 2 consecutive spaces as a separator, you could do: sep='\s\s+' (this will require an additional parameter engine='python'). But again, be sure that you have at least 2 spaces between two consecutive fields.
See here for reference about the io module and StringIO.
Note that the io module exists in python3 but not in python2 (it has another name) but since the latest pandas versions require python3, I guess you are using python3.
I'm trying to manipulate a list of items in python but im getting the error "AttributeError: 'list' object has no attribute 'split'"
I understand that list does not understand .split but i don't know what else to do. Below is a copy paste of the relevant part of my code.
tourl = 'http://data.bitcoinity.org/chart_data'
tovalues = {'timespan':'24h','resolution':'hour','currency':'USD','exchange':'all','mining_pool':'all','compare':'no','data_type':'price_volume','chart_type':'line_bar','smoothing':'linear','chart_types':'ccacdfcdaa'}
todata = urllib.urlencode(tovalues)
toreq = urllib2.Request(tourl, todata)
tores = urllib2.urlopen(toreq)
tores2 = tores.read()
tos = json.loads(tores2)
tola = tos["data"]
for item in tola:
ting = item.get("values")
ting.split(',')[2] <-----ERROR
print(ting)
To understand what i'm trying to do you will also need to see the json data. Ting outputs this:
[
[1379955600000L, 123.107310846774], [1379959200000L, 124.092526428571],
[1379962800000L, 125.539504822835], [1379966400000L, 126.27024617931],
[1379970000000L, 126.723474983766], [1379973600000L, 126.242406356837],
[1379977200000L, 124.788410570987], [1379980800000L, 126.810084904632],
[1379984400000L, 128.270580796748], [1379988000000L, 127.892411269036],
[1379991600000L, 126.140579640523], [1379995200000L, 126.513705084746],
[1379998800000L, 128.695124951923], [1380002400000L, 128.709738051044],
[1380006000000L, 125.987767097378], [1380009600000L, 124.323433535528],
[1380013200000L, 123.359378559603], [1380016800000L, 125.963250678733],
[1380020400000L, 125.074618194444], [1380024000000L, 124.656345088853],
[1380027600000L, 122.411303435449], [1380031200000L, 124.145747100372],
[1380034800000L, 124.359452274881], [1380038400000L, 122.815357211394],
[1380042000000L, 123.057706915888]
]
[
[1379955600000L, 536.4739135], [1379959200000L, 1235.42506637],
[1379962800000L, 763.16329656], [1379966400000L, 804.04579319],
[1379970000000L, 634.84689741], [1379973600000L, 753.52716718],
[1379977200000L, 506.90632968], [1379980800000L, 494.473732950001],
[1379984400000L, 437.02095093], [1379988000000L, 176.25405034],
[1379991600000L, 319.80432715], [1379995200000L, 206.87212398],
[1379998800000L, 638.47226435], [1380002400000L, 438.18036666],
[1380006000000L, 512.68490443], [1380009600000L, 904.603705539997],
[1380013200000L, 491.408088450001], [1380016800000L, 670.275397960001],
[1380020400000L, 767.166941339999], [1380024000000L, 899.976089609997],
[1380027600000L, 1243.64963909], [1380031200000L, 1508.82429811],
[1380034800000L, 1190.18854705], [1380038400000L, 546.504592349999],
[1380042000000L, 206.84883264]
]
And ting[0] outputs this:
[1379955600000L, 123.187067936508]
[1379955600000L, 536.794013499999]
What i'm really trying to do is add up the values from ting[0-24] that comes AFTER the second comma. This made me try to do a split but that does not work
You already have a list; the commas are put there by Python to delimit the values only when printing the list.
Just access element 2 directly:
print ting[2]
This prints:
[1379962800000, 125.539504822835]
Each of the entries in item['values'] (so ting) is a list of two float values, so you can address each of those with index 0 and 1:
>>> print ting[2][0]
1379962800000
>>> print ting[2][1]
125.539504822835
To get a list of all the second values, you could use a list comprehension:
second_vals = [t[1] for t in ting]
When you load the data with json.loads, it is already parsed into a real list that you can slice and index as normal. If you want the data starting with the third element, just use ting[2:]. (If you just want the third element by itself, just use ting[2].)