I am trying to show correlation between two individual lists. Before installing Numpy, I parsed World Bank data for GDP values and the number of internet users and stored them in two separate lists. Here is the snippet of code. This is just for gdp07. I actually have more lists for more years and other data such as unemployment.
import numpy as np
file = open('final_gdpnum.txt', 'r')
gdp07 = []
for line in file:
fields = line.strip().split()
gdp07.append(fields [0])
file2 = open('internetnum.txt', 'r')
netnum07 = []
for line in file2:
fields2 = line.strip().split()
nnetnum07.append(fields2 [0])
print np.correlate(gdp07,netnum07,"full")
The error I get is this:
Traceback (most recent call last):
File "Project3,py", line 83, in ,module.
print np.correlate(gdp07, netnum07, "full")
File "/usr/lib/python2.6/site-packages/numpy/core/numeric.py", line 645, in correlate
return multiarray.correlate2(a,v,mode))
ValueError: data type must provide an itemsize
Just for the record, I am using Cygwin with Python 2.6 on a Windows computer. I am only using Numpy along with its dependencies and other parts of its build (gcc compiler). Any help would be great. Thx
Perhaps that is the error when you try to input data as string, since according to python docs strip() return a string
http://docs.python.org/library/stdtypes.html
Try parsing the data to whatever type you want
As you can see here
In [14]:np.correlate(["3", "2","1"], [0, 1, 0.5])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/dog/<ipython-input-14-a0b588b9af44> in <module>()
----> 1 np.correlate(["3", "2","1"], [0, 1, 0.5])
/usr/lib64/python2.7/site-packages/numpy/core/numeric.pyc in correlate(a, v, mode, old_behavior)
643 return multiarray.correlate(a,v,mode)
644 else:
--> 645 return multiarray.correlate2(a,v,mode)
646
647 def convolve(a,v,mode='full'):
ValueError: data type must provide an itemsize
try parsing the values
In [15]: np.correlate([int("3"), int("2"),int("1")], [0, 1, 0.5])
Out[15]: array([ 2.5])
import numpy as np
file = open('final_gdpnum.txt', 'r')
gdp07 = []
for line in file:
fields = line.strip().split()
gdp07.append(int(fields [0]))
file2 = open('internetnum.txt', 'r')
netnum07 = []
for line in file2:
fields2 = line.strip().split()
nnetnum07.append(int(fields2 [0]))
print np.correlate(gdp07,netnum07,"full")
your other error is a character ending problem
i hope this works, since I dont think I can reproduce it since I have a linux box that supports utf-8 by default.
I went by ipython help(codecs) documentation
http://code.google.com/edu/languages/google-python-class/dict-files.html
import codecs
f = codecs.open(file, "r", codecs.BOM_UTF8)
for line in f:
fields = line.strip().split()
gdp07.append(int(fields [0]))
Try to cast data to float type. it works for me!
Related
I'm trying to read a very large set of nested json files in to a pandas dataframe, using the code below. It's a few million records, it's the "review" file from the yelp academic dataset.
Does anyone know a quicker way to do this?
Is it possible to just load a sample of the json records? I would probably be fine with just a couple hundred thousand records.
Also I probably don't need all the fields from the review.json file, could I just load a subset of them like user_id, business_id, stars? And would that speed things up?
I would post sample data but I can't even get it to finish loading.
Code:
df_review = pd.read_json('dataset/review.json', lines=True)
Update:
Code:
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)
Error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-18-8e4a45990905> in <module>()
5 reviews += line
6
----> 7 testdf = pd.read_json(reviews,lines=True)
/Users/anaconda/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
273 # commas and put it in a json list to make a valid json object.
274 lines = list(StringIO(json.strip()))
--> 275 json = u'[' + u','.join(lines) + u']'
276
277 obj = None
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 357: ordinal not in range(128)
Update 2:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)
If your file has json objects line separated as you imply, this might be able to work. Just reading the first 1000 lines of the file and then reading that with pandas.
import pandas as pd
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
pd.read_json(reviews,lines=True)
Speeding up that one line would be challenging because it's already super optimized.
I would first check if you can get less rows/data from the provider, as you mentioned.
If you can process the data before, I would recommend to convert it to JSON before(even try different parsers, their performance changes for each dataset structure), than save just the data you need, and with this output call the pandas method.
Here you can find some benchmark of json parsers, keep in mind that you should test on your data, this article is from 2015.
I agree with #Nathan H 's proposition. But the precise point will probably lies in parallelization.
import pandas as pd
buf = ''
buf_lst = []
df_lst = []
chunk_size = 1000
with open('dataset/review.json','r') as f:
lines = f.readlines()
buf_lst += [ ''.join(lines[x:x+chunk_size]) for x in range(0,len(lines), chunk_size)]
def f(buf):
return pd.read_json( buf,lines=True)
#### single-thread
df_lst = map( f, buf_lst)
#### multi-thread
import multiprocessing as mp
pool = mp.Pool(4)
df_lst = pool.map( f, buf_lst)
pool.join()
pool.close()
However, I am not sure how to combine pandas dataframe yet.
I am working on a simple data science project with Python. However, I am getting an error which is the following:
ValueError: could not convert string to float:
Here is what my code looks like:
import matplotlib.pyplot as plt
import csv
from datetime import datetime
filename = 'USAID.csv'
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
monies = []
for row in reader:
money = int(row[1])
monies.append(money)
print(monies)
if I change the line:
money = int(row[1]) to money = float(row[1])
I get this error: ValueError: could not convert string to float:
Here are my tracebacks: first error:
Traceback (most recent call last):
File "funding.py", line 60, in <module>
money = int(row[1])
ValueError: invalid literal for int() with base 10: '42152129.0'
Second Error:
Traceback (most recent call last):
File "funding.py", line 60, in <module>
money = float(row[1])
ValueError: could not convert string to float:
Any help would be great! Thank you!
The first failure is because you passed a string with . in it to int(); you can't convert that to an integer because there is a decimal portion.
The second failure is due to a different row[1] string value; one that is empty.
You could test for that:
if row[1]:
money = float(row[1])
Since you are working with a Data Science project you may want to consider using the pandas project to load your CSV instead with DataFrame.read_csv().
Some of the entries in row[1] are empty so you probably want to check for those before trying to cast. Pass a default value of, say 0, if the entry is blank.
Then you should consider using decimal for computations that relate to money.
I had the same issue while I was learning data visualization using Seaborn. Thanks EdChum's help, I was able to solve the issue with his approach:
df['col'] = pd.to_numeric(df['col'], errors='coerce')
Given the following script to read in latitude, longitude, and magnitude data:
#!/usr/bin/env python
# Read in latitudes and longitudes
eq_data = open('lat_long')
lats, lons = [], []
for index, line in enumerate(eq_data.readlines()):
if index > 0:
lats.append(float(line.split(',')[0]))
lons.append(float(line.split(',')[1]))
#Build the basemap
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
antmap = Basemap(projection='spstere', boundinglat=-20, lon_0=-60, resolution='f')
antmap.drawcoastlines(color='0.50', linewidth=0.25)
antmap.fillcontinents(color='0.95')
x,y = antmap(lons, lats)
antmap.plot(x,y, 'r^', markersize=4)
plt.show()
I receive the following error when attempting to read in the latitudes, longitudes, and magnitudes:
Traceback (most recent call last):
File "./basic_eqplot.py", line 10, in <module>
lats.append(float(line.split(',')[0]))
ValueError: invalid literal for float(): -18.381 -172.320 5.9
The input file looks something like:
-14.990,167.460,5.6
-18.381,-172.320,5.9
-33.939,-71.868,5.9
-22.742,-63.571,5.9
-2.952,129.219,5.7
Any ideas for why this would cause a hiccup?
It appears you have one or more lines of corrupt data in your input file. Your traceback says as much:
ValueError: invalid literal for float(): -18.381 -172.320 5.9
Specifically what is happening:
The line -18.381 -172.320 5.9 is read in from eq_data.
split(',') is called on the string "-18.381 -172.320 5.9". Since there is no comma in the string, the split method returns a list with a single element, the original string.
You attempt to parse the first element of the returned array as a float. The string "-18.381 -172.320 5.9" cannot be parsed as a float and a ValueError is raised.
To fix this issue, double check the format of your input data. You might also try surrounding this code snippet in a try/except block to give you a bit more useful information as to the specific source of the problem:
for index, line in enumerate(eq_data.readlines()):
if index > 0:
try:
lats.append(float(line.split(',')[0]))
lons.append(float(line.split(',')[1]))
except ValueError:
raise ValueError("Unable to parse input file line #%d: '%s'" % (index + 1, line))
What is probably going on is that your input file has a malformed line where a space is used to separate fields instead of a comma.
As a consequence, the result of line.split(',')[0] is the whole input line (in your case "-18.381 -172.320 5.9").
More in general: for these types of problems I really like to use the Python cvs module to parse the input file:
import csv
with open('lat_long', 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
assert len(row) == 3
lst, lon, mag = row
...
An alternative would be to use tools like pandas; but that might be overkill in some cases.
I'm trying to make an script which takes all rows starting by 'HELIX', 'SHEET' and 'DBREF' from a .txt, from that rows takes some specifical columns and then saves the results on a new file.
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print("2 Parameters expected: You must introduce your pdb file and a name for output file.")`
exit()
for line in open(sys.argv[1]):
if 'HELIX' in line:
helix = line.split()
cols_h = helix[0], helix[3:6:2], helix[6:9:2]
elif 'SHEET'in line:
sheet = line.split()
cols_s = sheet[0], sheet[4:7:2], sheet[7:10:2], sheet [12:15:2], sheet[16:19:2]
elif 'DBREF' in line:
dbref = line.split()
cols_id = dbref[0], dbref[3:5], dbref[8:10]
modified_data = open(sys.argv[2],'w')
modified_data.write(cols_id)
modified_data.write(cols_h)
modified_data.write(cols_s)
My problem is that when I try to write my final results it gives this error:
Traceback (most recent call last):
File "funcional2.py", line 21, in <module>
modified_data.write(cols_id)
TypeError: expected a character buffer object
When I try to convert to a string using ''.join() it returns another error
Traceback (most recent call last):
File "funcional2.py", line 21, in <module>
modified_data.write(' '.join(cols_id))
TypeError: sequence item 1: expected string, list found
What am I doing wrong?
Also, if there is some easy way to simplify my code, it'll be great.
PS: I'm no programmer so I'll probably need some explanation if you do something...
Thank you very much.
cols_id, cols_h and cols_s seems to be lists, not strings.
You can only write a string in your file so you have to convert the list to a string.
modified_data.write(' '.join(cols_id))
and similar.
'!'.join(a_list_of_things) converts the list into a string separating each element with an exclamation mark
EDIT:
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print("2 Parameters expected: You must introduce your pdb file and a name for output file.")`
exit()
cols_h, cols_s, cols_id = []
for line in open(sys.argv[1]):
if 'HELIX' in line:
helix = line.split()
cols_h.append(''.join(helix[0]+helix[3:6:2]+helix[6:9:2]))
elif 'SHEET'in line:
sheet = line.split()
cols_s.append( ''.join(sheet[0]+sheet[4:7:2]+sheet[7:10:2]+sheet[12:15:2]+sheet[16:19:2]))
elif 'DBREF' in line:
dbref = line.split()
cols_id.append(''.join(dbref[0]+dbref[3:5]+dbref[8:10]))
modified_data = open(sys.argv[2],'w')
cols = [cols_id,cols_h,cols_s]
for col in cols:
modified_data.write(''.join(col))
Here is a solution (untested) that separates data and code a little more. There is a data structure (keyword_and_slices) describing the keywords searched in the lines paired with the slices to be taken for the result.
The code then goes through the lines and builds a data structure (keyword2lines) mapping the keyword to the result lines for that keyword.
At the end the collected lines for each keyword are written to the result file.
import sys
from collections import defaultdict
def main():
if len(sys.argv) != 3:
print(
'2 Parameters expected: You must introduce your pdb file'
' and a name for output file.'
)
sys.exit(1)
input_filename, output_filename = sys.argv[1:3]
#
# Pairs of keywords and slices that should be taken from the line
# starting with the respective keyword.
#
keyword_and_slices = [
('HELIX', [slice(3, 6, 2), slice(6, 9, 2)]),
(
'SHEET',
[slice(a, b, 2) for a, b in [(4, 7), (7, 10), (12, 15), (16, 19)]]
),
('DBREF', [slice(3, 5), slice(8, 10)]),
]
keyword2lines = defaultdict(list)
with open(input_filename, 'r') as lines:
for line in lines:
for keyword, slices in keyword_and_slices:
if line.startswith(keyword):
parts = line.split()
result_line = [keyword]
for index in slices:
result_line.extend(parts[index])
keyword2lines[keyword].append(' '.join(result_line) + '\n')
with open(output_filename, 'w') as out_file:
for keyword in ['DBREF', 'HELIX', 'SHEET']:
out_file.writelines(keyword2lines[keyword])
if __name__ == '__main__':
main()
The code follows your text in checking if a line starts with a keyword, instead your code which checks if a keyword is anywhere within a line.
It also makes sure all files are closed properly by using the with statement.
You need to convert the tuple created on RHS in your assignments to string.
# Replace this with statement given below
cols_id = dbref[0], dbref[3:5], dbref[8:10]
# Create a string out of the tuple
cols_id = ''.join((dbref[0], dbref[3:5], dbref[8:10]))
I have a code in python to index a text file that contain arabic words. I tested the code on an english text and it works well ,but it gives me an error when i tested an arabic one.
Note: the text file is saved in unicode encoding not in ANSI encoding.
This is my code:
from whoosh import fields, index
import os.path
import csv
import codecs
from whoosh.qparser import QueryParser
# This list associates a name with each position in a row
columns = ["juza","chapter","verse","voc"]
schema = fields.Schema(juza=fields.NUMERIC,
chapter=fields.NUMERIC,
verse=fields.NUMERIC,
voc=fields.TEXT)
# Create the Whoosh index
indexname = "indexdir"
if not os.path.exists(indexname):
os.mkdir(indexname)
ix = index.create_in(indexname, schema)
# Open a writer for the index
with ix.writer() as writer:
with open("h.txt", 'r') as txtfile:
lines=txtfile.readlines()
# Read each row in the file
for i in lines:
# Create a dictionary to hold the document values for this row
doc = {}
thisline=i.split()
u=0
# Read the values for the row enumerated like
# (0, "juza"), (1, "chapter"), etc.
for w in thisline:
# Get the field name from the "columns" list
fieldname = columns[u]
u+=1
#if isinstance(w, basestring):
# w = unicode(w)
doc[fieldname] = w
# Pass the dictionary to the add_document method
writer.add_document(**doc)
with ix.searcher() as searcher:
query = QueryParser("voc", ix.schema).parse(u"بسم")
results = searcher.search(query)
print(len(results))
print(results[1])
Then the error is :
Traceback (most recent call last):
File "C:\Python27\yarab.py", line 38, in <module>
fieldname = columns[u]
IndexError: list index out of range
this is a sample of the file:
1 1 1 كتاب
1 1 2 قرأ
1 1 3 لعب
1 1 4 كتاب
While I cannot see anything obviously wrong with that, I would make sure you're designing for error. Make sure you catch any situation where split() returns more than expected amount of elements and handle it promptly (e.g. print and terminate). It looks like you might be dealing with ill-formatted data.
You missed the header of Unicode in your script. the first line should be:
encoding: utf-8
Also to open a file with the unicode encoding use:
import codecs
with codecs.open("s.txt",encoding='utf-8') as txtfile: