removing columns from a file on my database

removing columns from a file on my database - python

I have a number of tab delimited files on my database that have 7 columns with the headers 'name', 'taxID', 'taxRank', 'genomesize', 'numReads', 'numUniqueReads', 'abundance'. I would like to write a program that will call in a file generically (like using sys.argv) to bring in one file at a time and keep columns 0,1,4 (name, taxID, & numReads). I'm trying (very poorly) to do this in Python.
with open (sys.argv[1], 'r') as f:
rows = ( line.split('\t') for line in f)
d = { row[0]:row[1:] for row in rows }
df = pd.DataFrame(d)
sys.argv[1]_prekrona = df.drop(['taxRank','genomesize','numUniqueReads','abundance'], axis = 1)
After running the script I got the error:
Traceback (most recent call last):
File "open_file.py", line 19, in <module>
df.drop(['taxRank','genomesize','numUniqueReads','abundance'], axis = 1)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3697, in drop
errors=errors)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3108, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3140, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/software/7/apps/python_3/3.6.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4388, in drop
'labels %s not contained in axis' % labels[mask])
KeyError: "labels ['taxRank' 'genomesize' 'numUniqueReads' 'abundance'] not contained in axis"
This tells me I have to define my columns, but I am not sure how to do that. Any help is greatly appreciated!

Perhaps you could try:
df = pd.read_csv(f, usecols = ["name", "taxID", "numReads"]
usecols will take a list of column headers and create a subset of the data using only those columns.

Related

How to resolve TypeError: sequence item 1: expected str instance, int found (Python)?

Seeking for your assistance regarding this issue and I'm trying to resolve it, tried so many syntax but still getting the same error. I got multiple csv files to be converted and I'm pulling the same data, the script works for 1 of my csv file but not on the other. Looking forward to your feedback. Thank you very much.
My code:
import os
import pandas as pd
directory = 'C:/path'
ext = ('.csv')
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
if f.endswith(ext):
head_tail = os.path.split(f)
head_tail1 = 'C:/path'
k =head_tail[1]
r=k.split(".")[0]
p=head_tail1 + "/" + r + " - Revised.csv"
mydata = pd.read_csv(f)
# to pull columns and values
new = mydata[["A","Room","C","D"]]
new = new.rename(columns={'D': 'Qty. of Parts'})
new['Qty. of Parts'] = 1
new.to_csv(p ,index=False)
#to merge columns and values
merge_columns = ['A', 'Room', 'C']
merged_col = ''.join(merge_columns).replace('ARoomC', 'F')
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(x), axis=1)
new.drop(merge_columns, axis=1, inplace=True)
new = new.groupby(merged_col).count().reset_index()
new.to_csv(p, index=False)
The error I get:
Traceback (most recent call last):
File "C:Path\MyProject.py", line 34, in <module>
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(x), axis=1)
File "C:Path\MyProject.py", line 9565, in apply
return op.apply().__finalize__(self, method="apply")
File "C:Path\MyProject.py", line 746, in apply
return self.apply_standard()
File "C:Path\MyProject.py", line 873, in apply_standard
results, res_index = self.apply_series_generator()
File "C:Path\MyProject.py", line 889, in apply_series_generator
results[i] = self.f(v)
File "C:Path\MyProject.py", line 34, in <lambda>
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(x), axis=1)
TypeError: sequence item 1: expected str instance, int found

It's hard to say what you're trying to achieve without showing a sample of your data. But anyway, to fix the error, you need to cast the values as a string with str when calling pandas.Series.apply :
new[merged_col] = new[merge_columns].apply(lambda x: '.'.join(str(x)), axis=1)
Or, you can also use pandas.Series.astype:
new[merged_col] = new[merge_columns].astype(str).apply(lambda x: '.'.join(x), axis=1)

How to avoid memory error in pandas dataframe?

I am getting memory error while creating dataframe. I am reading zip file from s3 and writing the Byte data into dataframe but I am getting memory error. Could you please help me how to avoid this or what changes can I do in my code?
code-
list_table = []
for table in d:
dict_table = OrderedDict()
s_time = datetime.datetime.now().strftime("%H:%M:%S")
print("start_time--->>",s_time)
print("tablename--->>", table)
s3 = boto3.resource('s3')
key='raw/vs-1/load-1619/data' +'/'+ table
obj = s3.Object('*******',key)
n = obj.get()['Body'].read()
gzipfile = BytesIO(n)
gzipfile = gzip.GzipFile(fileobj=gzipfile)
content = gzipfile.read()
#print(content)
content_str = content.decode('utf-8')
df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
#print(df1)
#count = os.popen('aws s3 cp s3://itx-agu-lake/raw/vs-1/load-1619/data/{0} - | wc -l'.format(table)).read()
count = int(len(df1)) - 2
del(df1)
e_time = datetime.datetime.now().strftime("%H:%M:%S")
print("End_time---->>",e_time)
print(count)
dict_table['Table_Name'] = str(table)
dict_table['Count'] = count
list_table.append(dict_table)
I am getting memory error in below line-
df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
Error-
Traceback (most recent call last):
File "ravi_sir.py", line 45, in <module>
df1 = pd.DataFrame([x.split(',') for x in str(content_str).split('\n')])
File "/app/python3/lib/python3.6/site-packages/pandas/core/frame.py", line 520, in __init__
mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1739, in form_blocks
object_blocks = _simple_blockify(items_dict["ObjectBlock"], np.object_)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1784, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "/app/python3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1830, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError

Does it help to utilize the Pandas series string split method?
# a sample string
content_str = 'a,b,c,d\nd,e,f,g\nh,i,j,k'
content_str = str(content_str).split('\n')
df1 = pd.DataFrame(content_str)
df1 = df1[0].str.split(',', expand=True)
Posted here instead of the comments because it isn't pretty to post code there.

Pandas AttributeError: 'DataFrame' object has no attribute 'Timestamp'

so i want to get the monthly sum with my script but i always get an AttributeError, which i dont understand. The column Timestamp does indeed exist on my combined_csv. I know for sure that this line is causing the problem since i tested al of my other code before.
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
I'll appreciate every kind of help i can get - thanks
import os
import glob
import pandas as pd
# set working directory
os.chdir("Path to CSVs")
# find all csv files in the folder
# use glob pattern matching -> extension = 'csv'
# save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# print(all_filenames)
# combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=';') for f in all_filenames])
# Format CSV
# Transform Timestamp column into datetime
combined_csv['Timestamp'] = pd.to_datetime(combined_csv.Timestamp)
# Read out first entry of every day of every month
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
# To get the yield of day i have to subtract day 2 HtmDht_Energy - day 1 HtmDht_Energy
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
# combined_csv.reset_index()
# combined_csv.index.set_names(["year", "month"], inplace=True)
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
Output of combined_csv.columns
Index(['Timestamp', 'teHst0101', 'teHst0102', 'teHst0103', 'teHst0104',
'teHst0105', 'teHst0106', 'teHst0107', 'teHst0201', 'teHst0202',
'teHst0203', 'teHst0204', 'teHst0301', 'teHst0302', 'teHst0303',
'teHst0304', 'teAmb', 'teSolFloHexHst', 'teSolRetHexHst',
'teSolCol0501', 'teSolCol1001', 'teSolCol1501', 'vfSol', 'prSolRetSuc',
'rdGlobalColAngle', 'gSolPump01_roActual', 'gSolPump02_roActual',
'gHstPump03_roActual', 'gHstPump04_roActual', 'gDhtPump06_roActual',
'gMB01_isOpened', 'gMB02_isOpened', 'gCV01_posActual',
'gCV02_posActual', 'HtmDht_Energy', 'HtmDht_Flow', 'HtmDht_Power',
'HtmDht_Volume', 'HtmDht_teFlow', 'HtmDht_teReturn', 'HtmHst_Energy',
'HtmHst_Flow', 'HtmHst_Power', 'HtmHst_Volume', 'HtmHst_teFlow',
'HtmHst_teReturn', 'teSolColDes', 'teHstFloDes'],
dtype='object')
Traceback:
When i select it with
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
Traceback (most recent call last):
File "D:\Users\wink\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv['Timestamp'].dt.year, combined_csv['Timestamp'].dt.month]).sum()
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\Users\wink\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'Timestamp'
traceback with mustafas solution
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3862, in reindexer
value = value.reindex(self.index)._values
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\util\_decorators.py", line 312, in wrapper
return func(*args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4176, in reindex
return super().reindex(**kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
return self._reindex_axes(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4022, in _reindex_axes
frame = frame._reindex_index(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 4038, in _reindex_index
new_index, indexer = self.index.reindex(
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 2492, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 175, in new_meth
return meth(self_or_cls, *args, **kwargs)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\indexes\multi.py", line 531, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas\_libs\lib.pyx", line 2527, in pandas._libs.lib.tuples_to_object_array
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\winklerm\PycharmProjects\csvToExcel\main.py", line 28, in <module>
combined_csv["monthlySum"] = combined_csv.groupby([combined_csv.Timestamp.dt.year, combined_csv.Timestamp.dt.month]).sum()
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3888, in _sanitize_column
value = reindexer(value).T
File "C:\Users\winklerm\PycharmProjects\csvToExcel\venv\lib\site-packages\pandas\core\frame.py", line 3870, in reindexer
raise TypeError(
TypeError: incompatible index of inserted column with frame index

This line makes the Timestamp column the index of the combined_csv:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first']))
and therefore you get an error when you try to access .Timestamp.
Remedy is to reset_index, so instead of above line, you can try this:
combined_csv = round(combined_csv.resample('D', on='Timestamp')['HtmDht_Energy'].agg(['first'])).reset_index()
which will take the Timestamp column back into normal columns from the index and you can then access it.
Side note:
combined_csv["dailyYield"] = combined_csv["first"] - combined_csv["first"].shift()
is equivalent to
combined_csv["dailyYield"] = combined_csv["first"].diff()

using replace for a column of a pandas df TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'

How should I fix this?
import pandas as pd
csv_file = 'sample.csv'
count = 1
my_filtered_csv = pd.read_csv(csv_file, usecols=['subDirectory_filePath', 'expression'])
emotion_map = { '0':'6', '1':'3', '2':'4', '3':'5', '4':'2', '5':'1', '6':'0'}
my_filtered_csv['expression'] = my_filtered_csv['expression'].replace(emotion_map)
print(my_filtered_csv)
Error is:
Traceback (most recent call last):
File "/Users/mona/CS585/project/affnet/emotion_map.py", line 11, in <module>
my_filtered_csv['expression'] = my_filtered_csv['expression'].replace(emotion_map)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3836, in replace
limit=limit, regex=regex)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3885, in replace
regex=regex)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3259, in replace_list
masks = [comp(s) for i, s in enumerate(src_list)]
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3259, in <listcomp>
masks = [comp(s) for i, s in enumerate(src_list)]
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3247, in comp
return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
File "/Users/mona/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 4619, in _maybe_compare
raise TypeError("Cannot compare types %r and %r" % tuple(type_names))
TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'
Process finished with exit code 1
A few lines of the csv file looks like:
,subDirectory_filePath,expression
0,689/737db2483489148d783ef278f43f486c0a97e140fc4b6b61b84363ca.jpg,1
1,392/c4db2f9b7e4b422d14b6e038f0cdc3ecee239b55326e9181ee4520f9.jpg,0
2,468/21772b68dc8c2a11678c8739eca33adb6ccc658600e4da2224080603.jpg,0
3,944/06e9ae8d3b240eb68fa60534783eacafce2def60a86042f9b7d59544.jpg,1
4,993/02e06ee5521958b4042dd73abb444220609d96f57b1689abbe87c024.jpg,8
5,979/f675c6a88cdef99a6d8b0261741217a0319387fcf1571a174f99ac81.jpg,6
6,637/94b769d8e880cbbea8eaa1350cb8c094a03d27f9fef44e1f4c0fb2ae.jpg,9
7,997/b81f843f08ce3bb0c48b270dc58d2ab8bf5bea3e2262e50bbcadbec2.jpg,6
8,358/21a32dd1c1ecd57d3e8964621c911df1c0b3348a4ae5203b4a243230.JPG,9

Changing the emotion_map to the following fixed the problem:
emotion_map = { 0:6, 1:3, 2:4, 3:5, 4:2, 5:1, 6:0}

Another possiblity which can also create this error is:
you have already run this code once and the data is already replaced.
To solve this problem, you can go back and load the data set again

Pandas Reduction Error when trying to update dataframe

I need to update some of the data in my dataframe in the same sense of a update query in SQL. My current code is as follows:
import pandas
df = pandas.read_csv('filee.csv') # load trades from csv file
def updateDataframe(row):
if row['Name'] == "Joe":
return "Black"
else:
return row
df['LastName'] = df.apply(updateDataframe,axis=1)
However, it returns the following error:
Traceback (most recent call last):
File "test.py", line 11, in <module>
df['LastName'] = df.apply(updateDataframe,axis=1)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 2038, in __setitem__
self._set_item(key, value)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 2085, in _set_item
NDFrame._set_item(self, key, value)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 582, in _set_item
self._data.set(key, value)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 1459, in set
_set_item(self.items[loc], value)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 1454, in _set_item
block.set(item, arr)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 176, in set
self.values[loc] = value
ValueError: output operand requires a reduction, but reduction is not enabled
How do I resolve this. Or is there a better way to accomplish what I am trying to do?

#Jeff has a good concise implementation of your problem in the comments above, but if you want to fix the error in your code, try the following:
For the file filee.csv with the following contents:
Name,LastName
Andy,Blue
Joe,Smith
After the else, you need to return a Last Name string rather than a row object, as shown below:
import pandas
df = pandas.read_csv('filee.csv') # load trades from csv file
def updateDataframe(row):
if row['Name'] == "Joe":
return "Black"
else:
return row['LastName']
df['LastName'] = df.apply(updateDataframe,axis=1)
print df
results in the the following output:
Name LastName
0 Andy Blue
1 Joe Black

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

removing columns from a file on my database - python

Perhaps you could try: df = pd.read_csv(f, usecols = ["name", "taxID", "numReads"] usecols will take a list of column headers and create a subset of the data using only those columns.

Related

How to resolve TypeError: sequence item 1: expected str instance, int found (Python)?

How to avoid memory error in pandas dataframe?

Pandas AttributeError: 'DataFrame' object has no attribute 'Timestamp'

using replace for a column of a pandas df TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'

Pandas Reduction Error when trying to update dataframe

Categories

Resources