why is my data a tuple and how can I change this so I can sort the data - python
I am using rpy2 to do some statistical analyses in R via python. After importing a data file I want to sort the data and do a couple other things with it in R. Once I import the data and try to sort the data I get this error message:
TypeError: 'tuple' object cannot be interpreted as an index
The last 2 lines of my code are where I am trying to sort my data, and the few lines before that are where I import the data.
root = os.getcwd()
dirs = [os.path.abspath(name) for name in os.listdir(".") if os.path.isdir(name)]
for d in dirs:
os.chdir(d)
cwd = os.getcwd()
files_to_analyze = (glob.glob("*.afa"))
for f in files_to_analyze:
afa_file = os.path.join(cwd + '/' + f)
readfasta = robjects.r['read.fasta']
mydatafasta = readfasta(afa_file)
names = robjects.r['names']
IDnames = names(mydatafasta)
substr = robjects.r['substr']
ID = substr(IDnames, 1,8)
#print ID
readtable = robjects.r['read.table']
gps_file = os.path.join(root + '/' + "GPS.txt")
xy = readtable(gps_file, sep="\t")
#print xy
order = robjects.r['order']
gps = xy[order(xy[:,2]),]
I don't understand why my data is a tuple and not a dataframe that I can manipulate further using R. Is there a way to transform this into a workable dataframe that can be used by R?
My xy data look like:
Species AB425882 35.62 -83.4
Species AB425905 35.66 -83.33
Species KC413768 37.35 127.03
Species AB425841 35.33 -82.82
Species JX402724 29.38 -82.2
I want to sort the data alphanumerically by the second column using the order function in R.
There is a quite a bit of guesswork since the example is not sufficient to reproduce what you have.
In the following, if xy is an R data frame, you will want to use the method dedicated to R-style subsetting to perform R-style subsetting (see the doc):
# Note R indices are 1-based while Python indices are 0-based.
# When using R-style subsetting the indices are 1-based.
gps = xy.rx(order(xy.rx(True, 2)),
True)
Related
How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code)
#fetch the data in a sequence of 1 million rows as dataframe df1 = My_functions.get_ais_data(json1) df2 = My_functions.get_ais_data(json2) df3 = My_functions.get_ais_data(json3) df_all = pd.concat([df1,df2,df3], axis = 0 ) #save the data frame with names of the oldest_id and the corresponding iso data format df_all.to_csv('oldest_id + iso_date +.csv') .....the last line might be silly but I am trying to save the data frame in the name of some variables I created earlier in the code.
You can use an f-string to embed variables in strings like this: df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
if you need the value corresponding to the variable then mids answer is correct thus: df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv') However if you want to use the name of the variable itselfs : df_all.to_csv('/path/to/folder/' + f'{oldest_id=}'.split('=')[0] + f'{iso_date=}'.split('=')[0] + '.csv') would do the work
Maybe try: file_name = f"{oldest_id}{iso_date}.csv" df_all.to_csv(file_name) Assuming you are using Python 3.6 and up.
TypeError: can't convert type 'NoneType' to numerator/denominator
Here I try to calculate mean value based on the data in two list of dicts. Although I used same code before, I keep getting error. Is there any solution? import pandas as pd data = pd.read_csv('data3.csv',sep=';') # Reading data from csv data = data.dropna(axis=0) # Drop rows with null values data = data.T.to_dict().values() # Converting dataframe into list of dictionaries newdata = pd.read_csv('newdata.csv',sep=';') # Reading data from csv newdata = newdata.T.to_dict().values() # Converting dataframe into list of dictionaries score = [] for item in newdata: score.append({item['Genre_Name']:item['Ranking']}) from statistics import mean score={k:int(v) for i in score for k,v in i.items()} for item in data: y= mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(',')))) print(y) Too see csv files: https://repl.it/#rmakakgn/SVE2
.get method of dict return None if given key does not exist and statistics.mean fail due to that, consider that import statistics d = {"a":1,"c":3} data = [d.get(x) for x in ("a","b","c")] print(statistics.mean(data)) result in: TypeError: can't convert type 'NoneType' to numerator/denominator You need to remove Nones before feeding into statistics.mean, which you can do using list comprehension: import statistics d = {"a":1,"c":3} data = [d.get(x) for x in ("a","b","c")] data = [i for i in data if i is not None] print(statistics.mean(data)) or filter: import statistics d = {"a":1,"c":3} data = [d.get(x) for x in ("a","b","c")] data = filter(lambda x:x is not None,data) print(statistics.mean(data)) (both snippets above code will print 2) In this particular case, you might get filter effect by replacing: mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(',')))) with: mean([i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None]) though as with most python built-in and standard library functions accepting list as sole argument, you might decide to not build list but feed created generator directly i.e. mean(i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None) For further discussion see PEP 202 xor PEP 289.
Python - Pandas library returns wrong column values after parsing a CSV file
SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order... I am trying to read a comma separated value file with python and then parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need. Here's a look at the csv file format. Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1 E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5 E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62 E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2 E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5 E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5 E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8 This list is passed to pandas.read_csv()'s names parameter. See code. # Returns an array of the column names needed for our raw data table def cols_to_extract(): cols_to_use = [None] * RawDataCols.COUNT cols_to_use[RawDataCols.DATE] = 'Date' cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam' cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam' cols_to_use[RawDataCols.FTHG] = 'FTHG' cols_to_use[RawDataCols.HG] = 'HG' cols_to_use[RawDataCols.FTAG] = 'FTAG' cols_to_use[RawDataCols.AG] = 'AG' cols_to_use[RawDataCols.FTR] = 'FTR' cols_to_use[RawDataCols.RES] = 'Res' cols_to_use[RawDataCols.HTHG] = 'HTHG' cols_to_use[RawDataCols.HTAG] = 'HTAG' cols_to_use[RawDataCols.HTR] = 'HTR' cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance' cols_to_use[RawDataCols.HS] = 'HS' cols_to_use[RawDataCols.AS] = 'AS' cols_to_use[RawDataCols.HST] = 'HST' cols_to_use[RawDataCols.AST] = 'AST' cols_to_use[RawDataCols.HHW] = 'HHW' cols_to_use[RawDataCols.AHW] = 'AHW' cols_to_use[RawDataCols.HC] = 'HC' cols_to_use[RawDataCols.AC] = 'AC' cols_to_use[RawDataCols.HF] = 'HF' cols_to_use[RawDataCols.AF] = 'AF' cols_to_use[RawDataCols.HFKC] = 'HFKC' cols_to_use[RawDataCols.AFKC] = 'AFKC' cols_to_use[RawDataCols.HO] = 'HO' cols_to_use[RawDataCols.AO] = 'AO' cols_to_use[RawDataCols.HY] = 'HY' cols_to_use[RawDataCols.AY] = 'AY' cols_to_use[RawDataCols.HR] = 'HR' cols_to_use[RawDataCols.AR] = 'AR' return cols_to_use # Extracts raw data from the raw data csv and populates the raw match data table in the database def extract_raw_data(csv): # Clear the database table if it has any logs # if MatchRawData.objects.count != 0: # MatchRawData.objects.delete() cols_to_use = cols_to_extract() # Read and parse the csv file parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0) for col in cols_to_use: values = parsed_csv[col].values for val in values: print(str(col) + ' --------> ' + str(val)) Where RawDataCols is an IntEnum. class RawDataCols(IntEnum): DATE = 0 HOME_TEAM = 1 AWAY_TEAM = 2 FTHG = 3 HG = 4 FTAG = 5 AG = 6 FTR = 7 RES = 8 ... The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using values = parsed_csv[col].values pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line values = parsed_csv[["Column Name","Column Name2"]] Or you select Index wise by cols = [1,2,3,4] values = parsed_csv[parsed_csv.columns[cols]]
Get results in an Earth Engine python script
I'm trying to get NDVI mean in every polygon in a feature collection with earth engine python API. I think that I succeeded getting the result (a feature collection in a feature collection), but then I don't know how to get data from it. The data I want is IDs from features and ndvi mean in each feature. import datetime import ee ee.Initialize() #Feature collection fc = ee.FeatureCollection("ft:1s57dkY_Sg_E_COTe3sy1tIR_U-5Gw-BQNwHh4Xel"); fc_filtered = fc.filter(ee.Filter.equals('NUM_DECS', 1)) #Image collection Sentinel_collection1 = (ee.ImageCollection('COPERNICUS/S2')).filterBounds(fc_filtered) Sentinel_collection2 = Sentinel_collection1.filterDate(datetime.datetime(2017, 1, 1),datetime.datetime(2017, 8, 1)) # NDVI function to use with ee map def NDVIcalc (image): red = image.select('B4') nir = image.select('B8') ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI') #NDVI mean calculation with reduceRegions MeansFeatures = ndvi.reduceRegions(reducer= ee.Reducer.mean(),collection= fc_filtered,scale= 10) return (MeansFeatures) #Result that I don't know to get the information: Features ID and NDVI mean result = Sentinel_collection2.map(NDVIcalc)
If the result is small, you pull them into python using result.getInfo(). That will give you a python dictionary containing a list of FeatureCollection (which are more dictionaries). However, if the results are large or the polygons cover large regions, you'll have to Export the collection instead. That said, there are probably some other things you'll want to do first: 1) You might want to flatten() the collection, so it's not nested collections. It'll be easier to handle that way. 2) You might want to add a date to each result so you know what time the result came from. You can do that with a map on the result, inside your NDVIcalc function return MeansFeatures.map(lambda f : f.set('date', image.date().format()) 3) If what you really want is a time-series of NDVI over time for each polygon (most common), then restructuring your code to map over polygons first will be easier: Sentinel_collection = (ee.ImageCollection('COPERNICUS/S2') .filterBounds(fc_filtered) .filterDate(ee.Date('2017-01-01'),ee.Date('2017-08-01'))) def GetSeries(feature): def NDVIcalc(img): red = img.select('B4') nir = img.select('B8') ndvi = nir.subtract(red).divide(nir.add(red)).rename(['NDVI']) return (feature .set(ndvi.reduceRegion(ee.Reducer.mean(), feature.geometry(), 10)) .set('date', img.date().format("YYYYMMdd"))) series = Sentinel_collection.map(NDVIcalc) // Get the time-series of values as two lists. list = series.reduceColumns(ee.Reducer.toList(2), ['date', 'NDVI']).get('list') return feature.set(ee.Dictionary(ee.List(list).flatten())) result = fc_filtered.map(GetSeries) print(result.getInfo()) 4) And finally, if you're going to try to Export the result, you're likely to run into an issue where the columns of the exported table are selected from whatever columns the first feature has, so it's good to provide a "header" feature that has all columns (times), that you can merge() with the result as the first feature: # Get all possible dates. dates = ee.List(Sentinel_collection.map(function(img) { return ee.Feature(null, {'date': img.date().format("YYYYMMdd") }) }).aggregate_array('date')) # Make a default value for every date. header = ee.Feature(null, ee.Dictionary(dates, ee.List.repeat(-1, dates.size()))) output = header.merge(result) ee.batch.Export.table.toDrive(...)
Storing Datetime in a matrix to be used to define points of interest (Python)
I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys! NUMBER_OF_CLASSES = 4 SUBSPACE_DIMENSION = 3 from datetime import datetime import pandas as pd import pandas_datareader.data as web import numpy as np import matplotlib.pyplot as plt import scipy.io as sio PeriodList = pd.read_csv('IP_List.csv') PeriodList = PeriodList.as_matrix() # Pdata format: # Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data # Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array # Pdata{hull, engine, 3}(:) - array of parametric channel labels Pdata_1 = pd.read_csv('LPD-17_1A.csv') [list_m, list_n] = PeriodList.shape Pdata_1 = Pdata_1.as_matrix() startdatetime = [] enddatetime = [] #Up to line 27 done on MatLab script for d in range (0, list_m): Hull = PeriodList[d,0] Engine = PeriodList[d,1] startdatetime[d] = pd.to_datetime(PeriodList[d,2]) enddatetime[d] = pd.to_datetime(PeriodList[d,3]) #startdatetime = pd.to_datetime(PeriodList[d,2])
Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method : dates = PeriodList[['START','END']] dates['START'] = pd.to_datetime(dates['START']) dates['END'] = pd.to_datetime(dates['END']) # You can access the dates based on index using iloc dates.iloc[3] #If you Start date you can use the column name dates.iloc[3]['START'] Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values start_end = dict(zip(dates.index, dates.values)) If you are looking for the difference of the end date and start date you can simply subtract the columns i.e dates['Difference'] = dates['END']-dates['START'] I suggest you to go through pandas documentation for more info about accessing the data here Edit : You can also use dictionary in your code i.e startdatetime = {} enddatetime = {} #Up to line 27 done on MatLab script for d in range (0, list_m): Hull = PeriodList[d,0] Engine = PeriodList[d,1] startdatetime[d] = pd.to_datetime(PeriodList[d,2]) enddatetime[d] = pd.to_datetime(PeriodList[d,3]) Hope this helps
Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty Code: PeriodList = pd.read_csv('IP_List.csv') PeriodList = PeriodList.as_matrix() # Pdata format: # Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data # Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array # Pdata{hull, engine, 3}(:) - array of parametric channel labels Pdata_1 = pd.read_csv('LPD-17_1A.csv') [list_m, list_n] = PeriodList.shape #Pdata_1 = Pdata_1.as_matrix() startdatetime = ['' for x in range(list_m)] enddatetime = ['' for x in range(list_m)] #Up to line 27 done on MatLab script for d in range (0, list_m): Hull = PeriodList[d,0] Engine = PeriodList[d,1] startdatetime[d] = pd.to_datetime(PeriodList[d,2]) enddatetime[d] = pd.to_datetime(PeriodList[d,3]) #startdatetime = pd.to_datetime(PeriodList[d,2])