why is my data a tuple and how can I change this so I can sort the data - python

I am using rpy2 to do some statistical analyses in R via python. After importing a data file I want to sort the data and do a couple other things with it in R. Once I import the data and try to sort the data I get this error message:
TypeError: 'tuple' object cannot be interpreted as an index
The last 2 lines of my code are where I am trying to sort my data, and the few lines before that are where I import the data.
root = os.getcwd()
dirs = [os.path.abspath(name) for name in os.listdir(".") if os.path.isdir(name)]
for d in dirs:
os.chdir(d)
cwd = os.getcwd()
files_to_analyze = (glob.glob("*.afa"))
for f in files_to_analyze:
afa_file = os.path.join(cwd + '/' + f)
readfasta = robjects.r['read.fasta']
mydatafasta = readfasta(afa_file)
names = robjects.r['names']
IDnames = names(mydatafasta)
substr = robjects.r['substr']
ID = substr(IDnames, 1,8)
#print ID
readtable = robjects.r['read.table']
gps_file = os.path.join(root + '/' + "GPS.txt")
xy = readtable(gps_file, sep="\t")
#print xy
order = robjects.r['order']
gps = xy[order(xy[:,2]),]
I don't understand why my data is a tuple and not a dataframe that I can manipulate further using R. Is there a way to transform this into a workable dataframe that can be used by R?
My xy data look like:
Species AB425882 35.62 -83.4
Species AB425905 35.66 -83.33
Species KC413768 37.35 127.03
Species AB425841 35.33 -82.82
Species JX402724 29.38 -82.2
I want to sort the data alphanumerically by the second column using the order function in R.

There is a quite a bit of guesswork since the example is not sufficient to reproduce what you have.
In the following, if xy is an R data frame, you will want to use the method dedicated to R-style subsetting to perform R-style subsetting (see the doc):
# Note R indices are 1-based while Python indices are 0-based.
# When using R-style subsetting the indices are 1-based.
gps = xy.rx(order(xy.rx(True, 2)),
True)

Related

How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code)

#fetch the data in a sequence of 1 million rows as dataframe
df1 = My_functions.get_ais_data(json1)
df2 = My_functions.get_ais_data(json2)
df3 = My_functions.get_ais_data(json3)
df_all = pd.concat([df1,df2,df3], axis = 0 )
#save the data frame with names of the oldest_id and the corresponding iso data format
df_all.to_csv('oldest_id + iso_date +.csv')
.....the last line might be silly but I am trying to save the data frame in the name of some variables I created earlier in the code.
You can use an f-string to embed variables in strings like this:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
if you need the value corresponding to the variable then mids answer is correct thus:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
However if you want to use the name of the variable itselfs :
df_all.to_csv('/path/to/folder/' + f'{oldest_id=}'.split('=')[0] + f'{iso_date=}'.split('=')[0] + '.csv')
would do the work
Maybe try:
file_name = f"{oldest_id}{iso_date}.csv"
df_all.to_csv(file_name)
Assuming you are using Python 3.6 and up.

TypeError: can't convert type 'NoneType' to numerator/denominator

Here I try to calculate mean value based on the data in two list of dicts. Although I used same code before, I keep getting error. Is there any solution?
import pandas as pd
data = pd.read_csv('data3.csv',sep=';') # Reading data from csv
data = data.dropna(axis=0) # Drop rows with null values
data = data.T.to_dict().values() # Converting dataframe into list of dictionaries
newdata = pd.read_csv('newdata.csv',sep=';') # Reading data from csv
newdata = newdata.T.to_dict().values() # Converting dataframe into list of dictionaries
score = []
for item in newdata:
score.append({item['Genre_Name']:item['Ranking']})
from statistics import mean
score={k:int(v) for i in score for k,v in i.items()}
for item in data:
y= mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(','))))
print(y)
Too see csv files: https://repl.it/#rmakakgn/SVE2
.get method of dict return None if given key does not exist and statistics.mean fail due to that, consider that
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
print(statistics.mean(data))
result in:
TypeError: can't convert type 'NoneType' to numerator/denominator
You need to remove Nones before feeding into statistics.mean, which you can do using list comprehension:
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
data = [i for i in data if i is not None]
print(statistics.mean(data))
or filter:
import statistics
d = {"a":1,"c":3}
data = [d.get(x) for x in ("a","b","c")]
data = filter(lambda x:x is not None,data)
print(statistics.mean(data))
(both snippets above code will print 2)
In this particular case, you might get filter effect by replacing:
mean(map(score.get,map(str.strip,item['Recommended_Genres'].split(','))))
with:
mean([i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None])
though as with most python built-in and standard library functions accepting list as sole argument, you might decide to not build list but feed created generator directly i.e.
mean(i for i in map(score.get,map(str.strip,item['Recommended_Genres'].split(','))) if i is not None)
For further discussion see PEP 202 xor PEP 289.

Python - Pandas library returns wrong column values after parsing a CSV file

SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order...
I am trying to read a comma separated value file with python and then
parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need.
Here's a look at the csv file format.
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob
Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1
E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham
Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5
E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry
Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62
E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy
D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2
E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot
Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5
E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike
Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5
E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul
Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8
This list is passed to pandas.read_csv()'s names parameter.
See code.
# Returns an array of the column names needed for our raw data table
def cols_to_extract():
cols_to_use = [None] * RawDataCols.COUNT
cols_to_use[RawDataCols.DATE] = 'Date'
cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam'
cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam'
cols_to_use[RawDataCols.FTHG] = 'FTHG'
cols_to_use[RawDataCols.HG] = 'HG'
cols_to_use[RawDataCols.FTAG] = 'FTAG'
cols_to_use[RawDataCols.AG] = 'AG'
cols_to_use[RawDataCols.FTR] = 'FTR'
cols_to_use[RawDataCols.RES] = 'Res'
cols_to_use[RawDataCols.HTHG] = 'HTHG'
cols_to_use[RawDataCols.HTAG] = 'HTAG'
cols_to_use[RawDataCols.HTR] = 'HTR'
cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance'
cols_to_use[RawDataCols.HS] = 'HS'
cols_to_use[RawDataCols.AS] = 'AS'
cols_to_use[RawDataCols.HST] = 'HST'
cols_to_use[RawDataCols.AST] = 'AST'
cols_to_use[RawDataCols.HHW] = 'HHW'
cols_to_use[RawDataCols.AHW] = 'AHW'
cols_to_use[RawDataCols.HC] = 'HC'
cols_to_use[RawDataCols.AC] = 'AC'
cols_to_use[RawDataCols.HF] = 'HF'
cols_to_use[RawDataCols.AF] = 'AF'
cols_to_use[RawDataCols.HFKC] = 'HFKC'
cols_to_use[RawDataCols.AFKC] = 'AFKC'
cols_to_use[RawDataCols.HO] = 'HO'
cols_to_use[RawDataCols.AO] = 'AO'
cols_to_use[RawDataCols.HY] = 'HY'
cols_to_use[RawDataCols.AY] = 'AY'
cols_to_use[RawDataCols.HR] = 'HR'
cols_to_use[RawDataCols.AR] = 'AR'
return cols_to_use
# Extracts raw data from the raw data csv and populates the raw match data table in the database
def extract_raw_data(csv):
# Clear the database table if it has any logs
# if MatchRawData.objects.count != 0:
# MatchRawData.objects.delete()
cols_to_use = cols_to_extract()
# Read and parse the csv file
parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0)
for col in cols_to_use:
values = parsed_csv[col].values
for val in values:
print(str(col) + ' --------> ' + str(val))
Where RawDataCols is an IntEnum.
class RawDataCols(IntEnum):
DATE = 0
HOME_TEAM = 1
AWAY_TEAM = 2
FTHG = 3
HG = 4
FTAG = 5
AG = 6
FTR = 7
RES = 8
...
The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using
values = parsed_csv[col].values
pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line
values = parsed_csv[["Column Name","Column Name2"]]
Or you select Index wise by
cols = [1,2,3,4]
values = parsed_csv[parsed_csv.columns[cols]]

Get results in an Earth Engine python script

I'm trying to get NDVI mean in every polygon in a feature collection with earth engine python API.
I think that I succeeded getting the result (a feature collection in a feature collection), but then I don't know how to get data from it.
The data I want is IDs from features and ndvi mean in each feature.
import datetime
import ee
ee.Initialize()
#Feature collection
fc = ee.FeatureCollection("ft:1s57dkY_Sg_E_COTe3sy1tIR_U-5Gw-BQNwHh4Xel");
fc_filtered = fc.filter(ee.Filter.equals('NUM_DECS', 1))
#Image collection
Sentinel_collection1 = (ee.ImageCollection('COPERNICUS/S2')).filterBounds(fc_filtered)
Sentinel_collection2 = Sentinel_collection1.filterDate(datetime.datetime(2017, 1, 1),datetime.datetime(2017, 8, 1))
# NDVI function to use with ee map
def NDVIcalc (image):
red = image.select('B4')
nir = image.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI')
#NDVI mean calculation with reduceRegions
MeansFeatures = ndvi.reduceRegions(reducer= ee.Reducer.mean(),collection= fc_filtered,scale= 10)
return (MeansFeatures)
#Result that I don't know to get the information: Features ID and NDVI mean
result = Sentinel_collection2.map(NDVIcalc)
If the result is small, you pull them into python using result.getInfo(). That will give you a python dictionary containing a list of FeatureCollection (which are more dictionaries). However, if the results are large or the polygons cover large regions, you'll have to Export the collection instead.
That said, there are probably some other things you'll want to do first:
1) You might want to flatten() the collection, so it's not nested collections. It'll be easier to handle that way.
2) You might want to add a date to each result so you know what time the result came from. You can do that with a map on the result, inside your NDVIcalc function
return MeansFeatures.map(lambda f : f.set('date', image.date().format())
3) If what you really want is a time-series of NDVI over time for each polygon (most common), then restructuring your code to map over polygons first will be easier:
Sentinel_collection = (ee.ImageCollection('COPERNICUS/S2')
.filterBounds(fc_filtered)
.filterDate(ee.Date('2017-01-01'),ee.Date('2017-08-01')))
def GetSeries(feature):
def NDVIcalc(img):
red = img.select('B4')
nir = img.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename(['NDVI'])
return (feature
.set(ndvi.reduceRegion(ee.Reducer.mean(), feature.geometry(), 10))
.set('date', img.date().format("YYYYMMdd")))
series = Sentinel_collection.map(NDVIcalc)
// Get the time-series of values as two lists.
list = series.reduceColumns(ee.Reducer.toList(2), ['date', 'NDVI']).get('list')
return feature.set(ee.Dictionary(ee.List(list).flatten()))
result = fc_filtered.map(GetSeries)
print(result.getInfo())
4) And finally, if you're going to try to Export the result, you're likely to run into an issue where the columns of the exported table are selected from whatever columns the first feature has, so it's good to provide a "header" feature that has all columns (times), that you can merge() with the result as the first feature:
# Get all possible dates.
dates = ee.List(Sentinel_collection.map(function(img) {
return ee.Feature(null, {'date': img.date().format("YYYYMMdd") })
}).aggregate_array('date'))
# Make a default value for every date.
header = ee.Feature(null, ee.Dictionary(dates, ee.List.repeat(-1, dates.size())))
output = header.merge(result)
ee.batch.Export.table.toDrive(...)

Storing Datetime in a matrix to be used to define points of interest (Python)

I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys!
NUMBER_OF_CLASSES = 4
SUBSPACE_DIMENSION = 3
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
Pdata_1 = Pdata_1.as_matrix()
startdatetime = []
enddatetime = []
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])
Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method :
dates = PeriodList[['START','END']]
dates['START'] = pd.to_datetime(dates['START'])
dates['END'] = pd.to_datetime(dates['END'])
# You can access the dates based on index using iloc
dates.iloc[3]
#If you Start date you can use the column name
dates.iloc[3]['START']
Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values
start_end = dict(zip(dates.index, dates.values))
If you are looking for the difference of the end date and start date you can simply subtract the columns i.e
dates['Difference'] = dates['END']-dates['START']
I suggest you to go through pandas documentation for more info about accessing the data here
Edit :
You can also use dictionary in your code i.e
startdatetime = {}
enddatetime = {}
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
Hope this helps
Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty
Code:
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
#Pdata_1 = Pdata_1.as_matrix()
startdatetime = ['' for x in range(list_m)]
enddatetime = ['' for x in range(list_m)]
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Categories

Resources