ValueError: '_index' is a reserved name for dataframe columns - python

I am trying to save a file as to h5ad format and it is giving this value error; ValueError: '_index' is a reserved name for dataframe columns.
import pandas as pd
import scanpy as sc
import numpy as np
data = sc.read_h5ad('f.h5ad')
annotation = pd.read_csv('n.tsv', sep='\t')
annotation_dict = {item['barcodes']:item['celltype'] for item in annotation.to_dict('records')}
data.obs['barcodes'] = data.obs.index
data.obs['celltype'] = data.obs['barcodes'].map(annotation_dict)
sc.pp.filter_genes(data,min_cells=686)
sc.pp.filter_cells(data,min_genes=10)
sc.pp.normalize_per_cell(data,20000)
sc.pp.log1p(data)
sc.pp.highly_variable_genes(data,n_top_genes=1000)
data.X = np.exp(data.X.toarray())-1
data=data[:,data.var['highly_variable']]
sc.pp.normalize_per_cell(data,3800)
clustered = sc.read_h5ad('f.h5ad')
sc.pp.filter_cells(data,min_genes=10)
sc.pp.recipe_zheng17(clustered)
sc.tl.pca(clustered, n_comps=50)
sc.pp.neighbors(clustered, n_pcs=50)
sc.tl.louvain(clustered, resolution=0.15)
clustered.obs.groupby('louvain').count()
data.obs['louvain'] = list(clustered.obs['louvain'])
split = pd.DataFrame(data.obs['barcodes'])
test = split.sample(frac=0.2)
d_split = {item:'test' for item in test['barcodes']}
data.obs['split'] = data.obs['barcodes'].map(d_split).fillna('train')
data.write_h5ad(e.h5ad')

This is probably related to a known issue with the AnnData .raw object.
Two workarounds (From here):
#1
data.__dict__['_raw'].__dict__['_var'] = data.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'})
#2, deleting the backed up raw information
del data.raw

Related

Why can't I replace null values in this excel sheet?

In my code, I run a t-test which sometimes yields "NaN" or "nan" when running a test on two zero value groups. I have tried making new data frames, tried replacing using .replace and also tried fillna() but nothing was successful. I get errors when also trying to define a new dataframe or read the file again after adding new calculations.
How do I replace the nulls and "nan" in these files: "significant_report2.xls" or "quant_report2.xls"
import json
import os, sys
import numpy as np
import pandas as pd
import scipy.stats
output_report = "quant_report2.xls"
significant_report = "significant_report2.xls"
output_report_writer = open(output_report, "w")
significant_writer = open(significant_report, "w")
# setup samples grouped by control and treatment
header = []
for idx in control_indices:
header.append(quant_columns[idx])
for idx in treatment_indices:
header.append(quant_columns[idx])
output_report_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
significant_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
feature_list = list(quantitative_data_frame.index)
for feature_idx in range(len(feature_list)):
feature_name = feature_list[feature_idx]
control_values = quantitative_data_frame.iloc[feature_idx, control_indices]
treatment_values = quantitative_data_frame.iloc[feature_idx, treatment_indices]
ttest_stat, ttest_pvalue = scipy.stats.ttest_ind(control_values, treatment_values, equal_var=False)
ctrl_means = scipy.mean(control_values,0)
ctrl_stdDev = scipy.stats.tstd(control_values)
tx_means= scipy.mean(treatment_values,axis=0)
tx_stdDev1 = scipy.stats.tstd(treatment_values)
output_report_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,
"\t".join([str(x) for x in list(control_values)]),
"\t".join([str(x) for x in list(treatment_values)]), ttest_pvalue, ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))
significant_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,"\t".join([str(x) for x in list(control_values)]), "\t".join([str(x) for x in list(treatment_values)]),ttest_pvalue,ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))

How do I convert pandas data frame to GeoJson overlay in folium

[I have this code here but the thing is when I do folium.GeoJson(data, name="geojson") it just returns ValueError: Cannot render objects with any missing geometries: 0 {'type': 'MultiPolygon', 'coordinates': [[[[-7....
import requests
import json
import pandas as pd
import folium
map = folium.Map(location=\[40, -73\], zoom_start=6, tiles="OpenStreetMap")
fg = folium.FeatureGroup(name="Parks", show = False)
map.add_child(fg)
park_data = requests.get("https://data.cityofnewyork.us/resource/enfh-gkve.json")
park_data = park_data.json()
park_frame = pd.DataFrame(data=park_data)
park_geo_data = park_frame["multipolygon"]
fg.add_child(folium.GeoJson(park_geo_data, name = "Parks"))
map.add_child(folium.LayerControl(position="topright"))
map.save("about.html")
``` See image for what the data looks like raw from the rest api][1]
[Shows error as well as shows what raw rest api response looks like][2]
[1]: https://i.stack.imgur.com/dGSqL.png
[2]: https://i.stack.imgur.com/XlKdS.png
You can iterate over the rows of your dataframe and add row's 'multipolygon' column as your geojson feature to the featureGroup.
park_data = pd.read_json("https://data.cityofnewyork.us/resource/enfh-gkve.json")
for i, row in park_data.iterrows():
geo_data = row['multipolygon']
fg.add_child(folium.GeoJson(geo_data))

Set up a column based on another column and outside list in a Pandas Dataframe

I am trying to create a new column in a Pandas dataframe which takes only one array from a list of 5 arrays (the list is titled cluster_centre) and puts that array into the dataframe. It would take the array at the index that matches the value in the 'labels' column of the same dataframe (which has values of 0,1,2,3 or 4). So for instance, if the sentence in that row was given a label of 2 i.e. the 'labels' column value for that row would be 2, then the value of the 'cluster_centres' column in the df at that row would be cluster_centre[2]. How can I do this? The code I have attempted is pasted below:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import pandas as pd
with open('JWN_Nordstrom_MDNA_overview_2017.txt', 'r') as file:
initial_corpus = file.read()
corpus = initial_corpus.split('. ')
# Extract sentence embeddings
embedder = SentenceTransformer('bert-base-wikipedia-sections-mean-tokens')
corpus_embeddings = embedder.encode(corpus)
# Perform KMeans clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
cluster_centre = clustering_model.cluster_centers_
# Create dataframe
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
# The line below creates a ValueError
All_data_df['cluster_centres'] = cluster_centre[All_data_df['labels']]
print(All_data_df.head())
I get this error: ValueError: Wrong number of items passed 768, placement implies 1
UPDATE: I did some new stuff and tried this:
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
#All_data_df['cluster_centres'] = 0
for index, row in All_data_df.iterrows():
iforval = cluster_centre[row['labels']]
All_data_df.at[index, 'cluster_centres'] = iforval
print(All_data_df.head())
But get a new error: ValueError: Must have equal len keys and value when setting with an iterable. I printed iforval inside the loop and it does indeed return 29 correct arrays from the cluster_centre list, which matches the 29 rows present in the dataframe. Now I just need to put them into the new column of the dataframe, but .at[] didn't work, not sure if I am using it correctly.
EDIT/UPDATE: Ok I found a sort of solution, don't know why I didn't realise this before, I just created a list beforehand and made that into the new column, ended up being much simpler.
cluster_centres_list = [cluster_centres[label] for label in cluster_assignment]
all_data_df = pd.DataFrame()
all_data_df['sentences'] = corpus
all_data_df['embeddings'] = corpus_embeddings
all_data_df['labels'] = cluster_assignment
all_data_df['cluster_centres'] = cluster_centres_list
print(all_data_df.head())

Normalize columns in a numpy array- results in typeerror

want to do a simple normalization of the data in a numpy ndarray.
specifically want X-mu/sigma. Tried using the exact code that
that I found in earlier questions - kept getting error = TypeError
cannot perform reduce with flexible type. Gave up and tried a simpler
normzlization method X-mu/X.ptp - got the same error.
import csv
import numpy as np
from numpy import *
import urllib.request
#Import comma separated data from git.hub
url = 'http://archive.ics.uci.edu/ml/machine-learning-
databases/wine/wine.data'
urllib.request.urlretrieve(url,'F:/Python/Wine Dataset/wine_data')
#open file
filename = 'F:/Python/Wine Dataset/wine_data';
raw_data = open(filename,'rt');
#Put raw_data into a numpy.ndarray
reader = csv.reader(raw_data);
x = list(reader);
data = np.array(x)
#First column is classification, other columns are features
y= data[:,0];
X_raw = data[:,1:13];
# Attempt at normalizing data- really wanted X-mu/sigma gave up
# even this simplified version doesn't work
# latest error is TypeError cannot perform reduce with flexible type?????
X = (X_raw - X_raw.min(0)) / X_raw.ptp(0);
print(X);
#
#
#
#
Finally figured it out. The line "data = np.array(x)" returned an array containing string data.
was:
data = "np.array(x)"
changed to: "np.array(x).astype(np.float)"
after that everything worked - simple issue cost me hours

ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'numpy.ndarray'

I am creating a python script using pandas to read through a file which has multiple row values.
Once read, I need to build an array of these values and then assign it to a dataframe row value.
The code I have used is
import re
import numpy as np
import pandas as pd
master_data = pd.DataFrame()
temp_df = pd.DataFrame()
new_df = pd.DataFrame()
for f in data:
##Reading the file in pandas which is in excel format
#
file_df = pd.read_excel(f)
filename = file_df['Unnamed: 1'][2]
##Skipping first 24 rows to get the required reading values
column_names = ['start_time','xxx_value']
data_df = pd.read_excel(f, names=column_names, skiprows=25)
array =np.array([])
for i in data_df.iterrows():
array = np.append(array,i[1][1])
temp_df['xxx_value'] = [array]
temp_df['Filename'] = filename
temp_df['sub_id']=
temp_df['Filename'].str.split('_',1).str[1].str.strip()
temp_df['sen_site']=
temp_df['Filename'].str.split('_',1).str[0].str.strip()
temp_df['sampling_interval'] = 15
temp_df['start_time'] = data_df['start_time'][2]
new_df= new_df.append(xxx_df)
new_df.index = new_df.index + 1
new_df=new_df.sort_index()
new_df.index.name='record_id'
new_df = new_df.drop("Filename",1) ##dropping the Filename as it
is not needed to be loaded in postgresql
##Rearrange to postgresql format
column_new_df = new_df.columns.tolist()
column_new_df.
insert(4,column_new_df.pop(column_new_df.index('xxx_value')))
new_df = new_df.reindex(columns = column_new_df)
print(new_df)
This code is not working when I try to insert the array data into Postgresql.
It gives me an error stating:
ProgrammingError: (psycopg2.ProgrammingError) can't adapt type
'numpy.ndarray'
I am not sure where the problem is, as I can't see in your code the part where you insert the data into Postgres.
My guess though is that you are giving Postgres a Numpy array: psycopg2 can't handle Numpy data types, but it should be fairly easy to convert it to native Python types that work with psycopg2 (e.g. by using the .tolist(method), it is difficult to give more precise information without the code).
In my opinion, the most effective way would be to make psycopg2 always aware of np.ndarray(s). One could do that by registering an adapter:
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.ndarray, addapt_numpy_array)
To help working with numpy in general, my default addon to scripts/libraries dependent on psycopg2 is:
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def addapt_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
def addapt_numpy_float32(numpy_float32):
return AsIs(numpy_float32)
def addapt_numpy_int32(numpy_int32):
return AsIs(numpy_int32)
def addapt_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.float64, addapt_numpy_float64)
register_adapter(np.int64, addapt_numpy_int64)
register_adapter(np.float32, addapt_numpy_float32)
register_adapter(np.int32, addapt_numpy_int32)
register_adapter(np.ndarray, addapt_numpy_array)
otherwise there would be some issues even with numerical types.
I got the adapter trick from this other stackoverflow entry.
Convert each numpy array element to its equivalent list using apply and tolist first, and then you should be able to write the data to Postgres:
df['column_name'] = df['column_name'].apply(lambda x: x.tolist())
We can address the issue by extracting one element at a time. Here I'm assuming for a dataframe temp_df, sub_id of type numpy.int64, we can directly extract the values using the iloc and item as temp_df.iloc[0]['sub_id'].item() and we can push that in DB.

Categories

Resources