Concatenate multiple H5ad files into one - python

I tried to do what you suggested but I am getting an error saying ValueError: only one regex group is supported with Index.
I have multiple h5ad files with varying n_obs × n_vars. Here is my code:
adatas = [an.read_h5ad(filename) for filename in filenames]
batch_names = []
for i in range(len(adatas)):
adatas[i].var_names_make_unique()
batch_names.append(filenames[i].split('.')[0])
print(i,adatas[i])
adata = adatas[0].concatenate(adatas[1:],
batch_key = 'ID',
uns_merge="unique",
index_unique=None,
batch_categories=batch_names)
and this produces the above error. Can anyone help?

Related

For-loop in a list with datasets in Python

I have 32 datasets, with the same structure, I need to do some preparation in each one and then join them. For do the cleaning I've prepared a function and then I've tried to put that function in a loop, but it doesn't work.
Here my code
First: I imported the datasets to my environment in a list called files, I'm working in Google Colab.
import glob
import os
os.chdir('/content')
extension = 'xls'
all_files = [i for i in glob.glob('*.{}'.format(extension))]
files = []
for filename in all_files:
data = pd.read_excel(filename, skiprows=6)
files.append(data)
Second:
I did my cleaning function.
def data_cleaning(data):
data = data.iloc[2:, :4]
data = data[(data['Desglose'] != 'Localidades')]
data = data.drop(columns='Desglose')
data = data.rename(columns={'Total de localidades y su población1': 'poblacion'})
data['municipio'] = data['Municipio'].str.split(' ', n = 1).str.get(-1)
data['entidad_federativa'] = data['Entidad federativa'].str.split(' ', n = 1).str.get(-1)
data = data.iloc[:, 2:]
return data
And finally:
I'll try to make a for loop to repeat the cleaning process in each dataset of the list files.
files_clean = []
for i in files:
data_clean = data_cleaning(files[i])
files_clean.append(data_clean)
The error I get is:
TypeError Traceback (most recent call last)
<ipython-input-44-435517607919> in <module>()
1 files_clean = []
2 for i in files:
----> 3 data_clean = data_cleaning(files[i])
4 files_clean.append(data_clean)
TypeError: list indices must be integers or slices, not DataFram
I've done a similar process in R but I can't repeat it in Python. So, any suggestions would be appreciated.
Thank you very much for your time.
The error TypeError: list indices must be integers or slices, not DataFram is raised when you try to access a list using DataFrame values instead of an integer. To solve this problem, make sure that you access a list using an index number.
A common scenario where this error is raised is when you iterate over a list and compare objects in the list. To solve this error, you can use range() in python for loops.
for i in range(len(files))
or else you can check the type of files and type of one object in files and make necessary changes according to that.
The problem is with the index. for i in files does not return i as an integer but as a dataframe. A possible solution to your problem will be:
for df in files:
data_clean = data_cleaning(df)
files_clean.append(data_clean)
or similarly
for i in range(len(files)):
data_clean = data_cleaning(files[i])
files_clean.append(data_clean)
or possibly
for i, df in enumerate(files):
data_clean = data_cleaning(files[i])
files_clean.append(data_clean)

Having problem with csv export in python with jupyter

i've tried several solutions that have on stack and each one give me some diferent error. The last one i tried is this:
df = pd.read_csv('arima1.csv', sep=';',parse_dates={'Month':[0, 1]}, index_col = 'Month')
df.head()
plt.xlabel('Data')
plt.ylabel('Receita')
plt.plot(df)
and i get this error:
IndexError: list index out of range
this is my CSV file:
https://drive.google.com/file/d/1BlDo10_Oz1RzFEcosiVgdGickXs4elSA/view?usp=sharing
thks
CSV file needs cleaning
df = pd.read_csv("arima1.csv",sep='\"+')
# df['Month']= pd.to_datetime(df['Month,'],format="%m/%d/%Y,")
# df['Receita'] = df['Receita'].apply(lambda x: float(x.replace("R$","").replace(",","")))
# df.set_index(['Month'])['Receita'].plot()
Your seperator is ',' not ';'.
When trying to separate by ; you don't have a column named 'Month'

Stackig vertically .csv files with Pandas in Python

So I have been trying to merge .csv files with Pandas, and trying to create a couple of functions to automate it but I keep having an issue.
My problem is that I want to stack one .csv after the other(same number of columns and different number of rows) but instead of getting a bigger csv with the same numer of columns , I get a bigger csv with more columns and rows(correct number of rows, incorrect number of columns(more columns than the ones that are supposed to be)).
The code Im using is this one:
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,ignore_index=True)
return csv_final.to_csv("final_data.csv",index = None, header = None)
I have 3.csv files that have a size of 20000x17, and I want to merge it into one of 60000x17. I suppose my error must be in the arguments of index, header, index_col, etc....
Thanks in advance.
So after modifying the code, it worked. First of all, as Serge Ballesta said, it is necesary to say to the read_csv that there is no header. Finally, using the sort = False, the function works perfectly. This is the final code that I have used, and the final .csv is 719229 rows × 17 columns long. Thanks to everbody!
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None,header = None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,sort = False)
return csv_final.to_csv("final_data.csv", header = None)
If the files have no header you must say it to read_csv. If you don't, the first line of each file is read as a header line. As a result the DataFrames have different column names and concat will add new columns. So you should read with:
solo_csv = pd.read_csv(csv_path,index_col=None, header=None)
Alternatively, there is no reason to decode them, and you could just concatenate the sequential files:
def stackcsv(content_folder):
with open("final_data.csv", "w") as fdout
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
with open(csv_path) as fdin:
while True:
chunk = fdin.read()
if len(chunk) == 0: break
fdout.write(chunk)
Add parameter sort to False in the pandas concat function:
csv_final = pd.concat(combined_csv, axis = 0, ignore_index=True, sort=False)

EmptyDataError: No columns to parse from file when loading several files in a dictionary

I have 1000 csv files that I call using the following code (which puts every file into a dictionary):
dataframes = {}
csv_root = Path(".")
for csv_path in csv_root.glob("*.csv"):
key = csv_path.stem
dataframes[key] = pd.read_csv(csv_path, skiprows=1)
However when I use this code I got this error
EmptyDataError: No columns to parse from file
Which indicates that there is empty data or header is encountered.
I would like to know how to identify which of those 1000 csv files are the ones causing troubles? Because, as you can understand, checking file by file will consume a lot of time.
Thanks a lot!
I would just use a try/catch, like so:
dataframes = {}
csv_root = Path(".")
for csv_path in csv_root.glob("*.csv"):
key = csv_path.stem
try:
dataframes[key] = pd.read_csv(csv_path, skiprows=1)
except Exception, as e:
dataframes[key] = 'error' # mark the errored
This last step will get you the stems with issues:
errored_stems = [k for k,v in dataframes.iteritems() if k == 'error']

How to split a large excel file using Pandas?

I've tried the following(pd is pandas):
for i, chunk in pd.read_excel(os.path.join(INGEST_PATH,file), chunksize=5):
but I am getting this error:
NotImplementedError: chunksize keyword of read_excel is not implemented
I've tried searching for other methods but most of them are for CSV files, not xlsx, I also have pandas version 0.20.1
Any help is appreciated.
df = pd.read_excel(os.path.join(INGEST_PATH,file))
# split indexes
idxes = np.array_split(df.index.values, 5)
chunks = [df.ix[idx] for idx in idxes]
the above solutions werent working for me because the file wasnt being split properly and resulted into omitting the last few rows.. actually it gave me an error saying unequal divisions or something to that effect.
so i wrote the following. this will work for any file size.
enter code here
url_1=r'C:/Users/t3734uk/Downloads/ML-GooGLECRASH/amp_ub/df2.csv'
target_folder=r'C:\Users\t3734uk\Downloads\ML-GooGLECRASH\amp_ub'
df = pd.read_csv(url_1)
rows,columns=df.shape
def calcRowRanges(_no_of_files):
row_ranges=[]
interval_size=math.ceil(rows/_no_of_files)
print('intrval size is ----> '+ '{}'.format(interval_size))
for n in range(_no_of_files):
row_range=(n*interval_size,(n+1)*interval_size)
# print(row_range)
if row_range[1] > rows:
row_range=(n*interval_size,rows)
# print(row_range)
row_ranges.append(row_range)
return row_ranges
def splitFile(_df_,_row_ranges):
for row_range in _row_ranges:
_df=_df_[row_range[0]:row_range[1]]
writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')
_df.to_excel(writer)
def dosplit(num_files):
row_ranges=calcRowRanges(num_files)
print(row_ranges)
print(len(row_ranges))
splitFile(df,row_ranges)
dosplit(enter_no_files_to_be_split_in)
on second thoughts the following fucntion is more intuitive:
def splitFile2(_df_,no_of_splits):
_row_ranges=calcRowRanges(no_of_splits)
for row_range in _row_ranges:
_df=_df_[row_range[0]:row_range[1]]
writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')
_df.to_excel(writer)enter code here

Categories

Resources