I have 32 datasets, with the same structure, I need to do some preparation in each one and then join them. For do the cleaning I've prepared a function and then I've tried to put that function in a loop, but it doesn't work.
Here my code
First: I imported the datasets to my environment in a list called files, I'm working in Google Colab.
import glob
import os
os.chdir('/content')
extension = 'xls'
all_files = [i for i in glob.glob('*.{}'.format(extension))]
files = []
for filename in all_files:
data = pd.read_excel(filename, skiprows=6)
files.append(data)
Second:
I did my cleaning function.
def data_cleaning(data):
data = data.iloc[2:, :4]
data = data[(data['Desglose'] != 'Localidades')]
data = data.drop(columns='Desglose')
data = data.rename(columns={'Total de localidades y su población1': 'poblacion'})
data['municipio'] = data['Municipio'].str.split(' ', n = 1).str.get(-1)
data['entidad_federativa'] = data['Entidad federativa'].str.split(' ', n = 1).str.get(-1)
data = data.iloc[:, 2:]
return data
And finally:
I'll try to make a for loop to repeat the cleaning process in each dataset of the list files.
files_clean = []
for i in files:
data_clean = data_cleaning(files[i])
files_clean.append(data_clean)
The error I get is:
TypeError Traceback (most recent call last)
<ipython-input-44-435517607919> in <module>()
1 files_clean = []
2 for i in files:
----> 3 data_clean = data_cleaning(files[i])
4 files_clean.append(data_clean)
TypeError: list indices must be integers or slices, not DataFram
I've done a similar process in R but I can't repeat it in Python. So, any suggestions would be appreciated.
Thank you very much for your time.
The error TypeError: list indices must be integers or slices, not DataFram is raised when you try to access a list using DataFrame values instead of an integer. To solve this problem, make sure that you access a list using an index number.
A common scenario where this error is raised is when you iterate over a list and compare objects in the list. To solve this error, you can use range() in python for loops.
for i in range(len(files))
or else you can check the type of files and type of one object in files and make necessary changes according to that.
The problem is with the index. for i in files does not return i as an integer but as a dataframe. A possible solution to your problem will be:
for df in files:
data_clean = data_cleaning(df)
files_clean.append(data_clean)
or similarly
for i in range(len(files)):
data_clean = data_cleaning(files[i])
files_clean.append(data_clean)
or possibly
for i, df in enumerate(files):
data_clean = data_cleaning(files[i])
files_clean.append(data_clean)
Related
I tried to do what you suggested but I am getting an error saying ValueError: only one regex group is supported with Index.
I have multiple h5ad files with varying n_obs × n_vars. Here is my code:
adatas = [an.read_h5ad(filename) for filename in filenames]
batch_names = []
for i in range(len(adatas)):
adatas[i].var_names_make_unique()
batch_names.append(filenames[i].split('.')[0])
print(i,adatas[i])
adata = adatas[0].concatenate(adatas[1:],
batch_key = 'ID',
uns_merge="unique",
index_unique=None,
batch_categories=batch_names)
and this produces the above error. Can anyone help?
I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.
f = open('./yelp_academic_dataset_review.json', encoding='utf-8')
I tried without encoding utf-8 but it creates an error.
I created a function that reads the file line by line and make a pandas dataframe up to given number of lines.
Anyway some lines are lists. And script iterates in each list too and adds to dataframe.
def json_parser(file, max_chunk):
f = open(file)
df = pd.DataFrame([])
for i in range(2, max_chunk + 2):
try:
type(f.readlines(i)) == list
for j in range(len(f.readlines(i))):
part = json.loads(f.readlines(i)[j])
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
except:
f = open(file, encoding = "utf-8")
for j in range(len(f.readlines(i))):
try:
part = json.loads(f.readlines(i)[j-1])
except:
print(i,j)
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
df2.reset_index(inplace=True, drop=True)
return df2
But still I am having an error that list index out of range. (Yes I used print to debug).
So I looked closer to that lines which causes this error.
But very interestingly when I try to look at that lines, script gives me different list.
Here what I meant:
I runned the cells repeatedly and having different length of the list.
So I looked at lists:
It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing?
Thanks in advance.
You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.
But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.
You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.
I have five .csv's that have the same fields in the same order that need to be processed as such:
Get list of files
Make each file into a dataframe
Check if a column of letter-number combinations has a specific value (different for each file) eg: check if the number PT333 is in column1 for the file name data1:
column1 column2 column3
PT389 LA image.jpg
PT372 NY image2.jpg
If the column has a specific value, print which value it has and the filename/variable name that i've assigned to that file, and then rename that dataframe to output1
I tried to do this, but I don't know how to make it loop and do the same thing for each file.
At the moment it returns the number, but I also want it to return the data frame name, and I also want it to loop through all the files (a to e) to check for all the values in the numbers list.
This is what I have:
import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser
home = expanduser("~")
os.chdir(home + f'/files/')
data = glob.glob('data*.csv')
data
# If you have tips on how to loop through these rather than
# have a line for each one, open to feedback
a = pd.read_csv(data[0], encoding='ISO-8859-1', error_bad_lines=False)
b = pd.read_csv(data[1], encoding='ISO-8859-1', error_bad_lines=False)
c = pd.read_csv(data[2], encoding='ISO-8859-1', error_bad_lines=False)
d = pd.read_csv(data[3], encoding='ISO-8859-1', error_bad_lines=False)
e = pd.read_csv(data[4], encoding='ISO-8859-1', error_bad_lines=False)
filenames = [a,b,c,d,e]
filelist= ['a','b','c','d','e']
# I am aware that this part is repetitive. Unsure how to fix this,
# I keep getting errors
# Any help appreciated
numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']
def type():
for i in a.column1:
if i == numbers[0]:
print(numbers[0])
elif i == numbers[1]:
print(numbers[1])
elif i == numbers[2]:
print(numbers[2])
elif i == numbers[3]:
print(numbers[3])
elif i == numbers[4]:
print(numbers[4])
type()
Also happy to take any constructive criticism as to how to repeat less code and make things smoother. TIA
Give this a try
for file in glob.glob('data*.csv'): # loop through each file
df = pd.read_csv(file, # create the DataFrame of the file
encoding='ISO-8859-1',
error_bad_lines=False)
result = df.where( \ # Check where the DF contains these numbers
df.isin(numbers)) \
.melt()['value'] \ # melt the DF to be a series of 'value'
.dropna() \ # Remove any nans (non match)
.unique().tolist() # Return the unique values as a list.
if result: # If there are any results
print(file, ', '.join(result) # print the file name, and the results
Remove the comments and trailing spaces if you are copying and pasting the code. for the result line, in case you run into SyntaxError.
As mentioned you should be able to do the same without DataFrame as well:
for file in glob.glob('data*.csv'):
data = file.read()
for num in numbers:
if num in data:
print(file, num)
Also happy to take any constructive criticism as to how to repeat less
code and make things smoother.
I hope you don't mind that i started with code restructure. it makes explaining the next steps easier
loading the Files Array
Using list builder allows us to iterate through the files and load them into an a list in 1 line. It also has a lot of memory and time benefits.
files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]
more on comprehension
Type Function
First we need an argument so that we can give call this function for any given file. Along with the list we can loop over it with a for each loop.
Calling the Type Function on Multiple Files
We use for each loops again here
for file in files:
type(file)
more on python for loops
def type(file):
for value in file.column1:
if value in numbers:
print(value)
Result
import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser
home = expanduser("~")
os.chdir(home + f'/files/')
#please note that i am use glob instead of glob.glob here.
data = glob('data*.csv')
files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]
numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']
def type(file):
for value in file.column1:
if value in numbers:
print(value)
for file in files:
type(file)
I would suggest changing the type function, and calling it slightly differently
def type(x):
for i in x.column1:
if i == numbers[0]:
print(i, numbers[0])
elif i == numbers[1]:
print(i, numbers[1])
elif i == numbers[2]:
print(i, numbers[2])
elif i == numbers[3]:
print(i, numbers[3])
elif i == numbers[4]:
print(i, numbers[4])
for j in filenames:
type(j)
I have the following code:
for state in state_list:
state_df = pd.DataFrame()
for df in pd.read_csv(tax_sample,sep='\|\|', engine='python', dtype = tax_column_types, chunksize = 10, nrows = 100):
state_df = pd.concat(state_df,df[df['state'] == state])
state_df.to_csv('property' + state + '.csv')
My dataset is quite big, and I'm breaking it into chunks (in reality these would be bigger than 10 obs). I'm taking each chunk and checking if the state matches a particular state in a list, and, if so, store it in a dataframe and save it down.
In short, I'm trying to take a dataframe with many different states in it and break it into several dataframe, each with only one state and save to CSV.
however, the code above gives the error:
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea why?
Thanks,
Mike
Consider iterating off the chunks and each time run .isin[] for filter on state_list but save in a container like dict or list. As commented, avoid the overhead of expanding dataframes in a loop.
Afterwards, bind with pd.concat on container and then run a looped groupby on state field to output each file individually.
df_list = []
reader = pd.read_csv(tax_sample, sep='\|\|', engine='python',
dtype=tax_column_types, chunksize=10, nrows=100)
for chunk in reader:
tmp = chunk[chunk['state'].isin(state_list)]
df_list.append(tmp)
master_df = pd.concat(df_list)
for g in master_df.groupby('state'):
g[1].to_csv('property' + g[0] + '.csv')
I've tried the following(pd is pandas):
for i, chunk in pd.read_excel(os.path.join(INGEST_PATH,file), chunksize=5):
but I am getting this error:
NotImplementedError: chunksize keyword of read_excel is not implemented
I've tried searching for other methods but most of them are for CSV files, not xlsx, I also have pandas version 0.20.1
Any help is appreciated.
df = pd.read_excel(os.path.join(INGEST_PATH,file))
# split indexes
idxes = np.array_split(df.index.values, 5)
chunks = [df.ix[idx] for idx in idxes]
the above solutions werent working for me because the file wasnt being split properly and resulted into omitting the last few rows.. actually it gave me an error saying unequal divisions or something to that effect.
so i wrote the following. this will work for any file size.
enter code here
url_1=r'C:/Users/t3734uk/Downloads/ML-GooGLECRASH/amp_ub/df2.csv'
target_folder=r'C:\Users\t3734uk\Downloads\ML-GooGLECRASH\amp_ub'
df = pd.read_csv(url_1)
rows,columns=df.shape
def calcRowRanges(_no_of_files):
row_ranges=[]
interval_size=math.ceil(rows/_no_of_files)
print('intrval size is ----> '+ '{}'.format(interval_size))
for n in range(_no_of_files):
row_range=(n*interval_size,(n+1)*interval_size)
# print(row_range)
if row_range[1] > rows:
row_range=(n*interval_size,rows)
# print(row_range)
row_ranges.append(row_range)
return row_ranges
def splitFile(_df_,_row_ranges):
for row_range in _row_ranges:
_df=_df_[row_range[0]:row_range[1]]
writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')
_df.to_excel(writer)
def dosplit(num_files):
row_ranges=calcRowRanges(num_files)
print(row_ranges)
print(len(row_ranges))
splitFile(df,row_ranges)
dosplit(enter_no_files_to_be_split_in)
on second thoughts the following fucntion is more intuitive:
def splitFile2(_df_,no_of_splits):
_row_ranges=calcRowRanges(no_of_splits)
for row_range in _row_ranges:
_df=_df_[row_range[0]:row_range[1]]
writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')
_df.to_excel(writer)enter code here