I wanted to impute a dataframe using the fillna command in pandas. Here is a snippet of my code:
import glob
import pandas as pd
files=glob.glob("IN.201*.csv")
i=0
n=1
#the while loops are for reading and writing different subsets of the table into
#different .txt files:
while i<15:
j=0
while j<7:
dfs=[]
m=1
#for loop over only one file for testing:
for file in files[:1]:
z=i+1
#reading subset of the dataframe:
k=float(68.109375)+float(1.953125)*i
k1=float(68.109375)+float(1.953125)*z
l=float(8.0)+float(4)*j
l1=float(8.0)+float(4)*(j+1)
df=pd.read_csv(path+file).query( '#k <= lon < #k1 and #l < lat <= #l1')[['lon','lat','country','avg']]
#renaming columns in df:
df.rename(columns={"avg":"Day"+str(m)}, inplace=True)
#print(df)
m=m+1
dfs.append(df)
#imputation:
df_final=dfs[0].fillna(method='bfill', axis='columns', inplace=True).fillna(method='ffill', axis=1, inplace=True)
#writing to a txt file:
with open('Region_'+str(n), 'w+') as f:
df_final.to_csv(f)
n=n+1
j=j+1
i=i+1
Error:
Traceback (most recent call last):
File "imputation_test.py", line 42, in <module>
df_final=dfs[0].fillna(method='bfill', axis='columns', inplace=True).fillna(
method='ffill', axis=1, inplace=True)
File "C:\Users\DELL\AppData\Local\Programs\Python\Python36\lib\site-
packages\p
andas\core\frame.py", line 3787, in fillna
downcast=downcast, **kwargs)
File "C:\Users\DELL\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 5359, in fillna
raise NotImplementedError()
NotImplementedError
Motivation for the code:
I essentially wanted to read a .csv file into multiple dataframes consisting of different subsets of this table, (hence all the loops that I have used) in order to rearrange and split the .csv file/s(actually I want to do this for multiple .csv files) into a more suitable format. I wanted to then fill the missing data using the fillna command along the column axis.
The code is structured for reading into multiple .csv files and thus has unnecessary commands like 'df=[ ]' and the 'for loop', but for simplification purposes I was trying out this code first just to make sure and I got this error.
Feel free to ask for more info for this error.
Thanks!
Use bfill and ffill with axis=1:
dfs = dfs.bfill(axis=1).ffill(axis=1)
Part of the problem are the inplace=True and chaining methods. inplace=True returns a null object so, there is nothing to call chained methods from. The second part is that fillna(method='ffill') can be shortened to just ffill().
I came across this error when using ffill (which is a synonym for fillna(method='ffill')) to fill missing values across columns with inplace=True when the data frame contains both ints and NaNs in the same row.
The workaround is to use inplace=False:
df = pd.DataFrame(data={'a': [1]})
df = df.reindex(columns=['a', 'b', 'c']) # columns b and c contain NaN
df.ffill(axis='columns', inplace=True) # raises NotImplementedError
df = df.ffill(axis='columns') # works (inplace defaults to False)
Related
I'm creating a matrix and converting it into DataFrame after creation. Since I'm working with lots of data and it takes a while for creation I wanted to store the matrix into a CSV so I can just read it once is created. Here what I'm doing:
transitions = create_matrix(alpha, N)
# convert the matrix to a DataFrame
df = pd.DataFrame(transitions, columns=list(tags), index=list(tags))
df.to_csv(r'D:\U\k\Desktop\prg\F_transition_' + language + '.csv')
df_r = pd.read_csv('transition_en.csv')
The fact is that after reading from CSV I got the error:
in get_loc raise KeyError(key). KeyError: 'O'
It seems this is thrown by those lines of code:
if i == 0:
tran_pr = df_r.loc['O', tag]
else:
tran_pr = df_r.loc[st[-1], tag]
I imagine that once the data is stored in a CSV, the reading of the file is not equivalent to the DataFrame I had before. How can I convert these lines of code to login like I did before?
I tried to set index=False when creating the csv and also skip_blank_lines=True when reading. Nothing changes
df_r is like:
can you try:
import pandas as pd
df = pd.DataFrame([[1, 2], [2, 3]], columns = ['A', 'B'], index = ['C', 'D'])
print(df['A']['C'])
while using loc you need provide index first and then give column
df_r.loc[tag, 'O']
will work.
Don't use index = false, while importing, which will not include index in the dataframe
I got following warning
PerformanceWarning: DataFrame is highly fragmented. This is usually
the result of calling frame.insert many times, which has poor
performance. Consider using pd.concat instead. To get a
de-fragmented frame, use newframe = frame.copy()
when I tried to append multiple dataframes like
df1 = pd.DataFrame()
for file in files:
df = pd.read(file)
df['id'] = file
df1 = df1.append(df, ignore_index =True)
where
df['id'] = file
seems to cause the warning. I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.
Thanks,
I tried to create a testing code to duplicate the problem but I don't see Performance Warning with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.
import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
if not os.path.isdir('./data'):
os.mkdir('./data')
files = []
for i in range(num_files):
file = f'./data/{i}.pkl'
pd.DataFrame(
np.random.randint(1, 1_000, (rows, cols))
).to_pickle(file)
files.append(file)
return files
# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning
dfs = []
for file in files:
df = pd.read_pickle(file)
df['id'] = file
dfs.append(df)
dfs = pd.concat(dfs, ignore_index = True)
append is not an efficient method for this operation. concat is more appropriate in this situation.
Replace
df1 = df1.append(df, ignore_index =True)
with
pd.concat((df1,df),axis=0)
Details about the differences are in this question: Pandas DataFrame concat vs append
I had the same problem. This raised the PerformanceWarning:
df['col1'] = False
df['col2'] = 0
df['col3'] = 'foo'
This didn't:
df[['col1', 'col2', 'col3']] = (False, 0, 'foo')
Maybe you're adding single columns elsewhere?
copy() is supposed to consolidate the dataframe, and thus defragment. There was a bug fix in pandas 1.3.1 [GH 42579][1]. Copies on a larger dataframe might get expensive.
Tested on pandas 1.5.2, python 3.8.15
[1]: https://github.com/pandas-dev/pandas/pull/42579
This is a problem with recent update. Check this issue from pandas-dev. It seems to be resolved in pandas version 1.3.1 (reference PR).
Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])
I have one csv test1.csv (I do not have headers in it!!!). I also have as you can see delimiter with pipe but also with exactly one tab after the eight column.
ug|s|b|city|bg|1|94|ON-05-0216 9.72|28|288
ug|s|b|city|bg|1|94|ON-05-0217 9.72|28|288
I have second file test2.csv with only delimiter pipe
ON-05-0216|100|50
ON-05-0180|244|152
ON-05-0219|269|146
So because only one value (ON-05-0216) is being matched from the eight column from the first file and first column from the second file it means that I should have only one value in output file, but with addition of SUM column from the second and third column from second file (100+50).
So the final result is the following:
ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
or
ug|s|b|city|bg|1|94|ON-05-0216|Total=150 9.72|28|288
whatever is easier.
I though that the best way to use is with pandas. But I stuck with taking multiple delimiters from the first file and how to match columns without column names, so not sure how to continue further.
import pandas as pd
a = pd.read_csv("test1.csv", header=None)
b = pd.read_csv("test2.csv", header=None)
merged = a.merge(b,)
merged.to_csv("output.csv", index=False)
Thank you in advance
Use:
# Reading files
df1 = pd.read_csv('file1.csv', header=None, sep='|')
df2 = pd.read_csv('file2.csv', header=None, sep='|')
# splitting file on tab and concatenating with rest
ndf = pd.concat([df1.iloc[:,:7], df1[7].str.split('\t', expand=True), df1.iloc[:,8:]], axis=1)
ndf.columns = np.arange(11)
# adding values from df2 and bringing in format Total=sum
df2.columns = ['c1', 'c2', 'c3']
tot = df2.eval('c2+c3').apply(lambda x: 'Total='+str(x))
# Finding which rows needs to be retained
idx_1 = ndf.iloc[:,7].str.split('-',expand=True).iloc[:,2]
idx_2 = df2.c1.str.split('-',expand=True).iloc[:,2]
idx = idx_1.isin(idx_2) # Updated
ndf = ndf[idx].reset_index(drop=True)
tot = tot[idx].reset_index(drop=True)
# concatenating both CSV together and writing output csv
ndf.iloc[:,7] = ndf.iloc[:,7].map(str) + chr(9) + tot
pd.concat([ndf.iloc[:,:8],ndf.iloc[:,8:]], axis=1).to_csv('out.csv', sep='|', header=None, index=None)
# OUTPUT
# ug|s|b|city|bg|1|94|ON-05-0216 Total=150|9.72|28|288
You can use pipe as a delimeter when reading csv pd.read_csv(... sep='|'), and only split the tab separated columns later on by using this example here.
When merging two dataframes, you must have a common column that you will merge on. You could use them as index for easier appending after you do the neccessary math on separate dataframes.
I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)