I'm creating a matrix and converting it into DataFrame after creation. Since I'm working with lots of data and it takes a while for creation I wanted to store the matrix into a CSV so I can just read it once is created. Here what I'm doing:
transitions = create_matrix(alpha, N)
# convert the matrix to a DataFrame
df = pd.DataFrame(transitions, columns=list(tags), index=list(tags))
df.to_csv(r'D:\U\k\Desktop\prg\F_transition_' + language + '.csv')
df_r = pd.read_csv('transition_en.csv')
The fact is that after reading from CSV I got the error:
in get_loc raise KeyError(key). KeyError: 'O'
It seems this is thrown by those lines of code:
if i == 0:
tran_pr = df_r.loc['O', tag]
else:
tran_pr = df_r.loc[st[-1], tag]
I imagine that once the data is stored in a CSV, the reading of the file is not equivalent to the DataFrame I had before. How can I convert these lines of code to login like I did before?
I tried to set index=False when creating the csv and also skip_blank_lines=True when reading. Nothing changes
df_r is like:
can you try:
import pandas as pd
df = pd.DataFrame([[1, 2], [2, 3]], columns = ['A', 'B'], index = ['C', 'D'])
print(df['A']['C'])
while using loc you need provide index first and then give column
df_r.loc[tag, 'O']
will work.
Don't use index = false, while importing, which will not include index in the dataframe
Related
I am trying to process a CSV file into a new CSV file with only columns of interest and remove rows with unfit values of -1. Unfortunately I get unexpected results, as it automatically includes column 0 (old ID) into the new CSV file without explicitly asking the script to do it (as it is not defined in cols = [..]).
How could I change these values for the new row count. That for, when for example we remove row 9 with an id=9, the dataset id goes currently as [..7,8,10...] instead of a new id count as [..7,8,9,10...]. I hope anyone got a solution for it.
import pandas as pd
# take only specific columns from dataset
cols = [1, 5, 6]
data = pd.read_csv('data_sample.csv', usecols=cols, header=None) data.columns = ["url", "gender", "age"]
# remove rows from dataset with undefined values of -1
data = data[data['gender'] != -1]
data = data[data['age'] != -1]
""" Additional working solution
indexGender = data[data['gender'] == -1].index
indexAge = data[data['age'] == -1].index
# Delete the rows indexes from dataFrame
data.drop(indexGender,inplace=True)
data.drop(indexAge, inplace=True)
"""
data.to_csv('data_test.csv')
Thank you in advance.
I solved the problem via simple line after the data drop:
data.reset_index(drop=True, inplace=True)
I am looking to change part of the string in a column of a data frame. I, however, can not get it to update in the data frame. This is my code.
import pandas as pd
#File path
csv = '/home/test.csv'
#Read csv to pandas
df = pd.read_csv(nuclei_annotations_csv, header=None, names=['A', 'B', 'C', 'D', 'E', 'F'])
#Select Data to update
paths = df['A']
#Loop over data
for x in paths:
#Select data to updte
old = x[:36]
#Update value
new = '/Datasets/RetinaNetData'
#Replace
new_path = x.replace(old, new)
#Save values to DataFrame
paths.update(new_path)
#Print updated DataFrame
print(df)
The inputs and output I would like are:
Input:
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
OutPut:
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
Assuming that all of the rows are strings and all of them have at least 36 characters, you can use .str to get the part of the cells after the 36th character. Then you can just use the + operator to combine the new beginning with the remainder of each cell's contents:
df.A = '/Datasets/RetinaNetData' + df.A.str[36:]
As a general tip, methods like this that operate across the whole dataframe at once are going to be more efficient than looping over each row individually.
I wanted to impute a dataframe using the fillna command in pandas. Here is a snippet of my code:
import glob
import pandas as pd
files=glob.glob("IN.201*.csv")
i=0
n=1
#the while loops are for reading and writing different subsets of the table into
#different .txt files:
while i<15:
j=0
while j<7:
dfs=[]
m=1
#for loop over only one file for testing:
for file in files[:1]:
z=i+1
#reading subset of the dataframe:
k=float(68.109375)+float(1.953125)*i
k1=float(68.109375)+float(1.953125)*z
l=float(8.0)+float(4)*j
l1=float(8.0)+float(4)*(j+1)
df=pd.read_csv(path+file).query( '#k <= lon < #k1 and #l < lat <= #l1')[['lon','lat','country','avg']]
#renaming columns in df:
df.rename(columns={"avg":"Day"+str(m)}, inplace=True)
#print(df)
m=m+1
dfs.append(df)
#imputation:
df_final=dfs[0].fillna(method='bfill', axis='columns', inplace=True).fillna(method='ffill', axis=1, inplace=True)
#writing to a txt file:
with open('Region_'+str(n), 'w+') as f:
df_final.to_csv(f)
n=n+1
j=j+1
i=i+1
Error:
Traceback (most recent call last):
File "imputation_test.py", line 42, in <module>
df_final=dfs[0].fillna(method='bfill', axis='columns', inplace=True).fillna(
method='ffill', axis=1, inplace=True)
File "C:\Users\DELL\AppData\Local\Programs\Python\Python36\lib\site-
packages\p
andas\core\frame.py", line 3787, in fillna
downcast=downcast, **kwargs)
File "C:\Users\DELL\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 5359, in fillna
raise NotImplementedError()
NotImplementedError
Motivation for the code:
I essentially wanted to read a .csv file into multiple dataframes consisting of different subsets of this table, (hence all the loops that I have used) in order to rearrange and split the .csv file/s(actually I want to do this for multiple .csv files) into a more suitable format. I wanted to then fill the missing data using the fillna command along the column axis.
The code is structured for reading into multiple .csv files and thus has unnecessary commands like 'df=[ ]' and the 'for loop', but for simplification purposes I was trying out this code first just to make sure and I got this error.
Feel free to ask for more info for this error.
Thanks!
Use bfill and ffill with axis=1:
dfs = dfs.bfill(axis=1).ffill(axis=1)
Part of the problem are the inplace=True and chaining methods. inplace=True returns a null object so, there is nothing to call chained methods from. The second part is that fillna(method='ffill') can be shortened to just ffill().
I came across this error when using ffill (which is a synonym for fillna(method='ffill')) to fill missing values across columns with inplace=True when the data frame contains both ints and NaNs in the same row.
The workaround is to use inplace=False:
df = pd.DataFrame(data={'a': [1]})
df = df.reindex(columns=['a', 'b', 'c']) # columns b and c contain NaN
df.ffill(axis='columns', inplace=True) # raises NotImplementedError
df = df.ffill(axis='columns') # works (inplace defaults to False)
I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)
What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.
"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"
Using this code:
import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)
the following error is generated
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
Supplying a list of columns names in the read_csv() should do the trick.
ex: names=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
You can dynamically generate column names as simple counters (0, 1, 2, etc).
Dynamically generate column names
# Input
data_file = "smallsample.txt"
# Delimiter
data_file_delimiter = ','
# The max column count a line in the file could have
largest_column_count = 0
# Loop the data lines
with open(data_file, 'r') as temp_f:
# Read the lines
lines = temp_f.readlines()
for l in lines:
# Count the column count for the current line
column_count = len(l.split(data_file_delimiter)) + 1
# Set the new most column count
largest_column_count = column_count if largest_column_count < column_count else largest_column_count
# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]
# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)
Missing values will be assigned to the columns which your CSV lines don't have a value for.
Polished version of P.S. answer is as follows. It works.
Remember we have inserted lot of missing values in the dataframe.
### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)
If you want something really concise without explicitly giving column names, you could do this:
Make a one column DataFrame with each row being a line in the .csv file
Split each row on commas and expand the DataFrame
df = pd.read_fwf('<filename>.csv', header=None)
df[0].str.split(',', expand=True)
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.
import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8
data = pd.read_csv("smallsample.txt",header = None,names=range(8))
Use range instead of manually setting names as it will be cumbersome when you have many columns.
You can use shantanu pathak's method to find longest row length in your data.
Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)
new_data = data.fillna(0)
We could even use pd.read_table() method to read csv file which converts it into type DataFrame of single columns which can be read and split by ','
Manipulate your csv and in the first row, put the row that has the most elements, so that all next rows have less elements. Pandas will create as much columns as the first row has.