ipysheet.sheet converting to DataFrame with saving manual changes done - python

The aim is to create an interaction dataframe where I can handle values of the cells without coding.
For me it seems should be in the following way:
creating ipysheet.sheet
handling cells manually
converting it to pandas dataframe
the problem is:
after creating a ipysheet.sheet I manualy changed the values of some cells and then convert it to pandas dataframe, but changes done are not reflected in this datafarme; if you just call this sheet without converting you can see these changes;
d = {'col1': [2,8], 'col2': [3,6]}
df = pd.DataFrame(data=d)
sheet1 = ipysheet.sheet(rows=len(df.columns) +1 , columns=3)
first_col = df.columns.to_list()
first_col.insert(0, 'Attribute')
column = ipysheet.column(0, value = first_col, row_start = 0)
cell_value1 = ipysheet.cell(0, 1, 'Format')
cell_value2 = ipysheet.cell(0, 2, 'Description')
sheet1
creating a sheet1
ipysheet.to_dataframe(sheet1)
converting to pd.DataFrame

Solved by predefining all empty spaces as np.nan. You can handle it manually and it transmits to DataFrame when converting.

Related

Apply changes to dataframes in a dict thanks to a for loop: how to do it?

I can't apply the alterations I make to dataframes inside a dictionary. The changes are done with a for loop.
The problem is that although the loop works because the single iterated df makes the changes, they do not apply to the dictionary they are in.
The end goal is to create a merge of all the dataframes since they come from different excel sheets and sheets.
Here the code:
Import the two excel files, assigning None to the Sheet_Name parameter in order to import all the sheets of the document into a dict. I have 8 sheet in EEG excel file and 5 in SC file
import numpy as np
impody pandas as np
eeg = pd.read_excel("path_file", sheet_name = None)
sc = pd.read_excel("path_file" sheet_name = None)
Merges the first dictionary with the second one with the update method. Now the EEG dict contains both EEG and SC.
So now I have a dict with 13 df inside
eeg.update(sc)
The loop for is needed in order to carry out some modifications inside the single df.
reset the index to a specific column (common on all df), change its name, add a prefix on the variable that corresponds to the key of the df and lastly change the 0 with nan.
for key, df in eeg.items():
df.set_index(('Unnamed: 0'), inplace = True)
df.index.rename(('Sbj'), inplace = True)
df = df.add_prefix( key + '_')
df.replace (0, np.nan, inplace = True)
Although the loop is set on the dictionary items and the single iterated dataframe works, I don't see the changes on the dictionary df's and therefore can't proceed to extract them into a list, then merge.
As you can see in the fig.1 the single df in the for loop is good!
but when I go to the df in dict, they still result as before.
You need to map your modified dataframe back into your dictionary:
for key, df in eeg.items():
df.set_index(('Unnamed: 0'), inplace = True)
df.index.rename(('Sbj'), inplace = True)
df = df.add_prefix( key + '_')
df.replace (0, np.nan, inplace = True)
eeg[key] = df #map df back into eeg
What you probably want is:
# merge the dataframes in your dictionary into one
df1 = pd.DataFrame()
for key, df in eeg.items():
df1 = pd.concat([df1,df])
# apply index-changes to the merged dataframe
df1.set_index(('Unnamed: 0'), inplace = True)
df1.index.rename(('Sbj'), inplace = True)
df1 = df1.add_prefix( key + '_')
df1.replace (0, np.nan, inplace = True)

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Add new columns and new column names in python

I have a CSV file in the following format:
Date,Time,Open,High,Low,Close,Volume
09/22/2003,00:00,1024.5,1025.25,1015.75,1022.0,720382.0
09/23/2003,00:00,1022.0,1035.5,1019.25,1022.0,22441.0
10/22/2003,00:00,1035.0,1036.75,1024.25,1024.5,663229.0
I would like to add 20 new columns to this file, the value of each new column is synthetically created by simply randomizing a set of numbers.
It would be something like this:
import pandas as pd
df = pd.read_csv('dataset.csv')
print(len(df))
input()
for i in range(len(df)):
#Data that already exist
date = df.values[i][0]
time = df.values[i][1]
open_value= df.values[i][2]
high_value=df.values[i][3]
low_value=df.values[i][4]
close_value=df.values[i][5]
volume=df.values[i][6]
#This is the new data
prediction_1=randrange(3)
prediction_2=randrange(3)
prediction_3=randrange(3)
prediction_4=randrange(3)
prediction_5=randrange(3)
prediction_6=randrange(3)
prediction_7=randrange(3)
prediction_8=randrange(3)
prediction_9=randrange(3)
prediction_10=randrange(3)
prediction_11=randrange(3)
prediction_12=randrange(3)
prediction_13=randrange(3)
prediction_14=randrange(3)
prediction_15=randrange(3)
prediction_16=randrange(3)
prediction_17=randrange(3)
prediction_18=randrange(3)
prediction_19=randrange(3)
prediction_20=randrange(3)
#How to concatenate these data row by row in a matrix?
#How to add new column names and save the file?
I would like to concatenate them (old+synthetic data) and, after that, I would like to add 20 new columns named 'synthetic1', 'synthetic2', ..., 'synthetic20', to the existing column names and then save the resulting new dataset in a new text file.
I could do that easily with NumPy, but here, we have no numeric data and, therefore, I don't know how to do (or if it is possible to do) that. Is possible to do that with Pandas or another library?
Here's a way you can do:
import numpy as np
# set nrow and col, nrow should match the number of rows in existing df
n_row = 100
n_col = 20
f = pd.DataFrame(np.random.randint(100, size=(n_row, n_col)), columns=['synthetic' + str(x) for x in range(1,n_col+1)])
df = pd.concat([df, f])

Append Data Frame in Python

I have multiple sheets that are identical in column headers but not in terms of the number of rows. I want to combine the sheets to make one master sheet.
At the moment this is the code that I get, for which the output is blank and I end up with data = to that in the last sheet.
I decided to utilise a for loop iterated through data_sheetnames which is a list.
Below is the code I have utilised
combineddata = pd.DataFrame()
for club in data_sheetnames:
data = pd.read_excel(r'''C:\Users\me\Desktop\Data.xlsx''', header = 1, index_col = 2, sheet_name = club)
combineddata.append(data)
If I were to change combineddata to a blank list then I get a dictionary of dataframes.
The solution is that append does not work in place.
It returns the appended DataFrame.
Therefore
combineddata = pd.DataFrame()
for club in data_sheetnames:
data = pd.read_excel(r'''C:\Users\me\Desktop\Data.xlsx''', header = 1, index_col = 2, sheet_name = club)
combineddata = combineddata.append(data)
should solve the issue
An easier way is just to do this:
combined_data = pd.concat([pd.read_excel(sheet_name) for sheet_name in data_sheetnames])

Changing the dtype for specific columns in a pandas dataframe

I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)

Categories

Resources