It seems the default for pd.read_csv() is to read in the column names as str. I can't find the behavior documented and thus can't find where to change it.
Is there a way to tell read_csv() to read in the column names as integer?
Or maybe the solution is specifying the datatype when calling pd.DataFrame.to_csv(). Either way, at the time of writing to csv, the column names are integers and that is not preserved on read.
The code I'm working with is loosely related to this (credit):
df = pd.DataFrame(index=pd.MultiIndex.from_arrays([[], []]))
for row_ind1 in range(3):
for row_ind2 in range(3, 6):
for col in range(6, 9):
entry = row_ind1 * row_ind2 * col
df.loc[(row_ind1, row_ind2), col] = entry
df.to_csv("df.csv")
dfr = pd.read_csv("df.csv", index_col=[0, 1])
print(dfr.loc[(0, 3), 6]) # KeyError
print(dfr.loc[(0, 3), "6"]) # No KeyError
My temporary solution is:
dfr.columns = dfr.columns.map(int)
Related
I have several large csv filess each 100 columns and 800k rows. Starting from the first column, every other column has cells that are like python list, for example: in cell A2, I have [1000], in cell A3: I have [2300], and so forth. Column 2 is fine and are numbers, but columns 1, 3, 5, 7, etc, ...99 are similar to the column 1, their values are inside list. Is there an efficient way to remove the sign of the list [] from those columns and make their cells like normal numbers?
files_directory: r":D\my_files"
dir_files =os.listdir(r"D:\my_files")
for file in dir_files:
edited_csv = pd.read_csv("%s\%s"%(files_directory, file))
for column in list(edited_csv.columns):
if (column % 2) != 0:
edited_csv[column] = ?
Please try:
import pandas as pd
df = pd.read_csv('file.csv', header=None)
df.columns = df.iloc[0]
df = df[1:]
for x in df.columns[::2]:
df[x] = df[x].apply(lambda x: float(x[1:-1]))
print(df)
When reading the cells, for example column_1[3], which in this case is [4554.8433], python will read them as arrays. To read the numerical value inside the array, simply read the values like so:
value = column_1[3]
print(value[0]) #prints 4554.8433 instead of [4554.8433]
I want to concatenate the multidimensional output of a NumPy computation matching in dimensions the shape of the input (with regards to rows and respective selected columns).
But it fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
I do not want to flatten the indices first - so is there another way to get it to work?
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({('metrik_0', Timestamp('2020-01-01 00:00:00')): {(1, 1): 2.5393693602911447, (1, 5): 4.316896324314225, (1, 6): 4.271001191238499, (1, 9): 2.8712588011247377, (1, 11): 4.0458495954752545}, ('metrik_0', Timestamp('2020-01-01 01:00:00')): {(1, 1): 4.02779063729038, (1, 5): 3.3849606155101224, (1, 6): 4.284114856052976, (1, 9): 3.980919941298365, (1, 11): 5.042488191587525}, ('metrik_0', Timestamp('2020-01-01 02:00:00')): {(1, 1): 2.374592085569529, (1, 5): 3.3405503781564487, (1, 6): 3.4049690284720366, (1, 9): 3.892686173978996, (1, 11): 2.1876998087043127}})
def compute_return_columns_to_df(df, colums_to_process,axis=0):
method = 'compute_result'
renamed_base_levels = map(lambda x: f'{x}_{method}', colums_to_process.get_level_values(0).unique())
renamed_columns = colums_to_process.set_levels(renamed_base_levels, level=0)
#####
# perform calculation in numpy here
# for the sake of simplicity (and as the actual computation is irrelevant - it is omitted in this minimal example)
result = df[colums_to_process].values
#####
result = pd.DataFrame(result, columns=renamed_columns)
display(result)
return pd.concat([df, result], axis=1) # fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
# I do not want to flatten the indices first - so is there another way to get it to work?
compute_return_columns_to_df(df[df.columns[0:3]].head(), df.columns[0:2])
The reason why your code failed is in:
result = df[colums_to_process].values
result = pd.DataFrame(result, columns=renamed_columns)
Note that result has:
column names with the top index level renamed to
metrik_0_compute_result (so far OK),
but the row index is the default single level index,
composed of consecutive numbers.
Then, when you concatenate df and result, Pandas attempts to
align both source DataFrames on the row index, but they are incompatible
(df has a MultiIndex, whereas result has an "ordinary" index).
Change this part of your code to:
result = df[colums_to_process]
result.columns = renamed_columns
This way result keeps the original index and concat raises no
exception.
Another remark: Your function contains axis parameter, which is
never used. Consider removing it.
Another possible approach
Since result has a default (single level) index, you can leave the
previous part of code as is, but reset the index in df before joining:
return pd.concat([df.reset_index(drop=True), result], axis=1)
This way both DataFrames have the same indices and you can concatenate
them as well.
import pandas as pd
DF = pd.DataFrame(DICTIONARY,
index = [r"$\lambda$="+str(i) for i in range(3)],
columns = [r"$\xi$="+str(j) for j in range(3)])
There are a few times when I have a dictionary (not very large) and try to convert it into a dataframe, the code above would yield one with each cell being NaN. Yet the code below works fine. I wonder what could be the difference?
DF = pd.DataFrame(DICTIONARY, index = [r"$\lambda$="+str(i) for i in range(3)])
DF.columns = [r"$\xi$="+str(j) for j in range(3)]
What are your dictionary keys? I am guessing the keys don't align to your columns.
In the second option you are letting pandas assign default column names and then overwriting them.
Something like the below code works when the column names align - but explicitly defining the columns parameter, in this case, adds no value because the dict key already provides the names.
DF = pd.DataFrame({1:1,2:2,3:3},
index = [r"$\lambda$="+str(i) for i in range(3)],
columns = [j+1 for j in range(3)])
I have outputted a pandas df into an excel file using xlsxwriter. I'm trying to create a totals row at the top. To do so, I'm trying to create a function that dynamically populates the totals based off the column I choose.
Here is an example of what I'm intending to do:
worksheet.write_formula('G4', '=SUM(G4:G21)')
#G4 = Where total should be placed
I need this to be a function because the row counts can change (summation range should be dynamic), and I want there to be an easy way to apply this formula to various columns.
Therefore I've come up with the following:
def get_totals(column):
start_row = '4' #row which the totals will be on
row_count = str(tl_report.shape[0]) #number of rows in the table so I can sum up every row.
return (worksheet.write_formula(str(column+start_row),"'=SUM("+str(column+start_row)+":"+str(column+row_count)+")'") )
When running get_totals("G") it just results in 0. I'm suspecting it has to do with the STR operator that I had to apply because its adding single quotes to the formula, and therefore rendering it incorrectly.
However I cannot take the str operator out because I cannot concatenate INTs apparently?
Maybe I'm coding this all wrong, new to python, any help appreciated.
Thank you!
You could also do something like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,3,4], 'B': [5,6,7,8],
'C': [np.nan, np.nan, np.nan, np.nan]})
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow = 2)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
def get_totals(start_row, sum_column, column1='A', column2='B'):
for row in range(start_row,df.shape[0]+start_row):
worksheet.write_formula(f'{sum_column}{row}', f'=SUM({column1}{row},{column2}{row})')
get_totals(4, 'C')
writer.save()
Output:
In almost all cases XlsxWriter methods support two forms of notation to designate the position of cells: Row-column notation and A1 notation.
Row-column notation uses a zero based index for both row and column while A1 notation uses the standard Excel alphanumeric sequence of column letter and 1-based row. For example:
(6, 2) # Row-column notation.
('C7') # The same cell in A1 notation.
So for your case you could do the following and set the row-column values programatically (you may have to adjust by -1 to get zero indexing):
worksheet.write_formula(start_row, start_column, '=SUM(G4:G21)')
For the formula you could use of of XlsxWriter's utility functions:
from xlsxwriter.utility import xl_range
my_range = xl_range(3, 6, 20, 6) # G4:G21
I am trying to create a column from two other columns in a DataFrame.
Consider the 3-column data frame:
import numpy as np
import pandas as pd
random_list_1 = np.random.randint(1, 10, 5)
random_list_2 = np.random.randint(1, 10, 5)
random_list_3 = np.random.randint(1, 10, 5)
df = pd.DataFrame({"p": random_list_1, "q": random_list_2, "r": random_list_3})
I create a new column from "p" and "q" with a function that will be given to apply.
As a simple example:
def operate(row):
return [row['p'], row['q']]
Here,
df['s'] = df.apply(operate, axis = 1)
evaluates correctly and creates a column "s".
The issue appears when I am considering a data frame with a number of columns equal to the length of the list output by operate. So for instance with
df2 = pd.DataFrame({"p": random_list_1, "q": random_list_2})
evaluating this:
df2['s'] = df2.apply(operate, axis = 1)
throws a ValueError exception:
ValueError: Wrong number of items passed 2, placement implies 1
What is happening?
As a workaround, I could make operate return tuples (which does not throw an exception) and then convert them to lists, but for performance sake I would prefer getting lists in one reading only of the DataFrame.
Is there a way to achieve this?
In both of the cases this work for me:
df["s"] = list(np.column_stack((df.p.values,df.q.values)))
Working with vectorized function is better than use apply. In this case the speed boost is 3x. See documentation
Anyway I found your question interesting and I'd like to know why this is happening.