I would like to ask for your help. The problem in steps.
1. Import two excel files into Python Data frames - so far no problem
2. Transferring the data frames into numpy arrays.
3. Create a VLOOKUP function in python, with the arrays. Both arrays have a key in the first column, which is unique and can be used for matching. The two tables include data, which is correct in one table but not in the other one. I would like to overwrite values in the table where values are wrong from the table where values are correct (I know, which table is has the right values...)
Is there a more numpy way to do it;
So far the code I wrote:
import pandas as pd
df=pd.DataFrame()
s = pd.read_excel("C:\a.xlsx")
r = pd.read_excel("C:\b.xlsx")
z=s.values
t = r.values
Here matching the two arrays, and overwriting the value
for i in z:
for j in t:
if z[i, 0] == t[j, 0]:
t[i, 41] = z[j, 5]
If same length, use pd.merge, it acts like vlookup:
newdf = s.merge(r, on ='same_key')
newdf will have all the columns from both data frames. You can now access the individual columns you need to update:
newdf['wrongcolumn'] = newdf['rightcolumn']
Related
I have a data frame with N rows containing certain information. Depending on the values in the data frame, I want to create a numpy array with the same number of rows but with M columns.
I have a solution where I iterate through the rows of the data frame and apply a function, which outputs me a row for the array with M entries.
However, I am thinking about whether there are smarter, more efficient ways to avoid iterating through the df?
edit://
Apologies, I think the description might not be really good.
So I have a df with N rows. Depending on the values of certain columns, I want to create M binary entries for each row, that I store in a separate np array.
E.g. the function that I defined can look like this:
def func(row):
ret = np.zeros(12)
if row['A'] == 'X':
ret[3] = 1
else:
ret[[3,6,9]]=1
return ret
And currently I am applying this (simplified) function to each row of the df to get a full (N,M) array, which seems to be a bit inefficient.
See Pandas groupby() to group depending on M and than extract.
I have a Pandas DataFrame to which I would like to add a new column that I will then populate with numpy arrays, such that each row in that column contains a numpy array. I'm using the following approach, and am wondering whether this is the correct approach.
df['embeddings'] = pd.Series(dtype='object')
Then I would iterate over rows and add computed arrays like so (using np.zeros(1024) for illustration only, in reality these are the output of a neural network):
for i in range(df.shape[0]):
df['embeddings'].loc[i] = np.zeros(1024)
I tested whether it helps to pre-allocate the cells like so, but didn't notice a difference in execution time when I then iterate over rows, at least not with a DataFrame that only has 200 rows:
df['embeddings'] = [np.zeros(1024)] * df.shape[0]
As alternative to adding a column to then update the rows in it, one could create the list of numpy arrays first, to then add the list as a new column, but that would require more memory.
I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for
I have to write an object that takes either a pandas data frame or a numpy array as the input (similar to sklearn behavior). In one of the methods for this object, I need to select the columns (not a particular fixed one, I get a few column indices based on other calculations).
So, to make my code compatible with both input types, I tried to find a common way to select columns and tried methods like X[:,0](doesn't work on pandas dataframes), X[0] and others but they select differently. Is there a way to select columns in a similar fashion across pandas and numpy?
If no then how does sklearn work across these data structures?
You can use an if condition within your method and have separate selection methods for pandas dataframes and numpy arrays. Given sample code below.
def method_1(self, var, col_indices):
if isinstance(var, pd.DataFrame):
selected_columns = var[var.columns[col_indices]]
else:
selected_columns = var[:,col_indices]
Here, var is your input which can be a numpy array or pandas dataframe, col_indices are the indices of the columns you want to select.
I have a nested list of coordinates:
I need my list to be in the format of rows and columns shown below (I think it is called a data frame), with its contents having the Pythagorean formula applied against each cells' column and row header:
What is the best approach in Python to do it?
If I understand correcty, this should solve your problem:
import pandas as pd
df = pd.DataFrame(coor_house)
df['l2'] = np.sqrt((df[1].apply(lambda x :x[0])
-df[0].apply(lambda x :x[0]))**2
+(df[1].apply(lambda x :x[1])
-df[0].apply(lambda x :x[1]))**2)
This will create a dataframe where each column is a point and a column with the l2 norm of the difference.
I'm not very used to apply a function to a whole dataframe, so I'm sure there is a better way.