I have a data frame with N rows containing certain information. Depending on the values in the data frame, I want to create a numpy array with the same number of rows but with M columns.
I have a solution where I iterate through the rows of the data frame and apply a function, which outputs me a row for the array with M entries.
However, I am thinking about whether there are smarter, more efficient ways to avoid iterating through the df?
edit://
Apologies, I think the description might not be really good.
So I have a df with N rows. Depending on the values of certain columns, I want to create M binary entries for each row, that I store in a separate np array.
E.g. the function that I defined can look like this:
def func(row):
ret = np.zeros(12)
if row['A'] == 'X':
ret[3] = 1
else:
ret[[3,6,9]]=1
return ret
And currently I am applying this (simplified) function to each row of the df to get a full (N,M) array, which seems to be a bit inefficient.
See Pandas groupby() to group depending on M and than extract.
Related
I have a Pandas DataFrame to which I would like to add a new column that I will then populate with numpy arrays, such that each row in that column contains a numpy array. I'm using the following approach, and am wondering whether this is the correct approach.
df['embeddings'] = pd.Series(dtype='object')
Then I would iterate over rows and add computed arrays like so (using np.zeros(1024) for illustration only, in reality these are the output of a neural network):
for i in range(df.shape[0]):
df['embeddings'].loc[i] = np.zeros(1024)
I tested whether it helps to pre-allocate the cells like so, but didn't notice a difference in execution time when I then iterate over rows, at least not with a DataFrame that only has 200 rows:
df['embeddings'] = [np.zeros(1024)] * df.shape[0]
As alternative to adding a column to then update the rows in it, one could create the list of numpy arrays first, to then add the list as a new column, but that would require more memory.
I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching.
we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer.
import pandas as pd
df_1 = pd.DataFrame({'col_1':['a','b','c','d']})
df_2 = pd.DataFrame({'col_1':['q','w','e','r']})
rows = df_1.shape[0]
for idx in range(rows):
df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx]
Then, you check the df_1. you should get that:
df_1
col_1
0 q
1 w
2 e
3 r
Whatever the response is what you want, let me know so I can help you.
I would like to ask for your help. The problem in steps.
1. Import two excel files into Python Data frames - so far no problem
2. Transferring the data frames into numpy arrays.
3. Create a VLOOKUP function in python, with the arrays. Both arrays have a key in the first column, which is unique and can be used for matching. The two tables include data, which is correct in one table but not in the other one. I would like to overwrite values in the table where values are wrong from the table where values are correct (I know, which table is has the right values...)
Is there a more numpy way to do it;
So far the code I wrote:
import pandas as pd
df=pd.DataFrame()
s = pd.read_excel("C:\a.xlsx")
r = pd.read_excel("C:\b.xlsx")
z=s.values
t = r.values
Here matching the two arrays, and overwriting the value
for i in z:
for j in t:
if z[i, 0] == t[j, 0]:
t[i, 41] = z[j, 5]
If same length, use pd.merge, it acts like vlookup:
newdf = s.merge(r, on ='same_key')
newdf will have all the columns from both data frames. You can now access the individual columns you need to update:
newdf['wrongcolumn'] = newdf['rightcolumn']
I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers
I have a nested list of coordinates:
I need my list to be in the format of rows and columns shown below (I think it is called a data frame), with its contents having the Pythagorean formula applied against each cells' column and row header:
What is the best approach in Python to do it?
If I understand correcty, this should solve your problem:
import pandas as pd
df = pd.DataFrame(coor_house)
df['l2'] = np.sqrt((df[1].apply(lambda x :x[0])
-df[0].apply(lambda x :x[0]))**2
+(df[1].apply(lambda x :x[1])
-df[0].apply(lambda x :x[1]))**2)
This will create a dataframe where each column is a point and a column with the l2 norm of the difference.
I'm not very used to apply a function to a whole dataframe, so I'm sure there is a better way.