enter image description here
How do I filter so that row 2 is == '1' and find the min value in row 1. And it prints the column its in?
So in this case the result would only bring up Driver 3 [.941, 0.210, 1] because row 2 is 1 and row1 is smaller than the rest of that would also have '1' in row 2.
When I try np.amin(df, axis=2) I get the min values i want but row 2 returns a 0 instead of 1.
You first need to select the subset of columns you're interested in, then find from those the minimum value, then figure out which column its in (or both at once with argmin). (Note that pandas API somewhat prefers filtering out rows vs columns.)
First setup some example data
import pandas as pd
import numpy as np
data = np.arange(24).reshape(4,6)
data[1,3:5]=1 # Adjust some values in row 2
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D', 'E', 'F'])
So transpose columns and rows
dfT = df.T
Get the columns where row 2 ==1
dfFiltered = dfT[dfT[1] == 1]
Find the minimum row with the minimum value and get its index
dfFiltered.index[dfFiltered[0].argmin()]
This produces 'D' for this example.
First of all, for this kind of application you really want your DataFrame to be transposed. It is just not convenient to filter on columns.
import pandas as pd
A = pd.DataFrame(...)
B = A.transpose()
# Now you can filter the drivers where "column" 2 is equal to 1:
C = B[B[2] == 1]
# And find where the minimum of "column" 1 is:
idx = C[1].argmin()
# Finally, find out the driver and its information:
driver = C.index[idx]
info = C.iloc[idx]
# info2 = C.loc[driver] # alternative
# To get the format you want:
print(driver, info.values)
I'm trying to set create a new column on my DataFrame grouping two existing columns
import pandas as pd
import numpy as np
DATA=pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
DATA['index']=np.arange(5)
DATA.set_index('index', inplace=True)
The output is something like this
'A' 'B'
index
0 -0.003635 -0.644897
1 -0.617104 -0.343998
2 1.270503 -0.514588
3 -0.053097 -0.404073
4 -0.056717 1.870671
I would like to have an extra column 'C' that has an np.array with the elements of 'A' and 'B' for the corresponding row. In the real case, 'A' and 'B' are already 1D np.arrays, but of different lengths. I would like to make a longer array with all the elements stacked or concatenated.
Thanks
If columns a and b contains numpy arrays, you could apply hstack across rows:
import pandas as pd
import numpy as np
num_rows = 10
max_arr_size = 3
df = pd.DataFrame({
"a": [np.random.rand(max_arr_size) for _ in range(num_rows)],
"b": [np.random.rand(max_arr_size) for _ in range(num_rows)],
})
df["c"] = df.apply(np.hstack, 1)
assert all(row.a.size + row.b.size == row.c.size for _, row in df.iterrows())
DATA['C'] = DATA.apply(lambda x: np.array([x.A, x.B]), axis=1)
pandas requires all rows to be of the same length so the problem of uneven pandas series shouldn't be present
I am doing it this way
1- dropping columns from the main dataframe which doesn't need feature scaling
2- now obtained dataframe only has columns that require feature scaling
3- concatenate the dropped out columns with the scaled columns to get the final dataframe
but
I want to do it without dropping any columns. buy using a command that will scale first 14 columns but the others are preserved in the dataframe I get as an output
Look in to DataFrame.apply(). Setting the axis parameter to 1 will apply a function along columns. You can put a filter in that function to only include the columns you want to scale.
For example:
import pandas as pd
def scaling_function(x, col_to_scale):
for col in x.index:
if col in col_to_scale:
#your scaling operation here
x[col] = x[col] * 2
return x
df = pd.DataFrame([[4, 9, 2]] * 3, columns=['A', 'B', 'C'])
col_to_scale = ['A', 'B']
scaled_df = df.apply(lambda x: scaling_function(x, col_to_scale), axis=1)
Will double the values in column A & B while leaving C as is.
I am trying to make a script that loops through rows in dataframe and makes a new column from appending values from column A or B based on a condition in column C. However, there seem to be something wrong in the appending of the rows in the columns, as my new column contain several values.
import pandas as pd
import numpy as np
#Loading in the csv file
filename = '35180_TRA_data.csv'
df1 = pd.read_csv(filename, sep=',', nrows=1300, skiprows=25, index_col=False, header=0)
#Calculating the B concentration using column A and a factor
B_calc = df1['A']*137.818
#The measured B concentration
B_measured = df1['B']
#Looping through the dataset, and append the B_calc values where the C column is 2, while appending the B_measured values where the C column is 1.
calculations = []
for row in df1['C']:
if row == 2:
calculations.append(B_calc)
if row ==1:
calculations.append(B_measured)
df1['B_new'] = calculations
The values of my new column (B_new) are all wrong. For example in the first row it should be just 0.00, but it contains numerous values. So something is going wrong in the appending. Anyone who can spot that problem?
B_calc and B_measured are arrays. As such you have to specify which value you want to assign, otherwise you assign the whole array. Here is how you could do it :
df1 = pd.DataFrame({"A":[1,3,5,7,9], "B" : [9,7,5,3,1], "C":[1,2,1,2,1]})
#Calculating the B concentration using column A and a factor
B_calc = df1['A']*137.818
#The measured B concentration
B_measured = df1['B']
#Looping through the dataset, and append the B_calc values where the C column is 2, while appending the B_measured values where the C column is 1.
calculations = []
for index, row in df1.iterrows():
if row['C'] == 2:
calculations.append(B_calc[index])
if row['C'] ==1:
calculations.append(B_measured[index])
df1['B_new'] = calculations
But it's a bad practice to iterate on rows because it takes a long time. A better way is to use pandas masks, here is how it works :
mask_1 = df1['C'] == 1
mask_2 = df1['C'] == 2
df1.loc[mask_1, 'C'] = df1[mask_1]['A']*137.818
df1.loc[mask_2, 'C'] = df1[mask_2]['B']
I'm starting from the pandas DataFrame documentation here: Introduction to data structures
I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and timestamp rows, all 0 or all NaN.
I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.
I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly or just a better way in general.
Note: I'm using Python 2.7.
import datetime as dt
import pandas as pd
import scipy as s
if __name__ == '__main__':
base = dt.datetime.today().date()
dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
dates.sort()
valdict = {}
symbols = ['A','B', 'C']
for symb in symbols:
valdict[symb] = pd.Series( s.zeros( len(dates)), dates )
for thedate in dates:
if thedate > dates[0]:
for symb in valdict:
valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]
print valdict
NEVER grow a DataFrame row-wise!
TLDR; (just read the bold text)
Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.
Here is my advice: Accumulate data in a list, not a DataFrame.
Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame accepts both.
data = []
for row in some_function_that_yields_data():
data.append(row)
df = pd.DataFrame(data)
pd.DataFrame converts the list of rows (where each row is a scalar value) into a DataFrame. If your function yields DataFrames instead, call pd.concat.
Pros of this approach:
It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.
Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).
dtypes are automatically inferred (rather than assigning object to all of them).
A RangeIndex is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.
If you aren't convinced yet, this is also mentioned in the documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
*** Update for pandas >= 1.4: append is now DEPRECATED! ***
As of pandas 1.4, append has now been deprecated! Use pd.concat instead. See the release notes
These options are horrible
append or concat inside a loop
Here is the biggest mistake I've seen from beginners:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
# or similarly,
# df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)
Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation.
The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)
df.dtypes
A object # yuck!
B float64
C object
dtype: object
Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:
df.infer_objects().dtypes
A int64
B float64
C object
dtype: object
loc inside a loop
I have also seen loc used to append to a DataFrame that was created empty:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.
Empty DataFrame of NaNs
And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
It creates a DataFrame of object columns, like the others.
df.dtypes
A object # you DON'T want this
B object
C object
dtype: object
Appending still has all the issues as the methods above.
for i, (a, b, c) in enumerate(some_function_that_yields_data()):
df.iloc[i] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
Here's a couple of suggestions:
Use date_range for the index:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
Note: we could create an empty DataFrame (with NaNs) simply by writing:
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # With 0s rather than NaNs
To do these type of calculations for the data, use a NumPy array:
data = np.array([np.arange(10)]*3).T
Hence we can create the DataFrame:
In [10]: df = pd.DataFrame(data, index=index, columns=columns)
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:
newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional
In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.
If I have to keep appending new data into this newDF from more than
one oldDFs, I just use a for loop to iterate over
pandas.DataFrame.append()
Note: append() is deprecated since version 1.4.0. Use concat()
Initialize empty frame with column names
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df
Add a new record to a frame
my_df.loc[len(my_df)] = [2, 4, 5]
You also might want to pass a dictionary:
my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic
Append another frame to your existing frame
col_names = ['A', 'B', 'C']
my_df2 = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)
Performance considerations
If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.
If you plan to do thins inside a big loop (say 10M records or so), you are better off using a mixture of these two;
fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe.
This would boost your performance by around 10 times.
Simply:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.zeros([rows,columns])
Then fill it.
Assume a dataframe with 19 rows
index=range(0,19)
index
columns=['A']
test = pd.DataFrame(index=index, columns=columns)
Keeping Column A as a constant
test['A']=10
Keeping column b as a variable given by a loop
for x in range(0,19):
test.loc[[x], 'b'] = pd.Series([x], index = [x])
You can replace the first x in pd.Series([x], index = [x]) with any value
This is my way to make a dynamic dataframe from several lists with a loop
x = [1,2,3,4,5,6,7,8]
y = [22,12,34,22,65,24,12,11]
z = ['as','ss','wa', 'ss','er','fd','ga','mf']
names = ['Bob', 'Liz', 'chop']
a loop
def dataF(x,y,z,names):
res = []
for t in zip(x,y,z):
res.append(t)
return pd.DataFrame(res,columns=names)
Result
dataF(x,y,z,names)
# import pandas library
import pandas as pd
# create a dataframe
my_df = pd.DataFrame({"A": ["shirt"], "B": [1200]})
# show the dataframe
print(my_df)