I want to create a loop which creates multiple csvs which have the same 9 columns in the beginning but differ iteratively in the last column.
[col1,col2,col3,col4,...,col9,col[i]]
I have a dataframe with a shape of (20000,209).
What I want is that I create a loop which does not takes too much computation power and resources but creates 200 csvs which differ in the last column. All columns exist in one dataframe. The columns which should be added are in columns i =[10:-1].
I thought of something like:
for col in df.columns[10:-1]:
dfi = df[:9]
dfi.concat(df[10])
dfi.dropna()
dfi.to_csv('dfi.csv'))
Maybe it is also possible to use
dfi.to_csv('dfi.csv', sequence = [:9,i])
The i should display the number of the added column. Any idea how to make this happen easily? :)
Thanks a lot!
I'm not sure I understand fully what you want but are you saying that each csv should just have 10 columns, all should have the first 9 and then one csv for each of the remaining 200 columns?
If so I would go for something as simple as:
base_cols = list(range(9))
for i in range(9, 209):
df.iloc[:, base_cols+[i]].to_csv('csv{}.csv'.format(i))
Which should work I think.
Related
Newbie here. Just as the title says, I have a list of dataframes (each dataframe is a class of students). All dataframes have the same columns. I have made certain columns global.
BINARY_CATEGORIES = ['Gender', 'SPED', '504', 'LAP']
for example. These are yes/no or male/female categories, and I have already changed all of the data to be 1's and 0's for these columns. There are several other columns which I want to ignore as I iterate.
I am trying to accept the list of classes (dataframes) into my function and perform calculations on each dataframe using only my BINARY_CATEGORIES list of columns. This is what I've got, but it isn't making it through all of the classes and/or all of the columns.
def bal_bin_cols(classes):
i = 0
c = 0
for x in classes:
total_binary = classes[c][BINARY_CATEGORIES[i]].sum()
print(total_binary)
i+=1
c+=1
Eventually I need a new dataframe from this all of the sums corresponding to the categories and the respective classes. print(total binary) is just a place holder/debugger. I don't have that code yet that will populate the dataframe from the results of the above code, but I'd like it to be the classes as the index and the total calculation as the columns.
I know there's probably a vectorized way to do this, or enum, or groupby, but I will take a fix to my loop. I've been stuck forever. Please help.
Try something like:
Firstly create a dictionary:
d={
'male':1,
'female':0,
'yes':1,
'no':0
}
Finally use replace():
df[BINARY_CATEGORIES]=df[BINARY_CATEGORIES].replace(d.keys(),d.values(),regex=True)
[updated with expected outcome]
I'm trying to implement a "running" check where I need the sum and mean of two rows to be more than the previous 2 rows.
Referring to the dataframe (copied into spreadsheet) below, I'm trying code out a function where if the mean of those two orange cells is more than the blue cells, the function will return true for row 8, under a new column called 'Cond11'. The dataframe here is historical, so all rows are available.
Note that that Rows column is added in the spreadsheet, easier for me to reference the rows here.
I have been using .rolling to refer to the current row + whatever number of rows to refer to, or using shift(1) to refer to the previous row.
df.loc[:, ('Cond9')] = df.n.rolling(4).mean() >= 30
df.loc[:, ('Cond10')] = df.a > df.a.shift(1)
I'm stuck here... how to I do this 2 rows vs the previous 2 rows? Please advise!
The 2nd part of this question: I have another function that checks the latest rows in the dataframe for the same condition above. This function is meant to be used in real-time, when new data is streaming into the dataframe and the function is supposed to check the latest rows only.
Can I check if the following code works to detect the same conditions above?
cond11 = candles.n[-2:-1].sum() > candles.n[-4:-3].sum()
I believe this solves your problem:
df.rolling(4).apply(lambda rows: rows[0] + rows[1] < rows[2] + rows[3])
The first 3 rows will be NaNs but you did not define what you would like to happen there.
As for the second part, to be able to produce this condition live for new data you just have to prepend the last 3 rows of your current data and then apply the same process to it:
pd.concat([df[-3:], df])
I have the following piece of code to iterate through two data frames.
for i, row in df1.iterrows():
for j, innerrow in df2.iterrows():
if row["df1_id"] == innerrow["df2_id"]:
df1.at[i,"count_col_df1"] = innerrow["count_col_df2"]
Here, the comparison of ID's column is done to fill the data of one column in df1 from df2.
Since there are 10,000+ records in each data frame, it is taking hours to complete.
Any suggestions for efficient ways to compile the code would be welcomed.
Thanks in advance
If i understood you correctly this should help you. eq() is returning True or False by checking the values if whether equals.
df2.loc[df1['df1_id'].eq(df2['df2_id']), 'count_col_df2'] = df1['count_col_df1']
I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]
Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())
I'm quite new to python and am learning by applying what I know to automate some tasks in Excel.
Basically what I'm trying to do is take certain columns (ex: columns J:Z) and stack them below each other i.e. column J goes under column J, and then column L goes under the column J & K stack, so on and so forth.
I was able to achieve this by:
python
df1 = df1['December']\
.append(df1['Jan 2017'])\
.append(df1['Feb 2017'])\
.reset_index(drop=False)
But in the process it took out columns A:I. What I would like to accomplish is copy columns A:I rows 1 - 20 for each stack. The data is columnar and I'd like to convert it into rows for each column
I think you are looking for pandas .stack(), .unstack(), or .melt().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html
https://www.youtube.com/watch?v=qOkj5zOHwRE