We want to add the values of several different rows into one single row. In the image you can see an example of want we want to do, on the left (column ABC) the data we have, on the right the data we want.
We have a large dataset and thus want to write a script. Currently we have a pandas dataframe. We want to add five rows into one.
Does anyone have a simple solution?
Image (left what we have, right what we want)
You can do this:
inport pandas as pd
# reads an 1 Dimensional List and reads it as columns
pd.DataFrame([
[j for j in i for i in df.values] # makes 2D matrix of all values to 1D list
])
the [] in (pd.DataFrame([...])) means that the first row is the following data -> horizontal formatting
Here's a way you can try:
from itertools import product
# sample data
df = pd.DataFrame(np.random.randint(1, 10, size=9).reshape(-1, 3), columns=['X','Y','Z'])
X Y Z
0 2 6 5
1 5 6 2
2 2 4 5
# get all values
total_values = df.count().sum()
# existing column name
cols = df.columns
nums = [1,2,3]
# create new column names
new_cols = ['_'.join((str(i) for i in x)) for x in list(product(cols, nums))]
df2 = pd.DataFrame(df.values.reshape(-1, total_values), columns=new_cols)
X_1 X_2 X_3 Y_1 Y_2 Y_3 Z_1 Z_2 Z_3
0 2 6 5 5 6 2 2 4 5
I'd do this:
import pandas as pd, numpy as np
df=pd.DataFrame(np.arange(1,10).reshape(3,3),columns=["X","Y","Z"])
print(df)
X Y Z
0 1 2 3
1 4 5 6
2 7 8 9
dat = df.to_numpy()
d = np.column_stack([dat[:,x].reshape(1,dat.shape[0]) for x in range(dat.shape[1])])
pd.DataFrame(d,columns=(x+str(y) for x in df.columns for y in range(len(df)) ))
X0 X1 X2 Y0 Y1 Y2 Z0 Z1 Z2
0 1 4 7 2 5 8 3 6 9
Assuming this is a numpy array. (if its a csv you can read in as numpy array)
yourArray.flatten(order='C')
Related
Context: I have a dataframe that contains duplicated indices, howerver, I'd only like to drop the duplicated indices that contain a specific pattern
For example if we have this dataframe:
tt = pd.DataFrame({'A':[1,2,3,10,10,2,3,4,20,20]}, index=['Sum_2019','X','Y','Sum_2020','Sum_2020','A','B','C','Sum_2021','Sum_2021'])
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
Sum_2020 10
X 2
Y 3
Z 4
Sum_2021 20
Sum_2021 20
Desired output: How can I drop only the indices that contain the pattern "Sum_" or each given year (2020 repeated) from the dataframe?
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
X 2
Y 3
C 4
Sum_2021 20
Attempts:
I was trying to do this:
df= df[~df.index.duplicated(keep='first')]
But this also removes the indices "X" and "Y" that I want to keep.
Thank you!
Add condition for keep indices not starting by Sum_ and chain by & for bitwise AND:
df = pd.DataFrame({'A':[1,2,3,10,10,2,3,4,20,20]},
index=['Sum_2019','X','Y','Sum_2020','Sum_2020',
'X','Y','Z','Sum_2021','Sum_2021'])
df = df[~df.index.duplicated(keep='first') | ~df.index.str.startswith('Sum_')]
print (df)
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
X 2
Y 3
Z 4
Sum_2021 20
You can use shift to compare successive label:
# identify "Sum_" indices
s = tt.index.to_series().str.contains('Sum_')
# keep those that are not Sum_ if the previous is Sum_
out = tt[~(s&s.shift())]
output:
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
A 2
B 3
C 4
Sum_2021 20
I am having trouble figuring out how to iterate over variables in a pandas dataframe and perform same arithmetic function on each.
I have a dataframe df that contain three numeric variables x1, x2 and x3. I want to create three new variables by multiplying each by 2. Here’s what I am doing:
existing = ['x1','x2','x3']
new = ['y1','y2','y3']
for i in existing:
for j in new:
df[j] = df[i]*2
Above code is in fact creating three new variables y1, y2 and y3 in the dataframe. But the values of y1 and y2 are being overridden by the values of y3 and all three variables have same values, corresponding to that of y3. I am not sure what I am missing.
Really appreciate any guidance/ suggestion. Thanks.
You are looping something like 9 times here - 3 times for each column, with each iteration overwriting the previous.
You may want something like
for e, n in zip(existing,new):
df[n] = df[e]*2
I would do something more generic
#existing = ['x1','x2','x3']
exisiting = df.columns
new = existing.replace('x','y')
#maybe you need map+lambda/for for each existing string
for (ind_existing, ind_new) in zip(existing,new):
df[new[ind_new]] = df[existing[ind_existing]]*2
#maybe there is more elegant way by using pandas assign function
You can concatenante the original DataFrame with the columns with doubled values:
cols_to_double = ['x0', 'x1', 'x2']
new_cols = list(df.columns) + [c.replace('x', 'y') for c in cols_to_double]
df = pd.concat([df, 2 * df[cols_to_double]], axis=1, copy=True)
df.columns = new_cols
So, if your input df Dataframe is:
x0 x1 x2 other0 other1
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
after executing the previous lines, you get:
x0 x1 x2 other0 other1 y0 y1 y2
0 0 1 2 3 4 0 2 4
1 0 1 2 3 4 0 2 4
2 0 1 2 3 4 0 2 4
3 0 1 2 3 4 0 2 4
4 0 1 2 3 4 0 2 4
Here the code to create df:
import pandas as pd
import numpy as np
df = pd.DataFrame(
data=np.column_stack([np.full((5,), i) for i in range(5)]),
columns=[f'x{i}' for i in range(3)] + [f'other{i}' for i in range(2)]
)
I want to see changes in cell values when comparing multiple dataframes. These dataframes are formed from JSON data, which forms many-column tables, and I cannot easily change this data source. Let's say there are 10 dataframes with 10 rows and 10 columns (equally labelled). I'd like to compare the information, by turning each dataframe into 100 row and 1 column.
for 3x3 example:
import pandas as pd
data = [{'a':1,'b':2,'c':3},{'a':10,'b':20,'c':30},{'a':100,'b':200,'c':300}]
df = pd.DataFrame(data)
df.index = ['x','y','z']
gives this table
a b c
x 1 2 3
y 10 20 30
z 100 200 300
but I would like to have:
col
xa 1
xb 2
xc 3
ya 10
yb 20
yc 30
za 100
zb 200
zc 300
so that I may then add many columns and compare values changes.
Can somebody advise me on how to do this using pandas?
It is okay if a third colum is required, i.e.:
1 2 3
x a 1
x b 2
x c 3
y a 10
y b 20
y c 30
z a 100
z b 200
z c 300
Use DataFrame.stack with Series.to_frame and then flatten MultiIndex to index with map:
df_us = df.stack().to_frame('col')
df_us.index = df_us.index.map(lambda x: f'{x[0]}{x[1]}')
print (df_us)
col
xa 1
xb 2
xc 3
ya 10
yb 20
yc 30
za 100
zb 200
zc 300
For 3 columns:
df_us = df.stack().reset_index()
df_us.columns = [0,1,2]
print (df_us)
0 1 2
0 x a 1
1 x b 2
2 x c 3
3 y a 10
4 y b 20
5 y c 30
6 z a 100
7 z b 200
8 z c 300
Try something like this
import pandas as pd
data = [{'a':1,'b':2,'c':3},{'a':10,'b':20,'c':30},{'a':100,'b':200,'c':300}]
df = pd.DataFrame(data)
df.index = ['x','y','z']
df_us = df.unstack().reset_index()
df_us.columns = [i for i in range(df_us.shape[1])]
df_us = df_us.sort_values(by=2)
I have a data frame with 10 columns which successfully loads into a classifier. Now I am trying to load the sum of the columns instead of all 10 columns.
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
X = previous_games_stats[['GF', 'GA']]
X = X[0:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
totals = pd.DataFrame(X, columns=stats_feature_names)
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack+1]
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum()
The final line (with .sum() at the end) causes stats_df to go form being formatted like:
GF GA
0 2 1
1 4 3
2 2 1
3 2 1
4 3 4
5 2 4
6 0 3
7 0 2
8 2 5
9 0 3
to:
GF 17
GA 27
But I want to keep the same format, so the end result should be this:
GF GA
0 17 27
Since it is getting re-formatted, I am getting the following error:
IndexError: boolean index did not match indexed array along dimension 0; dimension is 4 but corresponding boolean dimension is 3
What can I do to make the format stay the same?
If call sum to DataFrame, get Series. For one row DataFrame use:
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
Another solution:
df1 = pd.DataFrame(X, columns=stats_feature_names)
stats_df = pd.DataFrame([df1.sum().values], columns=df.columns)
I have a list abc[] of size x and I have a data-frame whose shape is 2x. Now, I want to assign the values from list abc[] to a new column in data frame.
When the size of DF is equal or less than the list, I just say:
df['NewCol'] = abc[:df.shape[0]]
When the size of the df is more than the list (in this case twice), I do a for like below:
for i,rowData in df.iterrows():
i = i-1
j = i/2
df['NewCol'].iloc[i] = abc[j]
Here the size of df is exactly twice the size of list. And I will always have the case where the size of df is either twice/thrice the list. So that one entry can be matched to two or three consecutive entries.
Is there any faster way to achieve this?
df = pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
abc = ['a', 'b']
I will always have the case where the size of df is either twice/thrice the list.
multiplier = len(df) / len(abc) # Should be 2 or 3 per above condition.
df = df.assign(NewCol=[val for val in abc for _ in range(multiplier)])
>>> df
A B C NewCol
0 -0.262760 1.898977 2.265480 a
1 0.552906 2.144316 -0.942272 a
2 -1.429635 -0.060660 0.756665 b
3 -0.658036 -1.056586 1.458374 b
You could use numpy.repeat to repeat your list, as you are sure there will always be an integer.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':np.arange(6)})
abc = [4, 5, 6]
df['NewCol'] = np.repeat(abc, len(df)/len(abc))
df
a NewCol
0 0 4
1 1 4
2 2 5
3 3 5
4 4 6
5 5 6
If you prefer to have the list repeated as a whole, you can use np.tile :
df['NewCol2'] = np.tile(abc, len(df)/len(abc))
df
a NewCol NewCol2
0 0 4 4
1 1 4 5
2 2 5 6
3 3 5 4
4 4 6 5
5 5 6 6