I have a pandas dataframe and I would like to add an empty column (named nb_trades). Then I would like to fill this new column with a 5 by 5 increment. So I should get a column with values 5 10 15 20 ...
Doing the below code assign the same value (last value of i) in the whole column and that's not what I wanted:
big_df["nb_trade"]='0'
for i in range(big_df.shape[0]):
big_df['nb_trade']=5*(i+1)
Can someone help me?
Use range or np.arrange:
df = pd.DataFrame({'a':[1,2,3]})
print (df)
a
0 1
1 2
2 3
df['new'] = range(5, len(df.index) * 5 + 5, 5)
print (df)
a new
0 1 5
1 2 10
2 3 15
df['new'] = np.arange(5, df.shape[0] * 5 + 5, 5)
print (df)
a new
0 1 5
1 2 10
2 3 15
Solution of John Galt from comment:
df['new'] = np.arange(5, 5*(df.shape[0]+1), 5)
print (df)
a new
0 1 5
1 2 10
2 3 15
Related
I have a large dataset with many columns of numeric data and want to be able to count all the zeros in each of the rows. The following will generate a small sample of the data.
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df
While I can create a column to sum all the values in the rows with the following code:
df2=df.sum(axis=1)
df2
And I can get a count of the zeros in a column:
df.loc[df.a==1].count()
I haven't been able to figure out how to get a count of the zeros across each of the rows. Any assistance would be greatly appreciated.
For count matched values is possible use sum of Trues of boolean mask.
If need new column:
df['sum of 1'] = df.eq(1).sum(axis=1)
#alternative
#df['sum of 1'] = (df == 1).sum(axis=1)
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df['sum of 1'] = df.eq(1).sum(axis=1)
print (df)
a b c sum of 1
0 0 0 2 0
1 1 0 1 2
2 0 0 0 0
3 2 1 2 1
4 2 2 1 1
5 0 0 0 0
6 0 2 0 0
7 1 1 1 3
If need new row:
df.loc['sum of 1'] = df.eq(1).sum()
#alternative
#df.loc['sum of 1'] = (df == 1).sum()
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df.loc['sum of 1'] = df.eq(1).sum()
print (df)
a b c
0 0 0 2
1 1 0 1
2 0 0 0
3 2 1 2
4 2 2 1
5 0 0 0
6 0 2 0
7 1 1 1
sum of 1 2 2 3
I have a DataFrame and want to find duplicate values within a column and if found, create a new column add a zero for every duplicate case but leave the original value unchanged.
Original DataFrame:
Code1
1
2
3
4
5
1
2
1
1
New DataFrame:
Code1 Code2
1 1
2 2
3 3
4 4
5 5
6 6
1 10
2 20
1 100
1 1000
6 60
Use groupby and cumcount
df.assign(counts = df.groupby("Code1").cumcount(),
Code2=lambda x:x["Code1"]*10**(x["counts"])
).drop("counts", axis=1)
Code1 Code2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 1 10
6 2 20
7 1 100
8 1 1000
there might be a solution using transform (but I'm just not having time right now to investigate). However, here it's really explicit about what is happening
import pandas as pd
data = [1, 2, 3, 4, 5, 1, 2, 1, 1]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Code1'])
code2 = []
x = {}
for d in data:
if d not in x:
x[d] = d
else:
x[d] = x[d] * 10
code2.append(x[d])
df['Code2'] = code2
print(df)
I have a table
I want to sum values of the columns beloning to the same class h.*. So, my final table will look like this:
Is it possible to aggregate by string column name?
Thank you for any suggestions!
Use lambda function first for select first 3 characters with parameter axis=1 or indexing columns names similar way and aggregate sum:
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
Or:
df1 = df.set_index('object')
df2 = df1.groupby(df1.columns.str[:3], axis=1).sum().reset_index()
Sample:
np.random.seed(123)
cols = ['object', 'h.1.1','h.1.2','h.1.3','h.1.4','h.1.5',
'h.2.1','h.2.2','h.2.3','h.2.4','h.3.1','h.3.2','h.3.3']
df = pd.DataFrame(np.random.randint(10, size=(4, 13)), columns=cols)
print (df)
object h.1.1 h.1.2 h.1.3 h.1.4 h.1.5 h.2.1 h.2.2 h.2.3 h.2.4 \
0 2 2 6 1 3 9 6 1 0 1
1 9 3 4 0 0 4 1 7 3 2
2 4 8 0 7 9 3 4 6 1 5
3 8 3 5 0 2 6 2 4 4 6
h.3.1 h.3.2 h.3.3
0 9 0 0
1 4 7 2
2 6 2 1
3 3 0 6
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
print (df2)
object h.1 h.2 h.3
0 2 21 8 9
1 9 11 13 13
2 4 27 16 9
3 8 16 16 9
The solution above works great, but is vulnerable in case the h.X goes beyond single digits. I'd recommend the following:
Sample Data:
cols = ['h.%d.%d' %(i, j) for i in range(1, 11) for j in range(1, 11)]
df = pd.DataFrame(np.random.randint(10, size=(4, len(cols))), columns=cols, index=['p_%d'%p for p in range(4)])
Proposed Solution:
new_df = df.groupby(df.columns.str.split('.').str[1], axis=1).sum()
new_df.columns = 'h.' + new_df.columns # the columns are originallly numbered 1, 2, 3. This brings it back to h.1, h.2, h.3
Alternative Solution:
Going through multiindices might be more convoluted, but may be useful while manipulating this data elsewhere.
df.columns = df.columns.str.split('.', expand=True) # Transform into a multiindex
new_df = df.sum(axis = 1, level=[0,1])
new_df.columns = new_df.columns.get_level_values(0) + '.' + new_df.columns.get_level_values(1) # Rename columns
My dataframe looks like this:
item_id sales_quantity
1 10
1 11
1 1
1 2
... ...
10 1
10 9
... ...
I want to filter out all the rows corresponding to an item_id which appear less than 100 times. Here is what I tried:
from pandas import *
from statsmodels.tsa.stattools import adfuller
def adf(X):
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
filtered = df.groupby('item_id_copy')['sales_quantity'].filter(lambda x: len(x) >= 100)
df[df['sales_quantity'].isin(filtered)]
df['sales_quantity'].groupby(df['item_id_copy']).apply(adf)
But, when I run the following:
df['sales_quantity'].groupby(df['item_id_copy']).size(), I get lots of item_ids with size less than 100. Can someone please tell me what is wrong with my code?
It seems you need remove ['sales_quantity']:
df = df.groupby('item_id_copy').filter(lambda x: len(x) >= 100)
Or:
df = df[df.groupby('item_id_copy')['sales_quantity'].transform('size') > 100]
Sample:
np.random.seed(130)
df=pd.DataFrame(np.random.randint(3, size=(10,2)), columns=['item_id_copy','sales_quantity'])
print (df)
item_id_copy sales_quantity
0 1 1
1 1 2
2 2 1
3 0 1
4 2 0
5 2 0
6 0 1
7 1 2
8 1 2
9 1 2
df1 = df.groupby('item_id_copy').filter(lambda x: len(x) >= 4)
print (df1)
item_id_copy sales_quantity
0 1 1
1 1 2
7 1 2
8 1 2
9 1 2
df1 = df[df.groupby('item_id_copy')['sales_quantity'].transform('size') >= 4]
print (df1)
item_id_copy sales_quantity
0 1 1
1 1 2
7 1 2
8 1 2
9 1 2
EDIT:
For columns after apply some function is possible add Series constructor, then reshape by unstack. Last create new DataFrame from dicts in column Critical Values and join to original:
np.random.seed(130)
df = pd.DataFrame(np.random.randint(10, size=(1000,2)),
columns=['item_id_copy','sales_quantity'])
#print (df)
from statsmodels.tsa.stattools import adfuller
def adf(X):
result = adfuller(X)
return pd.Series(result, index=['ADF Statistic','p-value','a','b','Critical Values','c'])
df1 = df[df.groupby('item_id_copy')['sales_quantity'].transform('size') >= 100]
df2 = df1['sales_quantity'].groupby(df1['item_id_copy']).apply(adf).unstack()
df3 = pd.DataFrame(df2['Critical Values'].values.tolist(),
index=df2.index,
columns=['1%','5%','10%'])
df2=df2[['ADF Statistic','p-value']].join(df3).reset_index()
print (df2)
item_id_copy ADF Statistic p-value 1% 5% 10%
0 1 -12.0739 2.3136e-22 -3.498198 -2.891208 -2.582596
1 2 -4.48264 0.000211343 -3.494850 -2.889758 -2.581822
2 7 -4.2745 0.000491609 -3.491818 -2.888444 -2.581120
3 9 -11.7981 9.47089e-22 -3.486056 -2.885943 -2.579785
So I have a file 500 columns by 600 rows and want to take the average of all columns for rows 200-400:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400]
Then create a new column of the averages of all rows across all columns. And extract only that newly created column:
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean']
But how can I prevent the indexes from resetting when I extract the new column?
I think is not necessary create new column in sliced_df, only rename name of Series and if need output as DataFrame add to_frame. Indexes are not resetting, see sample bellow:
#random dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#in real data use df.iloc[200:400]
sliced_df=df.iloc[2:4]
print (sliced_df)
A B C D E
2 2 2 1 0 8
3 4 0 9 6 2
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
2 2.6
3 4.2
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
2 2.6
3 4.2
Python counts from 0, so maybe need change slice from 200:400 to 100:300, see difference:
sliced_df=df.iloc[1:3]
print (sliced_df)
A B C D E
1 0 4 2 5 2
2 2 2 1 0 8
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
1 2.6
2 2.6
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
1 2.6
2 2.6
Use copy() function as follows:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400].copy()
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean'].copy()