How to split a single column containing 1000 rows into chunk of two columns containing 500 rows per column in pandas.
I have a csv file that contains a single column and I need to split this into multiple columns. Below is the format in csv.
Steps I took:
I had multiple csv files containing one column with 364 rows. I concatenated them after converting them into a dataframe, but it copies the file in a linear fashion.
Code I tried
monthly_list = []
for file in ['D0_monthly.csv','c1_monthly.csv','c2_monthly.csv','c2i_monthly.csv','c3i_monthly.csv','c4i_monthly.csv','D1_monthly.csv','D2i_monthly.csv','D3i_monthly.csv','D4i_monthly.csv',
'D2j_monthly.csv','D3j_monthly.csv','D4j_monthly.csv','c2j_monthly.csv','c3j_monthly.csv','c4j_monthly.csv']:
monthly_file = pd.read_csv(file,header=None,index_col=None,skiprows=[0])
monthly_list.append(monthly_file)
monthly_all_file = pd.concat(monthly_list)
How the data is:
column1
1
2
3
.
.
364
1
2
3
.
.
364
I need to split the above column in the format shown below.
What the data should be:
column1
column2
1
1
2
2
3
3
4
4
5
5
.
.
.
.
.
.
364
364
Answer updated to work for arbitrary number of columns
You could start with number of columns or row length. For a given initial column length you could calculate one given the other. In this answer I use desired target column length - tgt_row_len.
nb_groups = 4
tgt_row_len = 5
df = pd.DataFrame({'column1': np.arange(1,tgt_row_len*nb_groups+1)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
...
17 18
18 19
19 20
Create groups in the index for the following grouping operation
df.index = df.reset_index(drop=True).index // tgt_row_len
column1
0 1
0 2
0 3
0 4
0 5
1 6
1 7
...
3 17
3 18
3 19
3 20
dfn = (
df.groupby(level=0).apply(lambda x: x['column1'].reset_index(drop=True)).T
.rename(columns = lambda x: 'col' + str(x+1)).rename_axis(None)
)
print(dfn)
col1 col2 col3 col4
0 1 6 11 16
1 2 7 12 17
2 3 8 13 18
3 4 9 14 19
4 5 10 15 20
Previous answer that handles creating two columns
This answer just shows 10 target rows as an example. That can easily be changed to 364 or 500.
A dataframe where column1 contains 2 sets of 10 rows
tgt_row_len = 10
df = pd.DataFrame({'column1': np.tile(np.arange(1,tgt_row_len+1),2)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 1
11 2
12 3
13 4
14 5
15 6
16 7
17 8
18 9
19 10
Move the bottom set of rows to column2
df.assign(column2=df['column1'].shift(-tgt_row_len)).iloc[:tgt_row_len].astype(int)
column1 column2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
I don't know if anyone has a more efficient solution but using pd. merge on a temp column should solve your issue. Here is a quick implementation of what you could write.
csv1['temp'] = 1
csv2['temp'] = 1
new_df=pd.merge(csv1,csv2,on=["temp"])
new_df.drop("temp",axis=1)
I hope this helps!
Related
I have this column (similar but with a lot of more entries)
import pandas as pd
numbers = range(1,16)
sequence = []
for number in numbers:
sequence.append(number)
df = pd.DataFrame(sequence).rename(columns={0: 'sequence'})
and I want to distribute the same values into lots of more columns periodically (and automatically) to get something like this (but with a bunch of values)
Thanks
Use reshape with 5 for number of new rows, -1 is for count automatically number of columns:
numbers = range(1,16)
df = pd.DataFrame(np.array(numbers).reshape(-1, 5).T)
print (df)
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If length of values in range cannot be filled to N rows here is possible solution:
L = range(1,22)
N = 5
filled = 0
arr = np.full(((len(L) - 1)//N + 1)*N, filled)
arr[:len(L)] = L
df = pd.DataFrame(arr.reshape((-1, N)).T)
print(df)
0 1 2 3 4
0 1 6 11 16 21
1 2 7 12 17 0
2 3 8 13 18 0
3 4 9 14 19 0
4 5 10 15 20 0
Use pandas.Series.values.reshape and desired rows and columns
pd.DataFrame(df.sequence.values.reshape(5, -1))
If you like to reshape after reading dataframe then
df = pd.DataFrame(df.to_numpy().reshape(5,-1))
num_cols = 3
result = pd.DataFrame(df.sequence.to_numpy().reshape(-1, num_cols, order="F"))
for a given number of columns, e.g., 3 here, reshapes df.sequence to (total_number_of_values / num_cols, num_cols) where first shape is inferred with -1. The Fortran order matches the structure so that numbers are "going down first",
to get
>>> result
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If num_cols = 5, then
>>> result
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15
I need to calculate a column based on other row. Basically I want my new_column to be the sum of "base_column" for all row with same id.
I currently do the following (but is not really efficient) what is the most efficient way to achieve that ?
def calculate(x):
filtered_df = df[["id"] == dataset.at[x.name, "id"]] # in fact my filter is more complex basically same id and date in the last 4 weeks
df.at[x.name, "new_column"] = filtered_df["base_column"].sum()
df.apply(calculate)
You can do a below
df['new_column']= df.groupby('id')['base_column'].transform('sum')
input
id base_column
0 1 2
1 1 4
2 2 5
3 3 6
4 5 7
5 7 4
6 7 5
7 7 3
output
id base_column new_column
0 1 2 6
1 1 4 6
2 2 5 5
3 3 6 6
4 5 7 7
5 7 4 12
6 7 5 12
7 7 3 12
Another way to do this is to use groupby and merge
import pandas as pd
df = pd.DataFrame({'id':[1,1,2],'base_column':[2,4,5]})
# compute sum by id
sum_base =df.groupby("id").agg({"base_column": 'sum'}).reset_index().rename(columns={'base_column':'new_column'})
# join the result to df
df = pd.merge(df,sum_base,how='left',on='id')
# id base_column new_column
#0 1 2 6
#1 1 4 6
#2 2 5 5
I have a dataset that looks like the following. The "HomeForm" column is what I'm trying to create and fill with values, i.e. the output.
HomeTeam AwayTeam FTHG FTAG HomeForm
Date
9 0 12 1 0
9 2 3 0 0
9 4 13 1 0
9 8 5 0 3
9 10 16 4 1
9 14 19 0 3
9 17 7 1 4
8 1 9 0 4
8 18 11 1 2
7 6 15 3 1
What I'm trying to do is to create another column called "HomeForm" that has, say, the sum of goals scored by the Home team in each of the last 6 matches. Bear in mind that the team can appear either in the "HomeTeam" column or in the "AwayTeam" column. What would be the best way to achieve this using python?
Thanks.
given a dataframe with numerical values in a specific column, I want to randomly remove a certain percentage of the rows for which the value in that specific column lies within a certain range.
For example given the following dataframe:
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7,8,9,10]})
df
col1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
2/5 of the rows where col1 is below 6 should be removed randomly.
Whats the most concise way to do that?
use sample + drop
df.drop(df.query('col1 < 6').sample(frac=.4).index)
col1
1 2
3 4
4 5
5 6
6 7
7 8
8 9
9 10
For a range
df.drop(df.query('2 < col1 < 8').sample(frac=.4).index)
col1
0 1
1 2
3 4
4 5
5 6
7 8
8 9
9 10
I am trying to create a new column named 'Cumulative Frequency' in a data frame where it consists of all the previous frequencies to the frequency for the current row as shown here.
What is the way to do this?
You want cumsum:
df['Cumulative Frequency'] = df['Frequency'].cumsum()
Example:
In [23]:
df = pd.DataFrame({'Frequency':np.arange(10)})
df
Out[23]:
Frequency
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
In [24]:
df['Cumulative Frequency'] = df['Frequency'].cumsum()
df
Out[24]:
Frequency Cumulative Frequency
0 0 0
1 1 1
2 2 3
3 3 6
4 4 10
5 5 15
6 6 21
7 7 28
8 8 36
9 9 45