I have pandas dataframe that contains data given below
ID Q1_rev Q1_transcnt Q2_rev Q2_transcnt Q3_rev Q3_transcnt Q4_rev Q4_transcnt
1 100 2 200 4 300 6 400 8
2 101 3 201 5 301 7 401 9
dataframe looks like below
I would like to do the below
a) For each ID, create 3 rows (from 8 input columns data)
b) Each row should contain the two columns data
c) subsequent rows should shift the columns by 1 (one quarter data).
To understand better, I expect my output to be like as below
I tried the below based on the SO post here but unable to get the expected output
s = 3
n = 2
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
output = pd.concat((df.iloc[:,0+i*s:6+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,2))
Can help me with this? The problem is in real time, n=2 will not be the case. It could be 4 or 5 as well. Meaning, Instead of '1st_rev','1st_transcnt','2nd_rev','2nd_transcnt', I may have the below. You can see there are 4 pairs of columns.
'1st_rev','1st_transcnt','2nd_rev','2nd_transcnt','3rd_rev','3rd_transcnt','4th_rev','4th_transcnt'
Use custom function with DataFrame.groupby by splitted columns names by _ and selected second splitted substring by x.split('_')[1]:
N = 2
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
Test with 3 window:
N = 3
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9
One option is with a for loop or list comprehension, followed by a concatenation, and a sort:
temp = df.set_index('ID')
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
outcome = [temp
.iloc(axis=1)[n:n+4]
.set_axis(cols, axis = 1)
for n in range(0, len(cols)+2, 2)]
pd.concat(outcome).sort_index()
1st_rev 1st_transcnt 2nd_rev 2nd_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
To make it more generic, a while loop can be used (you can use a for loop - a while loop seems more readable/easier to understand):
def reshape_N(df, N):
# you can pass your custom column names here instead
# as long as it matches the width
# of the dataframe
columns = ['rev', 'transcnt']
columns = np.tile(columns, N)
numbers = np.arange(1, N+1).repeat(2)
columns = [f"{n}_{ent}"
for n, ent
in zip(numbers, columns)]
contents = []
start = 0
end = N * 2
temp = df.set_index("ID")
while (end < temp.columns.size):
end += start
frame = temp.iloc(axis=1)[start:end]
frame.columns = columns
contents.append(frame)
start += 2
if not contents:
return df
return pd.concat(contents).sort_index()
let's apply the function:
reshape_N(df, 2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
reshape_N(df, 3)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9
Related
I have below dataframe columns:
Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x',
'sum_val',
'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
dtype='object')
Sample dataframe with reduced columns looks like this:
df:
Location Dec-2021_x Jan-2022_x sum_val Dec-2021_y Jan-2022_y
A 212 315 1000 12 13
B 312 612 1100 13 17
C 242 712 1010 15 15
D 215 382 1001 16 17
E 252 319 1110 17 18
I have to create a resultant dataframe which will be in the below format:
Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
'Oct-2024', 'Nov-2024', 'Dec-2024'
dtype='object')
The way we do this is using the formula:
'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)
and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000+ rows here so do not want to overwhelm the system.
Use:
#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
.assign(Location=list('abcde')))
print (df)
Location Dec-2021_x Jan-2022_x Feb-2022_x Mar-2022_x Apr-2022_x \
0 a 1 1 0 7 8
1 b 8 0 3 6 8
2 c 1 7 5 5 4
3 d 0 7 5 5 8
4 e 8 0 3 9 5
sum_val Dec-2021_y Jan-2022_y Feb-2022_y Mar-2022_y Apr-2022_y
0 2 8 0 5 9 1
1 0 1 2 0 5 7
2 8 2 3 1 0 4
3 2 4 0 9 4 9
4 2 1 7 2 1 7
#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) *
df1.xs('y', axis=1, level=1))
print (df2)
Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022
Location
a 16 0 0 126 16
b 0 0 0 0 0
c 16 168 40 0 128
d 0 0 90 40 144
e 16 0 12 18 70
I need some help with pandas, I'm trying to clean up csv files. I have three types of CSV
correct and expected csv
0
1
2
3
4
100
200
300
400
500
type one clumped
0
1
2
3
4
100
200
300 400
NaN
500
type two clumped
0
1
2
3
100
200
300 400 500
NaN
I'm trying to correct the csv 2 and 3 so that it will become like csv 1
Code
import glob
import pandas as pd
dir = r'D:\csv_files'
file_list = glob.glob(dir +'/*.csv')
files = []
for filename in file_list:
df = pd.read_csv(filename, header=None)
split = df.pop(2).str.split(' ', expand=True)
df.join(split, how='right', lsuffix = '_left', rsuffix = '_right')
print(df)
output:
0 1 2 3 4
0 100 200 300 400 500
0 1 3 4
0 100 200 NaN 500
0 3
0 100 NaN
Goal:
0 1 2 3 4
0 100 200 300 400 500
0 1 2 3 4
0 100 200 300 400 500
0 1 2 3 4
0 100 200 300 400 500
I printed out the split and it's correct, however, I'm unable to find how can I put it back into the main data frame.
Thanks in advance
You might find it easier to pre-parse the data using a standard Python csv.reader(). This could be used to split up any 'clumped' values and then flatten them back into a single list.
For example:
import pandas as pd
from itertools import chain
import glob
import csv
data = []
for fn in glob.glob('rate*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
values = chain.from_iterable(value.split(' ') for value in row[2:] if value)
data.append([row[0], row[1], *values])
df = pd.DataFrame(data, columns=range(6))
print(df)
This would give you a dataframe starting:
0 1 2 3 4 5
0 Montserrat Manzini 6 6 5 6
1 Madagascar San Juan 10 4 9 8
2 Botswana Tehran 2 10 9 10
3 Syrian Arab Republic Fairbanks 2 4 9 2
4 Guinea Punta Arenas 5 1 6 3
I have two DataFrames, df1 and df2, that share an index. I would like to assign the values in df1 based on a value in df2.
The standard pandas code looks like this:
df1['column1'][df2['column2']==i] = j
This populates df1 correctly when run on all the inputs.
However, the same syntax on dask DataFrames returns an error:
TypeError: 'Series' object does not support item assignment
dd.where() and dd.mask() don't seem to work as they return the original value as well.
Is there a dask equivalent to the above pandas code?
To do your task, you should:
use mask to get the new column,
save it back to column1.
To test, I used the following source DataFrames:
df1:
column1 xxx
0 1 230
1 2 160
2 3 160
3 4 190
4 5 190
5 6 260
6 7 260
7 8 260
8 9 300
df2:
column2 yyy
0 11 402
1 12 349
2 13 336
3 14 369
4 15 402
5 16 209
6 17 492
7 18 455
8 19 387
Then I set variables:
i = 15
j = 100
I created both Dask DataFrames as follows:
dd1 = dd.from_pandas(df1, chunksize=5)
dd2 = dd.from_pandas(df2, chunksize=5)
And to do the actual processing, I ran:
dd1.column1 = dd1.column1.mask(dd2['column2'] == i, j)
result = dd1.compute()
The result is:
column1 xxx
0 1 230
1 2 160
2 3 160
3 4 190
4 100 190
5 6 260
6 7 260
7 8 260
8 9 300
So, value in df1.column1 for index == 4 (where in df2.column2 == 15 (i))
has been set to 100 (j).
I believe that you are looking for the dask.dataframe.Series.where method. It seems to work ok for me.
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.where
In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: s = pd.Series(range(5))
In [4]: ds = dd.from_pandas(s, npartitions=2)
In [5]: ds.where(ds > 1, 10).compute()
Out[5]:
0 10
1 10
2 2
3 3
4 4
dtype: int64
I have a dataframe called df1:
Long_ID IndexBegin IndexEnd
0 10000001 0 3
1 10000002 3 6
2 10000003 6 10
I have a second dataframe called df2, which can be up to 1 million rows long:
Short_ID
0 1
1 2
2 3
3 10
4 20
5 30
6 100
7 101
8 102
9 103
I want to link Long_ID to Short_ID in such a way that if (IndexBegin:IndexEnd) is (0:3), then Long_ID gets inserted into df2 at indexes 0 through 2 (IndexEnd - 1). The starting index and ending index are determined using the last two columns of df1.
So that ultimately, my final dataframe looks like this: df3:
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
First, I tried storing the index of df2 as a key and Short_ID as a value in a dictionary, then iterating row by row, but that was too slow. This led me to learn about vectorization.
Then, I tried using where(), but I got "ValueError: Can only compare identically-labeled Series objects."
df2 = df2.reset_index()
df2['Long_ID'] = df1['Long_ID'] [ (df2['index'] < df1['IndexEnd']) & (df2['index'] >= df1['IndexBegin']) ]
I am relatively new to programming, and I appreciate if anyone can give a better approach to solving this problem. I have reproduced the code below:
df1_data = [(10000001, 0, 3), (10000002, 3, 6), (10000003, 6, 10)]
df1 = pd.DataFrame(df1_data, columns = ['Long_ID', 'IndexBegin', 'IndexEnd'])
df2_data = [1, 2, 3, 10, 20, 30, 100, 101, 102, 103]
df2 = pd.DataFrame(df2_data, columns = ['Short_ID'])
df2 does not need "IndexEnd" as long as the ranges are contiguous. You may use pd.merge_asof:
(pd.merge_asof(df2.reset_index(), df1, left_on='index', right_on='IndexBegin')
.reindex(['Short_ID', 'Long_ID'], axis=1))
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
Here is one way using IntervalIndex
df1.index=pd.IntervalIndex.from_arrays(left=df1.IndexBegin,right=df1.IndexEnd,closed='left')
df2['New']=df1.loc[df2.index,'Long_ID'].values
you may do :
df3 = df2.copy()
df3['long_ID'] = df2.merge(df1, left_on =df2.index,right_on = "IndexBegin", how = 'left').Long_ID.ffill().astype(int)
I created a function to solve your question. Hope it helps.
df = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data_2.xlsx')
df
Long_ID IndexBegin IndexEnd
0 10000001 0 3
1 10000002 3 6
2 10000003 6 10
df2 = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data.xlsx')
df2
Short_ID
0 1
1 2
2 3
3 10
4 20
5 30
6 100
7 101
8 102
9 103
def convert_Short_ID(df1,df2):
df2['Long_ID'] = None
for i in range(len(df2)):
for j in range(len(df)):
if (df2.index[i] >= df.loc[j,'IndexBegin']) and (df2.index[i] < df.loc[j,'IndexEnd']):
number = str(df.iloc[j, 0])
df2.loc[i,'Long_ID'] = df.loc[j, 'Long_ID']
break
else:
df2.loc[i, 'Long_ID'] = np.nan
df2['Long_ID'] = df2['Long_ID'].astype(str)
return df2
convert_Short_ID(df,df2)
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
Using Numpy to create the data before creating a Data Frame is a better approach since adding elements to a Data Frame is time-consuming. So:
import numpy as np
import pandas as pd
#Step 1: creating the first Data Frame
df1 = pd.DataFrame({'Long_ID':[10000001,10000002,10000003],
'IndexBegin':[0,3,6],
'IndexEnd':[3,6,10]})
#Step 2: creating the second chunk of data as a Numpy array
Short_ID = np.array([1,2,3,10,20,30,100,101,102,103])
#Step 3: creating a new column on df1 to count Long_ID ocurrences
df1['Qt']=df1['IndexEnd']-df1['IndexBegin']
#Step 4: using append to create a Numpy Array for the Long_ID item
Long_ID = np.array([])
for i in range(len(df1)):
Long_ID = np.append(Long_ID, [df1['Long_ID'][i]]*df1['Qt'][i])
#Finally, create the seconc Data Frame using both previous Numpy arrays
df2 = pd.DataFrame(np.vstack((Short_ID, Long_ID)).T, columns=['Short_ID','Long_ID'])
df2
Short_ID Long_ID
0 1.0 10000001.0
1 2.0 10000001.0
2 3.0 10000001.0
3 10.0 10000002.0
4 20.0 10000002.0
5 30.0 10000002.0
6 100.0 10000003.0
7 101.0 10000003.0
8 102.0 10000003.0
9 103.0 10000003.0
I have a data frame with three different columns, A, B and C. I have applied a group by command on Column A, B and C. I have also counted the no. of rows each group of three values possesses.
Resulting data:
Now, I want to make 0 and 1 (cell values in column C) as columns themselves.
Also, I want to add them and display their sum in a separate column (alongside 0 and 1 columns).
Desired output:
A B Count0 Count1 Sum of Counts Count1/Sum of Counts
1000 1000 38 538 567 538/567
1000 1001 9 90 99 90/99
1000 1002 8 16 24 16/24
1000 1003 2 10 12 10/12
(I am not an active Python user. I have searched a lot on this but can’t seem to find the right words to search it) If I learn how to do the sum of counts 0 and 1 and display alongside other columns in the dataframe, I will do the division myself.
Thanks in advance.
Use SeriesGroupBy.value_counts or size with unstack:
df = pd.DataFrame({
'A': [1000] * 10,
'B': [1000] * 2 + [1001] * 3 + [1002] * 5,
'C':[0,1] * 5
})
print (df)
A B C
0 1000 1000 0
1 1000 1000 1
2 1000 1001 0
3 1000 1001 1
4 1000 1001 0
5 1000 1002 1
6 1000 1002 0
7 1000 1002 1
8 1000 1002 0
9 1000 1002 1
df = df.groupby(['A','B'])['C'].value_counts().unstack(fill_value=0).reset_index()
#another solution
#df = pd.crosstab([df['A'], df['B']], df['C']).reset_index()
#solution 2
#df = df.groupby(['A','B','C']).size().unstack(fill_value=0).reset_index()
print (df)
C A B 0 1
0 1000 1000 1 1
1 1000 1001 2 1
2 1000 1002 2 3
And then sum and divide:
df = df.rename(columns={0:'Count0',1:'Count1'})
df['Sum of Counts'] = df['Count0'] + df['Count1']
df['Count1/Sum of Counts'] = df['Count1'] / df['Sum of Counts']
print (df)
C A B Count0 Count1 Sum of Counts Count1/Sum of Counts
0 1000 1000 1 1 2 0.500000
1 1000 1001 2 1 3 0.333333
2 1000 1002 2 3 5 0.600000
Try:
df1 = df.pivot_table(values='counts', index=['A', 'B'], columns=['C'], aggfunc='sum', fill_value=None, margins=True, dropna=True, margins_name='Sum of Counts').reset_index()
df1 = df1.rename(columns={0:'Count0',1:'Count1'})
df1['Count1/Sum of Counts'] = df1['Count1'] / df1['Sum of Counts']
You can do a reset_index() to structure it better. Also, Count1/Sum of Counts is just df['Count1'] / df['Sum of Counts']