Pandas split column with inconsistent data and inconsistent columns - python

I need some help with pandas, I'm trying to clean up csv files. I have three types of CSV
correct and expected csv
0
1
2
3
4
100
200
300
400
500
type one clumped
0
1
2
3
4
100
200
300 400
NaN
500
type two clumped
0
1
2
3
100
200
300 400 500
NaN
I'm trying to correct the csv 2 and 3 so that it will become like csv 1
Code
import glob
import pandas as pd
dir = r'D:\csv_files'
file_list = glob.glob(dir +'/*.csv')
files = []
for filename in file_list:
df = pd.read_csv(filename, header=None)
split = df.pop(2).str.split(' ', expand=True)
df.join(split, how='right', lsuffix = '_left', rsuffix = '_right')
print(df)
output:
0 1 2 3 4
0 100 200 300 400 500
0 1 3 4
0 100 200 NaN 500
0 3
0 100 NaN
Goal:
0 1 2 3 4
0 100 200 300 400 500
0 1 2 3 4
0 100 200 300 400 500
0 1 2 3 4
0 100 200 300 400 500
I printed out the split and it's correct, however, I'm unable to find how can I put it back into the main data frame.
Thanks in advance

You might find it easier to pre-parse the data using a standard Python csv.reader(). This could be used to split up any 'clumped' values and then flatten them back into a single list.
For example:
import pandas as pd
from itertools import chain
import glob
import csv
data = []
for fn in glob.glob('rate*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
values = chain.from_iterable(value.split(' ') for value in row[2:] if value)
data.append([row[0], row[1], *values])
df = pd.DataFrame(data, columns=range(6))
print(df)
This would give you a dataframe starting:
0 1 2 3 4 5
0 Montserrat Manzini 6 6 5 6
1 Madagascar San Juan 10 4 9 8
2 Botswana Tehran 2 10 9 10
3 Syrian Arab Republic Fairbanks 2 4 9 2
4 Guinea Punta Arenas 5 1 6 3

Related

pandas dynamic wide to long based on time

I have pandas dataframe that contains data given below
ID Q1_rev Q1_transcnt Q2_rev Q2_transcnt Q3_rev Q3_transcnt Q4_rev Q4_transcnt
1 100 2 200 4 300 6 400 8
2 101 3 201 5 301 7 401 9
dataframe looks like below
I would like to do the below
a) For each ID, create 3 rows (from 8 input columns data)
b) Each row should contain the two columns data
c) subsequent rows should shift the columns by 1 (one quarter data).
To understand better, I expect my output to be like as below
I tried the below based on the SO post here but unable to get the expected output
s = 3
n = 2
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
output = pd.concat((df.iloc[:,0+i*s:6+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,2))
Can help me with this? The problem is in real time, n=2 will not be the case. It could be 4 or 5 as well. Meaning, Instead of '1st_rev','1st_transcnt','2nd_rev','2nd_transcnt', I may have the below. You can see there are 4 pairs of columns.
'1st_rev','1st_transcnt','2nd_rev','2nd_transcnt','3rd_rev','3rd_transcnt','4th_rev','4th_transcnt'
Use custom function with DataFrame.groupby by splitted columns names by _ and selected second splitted substring by x.split('_')[1]:
N = 2
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
Test with 3 window:
N = 3
df1 = df.set_index('ID')
def f(x,n=N):
out = np.array([[list(L[x:x+n]) for x in range(len(L)-n+1)] for L in x.to_numpy()])
return pd.DataFrame(np.vstack(out))
df2 = (df1.groupby(lambda x: x.split('_')[1], axis=1, sort=False)
.apply(f)
.sort_index(axis=1, level=1, sort_remaining=False))
df2.index = np.repeat(df1.index, int(len(df2.index) / len(df1.index)))
df2.columns = df2.columns.map(lambda x: f'{x[1] + 1}_{x[0]}')
print (df2)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9
One option is with a for loop or list comprehension, followed by a concatenation, and a sort:
temp = df.set_index('ID')
cols = ['1st_rev','1st_transcnt','2nd_rev','2nd_transcnt']
outcome = [temp
.iloc(axis=1)[n:n+4]
.set_axis(cols, axis = 1)
for n in range(0, len(cols)+2, 2)]
pd.concat(outcome).sort_index()
1st_rev 1st_transcnt 2nd_rev 2nd_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
To make it more generic, a while loop can be used (you can use a for loop - a while loop seems more readable/easier to understand):
def reshape_N(df, N):
# you can pass your custom column names here instead
# as long as it matches the width
# of the dataframe
columns = ['rev', 'transcnt']
columns = np.tile(columns, N)
numbers = np.arange(1, N+1).repeat(2)
columns = [f"{n}_{ent}"
for n, ent
in zip(numbers, columns)]
contents = []
start = 0
end = N * 2
temp = df.set_index("ID")
while (end < temp.columns.size):
end += start
frame = temp.iloc(axis=1)[start:end]
frame.columns = columns
contents.append(frame)
start += 2
if not contents:
return df
return pd.concat(contents).sort_index()
let's apply the function:
reshape_N(df, 2)
1_rev 1_transcnt 2_rev 2_transcnt
ID
1 100 2 200 4
1 200 4 300 6
1 300 6 400 8
2 101 3 201 5
2 201 5 301 7
2 301 7 401 9
reshape_N(df, 3)
1_rev 1_transcnt 2_rev 2_transcnt 3_rev 3_transcnt
ID
1 100 2 200 4 300 6
1 200 4 300 6 400 8
2 101 3 201 5 301 7
2 201 5 301 7 401 9

How to calculate the mean of rows with rows having same content from Columns A to C in Excel using python? [duplicate]

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Create a column based on computation of a another column

I would like to create another column based on the sales for the previous week. Here is the sample input:
df = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20]})
print(df)
Based on this, i would like to create another column which is nothing but the sales of the previous week. Here is the sample of the desired output
df_output = pd.DataFrame({'Week':[1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5],
'Category':['Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White','Red','White'],
'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'Sales':[100,200,300,400,100,200,300,400,100,200,100,200,300,400,100,200,300,400,100,200],
'Sales_others':[10,20,30,40,10,20,30,40,10,20,10,20,30,40,10,20,30,40,10,20],
'Sales_previous_week':[0,0,100,200,300,400,100,200,300,400,0,0,100,200,300,400,100,200,300,400]})
print(df_output)
Am finding it hard to create what would be a self join. The previous week should only be influenced by sales file and i should be able to retain the "sales_others" column
--Edit
Adding original code
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
Renaming columns
CR_UK_NL_Weeklevel.columns.values[4] = 'CURRENT_WEEK'
CR_UK_NL_Weeklevel.columns.values[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns.values
Trying to implement solution:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
[78]:
CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
print(CR_UK_NL_Weeklevel)
--Error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
in
----> 1 CR_UK_NL_Weeklevel['PREVIOUS_WEEK'] = CR_UK_NL_Weeklevel.groupby(['RETAIL_SITE_ID','CATEGORY_NAME'])['CURRENT_WEEK'].shift(fill_value=0)
2 print(CR_UK_NL_Weeklevel)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\base.py in getitem(self, key)
273 else:
274 if key not in self.obj:
--> 275 raise KeyError("Column not found: {key}".format(key=key))
276 return self._gotitem(key, ndim=1)
277
KeyError: 'Column not found: CURRENT_WEEK'
If there are always same categories per week and consecutive weeks use DataFrameGroupBy.shift grouping by Category column:
df['Sales_PREVIOUS'] = df.groupby('Category')['Sales'].shift(fill_value=0)
print (df)
Week Category Sales Sales_PREVIOUS
0 1 Red 100 0
1 1 White 200 0
2 2 Red 300 100
3 2 White 400 200
4 3 Red 100 300
5 3 White 200 400
6 4 Red 300 100
7 4 White 400 200
8 5 Red 100 300
9 5 White 200 400
Another idea with pivoting is use DataFrame.pivot, then DataFrame.shift with DataFrame.stack for Series and last add new column by DataFrame.join:
s = df.pivot('Week','Category','Sales').shift(fill_value=0).stack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['Week','Category'])
EDIT:
With new data add column id:
df['Sales_PREVIOUS'] = df.groupby(['id','Category'])['Sales'].shift(fill_value=0)
And for second solution:
s = df.set_index(['Week','id','Category'])['Sales'].unstack([1,2]).shift(fill_value=0).unstack()
df = df.join(s.rename('Sales_PREVIOUS WEEK'), on=['id','Category','Week'])
print (df)
Week Category id Sales Sales_others Sales_PREVIOUS WEEK
0 1 Red 1 100 10 0
1 1 White 1 200 20 0
2 2 Red 1 300 30 100
3 2 White 1 400 40 200
4 3 Red 1 100 10 300
5 3 White 1 200 20 400
6 4 Red 1 300 30 100
7 4 White 1 400 40 200
8 5 Red 1 100 10 300
9 5 White 1 200 20 400
10 1 Red 2 100 10 0
11 1 White 2 200 20 0
12 2 Red 2 300 30 100
13 2 White 2 400 40 200
14 3 Red 2 100 10 300
15 3 White 2 200 20 400
16 4 Red 2 300 30 100
17 4 White 2 400 40 200
18 5 Red 2 100 10 300
19 5 White 2 200 20 400
EDIT:
Problem is with columns names, use:
cols = CR_UK_NL_Weeklevel.columns.tolist()
cols[4] = 'CURRENT_WEEK'
cols[3] = 'LAST_YEAR_WEEK'
CR_UK_NL_Weeklevel.columns = cols

How to add dataframe column data to a range of indexes in another dataframe?

I have a dataframe called df1:
Long_ID IndexBegin IndexEnd
0 10000001 0 3
1 10000002 3 6
2 10000003 6 10
I have a second dataframe called df2, which can be up to 1 million rows long:
Short_ID
0 1
1 2
2 3
3 10
4 20
5 30
6 100
7 101
8 102
9 103
I want to link Long_ID to Short_ID in such a way that if (IndexBegin:IndexEnd) is (0:3), then Long_ID gets inserted into df2 at indexes 0 through 2 (IndexEnd - 1). The starting index and ending index are determined using the last two columns of df1.
So that ultimately, my final dataframe looks like this: df3:
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
First, I tried storing the index of df2 as a key and Short_ID as a value in a dictionary, then iterating row by row, but that was too slow. This led me to learn about vectorization.
Then, I tried using where(), but I got "ValueError: Can only compare identically-labeled Series objects."
df2 = df2.reset_index()
df2['Long_ID'] = df1['Long_ID'] [ (df2['index'] < df1['IndexEnd']) & (df2['index'] >= df1['IndexBegin']) ]
I am relatively new to programming, and I appreciate if anyone can give a better approach to solving this problem. I have reproduced the code below:
df1_data = [(10000001, 0, 3), (10000002, 3, 6), (10000003, 6, 10)]
df1 = pd.DataFrame(df1_data, columns = ['Long_ID', 'IndexBegin', 'IndexEnd'])
df2_data = [1, 2, 3, 10, 20, 30, 100, 101, 102, 103]
df2 = pd.DataFrame(df2_data, columns = ['Short_ID'])
df2 does not need "IndexEnd" as long as the ranges are contiguous. You may use pd.merge_asof:
(pd.merge_asof(df2.reset_index(), df1, left_on='index', right_on='IndexBegin')
.reindex(['Short_ID', 'Long_ID'], axis=1))
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
Here is one way using IntervalIndex
df1.index=pd.IntervalIndex.from_arrays(left=df1.IndexBegin,right=df1.IndexEnd,closed='left')
df2['New']=df1.loc[df2.index,'Long_ID'].values
you may do :
df3 = df2.copy()
df3['long_ID'] = df2.merge(df1, left_on =df2.index,right_on = "IndexBegin", how = 'left').Long_ID.ffill().astype(int)
I created a function to solve your question. Hope it helps.
df = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data_2.xlsx')
df
Long_ID IndexBegin IndexEnd
0 10000001 0 3
1 10000002 3 6
2 10000003 6 10
df2 = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data.xlsx')
df2
Short_ID
0 1
1 2
2 3
3 10
4 20
5 30
6 100
7 101
8 102
9 103
def convert_Short_ID(df1,df2):
df2['Long_ID'] = None
for i in range(len(df2)):
for j in range(len(df)):
if (df2.index[i] >= df.loc[j,'IndexBegin']) and (df2.index[i] < df.loc[j,'IndexEnd']):
number = str(df.iloc[j, 0])
df2.loc[i,'Long_ID'] = df.loc[j, 'Long_ID']
break
else:
df2.loc[i, 'Long_ID'] = np.nan
df2['Long_ID'] = df2['Long_ID'].astype(str)
return df2
convert_Short_ID(df,df2)
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
Using Numpy to create the data before creating a Data Frame is a better approach since adding elements to a Data Frame is time-consuming. So:
import numpy as np
import pandas as pd
#Step 1: creating the first Data Frame
df1 = pd.DataFrame({'Long_ID':[10000001,10000002,10000003],
'IndexBegin':[0,3,6],
'IndexEnd':[3,6,10]})
#Step 2: creating the second chunk of data as a Numpy array
Short_ID = np.array([1,2,3,10,20,30,100,101,102,103])
#Step 3: creating a new column on df1 to count Long_ID ocurrences
df1['Qt']=df1['IndexEnd']-df1['IndexBegin']
#Step 4: using append to create a Numpy Array for the Long_ID item
Long_ID = np.array([])
for i in range(len(df1)):
Long_ID = np.append(Long_ID, [df1['Long_ID'][i]]*df1['Qt'][i])
#Finally, create the seconc Data Frame using both previous Numpy arrays
df2 = pd.DataFrame(np.vstack((Short_ID, Long_ID)).T, columns=['Short_ID','Long_ID'])
df2
Short_ID Long_ID
0 1.0 10000001.0
1 2.0 10000001.0
2 3.0 10000001.0
3 10.0 10000002.0
4 20.0 10000002.0
5 30.0 10000002.0
6 100.0 10000003.0
7 101.0 10000003.0
8 102.0 10000003.0
9 103.0 10000003.0

Pandas new column from groupby averages

I have a DataFrame
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
... 'b':[10,20,20,10,20,20],
... 'result':[100,200,300,400,500,600]})
...
>>> df
a b result
0 1 10 100
1 1 20 200
2 1 20 300
3 2 10 400
4 2 20 500
5 2 20 600
and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:
>>> df.groupby(['a','b'])['result'].mean()
a b
1 10 100
20 250
2 10 400
20 550
Name: result, dtype: int64
but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,
>>> df
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.
You need transform:
df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')
This generates a correctly indexed column of the groupby values for you:
a b result avg_result
0 1 10 100 100
1 1 20 200 250
2 1 20 300 250
3 2 10 400 400
4 2 20 500 550
5 2 20 600 550
Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below:
So it is better to go with the Window function as in the below code snippet example:
windowSpecAgg = Window.partitionBy('a', 'b')
ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()
The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Categories

Resources