Unpivot in pandas using a column that have mutiple value columns - python

So i have a dataframe frame like this
index
type_of_product
dt_of_product
value_of_product
size_of_product
1
A
01/02/22
23.1
1
1
B
01/03/22
23.2
2
1
C
01/04/22
23.3
2
And i need to unpivot the colum type_of_product with the values of dt_of_product, value_of_product and size_of_product
I tryed to use
pd.pivot(df, index = "index", column = "type_of_product", values = ["dt_of_product","value_of_product","size_of_product"]
and want to get this desire output
index
A_dt_of_product
B_dt_of_product
C_dt_of_product
A_value_of_product
B_value_of_product
C_value_of_product
A_size_of_product
B_size_of_product
C_size_of_product
1
01/02/22
01/03/22
01/04/22
23.1
23.2
23.3
1
2
2
Is there a way to do this in pandas with one pivot or do i have to do 3 pivots and merges all on them?

You can do:
df = df.pivot(index='index',
values=['dt_of_product', 'value_of_product', 'size_of_product'],
columns = ['type_of_product'])
df.columns = df.columns.swaplevel(0).map('_'.join)

Try:
df = (
df.set_index(["index", "type_of_product"])
.unstack(level=1)
.swaplevel(axis=1)
)
df.columns = map("_".join, df.columns)
print(df.reset_index().to_markdown(index=False))
Prints:
index
A_dt_of_product
B_dt_of_product
C_dt_of_product
A_value_of_product
B_value_of_product
C_value_of_product
A_size_of_product
B_size_of_product
C_size_of_product
1
01/02/22
01/03/22
01/04/22
23.1
23.2
23.3
1
2
2

You could try set_index with unstack
s = df.set_index(['index','type_of_product']).unstack().sort_index(level=1,axis=1)
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s
Out[30]:
A_dt_of_product A_size_of_product ... C_size_of_product C_value_of_product
index ...
1 01/02/22 1 ... 2 23.3
[1 rows x 9 columns]

Related

Merge two data frames

I tried two merge two data frames by adding the first line of the second df to the first line of the first df. I also tried to concatenate them but eiter failed.
The format of the Data is
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--
2,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---
4,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---
6,0.000,,,,,,,
The expected format of the output should be
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--,0.000,,,,,,,
2,3,N0129,Position,62.2,0.376,62.238,0.136,***---**,76.1,-36.000,0.300,-36.057,,,,
3,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---,0.000,,,,,,,
I already splitted the dataframe from above into two frames. The first one contains only the odd indexes and the second one the even one's.
My problem is now, to merge/concatenate the two frames, by adding the first row of the second df to the first row of the first df. I already tried some methods of merging/concatenating but all of them failed. All the print functions are not neccessary, I only use them to have a quick overview in the console.
The code which I felt most comfortable with is:
os.chdir(output)
csv_files = os.listdir('.')
for csv_file in (csv_files):
if csv_file.endswith(".asc.csv"):
df = pd.read_csv(csv_file)
keep_col = ['Messpunkt', 'Zeichnungspunkt', 'Eigenschaft', 'Position', 'Sollmass','Toleranz','Abweichung','Lage']
new_df = df[keep_col]
new_df = new_df[~new_df['Messpunkt'].isin(['**Teil'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**KS-Oben'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**KS-Unten'])]
new_df = new_df[~new_df['Messpunkt'].isin(['**N'])]
print(new_df)
new_df.to_csv(output+csv_file)
df1 = new_df[new_df.index % 2 ==1]
df2 = new_df[new_df.index % 2 ==0]
df1.reset_index()
df2.reset_index()
print (df1)
print (df2)
merge_df = pd.concat([df1,df2], axis=1)
print (merge_df)
merge_df.to_csv(output+csv_file)
I highly appreciate some help.
With this code, the output is:
1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--,,,,,,,,
2,,,,,,,,,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---,,,,,,,,
4,,,,,,,,,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---,,,,,,,,
6,,,,,,,,,0.000,,,,,,,
I get expected result when I use reset_index() to have the same index in both DataFrames.
It may need also drop=True to skip index as new column
pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
Minimal working example.
I use io only to simulate file in memory.
text = '''1,3,N0128,Durchm.,5.0,0.1,5.0760000000000005,0.076,-----****--
2,0.000,,,,,,,
3,3,N0129,Position,62.2,0.376,62.238,0.136,***---
4,76.1,-36.000,0.300,-36.057,,,,
5,2,N0130,Durchm.,5.0,0.1,5.067,0.067,-----***---
6,0.000,,,,,,,'''
import pandas as pd
import io
pd.options.display.max_columns = 20 # to display all columns
df = pd.read_csv(io.StringIO(text), header=None, index_col=0)
#print(df)
df1 = df[df.index % 2 == 1] # .reset_index(drop=True)
df2 = df[df.index % 2 == 0] # .reset_index(drop=True)
#print(df1)
#print(df2)
merge_df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
print(merge_df)
Result:
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
0 3.0 N0128 Durchm. 5.0 0.100 5.076 0.076 -----****-- 0.0 NaN NaN NaN NaN NaN NaN NaN
1 3.0 N0129 Position 62.2 0.376 62.238 0.136 ***--- 76.1 -36.000 0.300 -36.057 NaN NaN NaN NaN
2 2.0 N0130 Durchm. 5.0 0.100 5.067 0.067 -----***--- 0.0 NaN NaN NaN NaN NaN NaN NaN
EDIT:
It may need
merge_df.index = merge_df.index + 1
to correct index.

evaluate every cell and return column head if not null pandas df

I have pandas.df 233 rows * 234 columns and I need to evaluate every cell and return corresponding column header if not nan, so far I wrote the following:
#First get a list of all column names (except column 0):
col_list=[]
for column in df.columns[1:]:
col_list.append(column)
#Then I try to iterate through every cell and evaluate for Null
#Also a counter is initiated to take the next col_name from col_list
#when count reach 233
for index, row in df.iterrows():
count = 0
for x in row[1:]:
count = count+1
for col_name in col_list:
if count >= 233: break
elif str(x) != 'nan':
print col_name
The code does not do exactly that, what do I need to change to get the code to break after 233 rows and go to the next col_name?
Example:
Col_1 Col_2 Col_3
1 nan 13 nan
2 10 nan nan
3 nan 2 5
4 nan nan 4
output:
1 Col_2
2 Col_1
3 Col_2
4 Col_3
5 Col_3
I think you need if first column is index stack - it remove all NaNs and then get values from second level of Multiindex by reset_index and selecting or by Series constructor with Index.get_level_values:
s = df.stack().reset_index()['level_1'].rename('a')
print (s)
0 Col_2
1 Col_1
2 Col_2
3 Col_3
4 Col_3
Name: a, dtype: object
Or:
s = pd.Series(df.stack().index.get_level_values(1))
print (s)
0 Col_2
1 Col_1
2 Col_2
3 Col_3
4 Col_3
dtype: object
If need output as list:
L = df.stack().index.get_level_values(1).tolist()
print (L)
['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3']
Detail:
print (df.stack())
1 Col_2 13.0
2 Col_1 10.0
3 Col_2 2.0
Col_3 5.0
4 Col_3 4.0
dtype: float64
I'd use jezrael's stack solution.
However, if you're interested in Numpy way, which is usually faster.
In [4889]: np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
Out[4889]: array(['Col_2', 'Col_1', 'Col_2', 'Col_3', 'Col_3'], dtype=object)
Timings
In [4913]: df.shape
Out[4913]: (100, 3)
In [4914]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
10000 loops, best of 3: 35.8 µs per loop
In [4915]: %timeit df.stack().index.get_level_values(1)
1000 loops, best of 3: 335 µs per loop
In [4905]: df.shape
Out[4905]: (100000, 3)
In [4907]: %timeit np.tile(df.columns, df.shape[0])[~np.isnan(df.values.ravel())]
100 loops, best of 3: 5.98 ms per loop
In [4908]: %timeit df.stack().index.get_level_values(1)
100 loops, best of 3: 11.7 ms per loop
Choose based on your need (readability, speed, maintainability etc)
You can use dropna :
df.dropna(axis=1).columns
axis : {0 or ‘index’, 1 or ‘columns’}
how : {‘any’, ‘all’}
Basically you use dropna to remove the null, axis = 1 is dropping columns, and how="any" to remove is at least one in the columns is null, .columns get the remaining header.

Groupby certain number of rows pandas

I have a dataframe with let's say 2 columns: dates and doubles
2017-05-01 2.5
2017-05-02 3.5
... ...
2017-05-17 0.2
2017-05-18 2.5
Now I would like to do a groupby and sum with x rows. So i.e. with 6 rows it would return:
2017-05-06 15.6
2017-05-12 13.4
2017-05-18 18.0
Is there a clean solution to do this without running it through a for-loop with something like this:
temp = pd.DataFrame()
j = 0
for i in range(0,len(df.index),6):
temp[df.ix[i]['date']] = df.ix[i:i+6]['value'].sum()
I guess you are looking for resample. consider this dataframe
rng = pd.date_range('2017-05-01', periods=18, freq='D')
num = np.random.randint(5,size = 18)
df = pd.DataFrame({'date': rng, 'val': num})
df.resample('6D', on = 'date').sum().reset_index()
will return
date val
0 2017-05-01 14
1 2017-05-07 11
2 2017-05-13 16
This is alternative solution using groupby range of length of the dataframe.
Two columns using agg
df.groupby(np.arange(len(df))//6).agg(lambda x: {'date': x.date.iloc[0],
'value': x.value.sum()})
Multiple columns you can use first (or last) for date and sum for other columns.
group = df.groupby(np.arange(len(df))//6)
pd.concat((group['date'].first(),
group[[c for c in df.columns if c != 'date']].sum()), axis=1)

How to divide two columns element-wise in a pandas dataframe

I have two columns in my pandas dataframe. I'd like to divide column A by column B, value by value, and show it as follows:
import pandas as pd
csv1 = pd.read_csv('auto$0$0.csv')
csv2 = pd.read_csv('auto$0$8.csv')
df1 = pd.DataFrame(csv1, columns=['Column A', 'Column B'])
df2 = pd.DataFrame(csv2, columns=['Column A', 'Column B'])
dfnew = pd.concat([df1, df2])
The columns:
Column A Column B
12 2
14 7
16 8
20 5
And the expected result:
Result
6
2
2
4
How do I do this?
Just divide the columns:
In [158]:
df['Result'] = df['Column A']/df['Column B']
df
Out[158]:
Column A Column B Result
0 12 2 6.0
1 14 7 2.0
2 16 8 2.0
3 20 5 4.0
Series.div()
Equivalent to the / operator but with support to substitute a fill_value for missing data in either one of the inputs.
So normally div() is the same as /:
df['C'] = df.A.div(df.B)
# df.A / df.B
But div()'s fill_value is more concise than 2x fillna():
df['C'] = df.A.div(df.B, fill_value=-1)
# df.A.fillna(-1) / df.B.fillna(-1)
And div()'s method chaining is more idiomatic:
df['C'] = df.A.div(df.B).cumsum().add(1).gt(10)
# ((df.A / df.B).cumsum() + 1) > 10
Note that when dividing a DataFrame with another DataFrame or Series, DataFrame.div() also supports broadcasting across an axis or MultiIndex level.

How to assign a value_count output to a dataframe

I am trying to assign the output from a value_count to a new df. My code follows.
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, names=['date','bill_id','sponsor_id']) for f in glob.glob('/home/jayaramdas/anaconda3/df/s11?_s_b')))
column_list = ['date', 'bill_id']
df = df.set_index(column_list, drop = True)
df = df['sponsor_id'].value_counts()
df.columns=['sponsor', 'num_bills']
print (df)
The value count is not being assigned the column headers specified 'sponsor', 'num_bills'. I'm getting the following output from print.head
1036 426
791 408
1332 401
1828 388
136 335
Name: sponsor_id, dtype: int64
your column length doesn't match, you read 3 columns from the csv and then set the index to 2 of them, you calculated value_counts which produces a Series with the column values as the index and the value_counts as the values, you need to reset_index and then overwrite the column names:
df = df.reset_index()
df.columns=['sponsor', 'num_bills']
Example:
In [276]:
df = pd.DataFrame({'col_name':['a','a','a','b','b']})
df
Out[276]:
col_name
0 a
1 a
2 a
3 b
4 b
In [277]:
df['col_name'].value_counts()
Out[277]:
a 3
b 2
Name: col_name, dtype: int64
In [278]:
type(df['col_name'].value_counts())
Out[278]:
pandas.core.series.Series
In [279]:
df = df['col_name'].value_counts().reset_index()
df.columns = ['col_name', 'count']
df
Out[279]:
col_name count
0 a 3
1 b 2
Appending value_counts() to multi-column dataframe:
df = pd.DataFrame({'C1':['A','B','A'],'C2':['A','B','A']})
vc_df = df.value_counts().to_frame('Count').reset_index()
display(df, vc_df)
C1 C2
0 A A
1 B B
2 A A
C1 C2 Count
0 A A 2
1 B B 1

Categories

Resources