Python : Inserting hyphens after column header in Pandas - python

I have created a dataframe in Python Pandas as below:
import pandas as pd
import os
cols = ('Name','AGE','SAL')
df = pd.read_csv("C:\\Users\\agupt80\\Desktop\\POC\\Python\\test.csv",names = cols)
print(df)
When I am printing dataframe I am getting below output:
Name AGE SAL
0 Amit 32 100
1 gupta 33 200
2 hello 34 300
3 Amit 33 100
Please help is letting me know, how can I insert a Hyphen "-" line after column header like below:
Name AGE SAL
------------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 300
3 Amit 33 100

I don't know of any pandas customization option for printing separators, but you can just print the df to a string, and then insert the line yourself. Something like this:
string_repr = df.to_string().splitlines()
string_repr.insert(1, "-" * len(string_repr[0]))
out = '\n'.join(string_repr)
>>> print(out)
Name AGE SAL
------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 300
3 Amit 33 100

You could do something like this:
import pandas as pd
df = pd.DataFrame([
['Amit', 32, 100],
['gupta', 33, 200],
['hello', 34, 100],
['Amit', 33, 100]],
columns=['Name', 'AGE', 'SAL'])
lines = df.to_string().splitlines()
num_hyphens = max(len(line) for line in lines)
lines.insert(1, '-' * num_hyphens)
result = '\n'.join(lines)
print(result)
Output:
Name AGE SAL
------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 100
3 Amit 33 100
You can change the calculation of num_hyphens depending on how exactly you want your output to look like. For example you could do:
num_hyphens = 2 * len(lines[0]) - len(lines[0].strip())
To get:
Name AGE SAL
----------------------
0 Amit 32 100
1 gupta 33 200
2 hello 34 100
3 Amit 33 100
Note: If the DataFrame has a named index, to_string will output an additional header line with the name of the index. In that case you could choose to remove that line (replace it with the hyphens) add the hyphens after it (at position 2 instead of 1).

you need,
#shift the values one level
df=df.shift(1)
#fill the first row with hyphen
df.iloc[0]='----'
output
Name AGE SAL
---- ---- ----
Amit3 2 100
gupta 33 200
hello 34 300

Related

Pandas Taking a CSV and combining/concatenating rows from a previously pivoted csv

All I have a simple objective.
I have a CSV that has thousands of columns and thousands of rows. I want to take the existing CSV and literally concatenate/combine the values ONLY from Rows 1 & Row 2 into one single row similar to below. The key thing to keep in mind is that some of the values like "lion, tiger, bear repeat several times once for each metric). I do not want it to display lion.1 , lion.2 etc , it should just display lion.
Data sample for Rows 1 and 2:
flow color desc lion tiger bear lion tiger bear
flow color desc m1 m1 m1 m2 m2 m2
flavor1 catego1 flavor1 catego1 32 23 34 34 21 24
flavor2 catego2 flavor2 catego2 32 23 34 34 21 24
How I want date to appear in CSV in Row 1 (note we need row 2 to NOT appear in the file after we combine them):
"flow flow" "color color" "desc desc" "lion m1" "tiger m1" "bear m1" "lion m2" `“tiger m2” “bear m2”
"flavor1 catego1" flavor1 catego1 32 23 34 34 21 24
"flavor1 catego2" flavor2 catego2 32 23 34 34 21 24
Sad code attempt:
import pandas as pd
df = pd.read_csv(r"C:Test.csv")
row_one = df.iloc[0]
spacer = " "
row_two = df.iloc[1]
new_header = row_one+spacer+row_two
Updated based on new information.
I'm assuming you haven't included the row index in your output and so I've got what works for my small dateset and I think it would work for yours.
import pandas as pd
def main():
df = pd.DataFrame([['flow', 'color', 'desc'], ['flavor1 catego1', 'flavor1', 'catego1'], ['flavor2 catego2', 'flavor2', 'catego2']], columns=['flow', 'color', 'desc'])
print('before')
print(df)
df = df.rename(columns={x: f'{x} {y}' for x, y in zip(df.columns, df.iloc[0])})
print(f'Index of row to delete: {df.iloc[0].name}')
df.drop(index=df.iloc[0].name, inplace=True)
print()
print('after')
print(df)
if __name__ == '__main__':
main()
Output
before
flow color desc
0 flow color desc
1 flavor1 catego1 flavor1 catego1
2 flavor2 catego2 flavor2 catego2
Index of row to delete: 0
after
flow flow color color desc desc
1 flavor1 catego1 flavor1 catego1
2 flavor2 catego2 flavor2 catego2

summing two columns in a dataframe

My df looks as follows:
Roll Name Age Physics English Maths
0 A1 Max 16 87 79 90
1 A2 Lisa 15 47 75 60
2 A3 Luna 17 83 49 95
3 A4 Ron 16 86 79 93
4 A5 Silvia 15 57 99 91
I'd like to add the columns Physics, English, and Maths and display the results in a separate column 'Grade'.
I've tried the code:
df['Physics'] + df['English'] + df['Maths']
But it just concatenates. I am not taught about the lambda function yet.
How do I go about this?
df['Grade'] = df['Physics'] + df['English'] + df['Maths']
it concatenates maybe your data is in **String** just convert into float or integer.
Check Data Types First by using df.dtypes
Try:
df["total"] = df[["Physics", "English", "Maths"]].sum(axis=1)
df
Check Below code, Its is possible you columns are in string format, belwo will solve that:
import pandas as pd
df = pd.DataFrame({"Physics":['1','2','3'],"English":['1','2','3'],"Maths":['1','2','3']})
df['Total'] = df['Physics'].astype('int') +df['English'].astype('int') +df['Maths'].astype('int')
df
Output:

Pandas matching elements

i have a database named df1 and a sheet named df2。
i want to use df1 filling df2 by pandas。
DF1:
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33
4 Meir 20 110 20
5 Payne 10 175 21
DF2:
name SCORE height weight
1 JACK
2 PAUL
3 MLKE
*name maybe mess up the order
my misktake code :
import openpyxl
import pandas as pd
df1 = pd.DataFrame(pd.read_excel('df1.xlsx',sheet_name =0))
df2 = pd.DataFrame(pd.read_excel('df2.xlsx',sheet_name = 0))
result = df1.merge(df2,on = ['NAME'],how="left")
DF1:
Expected result:
DF2:
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33
As you mentioned, name maybe mess up the order, therefore, if you want to use df1 to fill-up df2, you can try setting name as index in both df1 and df2 and then use .update(), as follows:
df1a = df1.set_index('name')
df2a = df2.set_index('name')
df2a.update(df1a)
df2 = df2a.reset_index()
Result:
(Using df1 data based on the picture near the bottom):
print(df2)
name SCORE height weight
0 JACK 66 150 100
1 PAUL 50 165 22
2 MLKE 30 132 33
If you want to keep the original row index of df2, you can save the index and then restore it later, as follows:
df1a = df1.set_index('name')
df2a = df2.set_index('name')
df2a.update(df1a)
df2_idx = df2.index
df2 = df2a.reset_index()
df2.index = df2_idx
Result:
print(df2)
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33

Pandas DataFrame pivot (reshape?)

I can't seem to get this right... here's what I'm trying to do:
import pandas as pd
df = pd.DataFrame({
'item_id': [1,1,3,3,3],
'contributor_id': [1,2,1,4,5],
'contributor_role': ['sing', 'laugh', 'laugh', 'sing', 'sing'],
'metric_1': [80, 90, 100, 92, 50],
'metric_2': [180, 190, 200, 192, 150]
})
--->
item_id contributor_id contributor_role metric_1 metric_2
0 1 1 sing 80 180
1 1 2 laugh 90 190
2 3 1 laugh 100 200
3 3 4 sing 92 192
4 3 5 sing 50 150
And I want to reshape it into:
item_id SING_1_contributor_id SING_1_metric_1 SING_1_metric_2 SING_2_contributor_id SING_2_metric_1 SING_2_metric_2 ... LAUGH_1_contributor_id LAUGH_1_metric_1 LAUGH_1_metric_2 ... <LAUGH_2_...>
0 1 1 80 180 N/A N/A N/A ... 2 90 190 ... N/A..
1 3 4 92 192 5 50 150 ... 1 100 200 ... N/A..
Basically, for each item_id, I want to collect all relevant data into a single row. Each item could have multiple types of contributors, and there is a max for each type (e.g. max SING contributor = A per item, max LAUGH contributor = B per item). There are a set of metrics tied to each contributor (but for the same contributor, the values could be different across different items / contributor types).
I can probably achieve this through some seemingly inefficient methods (e.g. looping and matching then populating a template df), but I was wondering if there is a more efficient way to achieve this, potentially through cleverly specifying the index / values / columns in the pivot operation (or any other method..).
Thanks in advance for any suggestions!
EDIT:
Ended up adapting Ben's script below into the following:
df['role_count'] = df.groupby(['item_id', 'contributor_role']).cumcount().add(1).astype(str)
df['contributor_role'] = df.apply(lambda row: row['contributor_role'] + '_' + row['role_count'], axis=1)
df = df.set_index(['item_id','contributor_role']).unstack()
df.columns = ['_'.join(x) for x in df.columns.values]
You can create the additional key with cumcount then do unstack
df['newkey']=df.groupby('item_id').cumcount().add(1).astype(str)
df['contributor_id']=df['contributor_id'].astype(str)
s = df.set_index(['item_id','newkey']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
Out[38]:
contributor_id_1 contributor_role_1 ... metric_1_3 metric_2_3
item_id ...
1 1 sing ... NaN NaN
3 1 messaround ... 50.0 150.0

calculate values between two pandas dataframe based on a column value

EDITED: let me copy the whole data set
df is the store sales/inventory data
branch daqu store store_name style color size stocked sold in_stock balance
0 huadong wenning C301 EE #��#��##�� EEBW52301M 39 160 7 4 3 -5
1 huadong wenning C301 EE #��#��##�� EEBW52301M 39 165 1 0 1 1
2 huadong wenning C301 EE #��#��##�� EEBW52301M 39 170 6 3 3 -3
dh is the transaction (move 'amount' from store 'from' to 'to')
branch daqu from to style color size amount box_sum
8 huadong shanghai C306 C30C EEOM52301M 59 160 1 162
18 huadong shanghai C306 C30C EEOM52301M 39 160 1 162
25 huadong shanghai C306 C30C EETJ52301M 52 160 9 162
26 huadong shanghai C306 C30C EETJ52301M 52 155 1 162
32 huadong shanghai C306 C30C EEOW52352M 19 160 2 162
What I want is the store inventory data after the transaction, which would look exactly the same format as the df, but only 'in_stock' numbers would have changed from the original df according to numbers in dh.
below is what I tried:
df['full_code'] = df['store']+df['style']+df['color'].astype(str)+df['size'].astype(str)
dh['from_code'] = dh['from']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
dh['to_code'] = dh['to']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
# subtract from 'from' store
dh_from = pd.DataFrame(dh.groupby('from_code')['amount'].sum())
for code, stock in dh_from.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] - stock
# add to 'to' store
dh_to = pd.DataFrame(dh.groupby('to_code')['amount'].sum())
for code, stock in dh_to.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock
df.to_csv('d:/after_dh.csv')
But when I open the csv file then the 'in_stock' values for those which transaction occured are all blanks.
I think df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock this has some problem. What's the correct way of updating the value?
ORIGINAL: I have two pandas dataframe: df1 is for the inventory, df2 is for the transaction
df1 look something like this:
full_code in_stock
1 AAA 200
2 BBB 150
3 CCC 150
df2 look something like this:
from to full_code amount
1 XX XY AAA 30
2 XX XZ AAA 35
3 ZY OI BBB 50
4 AQ TR AAA 15
What I want is the inventory after all transactions are done.
In this case,
full_code in_stock
1 AAA 120
2 BBB 100
3 CCC 150
Note that full_code is unique in df1, but not unique in df2.
Is there any pandas way of doing this? I got messed up with the original dataframe and a view of the dataframe and got it solved by turning them into numpy array and finding matching full_codes. But the resulting code is also a mess and wonder if there is a simpler way of doing this not turning everything into a numpy array.
What I would do is to set the index in df1 to the 'full_code' column and then call sub to subtract the other df.
What we pass for the values is the result of grouping on 'full_code' and calling sum on 'amount' column.
An additional param for sub is fill_values this is because product 'CCC' does not exist on the rhs so we want this value to be preserved, otherwise it becomes NaN:
In [25]:
total = df1.set_index('full_code')['in_stock'].sub(df2.groupby('full_code')['amount'].sum(), fill_value=0)
total.reset_index()
​
Out[25]:
full_code in_stock
0 AAA 120
1 BBB 100
2 CCC 150

Categories

Resources