dataframe transform partial row data on column - python

I have one dataframe where format is given as below image.
Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data.
Now I want to transform this in column where every type of data appended by another group.
Expected output is:
is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.

here is one way to do it
use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year)
create the new row value by combining ticker and year, then using pivot to get the desired result set
df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated
df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
Result set
txt ticker2 data1 data2
0 AAPL/2020 0.824676 0.616524
1 AAPL/2021 0.018540 0.046365
2 AAPL/2022 0.222349 0.729845
3 AMZ/2020 0.122288 0.087217
4 AMZ/2021 0.012168 0.734674
5 AMZ/2022 0.923501 0.437676
6 APPL/2020 0.886927 0.520650
7 APPL/2021 0.725515 0.543404
8 APPL/2022 0.211378 0.464898
9 GGL/2020 0.777676 0.052658
10 GGL/2021 0.297292 0.213876
11 GGL/2022 0.894150 0.185207
12 MICO/2020 0.898251 0.882252
13 MICO/2021 0.141342 0.105316
14 MICO/2022 0.440459 0.811005
based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now
df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)),
columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"])
ticker = ['APPL', 'MICO']
df2.insert(loc=0, column='ticker', value=ticker)
df2.head()
df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting
df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df3.head()
df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
txt ticker2 data1 data2
0 APPL/2020 26 9
1 APPL/2021 75 59
2 APPL/2022 20 44
3 MICO/2020 79 90
4 MICO/2021 63 30
5 MICO/2022 73 91

Related

How to filter dataframe only by month and year?

I want to select many cells which are filtered only by month and year. For example there are 01.01.2017, 15.01.2017, 03.02.2017 and 15.02.2017 cells. I want to group these cells just looking at the month and year information. If they are in january, They should be grouped together.
Output Expectation:
01.01.2017 ---- 1
15.01.2017 ---- 1
03.02.2017 ---- 2
15.02.2017 ---- 2
Edit: I have 2 datasets in different excels as you can see below.
first data
second data
What I m trying to do is that I want to get 'Su Seviye' data for every 'DH_ID' seperately from first data. And then I want to paste these data into 'Kuyu Yüksekliği' column in the second data. But the problems are that every 'DH_ID' is in different sheets and although there are only month and year data in first database, second database has day information additionally. How can I produce this kind of codes?
import pandas as pd
df = pd.read_excel('...Gözlem kuyu su seviyeleri- 2017.xlsx', sheet_name= 'GÖZLEM KUYULARI1', header=None)
df2 = pd.read_excel('...YERALTI SUYU GÖZLEM KUYULARI ANALİZ SONUÇLAR3.xlsx', sheet_name= 'HJ-8')
HJ8 = df.iloc[:, [0,5,7,9,11,13,15,17,19,21,23,25,27,29]]
##writer = pd.ExcelWriter('yıllarsuseviyeler.xlsx')
##HJ8.to_excel(writer)
##writer.save()
rb = pd.read_excel('...yıllarsuseviyeler.xlsx')
rb.loc[0,7]='01.2022'
rb.loc[0,9]='02.2022'
rb.loc[0,11]='03.2022'
rb.loc[0,13]='04.2022'
rb.loc[0,15]='05.2021'
rb.loc[0,17]='06.2022'
rb.loc[0,19]='07.2022'
rb.loc[0,21]='08.2022'
rb.loc[0,23]='09.2022'
rb.loc[0,25]='10.2022'
rb.loc[0,27]='11.2022'
rb.loc[0,29]='12.2022'
You can see what I have done above.
First, you can convert date column to Datetime object, then get the year and month part with to_period, at last get the group number with ngroup().
df['group'] = df.groupby(pd.to_datetime(df['date'], format='%d.%m.%Y').dt.to_period('M')).ngroup() + 1
date group
0 01.01.2017 1
1 15.01.2017 1
2 03.02.2017 2
3 15.02.2017 2

Column wise concatenation for each set of values

I am trying to append rows by column for every set of 4 rows/values.
I have 11 values the first 4 values should be in one concatenated row and row-5 to row-8 as one value and last 3 rows as one value even if the splitted values are not four.
df_in = pd.DataFrame({'Column_IN': ['text 1','text 2','text 3','text 4','text 5','text 6','text 7','text 8','text 9','text 10','text 11']})
and my expected output is as follows
df_out = pd.DataFrame({'Column_OUT': ['text 1&text 2&text 3&text 4','text 5&text 6&text 7&text 8','text 9&text 10&text 11']})
I have tried to get my desired output df_out as below.
df_2 = df_1.iloc[:-7].agg('&'.join).to_frame()
Any modification required to get required output?
Try using groupby and agg:
>>> df_in.groupby(df_in.index // 4).agg('&'.join)
Column_IN
0 text 1&text 2&text 3&text 4
1 text 5&text 6&text 7&text 8
2 text 9&text 10&text 11
>>>

How to extract multiples tables from one PDF file using Pandas and tabula-py

Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this
import pandas as pd
import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

DataFrame.sort_values only looks at first digit rather then entire number

I have a DataFrame that looks like this,
del Ticker Open Interest
0 1 SPY 20,996,893
1 3 IWM 7,391,074
2 5 EEM 6,545,445
...
47 46 MU 1,268,256
48 48 NOK 1,222,759
49 50 ET 1,141,467
I want it to go in order from the lowest number to greatest with df['del'], but when I write df.sort_values('del') I get
del Ticker
0 1 SPY
29 10 BAC
5 11 GE
It appears do do it based on the first number rather than go in order? Am I using the correct code or do I need to completely change it?
Assuming you have numbers as type string you can do:
add leading zeros to the string numbers which will allow for ordering of the string
df["del"] = df["del"].map(lambda x: x.zfill(10))
df = df.sort_values('del')
or convert the type to integer
df["del"] = df["del"].astype('int') # as recommended by Alex.Kh in comment
#df["del"] = df["del"].map(int) # my initial answer
df = df.sort_values('del')
I also noticed that del seems to be sorted in the same way your index is sorted, so you even could do:
df = df.sort_index(ascending=False)
to go from lowest to highest you can explicitly .sort_values('del', ascending=True)

pandas multiple dataframe plot

I have two data frames. They have the same structure but they come from two different model. Basically, I would like to compare them in order to find the differences. The first thing that I would like to do is to plot two rows, the first from the first data frames and the second from the other.
This is what I do:
I read the two csv file,
PRICES = pd.read_csv('test_model_1.csv',sep=';',index_col=0, header = 0)
PRICES_B = pd.read_csv('bench_mark.csv',sep=';',index_col=0, header = 0)
then I plot the 8th column of both, as:
rowM = PRICES.iloc[8]
rowB = PRICES_B.iloc[8]
rowM.plot()
rowB.plot()
It does not seem the correct way. Indeed, I am not able to choose the labels or the legends.
This the results:
comparison between the 8th row of the first dataframe and the 8th row of the second dataframe
Someone could suggest me the correct way to compare the two data frames and plot some of the selected columns?
lets prepare some test data:
mtx1 = np.random.rand(10,8)*1.1+2
mtx2 = np.random.rand(10,8)+2
df1 = pd.DataFrame(mtx1)
df2 = pd.DataFrame(mtx2)
example output for df1:
Out[60]:
0 1 2 3
0 2.604748 2.233979 2.575730 2.491230
1 3.005079 2.984622 2.745642 2.082218
2 2.577554 3.001736 2.560687 2.838092
3 2.342114 2.435438 2.449978 2.984128
4 2.416953 2.124780 2.476963 2.766410
5 2.468492 2.662972 2.975939 3.026482
6 2.738153 3.024694 2.916784 2.988288
7 2.082538 3.030582 2.959201 2.438686
8 2.917811 2.798586 2.648060 2.991314
9 2.133571 2.162194 2.085843 2.927913
now let's plot it:
import matplotlib.pyplot as plt
%matplotlib inline
i = range(0,len(df1.loc[6,:])) # from 0 to 3
plt.plot(i,df1.loc[6,:]) # take whole row 6
plt.plot(i,df2.loc[6,:]) # take whole row 6
result:

Categories

Resources