I have a dataframe df with mixed data (float and text) which, when printed, looks like this (it's a very small part of the printing):
0 1
0 Le Maurepas NaN
1 CODE_90 AREA_HA
2 112 194.97
3 121 70.37
4 211 113.86
5 La Rolande NaN
6 CODE_90 AREA_HA
7 112 176.52
8 211 97.28
If necessary, this output can be reproduced by the following code (for example):
import pandas as pd
fst_col = ['Le Maurepas', 'CODE_90', 112, 121, 211, 'La Rolande', 'CODE_90', 112, 211]
snd_col = ['NaN', 'AREA_HA', 194.97, 70.37, 113.86, 'NaN', 'AREA_HA', 176.52, 97.28]
df = pd.DataFrame({'0' : fst_col, '1' : snd_col})
df
I would like to give another structure to my dataframe df and get it to look like this when printed:
Name Code Area
0 Le Maurepas 112 194.97
1 Le Maurepas 121 70.37
2 Le Maurepas 211 113.86
3 La Rolande 112 176.52
4 La Rolande 211 97.28
I browsed SO and I am aware that a function like pivot(index='', columns='', values='') could maybe do the job, but I don't know if it is applicable in my case, and, in fact, I don't know how to apply it...
Do I still have to insist with this function, by manipulating the parameters index, columns, values, or is there a particular way, corresponding more precisely to the structure of my initial dataframe df?
Any help welcome.
IIUC, try:
#change the string "NaN" empty values
df["1"] = df["1"].replace("NaN", None)
output = pd.DataFrame()
output["Name"] = df.loc[df["1"].isnull(), "0"].reindex(df.index, method="ffill")
output["Code"] = pd.to_numeric(df["0"], errors="coerce")
output["Area"] = pd.to_numeric(df["1"], errors="coerce")
output = output.dropna().reset_index(drop=True)
>>> output
Name Code Area
0 Le Maurepas 112.0 194.97
1 Le Maurepas 121.0 70.37
2 Le Maurepas 211.0 113.86
3 La Rolande 112.0 176.52
4 La Rolande 211.0 96.28
You can use:
indexes = (df[df['0'].eq('CODE_90')].index - 1).to_list()
indexes.append(len(df))
all_dfs = []
for idx in range(0, len(indexes)-1):
df_temp = df.loc[indexes[idx]:indexes[idx+1]-1]
print(df_temp)
df_temp['Name'] = df_temp['0'].iloc[0]
df_temp.rename(columns={'0': 'Code', '1': 'Area'}, inplace=True)
all_dfs.append(df_temp.iloc[2:])
df = pd.concat(all_dfs, ignore_index=True)
print(df)
Related
I would like to make my data frame more aesthetically appealing and drop what I believe are the unnecessary first row and column from the multi-index. I would like the column headers to be: 'Rk', 'Team','Conf','G','Rec','ADJOE',.....,'WAB'
Any help is such appreciated.
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df = df[0]
df
You only have to iterate over the existing columns and select the second value. Then you can set the list of values as new columns:
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df.columns = [x[1] for x in df.columns]
df.head()
Output:
Rk Team Conf G Rec AdjOE AdjDE Barthag EFG% EFGD% ... ORB DRB FTR FTRD 2P% 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 122.42 89.05 .97491 60.21 421 ... 30.2120 2318 30.4165 21.710 62.21 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 117.39 89.06 .95982 53.835 42.93 ... 37.26 27.6141 28.2242 33.3247 54.827 424 34.8108 29.418 65.2303 3.416
When you read from HTML, specify the row number you want as header:
df = pd.read_html(url, header=1)[0]
print(df.head())
output:
>>
Rk Team Conf G Rec ... 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 ... 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 ... 424 34.8108 29.418 65.2303 3.416
2 3 Kentucky SEC 26 21-510–3 ... 46.342 35.478 29.519 68.997 4.89
3 4 Arizona P12 25 23-213–1 ... 39.91 33.7172 31.471 72.99 6.24
4 5 Baylor B12 26 21-59–4 ... 49.2165 35.966 30.440 68.3130 6.15
I have a file with 4 columns(csv file) and n lines.
I want the 4th column values to move to the next line every time.
ex :
[LN],[cb],[I], [LS]
to
[LN],[cb],[I]
[LS]
that is, if my file is:
[LN1],[cb1],[I1], [LS1]
[LN2],[cb2],[I2], [LS2]
[LN3],[cb2],[I3], [LS3]
[LN4],[cb4],[I4], [LS4]
the output file will look like
[LN1],[cb1],[I1]
[LS1]
[LN2],[cb2],[I2]
[LS2]
[LN3],[cb2],[I3]
[LS3]
[LN4],[cb4],[I4]
[LS4]
Test file:
101 Xavier Mexico City 41 88.0
102 Ann Toronto 28 79.0
103 Jana Prague 33 81.0
104 Yi Shanghai 34 80.0
105 Robin Manchester 38 68.0
Output required:
101 Xavier Mexico City 41
88.0
102 Ann Toronto 28
79.0
103 Jana Prague 33
81.0
104 Yi Shanghai 34
80.0
105 Robin Manchester 38
68.0
Split the dataframe into 2 dataframes, one with the first 3 columns and the other with the last column. Add a new helper-column to both so you can order them afterwards. Now combine them again and order them first by index (which is identical for entries which where previously in the same row) and then by the helper column.
Since there is no test data, this answer is untested:
from io import StringIO
import pandas as pd
s = """col1,col2,col3,col4
101 Xavier,Mexico City,41,88.0
102 Ann,Toronto,28,79.0
103 Jana,Prague,33,81.0
104 Yi,Shanghai,34,80.0
105 Robin,Manchester,38,68.0"""
df = pd.read_csv(StringIO(s), sep=',')
df1 = df[['col1', 'col2', 'col3']].copy()
df2 = df[['col4']].rename(columns={'col4':'col1'}).copy()
df1['ranking'] = 1
df2['ranking'] = 2
df_out = df1.append(df2)
df_out = df_out.rename_axis('index_name').sort_values(by=['index_name', 'ranking'], ascending=[True, True])
df_out = df_out.drop(['ranking'], axis=1)
Another solution to this is to convert the table to a list, then rearrange the list to reconstruct the table.
import pandas as pd
df = pd.read_csv(r"test_file.csv")
df_list = df.values.tolist()
new_list = []
for x in df_list:
# Removes last element from list and save it to a variable 'last_x'.
# This action also modifies the original list
last_x = x.pop()
# append the modified list and the last values to an empty list.
new_list.append(x)
new_list.append([last_x])
# Use the new list to create the new table...
new_df = pd.DataFrame(new_list)
can i get specific column value in dataframe like the SQL like operator that can find any values then count the value to store it in the new column. here is the code for my dataframe
import pandas as pd
dataku = pd.DataFrame()
dataku['CIF'] = ['789', '290', '789', '789','290']
dataku['NAMA'] = ['de','ra','de','de','ra']
dataku['SALDO'] = [100,500,800,200,500]
dataku ['PRODUK']=['tabungan','deposito','deposito','tabungan','deposito usd']
dataku.groupby(['CIF','NAMA','PRODUK']).agg({'SALDO':'sum', 'PRODUK':'count'}).rename(columns={'SALDO':'TOTAL SALDO','PRODUK':'TOTAL PRODUK'})
the result i want for the new dataframe is like this
CIF NAMA PRODUK TOTAL_SALDO TOTAL_PRODUK GT_SALDO GT_PRODUK
290 ra deposito 500 1 1000 2
deposito usd 500 1
789 de tabungan 300 2 300 2
deposito 800 1 800 1
how i can get the value of GT_SALDO column and GT_PRODUK like the table above as the final result?
I am not enturely sure this is what you want, but you can groupby on parts of the strings stored in a column. For example, this is your original groupby, stored in df1
df1 = dataku.groupby(['CIF','NAMA','PRODUK']).agg({'SALDO':'sum', 'PRODUK':'count'}).rename(columns={'SALDO':'TOTAL SALDO','PRODUK':'TOTAL PRODUK'})
This is a groupby that only uses first 8 characters of the 'PRODUK' column:
df2 = dataku.groupby(['CIF','NAMA',dataku['PRODUK'].str.slice(stop=8)]).agg({'SALDO':'sum', 'PRODUK':'count'}).rename(columns={'SALDO':'GT_SALDO','PRODUK':'GT_PRODUK'})
df2 looks like this
GT_SALDO GT_PRODUK
CIF NAMA PRODUK
290 ra deposito 1000 2
789 de deposito 800 1
tabungan 300 2
You can join the two to get something that looks like your desired output:
df1.join(df2)
produces
TOTAL SALDO TOTAL PRODUK GT_SALDO GT_PRODUK
CIF NAMA PRODUK
290 ra deposito 500 1 1000.0 2.0
deposito usd 500 1 NaN NaN
789 de deposito 800 1 800.0 1.0
tabungan 300 2 300.0 2.0
you can fillna NaNs if they bother you
I have this df:
Date Plate Route Speed VehiceType
0 2020-11-03 13:54:00 0660182 Route 66 32 Wagon
1 2020-11-03 13:25:03 939CH003 Route 35 24 Truck
2 2020-11-03 09:27:11 WH3457 Route 02 41 Bus
and so on. I need time differences between same plate vehicles, which I easily obtain as:
df.groupby('Plate').Date.diff( )
then, I sort (otherwise I would have differences between different dates/plates, which I dont need) and group like this:
df2 = df.sort_values(by=['Plate', 'Date']).groupby('Plate').Date.diff().dt.total_seconds().reset_index()
i end up with a df (after renaming one column) like this:
index Difference (s)
0 34517 NaN
1 377539 33.0
2 119714 34.0
3 300900 765.0
that's not what I need ("index" column is supposed to be that of plates'). What I want is something like:
Plate Difference
0 WH3457 54.0
1 9W432T 24.0
2 947CH05 33.0
so that this df can be merged in the original one (left_on and right_on) by plate number for some filters. Pandas says merge cant be done because "index" column is just numbers, while plate column is clearly a string (I miss plate objects somehow when sorting).
So, how can I obtain this plate/difference df? (sort by plate and date is a must, otherwise differences makes no sense).
I've been struggling with this and cant get it. Thank you in advance.
EDIT:
This is a bigger chunk of original df (sorry about alignment and vehicle type in spanish):
Date Plate Route Latitude Longitud Speed VehicleType
0 2020-11-17 13:54:00+00:00 0660182 RUTA 66 19.333958 -99.199240 10 AUTOBUS LARGO (MAYOR A 10 M DE LONGITUD)
1 2020-11-17 13:54:00+00:00 939CH001M RUTA 51 19.256760 -98.955510 22 AUTOBUS LARGO (MAYOR A 10 M DE LONGITUD)
2 2020-11-17 13:54:00+00:00 596NZ008M RUTA 102 19.448385 -98.952400 0 VAGONETA
3 2020-11-17 13:54:00+00:00 0790024 RUTA 79 19.429462 -99.150820 0 MICROBUS (MENOR A 7.5 M DE LONGITUD)
4 2020-11-17 13:54:01+00:00 947CH045M RUTA 50 19.282007 -99.009000 28 MICROBUS (MENOR A 7.5 M DE LONGITUD)
... ... ... ... ... ... ... ...
1279721 2020-11-18 05:59:57+00:00 0120414 RUTA 12 19.357872 -99.077920 0 MICROBUS (MENOR A 7.5 M DE LONGITUD)
1279722 2020-11-18 05:59:58+00:00 1090016 CETRAM XOCHIMILCO 200826 19.295107 -99.102936 0 MICROBUS (MENOR A 7.5 M DE LONGITUD)
1279723 2020-11-18 05:59:59+00:00 0350144 RUTA 35 19.297995 -99.061150 0 VAGONETA
1279724 2020-11-18 05:59:59+00:00 006908 RUTA 106 19.490650 -99.174640 0 AUTOBUS CORTO (ENTRE 7.5 Y 10 M DE LONGITUD)
1279725 2020-11-18 05:59:59+00:00 0340071 RUTA 34 19.324417 -99.165500 1 MICROBUS (MENOR A 7.5 M DE LONGITUD)
If you desire to place your calculation (diff in seconds) back to the original dataframe, you can use pandas groupby.transform instead:
df['diff_in_sec'] = df.groupby('Plate').Date.transform(lambda x: x.diff().dt.total_seconds())
Furthermore, since your apply function doesn't perform any aggregation, df2 has the same original row shape as df so that index can be used to map the values back to df like this:
df2 = df.sort_values(by=['Plate', 'Date']).groupby('Plate').Date.diff().dt.total_seconds()
# this
df.loc[df2.index, 'diff_in_sec'] = df2
# or this
df2.name = 'diff_in_sec'
df.merge(df2, left_index=True, right_index=True)
With my code I can join 2 databases in one. Now, I need to do the same with another database file.
archivo1:
Fecha Cliente Impresiones Impresiones 2 Revenue
20/12/17 Jose 1312 35 $12
20/12/17 Martin 12 56 $146
20/12/17 Pedro 5443 124 $1,256
20/12/17 Esteban 667 1235 $1
archivo2:
Fecha Cliente Impresiones Impresiones 2 Revenue
21/12/17 Jose 25 5 $2
21/12/17 Martin 6347 523 $123
21/12/17 Pedro 2368 898 $22
21/12/17 Esteban 235 99 $7,890
archivo:
Fecha Cliente Impresiones Impresiones 2 Revenue
22/12/17 Peter 55 5 $2
22/12/17 Juan 634527 523 $123
22/12/17 Pedro 836 898 $22
22/12/17 Esteban 125 99 $7,890
I have this results:
The problem is that I need to add the new database(archivo) into the Data.xlsx file and it will look like:
Code:
import pandas as pd
import pandas.io.formats.excel
import numpy as np
# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")
df = pd.concat([df1, df2])\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
df.index = df.index.droplevel()
# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None
with pd.ExcelWriter("Data.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
Extending my comment as an answer, I'd recommend creating a function that will reshape your dataframes to conform to a given format. I'd recommend doing this simply because it is much easier to just reshape your data, rather than reshape new entries to conform to the existing structure. This is because your current structure is a format that makes it extremely hard to work with (take it from me).
So, the easiest thing to do would be to create a function -
def process(dfs):
df = pd.concat(dfs)\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i = df.index.get_level_values(0)
j = df.index.get_level_values(1)
y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
x = i.unique().repeat(len(df.index.levels[1]) + 1)
df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
df.index = df.index.droplevel()
return df
Now, load your dataframes -
df_list = []
for file in ['archivo1.xlsx', 'archivo2.xlsx', ...]:
df_list.append(pd.read_excel(file))
Now, call the process function with your df_list -
df = process(df_list)
df
Fecha 20/12/17 21/12/17
Esteban
Revenue $1 $7,890
Impresiones2 1235 99
Impresiones 667 235
Jose
Revenue $12 $2
Impresiones2 35 5
Impresiones 1312 25
Martin
Revenue $146 $123
Impresiones2 56 523
Impresiones 12 6347
Pedro
Revenue $1,256 $22
Impresiones2 124 898
Impresiones 5443 2368
Save df to a new excel file. Repeat the process for every new dataframe that enters the system.
In summary, your entire code listing would look like this -
import pandas as pd
import pandas.io.formats.excel
import numpy as np
def process(dfs):
df = pd.concat(dfs)\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i = df.index.get_level_values(0)
j = df.index.get_level_values(1)
y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
x = i.unique().repeat(len(df.index.levels[1]) + 1)
df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
df.index = df.index.droplevel()
return df
if __name__ == '__main__':
df_list = []
for file in ['archivo1.xlsx', 'archivo2.xlsx']:
df_list.append(pd.read_excel(file))
df = process(df_list)
with pd.ExcelWriter("test.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
The alternative to this tedious process is to change your dataset structure, and reconsider a more viable alternative that makes it much easier to add new data to existing data without having to keep reshaping everything from scratch. This is something you'll have to sit down and think about.