Incorporating new data into an existing dataframe - python

With my code I can join 2 databases in one. Now, I need to do the same with another database file.
archivo1:
Fecha Cliente Impresiones Impresiones 2 Revenue
20/12/17 Jose 1312 35 $12
20/12/17 Martin 12 56 $146
20/12/17 Pedro 5443 124 $1,256
20/12/17 Esteban 667 1235 $1
archivo2:
Fecha Cliente Impresiones Impresiones 2 Revenue
21/12/17 Jose 25 5 $2
21/12/17 Martin 6347 523 $123
21/12/17 Pedro 2368 898 $22
21/12/17 Esteban 235 99 $7,890
archivo:
Fecha Cliente Impresiones Impresiones 2 Revenue
22/12/17 Peter 55 5 $2
22/12/17 Juan 634527 523 $123
22/12/17 Pedro 836 898 $22
22/12/17 Esteban 125 99 $7,890
I have this results:
The problem is that I need to add the new database(archivo) into the Data.xlsx file and it will look like:
Code:
import pandas as pd
import pandas.io.formats.excel
import numpy as np
# Leemos ambos archivos y los cargamos en DataFrames
df1 = pd.read_excel("archivo1.xlsx")
df2 = pd.read_excel("archivo2.xlsx")
df = pd.concat([df1, df2])\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i, j = df.index.get_level_values(0), df.index.get_level_values(1)
k = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
idx = pd.MultiIndex.from_arrays([i.unique().repeat(len(df.index.levels[1]) + 1), k])
df = df.reindex(idx).fillna('')
df.index = df.index.droplevel()
# Creamos el xlsx de salida
pandas.io.formats.excel.header_style = None
with pd.ExcelWriter("Data.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')

Extending my comment as an answer, I'd recommend creating a function that will reshape your dataframes to conform to a given format. I'd recommend doing this simply because it is much easier to just reshape your data, rather than reshape new entries to conform to the existing structure. This is because your current structure is a format that makes it extremely hard to work with (take it from me).
So, the easiest thing to do would be to create a function -
def process(dfs):
df = pd.concat(dfs)\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i = df.index.get_level_values(0)
j = df.index.get_level_values(1)
y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
x = i.unique().repeat(len(df.index.levels[1]) + 1)
df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
df.index = df.index.droplevel()
return df
Now, load your dataframes -
df_list = []
for file in ['archivo1.xlsx', 'archivo2.xlsx', ...]:
df_list.append(pd.read_excel(file))
Now, call the process function with your df_list -
df = process(df_list)
df
Fecha 20/12/17 21/12/17
Esteban
Revenue $1 $7,890
Impresiones2 1235 99
Impresiones 667 235
Jose
Revenue $12 $2
Impresiones2 35 5
Impresiones 1312 25
Martin
Revenue $146 $123
Impresiones2 56 523
Impresiones 12 6347
Pedro
Revenue $1,256 $22
Impresiones2 124 898
Impresiones 5443 2368
Save df to a new excel file. Repeat the process for every new dataframe that enters the system.
In summary, your entire code listing would look like this -
import pandas as pd
import pandas.io.formats.excel
import numpy as np
def process(dfs):
df = pd.concat(dfs)\
.set_index(['Cliente', 'Fecha'])\
.stack()\
.unstack(-2)\
.sort_index(ascending=[True, False])
i = df.index.get_level_values(0)
j = df.index.get_level_values(1)
y = np.insert(j.values, np.flatnonzero(j == 'Revenue'), i.unique())
x = i.unique().repeat(len(df.index.levels[1]) + 1)
df = df.reindex(pd.MultiIndex.from_arrays([x, y])).fillna('')
df.index = df.index.droplevel()
return df
if __name__ == '__main__':
df_list = []
for file in ['archivo1.xlsx', 'archivo2.xlsx']:
df_list.append(pd.read_excel(file))
df = process(df_list)
with pd.ExcelWriter("test.xlsx",
engine='xlsxwriter',
date_format='dd/mm/yyyy',
datetime_format='dd/mm/yyyy') as writer:
df.to_excel(writer, sheet_name='Sheet1')
The alternative to this tedious process is to change your dataset structure, and reconsider a more viable alternative that makes it much easier to add new data to existing data without having to keep reshaping everything from scratch. This is something you'll have to sit down and think about.

Related

How to pivot a specific dataframe?

I have a dataframe df with mixed data (float and text) which, when printed, looks like this (it's a very small part of the printing):
0 1
0 Le Maurepas NaN
1 CODE_90 AREA_HA
2 112 194.97
3 121 70.37
4 211 113.86
5 La Rolande NaN
6 CODE_90 AREA_HA
7 112 176.52
8 211 97.28
If necessary, this output can be reproduced by the following code (for example):
import pandas as pd
fst_col = ['Le Maurepas', 'CODE_90', 112, 121, 211, 'La Rolande', 'CODE_90', 112, 211]
snd_col = ['NaN', 'AREA_HA', 194.97, 70.37, 113.86, 'NaN', 'AREA_HA', 176.52, 97.28]
df = pd.DataFrame({'0' : fst_col, '1' : snd_col})
df
I would like to give another structure to my dataframe df and get it to look like this when printed:
Name Code Area
0 Le Maurepas 112 194.97
1 Le Maurepas 121 70.37
2 Le Maurepas 211 113.86
3 La Rolande 112 176.52
4 La Rolande 211 97.28
I browsed SO and I am aware that a function like pivot(index='', columns='', values='') could maybe do the job, but I don't know if it is applicable in my case, and, in fact, I don't know how to apply it...
Do I still have to insist with this function, by manipulating the parameters index, columns, values, or is there a particular way, corresponding more precisely to the structure of my initial dataframe df?
Any help welcome.
IIUC, try:
#change the string "NaN" empty values
df["1"] = df["1"].replace("NaN", None)
output = pd.DataFrame()
output["Name"] = df.loc[df["1"].isnull(), "0"].reindex(df.index, method="ffill")
output["Code"] = pd.to_numeric(df["0"], errors="coerce")
output["Area"] = pd.to_numeric(df["1"], errors="coerce")
output = output.dropna().reset_index(drop=True)
>>> output
Name Code Area
0 Le Maurepas 112.0 194.97
1 Le Maurepas 121.0 70.37
2 Le Maurepas 211.0 113.86
3 La Rolande 112.0 176.52
4 La Rolande 211.0 96.28
You can use:
indexes = (df[df['0'].eq('CODE_90')].index - 1).to_list()
indexes.append(len(df))
all_dfs = []
for idx in range(0, len(indexes)-1):
df_temp = df.loc[indexes[idx]:indexes[idx+1]-1]
print(df_temp)
df_temp['Name'] = df_temp['0'].iloc[0]
df_temp.rename(columns={'0': 'Code', '1': 'Area'}, inplace=True)
all_dfs.append(df_temp.iloc[2:])
df = pd.concat(all_dfs, ignore_index=True)
print(df)

How do I make my dataframe single index from multindex?

I would like to make my data frame more aesthetically appealing and drop what I believe are the unnecessary first row and column from the multi-index. I would like the column headers to be: 'Rk', 'Team','Conf','G','Rec','ADJOE',.....,'WAB'
Any help is such appreciated.
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df = df[0]
df
You only have to iterate over the existing columns and select the second value. Then you can set the list of values as new columns:
import pandas as pd
url = 'https://www.barttorvik.com/#'
df = pd.read_html(url)
df.columns = [x[1] for x in df.columns]
df.head()
Output:
Rk Team Conf G Rec AdjOE AdjDE Barthag EFG% EFGD% ... ORB DRB FTR FTRD 2P% 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 122.42 89.05 .97491 60.21 421 ... 30.2120 2318 30.4165 21.710 62.21 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 117.39 89.06 .95982 53.835 42.93 ... 37.26 27.6141 28.2242 33.3247 54.827 424 34.8108 29.418 65.2303 3.416
When you read from HTML, specify the row number you want as header:
df = pd.read_html(url, header=1)[0]
print(df.head())
output:
>>
Rk Team Conf G Rec ... 2P%D 3P% 3P%D Adj T. WAB
0 1 Gonzaga WCC 24 22-211–0 ... 41.23 37.821 29.111 73.72 4.611
1 2 Houston Amer 25 21-410–2 ... 424 34.8108 29.418 65.2303 3.416
2 3 Kentucky SEC 26 21-510–3 ... 46.342 35.478 29.519 68.997 4.89
3 4 Arizona P12 25 23-213–1 ... 39.91 33.7172 31.471 72.99 6.24
4 5 Baylor B12 26 21-59–4 ... 49.2165 35.966 30.440 68.3130 6.15

Move the 4th column in each row to next row in python

I have a file with 4 columns(csv file) and n lines.
I want the 4th column values to move to the next line every time.
ex :
[LN],[cb],[I], [LS]
to
[LN],[cb],[I]
[LS]
that is, if my file is:
[LN1],[cb1],[I1], [LS1]
[LN2],[cb2],[I2], [LS2]
[LN3],[cb2],[I3], [LS3]
[LN4],[cb4],[I4], [LS4]
the output file will look like
[LN1],[cb1],[I1]
[LS1]
[LN2],[cb2],[I2]
[LS2]
[LN3],[cb2],[I3]
[LS3]
[LN4],[cb4],[I4]
[LS4]
Test file:
101 Xavier Mexico City 41 88.0
102 Ann Toronto 28 79.0
103 Jana Prague 33 81.0
104 Yi Shanghai 34 80.0
105 Robin Manchester 38 68.0
Output required:
101 Xavier Mexico City 41
88.0
102 Ann Toronto 28
79.0
103 Jana Prague 33
81.0
104 Yi Shanghai 34
80.0
105 Robin Manchester 38
68.0
Split the dataframe into 2 dataframes, one with the first 3 columns and the other with the last column. Add a new helper-column to both so you can order them afterwards. Now combine them again and order them first by index (which is identical for entries which where previously in the same row) and then by the helper column.
Since there is no test data, this answer is untested:
from io import StringIO
import pandas as pd
s = """col1,col2,col3,col4
101 Xavier,Mexico City,41,88.0
102 Ann,Toronto,28,79.0
103 Jana,Prague,33,81.0
104 Yi,Shanghai,34,80.0
105 Robin,Manchester,38,68.0"""
df = pd.read_csv(StringIO(s), sep=',')
df1 = df[['col1', 'col2', 'col3']].copy()
df2 = df[['col4']].rename(columns={'col4':'col1'}).copy()
df1['ranking'] = 1
df2['ranking'] = 2
df_out = df1.append(df2)
df_out = df_out.rename_axis('index_name').sort_values(by=['index_name', 'ranking'], ascending=[True, True])
df_out = df_out.drop(['ranking'], axis=1)
Another solution to this is to convert the table to a list, then rearrange the list to reconstruct the table.
import pandas as pd
df = pd.read_csv(r"test_file.csv")
df_list = df.values.tolist()
new_list = []
for x in df_list:
# Removes last element from list and save it to a variable 'last_x'.
# This action also modifies the original list
last_x = x.pop()
# append the modified list and the last values to an empty list.
new_list.append(x)
new_list.append([last_x])
# Use the new list to create the new table...
new_df = pd.DataFrame(new_list)

Any method to avoid creating individual files when using groupby and sortvalues in pandas

this is a small part of my dataset which contains thousands of rows
designation names runs wickets catches
batsman brendon mccullum 78 0 12
bowler shane bond 0 3 0
bowler mitchell mcclenaghan 20 1 1
batsman kane williamson 192 0 7
wicketkeeper brendon mccullum 78 0 12
batsman daniel vettori 65 11 3
wicketkeeper luke ronchi 7 0 4
bowler daniel vettori 65 11 3
batsman martin guptill 120 0 2
I need to split the dataset based on names, calculate the weightage for each column and then append to same excel sheet. this is my code
df1 = df.sort_values('names')
for i, g in df1.groupby('names'):
g.to_csv('{}'.format(i) + '-names'+ '.csv', header=True, index_label=True)
This code splits the main file into intermediate files for each name and then I run a for loop to perform the calculation on all intermediate files.
filenames = glob.glob('*-names.csv')
for files_ in filenames:
df2 = pd.read_csv(files_)
### perform required calculations
df.to_excel(writer, 'Sheet1', index=False, header=True)
writer.save()
This code is working for me but it creates huge number of intermediate files. I was wondering if there is any method that bypasses that file creation step?
It seems you need processes each group and then write to excel:
df1 = df.sort_values('names')
for i, g in df1.groupby('names'):
print (g)
# perform required calculations with g
g.to_excel(writer, 'Sheet1', index=False, header=True)
writer.save()
Or maybe need apply for each group custom function:
def f(x):
print (x)
# perform required calculations with x
return x
df2 = df1.groupby('names').apply(f)

Create dictionary from csv using pandas with date as key

I wish to create dictionary from the table below
ID ArCityArCountry DptCityDptCountry DateDpt DateAr
1922 ParisFrance NewYorkUnitedState 2008-03-10 2001-02-02
1002 LosAngelesUnitedState California UnitedState 2008-03-10 2008-12-01
1901 ParisFrance LagosNigeria 2001-03-05 2001-02-02
1922 ParisFrance NewYorkUnitedState 2011-02-03 2008-12-01
1002 ParisFrance CaliforniaUnitedState 2003-03-04 2002-03-04
1099 ParisFrance BeijingChina 2011-02-03 2009-02-04
1901 LosAngelesUnitedState ParisFrance 2001-03-05 2001-02-02
.
import pandas as pd
import datetime
from pandas_datareader import data, wb
import csv
#import numpy as np
out= open("testfile.csv", "rb")
data = csv.reader(out)
data = [[row[0],row[1] + row[2],row[3] + row[4], row[5],row[6]] for row in data]
out.close()
print data
out=open("data.csv", "wb")
output = csv.writer(out)
for row in data:
output.writerow(row)
out.close()
df = pd.read_csv('data.csv')
for DateDpt, DateAr in df.iteritems():
df.DateDpt = pd.to_datetime(df.DateDpt, format='%Y-%m-%d')
df.DateAr = pd.to_datetime(df.DateAr, format='%Y-%m-%d')
print df
dept_cities = df.groupby('ArCityArCountry')
for city, departures in dept_cities:
print(city)
print([list(r) for r in departures.loc[:, ['AuthorID', 'DptCityDptCountry', 'DateDpt', 'DateAr']].to_records()])
Expected output
ParisFrance = { DateAr, ID, ArCityArCountry, DptCityDptCountry}
Note: I want to group by ArCityArCountry and DptCityDptCountry
You will notice I didn't include DateDpt; I want to select all IDs that fall between DateAr and DateDpt and actually in ParisFrance or CaliforniaUnitedStates between the specified periods.
for example In 1999-10-02 Mr A was in Paris until 2013-12-12 and Mr B was in Paris in 2010-11-04 and left 2012-09-09 that means MrA and Mr B were in Paris since MrB's visit to Paris fall in btw the time
MrA was there CaliforniaUnitedStates = { DateAr, ID, ArCityArCountry, DptCityDptCountry}

Categories

Resources