I want to add dataframe to excel every time the code executes, in the last row available in the sheet. Here is the code I am using:
import pandas as pd
import pandas
from openpyxl import load_workbook
def append_df_to_excel(df, excel_path):
df_excel = pd.read_excel(excel_path)
result = pd.concat([df_excel, df], ignore_index=True)
result.to_excel(excel_path)
data_set1 = {
'Name': ['Rohit', 'Mohit'],
'Roll no': ['01', '02'],
'maths': ['93', '63']}
df1 = pd.DataFrame(data_set1)
append_df_to_excel(df1, r'C:\Users\kashk\OneDrive\Documents\ScreenStocks.xlsx')
My desired output(after 3 code runs):
Rohit 1 93
Mohit 2 63
Rohit 1 93
Mohit 2 63
Rohit 1 93
Mohit 2 63
But what I get:
Unnamed: 0.1 Unnamed: 0 Name Roll no maths
0 0 0 Rohit 1 93
1 1 1 Mohit 2 63
2 2 Rohit 1 93
3 3 Mohit 2 63
4 Rohit 1 93
5 Mohit 2 63
Not sure where I am going wrong.
It's happening because in a default situation these functions like to_excel or to_csv (and etc.) add a new column with index. So every time you save the file, it adds a new column.
That's why you just should change the raw where you save your dataframe to a file.
result.to_excel(excel_path, index=False)
Related
As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2
I am trying to bring this API URL into a pandas DataFrame and getting the values but still needing to add the date as a column like the other values:
import pandas as pd
from pandas.io.json import json_normalize
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
df = pd.read_json("https://covidapi.info/api/v1/country/DOM")
df = pd.DataFrame(df['result'].values.tolist())
print (df)
Getting this output:
confirmed deaths recovered
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
72 1488 68 16
73 1488 68 16
74 1745 82 17
75 1828 86 33
76 1956 98 36
You need to pass the index from your dataframe as well as the data itself:
df = pd.DataFrame(index=df.index, data=df['result'].values.tolist())
The line above creates the same columns, but keeps the original date index from the API call.
I have 3 dataframes:
df1 with match history (organized by date)
df2 with player stats (organized by player name)
df3 difference between player stats (df2) per match (df1) [in progress]
I want to do something like:
for idx, W_nm, L_nm in df1[['index','winner_name','loser_name']].values:
df3.loc[idx] = df2.loc[W_nm] - df2.loc[L_nm]
#... edit this row further
Which fails because:
'idx' doesn't reference df1's indices
df3 has no defined columns
Is there a way to reference the indices on the first line?
I've read iterrows() is 7x slower than .loc[] and I have quite a bit of data to process
Is there anything cleaner than this:
for idx in df1.index:
W_nm = df1.loc[idx,'winner_name']
L_nm = df1.loc[idx,'loser_name']
df3.loc[idx] = df2.loc[W_nm] - df2.loc[L_nm]
#... edit this row further
Which doesn't fix the "no defined columns", but gives me my handles.
So I'm expecting something like:
df1
[ 'Loser' 'Winner' 'Score'
0 Harry Hermione 3-7 ...
1 Harry Ron 0-2 ...
2 Ron Voldemort 7-89 ... ]
df2
[ 'Spells' 'Allies'
Harry 23 84 ...
Hermione 94 68 ...
Ron 14 63 ...
Voldemort 97 92 ... ]
then
df3
[ 'Spells' 'Allies'
0 -71 16 ...
1 9 21 ...
2 -83 -29 ... ]
What you need is join:
loser = df1.join(df2, on='Loser').loc[:,['Spells', 'Allies']]
winner = df1.join(df2, on='Winner').loc[:,['Spells', 'Allies']]
df3 = winner - loser
With your example data is gives:
Spells Allies
0 71 -16
1 -9 -21
2 83 29
When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())
This question already has answers here:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
(11 answers)
Closed 4 years ago.
I have a data file from columns A-G like below but when I am reading it with pd.read_csv('data.csv') it prints an extra unnamed column at the end for no reason.
colA ColB colC colD colE colF colG Unnamed: 7
44 45 26 26 40 26 46 NaN
47 16 38 47 48 22 37 NaN
19 28 36 18 40 18 46 NaN
50 14 12 33 12 44 23 NaN
39 47 16 42 33 48 38 NaN
I have seen my data file various times but I have no extra data in any other column. How I should remove this extra column while reading ? Thanks
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
In [162]: df
Out[162]:
colA ColB colC colD colE colF colG
0 44 45 26 26 40 26 46
1 47 16 38 47 48 22 37
2 19 28 36 18 40 18 46
3 50 14 12 33 12 44 23
4 39 47 16 42 33 48 38
NOTE: very often there is only one unnamed column Unnamed: 0, which is the first column in the CSV file. This is the result of the following steps:
a DataFrame is saved into a CSV file using parameter index=True, which is the default behaviour
we read this CSV file into a DataFrame using pd.read_csv() without explicitly specifying index_col=0 (default: index_col=None)
The easiest way to get rid of this column is to specify the parameter pd.read_csv(..., index_col=0):
df = pd.read_csv('data.csv', index_col=0)
First, find the columns that have 'unnamed', then drop those columns. Note: You should Add inplace = True to the .drop parameters as well.
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
The pandas.DataFrame.dropna function removes missing values (e.g. NaN, NaT).
For example the following code would remove any columns from your dataframe, where all of the elements of that column are missing.
df.dropna(how='all', axis='columns')
The approved solution doesn't work in my case, so my solution is the following one:
''' The column name in the example case is "Unnamed: 7"
but it works with any other name ("Unnamed: 0" for example). '''
df.rename({"Unnamed: 7":"a"}, axis="columns", inplace=True)
# Then, drop the column as usual.
df.drop(["a"], axis=1, inplace=True)
Hope it helps others.