Can you help me how I get the rows with each index value?
Getting empty dataframe even though I set a condition.
I am trying to extract rows that have a certain index value.
It is temperature data. I want to group the data based on temperature ranges.
So, I did indexing for grouping the temperature data
Then try to extract rows with a specific idx.
Here is entire code.
from pandas import read_csv
import pandas as pd
import numpy as np
from matplotlib import pyplot
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
bins = [0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40] #21
path = '/Users/user/Desktop/OJERI/00_Research/00_Climate_Data/00_Temp/01_ASOS_daily_seperated/StationINFO_by_years/'
filename = f'STATIONS_LATLON_removedst_1981-2020_59.csv' #59for1981-2020
df = pd.read_csv(path+filename, index_col=None, header=0)
station = df['Stations']
Temp = ['Tmean','Tmax','Tmin']`
for st in station[24:25]:
for t in Temp:
path1 = f'/Users/user/Desktop/OJERI/00_Research/00_Climate_Data/00_Temp/01_ASOS_daily_seperated/{t}_JJAS_NDJF/'
filenames = f'combined_SURFACE_ASOS_{st}_1981-2020_JJAS_{t}.csv'
df1 = read_csv(path1+filenames, index_col=None, header=0, parse_dates=True, squeeze=True)
def bincal(row):
for idx, which_bin in enumerate(bins):
if idx == 0:
pass
if row <= which_bin:
return idx
df1['data'] = df1[f'{t}'].map(bincal)
print(df1)
ok = df1[df1['data'] == 'idx']
print(ok)
printing df1=read_csv looks like:
print df1=read_csv
printing df1 after "df1['data'] = df1[f'{t}'].map(bincal)" looks like:
printing df1 after "df1['data'] = df1[f'{t}'].map(bincal)"
I expected to get Date, Tmean(or Tmin, or Tmax), idx (which should same number) values when I print out "ok" however, it shows empty dataframe.
printing "ok"
Can you help me how I get the rows with each index value?
I am working with a data Adult that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly. The headers are not preserved and some columns are now combined. I have looked through the page and online, but what I have tried is not working. I load the data in with the following code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
After inserting missing values and changing the data frame as desired I have tried:
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
and a few other variations. How can I save the file to a CSV and preserve the correct format for the next time I read the file in?
When re-loading the data I use the code:
import pandas as pd
df = pd.read_csv('file_name.csv')
when running df.head the output is:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
and print(df.loc[:,"age"].value_counts()) the output is:
36 898
31 888
34 886
23 877
35 876
which should not have 2 columns
If you pickle it like so:
Adult.to_pickle('adult.pickle')
You will, subsequently, be able to read it back in using read_pickle as follows:
original_adult = pd.read_pickle('adult.pickle')
Hope that helps.
If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
You can add parameter index=False to the to_csv('file_name.csv', index=False) function if you are not interested in saving the DataFrame row index. Otherwise, while reading the csv file again you'd need to specify the index_col parameter.
According to the documentation value_counts() returns a Series object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).
I replicated your code and it works for me. The order of the columns is preserved.
Let me show what I tried. Tried this batch of code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
This worked perfectly. Then
df = Adult
This also worked.
Then I saved this data frame to a csv file. Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
Load this csv file into a new_df. It will generate a new column for keeping track of index. It is unnecessary and you can drop it like following:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
When I compare the columns of the new_df from the original df, with this line of code
new_df.columns == df.columns
I get
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
You might not have been providing the absolute path to the file or saving the file twice as here. You only need to save it once.
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
And to read:
df = pd.read_csv('file_name.csv', index_col=0)
The first columns from print(df.loc[:,"age"].value_counts()) is the index column which is shown if you query the datframe, to save this to a list, use the to_listmethod:
print(df.loc[:,"age"].value_counts().to_list())
I want to create a dataframe that combines historical price data for various ETFs (XLF, XLP, XLU, etc) downloaded from yahoo finance. I tried to create a dataframe for each ETF, with the date and the adjusted close. Now I want to combine them all into one dataframe. Any idea why this doesn't work?
I've tried saving each csv as a dataframe, which only has date and adjusted close. Then I want to combine, using Date as the index.
# Here's what I've tried.
import os
import pandas as pd
from functools import reduce
XLF_df = pd.read_csv("datasets/XLF.csv").set_index("Date")["Adj Close"]
XRT_df = pd.read_csv("datasets/XRT.csv").set_index("Date")["Adj Close"]
XLP_df = pd.read_csv("datasets/XLP.csv").set_index("Date")["Adj Close"]
XLY_df = pd.read_csv("datasets/XLY.csv").set_index("Date")["Adj Close"]
XLV_df = pd.read_csv("datasets/XLV.csv").set_index("Date")["Adj Close"]
dfList = [XLF_df, XRT_df, XLP_df, XLY_df, XLV_df]
df = reduce(lambda df1,df2: pd.merge(df1,df2,on="Date"), dfList)
df.head()
I get an empty dataframe! And if I say on="id" then I get a key error.
I want to look at all files in a specific directory and get their name and modification date. I got the modification date. What I want to do is get the dates into a dataframe. So I can work with it. I want to get it into something like a pandas dataframe with one column called ModificationTime then the list of all the times.
I am using Jupyter notebooks and Python 3
import os
import datetime
import time
import pandas as pd
import numpy as np
from collections import OrderedDict
with os.scandir('My_Dir') as dir_entries:
for entry in dir_entries:
info = entry.stat()
(info.st_mtime)
time = (datetime.datetime.utcfromtimestamp(info.st_mtime))
df = {'ModificationTime': [time]}
df1 = pd.DataFrame(df)
print(df1)
#Output is this
ModificationTime
0 2019-02-16 02:39:13.428990
ModificationTime
0 2019-02-16 02:34:01.247963
ModificationTime
0 2018-09-22 18:07:34.829137
#If I print the code in a new cell I only get 1 output
print(df1)
#Output is this
ModificationTime
0 2019-02-16 02:39:13.428990
df1 = pd.DataFrame([])
with os.scandir('My_Dir') as dir_entries:
for entry in dir_entries:
info = entry.stat()
(info.st_mtime)
time = (datetime.datetime.utcfromtimestamp(info.st_mtime))
df = pd.DataFrame({'ModificationTime': [time]})
df1 = df1.append(df)
This will solve the problem. In your code, you create a dataframe but you keep overwriting it so you only get one row in the final dataframe.
I'm trying to get 4 CSV files into one dataframe. I've looked around on the web for examples and tried a few but they all give errors. Finally I think I'm onto something, but it gives unexpected results. Can anybody tell me why this doesn't work?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 24*365*4
dates = pd.date_range('20120101',periods=n,freq='h')
df = pd.DataFrame(np.random.randn(n,1),index=dates,columns=list('R'))
#df = pd.DataFrame(index=dates)
paths = ['./LAM DIV/10118218_JAN_LAM_DIV_1.csv',
'./LAM DIV/10118218_JAN-APR_LAM_DIV_1.csv',
'./LAM DIV/10118250_JAN_LAM_DIV_2.csv',
'./LAM DIV/10118250_JAN-APR_LAM_DIV_2.csv']
for i in range(len(paths)):
data = pd.read_csv(paths[i], index_col=0, header=0, parse_dates=True)
df.join(data['TempC'])
df.head()
Expected result:
Date Time R 0 1 2 3
Getting this:
Date Time R
You need to save the result of your join:
df = df.join(data['TempC'])