Dataset engineering Python Pandas - python

I am trying to modify a CSV dataset with the Python Pandas package.
I have a "time" column (column num 5) that has 51 days and ~4K on records for each day.
I want to minimize the dataset to 35 days with 24 random records per day.
I am using the following code:
import pandas as pd
import datetime
import random
import matplotlib.pyplot as plt
file_name = "filename"
data = pd.read_csv('path/to/the/file/'+file_name+".csv")
df = data.sort_values(data.columns[5])
df = df.reset_index(drop=True)
new_df=df
new_df.iloc[:,5]= pd.to_datetime(new_df.iloc[:,5])
new_df=new_df[(new_df.iloc[:,5] < '08/10/2018')]
Now I have 35 days with 4K records per day.
My thought was to create an empty Pandas DataFrame and to add by iteration 24 samples of each day, using the following code:
final_df = pd.DataFrame()
for date in new_df.iloc[:,5].unique():
day = new_df[(new_df.iloc[:,5] == date)].sample(n=24)
final_df.append(day)
print(final_df)
but it seems that the DF is steal empty:
Empty DataFrame
Columns: []
Index: []
Can someone direct me to the right solution? :)

So the df.append() function is creating a new DF that combines the primary df with the appended.
so the solution for that will be:
final_df = final_df.append(day)

Related

How to extract rows(from csv files) in a specific condition in python (csv files)

Can you help me how I get the rows with each index value?
Getting empty dataframe even though I set a condition.
I am trying to extract rows that have a certain index value.
It is temperature data. I want to group the data based on temperature ranges.
So, I did indexing for grouping the temperature data
Then try to extract rows with a specific idx.
Here is entire code.
from pandas import read_csv
import pandas as pd
import numpy as np
from matplotlib import pyplot
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
bins = [0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40] #21
path = '/Users/user/Desktop/OJERI/00_Research/00_Climate_Data/00_Temp/01_ASOS_daily_seperated/StationINFO_by_years/'
filename = f'STATIONS_LATLON_removedst_1981-2020_59.csv' #59for1981-2020
df = pd.read_csv(path+filename, index_col=None, header=0)
station = df['Stations']
Temp = ['Tmean','Tmax','Tmin']`
for st in station[24:25]:
for t in Temp:
path1 = f'/Users/user/Desktop/OJERI/00_Research/00_Climate_Data/00_Temp/01_ASOS_daily_seperated/{t}_JJAS_NDJF/'
filenames = f'combined_SURFACE_ASOS_{st}_1981-2020_JJAS_{t}.csv'
df1 = read_csv(path1+filenames, index_col=None, header=0, parse_dates=True, squeeze=True)
def bincal(row):
for idx, which_bin in enumerate(bins):
if idx == 0:
pass
if row <= which_bin:
return idx
df1['data'] = df1[f'{t}'].map(bincal)
print(df1)
ok = df1[df1['data'] == 'idx']
print(ok)
printing df1=read_csv looks like:
print df1=read_csv
printing df1 after "df1['data'] = df1[f'{t}'].map(bincal)" looks like:
printing df1 after "df1['data'] = df1[f'{t}'].map(bincal)"
I expected to get Date, Tmean(or Tmin, or Tmax), idx (which should same number) values when I print out "ok" however, it shows empty dataframe.
printing "ok"
Can you help me how I get the rows with each index value?

converting a dataframe to a csv file

I am working with a data Adult that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly. The headers are not preserved and some columns are now combined. I have looked through the page and online, but what I have tried is not working. I load the data in with the following code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
After inserting missing values and changing the data frame as desired I have tried:
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
and a few other variations. How can I save the file to a CSV and preserve the correct format for the next time I read the file in?
When re-loading the data I use the code:
import pandas as pd
df = pd.read_csv('file_name.csv')
when running df.head the output is:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
and print(df.loc[:,"age"].value_counts()) the output is:
36 898
31 888
34 886
23 877
35 876
which should not have 2 columns
If you pickle it like so:
Adult.to_pickle('adult.pickle')
You will, subsequently, be able to read it back in using read_pickle as follows:
original_adult = pd.read_pickle('adult.pickle')
Hope that helps.
If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
You can add parameter index=False to the to_csv('file_name.csv', index=False) function if you are not interested in saving the DataFrame row index. Otherwise, while reading the csv file again you'd need to specify the index_col parameter.
According to the documentation value_counts() returns a Series object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).
I replicated your code and it works for me. The order of the columns is preserved.
Let me show what I tried. Tried this batch of code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
This worked perfectly. Then
df = Adult
This also worked.
Then I saved this data frame to a csv file. Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
Load this csv file into a new_df. It will generate a new column for keeping track of index. It is unnecessary and you can drop it like following:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
When I compare the columns of the new_df from the original df, with this line of code
new_df.columns == df.columns
I get
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
You might not have been providing the absolute path to the file or saving the file twice as here. You only need to save it once.
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
And to read:
df = pd.read_csv('file_name.csv', index_col=0)
The first columns from print(df.loc[:,"age"].value_counts()) is the index column which is shown if you query the datframe, to save this to a list, use the to_listmethod:
print(df.loc[:,"age"].value_counts().to_list())

Combining a list of dataframes into a new dataframe

I want to create a dataframe that combines historical price data for various ETFs (XLF, XLP, XLU, etc) downloaded from yahoo finance. I tried to create a dataframe for each ETF, with the date and the adjusted close. Now I want to combine them all into one dataframe. Any idea why this doesn't work?
I've tried saving each csv as a dataframe, which only has date and adjusted close. Then I want to combine, using Date as the index.
# Here's what I've tried.
import os
import pandas as pd
from functools import reduce
XLF_df = pd.read_csv("datasets/XLF.csv").set_index("Date")["Adj Close"]
XRT_df = pd.read_csv("datasets/XRT.csv").set_index("Date")["Adj Close"]
XLP_df = pd.read_csv("datasets/XLP.csv").set_index("Date")["Adj Close"]
XLY_df = pd.read_csv("datasets/XLY.csv").set_index("Date")["Adj Close"]
XLV_df = pd.read_csv("datasets/XLV.csv").set_index("Date")["Adj Close"]
dfList = [XLF_df, XRT_df, XLP_df, XLY_df, XLV_df]
df = reduce(lambda df1,df2: pd.merge(df1,df2,on="Date"), dfList)
df.head()
I get an empty dataframe! And if I say on="id" then I get a key error.

Creating a list of names and modification dates of files in a specific directory and make a dataframe out of it

I want to look at all files in a specific directory and get their name and modification date. I got the modification date. What I want to do is get the dates into a dataframe. So I can work with it. I want to get it into something like a pandas dataframe with one column called ModificationTime then the list of all the times.
I am using Jupyter notebooks and Python 3
import os
import datetime
import time
import pandas as pd
import numpy as np
from collections import OrderedDict
with os.scandir('My_Dir') as dir_entries:
for entry in dir_entries:
info = entry.stat()
(info.st_mtime)
time = (datetime.datetime.utcfromtimestamp(info.st_mtime))
df = {'ModificationTime': [time]}
df1 = pd.DataFrame(df)
print(df1)
#Output is this
ModificationTime
0 2019-02-16 02:39:13.428990
ModificationTime
0 2019-02-16 02:34:01.247963
ModificationTime
0 2018-09-22 18:07:34.829137
#If I print the code in a new cell I only get 1 output
print(df1)
#Output is this
ModificationTime
0 2019-02-16 02:39:13.428990
df1 = pd.DataFrame([])
with os.scandir('My_Dir') as dir_entries:
for entry in dir_entries:
info = entry.stat()
(info.st_mtime)
time = (datetime.datetime.utcfromtimestamp(info.st_mtime))
df = pd.DataFrame({'ModificationTime': [time]})
df1 = df1.append(df)
This will solve the problem. In your code, you create a dataframe but you keep overwriting it so you only get one row in the final dataframe.

pandas - Joining CSV time series into a single dataframe

I'm trying to get 4 CSV files into one dataframe. I've looked around on the web for examples and tried a few but they all give errors. Finally I think I'm onto something, but it gives unexpected results. Can anybody tell me why this doesn't work?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 24*365*4
dates = pd.date_range('20120101',periods=n,freq='h')
df = pd.DataFrame(np.random.randn(n,1),index=dates,columns=list('R'))
#df = pd.DataFrame(index=dates)
paths = ['./LAM DIV/10118218_JAN_LAM_DIV_1.csv',
'./LAM DIV/10118218_JAN-APR_LAM_DIV_1.csv',
'./LAM DIV/10118250_JAN_LAM_DIV_2.csv',
'./LAM DIV/10118250_JAN-APR_LAM_DIV_2.csv']
for i in range(len(paths)):
data = pd.read_csv(paths[i], index_col=0, header=0, parse_dates=True)
df.join(data['TempC'])
df.head()
Expected result:
Date Time R 0 1 2 3
Getting this:
Date Time R
You need to save the result of your join:
df = df.join(data['TempC'])

Categories

Resources