I'm trying to get 4 CSV files into one dataframe. I've looked around on the web for examples and tried a few but they all give errors. Finally I think I'm onto something, but it gives unexpected results. Can anybody tell me why this doesn't work?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 24*365*4
dates = pd.date_range('20120101',periods=n,freq='h')
df = pd.DataFrame(np.random.randn(n,1),index=dates,columns=list('R'))
#df = pd.DataFrame(index=dates)
paths = ['./LAM DIV/10118218_JAN_LAM_DIV_1.csv',
'./LAM DIV/10118218_JAN-APR_LAM_DIV_1.csv',
'./LAM DIV/10118250_JAN_LAM_DIV_2.csv',
'./LAM DIV/10118250_JAN-APR_LAM_DIV_2.csv']
for i in range(len(paths)):
data = pd.read_csv(paths[i], index_col=0, header=0, parse_dates=True)
df.join(data['TempC'])
df.head()
Expected result:
Date Time R 0 1 2 3
Getting this:
Date Time R
You need to save the result of your join:
df = df.join(data['TempC'])
Related
I am trying to modify a CSV dataset with the Python Pandas package.
I have a "time" column (column num 5) that has 51 days and ~4K on records for each day.
I want to minimize the dataset to 35 days with 24 random records per day.
I am using the following code:
import pandas as pd
import datetime
import random
import matplotlib.pyplot as plt
file_name = "filename"
data = pd.read_csv('path/to/the/file/'+file_name+".csv")
df = data.sort_values(data.columns[5])
df = df.reset_index(drop=True)
new_df=df
new_df.iloc[:,5]= pd.to_datetime(new_df.iloc[:,5])
new_df=new_df[(new_df.iloc[:,5] < '08/10/2018')]
Now I have 35 days with 4K records per day.
My thought was to create an empty Pandas DataFrame and to add by iteration 24 samples of each day, using the following code:
final_df = pd.DataFrame()
for date in new_df.iloc[:,5].unique():
day = new_df[(new_df.iloc[:,5] == date)].sample(n=24)
final_df.append(day)
print(final_df)
but it seems that the DF is steal empty:
Empty DataFrame
Columns: []
Index: []
Can someone direct me to the right solution? :)
So the df.append() function is creating a new DF that combines the primary df with the appended.
so the solution for that will be:
final_df = final_df.append(day)
Can you help me how I get the rows with each index value?
Getting empty dataframe even though I set a condition.
I am trying to extract rows that have a certain index value.
It is temperature data. I want to group the data based on temperature ranges.
So, I did indexing for grouping the temperature data
Then try to extract rows with a specific idx.
Here is entire code.
from pandas import read_csv
import pandas as pd
import numpy as np
from matplotlib import pyplot
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
bins = [0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40] #21
path = '/Users/user/Desktop/OJERI/00_Research/00_Climate_Data/00_Temp/01_ASOS_daily_seperated/StationINFO_by_years/'
filename = f'STATIONS_LATLON_removedst_1981-2020_59.csv' #59for1981-2020
df = pd.read_csv(path+filename, index_col=None, header=0)
station = df['Stations']
Temp = ['Tmean','Tmax','Tmin']`
for st in station[24:25]:
for t in Temp:
path1 = f'/Users/user/Desktop/OJERI/00_Research/00_Climate_Data/00_Temp/01_ASOS_daily_seperated/{t}_JJAS_NDJF/'
filenames = f'combined_SURFACE_ASOS_{st}_1981-2020_JJAS_{t}.csv'
df1 = read_csv(path1+filenames, index_col=None, header=0, parse_dates=True, squeeze=True)
def bincal(row):
for idx, which_bin in enumerate(bins):
if idx == 0:
pass
if row <= which_bin:
return idx
df1['data'] = df1[f'{t}'].map(bincal)
print(df1)
ok = df1[df1['data'] == 'idx']
print(ok)
printing df1=read_csv looks like:
print df1=read_csv
printing df1 after "df1['data'] = df1[f'{t}'].map(bincal)" looks like:
printing df1 after "df1['data'] = df1[f'{t}'].map(bincal)"
I expected to get Date, Tmean(or Tmin, or Tmax), idx (which should same number) values when I print out "ok" however, it shows empty dataframe.
printing "ok"
Can you help me how I get the rows with each index value?
Im trying to read data from a log file I have in Python. Suppose the file is called data.log.
The content of the file looks as follows:
# Performance log
# time, ff, T vector, dist, windnorth, windeast
0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000
1.00000000,3.02502604,343260.68655952,384.26845401,-7.70828175,-0.45288215
2.00000000,3.01495320,342124.21684440,767.95286901,-7.71506536,-0.45123853
3.00000000,3.00489957,340989.57100678,1151.05303883,-7.72185550,-0.44959182
I would like to obtain the last two columns and put them into two separate lists, such that I get an output like:
list1 = [-7.70828175, -7.71506536, -7.71506536]
list2 = [-0.45288215, -0.45123853, -0.44959182]
I have tried reading the data with the following code as shown below, but instead of separate columns and rows I just get one whole column with three rows in return.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
file = open('data.log', 'r')
df = pd.read_csv('data.log', sep='\\s+')
df = list(df)
print (df[0])
Could someone indicate what I have to adjust in my code to obtain the required output as indicated above?
Thanks in advance!
import pandas as pd
df = pd.read_csv('sample.txt', skiprows=3, header=None,
names=['time', 'ff', 'T vector', 'dist', 'windnorth', 'windeast'])
spam = list(df['windeast'])
print(spam)
# store a specific column in a list
df['wind_diff'] = df.windnorth - df['windeast'] # two different ways to access columsn
print(df)
print(df['wind_diff'])
output
[-0.45288215, -0.45123853, -0.44959182]
time ff T vector dist windnorth windeast wind_diff
0 1.0 3.025026 343260.686560 384.268454 -7.708282 -0.452882 -7.255400
1 2.0 3.014953 342124.216844 767.952869 -7.715065 -0.451239 -7.263827
2 3.0 3.004900 340989.571007 1151.053039 -7.721856 -0.449592 -7.272264
0 -7.255400
1 -7.263827
2 -7.272264
Name: wind_diff, dtype: float64
Note, for creating plot in matplotlib you can work with pandas.Series directly, no need to store it in a list.
The error comes in the sep attribute. If you remove it, it will use the default (the comma) which is the one you need:
e.g.
>>> import pandas as pd
>>> import numpy as np
>>> file = open('data.log', 'r')
>>> df = pd.read_csv('data.log') # or use sep=','
>>> df = list(df)
>>> df[0]
'1.00000000'
>>> df[5]
'-0.45288215'
Plus use skiprows to get out the headers.
I want to look at all files in a specific directory and get their name and modification date. I got the modification date. What I want to do is get the dates into a dataframe. So I can work with it. I want to get it into something like a pandas dataframe with one column called ModificationTime then the list of all the times.
I am using Jupyter notebooks and Python 3
import os
import datetime
import time
import pandas as pd
import numpy as np
from collections import OrderedDict
with os.scandir('My_Dir') as dir_entries:
for entry in dir_entries:
info = entry.stat()
(info.st_mtime)
time = (datetime.datetime.utcfromtimestamp(info.st_mtime))
df = {'ModificationTime': [time]}
df1 = pd.DataFrame(df)
print(df1)
#Output is this
ModificationTime
0 2019-02-16 02:39:13.428990
ModificationTime
0 2019-02-16 02:34:01.247963
ModificationTime
0 2018-09-22 18:07:34.829137
#If I print the code in a new cell I only get 1 output
print(df1)
#Output is this
ModificationTime
0 2019-02-16 02:39:13.428990
df1 = pd.DataFrame([])
with os.scandir('My_Dir') as dir_entries:
for entry in dir_entries:
info = entry.stat()
(info.st_mtime)
time = (datetime.datetime.utcfromtimestamp(info.st_mtime))
df = pd.DataFrame({'ModificationTime': [time]})
df1 = df1.append(df)
This will solve the problem. In your code, you create a dataframe but you keep overwriting it so you only get one row in the final dataframe.
I'm trying to filter a data frame based on the contents of a pre-defined array.
I've looked up several examples on StackOverflow but simply get an empty output.
I'm not able to figure what is it I'm doing incorrectly. Could I please seek some guidance here?
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements= pd.DataFrame(df.drop_duplicates().Entity.value_counts(),columns=distinct_count_column_headers)
filtered_data= distinct_elements[distinct_elements['Entity'].isin(pre_defined_arr)]
print("Filtered data ... ")
print(filtered_data)
OUTPUT
Filtered data ...
Empty DataFrame
Columns: [Entity]
Index: []
Managed to that using filter function -> .filter(items=pre_defined_arr )
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements_filtered= pd.DataFrame(df.drop_duplicates().Entity.value_counts().filter(items=pre_defined_arr),columns=distinct_count_column_headers)
It's strange that there's just one answer I bumped on that suggests filter function. Almost 9 out 10 out there talk about .isin function which didn't work in my case.