Related
I have a dataset which I transformed to CSV as potential input for a keras auto encoder.
The loading of the CSV works flawless with pandas.read_csv() but the data types are not correct.
The csv solely contains two colums: label and features whereas the label column contains strings and the features column arrays with signed integers ([-1, 1]). So in general pretty simple structure.
To get two different dataframes for further processing I created them via:
labels = pd.DataFrame(columns=['label'], data=csv_data, dtype='U')
and
features = pd.DataFrame(columns=['features'], data=csv_data)
in both cases I got wrong datatypes as both are marked as object typed dataframes. What am I doing wrong?
For the features it is even harder because the parsing returns me a pandas.sequence that contains the array as string: ['[1, ..., 1]'].
So I tried a tedious workaround by parsing the string back to an numpy array via .to_numpy() a python cast for every element and than an np.assarray() - but the type of the dataframe is still incorrect. I think this could not be the general approach how to solve this task. As I am fairly new to pandas I checked some tutorials and the API but in most cases a cell in a dataframe rather contains a single value instead of a complete array. Maybe my overall design of the dataframe ist just not suitable for this task.
Any help appreacheated!
You are reading the file as string but you have a python list as a column you need to evaluate it to get the list.
I am not sure of the use case but you can split the labels for a more readable dataframe
import pandas as pd
features = ["featurea","featureb","featurec","featured","featuree"]
labels = ["[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]"]
df = pd.DataFrame(list(zip(features, labels)),
columns =['Features', 'Labels'])
import ast
#convert Strings to lists
df['Labels'] = df['Labels'].map(ast.literal_eval)
df.index = df['Features']
#Since list itself might not be useful you can split and expand it to multiple columns
new_df = pd.DataFrame(df['Labels'].values.tolist(),index= df.index)
Output
0 1 2 3 4 5
Features
featurea 1 0 1 1 1 1
featureb 1 0 1 1 1 1
featurec 1 0 1 1 1 1
featured 1 0 1 1 1 1
featuree 1 0 1 1 1 1
The input csv was formatted incorrectly, therefore the parsing was accurate but not what i intended. I expanded the real columns and skipped the header to have a column for every array entry - now panda recognize the types and the correct dimensions.
I have two different csv files, I have merged them into a single data frame and grouped according to the 'class_name' column. The group by works as intended but I dont know how to perform the operation by comparing the groups against one other. From r1.csv the class algebra has gone down by 5 students, so I want -5, calculus has increased by 5 so it has to +5, this has to be added as a new column in a separate data frame. Same with date arithmetics.
This is what I tried so far
import pandas as pd
report_1_df=pd.read_csv('r1.csv')
report_2_df=pd.read_csv('r2.csv')
for group,elements in pd.concat([report_1_df, report_2_df], axis=0, sort=False).groupby('class_name'):
print(elements)
I can see that my group by works, I tried .sum() .diff() but none seem to do what I want, what can I do here. Thanks.
r1.csv
class_name,student_count,start_time,end_time
algebra,15,"2019,Dec,08","2019,Dec,09"
calculus,10,"2019,Dec,08","2019,Dec,09"
statistics,12,"2019,Dec,08","2019,Dec,09"
r2.csv
class_name,student_count,start_time,end_time
calculus,15,"2019,Dec,09","2019,Dec,10"
algebra,10,"2019,Dec,09","2019,Dec,10"
trigonometry,12,"2019,Dec,09","2019,Dec,10"
Needed
class_name,student_count,student_count_change,start_time,start_time_delay,end_time,end_time_delay
algebra,10,-5,"2019,Dec,09",1,"2019,Dec,10",1
calculus,15,5,"2019,Dec,09",1,"2019,Dec,10",1
statistics,12,-12,"2019,Dec,08",0,"2019,Dec,09",0
trigonometry,12,12,"2019,Dec,09",0,"2019,Dec,10",0
Not sure if there's a more direct way, but you can start by appending missing data on both your dfs:
classes = (df1["class_name"].append(df2["class_name"])).unique()
def fill_data(df):
for i in np.setdiff1d(classes, df["class_name"].values):
df.loc[df.shape[0]] = [i, 0, *df.iloc[0,2:].values]
return df
df1 = fill_data(df1)
df2 = fill_data(df2)
With the missing classes filled, now you can use groupby to assign a new column for the difference and lastly drop_duplicates:
df = pd.concat([df1,df2],axis=0).reset_index(drop=True)
df["diff"] = df.groupby("class_name")["student_count"].diff().fillna(df["student_count"])
print (df.drop_duplicates("class_name",keep="last"))
class_name student_count start_time end_time diff
4 calculus 15 2019,Dec,09 2019,Dec,10 5.0
5 algebra 10 2019,Dec,09 2019,Dec,10 -5.0
6 trigonometry 12 2019,Dec,09 2019,Dec,10 12.0
7 statistics 0 2019,Dec,09 2019,Dec,10 -12.0
I have a DataFrame with some NA and I want to drop the rows where a particular column has NA values.
My first trial has been:
- Identifying the rows where the specific column values were NA
- Pass them to pandas.drop()
In my specific case, I have a DataFrame of 39164 rows by 40 columns.
If I look to NA in the specific columns I found 17715 concerned labels that I saved to a dedicated variable. I then sent them to pandas.drop() expecting about 22000 rows remaining, but I only got 2001.
If I use pandas.dropna() I get 21449 rows remaining, which is what I was expecting.
Here follows my code.
The first code portion downloads data from gouv.fr (sorry for not using fake data...but it will only take less than 10 seconds to execute).
WARNING: only the last 5 years are stored on the online database. So my example should be adapted later...
import pandas as pd
villes = {'Versailles' : '78646',
'Aix-en-Provence' : '13001'}
years = range(2014,2019)
root = "https://cadastre.data.gouv.fr/data/etalab-dvf/latest/csv/"
data = pd.DataFrame({})
for ville in villes.keys() :
for year in years :
file_addr = '/'.join([root,str(year),'communes',villes[ville][:2],villes[ville]+'.csv'])
print(file_addr)
tmp = pd.read_csv(file_addr)
data =pd.concat([data,tmp])
This is the second portion, where I try to drop some rows. As said, the results are very different depending on the chosen strategy (data_1 vs data_2). data_2, obtained by dropna() is the expected results.
print(data.shape)
undefined_surface = data.index[data.surface_reelle_bati.isna()]
print(undefined_surface)
data_1 = data.drop(undefined_surface)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_1.shape)
print(data_2.shape)
Using dropna() is totally fine for me, but I would like to understand what I was doing wrong with drop() since I got a very silly result compared to my expectation and I would like to be aware of that in the future...
Thanks in advance for help.
This is because your index is not unique, look for example for index 0, you have forty rows with this index
data_idx0 = data.iloc[0]
data_idx0.shape
# (40,)
If at least one of the rows with index 0 has surface_reelle_bati missing, all the forty rows will disappear from data_1. This is why you drop more rows when creating data_1 than when creating data_2.
To solve this, use reset_index()to get index go from 0 to the number of rows of data
data = data.reset_index()
undefined_surface = data.index[data.surface_reelle_bati.isna()].tolist()
data_1 = data.drop(undefined_surface)
print(data_1.shape)
# (21449, 41)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_2.shape)
# (21449, 41)
I'm trying to do something that I think should be a one-liner, but am struggling to get it right.
I have a large dataframe, we'll call it lg, and a small dataframe, we'll call it sm. Each dataframe has a start and an end column, and multiple other columns all of which are identical between the two dataframes (for simplicity, we'll call all of those columns type). Sometimes, sm will have the same start and end as lg, and if that is the case, I want sm's type to overwrite lg's type.
Here's the setup:
lg = pd.DataFrame({'start':[1,2,3,4], 'end':[5,6,7,8], 'type':['a','b','c','d']})
sm = pd.DataFrame({'start':[9,2,3], 'end':[10,6,11], 'type':['e','f','g']})
...note that the only matching ['start','end'] combo is ['2','6']
My desired output:
start end type
0 1 5 a
1 2 6 f # where sm['type'] overwrites lg['type'] because of matching ['start','end']
2 3 7 c
3 3 11 g # where there is no overwrite because 'end' does not match
4 4 8 d
5 9 10 e # where this row is added from sm
I've tried multiple versions of .merge(), merge_ordered(), etc. but to no avail. I've actually gotten it to work with merge_ordered() and drop_duplicates() only to realize that it was simply dropping the duplicate that was earlier in the alphabet, not because it was from sm.
You can try to set start and end columns as index and then use combine_first:
sm.set_index(['start', 'end']).combine_first(lg.set_index(['start', 'end'])).reset_index()
Fairly new to pandas and I have created a data frame called rollParametersDf:
rollParametersDf = pd.DataFrame(columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd'], index=[])
with the 4 column headings given. Which I would like to hold the reference dates for a study I am running. I want to add rows of data (one at a time) with the index name roll1, roll2..rolln that is created using the following code:
outsampleEnd = customCalender.iloc[[totalDaysAvailable]]
outsampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength+1]]
insampleEnd = customCalender.iloc[[totalDaysAvailable-outsampleLength]]
insampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength-insampleLength+1]]
print('roll',rollCount,'\t',outsampleEnd,'\t',outsampleStart,'\t',insampleEnd,'\t',insampleStart,'\t')
rollParametersDf.append({insampleStart,insampleEnd,outsampleStart,outsampleEnd})
I have tried using append but cannot get an individual row to append.
I would like the final dataframe to look like:
insampleStart insampleEnd outsampleStart outsampleEnd
roll1 1 5 6 8
roll2 2 6 7 9
:
rolln
You give key-values pairs to append
df = pd.DataFrame({'insampleStart':[], 'insampleEnd':[], 'outsampleStart':[], 'outsampleEnd':[]})
df = df.append({'insampleStart':[1,2], 'insampleEnd':[5,6], 'outsampleStart':[6,7], 'outsampleEnd':[8,9]}, ignore_index=True)
The pandas documentation has an example of appending rows to a DataFrame. This appending action is different from that of a list in that this appending action generates a new DataFrame. This means that for each append action you are rebuilding and reindexing the DataFrame which is pretty inefficient. Here is an example solution:
# create empty dataframe
columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd']
rollParametersDf = pd.DataFrame(columns=columns)
# loop through 5 rows and append them to the dataframe
for i in range(5):
# create some artificial data
data = np.random.normal(size=(1, len(columns)))
# append creates a new dataframe which makes this operation inefficient
# ignore_index causes reindexing on each call.
rollParametersDf = rollParametersDf.append(pd.DataFrame(data, columns=columns),
ignore_index=True)
print rollParametersDf
insampleStart insampleEnd outsampleStart outsampleEnd
0 2.297031 1.792745 0.436704 0.706682
1 0.984812 -0.417183 -1.828572 -0.034844
2 0.239083 -1.305873 0.092712 0.695459
3 -0.511505 -0.835284 -0.823365 -0.182080
4 0.609052 -1.916952 -0.907588 0.898772