Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.
In this case first i need to create a DataFrame for all the 10gb csv data.
text_csv=Pandas.read_csv("target.csv")
df=Pandas.DataFrame(text_csv)
df.describe()
Does this mean all the 10gb will get loaded in to memory and calculate the statistics?
Yes, I think you are right. And you can omit df=Pandas.DataFrame(text_csv), because output from read_csv is DataFrame:
import pandas as pd
df = pd.read_csv("target.csv")
print df.describe()
Or you can use dask:
import dask.dataframe as dd
df = dd.read_csv('target.csv.csv')
print df.describe()
You can use parameter chunksize of read_csv, but you get output TextParser not DataFrame, so then you need concat:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
#chunksize = 2 for testing
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
<pandas.io.parsers.TextFileReader object at 0x000000001995ADA0>
df = pd.concat(tp, ignore_index=True)
print df.describe()
a b
count 15.000000 15.000000
mean 3.333333 527.600000
std 1.877181 5.082182
min 1.000000 519.000000
25% 2.000000 524.500000
50% 3.000000 528.000000
75% 5.000000 531.500000
max 6.000000 535.000000
You can convert TextFileReader to DataFrame, but aggregate this output can be difficult:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
dfs = []
for t in tp:
df = pd.DataFrame(t)
df1 = df.describe()
dfs.append(df1.T)
df2 = pd.concat(dfs)
print df2
count mean std min 25% 50% 75% max
a 2 1.0 0.000000 1 1.00 1.0 1.00 1
b 2 525.5 0.707107 525 525.25 525.5 525.75 526
a 2 1.5 0.707107 1 1.25 1.5 1.75 2
b 2 530.0 4.242641 527 528.50 530.0 531.50 533
a 2 2.0 0.000000 2 2.00 2.0 2.00 2
b 2 530.0 2.828427 528 529.00 530.0 531.00 532
a 2 3.0 0.000000 3 3.00 3.0 3.00 3
b 2 526.5 10.606602 519 522.75 526.5 530.25 534
a 2 3.5 0.707107 3 3.25 3.5 3.75 4
b 2 532.5 3.535534 530 531.25 532.5 533.75 535
a 2 5.0 0.000000 5 5.00 5.0 5.00 5
b 2 530.0 1.414214 529 529.50 530.0 530.50 531
a 2 6.0 0.000000 6 6.00 6.0 6.00 6
b 2 520.5 0.707107 520 520.25 520.5 520.75 521
a 1 6.0 NaN 6 6.00 6.0 6.00 6
b 1 524.0 NaN 524 524.00 524.0 524.00 524
Seems there is no limitation of file size for pandas.read_csv method.
According to #fickludd's and #Sebastian Raschka's answer in Large, persistent DataFrame in pandas, you can use iterator=True and chunksize=xxx to load the giant csv file and calculate the statistics you want:
import pandas as pd
df = pd.read_csv('some_data.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with chunks of 1000 rows.
partial_desc = df.describe()
And aggregate all the partial describe info all yourself.
Related
I have multiple column csv file and I want to subtract values of column X31-X27,Y31-Y27,Z31-Z27 from the same dataframe but when I am subtracting it gives me NaN values.
Here is the values of csv file:
It gives me the result as shown in picture
Help me to figure out this problem
import pandas as pd
import os
import numpy as np
df27 = pd.read_csv('D:27.txt', names=['No27','X27','Y27','Z27','Date27','Time27'], sep='\s+')
df28 = pd.read_csv('D:28.txt', names=['No28','X28','Y28','Z28','Date28','Time28'], sep='\s+')
df29 = pd.read_csv('D:29.txt', names=['No29','X29','Y29','Z29','Date29','Time29'], sep='\s+')
df30 = pd.read_csv('D:30.txt', names=['No30','X30','Y30','Z30','Date30','Time30'], sep='\s+')
df31 = pd.read_csv('D:31.txt', names=['No31','X31','Y31','Z31','Date31','Time31'], sep='\s+')
total=pd.concat([df27,df28,df29,df30,df31], axis=1)
total.to_csv('merge27-31.csv', index = False)
print(total)
df2731 = pd.read_csv('C:\\Users\\finalmerge27-31.csv')
df2731.reset_index(inplace=True)
print(df2731)
df227 = df2731[['X31', 'Y31', 'Z31']] - df2731[['X27', 'Y27', 'Z27']]
print(df227)
# input data
df = pd.DataFrame({'x27':[-1458.88, 181.78, 1911.84, 3739.3, 5358.19], 'y27':[-5885.8, -5878.1,-5786.5,-5735.7, -5545.6],
'z27':[1102,4139,4616,4108,1123], 'x31':[-1458, 181, 1911, np.nan, 5358], 'y31':[-5885, -5878, -5786, np.nan, -5554],
'z31':[1102,4138,4616,np.nan,1123]})
df
x27 y27 z27 x31 y31 z31
0 -1458.88 -5885.8 1102 -1458.0 -5885.0 1102.0
1 181.78 -5878.1 4139 181.0 -5878.0 4138.0
2 1911.84 -5786.5 4616 1911.0 -5786.0 4616.0
3 3739.30 -5735.7 4108 NaN NaN NaN
4 5358.19 -5545.6 1123 5358.0 -5554.0 1123.0
pd.DataFrame(df1.values - df2.values).rename(columns={0:'x32-x27', 1:'y31-y27', 2:'z31-x31'})
Out:
x32-x27 y31-y27 z31-x31
0 -0.88 -0.8 0.0
1 0.78 -0.1 1.0
2 0.84 -0.5 0.0
3 NaN NaN NaN
4 0.19 8.4 0.0
I am importing an excel worksheet using pandas and trying to remove any instance where there is a duplicate area measurement for a given Frame. The sheet I'm playing with looks vaguely like the table below wherein there are n number of files, a measured area from each frame of an individual file, and the Frame Number that corresponds to each area measurement.
Filename.0
Area.0
Frame.0
Filename.1
Area.1
Frame.1
...
Filename.n
Area.n
Filename.n
Exp327_Date_File_0
600
1
Exp327_Date_File_1
830
1
...
Exp327_Date_File_n
700
1
Exp327_Date_File_0
270
2
Exp327_Date_File_1
730
1
...
Exp327_Date_File_n
600
2
Exp327_Date_File_0
230
3
Exp327_Date_File_1
630
2
...
Exp327_Date_File_n
500
3
Exp327_Date_File_0
200
4
Exp327_Date_File_1
530
3
...
Exp327_Date_File_n
400
4
NaN
NaN
NaN
Exp327_Date_File1
430
4
...
NaN
NaN
NaN
If I manually go through the excel worksheet and concatenate the filenames into just 3 unique columns containing my entire dataset like so:
Filename
Area
Frame
Exp327_Date_File_0
600
1
Exp327_Date_File_0
270
2
etc...
etc...
etc...
Exp327_Date_File_n
530
4
I have been able to successfully use pandas to remove the duplicates using the following:
df_1 = df.groupby(['Filename', 'Frame Number']).agg('Area': 'sum')
However, manually concatenating everything into this format isn't feasible when I have hundreds of File replicates and I will then have to separate everything back out into multiple column-sets (similar to how the data is presented in Table 1). How do I either (1) use pandas to create a new Dataframe with every 3 columns stacked on top of each other which I can then group and aggregate before breaking back up into individual sets of columns based on Filename or (2) loop through the multiple filenames and aggregate any Frames with multiple Areas? I have tried option 2:
(row, col) = df.shape #shape of the data frame the excel file was read into
for count in range(0,round(col/3)): #iterate through the data
aggregation_functions = {'Area.'+str(count):'sum'} #add Areas together
df_2.groupby(['Filename.'+str(count), 'Frame Number.'+str(count)]).agg(aggregation_functions)
However, this just returns the same DataFrame without any of the Areas summed together. Any help would be appreciated and please let me know if my question is unclear
Here is a way to achieve option (1):
import numpy as np
import pandas as pd
# sample data
df = pd.DataFrame({'Filename.0': ['Exp327_Date_File_0', 'Exp327_Date_File_0',
'Exp327_Date_File_0', 'Exp327_Date_File_0',
np.NaN],
'Area.0': [600, 270, 230, 200, np.NaN],
'Frame.0': [1, 2, 3, 4, np.NaN],
'Filename.1': ['Exp327_Date_File_1', 'Exp327_Date_File_1',
'Exp327_Date_File_1', 'Exp327_Date_File_1',
'Exp327_Date_File_1'],
'Area.1': [830, 730, 630, 530, 430],
'Frame.1': [1, 1, 2, 3, 4],
'Filename.2': ['Exp327_Date_File_2', 'Exp327_Date_File_2',
'Exp327_Date_File_2', 'Exp327_Date_File_2',
'Exp327_Date_File_2'],
'Area.2': [700, 600, 500, 400, np.NaN],
'Frame.2': [1, 2, 3, 4, np.NaN]})
# create list of sub-dataframes, each with 3 columns, partitioning the original dataframe
subframes = [df.iloc[:, j:(j + 3)] for j in np.arange(len(df.columns), step=3)]
# set column names to the same values for each subframe
for subframe in subframes:
subframe.columns = ['Filename', 'Area', 'Frame']
# concatenate the subframes
df_long = pd.concat(subframes)
df_long
Filename Area Frame
0 Exp327_Date_File_0 600.0 1.0
1 Exp327_Date_File_0 270.0 2.0
2 Exp327_Date_File_0 230.0 3.0
3 Exp327_Date_File_0 200.0 4.0
4 NaN NaN NaN
0 Exp327_Date_File_1 830.0 1.0
1 Exp327_Date_File_1 730.0 1.0
2 Exp327_Date_File_1 630.0 2.0
3 Exp327_Date_File_1 530.0 3.0
4 Exp327_Date_File_1 430.0 4.0
0 Exp327_Date_File_2 700.0 1.0
1 Exp327_Date_File_2 600.0 2.0
2 Exp327_Date_File_2 500.0 3.0
3 Exp327_Date_File_2 400.0 4.0
4 Exp327_Date_File_2 NaN NaN
i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a
I'm at a beginner to intermediate data science level. I want to impute missing values from a dataframe using knn.
As the dataframe contains strings and floats, I need to encode / decode values using LabelEncoder.
My method is as follows:
Replace NaN to be able to encode
Encode the text values and put them in a dictionary
Retrieve the NaN (previously converted) to be imputed with knn
Assign values with knn
Decode values from the dictionary
Unfortunately, in the last step, imputing values adds new values that cannot be decoded (unseen labels error message).
Could you please explain to me what I am doing wrong? Ideally help me to correct it please. Before concluding, I wanted to say that I know that there are other tools like OneHotEncoder, but I don't know them well enough and I found LabelEncoder much more intuitive because you can see it directly in the dataframe (where LabelEncoder provides an array).
Please find below an example of my method, thank you very much for your help :
[1]
# Import libraries.
import pandas as pd
import numpy as np
# intialise data of lists.
data = {'Name':['Jack', np.nan, 'Victoria', 'Nicolas', 'Victor', 'Brad'], 'Age':[59, np.nan, 29, np.nan, 65, 50], 'Car color':['Blue', 'Black', np.nan, 'Black', 'Grey', np.nan], 'Height ':[177, 150, np.nan, 180, 175, 190]}
# Make a DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Output :
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 NaN NaN Black 150.0
2 Victoria 29.0 NaN NaN
3 Nicolas NaN Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 NaN 190.0
[2]
# LabelEncoder does not work with NaN values, so I replace them with value '1000' :
df = df.replace(np.nan, 1000)
# And to avoid errors, str columns must be set as strings (even '1000' value) :
df[['Name','Car color']] = df[['Name','Car color']].astype(str)
df
Output
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 1000 1000.0 Black 150.0
2 Victoria 29.0 1000 1000.0
3 Nicolas 1000.0 Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 1000 190.0
[3]
# Import LabelEncoder library :
from sklearn.preprocessing import LabelEncoder
# define labelencoder :
le = LabelEncoder()
# Import defaultdict library to make a dict of labelencoder :
from collections import defaultdict
# Initiate a dict of LabelEncoder values :
encoder_dict = defaultdict(LabelEncoder)
# Make a new dataframe of LabelEncoder values :
df[['Name','Car color']] = df[['Name','Car color']].apply(lambda x: encoder_dict[x.name].fit_transform(x))
# Show output :
df
Output
Name Age Car color Height
0 2 59.0 2 177.0
1 0 1000.0 1 150.0
2 5 29.0 0 1000.0
3 3 1000.0 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
[4]
#Reverse back 1000 to missing values in order to impute them :
df = df.replace(1000, np.nan)
df
Output
Name Age Car color Height
0 2 59.0 2 177.0
1 0 NaN 1 150.0
2 5 29.0 0 NaN
3 3 NaN 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
[5]
# Import knn imputer library to replace impute missing values :
from sklearn.impute import KNNImputer
# Define imputer :
imputer = KNNImputer(n_neighbors=2)
# impute and reassign index/colonnes :
df = pd.DataFrame(np.round(imputer.fit_transform(df)),columns = df.columns)
df
Output
Name Age Car color Height
0 2.0 59.0 2.0 177.0
1 0.0 47.0 1.0 150.0
2 5.0 29.0 0.0 165.0
3 3.0 44.0 1.0 180.0
4 4.0 65.0 3.0 175.0
5 1.0 50.0 0.0 190.0
[6]
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
# Apply it to df -> THIS IS WHERE ERROR OCCURS :
df[['Name','Car color']].apply(inverse_transform_lambda)
Error message :
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-55-8a5e369215f6> in <module>()
----> 1 df[['Name','Car color']].apply(inverse_transform_lambda)
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926 kwds=kwds,
6927 )
-> 6928 return op.get_result()
6929
6930 def applymap(self, func):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-54-f16f4965b2c4> in <lambda>(x)
----> 1 inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
297 "y contains previously unseen labels: %s" % str(diff))
298 y = np.asarray(y)
--> 299 return self.classes_[y]
300
301 def _more_tags(self):
IndexError: ('arrays used as indices must be of integer (or boolean) type', 'occurred at index Name')
Based on my comment you should do
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x.astype(int)) # or x[].astype(int)
How can I get my JSON data into a reasonable data frame? I have a deeply nested file which I aim to get into a large data frame. All is described in the Github repository below:
http://www.github.com/simongraham/dataExplore.git
With nested jsons, you will need to walk through the levels, extracting needed segments. For the nutrition segment of the larger json, consider iterating through every nutritionPortions level and each time running the pandas normalization and concatenating to final dataframe:
import pandas as pd
import json
with open('/Users/simongraham/Desktop/Kaido/Data/kaidoData.json') as f:
data = json.load(f)
# INITIALIZE DF
nutrition = pd.DataFrame()
# ITERATIVELY CONCATENATE
for item in data[0]["nutritionPortions"]:
if 'ftEnergyKcal' in item.keys(): # MISSING IN 3 OF 53 LEVELS
temp = (pd.io
.json
.json_normalize(item, 'nutritionNutrients',
['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
'ftEnergyKcal', 'vcPortionUnit','dtConsumedDate'])
)
nutrition = pd.concat([nutrition, temp])
nutrition.head()
Output
ftValue nPercentRI vcNutrient vcNutritionPortionId \
0 0.00 0.0 alcohol c993ac30-ecb4-4154-a2ea-d51dbb293f66
1 0.00 0.0 bcfa c993ac30-ecb4-4154-a2ea-d51dbb293f66
2 7.80 6.0 biotin c993ac30-ecb4-4154-a2ea-d51dbb293f66
3 49.40 2.0 calcium c993ac30-ecb4-4154-a2ea-d51dbb293f66
4 1.82 0.0 carbohydrate c993ac30-ecb4-4154-a2ea-d51dbb293f66
vcTrafficLight vcUnit dtConsumedDate \
0 g 2016-04-12T00:00:00
1 g 2016-04-12T00:00:00
2 µg 2016-04-12T00:00:00
3 mg 2016-04-12T00:00:00
4 g 2016-04-12T00:00:00
vcNutritionId ftEnergyKcal \
0 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
1 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
2 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
3 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
4 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
vcUserId vcPortionName vcPortionSize \
0 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
1 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
2 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
3 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
4 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
vcPortionId vcPortionUnit
0 2 ml
1 2 ml
2 2 ml
3 2 ml
4 2 ml