Using pandas to clean massive dataset

Using pandas to clean massive dataset - python

So kinda a newb here, but I have this dataset that is transposed wkardly, I want to have this back to our guy in the next week, and I've gotten pretty close to completing - I think.
The problem I am facing is getting the data into one data frame. When I run the code, and print from the for loop, I can see the chunks of values that will need to be concatenated. however, i cant find a way to store all the values. when I do, I just get one chunk.
import pandas as pd
import numpy as np
df = pd.read_excel("DATA,h",
header = None,
dtype = object)
ranges = []
last_index = 0
def clean(df12,df13):
df12 = df12.T
df13 = df13.T
value1 = pd.DataFrame(df12)
value2 = pd.DataFrame(df13)
final_value = value1.append(value2)
return(final_value)
for i, row in df.iterrows():
rows = df.iloc[i]
if rows[9] == 'Member' or rows[9] == 'Non-Pledging Member':
if last_index == 0:
last_index = i
else:
ranges.append([last_index, i])
last_index = i
df44 = beans(row,row)
print(df44)
when I print rows from the for loop I get all the values I need in the terminal, but if i store it in a value or dataframe.. I just see one of those blocks of data. Does anyone know whats going on?
data: there are 15k of these
Proctor, Terry 206-915-3555 Member
620 33rd Ave E 16283
Seattle, WA 98112
what I am shooting for:
Proctor, Terry, 620 33rd Ave E, Seattle, WA, 98112, 206-915-3555, Member

Related

Pandas for Loop Optimization(Vectorization) when looking at previous row value

I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)

Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.

For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.

If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Most efficient method to modify values within large dataframes - Python

Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.

You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here

Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964

No axis named length for object type, trying to sort based on length of dataframe and mask it to the second datafram

I have 2 dataframes that I want to sort the values of the first dataframe by the string length which I used str.len() for then sort the second data frame based on the index of the second dataframe I'm trying to use pandas.masking but gives me error any advices ?
index of both dataframes are matching.
my code
wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
length= wdata['sentences'].str.len().sort_values()
print(length)
sort= wdata['sentences'].sort_values('length', ascending=True, inplace=True).any(axis=1)
df=sort
print(df)
df2 = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(df2.count(' ') == 0)
df2 = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
newdata2 = df2[df2.sort(df.index)]
print(newdata2)
----------------------
#first dataframe example
----------------------
#how are you
#I want to die
#I was home
#I went to sleep at work
#he have a bad reputation
#it was me who went to him
#have good sleep home
#yes
#I'm good
----------------------
#second dataframe example
----------------------
#halaw kuy bashii
#damawe bmrm
#la malawa bum
#la esh nustm
#aw kabraya bash nya
#awa mn bum chum bo lay
#xaweki xosh basar bba la malawa
#bale
#mn bashm
the output I expect is
the errors I'm getting
raise ValueError("No axis named {0} for object type {1}".format(axis, cls))
ValueError: No axis named length for object type <class 'pandas.core.series.Series'>
what am I doing wrong any ideas to solve it please ?

For first use Series.argsort for positions of sorted values, so then pass to DataFrame.iloc:
idx = wdata['sentences'].str.len().argsort()
df = wdata.iloc[idx]
print (df)
sentences
7 yes
8 Im good
2 I was home
0 how are you
1 I want to die
6 have good sleep home
3 I went to sleep at work
4 he have a bad reputation
5 it was me who went to him
If want select one column to Series:
sentences = df['sentences']
For second use same, if same index values like wdata:
newdata2 = df2.iloc[idx]

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?

Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using pandas to clean massive dataset - python

Related

Pandas for Loop Optimization(Vectorization) when looking at previous row value

Efficient way to loop through GroupBy DataFrame

Most efficient method to modify values within large dataframes - Python

No axis named length for object type, trying to sort based on length of dataframe and mask it to the second datafram

Slicing my data frame is returning unexpected results

Categories

Resources