I currently have a massive set of datasets. I have a set for each year in the 2000's. I take a combination of three years and run a code on that to clean.
The problem is that due to the size I can't run my cleaning code on it as my Memory runs out.
I was thinking about splitting the data using something like:
df.ix[1,N/x]
Where N is the total amount of rows in my dataframe. I think I should replace the dataframe to clear up the memory being used. This does mean I have to load in the dataframe first for each chunk I create.
There are several problems:
How do I get N when N can be different for each year ?
The operation requires that groups of data stay together.
Is there a way to make x vary with the size of N?
Is all of this highly inefficient/is there an efficient inbuild function for this?
Dataframe looks like:
ID Location year other variables
1 a 2006
1 a 2007
1 b 2006
1 a 2005
2 c 2005
2 c 2007
3 d 2005
What I need is for all the same ID's to stay together.
The data to be cut in managable chunks, dependent on a yearly changing total amount of data.
In this case it would be:
ID Location year other variables
1 a 2006
1 a 2007
1 b 2006
1 a 2005
ID Location year other variables
2 c 2005
2 c 2007
3 d 2005
The data originates from a csv by year. So all 2005 data comes from 2005csv, 2006 data from 2006csv etc.
The csv's are loaded into memory and concatenated to form one set of three years.
The individual csv files have the same setup as indicated above. So each observation is stated with an ID, location and year, followed by a lot of other variables.
Running it just on a group by group bases would be a bad idea, as there are thousands, if not millions of these ID's. They can have dozens of locations and a maximum of three years. All of this needs to stay together.
Loops on this many rows take ages in my experience
I was thinking maybe something along the lines of:
create a variable that counts the number of groups
use the maximum of this count variable and divide it by 4 or 5.
cut the data up in chunks this way
Not sure if this would be efficient, even less sure how to execute it.
One way to achieve this would be like as follows:
import pandas as pd
# generating random DF
num_rows = 100
locs = list('abcdefghijklmno')
df = pd.DataFrame(
{'id': np.random.randint(1, 100, num_rows),
'location': np.random.choice(locs, num_rows),
'year': np.random.randint(2005, 2007, num_rows)})
df.sort_values('id', inplace=True)
print('**** sorted DF (first 10 rows) ****')
print(df.head(10))
# chopping DF into chunks ...
chunk_size = 5
chunks = [i for i in df.id.unique()[::chunk_size]]
chunk_margins = [(chunks[i-1],chunks[i]) for i in range(1, len(chunks))]
df_chunks = [df.ix[(df.id >= x[0]) & (df.id < x[1])] for x in chunk_margins]
print('**** first chunk ****')
print(df_chunks[0])
Output:
**** sorted DF (first 10 rows) ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
**** first chunk ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
6 16 k 2006
82 16 g 2005
Use chunked pandas by importing Blaze.
Instructions from http://blaze.readthedocs.org/en/latest/ooc.html
Naive use of Blaze triggers out-of-core systems automatically when called on large files.
d = Data('my-small-file.csv')
d.my_column.count() # Uses Pandas
d = Data('my-large-file.csv')
d.my_column.count() # Uses Chunked Pandas
How does it work?
Blaze breaks up the data resource into a sequence of chunks. It pulls one chunk into memory, operates on it, pulls in the next, etc.. After all chunks are processed it often has to finalize the computation with another operation on the intermediate results.
Related
I have columns with vehicle data, for vehicles greater than 1 year old with mileage less than 100 I want to replace mileage less than 100 with 1000.
my attempts -
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)], 1000
Error - AttributeError: 'tuple' object has no attribute
and
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)]
mileage_corr['mileage'].where(mileage_corr['mileage'] <= 100, 1000, inplace=True)
error -
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return self._where(
Without complete information, assuming your vehicle_data_all DataFrame looks something like this,
years mileage
0 2019 192
1 2014 78
2 2010 38
3 2018 119
4 2019 4
5 2012 122
6 2005 50
7 2015 69
8 2004 56
9 2003 194
Pandas has a way of assigning based on a filter result. This is referred to as setting values.
df.loc[condition, "field_to_change"] = desired_change
Applied to your dataframe would look something like this,
vehicle_data_all.loc[((vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)), "mileage"] = 1000
This was my result,
years mileage
0 2019 192
1 2014 1000
2 2010 1000
3 2018 119
4 2019 1000
5 2012 122
6 2005 1000
7 2015 1000
8 2004 1000
9 2003 194
I got a table that look like this:
code
year
month
Value A
Value B
1
2020
1
120
100
1
2020
2
130
90
1
2020
3
90
89
1
2020
4
67
65
...
...
...
...
...
100
2020
10
90
90
100
2020
11
115
100
100
2020
12
150
135
I would like to know if there's a way to rearrange the data to find the correlation between A and B for every distinct code.
What I'm thinking is, for example, getting an array for every code, like:
[(A1,A2,A3...,A12),(B1,B2,B3...,B12)]
where A and B is the values for the respective month, and then I could see the correlation between these two columns. Is there a way to make this dynamic?
IIUC, you don't need to re-arrange to get the correlation for each "code". Instead, try with groupby:
>>> df.groupby("code").apply(lambda x: x["Value A"].corr(x["Value B"]))
code
1 0.830163
100 0.977093
dtype: float64
I have two data frames with each having about 250K lines. I am trying to do a fuzzy lookup between the two data frame's columns. After look up I need the indexes for those good matches with the threshold.
Some details are following,
My df1:
Name State Zip combined_1
0 Auto MN 10 Auto,MN,10
1 Rtla VI 253 Rtla,VI,253
2 Huka CO 56218 Huka,CO,56218
3 kann PR 214 Kann,PR,214
4 Himm NJ 65216 Himm,NJ,65216
5 Elko NY 65418 Elko,NY,65418
6 Tasm MA 13 Tasm,MA,13
7 Hspt OH 43218 Hspt,OH,43218
My other data frame that I am trying to look upto
Name State Zip combined_2
0 Kilo NC 69521 Kilo,NC,69521
1 Kjhl FL 3369 Kjhl,FL,3369
2 Rtla VI 25301 Rtla,VI,25301
3 Illt GA 30024 Illt,GA,30024
4 Huka CO 56218 Huka,CO,56218
5 Haja OH 96766 Haja,OH,96766
6 Auto MN 010 Auto,MN,010
7 Clps TX 44155 Clps,TX,44155
If you look close, when I do fuzzy lookup I should get a good match for indexes 0 and 2 in my df1 from df2 indexes, 6, 4.
So, I did this,
from fuzzywuzzy import fuzz
# Save df1 index
df_1index = []
# save df2 index
df2_indexes = []
# save fuzzy ratio
fazz_rat = []
for index, details in enumerate(df1['combined_1']):
for ind, information in enumerate(df2['combined_2']):
fuzmatch = fuzz.ratio(str(details), str(information))
if fuzmatch >= 94:
df_1index.append(index)
df2_indexes.append(ind)
fazz_rat.append(fuzmatch)
else:
pass
As I expected, I am getting the results for this example case,
df_1index
>> [0,2]
df2_indexes
>> [6,4]
To run against 250K * 250K lines in both data frames it takes, so much time.
How can I speed up this lookup process? Is there pandas or python way to improve performance for what want?
I have a DF[number] = pd.read_html(url.text)
I want to concantante or join the DF lists theres hundreads of e.g. DFs[400] into a single pandas dataframe
the dataframes are in list format so list of lists but python index lists like pandas dataframe
[ Vessel Built GT DWT Size (m) Unnamed: 5
0 x XIN HUA Bulk Carrier 2012 44543 82269 229 x 32
1 b FRANCESCO CORRADO Bulk Carrier 2008 40154 77061 225 x 32
2 5 NAN XIN 17 Bulk Carrier 2001 40570 75220 225 x 32
3 p DIAMOND INDAH Bulk Carrier 2002 43321 77830 229 x 37
4 NaN PRIME LILY Bulk Carrier 2012 44485 81507 229 x 32
5 s EVGENIA Bulk Carrier 2011 92183 176000 292 x 45
df[number] = pd.read_html(url.text)
for number in range(494):
df=pd.concat(df[number])
methods but that doesn't seem to work
df1=pd.concat(df[1])
df2=pd.concat(df[2])
df3=pd.concat(df[3])
dfx=pd.concat([df1,df2,df3],ignore_index=True)
this is not what I want as there is hundreads of [] python list dataframes
I want one pandas dataframe that joins all of the list dataframes into one
just be clear the df[] container of the lists is a dict type while df[1] is list
You can use list comprehension:
pd.concat([dfs[i] for i in range(len(dfs))])
I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.
I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
Update
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6