Splitting DataFrame and maintaining DataFrame group integrity

Splitting DataFrame and maintaining DataFrame group integrity - python

To whom it may concern,
I have a very large dataframe (MasterDataFrame) that contains ~180K groups that I would like to split into 5 smaller DataFrames and process each smaller DataFrame separately. Does anyone know of any way that I could achieve this split into 5 smaller DataFrames without accidentally splitting/jeopardizing the integrity of any of the groups from the MasterDataFrame? In other words, I would like for the 5 smaller DataFrames to not have overlapping groups.
Thanks in advance,
Christos
This is what my dataset looks like:
|======MasterDataset======|
Name Age Employer
Tom 12 Walmart
Nick 15 Disney
Chris 18 Walmart
Darren 19 KMart
Nate 43 ESPN
Harry 23 Walmart
Uriel 24 KMart
Matt 23 Disney
. . .
. . .
. . .
I need to be able to split my dataset such that the groups shown in the MasterDataset above are preserved. The smaller groups into which my MasterDataset will be split need to look like this:
|======SubDataset1======|
Name Age Employer
Tom 12 Walmart
Chris 18 Walmart
Harry 23 Walmart
Darren 19 KMart
Uriel 24 KMart
|======SubDataset2======|
Name Age Employer
Nick 15 Disney
Matt 23 Disney

I assume that you mean the number of lines with "groups"
For that .iloc should be perfect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
df_1 = df.iloc[0:100000,:]
df_2 = df.iloc[100001:200000,:]
....

Related

How do I append a column in a dataframe and and give each unique string a number?

I'm looking to append a column in a pandas data frame that is similar to the following "Identifier" column:
Name. Age Identifier
Peter Pan 13 PanPe
James Jones 24 JonesJa
Peter Pan 22 PanPe
Chris Smith 19 SmithCh
I need the "Identifier" column to look like:
Identifier
PanPe01
JonesJa01
PanPe02
SmithCh01
How would I number each original string with 01? And if there are duplicates (for example Peter Pan), then the following duplicate strings (after the original 01) will have 02, 03, and so forth?
I've been referred to the following theory:
combo="PanPe"
Counts={}
if combo in counts:
count=counts[combo]
counts[combo]=count+1
else:
counts[combo]=1
However, getting a good example of code would be ideal, as I am relatively new to Python, and would love to know the syntax as how to implement an entire column iterated through this process, instead of just one string as shown above with "PanPe".

You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Output:
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01

You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
Thank you #dm2 and #Bushmaster

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Sorry for the confusing title. I am practicing how to manipulate dataframes in Python through pandas. How do I make this kind of table:
id role name
0 11 ACTOR Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 11 DIRECTOR Christian Schwochow
2 22 ACTOR Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...
3 22 DIRECTOR Andrew Baird
4 33 ACTOR Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
5 33 DIRECTOR Saron Sakina
Into this kind:
id director actors name
0 11 Christian Schwochow Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 22 Andrew Baird Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...d
2 33 Saron Sakina Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...

Try this way
df.pivot(index='id', columns='role', values='name')

You can do in addition to #Tejas's answer:
df = (df.pivot(index='id', columns='role', values='name').
reset_index().
rename_axis('',axis=1).
rename(columns={'ACTOR':'actors name','DIRECTOR':'director'}))

Read a CSV File and create new csv and columns

Mathew Jim 60
Gerry Hagger 61
Sam Page 23
Azli Muzan 52
David Agor 32
James Paine 40
Mike Gregor 63
Howard Jack 56
I have this data in CSV file with names and age in one column and would like a python script that will separate the names from the age, create a new CSV with the heights of the individuals classified as short, medium and tall.
Height Classification
<5 short
5-6 medium
6> tall
Will like two column headings of Names and Classification.

Combining three datasets removing duplicates

I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.

After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

How to avoid using iloc or hard coding the index number pandas to dynamically fetch rows from single data frame into multiple subsets?

My dataframe looks likes this
country1 state1 city1 District1
india 36 20 40
china 27 21 35
honkong 34 21 38
london 32 21 38
company technology car brand population
adf java Ford 40
ydfh java Hyundai 19
klyu java Nissan 47
hy6g dotnet Toyota 20
rghtr dotnet Hyundai 30
htryr dotnet hummer 12
I wanted to create a multiple subset from single dataframe, I do not wanted to use index number or iloc function or hard coding the index number because it will filter out whenver there is new entry either after entry london or after last entry
If there is any new entry comes it should also needs to be captured, any clues how to perform in pandas or using numpy?
hope this question is clear

Assuming your data frame is saved as df you can use groupby and save the grouped sub-data to a dictionary for future reference.
d = {}
for group, frame in df.groupby('country1'):
d[group] = frame
Also if you want to groupby multiply columns pass a list to groupby as follows
for group, frame in df.groupby(['country1', 'technology']):
d[group] = frame

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting DataFrame and maintaining DataFrame group integrity - python

I assume that you mean the number of lines with "groups" For that .iloc should be perfect. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html df_1 = df.iloc[0:100000,:] df_2 = df.iloc[100001:200000,:] ....

Related

How do I append a column in a dataframe and and give each unique string a number?

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Read a CSV File and create new csv and columns

Combining three datasets removing duplicates

How to avoid using iloc or hard coding the index number pandas to dynamically fetch rows from single data frame into multiple subsets?

Categories

Resources