Splitting Pandas dataframe into multiple mini-dataframes - python

This is the second part of a program im working on. I have a pandas dataframe that consists of:
Title|df1_data1|df1_data2|df1_data3|df1_data4|df2_data1|df2_data2|df2_data3|df2_data4|df3_data1|df3_data2|df3_data3|df3_data4
But theres two rules:
The df will NOT always consist of 3 files (df1, df2, df3) there can be more or less.
There is ALWAYS 4 pieces of data per file.
I have the next step of the code written but the input needs multiple mini-dataframes of this bigger one.
So for this example of Three files I need to split the dataframe into
1. |Title|df1_data1|df1_data2|df1_data3|df1_data4|
2. |Title|df2_data1|df2_data2|df2_data3|df2_data4|
3. |Title|df3_data1|df3_data2|df3_data3|df3_data4|
I'm currently trying to figure this out and i'm trying to loop through the headers and every four headers (not counting title) I create a dataframe... idk ima keep trying PLS HELP
Here's the big dataframe REMEMBER THE RULES
thisdict = {'Title': ['aaarrr','hahahamhm','yaaahooo','yaahoo', 'oopsymhm', 'ayorrr'],
'df1_data1': ['324','123','444','NOTHING', 'NOTHING', 'NOTHING'],
'df1_data2': ['4314','4321','7658','NOTHING', 'NOTHING', 'NOTHING'],
'df1_data3': ['342','111','235','NOTHING', 'NOTHING', 'NOTHING'],
'df1_data4': ['325','542','523','NOTHING', 'NOTHING', 'NOTHING'],
'df2_data1': ['1','NOTHING','NOTHING','4', '3', 'NOTHING'],
'df2_data2': ['2','NOTHING','NOTHING','3', '2', 'NOTHING'],
'df2_data3': ['3','NOTHING','NOTHING','2', '4', 'NOTHING'],
'df2_data4': ['4','NOTHING','NOTHING','1', '1', 'NOTHING'],
'df3_data1': ['NOTHING','NOTHING','NOTHING','2', '67', '4'],
'df3_data2': ['NOTHING','NOTHING','NOTHING','73', '2', '7'],
'df3_data3': ['NOTHING','NOTHING','NOTHING','2', '4', '5'],
'df3_data4': ['NOTHING', 'NOTHING', 'NOTHING', '1', '0', '9']
}
dataframe = pd.DataFrame(thisdict)

You can append Title as index (for just in case duplicate values of Title). Then, create dict of split segments of columns according to length of total columns.
df2 = dataframe.set_index('Title', append=True) # append for just in case duplicate values of Title
df_s = {(i+1): df2.iloc[:, i*4: i*4+4].reset_index(level=-1) for i in range(len(df2.columns) // 4)}
Then, you can access individual split dataframes by syntax of df_s[i], e.g.
print(df_s[1])
Title df1_data1 df1_data2 df1_data3 df1_data4
0 aaarrr 324 4314 342 325
1 hahahamhm 123 4321 111 542
2 yaaahooo 444 7658 235 523
3 yaahoo NOTHING NOTHING NOTHING NOTHING
4 oopsymhm NOTHING NOTHING NOTHING NOTHING
5 ayorrr NOTHING NOTHING NOTHING NOTHING
print(df_s[2])
Title df2_data1 df2_data2 df2_data3 df2_data4
0 aaarrr 1 2 3 4
1 hahahamhm NOTHING NOTHING NOTHING NOTHING
2 yaaahooo NOTHING NOTHING NOTHING NOTHING
3 yaahoo 4 3 2 1
4 oopsymhm 3 2 4 1
5 ayorrr NOTHING NOTHING NOTHING NOTHING
print(df_s[3])
Title df3_data1 df3_data2 df3_data3 df3_data4
0 aaarrr NOTHING NOTHING NOTHING NOTHING
1 hahahamhm NOTHING NOTHING NOTHING NOTHING
2 yaaahooo NOTHING NOTHING NOTHING NOTHING
3 yaahoo 2 73 2 1
4 oopsymhm 67 2 4 0
5 ayorrr 4 7 5 9

You can set Title as index and use filter to get the columns:
df = df.set_index('Title')
dfs = {'df%s' % i: df.filter(like='df%s' % i).reset_index()
for i in range (1, 3+1)}

Related

Conditional merging of 2 pandas dataframes wiith no overlapping variable

I'm trying to merge two dataframes with conditions and one to many relationships.
The dataframes do not have an overlapping column as I am working with coordinate data.
df1 = pd.DataFrame({'Label': ['A', 'B', 'C', 'D'],
'x_low': [101940675,101947985,101941345,101948789],
'x_high': [101940777,10194855,101941577,101949111],
'y_low': [427429081, 427429000, 427429596, 427429466],
'y_high': [427429089, 427429001, 427429599, 427429467]})
df2 = pd.DataFrame({'Image': ['1', '2', '3', '4', '5'],
'X': [101948445, 101948467, 101948764, 101947896, 101941234],
'Y': [427429082, 427429001, 427429597, 427429467, 427430045]})
df1
Label x_low x_high y_low y_high
0 A 101940675 101940777 427429081 427429089
1 B 101947985 10194855 427429000 427429001
2 C 101941345 101941577 427429596 427429599
3 D 101948789 101949111 427429466 427429467
df2
Image X Y
0 1 101948445 427429082
1 2 101948467 427429001
2 3 101948764 427429597
3 4 101947896 427429467
4 5 101941234 427430045
I want to merge the datasets with the condition that the X and Y coordinates from df2 have to be between the corresponding _low and _high values of df1. I would also like to know if multiple coordinate pairs would form one to many relationships.
desired output would look something like this:
Label x_low x_high y_low y_high Image X Y
0 B 101947985 101948555 427429000 427429002 2 101948467 427429001
1 B 101947985 101948555 427429000 427429002 3 101948477 427429008
2 D 101948789 101949111 427429466 427429467 5 101941234 427430045
I tried merging both the whole dataframes but they have 3000 and 30000 lines so I dont have the memory available. Tried multiple other methods but nothing seems to be working.

Pandas error: "None of [Index([' '], dtype='object')] are in the [columns]"

For some reason, my code works when a list I am passing contains only integers. Using strings otherwise leads to the error in the title.
Here is my code:
def get_support(self, data, itemset):
return data[itemset].all(axis = 'columns').sum()
# I also tried: return data.loc[:, itemset].all(axis = 'columns').sum()
# this function returns the number of True values (from .all()) of a given column or given set of columns
A sample of a data where this code works is:
0 1 2
0 0 0 1
1 1 1 1
2 1 1 0
3 0 1 0
4 1 1 1
5 1 1 0
Running get_support(df, [0]) returns 4 and running get_support(df, [0,2] returns 2.
However, once columns are labeled, the code no longer works and outputs the error. I've checked the .csv file, and it's completely clean, with no spaces or extra stuff.
Sample of a data that will cause an error in my code:
Red Yellow Blue
0 0 0 1
1 1 1 1
2 1 1 0
3 0 1 0
4 1 1 1
5 1 1 0
Where exactly am I wrong?
Edit: Thank you very very much to #osint_alex. The error is gone now, but there is unfortunately a newfound problem:
print(get_support(temp_df, ['A']))
print(get_support(temp_df, ['A', 'B']))
print(get_support(temp_df, ['A', 'B', 'C']))
Running this block of code only outputs this value for each: 9835, which is the number of rows the dataset has
I have attempted commenting out the other lines, and I get 9835 nonetheless. However, after checking the .csv file, I should only get 516 for A (unable to test for others).
As of now, I am still trying to solve it on my own, but the numbers are too all over the place I do not even where to begin.
I think what is going wrong is that you are using itemset to select the columns of interest in the dataframe. So if you want to use itemset to denote the indicies of the columns - which is what I assume you want to do if you want to always pass in numbers - then the get_support function needs to change to reflect this.
Essentially if you pass in numbers and the columns are labelled with strings, you'll get a key error because those numbers aren't the column names.
Here is a suggested revision:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3],
'B': [3,4,0]})
def get_support(data, itemset):
cols = [x for x in data.columns if list(data.columns).index(x) in itemset]
return data[cols].all(axis = 'columns').sum()
print(get_support(df, [0, 1]))
Out:
2
You're going wrong here:
data[itemset].all(axis = 'columns').sum()
You can't sum() a string. You could run it through a data cleaning function first to make sure the list only has integers or floats.

How to remove strings between between parentheses (or any char) in DataFrame?

I have a string of number chars that I want to change to type int, but I need to remove the parentheses and the numbers in it (it's just a multiplier for my application, this is how I get the data).
Here is the sample code.
import pandas as pd
voltages = ['0', '0', '0', '0', '0', '310.000 (31)', '300.000 (30)', '190.000 (19)', '0', '20.000 (2)']
df = pd.DataFrame(voltages, columns=['Voltage'])
df
Out [1]:
Voltage
0 0
1 0
2 0
3 0
4 0
5 310.000 (31)
6 300.000 (30)
7 190.000 (19)
8 0
9 20.000 (2)
How can I remove the substrings within the parenthesis? Is there a Pandas.series.str way to do it?
Use str.replace with regex:
df.Voltage.str.replace(r"\s\(.*","")
Out:
0 0
1 0
2 0
3 0
4 0
5 310.000
6 300.000
7 190.000
8 0
9 20.000
Name: Voltage, dtype: object
You can also use str.split()
df_2 = df['Voltage'].str.split(' ', 0, expand = True).rename(columns = {0:'Voltage'})
df_2['Voltage'] = df_2['Voltage'].astype('float')
If you know the separating character will always be a space then the following is quite a neat way of doing it:
voltages = [i.rsplit(' ')[0] for i in voltages]
I think you could try this:
new_series = df['Voltage'].apply(lambda x:int(x.split('.')[0]))
df['Voltage'] = new_series
I hope it helps.
Hopefully, this will work for you:
result = source_value[:source_value.find(" (")]
NOTE: the find function requires a string as source_value. But if you have parens in your value, I assume it is a string.

getting index of a multiple line string

Trying to get an integer in a multi-line string with its value same as its index. Here's my trial.
table='''012
345
678'''
print (table[4])
if i execute the above, i will get a output of 3 instead of a 4.
I am trying to get number i with print(table[i])
What is the simplest way of getting the number corresponding to table[i] without using list, because i have to further use while loops later to replace values of the table and using lists would be very troublesome. Thanks.
Your string contains whitespaces (carriage return and mabye linefeed) at position 4 (\n in linux, \n\r on 4+5 on windows) - you can clean your text by removing them:
table='''012
345
678'''
print (table[4]) #3 - because [3] == \n
print(table.replace("\n","")[4]) # 4
You can view all characters in your "table" like so:
print(repr(table))
# print the ordinal value of the character and the character if a letter
for c in table:
print(ord(c), c if ord(c)>31 else "")
Output:
'012\n345\n678'
48 0
49 1
50 2
10
51 3
52 4
53 5
10
54 6
55 7
56 8
On a sidenote - you might want to build a lookup dict if your table does not change to skip replacing stuffin your string all the time:
table='''012
345
678'''
indexes = dict( enumerate(table.replace("\n","")))
print(indexes)
Output:
{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5', 6: '6', 7: '7', 8: '8'}
so you can do index[3] to get the '3' string

How to apply UDF to dataframe?

I am trying to create a function that will cleanup and dataframe that I put through the function. But I noticed that the df returned is cleanued up but not in place of the original df.
How can I run a UDF on a dataframe and keep the updated dataframe saved in place?
p.s. I know I can combine these rules into one line but the function I am creating is a lot more complex so I don't want to combine for this example
df = pd.DataFrame({'Key': ['3', '9', '9', '9', '9','34','34', '34'],
'LastFour': ['2290', '0087', 'M433','M433','25','25','25','25'],
'NUM': [20120528, 20120507, 20120615,20120629,20120621,20120305,20120506,20120506]})
def cleaner(x):
x = x[x['Key'] == '9']
x = x[x['LastFour'] == 'M433']
x = x[x['NUM'] == 20120615]
return x
cleaner(df)
Result from the UDF:
Key LastFour NUM
2 9 M433 20120615
But if I run the df after the function then I still get the original dataset:
Key LastFour NUM
0 3 2290 20120528
1 9 0087 20120507
2 9 M433 20120615
3 9 M433 20120629
4 9 25 20120621
5 34 25 20120305
6 34 25 20120506
7 34 25 20120506
You need to assign the result of cleaner(df) back to df as so:
df = cleaner(df)
An alternative method is to use pd.DataFrame.pipe to pass your dataframe through a function:
df = df.pipe(cleaner)

Categories

Resources