Cuting dataframe loop - python

I have a dataset which is only one column. I want to cut the column into multiple dataframes.
I use a for loop to create a list which contains the values at which positions I want to cut the dataframe.
import pandas as pd
df = pd.read_csv("column.csv", delimiter=";", header=0, index_col=(0))
number_of_pixels = int(len(df.index))
print("You have " + str(number_of_pixels) +" pixels in your file")
number_of_rows = int(input("Enter number of rows you want to create"))
list=[] #this list contains the number of pixels per row
for i in range (0,number_of_rows): #this loop fills the list with the number of pixels per row
pixels_per_row=int(input("Enter number of pixels in row " + str(i)))
list.append(pixels_per_row)
print(list)
After cutting the column into multiple dataframes I want to transpose each dataframe and concating all dataframes back together using:
df1=df1.reset_index(drop=True)
df1=df1.T
df2=df2.reset_index(drop=True)
df2=df2.T
frames = [df1,df2]
result = pd.concat(frames, axis=0)
print(result)
So I want to create a loop that cuts my dataframe into multiple frames at the positions stored in my list.
Thank you!

This is a problem that is better solved with numpy. I'll start from the point of you receiving a list from your user input. The whole point is to use numpy.split to separate the values based on the cumulative number of pixels requested, and then create a new DataFrame
Setup
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({'val': np.random.randint(1,10,50)})
lst = [4,10,2,1,15,8,9,1]
Code
pd.DataFrame(np.split(df.val.values, np.cumsum(lst)[:-1]))
Output
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 3 3.0 7.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4 7.0 2.0 1.0 2.0 1.0 1.0 4.0 5.0 1.0 NaN NaN NaN NaN NaN
2 1 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 4.0 3.0 5.0 8.0 3.0 5.0 9.0 1.0 8.0 4.0 5.0 7.0 2.0 6.0
5 7 3.0 2.0 9.0 4.0 6.0 1.0 3.0 NaN NaN NaN NaN NaN NaN NaN
6 7 3.0 5.0 5.0 7.0 4.0 1.0 7.0 5.0 NaN NaN NaN NaN NaN NaN
7 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
If your list has more pixels than the total number of rows in your initial DataFrame then you'll get extra all NaN rows in your output. If your lst sums to less than the total number of pixels, it will add them to all to the last row. Since you didn't specify either of these conditions in your question, not sure how you'd want to handle that.

Related

Add single field on df end

Is it possible after dataframe with 20+ rows and xx+ columns to add a single field with total count of certain value. User will add different values to df and before 'pandas.DataFrame.to_excel' it's neccesary to to add a single field with some specific data. Like in the attached picture. Is it possible to add a single field after an already structured df?
This can work for you:
Df:
A B output
0 a 1.0 1.0
1 a 2.0 1.0
2 a 3.0 1.0
3 a 4.0 1.0
4 a 5.0 1.0
for i in range(df.iloc[-1].name + 1, 25): # Add 20 new nan row (you can change it)
df.loc[i, :] = np.nan
df.loc[df.iloc[-1].name + 1, 'A'] = 'Result: ' + str(df['B'].sum()) # For this example i just put sum of column B so you can change it.
print(df)
A B output
0 a 1.0 1.0
1 a 2.0 1.0
2 a 3.0 1.0
3 a 4.0 1.0
4 a 5.0 1.0
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
18 NaN NaN NaN
19 NaN NaN NaN
20 NaN NaN NaN
21 NaN NaN NaN
22 NaN NaN NaN
23 NaN NaN NaN
24 NaN NaN NaN
25 Result: 15.0 NaN NaN

Rearranging column based on sequesnce in pandas dataframe

I have a pandas dataframe as below. I want to rearrange columns in my dataframe based on the sequence seperately for XX_ and YY_ columns.
import numpy as np
import pandas as pd
import math
import sys
import re
data=[[np.nan,2, 5,np.nan,np.nan,1],
[np.nan,np.nan,2,np.nan,np.nan,np.nan],
[np.nan,3,np.nan,np.nan,np.nan,np.nan],
[1,np.nan,np.nan,np.nan,np.nan,1],
[np.nan,2,np.nan,np.nan,2,np.nan],
[np.nan,np.nan,np.nan,2,np.nan,5]]
df = pd.DataFrame(data,columns=['XX_4','XX_2','XX_3','YY_4','YY_2','YY_3'])
df
My output dataframe should look like:
XX_2 XX_3 XX_4 YY_2 YY_3 YY_4
0 2.0 5.0 NaN NaN 1.0 NaN
1 NaN 2.0 NaN NaN NaN NaN
2 3.0 NaN NaN NaN NaN NaN
3 NaN NaN 1.0 NaN 1.0 NaN
4 2.0 NaN NaN 2.0 NaN NaN
5 NaN NaN 2.0 NaN 5.0 2.0
Since this is a small dataframe, I can manually rearrange the columns. Is there any way of doing it based on _2, _3 suffix?
IIUC we can use a function based off Jeff Attwood's article on sorting alphanumeric columns written by Mark Byers :
https://stackoverflow.com/a/2669120/9375102
import re
def sorted_nicely( l ):
""" Sort the given iterable in the way that humans expect."""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
df = pd.DataFrame(data,columns=['XX_9','XX_10','XX_3','YY_9','YY_10','YY_3'])
data = df.colums.tolist()
print(df[sorted_nicely(data)])
XX_3 XX_9 XX_10 YY_3 YY_9 YY_10
0 5.0 NaN 2.0 1.0 NaN NaN
1 2.0 NaN NaN NaN NaN NaN
2 NaN NaN 3.0 NaN NaN NaN
3 NaN 1.0 NaN 1.0 NaN NaN
4 NaN NaN 2.0 NaN NaN 2.0
5 NaN NaN NaN 5.0 2.0 NaN

Unexpected result in pandas pivot_table

I am trying to do a pivot_table on a pandas Dataframe. I am almost getting the expected result, but it seems to be multiplied by two. I could just divide by two and call it a day, however, I want to know whether I am doing something wrong.
Here goes the code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"IND":[1,2,3,4,5,1,5,5],"DATA":[2,3,4,2,10,4,3,3]})
df_pvt = pd.pivot_table(df, aggfunc=np.size, index=["IND"], columns="DATA")
df_pvt is now:
DATA 2 3 4 10
IND
1 2.0 NaN 2.0 NaN
2 NaN 2.0 NaN NaN
3 NaN NaN 2.0 NaN
4 2.0 NaN NaN NaN
5 NaN 4.0 NaN 2.0
However, instead of the 2.0 is should be 1.0! What am I misunderstanding / doing wrong?
Use the string 'size' instead. This will trigger the Pandas interpretation of "size", i.e. the number of elements in a group. The NumPy interpretation of size is the product of the lengths of each dimension.
df = pd.pivot_table(df, aggfunc='size', index=["IND"], columns="DATA")
print(df)
DATA 2 3 4 10
IND
1 1.0 NaN 1.0 NaN
2 NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN
4 1.0 NaN NaN NaN
5 NaN 2.0 NaN 1.0

Pandas combine two columns

I have following database:
df = pandas.DataFrame({'Buy':[10,np.nan,2,np.nan,np.nan,4],'Sell':[np.nan,7,np.nan,9,np.nan,np.nan]})
Out[37]:
Buy Sell
0 10.0 NaN
1 NaN 7.0
2 2.0 NaN
3 NaN 9.0
4 NaN NaN
5 4.0 NaN
I want o create two more columns called Quant and B/S
for Quant it is working fine as follows:
df['Quant'] = df['Buy'].fillna(df['Sell']) # Fetch available value from both column and if both values are Nan then output is Nan.
Output is:
df
Out[39]:
Buy Sell Quant
0 10.0 NaN 10.0
1 NaN 7.0 7.0
2 2.0 NaN 2.0
3 NaN 9.0 9.0
4 NaN NaN NaN
5 4.0 NaN 4.0
But I want to create B/S on the basis of "from which column they have taken value while creating Quant"
You can perform an equality test and feed into numpy.where:
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
For the case where both values are null, you can use an additional step:
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Example
from io import StringIO
import pandas as pd
mystr = StringIO("""Buy Sell
10 nan
nan 8
4 nan
nan 5
nan 7
3 nan
2 nan
nan nan""")
df = pd.read_csv(mystr, delim_whitespace=True)
df['Quant'] = df['Buy'].fillna(df['Sell'])
df['B/S'] = np.where(df['Quant'] == df['Buy'], 'B', 'S')
df.loc[df[['Buy', 'Sell']].isnull().all(1), 'B/S'] = np.nan
Result
print(df)
Buy Sell Quant B/S
0 10.0 NaN 10.0 B
1 NaN 8.0 8.0 S
2 4.0 NaN 4.0 B
3 NaN 5.0 5.0 S
4 NaN 7.0 7.0 S
5 3.0 NaN 3.0 B
6 2.0 NaN 2.0 B
7 NaN NaN NaN NaN

Duplicated rows from .csv and count- python

I have to get the number of times that a complete line appears repeated in my data frame, to then only show those lines that appear repetitions and show in the last column how many times those lines appear repetitions.
Input values for creating output correct table:
dur,wage1,wage2,wage3,cola,hours,pension,stby_pay,shift_diff,educ_allw,holidays,vacation,ldisab,dntl,ber,hplan,agr
2,4.5,4.0,?,?,40,?,?,2,no,10,below average,no,half,?,half,bad
2,2.0,2.0,?,none,40,none,?,?,no,11,average,yes,none,yes,full,bad
3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good
1,2.0,?,?,tc,40,ret_allw,4,0,no,11,generous,no,none,no,none,bad
1,6.0,?,?,?,38,?,8,3,?,9,generous,?,?,?,?,good
2,2.5,3.0,?,tcf,40,none,?,?,?,11,below average,?,?,yes,?,bad
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below average,yes,half,?,none,bad
1,2.8,?,?,none,38,empl_contr,2,3,no,9,below average,yes,half,?,none,bad
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.3,4.4,?,?,38,?,?,4,?,12,generous,?,full,?,full,good
1,2.8,?,?,?,35,?,?,2,?,12,below average,?,?,?,?,good
2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.5,4.0,?,none,40,?,?,4,?,12,average,yes,full,yes,half,good
3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good
3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
I just have to keep those rows that are totally equal.
This is the table result:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
As you can see on this table, we keep for example line with index 6 because on line 6 and 17 from the input table to read, both lines are the same.
With my current code:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
x = data[data.duplicated(keep=False)].drop_duplicates()
return x
I get the result correctly, however I do not know how to count the repeated rows and then add it in the column 'nums_rep' at the end of the table.
This is my result, without the last column that counts the number of repeated rows:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
How can I perform a correct count, based on the equality of all the data in the column, then add it and show it in the column 'num_reps'?

Categories

Resources