I have a column in my dataframe and it has values between 2100 and 8000. I want to split this column into multiple columns of intervals of 500. So let me show you by example:
column
2100
2105
2119
.
8000
I want to split it like this:
column1 column2 column3 . . column n
2100 0 0 . . 0
. 0 . . . 0
2600 0 0
2601 0 . . .
. .
3101 0
3102 0
.
3602
8000
Please suggest a solution.
Here's one approach using pd.cut and DataFrame.pivot:
df = pd.DataFrame(list(range(2100, 8000+1)), columns=['column'])
# create the bins to be used in pd.cut
bins = list(range(df.column.min(), df.column.max()+50, 50))
# array([2100, 2150, 2200, 2250, 2300 ...
# Create the labels for pd.cut, which will be used as column names
labels = [f'column{i}' for i in range(len(bins)-1)]
# ['column0', 'column1', 'column2', 'column3', 'column4', ...
df['bins'] = pd.cut(df.column, bins, labels=labels, include_lowest=True)
Which will give you:
column bins
0 2100 column0
1 2101 column0
2 2102 column0
3 2103 column0
4 2104 column0
5 2105 column0
6 2106 column0
7 2107 column0
8 2108 column0
And now use pivot to obtain the final result:
ix = df.groupby('bins').column.cumcount()
df.pivot(columns = 'bins', index=ix).fillna(0)
bins column0 column1 column2 column3 column4 column5 column6 column7 column8 ...
0 2100.0 2151.0 2201.0 2251.0 2301.0 2351.0 2401.0 2451.0 2501.0
1 2101.0 2152.0 2202.0 2252.0 2302.0 2352.0 2402.0 2452.0 2502.0
2 2102.0 2153.0 2203.0 2253.0 2303.0 2353.0 2403.0 2453.0 2503.0
3 2103.0 2154.0 2204.0 2254.0 2304.0 2354.0 2404.0 2454.0 2504.0
4 2104.0 2155.0 2205.0 2255.0 2305.0 2355.0 2405.0 2455.0 2505.0
5 2105.0 2156.0 2206.0 2256.0 2306.0 2356.0 2406.0 2456.0 2506.0
6 2106.0 2157.0 2207.0 2257.0 2307.0 2357.0 2407.0 2457.0 2507.0
7 2107.0 2158.0 2208.0 2258.0 2308.0 2358.0 2408.0 2458.0 2508.0
8 2108.0 2159.0 2209.0 2259.0 2309.0 2359.0 2409.0 2459.0 2509.0
9 2109.0 2160.0 2210.0 2260.0 2310.0 2360.0 2410.0 2460.0 2510.0
10 2110.0 2161.0 2211.0 2261.0 2311.0 2361.0 2411.0 2461.0 2511.0
...
Lets encapsulate it all in a function, and try with a simpler example to better see how this works:
def binning_and_pivot(df, bin_size):
bins = list(range(df.column.min(), df.column.max()+bin_size, bin_size))
labels = [f'column{i}' for i in range(len(bins)-1)]
df['bins'] = pd.cut(df.column, bins, labels=labels, include_lowest=True)
ix = df.groupby('bins').column.cumcount()
return df.pivot(columns = 'bins', index=ix).fillna(0)
df = pd.DataFrame(list(range(100+1)), columns=['column'])
df = df.sample(frac=0.7).reset_index(drop=True)
binning_and_pivot(df, bin_size=10)
bins column0 column1 column2 column3 column4 column5 column6 column7 column8
0 2.0 16.0 32.0 39.0 45.0 55.0 69.0 81.0 87.0
1 6.0 21.0 29.0 42.0 46.0 59.0 72.0 76.0 92.0
2 3.0 13.0 31.0 36.0 49.0 61.0 68.0 74.0 91.0
3 12.0 20.0 25.0 41.0 52.0 56.0 70.0 78.0 86.0
4 8.0 17.0 30.0 37.0 43.0 62.0 64.0 73.0 89.0
5 7.0 19.0 27.0 38.0 50.0 53.0 71.0 77.0 83.0
6 0.0 22.0 28.0 0.0 0.0 54.0 65.0 82.0 90.0
7 0.0 18.0 24.0 0.0 0.0 60.0 63.0 80.0 0.0
8 0.0 14.0 26.0 0.0 0.0 0.0 0.0 75.0 0.0
bins column9
0 95.0
1 100.0
2 96.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
Here's my choice of action
I did it for intervals of 4
NOTE the number of rows must by fully divided by the intervals
import pandas as pd
df = pd.read_csv(r'Z:\Path\neww.txt', delim_whitespace=True)
didi = df.to_dict()
num = 4
dd={}
for i in range(int(len(didi['column'].items())/num)):
dd['col' + str(i)] = dict(list(didi['column'].items())[i*num:num*(i+1)])
print(pd.DataFrame(dd).apply(lambda x: pd.Series(x.dropna().values)))
Input:
column
2100
2100
2100
2100
2100
2100
2100
2100
2100
2100
8000
8000
8000
8000
8000
8000
8000
8000
80
8000
Output:
col0 col1 col2 col3 col4
0 2100.0 2100.0 2100.0 8000.0 8000.0
1 2100.0 2100.0 2100.0 8000.0 8000.0
2 2100.0 2100.0 8000.0 8000.0 80.0
3 2100.0 2100.0 8000.0 8000.0 8000.0
see the documentation of numpy.reshape.
Suppose you extract your concerned data into a numpy array, say data. Here's a possible solution.
newdata = data.reshape((500, -1))
newdata is your reshaped data
Related
I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.
print df
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
My attempt:
I have attempted to get the sum of the column using groupby and .sum():
Total = df.groupby['MyColumn'].sum()
print Total
This causes the following error:
TypeError: 'instancemethod' object has no attribute '__getitem__'
Expected Output
I'd have expected the output to be as follows:
319
Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
TOTAL 319
You should use sum:
Total = df['MyColumn'].sum()
print(Total)
319
Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:
df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index=['MyColumn'])
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
because if you pass scalar, the values of all rows will be filled:
df.loc['Total'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
Total 319 319 319.0 319.0
Two other solutions are with at, and ix see the applications below:
df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.
Another option you can go with here:
df.loc["Total", "MyColumn"] = df.MyColumn.sum()
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#Total NaN 319.0 NaN NaN
You can also use append() method:
df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))
Update:
In case you need to append sum for all numeric columns, you can do one of the followings:
Use append to do this in a functional manner (doesn't change the original data frame):
# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')
# append sums to the data frame
df.append(sums)
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 319.0 400.0 398.0
Use loc to mutate data frame in place:
df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 638.0 800.0 796.0
Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:
Total = sum(df['MyColumn'])
or alternatively
Total = sum(df.MyColumn)
print Total
There are two ways to sum of a column
dataset = pd.read_csv("data.csv")
1: sum(dataset.Column_name)
2: dataset['Column_Name'].sum()
If there is any issue in this the please correct me..
As other option, you can do something like below
Group Valuation amount
0 BKB Tube 156
1 BKB Tube 143
2 BKB Tube 67
3 BAC Tube 176
4 BAC Tube 39
5 JDK Tube 75
6 JDK Tube 35
7 JDK Tube 155
8 ETH Tube 38
9 ETH Tube 56
Below script, you can use for above data
import pandas as pd
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()
I have a text file like this
0, 23.00, 78.00, 75.00, 105.00, 2,0.97
1, 371.00, 305.00, 38.00, 48.00, 0,0.85
1, 24.00, 78.00, 75.00, 116.00, 2,0.98
1, 372.00, 306.00, 37.00, 48.00, 0,0.84
2, 28.00, 87.00, 74.00, 101.00, 2,0.97
2, 372.00, 307.00, 35.00, 47.00, 0,0.80
3, 32.00, 86.00, 73.00, 98.00, 2,0.98
3, 363.00, 310.00, 34.00, 46.00, 0,0.83
4, 40.00, 77.00, 71.00, 98.00, 2,0.94
4, 370.00, 307.00, 38.00, 47.00, 0,0.84
4, 46.00, 78.00, 74.00, 116.00, 2,0.97
5, 372.00, 308.00, 34.00, 46.00, 0,0.57
5, 43.00, 66.00, 67.00, 110.00, 2,0.96
Code I tried
frames = []
x = []
y = []
labels = []
with open(file, 'r') as lb:
for line in lb:
line = line.replace(',', ' ')
arr = line.split()
frames.append(arr[0])
x.append(arr[1])
y.append(arr[2])
labels.append(arr[5])
print(np.shape(frames))
for d, a in enumerate(frames):
compare = []
if a == frames[d+2]:
compare.append(x[d])
compare.append(x[d+1])
compare.append(x[d+2])
xm = np.argmin(compare)
label = {0: int(labels[d]), 1: int(labels[d+1]), 2: int(labels[d+2])}.get(xm)
elif a == frames[d+1]:
compare.append(x[d])
compare.append(x[d+1])
xm = np.argmin(compare)
label = {0: int(labels[d]), 1: int(labels[d+1])}.get(xm)
In the first line, because the first number (0) is unique so I extract the sixth number (2) easily.
But after that, I got many lines with the same first number, so I want somehow to store all the lines with the same first number to compare the second number, then extract the sixth number of the line which has the lowest second number.
Can someone provide python solutions for me? I tried readline() and next() but don't know how to solve it.
you can read the file with pandas.read_csv instead, and things will come much more easily
import pandas as pd
df = pd.read_csv(file_path, header = None)
You'll read the file as a table
0 1 2 3 4 5 6
0 0 23.0 78.0 75.0 105.0 2 0.97
1 1 371.0 305.0 38.0 48.0 0 0.85
2 1 24.0 78.0 75.0 116.0 2 0.98
3 1 372.0 306.0 37.0 48.0 0 0.84
4 2 28.0 87.0 74.0 101.0 2 0.97
5 2 372.0 307.0 35.0 47.0 0 0.80
6 3 32.0 86.0 73.0 98.0 2 0.98
7 3 363.0 310.0 34.0 46.0 0 0.83
8 4 40.0 77.0 71.0 98.0 2 0.94
9 4 370.0 307.0 38.0 47.0 0 0.84
10 4 46.0 78.0 74.0 116.0 2 0.97
11 5 372.0 308.0 34.0 46.0 0 0.57
12 5 43.0 66.0 67.0 110.0 2 0.96
then you can group in subtables based on one of the columns (in your case column 0)
for group, sub_df in d.groupby(0):
row = sub_df[1].idxmin() # returns the index of the minimum value for column 1
df.loc[row, 5] # this is the number you are looking for
I think this is what you need using pandas:
import pandas as pd
df = pd.read_table('./test.txt', sep=',', names = ('1','2','3','4','5','6','7'))
print(df)
# 1 2 3 4 5 6 7
# 0 0 23.0 78.0 75.0 105.0 2 0.97
# 1 1 371.0 305.0 38.0 48.0 0 0.85
# 2 1 24.0 78.0 75.0 116.0 2 0.98
# 3 1 372.0 306.0 37.0 48.0 0 0.84
# 4 2 28.0 87.0 74.0 101.0 2 0.97
# 5 2 372.0 307.0 35.0 47.0 0 0.80
# 6 3 32.0 86.0 73.0 98.0 2 0.98
# 7 3 363.0 310.0 34.0 46.0 0 0.83
# 8 4 40.0 77.0 71.0 98.0 2 0.94
# 9 4 370.0 307.0 38.0 47.0 0 0.84
# 10 4 46.0 78.0 74.0 116.0 2 0.97
# 11 5 372.0 308.0 34.0 46.0 0 0.57
# 12 5 43.0 66.0 67.0 110.0 2 0.96
df_new = df.loc[df.groupby("1")["6"].idxmin()]
print(df_new)
# 1 2 3 4 5 6 7
# 0 0 23.0 78.0 75.0 105.0 2 0.97
# 1 1 371.0 305.0 38.0 48.0 0 0.85
# 5 2 372.0 307.0 35.0 47.0 0 0.80
# 7 3 363.0 310.0 34.0 46.0 0 0.83
# 9 4 370.0 307.0 38.0 47.0 0 0.84
# 11 5 372.0 308.0 34.0 46.0 0 0.57
This is my dataframe. How to I add max_value, min_value, mean_value, median_value names to rows so that my index values will be like
0
1
2
3
4
max_value
min_value
mean_value
median_value
Could anyone help me in solving this
If want add rows use add DataFrame.agg:
df1 = df.append(df.agg(['max','min','mean','median']))
If want add columns use assign with min, max, mean and median:
df2 = df.assign(max_value=df.max(axis=1),
min_value=df.min(axis=1),
mean_value=df.mean(axis=1),
median_value=df.median(axis=1))
one Way is,
Thanks to #jezrael for the help.
df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
df1=df.copy()
#column wise calc
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()
#row wise calc
df['max']=df1.max(axis=1)
df['min']=df1.min(axis=1)
df['mean']=df1.mean(axis=1)
df['median']=df1.median(axis=1)
O/P:
A B C D max min mean median
0 49.0 91.0 16.0 17.0 91.0 16.0 43.25 33.0
1 20.0 42.0 86.0 60.0 86.0 20.0 52.00 51.0
2 32.0 25.0 94.0 13.0 94.0 13.0 41.00 28.5
3 40.0 1.0 66.0 31.0 66.0 1.0 34.50 35.5
4 18.0 30.0 67.0 31.0 67.0 18.0 36.50 30.5
max 49.0 91.0 94.0 60.0 NaN NaN NaN NaN
min 18.0 1.0 16.0 13.0 NaN NaN NaN NaN
mean 31.8 37.8 65.8 30.4 NaN NaN NaN NaN
median 32.0 30.0 67.0 31.0 NaN NaN NaN NaN
This worked well and fine:
df1 = df.copy()
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()
How to match values from this DataFrame source:
car_id lat lon
0 100 10.0 15.0
1 100 12.0 10.0
2 100 09.0 08.0
3 110 23.0 12.0
4 110 18.0 32.0
5 110 21.0 16.0
5 110 12.0 02.0
And keep only those whose coords are in this second DataFrame coords:
lat lon
0 12.0 10.0
1 23.0 12.0
3 18.0 32.0
So that the resulting DataFrame result is:
car_id lat lon
1 100 12.0 10.0
3 110 23.0 12.0
4 110 18.0 32.0
I can do that in an iterative way with apply, but I'm looking for a vectorized way. I tried the following with isin() with no success:
result = source[source[['lat', 'lon']].isin({
'lat': coords['lat'],
'lon': coords['lon']
})]
The above method returns:
ValueError: ('operands could not be broadcast together with shapes (53103,) (53103,2)
DataFrame.merge() per default merges on all columns with the same names (intersection of the columns of both DFs):
In [197]: source.merge(coords)
Out[197]:
car_id lat lon
0 100 12.0 10.0
1 110 23.0 12.0
2 110 18.0 32.0
Here's one approach with NumPy broadcasting -
a = source.values
b = coords.values
out = source[(a[:,1:]==b[:,None]).all(-1).any(0)]
Sample run -
In [74]: source
Out[74]:
car_id lat lon
0 100 10.0 15.0
1 100 12.0 10.0
2 100 9.0 8.0
3 110 23.0 12.0
4 110 18.0 32.0
5 110 21.0 16.0
5 110 12.0 2.0
In [75]: coords
Out[75]:
lat lon
0 12.0 10.0
1 23.0 12.0
3 18.0 32.0
In [76]: a = source.values
...: b = coords.values
...:
In [77]: source[(a[:,1:]==b[:,None]).all(-1).any(0)]
Out[77]:
car_id lat lon
1 100 12.0 10.0
3 110 23.0 12.0
4 110 18.0 32.0
I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.
print df
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
My attempt:
I have attempted to get the sum of the column using groupby and .sum():
Total = df.groupby['MyColumn'].sum()
print Total
This causes the following error:
TypeError: 'instancemethod' object has no attribute '__getitem__'
Expected Output
I'd have expected the output to be as follows:
319
Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
TOTAL 319
You should use sum:
Total = df['MyColumn'].sum()
print(Total)
319
Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:
df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index=['MyColumn'])
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
because if you pass scalar, the values of all rows will be filled:
df.loc['Total'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
Total 319 319 319.0 319.0
Two other solutions are with at, and ix see the applications below:
df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.
Another option you can go with here:
df.loc["Total", "MyColumn"] = df.MyColumn.sum()
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#Total NaN 319.0 NaN NaN
You can also use append() method:
df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))
Update:
In case you need to append sum for all numeric columns, you can do one of the followings:
Use append to do this in a functional manner (doesn't change the original data frame):
# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')
# append sums to the data frame
df.append(sums)
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 319.0 400.0 398.0
Use loc to mutate data frame in place:
df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 638.0 800.0 796.0
Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:
Total = sum(df['MyColumn'])
or alternatively
Total = sum(df.MyColumn)
print Total
There are two ways to sum of a column
dataset = pd.read_csv("data.csv")
1: sum(dataset.Column_name)
2: dataset['Column_Name'].sum()
If there is any issue in this the please correct me..
As other option, you can do something like below
Group Valuation amount
0 BKB Tube 156
1 BKB Tube 143
2 BKB Tube 67
3 BAC Tube 176
4 BAC Tube 39
5 JDK Tube 75
6 JDK Tube 35
7 JDK Tube 155
8 ETH Tube 38
9 ETH Tube 56
Below script, you can use for above data
import pandas as pd
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()