The main goal I am trying to accomplish is dividing these columns by each other, in python.
Okay so I have this excel file that I import called 'data', data has two columns that I need but they are in different tabs which is not a problem. I get both columns I need and name them the first is called 'city_area' it has 27 numbers in it ( it is printed out 0 - 26) the second column I retrieve is called 'population'. population has a gap at the top of the column before the numbers are listed. I use [6:] to get the data I actually need, population is then printed out and it has 27 numbers as well in it ( but these are listed 6 - 32). When I try to divide these two only the numbers that match up 6 - 26 are divided the rest result is NaN. Both are floats as well, any suggestions?
population1 = (df2['Unnamed: 5'])
population = population1[6:].astype(float)
population
6 664000.0
7 3557000.0
8 619000.0
9 3351000.0
10 13974000.0
11 8238000.0
12 2393000.0
13 3474000.0
14 5750000.0
15 6199000.0
16 2866000.0
17 2304000.0
18 19522000.0
19 7136000.0
20 3595886.0
21 10261856.0
22 8518000.0
23 3041000.0
24 15593000.0
25 3051000.0
26 10984000.0
27 1567000.0
28 405000.0
29 5974000.0
30 41164000.0
31 2007000.0
32 1337000.0
city_area = (df3['SQ_KM'])
city_area
0 8835.000
1 511.000
2 6407.000
3 11400.000
4 693.800
5 313.800
6 5802.000
7 93.380
8 739.000
9 827.141
10 3292.000
11 8096.000
12 330.900
13 1059.400
14 211.500
15 432.170
16 218.000
17 1390.000
18 1260.000
19 169.300
20 496.800
21 34091.000
22 5687.000
23 675.400
24 1520.000
25 181.857
26 2220.000
Name: SQ_KM, dtype: float64
answer = population/city_area
answer
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 114.443295
7 38091.668451
8 837.618403
9 4051.304433
10 4244.835966
11 1017.539526
12 7231.792082
13 3279.214650
14 27186.761229
15 14343.892450
16 13146.788991
17 1657.553957
18 15493.650794
19 42150.029533
20 7238.095813
21 301.013640
22 1497.802005
23 4502.517027
24 10258.552632
25 16776.918128
26 4947.747748
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
32 NaN
dtype: float64
Related
I have a column as follows:
A B
0 0 20.00
1 1 35.00
2 2 75.00
3 3 29.00
4 4 125.00
5 5 16.00
6 6 52.50
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.20
17 17 27.44
18 18 57.01
19 19 29.88
I want to change the values of the column as follows
if 0<B<10.0, then Replace the cell value of B by "0 to 10"
if 10.1<B<20.0, then Replace the cell value of B by "10 to 20"
continue like this until the maximum range achieved.
I have tried
ds['B'] = np.where(ds['B'].between(10.0,20.0), "10 to 20", ds['B'])
But once I perform this operation, the DataFrame is occupied by the string "10 to 20" so I cannot perform this operation again for the remaining values of the DataFrame. After this step, the DataFrame looks like this:
A B
0 0 10 to 20
1 1 35.0
2 2 75.0
3 3 29.0
4 4 125.0
5 5 10 to 20
6 6 52.5
7 7 nan
8 8 nan
9 9 nan
10 10 nan
11 11 nan
12 12 nan
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.2
17 17 27.44
18 18 57.01
19 19 29.88
And the following line: ds['B'] = np.where(ds['B'].between(20.0,30.0), "20 to 30", ds['B']) will throw TypeError: '>=' not supported between instances of 'str' and 'float'
How can i code this to change all of the values in the DataFrame to these strings of ranges all at once?
Build your bins and labels and use pd.cut:
bins = np.arange(0, df["B"].max() // 10 * 10 + 10, 10).astype(int)
labels = [' to '.join(t) for t in zip(bins[:-1].astype(str), bins[1:].astype(str))]
df["B"] = pd.cut(df["B"], bins=bins, labels=labels)
>>> df
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 NaN
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
This can be done with much less code as this is actually just a matter of string formatting.
ds['B'] = ds['B'].apply(lambda x: f'{int(x/10) if x>=10 else ""}0 to {int(x/10)+1}0' if pd.notnull(x) else x)
You can create a custom function that maps each range to a string. For example, 19.0 will be mapped to "10 to 20", and then apply this function to each row.
I've written the code so that the minimum and maximum of the range is generalizable to the DataFrame, and takes on values that are multiples of 10.
import numpy as np
import pandas as pd
## copy and paste your DataFrame
ds = pd.read_clipboard()
# floor to nearest multiple of 10
ds_min = ds['B'].min()//10*10
# ceiling to the nearest multiple of 10
ds_max = round(ds['B'].max(),-1)
ranges = np.linspace(ds_min, ds_max, ((ds_max-ds_min)/10)+1)
def map_value_to_string(value):
for idx in range(1,len(ranges)):
low_value, high_value = ranges[idx-1], ranges[idx]
if low_value < value <= high_value:
return f"{int(low_value)} to {int(high_value)}"
else:
continue
ds['B'] = ds['B'].apply(lambda x: map_value_to_string(x))
Output:
>>> ds
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 None
8 8 None
9 9 None
10 10 None
11 11 None
12 12 None
13 13 230 to 240
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
Disclaimer: This might be possible duplicate but I cannot find the exact solution. Please feel free to mark this question as duplicate and provide link to duplicate question in comments.
I am still learning python dataframe operations and this possibly has a very simple solution which I am not able to figure out.
I have a python dataframe with a single columns. Now I want to change value of each row to value of previous row if certain conditions are satisfied. I have created a loop solution to implement this but I was hoping for a more efficient solution.
Creation of initial data:
import numpy as np
import pandas as pd
data = np.random.randint(5,30,size=20)
df = pd.DataFrame(data, columns=['random_numbers'])
print(df)
random_numbers
0 6
1 24
2 29
3 18
4 22
5 17
6 12
7 7
8 6
9 27
10 29
11 13
12 23
13 6
14 25
15 24
16 16
17 15
18 25
19 19
Now lets assume two condition are 1) value less than 10 and 2) value more than 20. In any of these cases, set row value to previous row value. This has been implement in loop format as follows:
for index,row in df.iterrows():
if index == 0:
continue;
if(row.random_numbers<10):
df.loc[index,'random_numbers']=df.loc[index-1,'random_numbers']
if(row.random_numbers>20):
df.loc[index,'random_numbers']=df.loc[index-1,'random_numbers']
random_numbers
0 6
1 6
2 6
3 18
4 18
5 17
6 12
7 12
8 12
9 12
10 12
11 13
12 13
13 13
14 13
15 13
16 16
17 15
18 15
19 19
Please suggest a more efficient way to implement this logic as I am using large number of rows.
You can replace the values less than 10 and values more than 20 with NaN then use pandas.DataFrame.ffill() to fill nan with previous row value.
mask = (df['random_numbers'] < 10) | (df['random_numbers'] > 20)
# Since you escape with `if index == 0:`
mask[df.index[0]] = False
df.loc[mask, 'random_numbers'] = np.nan
df['random_numbers'].ffill(inplace=True)
# Original
random_numbers
0 7
1 28
2 8
3 14
4 12
5 20
6 21
7 11
8 16
9 27
10 19
11 23
12 18
13 5
14 6
15 11
16 6
17 8
18 17
19 8
# After replaced
random_numbers
0 7.0
1 7.0
2 7.0
3 14.0
4 12.0
5 20.0
6 20.0
7 11.0
8 16.0
9 16.0
10 19.0
11 19.0
12 18.0
13 18.0
14 18.0
15 11.0
16 11.0
17 11.0
18 17.0
19 17.0
We can also do it in a simpler way by using .mask() together with .ffill() and slicing on [1:] as follows:
df['random_numbers'][1:] = df['random_numbers'][1:].mask((df['random_numbers'] < 10) | (df['random_numbers'] > 20))
df['random_numbers'] = df['random_numbers'].ffill(downcast='infer')
.mask() tests for the condition and replace with NaN when the condition is true (default to replace with NaN if the parameter other= is not supplied). Retains the original values for rows with condition not met.
Note that the resulting numbers are maintained as integer instead of transformed unexpectedly to float type by supplying the downcast='infer' in the call to .ffill().
We use [1:] on the first line to ensure the data on row 0 is untouched without transformation.
# Original data: (reusing your sample data)
random_numbers
0 6
1 24
2 29
3 18
4 22
5 17
6 12
7 7
8 6
9 27
10 29
11 13
12 23
13 6
14 25
15 24
16 16
17 15
18 25
19 19
# After transposition:
random_numbers
0 6
1 6
2 6
3 18
4 18
5 17
6 12
7 12
8 12
9 12
10 12
11 13
12 13
13 13
14 13
15 13
16 16
17 15
18 15
19 19
I have a pandas dataframe with around 15 columns and all i am trying to do is see if the data in 1st row of partition_num is equal to the data in last row of partition_num if its not equal, add a new row at the end with the data from the 1st row
Input:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 25 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
Desired output:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 25 26 9
3 4 7333 24 26 9
4 1 8999 26 18 15
5 2 8999 15 17 45
6 3 8999 26 18 15
7 1 3455 12 14 18
8 2 3455 12 14 18
Since the data for partition_num -7333 in row 0 is not equal to the data in row 2, add a new row(row 3) with same data as row 0
can we add a new column to identify the new record something like flag :
row id partition_num lat long time flag
0 1 7333 24 26 9 old
1 2 7333 15 19 10 old
2 3 7333 25 26 9 old
3 4 7333 24 26 9 new
4 1 8999 26 18 15 old
5 2 8999 15 17 45 old
6 3 8999 26 18 15 old
7 1 3455 12 14 18 old
8 2 3455 12 14 18 old
groupby will easily build sub_dataframes per partition_num. From that point the processing is simple:
for i, x in df.groupby('partition_num'):
if (x.iloc[0]['partition_num':] != x.iloc[-1]['partition_num':]).any():
s = x.iloc[0].copy()
s.id = x.iloc[-1].id + 1
df = df.append(s).reset_index(drop=True).rename_axis('row')
The following code compares the values of 'partition_num' in the first and last row, and if they don't match, appends the first row onto the end of the data frame:
if df.loc[0, 'partition_num'] != df.loc[len(df)-1, 'partition_num']:
df = df.append(df.loc[0, :]).reset_index(drop=True)
df.index.name = 'row'
print(df)
id partition_num lat long time
row
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 26 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
8 1 7333 24 26 9
The index column is set to 'row', and it is reset and renamed to get the correct ordering.
Added this piece to the above logic:
s['flag']= 'new_row'
and it worked!!
I have a csv of daily maximum temperatures. I am trying to assign a "rank" for my data. I first sorted my daily maximum temperature from lowest to highest. I then created a new column called rank.
#Sort data smallest to largest
ValidFullData_Sorted=ValidFullData.sort_values(by="TMAX")
#count total obs
n=ValidFullData_Sorted.shape[0]
#add a numbered column 1-> n to use in return calculation for rank
ValidFullData_Sorted.insert(0,'rank',range(1,1+n))
How can I make the rank the same for values of daily maximum temperature that are the same? (i.e. every time the daily maximum temperature reaches 95° the rank for each of those instances should be the same)
Here is some sample data:(its daily temperature data so its thousands of lines long)
Date TMAX TMIN
1/1/00 22 11
1/2/00 26 12
1/3/00 29 14
1/4/00 42 7
1/5/00 42 21
And I want to add a TMAXrank column that would look like this:
Date TMAX TMIN TMAXRank
1/1/00 22 11 4
1/2/00 26 12 3
1/3/00 29 14 2
1/4/00 42 7 1
1/5/00 42 21 1
ValidFullData['TMAXRank'] = ValidFullData[ValidFullData['TMAX'] < 95]['TMAX'].rank(ascending=False, method='dense')
Output:
Unnamed: 0 TMAX TMIN TMAXRank
17 17 88 14 1.0
16 16 76 12 2.0
15 15 72 11 3.0
14 14 64 21 4.0
8 8 62 7 5.0
7 7 58 14 6.0
13 13 58 7 6.0
18 18 55 7 7.0
3 3 42 7 8.0
4 4 42 21 8.0
6 6 41 12 9.0
12 12 37 14 10.0
5 5 36 11 11.0
2 2 29 14 12.0
1 1 26 12 13.0
0 0 22 11 14.0
9 9 98 21 NaN
10 10 112 11 NaN
11 11 98 12 NaN
19 19 95 21 NaN
I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.
so far I have (just for example in reality I am reading a .csv file with a larger matrix)
x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)
0 1 2 3 4 5 6 7
0 9 0 23 13 2 5 14 6
1 20 17 11 10 25 23 20 23
2 15 14 22 25 11 15 5 15
3 9 27 15 27 7 15 17 23
4 12 6 11 13 27 11 26 20
5 27 13 5 16 5 5 2 18
6 3 18 22 0 7 10 11 11
7 25 18 10 11 29 29 1 25
What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe
0 1 3 4 5
0 9 0 13 2 5
1 20 17 10 25 23
2 15 14 25 11 15
3 9 27 27 7 15
4 12 6 13 27 11
5 27 13 16 5 5
6 3 18 0 7 10
7 25 18 11 29 29
I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).
The actual script I am using to read in thus far is
filename = 'Data.csv'
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age
so a sub dataframme of x is what I am after (as above).
Any advice is greatly appreciated.
Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:
df.loc[:, (df>=27).any()]
Out[34]:
0 1 3 4 5 7
0 8 2 28 9 14 21
1 24 26 23 17 0 0
2 3 24 7 15 4 28
3 29 17 12 7 7 6
4 5 3 10 24 29 14
5 23 21 0 16 23 13
6 22 10 27 1 7 24
7 9 27 2 27 17 12
And this is the initial dataframe:
df
Out[35]:
0 1 2 3 4 5 6 7
0 8 2 7 28 9 14 26 21
1 24 26 15 23 17 0 21 0
2 3 24 26 7 15 4 7 28
3 29 17 9 12 7 7 0 6
4 5 3 13 10 24 29 22 14
5 23 21 26 0 16 23 17 13
6 22 10 19 27 1 7 9 24
7 9 27 26 2 27 17 8 12