I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.).
I want to change the number as the letter in column '2' changes. Numbers in the remaining columns (columns:1,3 ~) should not change.
df=
index 1 2 3
0 0 a100 1
1 1.04 a100 2
2 32 a100 3
3 5.05 a105 4
4 1.01 a105 5
5 155 a105 6
6 3155.26 a105 7
7 354.12 a100 8
8 5680.13 a100 9
9 125.55 a100 10
10 13.32 a100 11
11 5656.33 a156 12
12 456.61 a156 13
13 23.52 a1235 14
14 35.35 a1235 15
15 350.20 a100 16
16 30. a100 17
17 13.50 a100 18
18 323.13 a231 19
19 15.11 a1111 20
20 11.22 a1111 21
Here is my expected result:
df=
index 1 2 3
0 0 0 1
1 1.04 0 2
2 32 0 3
3 5.05 1 4
4 1.01 1 5
5 155 1 6
6 3155.26 1 7
7 354.12 2 8
8 5680.13 2 9
9 125.55 2 10
10 13.32 2 11
11 5656.33 3 12
12 456.61 3 13
13 23.52 4 14
14 35.35 4 15
15 350.20 5 16
16 30 5 17
17 13.50 5 18
18 323.13 6 19
19 15.11 7 20
20 11.22 7 21
How do I solve this problem?
Use consecutive groups created by compare for not equal shifted values with cumulative sum and then subtract 1:
#if column is string '2'
df['2'] = df['2'].ne(df['2'].shift()).cumsum().sub(1)
#if column is number 2
df[2] = df[2].ne(df[2].shift()).cumsum().sub(1)
print (df)
index 1 2 3
0 0 0.00 0 1
1 1 1.04 0 2
2 2 32.00 0 3
3 3 5.05 1 4
4 4 1.01 1 5
5 5 155.00 1 6
6 6 3155.26 1 7
7 7 354.12 2 8
8 8 5680.13 2 9
9 9 125.55 2 10
10 10 13.32 2 11
11 11 5656.33 3 12
12 12 456.61 3 13
13 13 23.52 4 14
14 14 35.35 4 15
15 15 350.20 5 16
16 16 30.00 5 17
17 17 13.50 5 18
18 18 323.13 6 19
19 19 15.11 7 20
20 20 11.22 7 21
I have a dataframe like this:
a\tb\tc d\te\tf g\th\ti
20\t21\t22 1\t2\t3 30\t31\t32
17\t18\t19 4\t5\t6 27\t28\t29
14\t15\t16 7\t8\t9 24\t25\t26
11\t12\t13 10\t11\t12 21\t22\t23
8\t9\t10 13\t14\t15 18\t19\t20
5\t6\t7 16\t17\t18 15\t16\t17
2\t3\t4 19\t20\t21 12\t13\t14
expected output:
a b c d e f g h i
0 20 21 22 1 2 3 30 31 32
1 17 18 19 4 5 6 27 28 29
2 14 15 16 7 8 9 24 25 26
3 11 12 13 10 11 12 21 22 23
4 8 9 10 13 14 15 18 19 20
5 5 6 7 16 17 18 15 16 17
6 2 3 4 19 20 21 12 13 14
My solution is:
l = list()
for column in df.columns:
columns = column.split()
d = df[column].str.split(expand=True)
l.append(d.rename(columns=dict(zip(range(len(columns)),columns))))
pd.concat(l,axis=1)
But this looks so complex.
Is there a simple way of doing this ?
Your approach looks good. You can simplify the rename part by just assigning the new names to .columns attribute:
def expand(col):
_df = df[col].str.split(expand=True)
_df.columns = col.split()
return _df
pd.concat(map(expand, df.columns), axis=1)
a b c d e f g h i
0 20 21 22 1 2 3 30 31 32
1 17 18 19 4 5 6 27 28 29
2 14 15 16 7 8 9 24 25 26
3 11 12 13 10 11 12 21 22 23
4 8 9 10 13 14 15 18 19 20
5 5 6 7 16 17 18 15 16 17
6 2 3 4 19 20 21 12 13 14
I am trying to implement a task of finding all simple cycles in undirected graph. Originally, the task was to find all cycles of fixed length (= 3), and I've managed to do it using the properties of adjacency matrices. But before using that approach I was also trying to use DFS and it worked correctly for really small input sizes, but for bigger inputs it was going crazy, ending with (nearly) infinite loops. I tried to fix the code, but then it just could not find all the cycles.
My code is attached below.
1. Please, do not pay attention to several global variables used. The working code using another approach was already submitted. This one is just for me to see if how to make DFS work properly.
2. Yes, I've searched for this problem before posting this question, but either the option I've managed to find used different approach, or it was just about detecting if there are cycles at all. Besides, I want to know if it is possible to fix my code.
Big thanks to anyone who could help.
num_res = 0
adj_list = []
cycles_list = []
def dfs(v, path):
global num_res
for node in adj_list[v]:
if node not in path:
dfs(node, path + [node])
elif len(path) >= 3 and (node == path[-3]):
if sorted(path[-3:]) not in cycles_list:
cycles_list.append(sorted(path[-3:]))
num_res += 1
if __name__ == "__main__":
num_towns, num_pairs = [int(x) for x in input().split()]
adj_list = [[] for x in range(num_towns)]
adj_matrix = [[0 for x in range(num_towns)] for x in range(num_towns)]
# EDGE LIST TO ADJACENCY LIST
for i in range(num_pairs):
cur_start, cur_end = [int(x) for x in input().split()]
adj_list[cur_start].append(cur_end)
adj_list[cur_end].append(cur_start)
dfs(0, [0])
print(num_res)
UPD: Works ok for following inputs:
5 8
4 0
0 2
0 1
3 2
4 3
4 2
1 3
3 0
(output: 5)
6 15
5 4
2 0
3 1
5 1
4 1
5 3
1 0
4 0
4 3
5 2
2 1
3 0
3 2
5 0
4 2
(output: 20)
9 12
0 1
0 2
1 3
1 4
2 4
2 5
3 6
4 6
4 7
5 7
6 8
7 8
(output: 0)
Does NOT give any output and just continues through the loop.
22 141
5 0
12 9
18 16
7 6
7 0
4 1
16 1
8 1
6 1
14 0
16 0
11 9
20 14
12 3
18 3
1 0
17 0
17 15
14 5
17 13
6 5
18 12
21 1
13 4
18 11
18 13
8 0
15 9
21 18
13 6
12 8
16 13
20 18
21 3
11 6
15 14
13 5
17 5
10 8
9 5
16 14
19 9
7 5
14 10
16 4
18 7
12 1
16 3
19 18
19 17
20 2
12 11
15 3
15 11
13 2
10 7
15 13
10 9
7 3
14 3
10 1
21 19
9 2
21 4
19 0
18 1
10 6
15 0
20 7
14 11
19 6
18 10
7 4
16 10
9 4
13 3
12 2
4 3
17 7
15 8
13 7
21 14
4 2
21 0
20 16
18 8
20 12
14 2
13 1
16 15
17 11
17 16
20 10
15 7
14 1
13 0
17 12
18 5
12 4
15 1
16 9
9 1
17 14
16 2
12 5
20 8
19 2
18 4
19 4
19 11
15 12
14 12
11 8
17 10
18 14
12 7
16 8
20 11
8 7
18 9
6 4
11 5
17 6
5 3
15 10
20 19
15 6
19 10
20 13
9 3
13 9
13 10
21 7
19 13
19 12
19 14
6 3
21 15
21 6
17 3
10 5
(output should be 343)
I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.
so far I have (just for example in reality I am reading a .csv file with a larger matrix)
x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)
0 1 2 3 4 5 6 7
0 9 0 23 13 2 5 14 6
1 20 17 11 10 25 23 20 23
2 15 14 22 25 11 15 5 15
3 9 27 15 27 7 15 17 23
4 12 6 11 13 27 11 26 20
5 27 13 5 16 5 5 2 18
6 3 18 22 0 7 10 11 11
7 25 18 10 11 29 29 1 25
What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe
0 1 3 4 5
0 9 0 13 2 5
1 20 17 10 25 23
2 15 14 25 11 15
3 9 27 27 7 15
4 12 6 13 27 11
5 27 13 16 5 5
6 3 18 0 7 10
7 25 18 11 29 29
I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).
The actual script I am using to read in thus far is
filename = 'Data.csv'
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age
so a sub dataframme of x is what I am after (as above).
Any advice is greatly appreciated.
Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:
df.loc[:, (df>=27).any()]
Out[34]:
0 1 3 4 5 7
0 8 2 28 9 14 21
1 24 26 23 17 0 0
2 3 24 7 15 4 28
3 29 17 12 7 7 6
4 5 3 10 24 29 14
5 23 21 0 16 23 13
6 22 10 27 1 7 24
7 9 27 2 27 17 12
And this is the initial dataframe:
df
Out[35]:
0 1 2 3 4 5 6 7
0 8 2 7 28 9 14 26 21
1 24 26 15 23 17 0 21 0
2 3 24 26 7 15 4 7 28
3 29 17 9 12 7 7 0 6
4 5 3 13 10 24 29 22 14
5 23 21 26 0 16 23 17 13
6 22 10 19 27 1 7 9 24
7 9 27 26 2 27 17 8 12
I am novice in python.
I have 8 csv files with 26 columns and 600 rows in each. now I want to take the last 4 column of each csv files (Column 22 to column 25), read the files and sum them up to replace all the 4 columns in each file. for example (I am showing some random data here):
new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9 9
new2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 12
13 13 13 13 13 13 13 13 13 13 13
14 14 14 14 14 14 14 14 14 14 14
15 15 15 15 15 15 15 15 15 15 15
16 16 16 16 16 16 16 16 16 16 16
17 17 17 17 17 17 17 17 17 17 17
18 18 18 18 18 18 18 18 18 18 18
19 19 19 19 19 19 19 19 19 19 19
Now, I want to sum each element of "h, i, j, k" of from these 2 files, then replace the files last 4 columns with this new sum.
Modified new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 12 12 12 12
2 2 2 2 2 2 2 14 14 14 14
3 3 3 3 3 3 3 16 16 16 16
4 4 4 4 4 4 4 18 18 18 18
5 5 5 5 5 5 5 20 20 20 20
6 6 6 6 6 6 6 22 22 22 22
7 7 7 7 7 7 7 24 24 24 24
8 8 8 8 8 8 8 26 26 26 26
9 9 9 9 9 9 9 28 28 28 28
Modified new-2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 12 12 12 12
12 12 12 12 12 12 12 14 14 14 14
13 13 13 13 13 13 13 16 16 16 16
14 14 14 14 14 14 14 18 18 18 18
15 15 15 15 15 15 15 20 20 20 20
16 16 16 16 16 16 16 22 22 22 22
17 17 17 17 17 17 17 24 24 24 24
18 18 18 18 18 18 18 26 26 26 26
19 19 19 19 19 19 19 28 28 28 28
I am assuming I should use Panda or numpy for this, but not sure how to do it. any suggestions/hints would be appreciated.
You can do this by just using numpy.
import numpy as np
# list of all the files
file_list = ['foo.csv','bar.csv','baz.csv'] # all 8 files
col_names = ['a','b','c','d','e','f'] # all the names till z if necessary as the first row, else skip this
# initializing a numpy array, for containing sum from last 4 columns
add_cols = np.zeros((600,4))
# iterating over all .csv files
for file in file_list :
# skiprows will skip the first row and usecols will get values in last 4 cols
temp = np.loadtxt(file, skiprows=1, delimiter=',' , usecols = (22,23,24,25) )
add_cols = np.add(temp,add_cols)
# now again overwriting all the files, substituting the last 4 columns with the sum
for file in file_list :
#loading the content from file in temp
temp = np.loadtxt(file, skiprows=1, delimiter=',')
temp[:,[22,23,24,25]] = add_cols
# writing the column names first
with open(file,'w') as p:
p.write(','.join(col_names)+'\n')
# now appending final values in temp to the file as csv
with open(file,'a') as p:
np.savetxt(p,temp,delimiter=",",fmt="%i")
Now if your file is not comma separated and rather space separated, remove the delimiter option from all the functions as the delimiter is taken as space by default. Also join the first column accordingly.
After loading your csvs using read_csv, you can add the last 4 columns together and then overwrite them:
In [10]:
total = df[df.columns[-4:]].values + df1[df1.columns[-4:]].values
total
Out[10]:
array([[12, 12, 12, 12],
[14, 14, 14, 14],
[16, 16, 16, 16],
[18, 18, 18, 18],
[20, 20, 20, 20],
[22, 22, 22, 22],
[24, 24, 24, 24],
[26, 26, 26, 26],
[28, 28, 28, 28]], dtype=int64)
In [12]:
df[df.columns[-4:]] = total
df1[df1.columns[-4:]] = total
df
Out[12]:
a b c d e f g h i j k
0 1 1 1 1 1 1 1 12 12 12 12
1 2 2 2 2 2 2 2 14 14 14 14
2 3 3 3 3 3 3 3 16 16 16 16
3 4 4 4 4 4 4 4 18 18 18 18
4 5 5 5 5 5 5 5 20 20 20 20
5 6 6 6 6 6 6 6 22 22 22 22
6 7 7 7 7 7 7 7 24 24 24 24
7 8 8 8 8 8 8 8 26 26 26 26
8 9 9 9 9 9 9 9 28 28 28 28
In [13]:
df1
Out[13]:
a b c d e f g h i j k
0 11 11 11 11 11 11 11 12 12 12 12
1 12 12 12 12 12 12 12 14 14 14 14
2 13 13 13 13 13 13 13 16 16 16 16
3 14 14 14 14 14 14 14 18 18 18 18
4 15 15 15 15 15 15 15 20 20 20 20
5 16 16 16 16 16 16 16 22 22 22 22
6 17 17 17 17 17 17 17 24 24 24 24
7 18 18 18 18 18 18 18 26 26 26 26
8 19 19 19 19 19 19 19 28 28 28 28
We need to call the attribute .values here to return a np array because otherwise it will try to align on the index which in this case do not align.
Once you overwrite them call df.to_csv(file_path) and df1.to_csv(file_path)
In the case of your 8 dfs you can loop over them and aggregate whilst looping:
# take a copy of the firt df's last 4 columns
total = df_list[0]
total = total[total.columns[-4:]].values
for df in df_list[1:]:
total += df[df.columns[-4:]].values
Then just loop over your dfs again to overwrite:
for df in df_list:
df[df.columns[-4:]] = total
And then write out again using to_csv.