How to loop through .csv file and extract certain values in python? - python

I'm trying to loop through the 11th column in a CSV file and search for the term "abc" (as an example). For every "abc" it finds, I want it to return the value of the first column of the same row, unless it's empty. If it's empty, I want it to go up the first column row by row until it finds a cell that's not empty and return the value of that cell.
I've already imported the needed CSV file. Here's my code trying to do the above.
for row in csvReader:
if row[10] == 'abc':
colAVal = row
while colAVal[0] == '' and colAVal != 0:
colAVal -= 1
print(colAVal[0])
My question is does this code do what it's supposed to do?
And for the second part of what I'm trying to do, I want to be able to manipulate the values that it returns - is there a way of storing these values so that that I can write code that does something for every colAVal[0] that the first part returned?

What you have there won't quite do what you want. Involking
colAVal -= 1
does not give you the previous row in an iterator. In languages with a more standard for loop, you could instead access the data you want by going backwards on the current iterator row until you found what you wanted, but in python this is not the recommended approach. Python's for loop is more of a for each loop, and as such once you've gone from one row to the next, the previous is inaccessable without saving it or accessing it directly by row count on the input data object. Mixing these kinds of access is highly not recommended, and can get confusing fast.
You also have two questions in you question above, and I'll try my best to answer both.
Given a dataset that looks like the following:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12
0,0,0,0,0,0,0,0,0,0,abc,0
1,1,1,1,1,1,1,1,1,1,1,1
2,2,2,2,2,2,2,2,2,2,2,2
3,3,3,3,3,3,3,3,3,3,3,3
4,4,4,4,4,4,4,4,4,4,4,4
,5,5,5,5,5,5,5,5,5,abc,5
,6,6,6,6,6,6,6,6,6,abc,6
7,7,7,7,7,7,7,7,7,7,7,7
you would expect the answers to be 0, 4, and 4, if I'm understanding your question correctly. You could accomplish that and save the data for later use with something like the following:
#! /usr/bin/env python
import csv
results = []
with open('example.csv') as file_handler:
for row in csv.reader(file_handler):
if row[0] != '' and row[0] != 0:
lastValidFirstColumn = row[0]
if row[10] == 'abc':
results.append(lastValidFirstColumn)
print(results)
# prints ['0', '4', '4']
the data you want if I understood correctly is now stored in the results variable. Its not too difficult to write it to file or do other manipulations for it, and I'd recommend looking them up yourself, it'd be a better learning experience.

You can do this in pandas pretty easily
import pandas as pd
import numpy as np
df = pd.read_csv('my.csv', header=None)
Using a made up csv, we have these values
0 1 2 3 4 5 6 7 8 9 10
0 20.0 b a b a b a b a b abc
1 NaN c d c d c d c d c def
2 10.0 d e d e d e d e d ghi
3 NaN e f e f e f e f e abc
df['has_abc'] = np.where(df[10]=='abc', df.ffill()[0], np.nan)
df.dropna(subset=['has_abc'], inplace=True)
Output
0 1 2 3 4 5 6 7 8 9 10 has_abc
0 20.0 b a b a b a b a b abc 20.0
3 NaN e f e f e f e f e abc 10.0

Related

How to explode pandas dataframe with lists to label the ones in the same row with same id?

For example, I have a pandas dataframe like this :
Ignoring the "Name" column, I want a dataframe that looks like this, labelling the Hashes of the same group with their "ID"
Here, we traverse each row, we encounter "8a43", and assign ID 1 to it, and wherever we find the same hash value, we assign ID as 1. Then we move on to the next row, and encounter 79e2 and b183. We then traverse all the rows and wherever we find these values, we store their IDs as 2. Now the issue will arise when we reach "abc7". It will be assigned ID=5 as it was previously encountered in "abc5". But I also want that in rows after the current one, wherever I find "26ea", assign the ID=5 to those as well.
I hope all this makes sense. If not, feel free to reach out to me via comments or message. I will clear it out quickly.
Solution using dict
import numpy as np
import pandas as pd
hashvalues = list(df['Hash_Value'])
dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
# convert to list
if isinstance(hashlist, str):
hashlist = hashlist.replace('[','').replace(']', '')
hashlist = hashlist.split(',')
# check if the hash is unknown
if hashlist[0] not in dic:
# Assign a new id
dic[hashlist[0]] = i
k = i
i += 1
else:
# if known use existing id
k = dic[hashlist[0]]
for h in hashlist[1:]:
# set id of the rest of the list hashes
# equal to the first hashes's id
dic[h] = k
id_list.append(k)
else:
id_list.append(np.nan)
print(df)
Hash Name ID
0 [8a43] abc1 1
1 [79e2,b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7,5ea9,1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee,26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4
Use networkx solution for dictionary for common values, select first value in Hash_Value by str and use Series.map:
#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')
import networkx as nx
G=nx.Graph()
for l in df['Hash_Value']:
nx.add_path(G, l)
new = list(nx.connected_components(G))
print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]
mapped = {node: cid for cid, component in enumerate(new) for node in component}
df['ID'] = df['Hash_Value'].str[0].map(mapped) + 1
print (df)
Hash_Value Name ID
0 [8a43] abcl 1
1 [79e2, b183] abc2 2
2 [f82a] abc3 3
3 [b183] abc4 2
4 [eaa7, 5ea9, 1cee] abc5 4
5 [5ea9] abc6 4
6 [1cee, 26ea] abc7 4
7 [79e2] abc8 2
8 [8a43] abc9 1
9 [26ea] abc10 4

how to write a list in a file with a specific format?

I have a Python list and wanna reprint that in a special way.
input:
trend_end= ['skill1',10,0,13,'skill2',6,1,0,'skill3',5,8,9,'skill4',9,0,1]
I want to write a file like this:
output:
1 2 3
1 10 0 13
2 6 1 0
3 5 8 9
4 9 0 1
Basically, I need to do the following steps:
Separate elements of the list for each skill.
Write them in a table shape, add indices of columns and rows.
I wanna use it as an input of another software. That's why I wanna write a file.
I did this but I know it is wrong, can you see how I can fix it?
f1 = open("data.txt", "a")
for j in trend_end:
f1.write(str(j))
for i in range(1,int(len(trend_end)/df1ana.shape[0])):
G=[trend_end[i*(df1ana.shape[0]-10)- (df1ana.shape[0]-10):i*(df1ana.shape[0]-10)]]
for h in G:
f1.write(i)
f1.write(h)
f1.write('\n')
f.close()
df1ana.shape[0] is 3 in the above example. It is basically the length of data for each skill
Another option that you can try via pandas:
import pandas as pd
pd.DataFrame([trend_end[i+1:i+4] for i in range(0,len(trend_end),4)]).to_csv('data.txt', sep='\t')
OUTPUT:
0 1 2
0 10 0 13
1 6 1 0
2 5 8 9
3 9 0 1
You should iterate over the list in steps of 4, i.e. df1ana.shape[0]+1
steps = df1ana.shape[0]+1
with open("data.txt", "a") as f:
f.write(' ' + ' '.join(range(1, steps)) + '\n') # write header line
for i in range(1, len(trend_end), steps):
f.write(f"{i:<3}")
for j in range(i, i+steps-1):
f.write("f{trend_end[j]:<3}")
f.write("\n")
The :<3 formatting puts each value in a 3-character, left-aligned field.
This should work regardless of the number of groups or the number of records per group. It uses the difference in the size of the full list compared to the integer only list to calculate the number of rows you should have, and uses the ratio of the number of integers over the number of rows to get the number of columns.
import numpy as np
import pandas as pd
digits = [x for x in trend if isinstance(x,int)]
pd.DataFrame(np.reshape(digits,
(int(len(trend)-len(digits)),
int(len(digits)/(len(trend)-len(digits)))))).to_csv('output.csv')

Python: how to multiply 2 columns?

I have the simple dataframe and I would like to add the column 'Pow_calkowita'. If 'liczba_kon' is 0, 'Pow_calkowita' is 'Powierzchn', but if 'liczba_kon' is not 0, 'Pow_calkowita' is 'liczba_kon' * 'Powierzchn. Why I can't do that?
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
row['Pow_calkowita'] = row['Powierzchn']
elif row['liczba_kon'] != 0:
row['Pow_calkowita'] = row['Powierzchn'] * row['liczba_kon']
My code didn't return any values.
liczba_kon Powierzchn
0 3 69.60495
1 1 39.27270
2 1 130.41225
3 1 129.29570
4 1 294.94400
5 1 64.79345
6 1 108.75560
7 1 35.12290
8 1 178.23905
9 1 263.00930
10 1 32.02235
11 1 125.41480
12 1 47.05420
13 1 45.97135
14 1 154.87120
15 1 37.17370
16 1 37.80705
17 1 38.78760
18 1 35.50065
19 1 74.68940
I have found some soultion:
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
Is it good way?
To write idiomatic code for Pandas and leverage on Pandas' efficient array processing, you should avoid writing codes to loop over the array by yourself. Pandas allows you to write succinct codes yet process efficiently by making use of vectorization over its efficient numpy ndarray data structure. Underlying, it uses fast array processing using optimized C language binary codes. Pandas already handles the necessary looping behind the scene and this is also an advantage using Pandas by single statement without explicitly writing loops to iterate over all elements. By using Pandas, you would better enjoy its fast efficient yet succinct vectorization processing instead.
As your formula is based on a condition, you cannot use direct multiplication. Instead you can use np.where() as follows:
import numpy as np
df['Pow_calkowita'] = np.where(df['liczba_kon'] == 0, df['Powierzchn'], df['Powierzchn'] * df['liczba_kon'])
When the test condition in first parameter is true, the value from second parameter is taken, else, the value from the third parameter is taken.
Test run output: (Add 2 more test cases at the end; one with 0 value of liczba_kon)
print(df)
liczba_kon Powierzchn Pow_calkowita
0 3 69.60495 208.81485
1 1 39.27270 39.27270
2 1 130.41225 130.41225
3 1 129.29570 129.29570
4 1 294.94400 294.94400
5 1 64.79345 64.79345
6 1 108.75560 108.75560
7 1 35.12290 35.12290
8 1 178.23905 178.23905
9 1 263.00930 263.00930
10 1 32.02235 32.02235
11 1 125.41480 125.41480
12 1 47.05420 47.05420
13 1 45.97135 45.97135
14 1 154.87120 154.87120
15 1 37.17370 37.17370
16 1 37.80705 37.80705
17 1 38.78760 38.78760
18 1 35.50065 35.50065
19 1 74.68940 74.68940
20 0 69.60495 69.60495
21 2 74.68940 149.37880
To answer the first question: "Why I can't do that?"
The documentation states (in the notes):
Because iterrows returns a Series for each row, ....
and
You should never modify something you are iterating over. [...] the iterator returns a copy and not a view, and writing to it will have no effect.
this basically means that it returns a new Series with the values of that row
So, what you are getting is NOT the actual row, and definitely NOT the dataframe!
BUT what you are doing is working, although not in the way that you want to:
df = DF(dict(a= [1,2,3], b= list("abc")))
df # To demonstrate what you are doing
a b
0 1 a
1 2 b
2 3 c
for index, row in df.iterrows():
... print("\n------------------\n>>> Next Row:\n")
... print(row)
... row["c"] = "ADDED" ####### HERE I am adding to 'the row'
... print("\n -- >> added:")
... print(row)
... print("----------------------")
...
------------------
Next Row: # as you can see, this Series has the same values
a 1 # as the row that it represents
b a
Name: 0, dtype: object
-- >> added:
a 1
b a
c ADDED # and adding to it works... but you aren't doing anything
Name: 0, dtype: object # with it, unless you append it to a list
----------------------
------------------
Next Row:
a 2
b b
Name: 1, dtype: object
### same here
-- >> added:
a 2
b b
c ADDED
Name: 1, dtype: object
----------------------
------------------
Next Row:
a 3
b c
Name: 2, dtype: object
### and here
-- >> added:
a 3
b c
c ADDED
Name: 2, dtype: object
----------------------
To answer the second question: "Is it good way?"
No.
Because using the multiplication like SeaBean has shown actually uses the power of
numpy and pandas, which are vectorized operations.
This is a link to a good article on vectorization in numpy arrays, which are basically the building blocks of pandas DataFrames and Series.
dataframe is designed to operate with vectorication. you can treat it as a database table. So you should use its functions as long as it's possible.
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
This simplified the code and enhanced performance., we can test their performance:
sampleSize = 100000
df=pd.DataFrame({
'liczba_kon': np.random.randint(3, size=(sampleSize)),
'Powierzchn': np.random.randint(1000, size=(sampleSize)),
})
# vectorication
s = time.time()
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
print(time.time() - s)
# iteration
s = time.time()
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
print(time.time() - s)
We can see vectorication performed much faster.
0.0034716129302978516
6.193516492843628

How might I count the occurrence of a specific character that is different in every row

I would like to count how often a specific character (that is different in every row) appears in a dataframe series. Example dataframe:
givenletter
phrase
w
whatwhatwhat
q
queenbee
d
devildonkey
n
woohoo
e
arrogant
Desired result:
givenletter
phrase
frequency
w
whatwhatwhat
3
q
queenbee
1
d
devildonkey
2
n
woohoo
0
e
arrogant
0
Attempted code below just came back as 0 for all frequencies
df["frequency"] = df["phrase"].str.count(str(df["givenletter"]))
I've tried digging through similar stackoverflow questions but they all seem to deal with counting the occurrence of a character that doesn't change. Would be grateful for any advice on how I might correct my code.
Try
df['frequency'] = df.apply(lambda x:x['phrase'].count(x['givenletter']), axis=1)
A list comprehension is sufficient for this:
df["frequency"] = [phrase.count(letter)
for phrase, letter
in zip(df.phrase, df.givenletter)]
df
givenletter phrase frequency
0 w whatwhatwhat 3
1 q queenbee 1
2 d devildonkey 2
3 n woohoo 0
4 e arrogant 0

get a random item from a group of rows in a xlsx file in python

I have a xlsx file, for example:
A B C D E F G
1 5 2 7 0 1 8
3 4 0 7 8 5 9
4 2 9 7 0 6 2
1 6 3 2 8 8 0
4 3 5 2 5 7 9
5 2 3 2 6 9 1
being my values (that are actually on an excel file).
I nedd to get random rows of it, but separeted for column D values.
You can note that column D has values that are 7 and values that are 2.
I need to get 1 random row of all the rows that have 7 on column D and 1 random row of all the rows that have 2 on column D.
And put the results on another xlsx file.
My expected output needs to be the content of line 0, 1 or 2 and the content of line 3, 4 or 5.
Can someone help me with that?
Thanks!
I've created the code to that. The code below assumes that the excel name is test.xlsx and resides in the same folder as where you run your code. It samples NrandomLines from each unique value in column D and prints that out.
import pandas as pd
import numpy as np
import random
df = pd.read_excel('test.xlsx') # read the excel
vals = df.D.unique() # all unique values in column D, in your case its only 2 and 7
idx = []
N = []
for i in vals: # loop over unique values in column D
locs = (df.D==i).values.nonzero()[0]
idx = idx + [locs] # save row index of every unique value in column D
N = N + [len(locs)] # save how many rows contain specific value in D
NrandomLines = 1 # how many random samples you want
for i in np.arange(len(vals)): # loop over unique values of D
for k in np.arange(NrandomLines): # loop how many random samples you want
randomRow = random.randint(0,N[i]-1) # create random sample
print(df.iloc[idx[i][randomRow],:]) # print out random row
With OpenPyXl, you can use Worksheet.iter_rows to iterate the worksheet rows.
You can use itertools.groupby to group the row according to the "D" column values.
To do that, you can create a small function to pick-up this value in a row:
def get_d(row):
return row[3].value
Then, you can use random.choice to choose a row randomly.
Putting all things togather, you can have:
def get_d(row):
return row[3].value
for key, group in itertools.groupby(rows, key=get_d):
row = random.choice(list(group))
print(row)

Categories

Resources