Select column from multiple DataFrames based on same header prefix - python

I have a function that iterates over the rows of a csv for the Age column and if an age is negative, it will print the Key and the Age value to a text file.
def neg_check():
results = []
file_path = input('Enter file path: ')
file_data = pd.read_csv(file_path, encoding = 'utf-8')
for index, row in file_data.iterrows():
if row['Age'] < 0:
results.append((row['Key'], row['Age']))
with open('results.txt', 'w') as outfile:
outfile.write("\n".join(map(str, results)))
outfile.close()
In order to make this code repeatable, how can I modify it so it will iterate the rows if the column starts with "Age"? My files have many columns that start with "Age" but end differently. . I tried the following...
if row.startswith['Age'] < 0:
and
if row[row.startswith('Age')] < 0:
but it throws AttributeError: 'Series' object has no attribute 'startswith' error.
My csv files:
sample 1
Key Sex Age
1 Male 46
2 Female 34
sample 2
Key Sex AgeLast
1 Male 46
2 Female 34
sample 3
Key Sex AgeFirst
1 Male 46
2 Female 34

I would do this in one step, but there are a few options. One is filter:
v = df[df.filter(like='AgeAt').iloc[:, 0] < 0]
Or,
c = df.columns[df.columns.str.startswith('AgeAt')][0]
v = df[df[c] < 0]
Finally, to write to CSV, use
if not v.empty:
v.to_csv('invalid.csv')
Looping over your data is not necessary with pandas.

Related

How to match input data and a data in df, for loop minus

I want the input str to match with str in file that have fix row and then I will minus the score column of that row
1!! == i think this is for loop to find match str line by line from first to last
2!! == this is for when input str have matched it will minus score of matched row by 1.
CSV file:
article = pd.read_csv('Customer_List.txt', delimiter = ',',names = ['ID','NAME','LASTNAME','SCORE','TEL','PASS'])
y = len(article.ID)
line=article.readlines()
for x in range (0,y): # 1!!
if word in line :
newarticle = int(article.SCORE[x]) - 1 #2!!
print(newarticle)
else:
x = x + 1
P.S. I have just study python for 5 days, please give me a suggestion.Thank you.
Since I see you using pandas, I will give a solution without any loops as it is much easier.
You have, for example:
df = pd.DataFrame()
df['ID'] = [216, 217]
df['NAME'] = ['Chatchai', 'Bigm']
df['LASTNAME'] = ['Karuna', 'Koratuboy']
df['SCORE'] = [25, 15]
You need to do:
lookfor = str(input("Enter the name: "))
df.loc[df.NAME == lookfor, 'SCORE']-= 1
What happens in the lines above is, you look for the name entered in the NAME column of your dataframe, and reduce the score by 1 if there is a match, which is what you want if I understand your question.
Example:
Now, let's say you are looking for a person called Alex with the name, since there is no such person, you must get the same dataframe back.
Enter the name: Alex
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 25
1 217 Bigm Koratuboy 15
Now, let's say you are looking for a person called Chatchai with the name, since there is a match and you want the score to be reduced, you will get:
Enter the name: Chatchai
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 24
1 217 Bigm Koratuboy 15

Text file combining script - Python - Big Data

I was wondering if anyone could help me come up with a better way of doing this,
basically I have text files that are formatted like this (some have more columns some have less, each column separated by spaces)
AA BB CC DD Col1 Col2 Col3
XX XX XX Total 1234 1234 1234
Aaaa OO0 LAHB TEXT 111 41 99
Aaaa OO0 BLAH XETT 112 35 176
Aaaa OO0 BALH TXET 131 52 133
Aaaa OO0 HALB EXTT 144 32 193
These text files ranged in size from a few hundred KB to around 100MB for the newest and largest filesWhat I need to do is combine two or more files by adding the checking to see if there are any duplicate data first of all so checking if AA BB CC and DD from each row match with any rows from the other files, if so then I append the data from Col1 Col2 Col3 (etc) on to that row, if not then I fill the new columns in with zeros. The I calculate the top 100 rows based on the total of each row and output the top 100 results to a webpage.
here is the python code I'm using
import operator
def combine(dataFolder, getName, sort):
files = getName.split(",")
longestHeader = 0
longestHeaderFile =[]
dataHeaders = []
dataHeaderCode = []
fileNumber = 1
combinedFile = {}
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
x = 0
for line in f:
lines.append(line.upper().split())
if x == 1:
break
splitLine = lines[1].index("TOTAL")+1
dataHeaders.extend(lines[0][splitLine:])
headerNumber = 1
for name in lines[0][splitLine:]:
dataHeaderCode.append(str(fileNumber)+"_"+str(headerNumber))
headerNumber += 1
if splitLine > longestHeader:
longestHeader = splitLine
longestHeaderFile = lines[0][:splitLine]
fileNumber += 1
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
for line in f:
lines.append(line.upper().split())
splitLine = lines[1].index("TOTAL")+1
headers = lines[0][:splitLine]
data = lines[0][splitLine:]
for x in range(2, len(lines)):
normalizedLine = {}
lineName = ""
total = 0
for header in longestHeaderFile:
try:
if header == "EE" or header == "DD":
index = splitLine-1
else:
index = headers.index(header)
normalizedLine[header] = lines[x][index]
except ValueError:
normalizedLine[header] = "XX"
lineName += normalizedLine[header]
combinedFile[lineName] = normalizedLine
for header in dataHeaders:
headIndex = dataHeaders.index(header)
name = dataHeaderCode[headIndex]
try:
index = splitLine+data.index(header)
value = int(lines[x][index])
except ValueError:
value = 0
except IndexError:
value = 0
try:
value = combinedFile[lineName][header]
combinedFile[lineName][name] = int(value)
except KeyError:
combinedFile[lineName][name] = int(value)
total += int(value)
combinedFile[lineName]["TOTAL"] = total
combined = sorted(combinedFile.values(), key=operator.itemgetter(sort), reverse=True)
return combined
I'm pretty new to Python so this may not be the most "Pythonic" way of doing it, anyway this works but its slow (about 12 seconds for two files about 6MB each) and when we uploaded the code to our AWS server we found that we would get a 500 error from the server saying headers were too large (when we tried to combine larger files). Can anyone help me refine this into something a bit quicker and more suited for a web environment. Also just to clarify I don't have access to the AWS server or the setting of it, that goes through our Lead Developer, so I have no actual clue on how its set up, I do most of my dev work through localhost then commit to Github.

Pandas Parse DataFrame Field and Maintain ID Field

I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?

Python-how to print multiple rows with a common word from a csv file?

whichgender=input("Would you like to view male or female students? ")
if whichgender == "Male":
with open('classinfo.csv' , 'r') as classinfoReport:
classinfoReaders = csv.reader(classinfoReport)
for row in classinfoReaders:
for field in row:
if field == whichgender:
print (row)
I am trying to print every row from my csv file that contains the word 'Male'. This code works but it only prints the first row that it finds with the word Male in . There are 13 rows in my file with 'Male' in them and I want to print them all.How do I do that??
I suggest you to use pandas to simplify the problem.
import pandas as pd
df = pd.DataFrame(pd.read_csv('classinfo.csv', header=None))
print(df[df[<index of the gender string here>] == 'Male'])
I wrote a dummy CSV file with the same filename as yours classinfo.csv:
Adam,Male,25
Milo,Male,34
Mikka,Female,20
Samantha,Female,19
John,Male,21
Since the gender index is 1:
import pandas as pd
df = pd.DataFrame(pd.read_csv('classinfo.csv', header=None))
print(df[df[1] == 'Male'])
The result when run:
0 1 2
0 Adam Male 25
1 Milo Male 34
4 John Male 21
OR you can change the following in your code as
whichgender=input("Would you like to view male or female students? ")
if whichgender == "Male":
with open('classinfo.csv' , 'r') as classinfoReport:
classinfoReaders = csv.reader(classinfoReport)
for row in classinfoReaders:
if 'Male' in row:
print(row)
My suggestion is also to use pandas
This is what you need:-
whichgender=input("Would you like to view male or female students? ")
if whichgender == "Male":
with open('classinfo.csv' , 'r') as classinfoReport:
classinfoReaders = csv.reader(classinfoReport)
for row in classinfoReaders:
for field in row:
if whichgender in field:
print (row)

enumerate append error while creating list from csv

I'm stuck in a process of creating list of columns. I tried to avoid using defaultdict.
Thanks for any help!
Here is my code:
# Read CSV file
with open('input.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
#-----------#
row_list = []
column_list = []
year = []
suburb = []
for each in reader:
row_list = row_list + [each]
year = year + [each[0]]#create list of years
suburb = suburb + [each[2]]#create list of suburb
for (i,v) in enumerate(each[3:-1]):
column_list[i].append(v)
#print i,v
#print column_list[0]
My error message:
19 suburb = suburb + [each[2]]#create list of suburb
20 for i,v in enumerate(each[3:-1]):
---> 21 column_list[i].append(v)
22 #print i,v
23 #print column_list[0]
IndexError: list index out of range
printed result of (i,v):
0 10027
1 14513
2 3896
3 23362
4 77966
5 5817
6 24699
7 9805
8 62692
9 33466
10 38792
0 0
1 122
2 0
3
4 137
5 0
6 0
7
8
9 77
10
Basically, I want to have lists to look like this.
column[0]=['10027','0']
column[1]=['14513','122']
A sample of my csv file:
enter image description here
Yes Like Alex mentioned the problem is indeed due to trying to access the index before creating/initializing it as an alternative solution you can also consider this.
for (i,v) in enumerate(each[3:-1]):
if len(column_list) < i+1:
column_list.append([])
column_list[i].append(v)
hope It may Help !
The error happens because column_list is empty and so you can't access column_list[i] because it doesn't exist. It doesn't matter that you want to append to it because you can't append to something nonexistent, and appending doesn't create it from scratch.
column_list = defaultdict(list) would indeed solve this but since you don't want to do that, the simplest is to make sure that column_list starts out with plenty of empty lists to append to. Like this:
column_list = [[] for _ in range(size)]
where size is the number of columns, the length of each[3:-1], which is apparently 11 according to your output.

Categories

Resources