Exact match in Python CSV row and column

Exact match in Python CSV row and column - python

I looked around for a while and didn't find anything that matched what I was doing.
I have this code:
import csv
import datetime
legdistrict = []
reader = csv.DictReader(open('active.txt', 'rb'), delimiter='\t')
for row in reader:
if '27' in row['LegislativeDistrict']:
legdistrict.append(row)
ages = []
for i,value in enumerate(legdistrict):
dates = datetime.datetime.now() - datetime.datetime.strptime(value['Birthdate'], '%m/%d/%Y')
ages.append(int(datetime.timedelta.total_seconds(dates) / 31556952))
total_values = len(ages)
total = sum(ages) / total_values
print total_values
print sum(ages)
print total
which searches a tab-delimited text file and finds the rows in the column named LegislativeDistrict that contain the string 27. (So, finding all rows that are in the 27th LD.) It works well, but I run into issues if the string is a single digit number.
When I run the code with 27, I get this result:
0 ;) eric#crunchbang ~/sbdmn/May 2014 $ python data.py
74741
3613841
48
Which means there are 74,741 values that contain 27, with combined ages of 3,613,841, and an average age of 48.
But when I run the code with 4 I get this result:
0 ;) eric#crunchbang ~/sbdmn/May 2014 $ python data.py
1177818
58234407
49
The first result (1,177,818) is much too large. There are no LDs in my state over 170,000 people, and my lists deal with voters only.
Because of this, I'm assuming using 4 is finding all the values that have 4 in them... so 14, 41, and 24 would all be used thus causing the huge number.
Is there a way I can search for a value in a specific column and use a regex or exact search? Regex works, but I can't get it to search just one column -- it searches the entire text file.
My data looks like this:
StateVoterID CountyVoterID Title FName MName LName NameSuffix Birthdate Gender RegStNum RegStFrac RegStName RegStType RegUnitType RegStPreDirection RegStPostDirection RegUnitNum RegCity RegState RegZipCode CountyCode PrecinctCode PrecinctPart LegislativeDistrict CongressionalDistrict Mail1 Mail2 Mail3 Mail4 MailCity MailZip MailState MailCountry Registrationdate AbsenteeType LastVoted StatusCode
IDNUMBER OTHERIDNUMBER NAME MI 01/01/1900 M 123 FIRST ST W CITY STATE ZIP MM 123 4 AGE 5 01/01/1950 N 01/01/2000 B

'4' in '400' will return True as in does a substring check. Use instead '4' == '400', which only will return True if the two strings are identical:
if '4' == row['LegislativeDistrict']:
(...)

Related

Data Scraping from txt file with consistent structure

I'm working with a very old program that outputs the results for a batch query in a very odd format (at least for me).
Imagine having queried info for the objects A, B and C.
The output will look like this:
name : A
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : B
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : C
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
Do you have any idea of how to put the data in a more useful format?
A possible good format would be a table with columns A B C and rows p1, p2...
I had few ideas but I don't really know how to implement those:
Every object is separated by a ====== string, that means i can use this to separate in many .txt files the output
Then I can read the files with excel setting : as separator, obtaining a csv file with 2 columns (1 containing the p descriptors and one with the actual values)
Now i need to merge all the csv files into one single csv with as many columns as objects and px rows
I'd like to do this in python but I really don't know any package for this situation. Also the objects are a few hundreds so I need an automatized algorithm for doing that.
Any tip, advice or idea you can think of is welcome.

Here's a quick solution putting the data you say you need - not all labels - in a csv file. Each output line starts with the name A/B/C and then comes the values p1..x.
It has no handling of missing values, so in that case just the present values will be listed, thus column 5 will not always be p4. It relies on the assumption that there's a name line starting every item/entry, and that all other a:b lines have a value b to be stored. This should be a good start to put it into another structure should you need so. The format is truly special, more of a report structure, so I'd guess there's no suitable general purpose lib. Flat format is another similarly tricky old format type for which there are libraries - I've used it when calculating how much money each swedish participator in the interrail program should receive. Tricky business but fun! :-)
The code:
import re
import csv
with open('input.txt') as f:
lines = f.readlines()
f.close()
entries = []
entry = []
for line in lines:
parts = re.split(r':', line)
if len(parts) >= 2:
label = parts[0]
value = parts[1].strip()
if label.startswith('name'):
print('got name: ' + value)
# start new entry with the name as first value
entry = [value]
entries.append(entry)
else:
print('got value: ' + value)
entry.append(value)
print('collected {} entries'.format(len(entries)))
with open('output.csv', 'w', newline='') as output:
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
wr.writerows(entries)

Trying to get sums in lists to print out with strings in output

I have been working on a Python project analyzing a CSV file and cannot get the output to show me sums with my strings, just lists of the numbers that should be summed.
Code I'm working with:
import pandas as pd
data = pd.read_csv('XML_projectB.csv')
#inserted column headers since the raw data doesn't have any
data.columns = ['name','email','category','amount','date']
data['date'] = pd.to_datetime(data['date'])
#Calculate the total budget by cateogry
category_wise = data.groupby('category').agg({'amount':['sum']})
category_wise.reset_index(inplace=True)
category_wise.columns = ['category','total_budget']
#Determine which budget category people spent the most money in
max_budget = category_wise[category_wise['total_budget']==max(category_wise['total_budget'])]['category'].to_list()
#Tally the total amounts for each year-month (e.g., 2017-05)
months_wise = data.groupby([data.date.dt.year, data.date.dt.month])['amount'].sum()
months_wise = pd.DataFrame(months_wise)
months_wise.index.names = ['year','month']
months_wise.reset_index(inplace=True)
#Determine which person(s) spent the most money on a single item.
person = data[data['amount'] == max(data['amount'])]['name'].to_list()
#Tells user in Shell that text file is ready
print("Check your folder!")
#Get all this info into a text file
tfile = open('output.txt','a')
tfile.write(category_wise.to_string())
tfile.write("\n\n")
tfile.write("The type with most budget is " + str(max_budget) + " and the value for the same is " + str(max(category_wise['total_budget'])))
tfile.write("\n\n")
tfile.write(months_wise.to_string())
tfile.write("\n\n")
tfile.write("The person who spent most on a single item is " + str(person) + " and he/she spent " + str(max(data['amount'])))
tfile.close()
The CSV raw data looks like this (there are almost 1000 lines of it):
Walker Gore,wgore8i#irs.gov,Music,$77.98,2017-08-25
Catriona Driussi,cdriussi8j#github.com,Garden,$50.35,2016-12-23
Barbara-anne Cawsey,bcawsey8k#tripod.com,Health,$75.38,2016-10-16
Henryetta Hillett,hhillett8l#pagesperso-orange.fr,Electronics,$59.52,2017-03-20
Boyce Andreou,bandreou8m#walmart.com,Jewelery,$60.77,2016-10-19
My output in the txt file looks like this:
category total_budget
0 Automotive $53.04$91.99$42.66$1.32$35.07$97.91$92.40$21.28$36.41
1 Baby $93.14$46.59$31.50$34.86$30.99$70.55$86.74$56.63$84.65
2 Beauty $28.67$97.95$4.64$5.25$96.53$50.25$85.42$24.77$64.74
3 Books $4.03$17.68$14.21$43.43$98.17$23.96$6.81$58.33$30.80
4 Clothing $64.07$19.29$27.23$19.78$70.50$8.81$39.36$52.80$80.90
year month amount
0 2016 9 $97.95$67.81$80.64
1 2016 10 $93.14$6.08$77.51$58.15$28.31$2.24$12.83$52.22$48.72
2 2016 11 $55.22$95.00$34.86$40.14$70.13$24.82$63.81$56.83
3 2016 12 $13.32$10.93$5.95$12.41$45.65$86.69$31.26$81.53
I want the total_budget column to be the sum of the list for each category, not the individual values you see here. It's the same problem for months_wise, it gives me the individual values, not the sums.
I tried the {} .format in the write lines, .apply(str), .format on its own, and just about every other Python permutation of the conversion to string from a list I could think of, but I'm stumped.
What am I missing here?

As #Barmar said, the source has $XX so it is not treated as numbers. You could try following this approach to parse the values as integers/floats instead of strings with $ in them.

How to match input data and a data in df, for loop minus

I want the input str to match with str in file that have fix row and then I will minus the score column of that row
1!! == i think this is for loop to find match str line by line from first to last
2!! == this is for when input str have matched it will minus score of matched row by 1.
CSV file:
article = pd.read_csv('Customer_List.txt', delimiter = ',',names = ['ID','NAME','LASTNAME','SCORE','TEL','PASS'])
y = len(article.ID)
line=article.readlines()
for x in range (0,y): # 1!!
if word in line :
newarticle = int(article.SCORE[x]) - 1 #2!!
print(newarticle)
else:
x = x + 1
P.S. I have just study python for 5 days, please give me a suggestion.Thank you.

Since I see you using pandas, I will give a solution without any loops as it is much easier.
You have, for example:
df = pd.DataFrame()
df['ID'] = [216, 217]
df['NAME'] = ['Chatchai', 'Bigm']
df['LASTNAME'] = ['Karuna', 'Koratuboy']
df['SCORE'] = [25, 15]
You need to do:
lookfor = str(input("Enter the name: "))
df.loc[df.NAME == lookfor, 'SCORE']-= 1
What happens in the lines above is, you look for the name entered in the NAME column of your dataframe, and reduce the score by 1 if there is a match, which is what you want if I understand your question.
Example:
Now, let's say you are looking for a person called Alex with the name, since there is no such person, you must get the same dataframe back.
Enter the name: Alex
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 25
1 217 Bigm Koratuboy 15
Now, let's say you are looking for a person called Chatchai with the name, since there is a match and you want the score to be reduced, you will get:
Enter the name: Chatchai
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 24
1 217 Bigm Koratuboy 15

import a txt file in python and print all lines with specific word in string

I receive data that looks like the below:
0,JW111101,Run Name
0,111116,Date
0,+2.5,Increment
0,2=0,Start Station
0,1=Fri 11 Nov 2016 14:21:44,Date & Time Stamp
0,6=1 Off 4On,Cycle Times
0,6=Fluke 189B,Meter Type
0,6=Racal Landstar,GPS Reciever
0,1=Fri 11 Nov 2016 14:21:47,Date & Time Stamp
0,6=COMPANY NAME,142156.00,29.0638153,95.3436157,-1.2
0,6=LINE NAME,142156.00,29.0638153,95.3436157,-1.2
1,6=Test Station,142255.00,29.0638145,95.3436133,-0.9
1,6=-1559 On NG,142255.00,29.0638145,95.3436133,-0.9
1,6=-1169 Off NG,142255.00,29.0638145,95.3436133,-0.9
1,6=1Approx.,142255.00,29.0638145,95.3436133,-0.9
1,6=AC 0.735,142255.00,29.0638145,95.3436133,-0.9
1,1558,GPS Not Available
1,1460,142350.00,29.0638166,95.3436115,-0.9
1,1185z,142351.00,29.0638167,95.3436116,-0.9
1,1554,142352.00,29.0638166,95.3436116,-0.9
I would like to find the smallest and biggest numbers in the 3rd column. The third column is actually a utc timestamp.
My eventual goal is to be able to figure out when they start, when they end, and their duration.
Can anyone point me in the right direction?

assuming your data is in a file called random.csv here's how you can read the data in.
import csv
list_of_stuff = []
with open('random.csv') as readcsv:
row_object = csv.reader(readcsv)
for value in row_object:
list_of_stuff.append(value)
print list_of_stuff
Now you have a list of lists where each sublist is a line of your file. Here's the documentation on reading in csv files: https://docs.python.org/2/library/csv.html
Your third column includes strings and integers that can't be converted to timestamps so I can't help you with that. Let's assume the third column was all integers, though. Then you could use this in addition to my code above:
third_column = [row[2] for row in list_of_stuff]
print "max: ", max(third_column)
print "min: ", min(third_column)
print "range: ", max(third_column) - min(third_column)

Matching parts of two csv files to return certain elements

Hello I am looking for some help to do like an index match in excel i am very new to python but my data sets are far to large for excel now
I will dumb my question right down as much as possible cause the data contains alot of irrelevant information to this problem
CSV A (has 3 Basic columns)
Name, Date, Value
CSV B (has 2 columns)
Value, Score
CSV C (I want to create this using python; 2 columns)
Name, Score
All I want to do is enter a date and have it look up all rows in CSV A which match that "date" and then look up the "score" associated to the "value" from that row in CSV A in CSV B and returning it in CSV C along with the name of the person. Rinse and repeat through every row
Any help is much appreciated I don't seem to be getting very far

Here is a working script using Python's csv module:
It prompts the user to input a date (format is m-d-yy), then reads csvA row by row to check if the date in each row matches the inputted date.
If yes, it checks if the value that corresponds the date from the current row of A matches any of the rows in csvB.
If there are matches, it will write the name from csvA and the score from csvB to csvC.
import csv
date = input('Enter date: ').strip()
A = csv.reader( open('csvA.csv', newline=''), delimiter=',')
matches = 0
# reads each row of csvA
for row_of_A in A:
# removes whitespace before and after of each string in each row of csvA
row_of_A = [string.strip() for string in row_of_A]
# if date of row in csvA has equal value to the inputted date
if row_of_A[1] == date:
B = csv.reader( open('csvB.csv', newline=''), delimiter=',')
# reads each row of csvB
for row_of_B in B:
# removes whitespace before and after of each string in each row of csvB
row_of_B = [string.strip() for string in row_of_B]
# if value of row in csvA is equal to the value of row in csvB
if row_of_A[2] == row_of_B[0]:
# if csvC.csv does not exist
try:
open('csvC.csv', 'r')
except:
C = open('csvC.csv', 'a')
print('Name,', 'Score', file=C)
C = open('csvC.csv', 'a')
# writes name from csvA and value from csvB to csvC
print(row_of_A[0] + ', ' + row_of_B[1], file=C)
m = 'matches' if matches > 1 else 'match'
print('Found', matches, m)
Sample csv files:
csvA.csv
Name, Date, Value
John, 2-6-15, 10
Ray, 3-5-14, 25
Khay, 4-4-12, 30
Jake, 2-6-15, 100
csvB.csv
Value, Score
10, 500
25, 200
30, 300
100, 250
Sample Run:
>>> Enter date: 2-6-15
Found 2 matches
csvC.csv (generated by script)
Name, Score
John, 500
Jake, 250

if you are using unix you can do this by below shell script
also I am assuming that you are appending the search output in file_C and there are no duplicated in both source files
while true
do
echo "enter date ..."
read date
value_one=grep $date file_A | cut -d',' -f1
tmp1=grep $date' file_A | cut -d',' -f3
value_two=grep $tmp1 file_B | cut -d',' -f2
echo "${value_one},${value_two}" >> file_c
echo "want to search more dates ... press y|Y, press any other key to exit"
read ch
if [ "$ch" = "y" ] || [ "$ch" = "y" ]
then
continue
else
exit
fi
done

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Exact match in Python CSV row and column - python

'4' in '400' will return True as in does a substring check. Use instead '4' == '400', which only will return True if the two strings are identical: if '4' == row['LegislativeDistrict']: (...)

Related

Data Scraping from txt file with consistent structure

Trying to get sums in lists to print out with strings in output

How to match input data and a data in df, for loop minus

import a txt file in python and print all lines with specific word in string

Matching parts of two csv files to return certain elements

Categories

Resources