Python csv search script - python

I wish to write a Python script which reads from a csv. The csv comprises of 2 columns. I want the script to read through the first column row by row and find the corresponding value in the second row. If it finds the value in the second row I want it to input a value into a third column.
example of output
Any help with this would be much appreciated and I hope my aim is clear. Apologies in advance if it is too vague.

this script read test.csv file and parse it an write to OUTPUT.txt
f = open("test.csv","r")
d={}
s={}
for line in f:
l=line.split(",")
if not l[0] in d:
d[l[0]]=l[1].rstrip()
s[l[0]]=''
else:
s[l[0]]+=str(";")+str(l[1].rstrip())
w=open("OUTPUT.txt","w")
w.write("%-10s %-10s %-10s\r\n" % ("ID","PARENTID","Atachment"))
for i in d.keys():
w.write("%-10s %-10s %-10s\r\n" % (i,d[i],s[i]))
f.close()
w.close()
example:
input:
1,123
2,456
1,333
3,
1,asas
1,333
000001,sasa
1,ss
1023265,333
0221212,
000001,sasa2
000001,sas4
OUTPUT:
ID PARENTID Atachment
000001 sasa ;sasa2;sas4
1023265 333
1 123 ;333;asas;333;ss
3
2 456
0221212

Related

Data Scraping from txt file with consistent structure

I'm working with a very old program that outputs the results for a batch query in a very odd format (at least for me).
Imagine having queried info for the objects A, B and C.
The output will look like this:
name : A
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : B
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : C
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
Do you have any idea of how to put the data in a more useful format?
A possible good format would be a table with columns A B C and rows p1, p2...
I had few ideas but I don't really know how to implement those:
Every object is separated by a ====== string, that means i can use this to separate in many .txt files the output
Then I can read the files with excel setting : as separator, obtaining a csv file with 2 columns (1 containing the p descriptors and one with the actual values)
Now i need to merge all the csv files into one single csv with as many columns as objects and px rows
I'd like to do this in python but I really don't know any package for this situation. Also the objects are a few hundreds so I need an automatized algorithm for doing that.
Any tip, advice or idea you can think of is welcome.
Here's a quick solution putting the data you say you need - not all labels - in a csv file. Each output line starts with the name A/B/C and then comes the values p1..x.
It has no handling of missing values, so in that case just the present values will be listed, thus column 5 will not always be p4. It relies on the assumption that there's a name line starting every item/entry, and that all other a:b lines have a value b to be stored. This should be a good start to put it into another structure should you need so. The format is truly special, more of a report structure, so I'd guess there's no suitable general purpose lib. Flat format is another similarly tricky old format type for which there are libraries - I've used it when calculating how much money each swedish participator in the interrail program should receive. Tricky business but fun! :-)
The code:
import re
import csv
with open('input.txt') as f:
lines = f.readlines()
f.close()
entries = []
entry = []
for line in lines:
parts = re.split(r':', line)
if len(parts) >= 2:
label = parts[0]
value = parts[1].strip()
if label.startswith('name'):
print('got name: ' + value)
# start new entry with the name as first value
entry = [value]
entries.append(entry)
else:
print('got value: ' + value)
entry.append(value)
print('collected {} entries'.format(len(entries)))
with open('output.csv', 'w', newline='') as output:
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
wr.writerows(entries)

Looping through a pandas dataframe - how to make code run faster?

I have a dataframe, df, with 43244 rows, and a txt file, text with 1107957 lines. The purpose of the following code is to evaluate entries in df, and return a word_id value if they are present in the text.
with open('text.txt') as f:
text = f.readlines()
for index, row in df.iterrows():
lemma_id = 0
for lines in range(len(text)):
word_row = text[lines].split()
if word_row[2] == row['Word']:
word_id = word_row[1]
row['ID'] = word_id
However, this code would take an estimated 120 days to complete in my jupyter notebook, and I (obviously) want it to execute a bit more efficiently.
How do I approach this? Should I convert text into a dataframe/database, or is there another more efficient approach?
EDIT
Example of dataframe structure:
Word ID
0 hello NaN
1 there NaN
Example of txt.file structure:
NR ID WORD
32224 86289 ah
32225 86290 general
32226 86291 kenobi
Have you tried using pandas.merge?
Your for loop would be replaced by the following (assuming that text is a DataFrame)
new_df = pd.merge(df, text_df, left_on='WORD', right_on='Word')
new_df.dropna(subset=['ID'], inplace=True)

How to show differences between columns of two csv files?

I have two csv files:
old file:
name size_bytes
air unknown
data/air/monitor
data/air/monitor/ambient-air-quality-oil-sands-region
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN/datapackage.json 886
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN/digest.txt 186
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/EN/JOSM_AMS13_SpecHg_AB_2017-04-02_EN.pdf 9033
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR/datapackage.json 886
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR/digest.txt 186
...
new file:
name size_bytes
data 0
data/air 0
data/air/monitor 0
data/air/monitor/ambient-air-quality-oil-sands-region 0
data/air/monitor/ambient-air-quality-oil-sands-region/96c679c3-709e-4a42-89c6-09f09f2b7ffe.xml 65589
data/air/monitor/ambient-air-quality-oil-sands-region/datapackage.json 13152367
data/air/monitor/ambient-air-quality-oil-sands-region/digest.txt 188
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region 0
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02 0
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR 0
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/FR/JOSM_AMS13_SpecHg_AB_2017-04-02_FR.pdf 9186
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-02/digest.txt 82
data/air/monitor/ambient-air-quality-oil-sands-region/ecosystem-sites-speciated-mercury-preliminary-data-oil-sands-region/2017-04-09 0
...
I want to compare the names from the "old file" to the names in the "new file" and get any missing names (folder or file paths).
Right now I have this:
with open('old_file.csv', 'r') as old_file:
old = set(row.split(',')[0].strip().lower() for row in old_file)
with open('new_file.csv','r') as new_file, open('compare.csv', 'w') as compare_files:
for line in new_file:
if line.split(',')[0].strip().lower() not in old:
compare_files.write(line)
This runs but the output is not correct, it prints out names that ARE in both files.
Here is the output:
data 0
data/air 0
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region 0
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ElementConcentrationPM25_OSM_AMS-sites_2016-2017.csv 736737
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ElementConcentrationPM25to10_OSM_AMS-sites_2016-2017.csv 227513
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ElementFlux_OSM_AMS-sites_2016-2017.csv 691252
data/air/monitor/deposition-oil-sands-region/the-monitored-ambient-concentration-and-estimated-atmospheric-deposition-of-trace-elements-at-four-monitoring-sites-in-the-canadian-athabasca-oil-sands-region/ffeae500-ea0c-493f-9b24-5efbd16411fd.xml 41399
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-APQMP-AllSites-2019.csv 169109
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-APQMP-AllSites-2020.csv 150205
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-CAPMoN-AllSites-2017.csv 4343972
data/air/monitor/monitoring-of-atmospheric-precipitation-chemistry/major-ions/AtmosphericPrecipitationChemistry-MajorIons-CAPMoN-AllSites-2018.csv 3782783
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases 0
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2012.csv 1826690
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2013.csv 1890761
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2014.csv 1946788
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2015.csv 2186536
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2016.csv 2434692
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2017.csv 2150499
data/air/monitor/monitoring-of-combined-atmospheric-gases-and-particles/major-ions-and-acidifying-gases/AtmosphericCombinedGasesParticles-FilterPack-CAPMoN-AllSites-2018.csv 2136853
...
Is there something wrong with my code?
Is there a better way to do this? Maybe using pandas?
Your tags mention Pandas but I don't see you using it. Either way, an outer merge should do what you want, if I understand your question:
old = pd.read_csv(path_to_old_file)
new = pd.read_csv(path_to_new_file)
df = pd.merge(old, new, on="name", how="outer")
Your post isn't super clear on what exactly you need, and I don't particularly feel like scrutinizing those file names for differences. From what I could gather, you want all the unique file paths from both csv files, right? It's not clear what you want done with the other column so I've left them alone.
I recommend reading this Stack Overflow post.
EDIT
After your clarification:
old = pd.read_csv(path_to_old_file)
new = pd.read_csv(path_to_new_file)
np.setdiff1d(old["name"], new["name"])
This will give you all the values in the name column of the old dataframe which are not present in the new dataframe.

Python - average of unique values

I have a CSV file that looks like this:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
...
0101,41.0
0102,39.9
0103,44.6
0104,42.0
0105,43.0
0106,42.4
It's a list of temperatures for specific dates. It contains data for several years so the same dates occur multiple times. I would like to average the temperature so that I get a new table where each date is only occurring once and has the average temperature for that date in the second column.
I know that Stack Overflow requires you to include what you've attempted, but I really don't know how to do this and couldn't find any other answers on this.
I hope someone can help. Any help is much appreciated.
You can use pandas, and run the groupby command, when df is your data frame:
df.groupby('DATE').mean()
Here is some toy example to depict the behaviour
import pandas as pd
df=pd.DataFrame({"a":[1,2,3,1,2,3],"b":[1,2,3,4,5,6]})
df.groupby('a').mean()
Will result in
a b
1 2.5
2 3.5
3 4.5
When the original dataframe was
a b
0 1 1
1 2 2
2 3 3
3 1 4
4 2 5
5 3 6
If you can use the defaultdict pacakge from collections, makes this type of thing pretty easy.
Assuming your list is in the same directory as the python script and it looks like this:
list.csv:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
Here is the code I used to print out the averages.
#test.py
#usage: python test.py list.csv
import sys
from collections import defaultdict
#Open a file who is listed in the command line in the second position
with open(sys.argv[1]) as File:
#Skip the first line of the file, if its just "data,value"
File.next()
#Create a dictionary of lists
ourDict = defaultdict(list)
#parse the file, line by line
for each in File:
# Split the file, by a comma,
#or whatever separates them (Comma Seperated Values = CSV)
each = each.split(',')
# now each[0] is a year, and each[1] is a value.
# We use each[0] as the key, and append vallues to the list
ourDict[each[0]].append(float(each[1]))
print "Date\tValue"
for key,value in ourDict.items():
# Average is the sum of the value of all members of the list
# divided by the list's length
print key,'\t',sum(value)/len(value)

Matching parts of two csv files to return certain elements

Hello I am looking for some help to do like an index match in excel i am very new to python but my data sets are far to large for excel now
I will dumb my question right down as much as possible cause the data contains alot of irrelevant information to this problem
CSV A (has 3 Basic columns)
Name, Date, Value
CSV B (has 2 columns)
Value, Score
CSV C (I want to create this using python; 2 columns)
Name, Score
All I want to do is enter a date and have it look up all rows in CSV A which match that "date" and then look up the "score" associated to the "value" from that row in CSV A in CSV B and returning it in CSV C along with the name of the person. Rinse and repeat through every row
Any help is much appreciated I don't seem to be getting very far
Here is a working script using Python's csv module:
It prompts the user to input a date (format is m-d-yy), then reads csvA row by row to check if the date in each row matches the inputted date.
If yes, it checks if the value that corresponds the date from the current row of A matches any of the rows in csvB.
If there are matches, it will write the name from csvA and the score from csvB to csvC.
import csv
date = input('Enter date: ').strip()
A = csv.reader( open('csvA.csv', newline=''), delimiter=',')
matches = 0
# reads each row of csvA
for row_of_A in A:
# removes whitespace before and after of each string in each row of csvA
row_of_A = [string.strip() for string in row_of_A]
# if date of row in csvA has equal value to the inputted date
if row_of_A[1] == date:
B = csv.reader( open('csvB.csv', newline=''), delimiter=',')
# reads each row of csvB
for row_of_B in B:
# removes whitespace before and after of each string in each row of csvB
row_of_B = [string.strip() for string in row_of_B]
# if value of row in csvA is equal to the value of row in csvB
if row_of_A[2] == row_of_B[0]:
# if csvC.csv does not exist
try:
open('csvC.csv', 'r')
except:
C = open('csvC.csv', 'a')
print('Name,', 'Score', file=C)
C = open('csvC.csv', 'a')
# writes name from csvA and value from csvB to csvC
print(row_of_A[0] + ', ' + row_of_B[1], file=C)
m = 'matches' if matches > 1 else 'match'
print('Found', matches, m)
Sample csv files:
csvA.csv
Name, Date, Value
John, 2-6-15, 10
Ray, 3-5-14, 25
Khay, 4-4-12, 30
Jake, 2-6-15, 100
csvB.csv
Value, Score
10, 500
25, 200
30, 300
100, 250
Sample Run:
>>> Enter date: 2-6-15
Found 2 matches
csvC.csv (generated by script)
Name, Score
John, 500
Jake, 250
if you are using unix you can do this by below shell script
also I am assuming that you are appending the search output in file_C and there are no duplicated in both source files
while true
do
echo "enter date ..."
read date
value_one=grep $date file_A | cut -d',' -f1
tmp1=grep $date' file_A | cut -d',' -f3
value_two=grep $tmp1 file_B | cut -d',' -f2
echo "${value_one},${value_two}" >> file_c
echo "want to search more dates ... press y|Y, press any other key to exit"
read ch
if [ "$ch" = "y" ] || [ "$ch" = "y" ]
then
continue
else
exit
fi
done

Categories

Resources