Related
To clarify, I have 2 CSV files I want to read.
First CSV has the following headers: ['ISO3', 'Languages', 'Country Name'].
Second CSV has the following headers: ['ISO3', 'Area', 'Country Name']
I want to write to a new CSV file with the following headers (and their corresponding values obviously), so like: ['ISO3', 'Area', 'Languages', 'Country Name']. Basically, I want to merge the 2 CSVs, without having the duplication of ISO3 and Country Name.
Right now, i am reading both CSVs and then I am able to successfully write the 'Area' to the original written CSV which contains only ['ISO3', 'Languages', 'Country Name'].
However, the formatting is off.
import csv
filePath = '/file/path/shortlist_languages.csv'
fp_write = input("Enter fp for writing new CSV (do not include .csv extension): ")
country_data_fields =[]
with open(filePath) as file:
reader = csv.DictReader(file)
for row in reader:
country_data_fields.append({
'Languages': row['Languages'],
'Country Name': row['Country Name'],
'ISO3': row['ISO3']
})
with open('/file/path/shortlist_area.csv') as file_t:
reader = csv.DictReader(file_t)
for row in reader:
country_data_fields.append({
'Area': row['Area'],
})
with open(fp_write+'country_data_table.csv', 'w',
newline='') as country_data_fields_csv:
fieldnames = ['Languages', 'Country Name', 'ISO3', 'Area']
csv_dict_writer = csv.DictWriter(country_data_fields_csv, fieldnames=fieldnames)
csv_dict_writer.writeheader()
for data in country_data_fields:
csv_dict_writer.writerow(data)
The CSV result looks like the below:
Languages,Country Name,ISO3,Area
Albanian,Albania,ALB,
Arabic,Algeria,DZA,
Catalan,Andorra,AND,
Portuguese,Angola,AGO,
English,Antigua and Barbuda,ATG,
,,,28748
,,,2381741
,,,468
,,,1246700
,,,442
I want the "Area" values to be nicely lined up with the others though, so how?
I understand that you're identifying each record by the 'ISO3' key? Use a dict instead of a list, using the 'ISO3' value as a key.
In the first loop instead of .append just set the dict value with the key, in the second loop get the existing record dict for that key, set ['Area'] to the row['Area'] value, and it should update properly. Something like this (not tested):
for row in reader:
iso3 = row['ISO3']
country_record = country_data_fields[iso3]
country_record['Area'] = row['Area']
Modify the final loop to iterate through the dict instead of a list.
I have two csv files, and I want to combine these two csv files into one csv file. Assume that the two csv files are A.csv and B.csv, I have already known that there are some conflicts in them. For example, there are two columns, ID and name, in A.csv ID "12345" has name "Jack", in B.csv ID "12345" has name "Tom". So there are conflicts that the same ID has different name. Now I want to keep ID "12345", and I want to choose name from A.csv, and abandon name from B.csv. How could I do that?
Here is some code I have tried, but it can only combine two csv files but connot deal with the conflicts, or more precisely, it cannot choose definite value from A.csv :
import pandas as pd
import glob
def merge(csv_list, outputfile):
for input_file in csv_list:
f = open(input_file, 'r', encoding='utf-8')
data = pd.read_csv(f, error_bad_lines=False)
data.to_csv(outputfile, mode='a', index=False)
print('Combine Completed')
def distinct(file):
df = pd.read_csv(file, header=None)
datalist = df.drop_duplicates()
datalist.to_csv('result_new_month01.csv', index = False, header = False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = glob.glob('*.csv')
output_csv_path = 'result.csv'
print(csv_list)
merge(csv_list)
distinct(output_csv_path)
P.S. English is not my native language. Please excuse my syntax error.
Will put down my comments here as an answer:
The problem with your merge function is, you're reading one file and writing it out the same result.csv in append mode without doing any check for duplicate names. In order to check for duplicates, they need to be in the same dataframe. From your code, you're combining multiple CSV files, not necessary just A.csv and B.csv. So when you say "I want to choose name from A.csv, and abandon name from B.csv" - it looks like you really mean "keep the first one".
Anyway, fix your merge() function to keep reading files into a list of dataframes - with A.csv being first. And then use #gold_cy's answer to concatenate the dataframes keeping only the first occurrence.
Or in your distinct() function, instead of datalist = df.drop_duplicates(), put datalist = df.drop_duplicates("ID", keep="first").reset_index(drop=True) - but this can be done after the read-loop instead writing out a CSV full of duplicates, first drop the dupes and then write out to csv.
So here's the change using the first method:
import pandas as pd
import glob
def merge_distinct(csv_list, outputfile):
all_frames = [] # list of dataframes
for input_file in csv_list:
# skip your file-open line and pass those params to pd.read_csv
data = pd.read_csv(input_file, encoding='utf-8', error_bad_lines=False)
all_frames.append(data) # append to list of dataframes
print('Combine Completed')
final = pd.concat(all_frames).drop_duplicates("ID", keep="first").reset_index(drop=True)
final.to_csv('result_new_month01.csv', index=False, header=False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = sorted(glob.glob('*.csv')) # sort the list
output_csv_path = 'result.csv'
print(csv_list)
merge_distinct(csv_list, output_csv_path)
Notes:
Instead of doing f = open(...) just pass those arguments to pd.read_csv().
why are you writing out the final csv with header=False? That's useful to have.
glob.glob() doesn't guarantee sorting (depends on filesystem) so I've used sorted() above.
File-system sorting is different from just sorting in sorted which is essentially in ASCII/Unicode index order.
If you want to keep one DataFrame value over the other, then concatenate them and keep the first duplicate in the output. This means the preferred values should be in the first argument to the sequence you provide to concatenate as shown below.
df = pd.DataFrame({"ID": ["12345", "4567", "897"], "name": ["Jack", "Tom", "Frank"]})
df1 = pd.DataFrame({"ID": ["12345", "333", "897"], "name": ["Tom", "Sam", "Rob"]})
pd.concat([df, df1]).drop_duplicates("ID", keep="first").reset_index(drop=True)
ID name
0 12345 Jack
1 4567 Tom
2 897 Frank
3 333 Sam
I have data in a csv file, where I want to execute a HTTP GET request for every row in the csv and store the results of the request in a DataFrame.
Here's what I'm working with so far:
with open('input.csv') as csv_file:
csv_reader = csv.DictReader(csv_file)
df = pd.DataFrame()
for row in csv_reader:
result = requests.get(BASEURL+row['ID']+"&access_token="+TOKEN).json()
data = pd.DataFrame(result)
df.append(data)
However, this doesn't seem to be appending to the df?
Note the json response will always return id, first_name, last_name key-value pairs.
The append operation returns a new dataframe with the appended data.
Change the last line to:
df = df.append(data)
Intro Python question: I am working on a program that counts the number of politicians in each political party for each session of the U.S. Congress. I'm starting from a .csv with biographical data, and wish to export my political party membership count as a new .csv. This is what I'm doing:
import pandas as pd
read = pd.read_csv('30.csv', delimiter = ';', names = ['Name', 'Years', 'Position', 'Party', 'State', 'Congress'])
party_count = read.groupby('Party').size()
with open('parties.csv', 'a') as f:
party_count.to_csv(f, header=False)
This updates my .csv to read as follows:
'Year','Party','Count'
'American Party',1
'Democrat',162
'Independent Democrat',3
'Party',1
'Whig',145
I next need to include the date under my first column ('Year'). This is contained in the 'Congress' column in my first .csv. What do I need to add to my final line of code to make this work?
Here is a snippet from the original .csv file I am drawing from:
'Name';'Years';'Position';'Party';'State';'Congress'
'ABBOTT, Amos';'1786-1868';'Representative';'Whig';'MA';'1847'
'ADAMS, Green';'1812-1884';'Representative';'Whig';'KY';'1847'
'ADAMS, John Quincy';'1767-1848';'Representative';'Whig';'MA';'1847'
You can merge back the counts of Party to your original dataframe by:
party_count = df.groupby('Party').size().reset_index(name='Count')
df = df.merge(party_count, on='Party', how='left')
Once you have the count of parties now you can select your data. For eg: If you need [Congress, Party, Count] you can use:
out_df = df[['Congress ', 'Party', 'Count']].drop_duplicates()
out_df.columns = ['Year', 'Party', 'Count']
Here, out_df being the dataframe you can write to my.csv file.
out_df.to_csv('my.csv', index=False)
I have data which consists 3004 rows without header, and each row has different number of fields (e.g. for row number 1,2,3,4 has 16,17,21,12, respectively). Here is my code when I call the csv.
df = pd.read_csv(file,'rb', delimiter ='\t', engine='python')
here is the output:
$GPRMC,160330.40,A,1341.,N,10020.,E,0.006,,150517,,,A*7D
$GPGGA,160330.40,1341.,N,10020.,E,1,..
$PUBX,00,160330.40,1341.,N,10020.,E,...
$PUBX,03,20,2,-,056,40,,000,5,U,014,39,41,026,...
$PUBX,04,160330.40,150517,144210.39,1949,18,-6...
ÿ$GPRMC,160330.60,A,1341.,N,10020.,E...
$GPGGA,160330.60,1341.,N,10020.,E,1,...
It seemed like delimiter didn't work at all to separate the data into column by column. Hence, I tried with columns function based on number of fields from ($PUBX, 00). Here is the code when I add columns:
my_cols = ['MSG type', 'ID MSG', 'UTC','LAT', 'N/S', 'LONG', 'E/W', 'Alt', 'Status','hAcc', 'vAcc','SOG', 'COG', 'VD','HDOP', 'VDOP', 'TDOP', 'Svs', 'reserved', 'DR', 'CS', '<CR><LF>']
df = pd.read_csv(file, 'rb', header = None, na_filter = False, engine = 'python', index_col=False, names=my_cols)
and the result be like the picture below. The file becomes into one column in 'MSG type'.
the output
My purpose after success to call this csv is to read rows only with combination between $PUBX, 00,... and one column of $PUBX, 04,... and write it to csv. But, I am still struggling how to separate the file into columns. Please advice me on this matter. Thank you very much.
pd.read_csv
is used for reading CSV(comma separated values) Files hence you don't need to specify a delimiter.
If you want to read a TSV (Tab separated values) File, you can use:
pd.read_table(filepath)
The default separator is tab
Hat Tip to Ilja Everilä
#Hasanah Based on your code:
df = pd.read_csv(file,'rb', delimiter ='\t', engine='python')
delimiter='\t' tells pandas to separate the data into fields based on tab characters.
The default delimiter when pandas reads in csv files is a comma, so you should not need to define a delimiter:
df = pd.read_csv(file,'rb', engine='python')