I am having trouble writing a loop that returns what I need. I have two CSV files. For the values in a column in CSV 1, I need to find if there are matching values in CSV 2 and if there are matching values, return a dataframe for the row of the matching values. When I try to create a loop, I cannot get the right values in the loop. For example:
import pandas as pd
csv2 = pd.read_csv('/users/jamesh/documents/asiopods/asicrawlconcat.csv', header = 1)
csv1 = pd.read_csv('/users/jamesh/documents/asiopods/asiconcat.csv', header = 0)
h1s = csv1['Recommended_H1']
h1 = h1s
h1[0:3] #test
subject = csv2['H1_1']
for x in h1:
for y in subject:
if x == y:
print y
The code above returns the values I need, but in string form. I need to return the dataframe for the values of y, from CSV2
Any help or direction is greatly appreciated!
Edit - with some offline help, I have been able get the correct information from the loop. However, I still can't figure out how to get the data into a pandas.dataframe. Instead the data is returned in a vertical manner. Here is the new loop:
def foogaiz():
for k1, v1 in h1.iteritems():
for k2, v2 in subject.iteritems():
if v1 == v2:
data = csv2.irow(k2)
return data
It's a little unclear if the values you're matching on ("Recommend_H1" in your example) are unique and only appear once in asiconcat.csv. If so, then I recommend naming the two columns that have matching values the same ('H1_1' in my example syntax below) and doing a df.merge()
matched_df = df.merge(crawldf,on="H1_1",how="left")
The left join option is in order to keep rows that don't have matches on crawldf.
You can read the documentation for merge here:
http://pandas.pydata.org/pandas-docs/stable/merging.html
Related
I am trying to replace some missing and incorrect values in my master dataset by filling it in with correct values from two different datasets.
I created a miniature version of the full dataset like so (note the real dataset is several thousand rows long):
import pandas as pd
data = {'From':['GA0251','GA5201','GA5551','GA510A','GA5171','GA5151'],
'To':['GA0201_T','GA5151_T','GA5151_R','GA5151_V','GA5151_P','GA5171_B'],
'From_Latitude':[55.86630869,0,55.85508787,55.85594626,55.85692217,55.85669934],
'From_Longitude':[-4.27138731,0,-4.24126866,-4.24446585,-4.24516129,-4.24358251,],
'To_Latitude':[55.86614756,0,55.85522197,55.85593762,55.85693878,0],
'To_Longitude':[-4.271040979,0,-4.241466534,-4.244607602,-4.244905037,0]}
dataset_to_correct = pd.DataFrame(data)
However, some values in the From lat/long and the To lat/long are incorrect. I have two tables like the one below for each of From and To, which I would like to substitute into the table in place of the two values for that row.
Table of Corrected From lat/long:
data = {'Site':['GA5151_T','GA5171_B'],
'Correct_Latitude':[55.85952791,55.87044558],
'Correct_Longitude':[55.85661767,-4.24358251,]}
correct_to_coords = pd.DataFrame(data)
I would like to match this table to the From column and then replace the From_Latitude and From_Longitude with the correct values.
Table of Corrected To lat/long:
data = {'Site':['GA5201','GA0251'],
'Correct_Latitude':[55.857577,55.86616756],
'Correct_Longitude':[-4.242770,-4.272140979]}
correct_from_coords = pd.DataFrame(data)
I would like to match this table to the To column and then replace the To_Latitude and To_Longitude with the correct values.
Is there a way to match the site in each table to the corresponding From or To column and then replace only the values in the respective columns?
I have tried using code from this answer (Elegant way to replace values in pandas.DataFrame from another DataFrame) but it seems to have no effect on the database.
(correct_to_coords.set_index('Site').rename(columns = {'Correct_Latitude':'To_Latitude'}) .combine_first(dataset_to_correct.set_index('To')))
#zswqa 's answer produces right result, #Anurag Dabas 's doesn't.
Another possible solution, It is a bit faster than merge method suggested above, although both are correct.
dataset_to_correct.set_index("To",inplace=True)
correct_to_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_to_coords.index, "To_Latitude"] = correct_to_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_to_coords.index, "To_Longitude"] = correct_to_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
dataset_to_correct.set_index("From",inplace=True)
correct_from_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_from_coords.index, "From_Latitude"] = correct_from_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_from_coords.index, "From_Longitude"] = correct_from_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
merge = dataset_to_correct.merge(correct_to_coords, left_on='To', right_on='Site', how='left')
merge.loc[(merge.To == merge.Site), 'To_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.To == merge.Site), 'To_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge = merge.merge(correct_from_coords, left_on='From', right_on='Site', how='left')
merge.loc[(merge.From == merge.Site), 'From_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.From == merge.Site), 'From_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge
lets try dual merge by merge()+pop()+fillna()+drop():
dataset_to_correct=dataset_to_correct.merge(correct_to_coords,left_on='To',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['From_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['From_Latitude'])
dataset_to_correct['From_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['From_Longitude'])
dataset_to_correct=dataset_to_correct.merge(correct_from_coords,left_on='From',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['To_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['To_Latitude'])
dataset_to_correct['To_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['To_Longitude'])
I'm new to python. I'm trying to concat 2 csv files to find out the differences. I'm using Id column as index to concatenate the values. Since the Csv files has duplicate IDs I'm getting the below error:
ValueError: Shape of passed values is (17, 4), indices imply (13, 4)
The error is on the line:
df_all_changes = pd.concat([old, new],axis=1,keys=['src','tgt'], join='inner')
Q1: How to handle/remedy the above error? Can someone please help
Q2: Also I want to know what the below line does:
df_changed = df_all_changes.groupby(level=0, axis=0).apply(lambda frame: frame.apply(report_diff, axis=1))
Q3: what would happen if I give level=1, axis=1 in the above line?
import pandas as pd
#list of key column(s)
key=['Id']
# Read in the two excel files and fill NA
old = pd.read_csv('Source.csv')
new = pd.read_csv('Target.csv')
#set index
old=old.set_index(key)
new=new.set_index(key)
#identify dropped rows and added (new) rows
dropped_rows = set(old.index) - set(new.index)
added_rows = set(new.index) - set(old.index)
#print(old.loc[dropped_rows])
#combine data
df_all_changes = pd.concat([old,new],axis=1,keys=['src','tgt'],join='inner')
print(df_all_changes)
#swap column indexes
df_all_changes = df_all_changes.swaplevel(axis='columns')#[new.columns[0:]]
#prepare functio for comparing old values and new values
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
#apply the report_diff function
df_changed = df_all_changes.groupby(level=0, axis=0).apply(lambda frame: frame.apply(report_diff, axis=1))
print(df_changed)
You may want to provide examples of what the dataframe looks like;
Q1 - because two dataframes do not have the same number of rows, instead of using pd.concat, it would suggest you use pd.merge:
df_all_changes pd.merge(new, old, on=['src', 'tgt'], how='inner')
Q2
level=0 means that you want to group by first hierarchical index; and
axis=0 means you want to split by row level (it is default setting)
you should look at the documentation. and .apply() simply applies your custom function where you comparing old and new values column-wise (axis=1).
Q3 - mentioned above in Q2.
I'm wanting to aggregate some API responses into a DataFrame.
The request consistently returns a number of json key value pairs, lets say A,B,C. occasionally however it will return A,B,C,D.
I would like something comparible to SQL's OUTER JOIN, that will simply add the new row, whilst filling the corresponding previous columns as NULL or some other placeholder.
The pandas join options insist upon imposing a unique suffix for the side, I really don't want this.
Am I looking at this the wrong way?
If there is no easy solution, I could just select a subset of the consistently available columns but I really wanted to download the lot and do the processing as a separate stage.
You can use pandas.concat as it provides with all the functionality required for your problem. Let this toy problem illustrate the possible solution.
# This generates random data with some key and value pair.
def gen_data(_size):
import string
keys = list(string.ascii_uppercase)
return dict((k,[v]) for k,v in zip(np.random.choice(keys, _size),np.random.randint(1000, size=_size)))
counter = 0
df = pd.DataFrame()
while True:
if counter > 5:
break;
# Recieve the data
new_data = gen_data(5)
# Converting this to dataframe obj
new_data = pd.DataFrame(new_data)
# Appending this data to my stack
df = pd.concat((df, new_data), axis=0, sort=True)
counter += 1
df.reset_index(drop=True, inplace=True)
print(df.to_string())
I have a csv file with 330k+ rows and 12 columns. I need to put column 1 (numeric ID) and column 3 (text string) into a list or array so I can analyze the data in column 3.
this code worked for me to pull out the third col:
for row in csv_strings:
string1.append(row[2])
Can someone point me to the correct class of commands that I can research to get the job done?
Thanks.
Pandas is the best tool for this.
import pandas as pd
df = pd.read_csv("filename.csv", usecols=[ 0, 2 ])
points = []
for row in csv_strings:
points.append({id: row[0], text: row[2]})
You can pull them out into a list of key value pairs.
A different answer, using tuples, which ensure immutability and are pretty fast, but less convenient than dictionaries:
# build results
results = []
for row in csv_lines:
results.append((row[0], row[2]))
# Read results
for result in results:
result[0] # id
result[1] # string
import csv
x,z = [],[]
csv_reader = csv.reader(open('Data.csv'))
for line in csv_reader:
x.append(line[0])
z.append(line[2])
This can help u getting data from 1st and 3rd column
I am trying to match the stop_id in stop_times.csv to the stop_id in stops.csv in order to copy over the stop_lat and stop_lon to their respective columns in stop_times.csv.
Gist files:
stops.csv LINK
stop_times.csv LINK
Here's my code:
import pandas as pd
st = pd.read_csv('csv/stop_times.csv', sep=',')
st.set_index(['trip_id','stop_sequence'])
stops = pd.read_csv('csv/stops.csv')
for i in range(len(st)):
for x in range(len(stops)):
if st['stop_id'][i] == stops['stop_id'][x]:
st['stop_lat'][i] = stops['stop_lat'][x]
st['stop_lon'][i] = stops['stop_lon'][x]
st.to_csv('csv/stop_times.csv', index=False)
I'm aware that the script is applying a copy, but I'm not sure what other method to go about doing this, as I'm fairly new to pandas.
You can merge the two DataFrames:
pd.merge(stops, st, on='stop_id')
Since there are stop_lat columns in each, it will give you stop_lat_x (the good one) and stop_lat_y (the always-zero one). You can then remove or ignore the bad column and output the resulting DataFrame however you want.