I have been trying to convert data obtained from a Google sheet into a pandas dataframe.
My solution works:
header = data[0] # The first row
all_other_rows = data[1:] # All other rows
df = pd.DataFrame(header, columns=all_other_rows)
However, I don't understand why this code failed:
df = pd.DataFrame(data[0], columns=data[1:])
The error initially is "Shape of Values passed is (4, 1), indices imply (4, 4)", but even when resolved according to this answer, by adding brackets around data[0], it takes 5-10 minutes and the df loads incorrectly. What is the difference?
Extra details of this code:
The data was imported with this code:
gc = gspread.authorize(credentials)
wb = gc.open_by_key(spreadsheet_key)
ws = wb.worksheet(worksheet_name)
ws.get_all_values()
Here's sample data:
[['id', 'other_id', 'another_id', 'time_of_determination'],
['63409', '1', '', '2019-11-14 22:01:19.386903+00'],
['63499', '1', '8', '2019-11-14 22:01:19.386903+00'],
['63999', '1', '', '2019-11-14 22:01:19.386903+00'],
['69999', '1', '', '2019-11-14 22:01:19.386903+00']]
You can download Google Sheet in MS Excel format. Then try:
import pandas as pd
df = pd.read_excel(excel_file)
You don't have to mention columns explicitly for this. read_excel will automatically detect columns.
Alternatively, I guess you wanted something like this. The issue is probably with your rows and columns selection:
data = [['id', 'other_id', 'another_id', 'time_of_determination'],
['63409', '1', '', '2019-11-14 22:01:19.386903+00'],
['63499', '1', '8', '2019-11-14 22:01:19.386903+00'],
['63999', '1', '', '2019-11-14 22:01:19.386903+00'],
['69999', '1', '', '2019-11-14 22:01:19.386903+00']]
import pandas as pd
header = data[0] # The first row
all_other_rows = data[1:] # All other rows
pd.DataFrame(all_other_rows, columns=header)
Output:
index id other_id another_id time_of_determination
0 63409 1 2019-11-14 22:01:19.386903+00
1 63499 1 8 2019-11-14 22:01:19.386903+00
2 63999 1 2019-11-14 22:01:19.386903+00
3 69999 1 2019-11-14 22:01:19.386903+00
Related
The project I'm working on requires me to find out which 'project' has been updated since the last time it was processed. For this purpose I have two dataframes which both contain three columns, the last one of which is a date signifying the last time a project is updated. The first dataframe is derived from a query on a database table which records the date a 'project' is updated. The second is metadata I store myself in a different table about the last time my part of the application processed a project.
I think I came pretty far but I'm stuck on the following error, see the code provided below:
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
unprocessed = lastmatch[~lastmatch.isin(processed)].dropna()
processed.set_index(['projectid', 'stage'], inplace=True)
lastmatch.set_index(['projectid', 'stage'], inplace=True)
processed.sort_index(inplace=True)
lastmatch.sort_index(inplace=True)
print(lastmatch['lastmatchdate'])
print(processed['process_date'])
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
The result I want to achieve is a dataframe containing the rows where the 'lastmatchdate' is greater than the date that the project was last processed (process_date). However this line:
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
produces a ValueError: Can only compare identically-labeled Series objects. I think it might be a syntax I don't know of or got wrong.
The output I expect is in this case:
lastmatchdate
projectid stage
1 c 2020-08-31
So concretely the question is: how do I get a dataframe containing only the rows of another dataframe having the (datetime) value of column a greater than column b of the other dataframe.
merged = pd.merge(processed, lastmatch, left_index = True, right_index = True)
merged = merged.assign(to_process = merged['lastmatchdate']> merged['process_date'])
You will get the following:
process_date lastmatchdate to_process
projectid stage
1 c 2020-08-31 2020-08-31 False
2 v 2013-11-24 2013-11-24 False
you 've receiver ValueError because you tried to compare two different dataframes, if you want to compare row by row two dataframes, merge them before
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
df=pd.merge(lastmatch,processed,on=['stage','projectid'])
df=df[
df.lastmatchdate>df.process_date
]
print(df)
projectid stage lastmatchdate process_date
0 1 c 2020-08-31 2020-08-30
I have a few Python dataframes in Pandas, I want to loop through them to find out which data frame meet my rows' criteria and save it in a new data frame.
d = {'Count' : ['10', '11', '12', '13','13.4','12.5']}
df_1= pd.DataFrame(data=d)
df_1
d = {'Count' : ['10', '-11', '-12', '13','16','2']}
df_2= pd.DataFrame(data=d)
df_2
Here is the logic I want to use, but it does not contain the right syntax,
for df in (df_1,df_2)
if df['Count'][0] >0 and df['Count'][1] >0 and df['Count'][2]>0 and df['Count'][3]>0
and (df['Count'][4] is between df['Count'][3]+0.5 and df['Count'][3]-0.5) is True:
df.save
The correct output is df_1... because it meets my condition. How do I create a new DataFrame or LIST to save the result as well?
Let me know if you have any questions in the comments. Main updates I made to your code was:
Replacing your chained indexing with .loc
Consolidating your first few separate and'd comparisons into a comparison on a slice of the series, reduced down to a single T/F with .all()
Code below:
import pandas as pd
# df_1 & df_2 input taken from you
d = {'Count' : ['10', '11', '12', '13','13.4','12.5']}
df_1= pd.DataFrame(data=d)
d = {'Count' : ['10', '-11', '-12', '13','16','2']}
df_2= pd.DataFrame(data=d)
# my solution here
df_1['Count'] = df_1['Count'].astype('float')
df_2['Count'] = df_2['Count'].astype('float')
my_dataframes = {'df_1': df_1, 'df_2': df_2}
good_dataframes = []
for df_name, df in my_dataframes.items():
if (df.loc[0:3, 'Count'] > 0).all() and (df.loc[3,'Count']-0.5 <= df.loc[4, 'Count'] <= df.loc[3, 'Count']+0.5):
good_dataframes.append(df_name)
good_dataframes_df = pd.DataFrame({'good': good_dataframes})
TEST:
>>> print(good_dataframes_df)
good
0 df_1
I have a csv file full of data, which is all type string. The file is called Identifiedλ.csv.
Here is some of the data from the csv file:
['Ref', 'Ion', 'ULevel', 'UConfig.', 'ULSJ', 'LLevel', 'LConfig.', 'LLSJ']
['12.132', 'Ne X', '4', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.132', 'Ne X', '3', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
What I would like to do is the read the file and search for a number in the column 'Ref', for example 12.846. And if the number I search matches a number in the file, print the whole row of that number .
eg. something like:
csv_g = csv.reader(open('Identifiedλ.csv', 'r'), delimiter=",")
for row in csv_g:
if 12.846 == (row[0]):
print (row)
And it would return (hopefully)
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
However this returns nothing and I think it's because the 'Ref' column is type string and the number I search is type float. I'm trying to find a way to convert the string to float but am seemingly stuck.
I've tried:
df = pd.read_csv('Identifiedλ.csv', dtype = {'Ref': np.float64,})
and
array = b = np.asarray(array,
dtype = np.float64, order = 'C')
but am confused on how to incorporate this with the rest of the search.
Any tips would be most helpful! Thank you!
Python has a function to convert strings to floats. For example, the following evaluates to True:
float('3.14')==3.14
I would try this conversion while comparing the value in the first column.
I have a csv file like this :
id;verbatim
1;je veux manger
2;tu as manger
I have my script which return a dictionary like this :
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
T would like to add 4 news columns like this :
id;verbatim,key,value1,value2,value3
1;je veux manger
2;tu as manger
And then add my dictionary in each columns like this :
id;verbatim;key,value1,value2,value3
1je veux manger;manger;7;1;0
2;tu as manger;être laid;0;21;1041
Below the script which allow me to get my dictionary :
with open('output.csv','wb') as f:
w = csv.writer(f)
w.writerow(dico_phrases.keys())
w.writerow(dico_phrases.values())
I have this :
['manger'],"['être', 'laid']"
"['7', '1', '0']","['0', '21', '1041']"
It is not quite I have imagined
Consider using pandas for this -
df = pd.read_csv("input.csv", index_col='id')
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
df['key'] = [" ".join(eval(x)) for x in dico_phrases.keys()]
df = df.join(pd.DataFrame([x for x in dico_phrases.values()], index=df.index))
df.to_csv("output.csv")
Output:
id,verbatim,key,0,1,2
1,je veux manger,manger,7,1,0
2,tu as manger,être laid,0,21,1041
I am having a bit of trouble getting started on an assignment. We are issued a tab delineated .txt file with 6 columns of data and around 50 lines of this data. I need help starting a list to store this data in for later recall. Eventually I will need to be able to list all the contents of any particular column and sort it, count it, etc. Any help would be appreciated.
Edit; I really haven't done much besides research on this kinda stuff, I know ill be looking into csv, and i have done single column .txt files before but im not sure how to tackle this situation. How will I give names to the separate columns? how will I tell the program when one row ends and the next begins?
The dataframe structure in Pandas basically does exactly what you want. It's highly analogous to the data frame in R if you're familiar with that. It has built in options for subsetting, sorting, and otherwise manipulating tabular data.
It reads directly from csv and even automatically reads in column names. You'd call:
read_csv(yourfilename,
sep='\t', # makes it tab delimited
header=1) # makes the first row the header row.
Works in Python 3.
Let's say you have a csv like the following.
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
You can read them into a dictionary like so:
>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], dialect='excel-tab')
>>> for row in reader:
... print row
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
But Pandas library might be better suited for this. http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files
Sounds like a job better suited to a database. You should just use something like PostgreSQLs COPY FROM operation to import the CSV data into a table then use python + SQL for all your sorting, searching and matching needs.
If you feel a real database is overkill there's still options like SQLlite and BerkleyDB which both have python modules.
EDIT: BerkelyDB is deprecated but anydbm is similiar in concept.
I think using a db for 50 lines and 6 colums is overkill, so here's my idea:
from __future__ import print_function
import os
from operator import itemgetter
def get_records_from_file(path_to_file):
"""
Read a tab-deliminated file and return a
list of dictionaries representing the data.
"""
records = []
with open(path_to_file, 'r') as f:
# Use the first line to get names for columns
fields = [e.lower() for e in f.readline().split('\t')]
# Iterate over the rest of the lines and store records
for line in f:
record = {}
for i, field in enumerate(line.split('\t')):
record[fields[i]] = field
records.append(record)
return records
if __name__ == '__main__':
path = os.path.join(os.getcwd(), 'so.txt')
records = get_records_from_file(path)
print('Number of records: {0}'.format(len(records)))
s = sorted(records, key=itemgetter('id'))
print('Sorted: {0}'.format(s))
For storing records for later use, look into Python's pickle library--that'll allow you to preserve them as Python objects.
Also, note I don't have Python 3 installed on the computer I'm using right now, but I'm pretty sure this'll work on Python 2 or 3.