Speed differences and errors when creating a pandas data frame - python

I have been trying to convert data obtained from a Google sheet into a pandas dataframe.
My solution works:
header = data[0] # The first row
all_other_rows = data[1:] # All other rows
df = pd.DataFrame(header, columns=all_other_rows)
However, I don't understand why this code failed:
df = pd.DataFrame(data[0], columns=data[1:])
The error initially is "Shape of Values passed is (4, 1), indices imply (4, 4)", but even when resolved according to this answer, by adding brackets around data[0], it takes 5-10 minutes and the df loads incorrectly. What is the difference?
Extra details of this code:
The data was imported with this code:
gc = gspread.authorize(credentials)
wb = gc.open_by_key(spreadsheet_key)
ws = wb.worksheet(worksheet_name)
ws.get_all_values()
Here's sample data:
[['id', 'other_id', 'another_id', 'time_of_determination'],
['63409', '1', '', '2019-11-14 22:01:19.386903+00'],
['63499', '1', '8', '2019-11-14 22:01:19.386903+00'],
['63999', '1', '', '2019-11-14 22:01:19.386903+00'],
['69999', '1', '', '2019-11-14 22:01:19.386903+00']]

You can download Google Sheet in MS Excel format. Then try:
import pandas as pd
df = pd.read_excel(excel_file)
You don't have to mention columns explicitly for this. read_excel will automatically detect columns.
Alternatively, I guess you wanted something like this. The issue is probably with your rows and columns selection:
data = [['id', 'other_id', 'another_id', 'time_of_determination'],
['63409', '1', '', '2019-11-14 22:01:19.386903+00'],
['63499', '1', '8', '2019-11-14 22:01:19.386903+00'],
['63999', '1', '', '2019-11-14 22:01:19.386903+00'],
['69999', '1', '', '2019-11-14 22:01:19.386903+00']]
import pandas as pd
header = data[0] # The first row
all_other_rows = data[1:] # All other rows
pd.DataFrame(all_other_rows, columns=header)
Output:
index id other_id another_id time_of_determination
0 63409 1 2019-11-14 22:01:19.386903+00
1 63499 1 8 2019-11-14 22:01:19.386903+00
2 63999 1 2019-11-14 22:01:19.386903+00
3 69999 1 2019-11-14 22:01:19.386903+00

Related

Comparing date columns between two dataframes

The project I'm working on requires me to find out which 'project' has been updated since the last time it was processed. For this purpose I have two dataframes which both contain three columns, the last one of which is a date signifying the last time a project is updated. The first dataframe is derived from a query on a database table which records the date a 'project' is updated. The second is metadata I store myself in a different table about the last time my part of the application processed a project.
I think I came pretty far but I'm stuck on the following error, see the code provided below:
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
unprocessed = lastmatch[~lastmatch.isin(processed)].dropna()
processed.set_index(['projectid', 'stage'], inplace=True)
lastmatch.set_index(['projectid', 'stage'], inplace=True)
processed.sort_index(inplace=True)
lastmatch.sort_index(inplace=True)
print(lastmatch['lastmatchdate'])
print(processed['process_date'])
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
The result I want to achieve is a dataframe containing the rows where the 'lastmatchdate' is greater than the date that the project was last processed (process_date). However this line:
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
produces a ValueError: Can only compare identically-labeled Series objects. I think it might be a syntax I don't know of or got wrong.
The output I expect is in this case:
lastmatchdate
projectid stage
1 c 2020-08-31
So concretely the question is: how do I get a dataframe containing only the rows of another dataframe having the (datetime) value of column a greater than column b of the other dataframe.
merged = pd.merge(processed, lastmatch, left_index = True, right_index = True)
merged = merged.assign(to_process = merged['lastmatchdate']> merged['process_date'])
You will get the following:
process_date lastmatchdate to_process
projectid stage
1 c 2020-08-31 2020-08-31 False
2 v 2013-11-24 2013-11-24 False
you 've receiver ValueError because you tried to compare two different dataframes, if you want to compare row by row two dataframes, merge them before
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
df=pd.merge(lastmatch,processed,on=['stage','projectid'])
df=df[
df.lastmatchdate>df.process_date
]
print(df)
projectid stage lastmatchdate process_date
0 1 c 2020-08-31 2020-08-30

How to loop through multiple data frames to select a data frame based on row criteria?

I have a few Python dataframes in Pandas, I want to loop through them to find out which data frame meet my rows' criteria and save it in a new data frame.
d = {'Count' : ['10', '11', '12', '13','13.4','12.5']}
df_1= pd.DataFrame(data=d)
df_1
d = {'Count' : ['10', '-11', '-12', '13','16','2']}
df_2= pd.DataFrame(data=d)
df_2
Here is the logic I want to use, but it does not contain the right syntax,
for df in (df_1,df_2)
if df['Count'][0] >0 and df['Count'][1] >0 and df['Count'][2]>0 and df['Count'][3]>0
and (df['Count'][4] is between df['Count'][3]+0.5 and df['Count'][3]-0.5) is True:
df.save
The correct output is df_1... because it meets my condition. How do I create a new DataFrame or LIST to save the result as well?
Let me know if you have any questions in the comments. Main updates I made to your code was:
Replacing your chained indexing with .loc
Consolidating your first few separate and'd comparisons into a comparison on a slice of the series, reduced down to a single T/F with .all()
Code below:
import pandas as pd
# df_1 & df_2 input taken from you
d = {'Count' : ['10', '11', '12', '13','13.4','12.5']}
df_1= pd.DataFrame(data=d)
d = {'Count' : ['10', '-11', '-12', '13','16','2']}
df_2= pd.DataFrame(data=d)
# my solution here
df_1['Count'] = df_1['Count'].astype('float')
df_2['Count'] = df_2['Count'].astype('float')
my_dataframes = {'df_1': df_1, 'df_2': df_2}
good_dataframes = []
for df_name, df in my_dataframes.items():
if (df.loc[0:3, 'Count'] > 0).all() and (df.loc[3,'Count']-0.5 <= df.loc[4, 'Count'] <= df.loc[3, 'Count']+0.5):
good_dataframes.append(df_name)
good_dataframes_df = pd.DataFrame({'good': good_dataframes})
TEST:
>>> print(good_dataframes_df)
good
0 df_1

Python csv file convert string to float

I have a csv file full of data, which is all type string. The file is called Identifiedλ.csv.
Here is some of the data from the csv file:
['Ref', 'Ion', 'ULevel', 'UConfig.', 'ULSJ', 'LLevel', 'LConfig.', 'LLSJ']
['12.132', 'Ne X', '4', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.132', 'Ne X', '3', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
What I would like to do is the read the file and search for a number in the column 'Ref', for example 12.846. And if the number I search matches a number in the file, print the whole row of that number .
eg. something like:
csv_g = csv.reader(open('Identifiedλ.csv', 'r'), delimiter=",")
for row in csv_g:
if 12.846 == (row[0]):
print (row)
And it would return (hopefully)
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
However this returns nothing and I think it's because the 'Ref' column is type string and the number I search is type float. I'm trying to find a way to convert the string to float but am seemingly stuck.
I've tried:
df = pd.read_csv('Identifiedλ.csv', dtype = {'Ref': np.float64,})
and
array = b = np.asarray(array,
dtype = np.float64, order = 'C')
but am confused on how to incorporate this with the rest of the search.
Any tips would be most helpful! Thank you!
Python has a function to convert strings to floats. For example, the following evaluates to True:
float('3.14')==3.14
I would try this conversion while comparing the value in the first column.

Create news columns and add key and values from dictionary in new columns for existing csv file

I have a csv file like this :
id;verbatim
1;je veux manger
2;tu as manger
I have my script which return a dictionary like this :
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
T would like to add 4 news columns like this :
id;verbatim,key,value1,value2,value3
1;je veux manger
2;tu as manger
And then add my dictionary in each columns like this :
id;verbatim;key,value1,value2,value3
1je veux manger;manger;7;1;0
2;tu as manger;être laid;0;21;1041
Below the script which allow me to get my dictionary :
with open('output.csv','wb') as f:
w = csv.writer(f)
w.writerow(dico_phrases.keys())
w.writerow(dico_phrases.values())
I have this :
['manger'],"['être', 'laid']"
"['7', '1', '0']","['0', '21', '1041']"
It is not quite I have imagined
Consider using pandas for this -
df = pd.read_csv("input.csv", index_col='id')
dico_phrases = {"['manger']": ['7', '1', '0'], "['être', 'laid']": ['0', '21', '1041']}
df['key'] = [" ".join(eval(x)) for x in dico_phrases.keys()]
df = df.join(pd.DataFrame([x for x in dico_phrases.values()], index=df.index))
df.to_csv("output.csv")
Output:
id,verbatim,key,0,1,2
1,je veux manger,manger,7,1,0
2,tu as manger,être laid,0,21,1041

Tab delineated python 3 .txt file reading

I am having a bit of trouble getting started on an assignment. We are issued a tab delineated .txt file with 6 columns of data and around 50 lines of this data. I need help starting a list to store this data in for later recall. Eventually I will need to be able to list all the contents of any particular column and sort it, count it, etc. Any help would be appreciated.
Edit; I really haven't done much besides research on this kinda stuff, I know ill be looking into csv, and i have done single column .txt files before but im not sure how to tackle this situation. How will I give names to the separate columns? how will I tell the program when one row ends and the next begins?
The dataframe structure in Pandas basically does exactly what you want. It's highly analogous to the data frame in R if you're familiar with that. It has built in options for subsetting, sorting, and otherwise manipulating tabular data.
It reads directly from csv and even automatically reads in column names. You'd call:
read_csv(yourfilename,
sep='\t', # makes it tab delimited
header=1) # makes the first row the header row.
Works in Python 3.
Let's say you have a csv like the following.
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
You can read them into a dictionary like so:
>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], dialect='excel-tab')
>>> for row in reader:
... print row
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
But Pandas library might be better suited for this. http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files
Sounds like a job better suited to a database. You should just use something like PostgreSQLs COPY FROM operation to import the CSV data into a table then use python + SQL for all your sorting, searching and matching needs.
If you feel a real database is overkill there's still options like SQLlite and BerkleyDB which both have python modules.
EDIT: BerkelyDB is deprecated but anydbm is similiar in concept.
I think using a db for 50 lines and 6 colums is overkill, so here's my idea:
from __future__ import print_function
import os
from operator import itemgetter
def get_records_from_file(path_to_file):
"""
Read a tab-deliminated file and return a
list of dictionaries representing the data.
"""
records = []
with open(path_to_file, 'r') as f:
# Use the first line to get names for columns
fields = [e.lower() for e in f.readline().split('\t')]
# Iterate over the rest of the lines and store records
for line in f:
record = {}
for i, field in enumerate(line.split('\t')):
record[fields[i]] = field
records.append(record)
return records
if __name__ == '__main__':
path = os.path.join(os.getcwd(), 'so.txt')
records = get_records_from_file(path)
print('Number of records: {0}'.format(len(records)))
s = sorted(records, key=itemgetter('id'))
print('Sorted: {0}'.format(s))
For storing records for later use, look into Python's pickle library--that'll allow you to preserve them as Python objects.
Also, note I don't have Python 3 installed on the computer I'm using right now, but I'm pretty sure this'll work on Python 2 or 3.

Categories

Resources