I run a for loop for executing few SQL queries . I have the results captured in a DataFrame (again inside the loop) as below for two validations.
DATAFRAME for Test1:
index column1 column2
0 jack 100
1 bill 200
2 Tom 300
DATAFRAME Looks for Test2:
index column1
0 102345
1 102345
I have to write the results of Dataframe for each Test to another table in Oracle . In order to do this , I need to get the column names. I am unable to Identify how many column names are present at a given point in time in the loop as the Dataframe can have from 1-5 columns depending upon the SQL run . Is there a way to do this .
Code for reading from table and writing to DataFrame:
def get_src_query_metadata(cursor, sql_query):
cursor.execute(sql_query)
columns = [col[0] for col in cursor.description]
cursor.rowfactory = lambda *args: dict(zip(columns, args))
data = pd.DataFrame(cursor.fetchall())
return data
def get_target_query_metadata(cursor, sql_query):
cursor.execute(sql_query)
columns = [col[0] for col in cursor.description]
cursor.rowfactory = lambda *args: dict(zip(columns, args))
data = pd.DataFrame(cursor.fetchall())
return data
def main():
_JobDict_src = get_src_query_metadata(cursor, src_query[i])
_JobDict_tgt = get_target_query_metadata(cursor, target_query[i])
How do I get the column names and its values assign to separate variables .
You can find and count column names through this loop
coln=0
for col in df.columns:
coln+=1
print(col)
print(coln)
and find data types through the following
for col in df.dtypes:
print(col)
Related
My dataframe example (over 35k rows):
stop_id time
7909 2022-04-06T03:47:00+03:00
7909 2022-04-06T04:07:00+03:00
1009413 2022-04-06T04:10:00+03:00
1002246 2022-04-06T04:19:00+03:00
1009896 2022-04-06T04:20:00+03:00
I want to conduct some operations on this dataframe, and then split the dataframe based on the value stop_id. So, assuming there are 50 unique stop_id values, I want to get 50 separate csv/excel files containing data with one unique stop_id. How can I do this?
Using group by
# group by 'stop_id' column
groups = df.groupby("stop_id")
And then iterating over the groups (named to the stop_id of the group using an f-string)
for name, group in groups:
#logic to write to files
group.to_csv(f'{name}.csv')
I used the groupby and first method here.
import pandas as pd
df = pd.DataFrame({"stop_id" : [7909, 7909, 1009413, 1002246,1009896],
"time":["2022-04-06T03:47:00+03:00", "2022-04-06T04:10:00+03:00",
"2022-04-06T04:07:00+03:00","2022-04-06T04:19:00+03:00","2022-04-06T04:20:00+03:00"]})
df = df.groupby("stop_id")
df = df.first().reset_index()
print(df)
print(df)
for idx, item in enumerate(df["stop_id"]):
df_inner = pd.DataFrame({item})
df_inner.to_csv(f'{df["time"].values[idx]}.csv', index=False)
stop_id time
0 7909 2022-04-06T03:47:00+03:00
1 1002246 2022-04-06T04:19:00+03:00
2 1009413 2022-04-06T04:07:00+03:00
3 1009896 2022-04-06T04:20:00+03:00
I'd like to split a column of a dataframe into two separate columns. Here is how my dataframe looks like (only the first 3 rows):
I'd like to split the column referenced_tweets into two columns: type and id in a way that for example, for the first row, the value of the type column would be replied_to and the value of id would be 1253050942716551168.
Here is what I've tried:
df[['type', 'id']] = df['referenced_tweets'].str.split(',', n=1, expand=True)
but I get the error:
ValueError: Columns must be the same length as key
(I think I get this error because the type in the referenced_tweets column is NOT always replied_to (e.g., it can be retweeted, and therefore, the lengths would be different)
Why not get the values from the dict and add it two new columns?
def unpack_column(df_series, key):
""" Function that unpacks the key value of your column and skips NaN values """
return [None if pd.isna(value) else value[0][key] for value in df_series]
df['type'] = unpack_column(df['referenced_tweets'], 'type')
df['id'] = unpack_column(df['referenced_tweets'], 'id')
or in a one-liner:
df[['type', 'id']] = df['referenced_tweets'].apply(lambda x: (x[0]['type'], x[0]['id']))
I want read two columns (Latitude & longitude) from my dataframe df1 in pandas and create a new column zipcode and add zipcode at each row in the dataframe.
I think this webpage is useful: https://postcodes.readthedocs.io/en/latest/
df1 = df[['Col1',' Col2', 'Col3','Col4', 'Col5', 'Latitude', 'Longitude']]
for row in df1[7]:
# Try to,
try:
# get lat long and find the post code
postcodes.get_nearest(lat, lng)
# But if you get an error
except:
# error
# Create a new columns post code in df1
df1['postcode'] = zipcode
You have to use apply to create new column based on other data of dataframe.
def getPostcode(row):
try:
row['postcode']=postcodes.get_nearest(row['Latitude'], row['Longitude'])
except:
print('Error for data {0}'.format(row))
return row
Then add this line to main code after init df1:
df1.apply(getPostcode,axis=1).
You can try:
df1['postcode'] = df1.apply(
lambda x: postcodes.get_nearest(x['Latitude'], x['Longitude']),
axis=1
)
You can imagine that the apply function loops every row or column of the dataframe executing a function (in this case, the lambda function).
It will loop rows when the axis option is 1, and will loop columns when axis options is 0 (default).
This lambda function receives a row as x, then it sends the 'Latitude' and 'Longitude' values to .get_nearest.
Depending on the size of your dataframe, it may take a while though.
I've tested the postcodes here and it didn't worked, but if this lib is working for you, this code should do fine.
I have a pandas dataframe grouped by certain columns. Now I want to insert the mean of the numeric values of four adjacent columns into a new column. This is what I did:
df = pd.read_csv(filename)
# in this line I extract a unique ID from the filename
id = re.search('(\w\w\w)', filename).group(1)
Files look like this:
col1 | col2 | col3
-----------------------
str1a | str1b | float1
My idea was now the following:
# get the numeric values
df2 = pd.DataFrame(df.groupby(['col1', 'col2']).mean()['col3'].T
# insert the id into a new column
df2.insert(0, 'ID', id)
Now loop over all
for j in range(len(df2.values)):
for k in df['col1'].unique():
df2.insert(j+5, (k, 'mean'), df2.values[j])
df2.to_excel('text.xlsx')
But I get the following error, referring to the line with df.insert:
TypeError: not all arguments converted during string formatting
and
if not allow_duplicates and item in self.items:
# Should this be a different kind of error??
raise ValueError('cannot insert %s, already exists' % item)
I am not sure what string formatting refers to here, since I have only numerical values being passed around.
The final output should have all values from col3 in a single row (indexed by id) and every fifth column should be the inserted mean value of the four preceding values.
If I had to work with files like yours I code a function to convert to csv... something like that:
data = []
for lineInFile in file.read().splitlines():
lineInFile_splited = lineInFile.split('|')
if len(lineInFile_splited)>1: ## get only data and not '-------'
data.append(lineInFile_splited)
df = pandas.DataFrame(data, columns = ['A','B'])
Hope it helps!
I have a data frame as shown below
I have a problem in iterating over the rows. for every row fetched I want to return the key value. For example in the second row for 2016-08-31 00:00:01 entry df1 & df3 has compass value 4.0 so I wanted to return the keys which has the same compass value which is df1 & df3 in this case
I Have been iterating rows using
for index,row in df.iterrows():
Update
Okay so now I understand your question better this will work for you.
First change the shape of your dataframe with
dfs = df.stack().swaplevel(axis=0)
This will make your dataframe look like:
Then you can iterate the rows like before and extract the information you want. I'm just using print statements for everything, but you can put this in some more appropriate data structure.
for index, row in dfs.iterrows():
dup_filter = row.duplicated(keep=False)
dfss = row_tuple[dup_filter].index.values
print("Attribute:", index[0])
print("Index:", index[1])
print("Matches:", dfss, "\n")
which will print out something like
.....
Attribute: compass
Index: 5
Matches: ['df1' 'df3']
Attribute: gyro
Index: 5
Matches: ['df1' 'df3']
Attribute: accel
Index: 6
Matches: ['df1' 'df3']
....
You could also do it one attribute at a time by
dfs_compass = df.stack().swaplevel(axis=0).loc['compass']
and iterate through the rows with just the index.
Old
If I understand your question correctly, i.e. you want to return the indexes of rows which have matching values on the second level of your columns, i.e. ('compass', 'accel', 'gyro'). The following will work.
compass_match_indexes = []
for index, row in df.iterrows():
match_filter = row[:, 'compass'].duplicated()
if len(row[:, 'compass'][match_filter] > 0)
compass_match_indexes.append(index)
You can use select your dataframe with that list like df.loc[compass_match_indexes]
--
Another approach, you could get the transform of your DataFrame with df.T and then use the duplicated function.