extracted data from sql for processing using python

extracted data from sql for processing using python - python

I have saved out a data column as follows:
[[A,1], [B,5], [C,18]....]
i was hoping to group A,B,C as shown above into Category and 1,5,18 into Values/Series for updating of my powerpoint chart using python pptx.
Example:
Category
Values
A
1
B
5
Is there any way i can do it? currently the above example is also extracted as strings so i believe i have to convert it to lists first?
thanks in advance!

Try to parse your strings (a list of lists) then create your dataframe from the real list:
import pandas as pd
import re
s = '[[A,1], [B,5], [C,18]]'
cols = ['Category', 'Values']
data = [row.split(',') for row in re.findall('\[([^]]+)\]', s[1:-1])]
df = pd.DataFrame(data, columns=cols)
print(df)
# Output:
Category Values
0 A 1
1 B 5
2 C 18

You should be able to just use pandas.DataFrame and pass in your data, unless I'm misunderstanding the question. Anyway, try:
df = pandas.DataFrame(data=d, columns = ['Category', 'Value'])
where d is your list of tuples.

from prettytable import PrettyTable
column = [["A",1],["B",5],["C",18]]
columnname=[]
columnvalue =[]
t = PrettyTable(['Category', 'Values'])
for data in column:
columnname.append(data[0])
columnvalue.append(data[1])
t.add_row([data[0], data[1]])
print(t)

Related

How to sort panda dataframe based on numbers in column name

I have a data file with column names like this (numbers in the name from 1 to 32):
inlet_left_cell-10<stl-unit=m>-imprint)
inlet_left_cell-11<stl-unit=m>-imprint)
inlet_left_cell-12<stl-unit=m>-imprint)
-------
inlet_left_cell-9<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
I would like to sort the columns (with data) from left to right in python based on the number in the columns. I need to move a whole column based on the number in the column name.
So xxx-1xxx, xxx-2xx, xxx-3xxx, ...... xxx-32xxx
inlet_left_cell-1<stl-unit=m>-imprint)
inlet_left_cell-2<stl-unit=m>-imprint)
inlet_left_cell-3<stl-unit=m>-imprint)
-------
inlet_left_cell-32<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
Is there any way to do this in python ? Thanks.

Here is the solution
# Some random data
data = np.random.randint(1,10, size=(100,32))
# Setting up column names as given in your problem randomly ordered
columns = [f'inlet_left_cell-{x}<stl-unit=m>-imprint)' for x in range(1,33)]
np.random.shuffle(columns)
# Creating the dataframe
df = pd.DataFrame(data, columns=columns)
df.head()
# Sorting the columns in required order
col_nums = [int(x.split('-')[1].split('<')[0]) for x in df.columns]
column_map = dict(zip(col_nums, df.columns))
df = df[[column_map[i] for i in range(1,33)]]
df.head()

There many ways to do it...I'm just posting simply way.
Simply extract column names & sort them using natsort.
Assuming Dataframe name as df..
from natsort import natsorted, ns
dfl=list(df) #used to convert column names to list
dfl=natsorted(dfl, alg=ns.IGNORECASE) # sort based on subtsring numbers
df_sorted= df[dfl] #Re arrange Df
print(df_sorted)

If the column names differ only by this number, try this:
import pandas as pd
data = pd.read_excel("D:\\..\\file_name.xlsx")
data = data.reindex(sorted(data.columns), axis=1)
For example:
data = pd.DataFrame(columns=["inlet_left_cell-23<stl-unit=m>-imprint)", "inlet_left_cell-47<stl-unit=m>-imprint)", "inlet_left_cell-10<stl-unit=m>-imprint)", "inlet_left_cell-12<stl-unit=m>-imprint)"])
print(data)
inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint) inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint)
After this:
data = data.reindex(sorted(data.columns), axis=1)
print(data)
inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint) inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint)

Why are my Pandas DataFrame values out of order according to the index?

Problem:
Executing this code:
import pandas as pd
data = {"1","2","3","4","5"}
index = ["1_i","2_i","3_i","4_i","5_i"]
df = pd.DataFrame(data,index=index)
print(df)
Results in this output:
0
1_i 4
2_i 3
3_i 5
4_i 1
5_i 2
Question:
Why aren't the values in order according to the index that I set it to?
1 should be set to the index 1_i, 2 should be set to the index 2_i, etc.

The problem is that you are making the dataframe from a set, which is arbitrarily ordered. Try making your dataframe from a container that maintains order, like a list or tuple.
import pandas as pd
data = ["1","2","3","4","5"] # a list
index = ["1_i","2_i","3_i","4_i","5_i"]
df = pd.DataFrame(data, index=index)

How to compare two columns using pandas?

BACKGROUND:
I have two columns: 'address' and 'raw_data'. The dataset looks like this:
this is just a sample I made up, the original dataset is over 6m rows and in a different language
Problem:
I need to find all the data where the 'address' and 'raw_data' are not matched meaning there were some sorta of mistakes were made when logging in the data from 'address' to 'raw_data.
I'm fairly new to Pandas. My plan is separate the 'raw_data' column by comma, then compare the newly produced columns with the original 'address' column (to see if the 'address' column has those info, if not, that means there is a mistake?).
Like I said, I'm new to pandas and this is what I have so far.
import pandas as pd
columns = ['address', 'raw_data']
df=pd.read_csv('address.csv', usecols=columns)
df = pd.concat([df['address'], df['raw_data'].str.split(',', expand=True)], axis=1)
Now the new columns has info like this: "CITY":"ATLANTA". I want to the columns to just have ATLANTA without all the the colons and 'CITY' in order to compare the info with 'address' column.
How should I go on about it?
Also, at this point of my pandas learning experience, I do not yet know how to compare two columns. Could someone help a newbie out please? Thanks a lot!
PS: by comparison of two columns I meant to check whether one column has the characters in the second column, not to check whether the two columns are equal. Just want to point that out.

df = pd.DataFrame([[2, 2], [3, 6],[1,1]], columns = ["col1", "col2"])
comparison_column = np.where(df["col1"] == df["col2"], True, False)
df["equal"] = comparison_column
col1 col2 equal
2 2 True
3 6 False
1 1 True

I will use this data:
import numpy as np
import pandas as pd
j = {"address":"foo","b": "bar"}
j2 = {"address":"foo2","b": "bar2"}
values = [["foo", j], ["bar", j2]]
df = pd.DataFrame(data=values, columns=["address", "raw_data"])
df
address raw_data
0 foo {'address': 'foo', 'b': 'bar'}
1 bar {'address': 'foo2', 'b': 'bar2'}
I will separate columns from raw_data (with .values.tolist()) in another df (df2):
df2 = pd.DataFrame(df['raw_data'].values.tolist())
df2
address b
0 foo bar
1 foo2 bar2
To compare you use:
df.address == df2.address
0 True
1 False
If you need save this in the original df you can add a column:
df["result"] = df.address == df2.address

You can separate them from , by just treating them as a dict. You can map custom functions to columns with apply function. In this case you have define a function that accesses to keys of dictionary and extracts values.
df['address_raw'] = df['raw_data'].apply(lambda x: x['address'])
df['city_raw'] = df['raw_data'].apply(lambda x: x['CITY'])
df['addrline2_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_2'])
df['addrline3_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_3'])
df['utmnorthing_raw'] = df['raw_data'].apply(lambda x: x['UTM_NORTHING'])
These lines will create columns of each field in the dict and then you can just compare the ones like:
df['address'] == df['address_raw']

Remove Column with Duplicate Values in Pandas

I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.

There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]

One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1

Extracting specific columns from pandas.dataframe

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]

import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]

A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.

This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracted data from sql for processing using python - python

You should be able to just use pandas.DataFrame and pass in your data, unless I'm misunderstanding the question. Anyway, try: df = pandas.DataFrame(data=d, columns = ['Category', 'Value']) where d is your list of tuples.

from prettytable import PrettyTable column = [["A",1],["B",5],["C",18]] columnname=[] columnvalue =[] t = PrettyTable(['Category', 'Values']) for data in column: columnname.append(data[0]) columnvalue.append(data[1]) t.add_row([data[0], data[1]]) print(t)

Related

How to sort panda dataframe based on numbers in column name

Why are my Pandas DataFrame values out of order according to the index?

How to compare two columns using pandas?

Remove Column with Duplicate Values in Pandas

Extracting specific columns from pandas.dataframe

Categories

Resources