Pandas extract informations [duplicate]

Pandas extract informations [duplicate] - python

This question already has answers here:
Selection with .loc in python
(5 answers)
Closed 4 years ago.
This is my Pandas Frame:
istat cap Comune
0 1001 10011 AGLIE'
1 1002 10060 AIRASCA
2 1003 10070 ALA DI STURA
I want to reproduce the equivalent SQL query:
Select cap
from DataFrame
where Comune = 'AIRASCA'
Obtaining:
cap
10060
I tried to achieve this with dataframe.loc() but i cannot retrieve what i need.
And this is my Python code:
import pandas as pd
from lxml import etree
from pykml import parser
def to_upper(l):
return l.upper()
kml_file_path = '../Source/Kml_Regions/Lombardia.kml'
excel_file_path = '../Source/Milk_Coverage/Milk_Milan_Coverage.xlsx'
zip_file_path = '../Source/ZipCodes/italy_cap.csv'
# Read zipcode csv
zips = pd.read_csv(zip_file_path)
zip_df = pd.DataFrame(zips, columns=['cap', 'Comune']).set_index('Comune')
zips_dict = zips.apply(lambda x: x.astype(str).str.upper())
# Read excel file for coverage
df = pd.ExcelFile(excel_file_path).parse('Comuni')
x = df['City'].tolist()
cities = list(map(to_upper, x))
#-----------------------------------------------------------------------------------------------#
# Check uncovered
# parse the input file into an object tree
with open(kml_file_path) as f:
tree = parser.parse(f)
# get a reference to the "Document.Folder" node
uncovered = tree.getroot().Document.Folder
# iterate through all "Document.Folder.Placemark" nodes and find and remove all nodes
# which contain child node "name" with content "ZONE"
for pm in uncovered.Placemark:
if pm.name in cities:
parent = pm.getparent()
parent.remove(pm)
else:
pass
# convert the object tree into a string and write it into an output file
with open('../Output/Uncovered_Milkman_LO.kml', 'w') as output:
output.write(etree.tostring(uncovered, pretty_print=True))
#---------------------------------------------------------------------------------------------#
# Check covered
with open(kml_file_path) as f:
tree = parser.parse(f)
covered = tree.getroot().Document.Folder
for pmC in covered.Placemark:
if pmC.name in cities:
pass
else:
parentCovered = pmC.getparent()
parentCovered.remove(pmC)
# convert the object tree into a string and write it into an output file
with open('../Output/Covered_Milkman_LO.kml', 'w') as outputs:
outputs.write(etree.tostring(covered, pretty_print=True))
# Writing CAP
with open('../Output/Covered_Milkman_LO.kml', 'r') as f:
in_file = f.readlines() # in_file is now a list of lines
# Now we start building our output
out_file = []
cap = ''
#for line in in_file:
# out_file.append(line) # copy each line, one by one
# Iterate through dictionary which is a list transforming it in a itemable object
for city in covered.Placemark:
print zips_dict.loc[city.name, ['Comune']]
I cannot understand the errors python is giving me, what i'm doing wrong? Technically i can look for a key by finding a value in pandas, is it correct?
I think is not similar to the possible duplicate question because i'm asking to retrieve a single value instead of a column.

ksooklall's answer should work just fine but (unless I'm remembering incorrectly) it's a bit faux pas to use back to back brackets in pandas -it's a bit slower than using loc and can actually matter when using larger dataframes with many calls.
Using loc like this should work just fine:
df.loc[df['Comune'] == 'AIRASCA', 'cap']

Try this:
cap = df[df['Comune'] == 'AIRASCA']['cap']

You can use eq
Ex:
import pandas as pd
df = pd.DataFrame({"istat": [1001, 1002, 1003], "cap": [10011, 10060, 10070 ], "Comune": ['AGLIE', 'AIRASCA', 'ALA DI STURA']})
print( df.loc[df["Comune"].eq('AIRASCA'), "cap"] )
Output:
1 10060
Name: cap, dtype: int64

Related

Convert string variables into ints in a dataset

I'm trying to convert values from strings to ints in a certain column of a dataset. I tried using a for loop and even though the loop does seem to be iterating through the data it's failing to convert any of the variables. I'm certain that I'm making a super basic mistake but can't figure it out as I'm very new at this.
I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions
Then proceeded to process the data so that I can analyse it statistically.
Here's the start of the code
#import pandas
import pandas as pd
#import expeditions as csv file
exp = pd.read_csv('C:\\file\\path\\to\\expeditions.csv')
#create subset for success vs failure
exp_win_v_fail = exp[['termination_reason', 'basecamp_date', 'season']]
#drop successes in dispute
exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['termination_reason'] != 'Success (claimed)') & (exp_win_v_fail['termination_reason'] != 'Attempt rumoured')]
This is the part I can't figure out
#recode termination reason to be binary
for element in exp_win_v_fail['termination_reason']:
if element == 'Success (main peak)':
element = 1
elif element == 'Success (subpeak)':
element = 1
else:
element = 0
Any help would be very much appreciated

To replace all values beginning with 'Success' with 1, and all other values to 0:
from pandas import read_csv
RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'
data = read_csv('expeditions.csv')
exp_win_v_fail = data[[TR, BD, SE]]
for v, re_ in enumerate((NRE, RE)):
exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)
for e in exp_win_v_fail[TR]:
print(e)

Read in a specific way from a csv file with pandas python

I have a data in a csv file here is a sample:
firstnb,secondnb,distance
901,19011,459.73618164837535
901,19017,492.5540450352788
901,19018,458.489289271722
903,13019,167.46632044684435
903,13020,353.16001204909657
the desired output:
901,19011,19017,19018
903,13019,13020
As you can see in the output I want to take firstnb column (901/903)
and put beside each one the secondnb I believe you can understand from the desired output better than my explanation :D
What I tried so far is the following:
import pandas as pd
import csv
df = pd.read_csv('test.csv')
with open('neighborList.csv','w',newline='') as file:
writer = csv.writer(file)
secondStation = []
for row in range(len(df)):
firstStation = df['firstnb'][row]
for x in range(len(df)):
if firstStation == df['firstnb'][x]:
secondStation.append(df['secondnb'][x])
# line = firstStation ,secondStation
# writer.writerow(line)
print(firstStation,secondStation)
secondStation = []
my code output this :
901 [19011, 19017, 19018]
901 [19011, 19017, 19018]
901 [19011, 19017, 19018]
903 [13019, 13020]
903 [13019, 13020]

Pandas has a built in function to do this, called groupby:
df = pd.read_csv(YOUR_CSV_FILE)
df_grouped = list(df.groupby(df['firstnb'])) # group by first column
# chain keys and values into merged list
for key, values in df_grouped:
print([key] + values['secondnb'].tolist())
Here I just print the sublists; you can save them into a new csv in any format you'd like (strings, ints, etc)

First, I grouped the data by firstnb, creating a list of the values in secondnb using the aggregate function.
df[['firstnb','secondnb']].groupby('firstnb').aggregate(func=list).to_dict()
By turning this into a dict, we get:
{'secondnb': {901: [19011, 19017, 19018], 903: [13019, 13020]}}
I'm not entirely clear on what the final output should be (plain strings, lists, …), but from here on, it's easy to produce whatever you'd like.
For example, a list of lists:
intermediate = df[['firstnb','secondnb']].groupby('firstnb').aggregate(func=list).to_dict()
[[k] + v for k,v in intermediate['secondnb'].items()]
Result:
[[901, 19011, 19017, 19018], [903, 13019, 13020]]

def toList(a):
res = []
for r in a:
res.append(r)
return res
df.groupby('firstnb').agg(toList)

Read Only Details (Employee ID) from test file and transfer to excel sheet

I have text file containing employee details and various other details. Below is the consolidated data as shown below.
Data file created on 4 Jun 2020
GROUPCASEINSENSITIVE ON
#KCT-User-Group
GROUP KCT ALopp190 e190 ARaga789 Lshastri921
GROUP KCT DPatel592 ANaidu026 e026 KRam161 e161
#KBN-User-Group
GROUP KBN SPatil322 e322 LAgarwal908 AKeshri132 e132
GROUP KBN BRaju105 e105 LNaik110 PNeema163 e163
#PDA-User-Group
GROUP PDA SRoy977 AAgarwal594 e594 AMath577 e577
GROUP PDA BSharma865 e865 CUmesh195 RRana354
When i run a Python code i need output as shown below
ALopp190
ARaga789
Lshastri921
DPatel592
ANaidu026
KRam161
SPatil322
LAgarwal908
AKeshri132
BRaju105
LNaik110
PNeema163
SRoy977
AAgarwal594
AMath577
BSharma865
CUmesh195
RRana354
From that text file i need only the above data.This is what i had tried but its not working
def user(li):
n = len (li)
for j in range(0, n, 2):
print (li[j])
import os
os.getcwd()
fo = open(r'C:\\Users\\Kiran\\Desktop\\Emplyoees\\User.txt', 'r')
for i in fo.readlines():
li = list(i.split(" "))
#print (li)
li.remove("GROUP")
li.remove("KCT")
li.remove("KBN")
li.remove("PDA")
user (li)
I am new to python and not sure how to get the data. Can you please assist me in fixing this issue.

Try this:
with open('data.txt') as fp:
res = []
for line in fp.readlines()[2:]:
if not line.startswith('#'):
res += [x for x in line.split()[2:] if not (x.startswith('e') and x.replace('e', '').isnumeric())]
print('\n'.join(res))
Output:
ALopp190
ARaga789
Lshastri921
DPatel592
ANaidu026
KRam161
SPatil322
LAgarwal908
AKeshri132
BRaju105
LNaik110
PNeema163
SRoy977
AAgarwal594
AMath577
BSharma865
CUmesh195
RRana354

Based on output format (string + digit) you can parse it with regex and then use pandas to save results to excel:
import re
import pandas as pd
with open('file.txt', 'r') as f:
result = re.findall('[A-Z]\w+\d+', f.read())
df = pd.DataFrame(result)
df.to_excel('result.xlsx')

Update values in a column while looping over through a pandas dataframe

I am working on a script to extract some details from images. I am trying to loop over a dataframe that has my image names. How can I add a new column to the dataframe, that populates the extracted name appropriately against the image name?
for image in df['images']:
concatenated_name = ''.join(name)
df.loc[image, df['images']]['names'] = concatenated_name
Expected:
Index images names
0 img_01 TonyStark
1 img_02 Thanos
2 img_03 Thor
Got:
Index images names
0 img_01 Thor
1 img_02 Thor
2 img_03 Thor

Use apply to apply a function on each row:
def get_name(image):
# Code for getting the name
return name
df['names'] = df['images'].apply(get_name)
Follwing your answer that added some more details, it should be possible to shorten it to:
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
return data
df['data'] = df['filenames'].apply(get_details)
# save df to csv / excel / other

After multiple trials, I think I have a viable solution to this question.
I was using nested function for this exercise, such that function 1 loops over a dataframe of files and calls to function 2 to extract text, perform validation and return a value if the image had the expected field.
First, I created an empty list which would be populated during each run of function 2. At the end, the user can choose to use this list to create a dataframe.
# dataframes to store data
df = pd.DataFrame(os.listdir(), columns=['filenames'])
df = df[df['filenames'].str.contains(".png|.jpg|.jpeg")]
df['filenames'] = '\\' + df['filenames']
df1 = [] #Empty list to record details
# Function 1
def extract_details(df):
for filename in df['filenames']:
get_details(filename)
# Function 2
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
print(filename, data)
df1.append([filename, data])
df_data = pd.DataFrame(df1, columns=['filenames', 'data']) # Container for final output
df_data.to_csv('data_list.csv') # Write output to a csv file
df_data.to_excel('data_list.xlsx') # Write output to an excel file

Pandas: import csv with user corrected faulty values

I try to import a csv and dealing with faulty values, e.x. wrong decimal seperator or strings in int/double columns. I use converters to do the error fixing. In case of strings in number columns the user sees a input box where he has to fix the value. Is it possible to get the column name and/or the row which is actually 'imported'? If not, is there a better way to do the same?
example csv:
------------
description;elevation
point a;-10
point b;10,0
point c;35.5
point d;30x
from PyQt4 import QtGui
import numpy
from pandas import read_csv
def fixFloat(x):
# return x as float if possible
try:
return float(x)
except:
# if not, test if there is a , inside, replace it with a . and return it as float
try:
return float(x.replace(",", "."))
except:
changedValue, ok = QtGui.QInputDialog.getText(None, 'Fehlerhafter Wert', 'Bitte korrigieren sie den fehlerhaften Wert:', text=x)
if ok:
return self.fixFloat(changedValue)
else:
return -9999999999
def fixEmptyStrings(s):
if s == '':
return None
else:
return s
converters = {
'description': fixEmptyStrings,
'elevation': fixFloat
}
dtypes = {
'description': object,
'elevation': numpy.float64
}
csvData = read_csv('/tmp/csv.txt',
error_bad_lines=True,
dtype=dtypes,
converters=converters
)

If you want to iterate over them, the built-in csv.DictReader is pretty handy. I wrote up this function:
import csv
def read_points(csv_file):
point_names, elevations = [], []
message = (
"Found bad data for {0}'s row: {1}. Type new data to use "
"for this value: "
)
with open(csv_file, 'r') as open_csv:
r = csv.DictReader(open_csv, delimiter=";")
for row in r:
tmp_point = row.get("description", "some_default_name")
tmp_elevation = row.get("elevation", "some_default_elevation")
point_names.append(tmp_point)
try:
tmp_elevation = float(tmp_elevation.replace(',', '.'))
except:
while True:
user_val = raw_input(message.format(tmp_point,
tmp_elevation))
try:
tmp_elevation = float(user_val)
break
except:
tmp_elevation = user_val
elevations.append(tmp_elevation)
return pandas.DataFrame({"Point":point_names, "Elevation":elevations})
And for the four-line test file, it gives me the following:
In [41]: read_points("/home/ely/tmp.txt")
Found bad data for point d's row: 30x. Type new data to use for this value: 30
Out[41]:
Elevation Point
0 -10.0 point a
1 10.0 point b
2 35.5 point c
3 30.0 point d
[4 rows x 2 columns]
Displaying a whole QT dialog box seems way overkill for this task. Why not just a command prompt? You can also add more conversion functions and change some things like the delimiter to be keyword arguments if you want it to be more customizable.
One question is how much data there is to iterate through. If it's a lot of data, this will be time consuming and tedious. In that case, you may just want to discard observations like the '30x' or write their point ID name to some other file so you can go back and deal with them all in one swoop inside something like Emacs or VIM where manipulating a big swath of text at once will be easier.

I would take a different approach here.
Rather than at read_csv time, I would read the csv naively and then fix / convert to float:
In [11]: df = pd.read_csv(csv_file, sep=';')
In [12]: df['elevation']
Out[12]:
0 -10
1 10,0
2 35.5
3 30x
Name: elevation, dtype: object
Now just iterate through this column:
In [13]: df['elevation'] = df['elevation'].apply(fixFloat)
This is going to make it much easier to reason about the code (which columns you're applying functions to, how to access other columns etc. etc.).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas extract informations [duplicate] - python

Try this: cap = df[df['Comune'] == 'AIRASCA']['cap']

You can use eq Ex: import pandas as pd df = pd.DataFrame({"istat": [1001, 1002, 1003], "cap": [10011, 10060, 10070 ], "Comune": ['AGLIE', 'AIRASCA', 'ALA DI STURA']}) print( df.loc[df["Comune"].eq('AIRASCA'), "cap"] ) Output: 1 10060 Name: cap, dtype: int64

Related

Convert string variables into ints in a dataset

Read in a specific way from a csv file with pandas python

Read Only Details (Employee ID) from test file and transfer to excel sheet

Update values in a column while looping over through a pandas dataframe

Pandas: import csv with user corrected faulty values

Categories

Resources